Run multiple spark queries in parallel in a multi-user environment on a static dataset












0














Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?



One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source



Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)










share|improve this question





























    0














    Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?



    One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
    Source



    Is there a more efficient way to achieve this using spark or any other big data framework?
    Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)










    share|improve this question



























      0












      0








      0


      2





      Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?



      One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
      Source



      Is there a more efficient way to achieve this using spark or any other big data framework?
      Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)










      share|improve this question















      Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?



      One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
      Source



      Is there a more efficient way to achieve this using spark or any other big data framework?
      Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)







      java apache-spark parallel-processing bigdata parquet






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 13 at 7:26

























      asked Nov 12 at 14:11









      Divya

      66




      66
























          1 Answer
          1






          active

          oldest

          votes


















          0














          I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.



          That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html



          This area is somewhat hard to follow, as there is the notion of as follows:



          ... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".



          One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263949%2frun-multiple-spark-queries-in-parallel-in-a-multi-user-environment-on-a-static-d%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.



            That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html



            This area is somewhat hard to follow, as there is the notion of as follows:



            ... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".



            One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.






            share|improve this answer




























              0














              I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.



              That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html



              This area is somewhat hard to follow, as there is the notion of as follows:



              ... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".



              One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.






              share|improve this answer


























                0












                0








                0






                I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.



                That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html



                This area is somewhat hard to follow, as there is the notion of as follows:



                ... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".



                One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.






                share|improve this answer














                I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.



                That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html



                This area is somewhat hard to follow, as there is the notion of as follows:



                ... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".



                One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 12 at 19:39

























                answered Nov 12 at 15:08









                thebluephantom

                2,3552925




                2,3552925






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263949%2frun-multiple-spark-queries-in-parallel-in-a-multi-user-environment-on-a-static-d%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python