Run multiple spark queries in parallel in a multi-user environment on a static dataset
Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?
One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source
Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)
java apache-spark parallel-processing bigdata parquet
add a comment |
Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?
One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source
Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)
java apache-spark parallel-processing bigdata parquet
add a comment |
Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?
One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source
Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)
java apache-spark parallel-processing bigdata parquet
Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?
One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source
Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)
java apache-spark parallel-processing bigdata parquet
java apache-spark parallel-processing bigdata parquet
edited Nov 13 at 7:26
asked Nov 12 at 14:11
Divya
66
66
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263949%2frun-multiple-spark-queries-in-parallel-in-a-multi-user-environment-on-a-static-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
add a comment |
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
add a comment |
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
edited Nov 12 at 19:39
answered Nov 12 at 15:08
thebluephantom
2,3552925
2,3552925
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263949%2frun-multiple-spark-queries-in-parallel-in-a-multi-user-environment-on-a-static-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown