Spark - Mongdb Join performance [duplicate]
This question already has an answer here:
How can I force Spark to execute code?
2 answers
Pyspark with Elasticsearch
1 answer
I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)
My pyspark code looks like this
df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")
Now, all these lines execute "almost" immediately. But, when I run.
res.show()
It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.
Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)
mongodb apache-spark pyspark
marked as duplicate by eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
|
show 1 more comment
This question already has an answer here:
How can I force Spark to execute code?
2 answers
Pyspark with Elasticsearch
1 answer
I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)
My pyspark code looks like this
df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")
Now, all these lines execute "almost" immediately. But, when I run.
res.show()
It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.
Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)
mongodb apache-spark pyspark
marked as duplicate by eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
1
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23
|
show 1 more comment
This question already has an answer here:
How can I force Spark to execute code?
2 answers
Pyspark with Elasticsearch
1 answer
I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)
My pyspark code looks like this
df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")
Now, all these lines execute "almost" immediately. But, when I run.
res.show()
It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.
Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)
mongodb apache-spark pyspark
This question already has an answer here:
How can I force Spark to execute code?
2 answers
Pyspark with Elasticsearch
1 answer
I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)
My pyspark code looks like this
df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")
Now, all these lines execute "almost" immediately. But, when I run.
res.show()
It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.
Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)
This question already has an answer here:
How can I force Spark to execute code?
2 answers
Pyspark with Elasticsearch
1 answer
mongodb apache-spark pyspark
mongodb apache-spark pyspark
asked Nov 15 '18 at 14:46
someonesomeone
1328
1328
marked as duplicate by eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
1
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23
|
show 1 more comment
1
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23
1
1
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23
|
show 1 more comment
0
active
oldest
votes
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
1
There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.
– eliasah
Nov 15 '18 at 15:04
Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).
– Frank
Nov 15 '18 at 16:33
@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.
– someone
Nov 19 '18 at 7:08
@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)
– someone
Nov 19 '18 at 7:09
Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.
– eliasah
Nov 19 '18 at 8:23