Spark - Mongdb Join performance [duplicate]












0
















This question already has an answer here:




  • How can I force Spark to execute code?

    2 answers



  • Pyspark with Elasticsearch

    1 answer




I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)



My pyspark code looks like this



df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")


Now, all these lines execute "almost" immediately. But, when I run.



res.show()


It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.



Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)










share|improve this question













marked as duplicate by eliasah apache-spark
Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.














  • 1





    There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

    – eliasah
    Nov 15 '18 at 15:04











  • Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

    – Frank
    Nov 15 '18 at 16:33













  • @eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

    – someone
    Nov 19 '18 at 7:08











  • @Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

    – someone
    Nov 19 '18 at 7:09











  • Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

    – eliasah
    Nov 19 '18 at 8:23
















0
















This question already has an answer here:




  • How can I force Spark to execute code?

    2 answers



  • Pyspark with Elasticsearch

    1 answer




I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)



My pyspark code looks like this



df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")


Now, all these lines execute "almost" immediately. But, when I run.



res.show()


It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.



Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)










share|improve this question













marked as duplicate by eliasah apache-spark
Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.














  • 1





    There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

    – eliasah
    Nov 15 '18 at 15:04











  • Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

    – Frank
    Nov 15 '18 at 16:33













  • @eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

    – someone
    Nov 19 '18 at 7:08











  • @Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

    – someone
    Nov 19 '18 at 7:09











  • Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

    – eliasah
    Nov 19 '18 at 8:23














0












0








0









This question already has an answer here:




  • How can I force Spark to execute code?

    2 answers



  • Pyspark with Elasticsearch

    1 answer




I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)



My pyspark code looks like this



df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")


Now, all these lines execute "almost" immediately. But, when I run.



res.show()


It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.



Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)










share|improve this question















This question already has an answer here:




  • How can I force Spark to execute code?

    2 answers



  • Pyspark with Elasticsearch

    1 answer




I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)



My pyspark code looks like this



df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")


Now, all these lines execute "almost" immediately. But, when I run.



res.show()


It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.



Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)





This question already has an answer here:




  • How can I force Spark to execute code?

    2 answers



  • Pyspark with Elasticsearch

    1 answer








mongodb apache-spark pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 14:46









someonesomeone

1328




1328




marked as duplicate by eliasah apache-spark
Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









marked as duplicate by eliasah apache-spark
Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.










  • 1





    There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

    – eliasah
    Nov 15 '18 at 15:04











  • Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

    – Frank
    Nov 15 '18 at 16:33













  • @eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

    – someone
    Nov 19 '18 at 7:08











  • @Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

    – someone
    Nov 19 '18 at 7:09











  • Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

    – eliasah
    Nov 19 '18 at 8:23














  • 1





    There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

    – eliasah
    Nov 15 '18 at 15:04











  • Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

    – Frank
    Nov 15 '18 at 16:33













  • @eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

    – someone
    Nov 19 '18 at 7:08











  • @Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

    – someone
    Nov 19 '18 at 7:09











  • Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

    – eliasah
    Nov 19 '18 at 8:23








1




1





There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04





There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04













Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33







Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33















@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08





@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08













@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09





@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09













Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23





Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23












0






active

oldest

votes

















0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes

Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python