Spark - Mongdb Join performance [duplicate]

This question already has an answer here:

How can I force Spark to execute code?

2 answers

Pyspark with Elasticsearch

1 answer

I have two mongodb collections, each with millions of records (say: 30 gb each).
I am trying to perform a join on the data after loading the dataframes into spark.
What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)

My pyspark code looks like this

df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()

df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()

res = df1.join(df2,df1.xyz = df2.xyz, "inner")

Now, all these lines execute "almost" immediately. But, when I run.

res.show()

It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.

Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark.
(I am trying to explore if there is any visible difference between joining data on mongodb vs spark)

asked Nov 15 '18 at 14:46

someone

1328

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

1

There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04

Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33

@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08

@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09

Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23

|
show 1 more comment

This question already has an answer here:

How can I force Spark to execute code?

2 answers

Pyspark with Elasticsearch

1 answer

My pyspark code looks like this

df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()

df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()

res = df1.join(df2,df1.xyz = df2.xyz, "inner")

Now, all these lines execute "almost" immediately. But, when I run.

res.show()

It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.

asked Nov 15 '18 at 14:46

someone

1328

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

1

There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04

Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33

@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08

@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09

Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23

|
show 1 more comment

This question already has an answer here:

How can I force Spark to execute code?

2 answers

Pyspark with Elasticsearch

1 answer

My pyspark code looks like this

df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()

df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()

res = df1.join(df2,df1.xyz = df2.xyz, "inner")

Now, all these lines execute "almost" immediately. But, when I run.

res.show()

It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.

asked Nov 15 '18 at 14:46

someone

1328

This question already has an answer here:

How can I force Spark to execute code?

2 answers

Pyspark with Elasticsearch

1 answer

My pyspark code looks like this

df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()

df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()

res = df1.join(df2,df1.xyz = df2.xyz, "inner")

Now, all these lines execute "almost" immediately. But, when I run.

res.show()

It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.

This question already has an answer here:

How can I force Spark to execute code?

2 answers

Pyspark with Elasticsearch

1 answer

mongodb apache-spark pyspark

asked Nov 15 '18 at 14:46

someone

1328

asked Nov 15 '18 at 14:46

someone

1328

asked Nov 15 '18 at 14:46

someone

1328

asked Nov 15 '18 at 14:46

someone

1328

asked Nov 15 '18 at 14:46

someone

1328

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 15 '18 at 15:04

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

1

There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04

Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33

@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08

@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09

Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23

|
show 1 more comment

1

There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04

Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33

@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08

@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09

Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23

There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for.

– eliasah
Nov 15 '18 at 15:04

Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes).

– Frank
Nov 15 '18 at 16:33

@eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there.

– someone
Nov 19 '18 at 7:08

@Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?)

– someone
Nov 19 '18 at 7:09

Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates.

– eliasah
Nov 19 '18 at 8:23

|
show 1 more comment

0

active

oldest

votes

0

active

oldest

votes

0

active

oldest

votes

This page is only for reference, If you need detailed information, please check here

m5cIu0aFbK,IiFb0LKFZsQALiWIp2R Pye,BZDfJYw77hF0seNSAmeXmjTMgUAP9 tz5UNwdoI ibVZ7s0Js sr,mf 1w6

搜尋此網誌

Vfrdtyky

Spark - Mongdb Join performance [duplicate]

0

0

0

Popular posts from this blog

Bressuire

Vorschmack

Xamarin.iOS Cant Deploy on Iphone