add column from raw df to groped df in pyspark
Hello I have created grouped dataframe from raw dataframe with this command:
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
and my spark_df
dataframe has three columns: Transaction
, Products
and CustomerID
I want to put CustomerID
column into the sp2
dataframe ( it wont be grouped).
When I try to join it with this command:
df_joined = sp2.join(spark_df, "CustomerID")
I got this error message:
Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING columnCustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];
python apache-spark pyspark
add a comment |
Hello I have created grouped dataframe from raw dataframe with this command:
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
and my spark_df
dataframe has three columns: Transaction
, Products
and CustomerID
I want to put CustomerID
column into the sp2
dataframe ( it wont be grouped).
When I try to join it with this command:
df_joined = sp2.join(spark_df, "CustomerID")
I got this error message:
Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING columnCustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];
python apache-spark pyspark
add a comment |
Hello I have created grouped dataframe from raw dataframe with this command:
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
and my spark_df
dataframe has three columns: Transaction
, Products
and CustomerID
I want to put CustomerID
column into the sp2
dataframe ( it wont be grouped).
When I try to join it with this command:
df_joined = sp2.join(spark_df, "CustomerID")
I got this error message:
Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING columnCustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];
python apache-spark pyspark
Hello I have created grouped dataframe from raw dataframe with this command:
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
and my spark_df
dataframe has three columns: Transaction
, Products
and CustomerID
I want to put CustomerID
column into the sp2
dataframe ( it wont be grouped).
When I try to join it with this command:
df_joined = sp2.join(spark_df, "CustomerID")
I got this error message:
Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING columnCustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];
python apache-spark pyspark
python apache-spark pyspark
edited Nov 12 at 13:28
Ali AzG
585515
585515
asked Nov 12 at 13:25
yigitozmen
4761623
4761623
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This error occurs because you don't have CustomerID
column in your sp2
dataframe. so you can not join them on CustomerID
. I suggest you to create a CustomerID
column with None
value in sp2
dataframe and then join it with spark_df
on CustomerID
column.
This is a sample code to do this:
import pyspark.sql.functions as f
sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))
df_joined = sp2.join(spark_df, "CustomerID")
UPDATE: The other way to add CustomerID
column into your grouped data is to use first
function:
import pyspark.sql.functions as F
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because theCustomerID
is not the first column. change the code like this, and try again joining.sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this:df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
|
show 5 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263163%2fadd-column-from-raw-df-to-groped-df-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This error occurs because you don't have CustomerID
column in your sp2
dataframe. so you can not join them on CustomerID
. I suggest you to create a CustomerID
column with None
value in sp2
dataframe and then join it with spark_df
on CustomerID
column.
This is a sample code to do this:
import pyspark.sql.functions as f
sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))
df_joined = sp2.join(spark_df, "CustomerID")
UPDATE: The other way to add CustomerID
column into your grouped data is to use first
function:
import pyspark.sql.functions as F
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because theCustomerID
is not the first column. change the code like this, and try again joining.sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this:df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
|
show 5 more comments
This error occurs because you don't have CustomerID
column in your sp2
dataframe. so you can not join them on CustomerID
. I suggest you to create a CustomerID
column with None
value in sp2
dataframe and then join it with spark_df
on CustomerID
column.
This is a sample code to do this:
import pyspark.sql.functions as f
sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))
df_joined = sp2.join(spark_df, "CustomerID")
UPDATE: The other way to add CustomerID
column into your grouped data is to use first
function:
import pyspark.sql.functions as F
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because theCustomerID
is not the first column. change the code like this, and try again joining.sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this:df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
|
show 5 more comments
This error occurs because you don't have CustomerID
column in your sp2
dataframe. so you can not join them on CustomerID
. I suggest you to create a CustomerID
column with None
value in sp2
dataframe and then join it with spark_df
on CustomerID
column.
This is a sample code to do this:
import pyspark.sql.functions as f
sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))
df_joined = sp2.join(spark_df, "CustomerID")
UPDATE: The other way to add CustomerID
column into your grouped data is to use first
function:
import pyspark.sql.functions as F
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))
This error occurs because you don't have CustomerID
column in your sp2
dataframe. so you can not join them on CustomerID
. I suggest you to create a CustomerID
column with None
value in sp2
dataframe and then join it with spark_df
on CustomerID
column.
This is a sample code to do this:
import pyspark.sql.functions as f
sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))
df_joined = sp2.join(spark_df, "CustomerID")
UPDATE: The other way to add CustomerID
column into your grouped data is to use first
function:
import pyspark.sql.functions as F
sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))
edited Nov 12 at 14:03
answered Nov 12 at 13:35
Ali AzG
585515
585515
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because theCustomerID
is not the first column. change the code like this, and try again joining.sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this:df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
|
show 5 more comments
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because theCustomerID
is not the first column. change the code like this, and try again joining.sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this:df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42
I think it is because the
CustomerID
is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I think it is because the
CustomerID
is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items')
. and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45
I still got the error with this
– yigitozmen
Nov 12 at 13:47
I still got the error with this
– yigitozmen
Nov 12 at 13:47
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50
|
show 5 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263163%2fadd-column-from-raw-df-to-groped-df-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown