add column from raw df to groped df in pyspark

-1

Hello I have created grouped dataframe from raw dataframe with this command:

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))

and my spark_df dataframe has three columns: Transaction, Products and CustomerID

I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).

When I try to join it with this command:

df_joined = sp2.join(spark_df, "CustomerID")

I got this error message:

Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING column CustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

add a comment |

-1

Hello I have created grouped dataframe from raw dataframe with this command:

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))

and my spark_df dataframe has three columns: Transaction, Products and CustomerID

I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).

When I try to join it with this command:

df_joined = sp2.join(spark_df, "CustomerID")

I got this error message:

Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING column CustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

add a comment |

-1

Hello I have created grouped dataframe from raw dataframe with this command:

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))

and my spark_df dataframe has three columns: Transaction, Products and CustomerID

I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).

When I try to join it with this command:

df_joined = sp2.join(spark_df, "CustomerID")

I got this error message:

Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING column CustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

Hello I have created grouped dataframe from raw dataframe with this command:

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))

and my spark_df dataframe has three columns: Transaction, Products and CustomerID

I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).

When I try to join it with this command:

df_joined = sp2.join(spark_df, "CustomerID")

I got this error message:

Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING column CustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];

python apache-spark pyspark

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

edited Nov 12 at 13:28

Ali AzG

585515

edited Nov 12 at 13:28

Ali AzG

585515

edited Nov 12 at 13:28

Ali AzG

585515

asked Nov 12 at 13:25

yigitozmen

4761623

asked Nov 12 at 13:25

yigitozmen

4761623

asked Nov 12 at 13:25

yigitozmen

4761623

add a comment |

1 Answer
1

active

oldest

votes

This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.

This is a sample code to do this:

import pyspark.sql.functions as f



sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))



df_joined = sp2.join(spark_df, "CustomerID")

UPDATE: The other way to add CustomerID column into your grouped data is to use first function:

import pyspark.sql.functions as F



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

|
show 5 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263163%2fadd-column-from-raw-df-to-groped-df-in-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This is a sample code to do this:

import pyspark.sql.functions as f



sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))



df_joined = sp2.join(spark_df, "CustomerID")

UPDATE: The other way to add CustomerID column into your grouped data is to use first function:

import pyspark.sql.functions as F



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

|
show 5 more comments

This is a sample code to do this:

import pyspark.sql.functions as f



sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))



df_joined = sp2.join(spark_df, "CustomerID")

UPDATE: The other way to add CustomerID column into your grouped data is to use first function:

import pyspark.sql.functions as F



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

|
show 5 more comments

This is a sample code to do this:

import pyspark.sql.functions as f



sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))



df_joined = sp2.join(spark_df, "CustomerID")

UPDATE: The other way to add CustomerID column into your grouped data is to use first function:

import pyspark.sql.functions as F



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

This is a sample code to do this:

import pyspark.sql.functions as f



sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))



df_joined = sp2.join(spark_df, "CustomerID")

UPDATE: The other way to add CustomerID column into your grouped data is to use first function:

import pyspark.sql.functions as F



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

edited Nov 12 at 14:03

answered Nov 12 at 13:35

Ali AzG

585515

answered Nov 12 at 13:35

Ali AzG

585515

answered Nov 12 at 13:35

Ali AzG

585515

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

|
show 5 more comments

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
– yigitozmen
Nov 12 at 13:38

Is there any way to put that column into grouped dataframe?
– yigitozmen
Nov 12 at 13:42

I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
– Ali AzG
Nov 12 at 13:45

I still got the error with this
– yigitozmen
Nov 12 at 13:47

can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
– yigitozmen
Nov 12 at 13:50

|
show 5 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RRSbhH2R0 fXf5E

搜尋此網誌

Vfrdtyky