add column from raw df to groped df in pyspark












-1














Hello I have created grouped dataframe from raw dataframe with this command:



sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))


and my spark_df dataframe has three columns: Transaction, Products and CustomerID



I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).



When I try to join it with this command:



df_joined = sp2.join(spark_df, "CustomerID")


I got this error message:




Py4JJavaError: An error occurred while calling o44.join. :
org.apache.spark.sql.AnalysisException: USING column CustomerID
cannot be resolved on the left side of the join. The left-side
columns: [Transaction, items];











share|improve this question





























    -1














    Hello I have created grouped dataframe from raw dataframe with this command:



    sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))


    and my spark_df dataframe has three columns: Transaction, Products and CustomerID



    I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).



    When I try to join it with this command:



    df_joined = sp2.join(spark_df, "CustomerID")


    I got this error message:




    Py4JJavaError: An error occurred while calling o44.join. :
    org.apache.spark.sql.AnalysisException: USING column CustomerID
    cannot be resolved on the left side of the join. The left-side
    columns: [Transaction, items];











    share|improve this question



























      -1












      -1








      -1







      Hello I have created grouped dataframe from raw dataframe with this command:



      sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))


      and my spark_df dataframe has three columns: Transaction, Products and CustomerID



      I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).



      When I try to join it with this command:



      df_joined = sp2.join(spark_df, "CustomerID")


      I got this error message:




      Py4JJavaError: An error occurred while calling o44.join. :
      org.apache.spark.sql.AnalysisException: USING column CustomerID
      cannot be resolved on the left side of the join. The left-side
      columns: [Transaction, items];











      share|improve this question















      Hello I have created grouped dataframe from raw dataframe with this command:



      sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))


      and my spark_df dataframe has three columns: Transaction, Products and CustomerID



      I want to put CustomerID column into the sp2 dataframe ( it wont be grouped).



      When I try to join it with this command:



      df_joined = sp2.join(spark_df, "CustomerID")


      I got this error message:




      Py4JJavaError: An error occurred while calling o44.join. :
      org.apache.spark.sql.AnalysisException: USING column CustomerID
      cannot be resolved on the left side of the join. The left-side
      columns: [Transaction, items];








      python apache-spark pyspark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 13:28









      Ali AzG

      585515




      585515










      asked Nov 12 at 13:25









      yigitozmen

      4761623




      4761623
























          1 Answer
          1






          active

          oldest

          votes


















          1














          This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.



          This is a sample code to do this:



          import pyspark.sql.functions as f

          sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

          df_joined = sp2.join(spark_df, "CustomerID")


          UPDATE: The other way to add CustomerID column into your grouped data is to use first function:



          import pyspark.sql.functions as F

          sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))





          share|improve this answer























          • Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
            – yigitozmen
            Nov 12 at 13:38










          • Is there any way to put that column into grouped dataframe?
            – yigitozmen
            Nov 12 at 13:42










          • I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
            – Ali AzG
            Nov 12 at 13:45












          • I still got the error with this
            – yigitozmen
            Nov 12 at 13:47










          • can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
            – yigitozmen
            Nov 12 at 13:50













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263163%2fadd-column-from-raw-df-to-groped-df-in-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.



          This is a sample code to do this:



          import pyspark.sql.functions as f

          sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

          df_joined = sp2.join(spark_df, "CustomerID")


          UPDATE: The other way to add CustomerID column into your grouped data is to use first function:



          import pyspark.sql.functions as F

          sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))





          share|improve this answer























          • Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
            – yigitozmen
            Nov 12 at 13:38










          • Is there any way to put that column into grouped dataframe?
            – yigitozmen
            Nov 12 at 13:42










          • I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
            – Ali AzG
            Nov 12 at 13:45












          • I still got the error with this
            – yigitozmen
            Nov 12 at 13:47










          • can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
            – yigitozmen
            Nov 12 at 13:50


















          1














          This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.



          This is a sample code to do this:



          import pyspark.sql.functions as f

          sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

          df_joined = sp2.join(spark_df, "CustomerID")


          UPDATE: The other way to add CustomerID column into your grouped data is to use first function:



          import pyspark.sql.functions as F

          sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))





          share|improve this answer























          • Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
            – yigitozmen
            Nov 12 at 13:38










          • Is there any way to put that column into grouped dataframe?
            – yigitozmen
            Nov 12 at 13:42










          • I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
            – Ali AzG
            Nov 12 at 13:45












          • I still got the error with this
            – yigitozmen
            Nov 12 at 13:47










          • can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
            – yigitozmen
            Nov 12 at 13:50
















          1












          1








          1






          This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.



          This is a sample code to do this:



          import pyspark.sql.functions as f

          sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

          df_joined = sp2.join(spark_df, "CustomerID")


          UPDATE: The other way to add CustomerID column into your grouped data is to use first function:



          import pyspark.sql.functions as F

          sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))





          share|improve this answer














          This error occurs because you don't have CustomerID column in your sp2 dataframe. so you can not join them on CustomerID. I suggest you to create a CustomerID column with None value in sp2 dataframe and then join it with spark_df on CustomerID column.



          This is a sample code to do this:



          import pyspark.sql.functions as f

          sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

          df_joined = sp2.join(spark_df, "CustomerID")


          UPDATE: The other way to add CustomerID column into your grouped data is to use first function:



          import pyspark.sql.functions as F

          sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 12 at 14:03

























          answered Nov 12 at 13:35









          Ali AzG

          585515




          585515












          • Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
            – yigitozmen
            Nov 12 at 13:38










          • Is there any way to put that column into grouped dataframe?
            – yigitozmen
            Nov 12 at 13:42










          • I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
            – Ali AzG
            Nov 12 at 13:45












          • I still got the error with this
            – yigitozmen
            Nov 12 at 13:47










          • can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
            – yigitozmen
            Nov 12 at 13:50




















          • Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
            – yigitozmen
            Nov 12 at 13:38










          • Is there any way to put that column into grouped dataframe?
            – yigitozmen
            Nov 12 at 13:42










          • I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
            – Ali AzG
            Nov 12 at 13:45












          • I still got the error with this
            – yigitozmen
            Nov 12 at 13:47










          • can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
            – yigitozmen
            Nov 12 at 13:50


















          Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
          – yigitozmen
          Nov 12 at 13:38




          Thanks for helping, but now I got this error: Detected implicit cartesian product for INNER join between logical plans
          – yigitozmen
          Nov 12 at 13:38












          Is there any way to put that column into grouped dataframe?
          – yigitozmen
          Nov 12 at 13:42




          Is there any way to put that column into grouped dataframe?
          – yigitozmen
          Nov 12 at 13:42












          I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
          – Ali AzG
          Nov 12 at 13:45






          I think it is because the CustomerID is not the first column. change the code like this, and try again joining. sp2 = sp2.select('CustomerID', 'Transaction', 'items'). and join like this: df_joined = sp2.join(spark_df, "CustomerID", how='full')
          – Ali AzG
          Nov 12 at 13:45














          I still got the error with this
          – yigitozmen
          Nov 12 at 13:47




          I still got the error with this
          – yigitozmen
          Nov 12 at 13:47












          can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
          – yigitozmen
          Nov 12 at 13:50






          can I select CustomerID while creating grouped dataframe with this command: sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))
          – yigitozmen
          Nov 12 at 13:50




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263163%2fadd-column-from-raw-df-to-groped-df-in-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Xamarin.iOS Cant Deploy on Iphone

          Glorious Revolution

          Dulmage-Mendelsohn matrix decomposition in Python