How to encode labels from array in pyspark





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















For example I have DataFrame with categorical features in name:



 from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("example")
.config("spark.some.config.option", "some-value").getOrCreate()

features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])
df.show()


Out:



+---------+----+
| name| id |
+---------+----+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+----+


What I want to get:



+--------+--------+--------+--------+----+
| name_a | name_b | name_c | name_d | id |
+--------+--------+--------+--------+----+
| 1 | 1 | 1 | 0 | 1 |
+--------+--------+--------+--------+----+
| 1 | 0 | 1 | 0 | 2 |
+--------+--------+--------+--------+----+
| 0 | 0 | 0 | 1 | 3 |
+--------+--------+--------+--------+----+
| 0 | 1 | 1 | 0 | 4 |
+--------+--------+--------+--------+----+
| 1 | 1 | 0 | 1 | 5 |
+--------+--------+--------+--------+----+


I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.



 from pyspark.ml.feature import VectorIndexer

indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
indexerModel = indexer.fit(df)


I get the following error:



Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType


I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.










share|improve this question





























    2















    For example I have DataFrame with categorical features in name:



     from pyspark.sql import SparkSession

    spark = SparkSession.builder.master("local").appName("example")
    .config("spark.some.config.option", "some-value").getOrCreate()

    features = [(['a', 'b', 'c'], 1),
    (['a', 'c'], 2),
    (['d'], 3),
    (['b', 'c'], 4),
    (['a', 'b', 'd'], 5)]

    df = spark.createDataFrame(features, ['name','id'])
    df.show()


    Out:



    +---------+----+
    | name| id |
    +---------+----+
    |[a, b, c]| 1|
    | [a, c]| 2|
    | [d]| 3|
    | [b, c]| 4|
    |[a, b, d]| 5|
    +---------+----+


    What I want to get:



    +--------+--------+--------+--------+----+
    | name_a | name_b | name_c | name_d | id |
    +--------+--------+--------+--------+----+
    | 1 | 1 | 1 | 0 | 1 |
    +--------+--------+--------+--------+----+
    | 1 | 0 | 1 | 0 | 2 |
    +--------+--------+--------+--------+----+
    | 0 | 0 | 0 | 1 | 3 |
    +--------+--------+--------+--------+----+
    | 0 | 1 | 1 | 0 | 4 |
    +--------+--------+--------+--------+----+
    | 1 | 1 | 0 | 1 | 5 |
    +--------+--------+--------+--------+----+


    I found the same queston but there is nothing helpful.
    I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.



     from pyspark.ml.feature import VectorIndexer

    indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
    indexerModel = indexer.fit(df)


    I get the following error:



    Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType


    I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.










    share|improve this question

























      2












      2








      2


      1






      For example I have DataFrame with categorical features in name:



       from pyspark.sql import SparkSession

      spark = SparkSession.builder.master("local").appName("example")
      .config("spark.some.config.option", "some-value").getOrCreate()

      features = [(['a', 'b', 'c'], 1),
      (['a', 'c'], 2),
      (['d'], 3),
      (['b', 'c'], 4),
      (['a', 'b', 'd'], 5)]

      df = spark.createDataFrame(features, ['name','id'])
      df.show()


      Out:



      +---------+----+
      | name| id |
      +---------+----+
      |[a, b, c]| 1|
      | [a, c]| 2|
      | [d]| 3|
      | [b, c]| 4|
      |[a, b, d]| 5|
      +---------+----+


      What I want to get:



      +--------+--------+--------+--------+----+
      | name_a | name_b | name_c | name_d | id |
      +--------+--------+--------+--------+----+
      | 1 | 1 | 1 | 0 | 1 |
      +--------+--------+--------+--------+----+
      | 1 | 0 | 1 | 0 | 2 |
      +--------+--------+--------+--------+----+
      | 0 | 0 | 0 | 1 | 3 |
      +--------+--------+--------+--------+----+
      | 0 | 1 | 1 | 0 | 4 |
      +--------+--------+--------+--------+----+
      | 1 | 1 | 0 | 1 | 5 |
      +--------+--------+--------+--------+----+


      I found the same queston but there is nothing helpful.
      I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.



       from pyspark.ml.feature import VectorIndexer

      indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
      indexerModel = indexer.fit(df)


      I get the following error:



      Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType


      I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.










      share|improve this question














      For example I have DataFrame with categorical features in name:



       from pyspark.sql import SparkSession

      spark = SparkSession.builder.master("local").appName("example")
      .config("spark.some.config.option", "some-value").getOrCreate()

      features = [(['a', 'b', 'c'], 1),
      (['a', 'c'], 2),
      (['d'], 3),
      (['b', 'c'], 4),
      (['a', 'b', 'd'], 5)]

      df = spark.createDataFrame(features, ['name','id'])
      df.show()


      Out:



      +---------+----+
      | name| id |
      +---------+----+
      |[a, b, c]| 1|
      | [a, c]| 2|
      | [d]| 3|
      | [b, c]| 4|
      |[a, b, d]| 5|
      +---------+----+


      What I want to get:



      +--------+--------+--------+--------+----+
      | name_a | name_b | name_c | name_d | id |
      +--------+--------+--------+--------+----+
      | 1 | 1 | 1 | 0 | 1 |
      +--------+--------+--------+--------+----+
      | 1 | 0 | 1 | 0 | 2 |
      +--------+--------+--------+--------+----+
      | 0 | 0 | 0 | 1 | 3 |
      +--------+--------+--------+--------+----+
      | 0 | 1 | 1 | 0 | 4 |
      +--------+--------+--------+--------+----+
      | 1 | 1 | 0 | 1 | 5 |
      +--------+--------+--------+--------+----+


      I found the same queston but there is nothing helpful.
      I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.



       from pyspark.ml.feature import VectorIndexer

      indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
      indexerModel = indexer.fit(df)


      I get the following error:



      Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType


      I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.







      python apache-spark pyspark pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 4 '18 at 19:54









      Moris HuxleyMoris Huxley

      15210




      15210
























          2 Answers
          2






          active

          oldest

          votes


















          2














          If you want use the output with Spark ML it is best to use CountVectorizer:



          from pyspark.ml.feature import CountVectorizer

          # Add binary=True if needed
          df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
          .fit(df)
          .transform(df))
          df_enc.show(truncate=False)


          +---------+---+-------------------------+
          |name |id |name_vector |
          +---------+---+-------------------------+
          |[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
          |[a, c] |2 |(4,[0,1],[1.0,1.0]) |
          |[d] |3 |(4,[3],[1.0]) |
          |[b, c] |4 |(4,[1,2],[1.0,1.0]) |
          |[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
          +---------+---+-------------------------+


          Otherwise collect distinct values:



          from pyspark.sql.functions import array_contains, col, explode

          names = [
          x[0] for x in
          df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]


          and select the columns with array_contains:



          df_sep = df.select("*", *[
          array_contains("name", name).alias("name_{}".format(name)).cast("integer")
          for name in names]
          )
          df_sep.show()


          +---------+---+------+------+------+------+
          | name| id|name_a|name_b|name_c|name_d|
          +---------+---+------+------+------+------+
          |[a, b, c]| 1| 1| 1| 1| 0|
          | [a, c]| 2| 1| 0| 1| 0|
          | [d]| 3| 0| 0| 0| 1|
          | [b, c]| 4| 0| 1| 1| 0|
          |[a, b, d]| 5| 1| 1| 0| 1|
          +---------+---+------+------+------+------+





          share|improve this answer

































            0














            With explode from the pyspark.sql.functions and pivot:



            from pyspark.sql import functions as F
            features = [(['a', 'b', 'c'], 1),
            (['a', 'c'], 2),
            (['d'], 3),
            (['b', 'c'], 4),
            (['a', 'b', 'd'], 5)]
            df = spark.createDataFrame(features, ['name','id'])
            df.show()
            +---------+---+
            | name| id|
            +---------+---+
            |[a, b, c]| 1|
            | [a, c]| 2|
            | [d]| 3|
            | [b, c]| 4|
            |[a, b, d]| 5|
            +---------+---+

            df = df.withColumn('exploded', F.explode('name'))

            df.drop('name').groupby('id').pivot('exploded').count().show()
            +---+----+----+----+----+
            | id| a| b| c| d|
            +---+----+----+----+----+
            | 5| 1| 1|null| 1|
            | 1| 1| 1| 1|null|
            | 3|null|null|null| 1|
            | 2| 1|null| 1|null|
            | 4|null| 1| 1|null|
            +---+----+----+----+----+


            Sort by id and convert null to 0



            df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
            +---+---+---+---+---+
            | id| a| b| c| d|
            +---+---+---+---+---+
            | 1| 1| 1| 1| 0|
            | 2| 1| 0| 1| 0|
            | 3| 0| 0| 0| 1|
            | 4| 0| 1| 1| 0|
            | 5| 1| 1| 0| 1|
            +---+---+---+---+---+


            explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.






            share|improve this answer


























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53620483%2fhow-to-encode-labels-from-array-in-pyspark%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              If you want use the output with Spark ML it is best to use CountVectorizer:



              from pyspark.ml.feature import CountVectorizer

              # Add binary=True if needed
              df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
              .fit(df)
              .transform(df))
              df_enc.show(truncate=False)


              +---------+---+-------------------------+
              |name |id |name_vector |
              +---------+---+-------------------------+
              |[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
              |[a, c] |2 |(4,[0,1],[1.0,1.0]) |
              |[d] |3 |(4,[3],[1.0]) |
              |[b, c] |4 |(4,[1,2],[1.0,1.0]) |
              |[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
              +---------+---+-------------------------+


              Otherwise collect distinct values:



              from pyspark.sql.functions import array_contains, col, explode

              names = [
              x[0] for x in
              df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]


              and select the columns with array_contains:



              df_sep = df.select("*", *[
              array_contains("name", name).alias("name_{}".format(name)).cast("integer")
              for name in names]
              )
              df_sep.show()


              +---------+---+------+------+------+------+
              | name| id|name_a|name_b|name_c|name_d|
              +---------+---+------+------+------+------+
              |[a, b, c]| 1| 1| 1| 1| 0|
              | [a, c]| 2| 1| 0| 1| 0|
              | [d]| 3| 0| 0| 0| 1|
              | [b, c]| 4| 0| 1| 1| 0|
              |[a, b, d]| 5| 1| 1| 0| 1|
              +---------+---+------+------+------+------+





              share|improve this answer






























                2














                If you want use the output with Spark ML it is best to use CountVectorizer:



                from pyspark.ml.feature import CountVectorizer

                # Add binary=True if needed
                df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
                .fit(df)
                .transform(df))
                df_enc.show(truncate=False)


                +---------+---+-------------------------+
                |name |id |name_vector |
                +---------+---+-------------------------+
                |[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
                |[a, c] |2 |(4,[0,1],[1.0,1.0]) |
                |[d] |3 |(4,[3],[1.0]) |
                |[b, c] |4 |(4,[1,2],[1.0,1.0]) |
                |[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
                +---------+---+-------------------------+


                Otherwise collect distinct values:



                from pyspark.sql.functions import array_contains, col, explode

                names = [
                x[0] for x in
                df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]


                and select the columns with array_contains:



                df_sep = df.select("*", *[
                array_contains("name", name).alias("name_{}".format(name)).cast("integer")
                for name in names]
                )
                df_sep.show()


                +---------+---+------+------+------+------+
                | name| id|name_a|name_b|name_c|name_d|
                +---------+---+------+------+------+------+
                |[a, b, c]| 1| 1| 1| 1| 0|
                | [a, c]| 2| 1| 0| 1| 0|
                | [d]| 3| 0| 0| 0| 1|
                | [b, c]| 4| 0| 1| 1| 0|
                |[a, b, d]| 5| 1| 1| 0| 1|
                +---------+---+------+------+------+------+





                share|improve this answer




























                  2












                  2








                  2







                  If you want use the output with Spark ML it is best to use CountVectorizer:



                  from pyspark.ml.feature import CountVectorizer

                  # Add binary=True if needed
                  df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
                  .fit(df)
                  .transform(df))
                  df_enc.show(truncate=False)


                  +---------+---+-------------------------+
                  |name |id |name_vector |
                  +---------+---+-------------------------+
                  |[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
                  |[a, c] |2 |(4,[0,1],[1.0,1.0]) |
                  |[d] |3 |(4,[3],[1.0]) |
                  |[b, c] |4 |(4,[1,2],[1.0,1.0]) |
                  |[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
                  +---------+---+-------------------------+


                  Otherwise collect distinct values:



                  from pyspark.sql.functions import array_contains, col, explode

                  names = [
                  x[0] for x in
                  df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]


                  and select the columns with array_contains:



                  df_sep = df.select("*", *[
                  array_contains("name", name).alias("name_{}".format(name)).cast("integer")
                  for name in names]
                  )
                  df_sep.show()


                  +---------+---+------+------+------+------+
                  | name| id|name_a|name_b|name_c|name_d|
                  +---------+---+------+------+------+------+
                  |[a, b, c]| 1| 1| 1| 1| 0|
                  | [a, c]| 2| 1| 0| 1| 0|
                  | [d]| 3| 0| 0| 0| 1|
                  | [b, c]| 4| 0| 1| 1| 0|
                  |[a, b, d]| 5| 1| 1| 0| 1|
                  +---------+---+------+------+------+------+





                  share|improve this answer















                  If you want use the output with Spark ML it is best to use CountVectorizer:



                  from pyspark.ml.feature import CountVectorizer

                  # Add binary=True if needed
                  df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
                  .fit(df)
                  .transform(df))
                  df_enc.show(truncate=False)


                  +---------+---+-------------------------+
                  |name |id |name_vector |
                  +---------+---+-------------------------+
                  |[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
                  |[a, c] |2 |(4,[0,1],[1.0,1.0]) |
                  |[d] |3 |(4,[3],[1.0]) |
                  |[b, c] |4 |(4,[1,2],[1.0,1.0]) |
                  |[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
                  +---------+---+-------------------------+


                  Otherwise collect distinct values:



                  from pyspark.sql.functions import array_contains, col, explode

                  names = [
                  x[0] for x in
                  df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]


                  and select the columns with array_contains:



                  df_sep = df.select("*", *[
                  array_contains("name", name).alias("name_{}".format(name)).cast("integer")
                  for name in names]
                  )
                  df_sep.show()


                  +---------+---+------+------+------+------+
                  | name| id|name_a|name_b|name_c|name_d|
                  +---------+---+------+------+------+------+
                  |[a, b, c]| 1| 1| 1| 1| 0|
                  | [a, c]| 2| 1| 0| 1| 0|
                  | [d]| 3| 0| 0| 0| 1|
                  | [b, c]| 4| 0| 1| 1| 0|
                  |[a, b, d]| 5| 1| 1| 0| 1|
                  +---------+---+------+------+------+------+






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Dec 4 '18 at 20:32

























                  answered Dec 4 '18 at 20:12









                  user10465355user10465355

                  2,1392521




                  2,1392521

























                      0














                      With explode from the pyspark.sql.functions and pivot:



                      from pyspark.sql import functions as F
                      features = [(['a', 'b', 'c'], 1),
                      (['a', 'c'], 2),
                      (['d'], 3),
                      (['b', 'c'], 4),
                      (['a', 'b', 'd'], 5)]
                      df = spark.createDataFrame(features, ['name','id'])
                      df.show()
                      +---------+---+
                      | name| id|
                      +---------+---+
                      |[a, b, c]| 1|
                      | [a, c]| 2|
                      | [d]| 3|
                      | [b, c]| 4|
                      |[a, b, d]| 5|
                      +---------+---+

                      df = df.withColumn('exploded', F.explode('name'))

                      df.drop('name').groupby('id').pivot('exploded').count().show()
                      +---+----+----+----+----+
                      | id| a| b| c| d|
                      +---+----+----+----+----+
                      | 5| 1| 1|null| 1|
                      | 1| 1| 1| 1|null|
                      | 3|null|null|null| 1|
                      | 2| 1|null| 1|null|
                      | 4|null| 1| 1|null|
                      +---+----+----+----+----+


                      Sort by id and convert null to 0



                      df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
                      +---+---+---+---+---+
                      | id| a| b| c| d|
                      +---+---+---+---+---+
                      | 1| 1| 1| 1| 0|
                      | 2| 1| 0| 1| 0|
                      | 3| 0| 0| 0| 1|
                      | 4| 0| 1| 1| 0|
                      | 5| 1| 1| 0| 1|
                      +---+---+---+---+---+


                      explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.






                      share|improve this answer






























                        0














                        With explode from the pyspark.sql.functions and pivot:



                        from pyspark.sql import functions as F
                        features = [(['a', 'b', 'c'], 1),
                        (['a', 'c'], 2),
                        (['d'], 3),
                        (['b', 'c'], 4),
                        (['a', 'b', 'd'], 5)]
                        df = spark.createDataFrame(features, ['name','id'])
                        df.show()
                        +---------+---+
                        | name| id|
                        +---------+---+
                        |[a, b, c]| 1|
                        | [a, c]| 2|
                        | [d]| 3|
                        | [b, c]| 4|
                        |[a, b, d]| 5|
                        +---------+---+

                        df = df.withColumn('exploded', F.explode('name'))

                        df.drop('name').groupby('id').pivot('exploded').count().show()
                        +---+----+----+----+----+
                        | id| a| b| c| d|
                        +---+----+----+----+----+
                        | 5| 1| 1|null| 1|
                        | 1| 1| 1| 1|null|
                        | 3|null|null|null| 1|
                        | 2| 1|null| 1|null|
                        | 4|null| 1| 1|null|
                        +---+----+----+----+----+


                        Sort by id and convert null to 0



                        df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
                        +---+---+---+---+---+
                        | id| a| b| c| d|
                        +---+---+---+---+---+
                        | 1| 1| 1| 1| 0|
                        | 2| 1| 0| 1| 0|
                        | 3| 0| 0| 0| 1|
                        | 4| 0| 1| 1| 0|
                        | 5| 1| 1| 0| 1|
                        +---+---+---+---+---+


                        explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.






                        share|improve this answer




























                          0












                          0








                          0







                          With explode from the pyspark.sql.functions and pivot:



                          from pyspark.sql import functions as F
                          features = [(['a', 'b', 'c'], 1),
                          (['a', 'c'], 2),
                          (['d'], 3),
                          (['b', 'c'], 4),
                          (['a', 'b', 'd'], 5)]
                          df = spark.createDataFrame(features, ['name','id'])
                          df.show()
                          +---------+---+
                          | name| id|
                          +---------+---+
                          |[a, b, c]| 1|
                          | [a, c]| 2|
                          | [d]| 3|
                          | [b, c]| 4|
                          |[a, b, d]| 5|
                          +---------+---+

                          df = df.withColumn('exploded', F.explode('name'))

                          df.drop('name').groupby('id').pivot('exploded').count().show()
                          +---+----+----+----+----+
                          | id| a| b| c| d|
                          +---+----+----+----+----+
                          | 5| 1| 1|null| 1|
                          | 1| 1| 1| 1|null|
                          | 3|null|null|null| 1|
                          | 2| 1|null| 1|null|
                          | 4|null| 1| 1|null|
                          +---+----+----+----+----+


                          Sort by id and convert null to 0



                          df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
                          +---+---+---+---+---+
                          | id| a| b| c| d|
                          +---+---+---+---+---+
                          | 1| 1| 1| 1| 0|
                          | 2| 1| 0| 1| 0|
                          | 3| 0| 0| 0| 1|
                          | 4| 0| 1| 1| 0|
                          | 5| 1| 1| 0| 1|
                          +---+---+---+---+---+


                          explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.






                          share|improve this answer















                          With explode from the pyspark.sql.functions and pivot:



                          from pyspark.sql import functions as F
                          features = [(['a', 'b', 'c'], 1),
                          (['a', 'c'], 2),
                          (['d'], 3),
                          (['b', 'c'], 4),
                          (['a', 'b', 'd'], 5)]
                          df = spark.createDataFrame(features, ['name','id'])
                          df.show()
                          +---------+---+
                          | name| id|
                          +---------+---+
                          |[a, b, c]| 1|
                          | [a, c]| 2|
                          | [d]| 3|
                          | [b, c]| 4|
                          |[a, b, d]| 5|
                          +---------+---+

                          df = df.withColumn('exploded', F.explode('name'))

                          df.drop('name').groupby('id').pivot('exploded').count().show()
                          +---+----+----+----+----+
                          | id| a| b| c| d|
                          +---+----+----+----+----+
                          | 5| 1| 1|null| 1|
                          | 1| 1| 1| 1|null|
                          | 3|null|null|null| 1|
                          | 2| 1|null| 1|null|
                          | 4|null| 1| 1|null|
                          +---+----+----+----+----+


                          Sort by id and convert null to 0



                          df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
                          +---+---+---+---+---+
                          | id| a| b| c| d|
                          +---+---+---+---+---+
                          | 1| 1| 1| 1| 0|
                          | 2| 1| 0| 1| 0|
                          | 3| 0| 0| 0| 1|
                          | 4| 0| 1| 1| 0|
                          | 5| 1| 1| 0| 1|
                          +---+---+---+---+---+


                          explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Dec 4 '18 at 22:01

























                          answered Dec 4 '18 at 20:56









                          user2314737user2314737

                          15.5k115571




                          15.5k115571






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53620483%2fhow-to-encode-labels-from-array-in-pyspark%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              List item for chat from Array inside array React Native

                              Thiostrepton

                              Caerphilly