How to encode labels from array in pyspark

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

For example I have DataFrame with categorical features in name:

 from pyspark.sql import SparkSession



 spark = SparkSession.builder.master("local").appName("example")

 .config("spark.some.config.option", "some-value").getOrCreate()



 features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4), 

             (['a', 'b', 'd'], 5)]



 df = spark.createDataFrame(features, ['name','id'])

 df.show()

Out:

+---------+----+

|     name| id |

+---------+----+

|[a, b, c]|   1|

|   [a, c]|   2|

|      [d]|   3|

|   [b, c]|   4|

|[a, b, d]|   5|

+---------+----+

What I want to get:

+--------+--------+--------+--------+----+

| name_a | name_b | name_c | name_d | id |

+--------+--------+--------+--------+----+

| 1      | 1      | 1      | 0      | 1  |

+--------+--------+--------+--------+----+

| 1      | 0      | 1      | 0      | 2  |

+--------+--------+--------+--------+----+

| 0      | 0      | 0      | 1      | 3  |

+--------+--------+--------+--------+----+

| 0      | 1      | 1      | 0      | 4  |

+--------+--------+--------+--------+----+

| 1      | 1      | 0      | 1      | 5  |

+--------+--------+--------+--------+----+

I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.

 from pyspark.ml.feature import VectorIndexer



 indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)

 indexerModel = indexer.fit(df)

I get the following error:

Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType

I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.

asked Dec 4 '18 at 19:54

Moris Huxley

15210

add a comment |

For example I have DataFrame with categorical features in name:

 from pyspark.sql import SparkSession



 spark = SparkSession.builder.master("local").appName("example")

 .config("spark.some.config.option", "some-value").getOrCreate()



 features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4), 

             (['a', 'b', 'd'], 5)]



 df = spark.createDataFrame(features, ['name','id'])

 df.show()

Out:

+---------+----+

|     name| id |

+---------+----+

|[a, b, c]|   1|

|   [a, c]|   2|

|      [d]|   3|

|   [b, c]|   4|

|[a, b, d]|   5|

+---------+----+

What I want to get:

+--------+--------+--------+--------+----+

| name_a | name_b | name_c | name_d | id |

+--------+--------+--------+--------+----+

| 1      | 1      | 1      | 0      | 1  |

+--------+--------+--------+--------+----+

| 1      | 0      | 1      | 0      | 2  |

+--------+--------+--------+--------+----+

| 0      | 0      | 0      | 1      | 3  |

+--------+--------+--------+--------+----+

| 0      | 1      | 1      | 0      | 4  |

+--------+--------+--------+--------+----+

| 1      | 1      | 0      | 1      | 5  |

+--------+--------+--------+--------+----+

I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.

 from pyspark.ml.feature import VectorIndexer



 indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)

 indexerModel = indexer.fit(df)

I get the following error:

Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType

I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.

asked Dec 4 '18 at 19:54

Moris Huxley

15210

add a comment |

For example I have DataFrame with categorical features in name:

 from pyspark.sql import SparkSession



 spark = SparkSession.builder.master("local").appName("example")

 .config("spark.some.config.option", "some-value").getOrCreate()



 features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4), 

             (['a', 'b', 'd'], 5)]



 df = spark.createDataFrame(features, ['name','id'])

 df.show()

Out:

+---------+----+

|     name| id |

+---------+----+

|[a, b, c]|   1|

|   [a, c]|   2|

|      [d]|   3|

|   [b, c]|   4|

|[a, b, d]|   5|

+---------+----+

What I want to get:

+--------+--------+--------+--------+----+

| name_a | name_b | name_c | name_d | id |

+--------+--------+--------+--------+----+

| 1      | 1      | 1      | 0      | 1  |

+--------+--------+--------+--------+----+

| 1      | 0      | 1      | 0      | 2  |

+--------+--------+--------+--------+----+

| 0      | 0      | 0      | 1      | 3  |

+--------+--------+--------+--------+----+

| 0      | 1      | 1      | 0      | 4  |

+--------+--------+--------+--------+----+

| 1      | 1      | 0      | 1      | 5  |

+--------+--------+--------+--------+----+

I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.

 from pyspark.ml.feature import VectorIndexer



 indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)

 indexerModel = indexer.fit(df)

I get the following error:

Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType

I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.

asked Dec 4 '18 at 19:54

Moris Huxley

15210

For example I have DataFrame with categorical features in name:

 from pyspark.sql import SparkSession



 spark = SparkSession.builder.master("local").appName("example")

 .config("spark.some.config.option", "some-value").getOrCreate()



 features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4), 

             (['a', 'b', 'd'], 5)]



 df = spark.createDataFrame(features, ['name','id'])

 df.show()

Out:

+---------+----+

|     name| id |

+---------+----+

|[a, b, c]|   1|

|   [a, c]|   2|

|      [d]|   3|

|   [b, c]|   4|

|[a, b, d]|   5|

+---------+----+

What I want to get:

+--------+--------+--------+--------+----+

| name_a | name_b | name_c | name_d | id |

+--------+--------+--------+--------+----+

| 1      | 1      | 1      | 0      | 1  |

+--------+--------+--------+--------+----+

| 1      | 0      | 1      | 0      | 2  |

+--------+--------+--------+--------+----+

| 0      | 0      | 0      | 1      | 3  |

+--------+--------+--------+--------+----+

| 0      | 1      | 1      | 0      | 4  |

+--------+--------+--------+--------+----+

| 1      | 1      | 0      | 1      | 5  |

+--------+--------+--------+--------+----+

I found the same queston but there is nothing helpful.
I tried to use VectorIndexer from PySpark.ML but I faced some problems with a transform of name field to vector type.

 from pyspark.ml.feature import VectorIndexer



 indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)

 indexerModel = indexer.fit(df)

I get the following error:

Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType

I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer.

python apache-spark pyspark pyspark-sql

asked Dec 4 '18 at 19:54

Moris Huxley

15210

asked Dec 4 '18 at 19:54

Moris Huxley

15210

asked Dec 4 '18 at 19:54

Moris Huxley

15210

asked Dec 4 '18 at 19:54

Moris Huxley

15210

asked Dec 4 '18 at 19:54

Moris Huxley

15210

add a comment |

2 Answers
2

active

oldest

votes

If you want use the output with Spark ML it is best to use CountVectorizer:

from pyspark.ml.feature import CountVectorizer



# Add binary=True if needed

df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")

    .fit(df)

    .transform(df))

df_enc.show(truncate=False)

+---------+---+-------------------------+

|name     |id |name_vector              |

+---------+---+-------------------------+

|[a, b, c]|1  |(4,[0,1,2],[1.0,1.0,1.0])|

|[a, c]   |2  |(4,[0,1],[1.0,1.0])      |

|[d]      |3  |(4,[3],[1.0])            |

|[b, c]   |4  |(4,[1,2],[1.0,1.0])      |

|[a, b, d]|5  |(4,[0,2,3],[1.0,1.0,1.0])|

+---------+---+-------------------------+

Otherwise collect distinct values:

from pyspark.sql.functions import array_contains, col, explode



names = [

    x[0] for x in 

    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

and select the columns with array_contains:

df_sep = df.select("*", *[

    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 

    for name in names]

)

df_sep.show()

+---------+---+------+------+------+------+

|     name| id|name_a|name_b|name_c|name_d|

+---------+---+------+------+------+------+

|[a, b, c]|  1|     1|     1|     1|     0|

|   [a, c]|  2|     1|     0|     1|     0|

|      [d]|  3|     0|     0|     0|     1|

|   [b, c]|  4|     0|     1|     1|     0|

|[a, b, d]|  5|     1|     1|     0|     1|

+---------+---+------+------+------+------+

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

add a comment |

With explode from the pyspark.sql.functions and pivot:

from pyspark.sql import functions as F

features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4),

             (['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])

df.show()

+---------+---+

|     name| id|

+---------+---+

|[a, b, c]|  1|

|   [a, c]|  2|

|      [d]|  3|

|   [b, c]|  4|

|[a, b, d]|  5|

+---------+---+



df = df.withColumn('exploded', F.explode('name'))



df.drop('name').groupby('id').pivot('exploded').count().show()

+---+----+----+----+----+

| id|   a|   b|   c|   d|

+---+----+----+----+----+

|  5|   1|   1|null|   1|

|  1|   1|   1|   1|null|

|  3|null|null|null|   1|

|  2|   1|null|   1|null|

|  4|null|   1|   1|null|

+---+----+----+----+----+

Sort by id and convert null to 0

df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()

+---+---+---+---+---+

| id|  a|  b|  c|  d|

+---+---+---+---+---+

|  1|  1|  1|  1|  0|

|  2|  1|  0|  1|  0|

|  3|  0|  0|  0|  1|

|  4|  0|  1|  1|  0|

|  5|  1|  1|  0|  1|

+---+---+---+---+---+

explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53620483%2fhow-to-encode-labels-from-array-in-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If you want use the output with Spark ML it is best to use CountVectorizer:

from pyspark.ml.feature import CountVectorizer



# Add binary=True if needed

df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")

    .fit(df)

    .transform(df))

df_enc.show(truncate=False)

+---------+---+-------------------------+

|name     |id |name_vector              |

+---------+---+-------------------------+

|[a, b, c]|1  |(4,[0,1,2],[1.0,1.0,1.0])|

|[a, c]   |2  |(4,[0,1],[1.0,1.0])      |

|[d]      |3  |(4,[3],[1.0])            |

|[b, c]   |4  |(4,[1,2],[1.0,1.0])      |

|[a, b, d]|5  |(4,[0,2,3],[1.0,1.0,1.0])|

+---------+---+-------------------------+

Otherwise collect distinct values:

from pyspark.sql.functions import array_contains, col, explode



names = [

    x[0] for x in 

    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

and select the columns with array_contains:

df_sep = df.select("*", *[

    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 

    for name in names]

)

df_sep.show()

+---------+---+------+------+------+------+

|     name| id|name_a|name_b|name_c|name_d|

+---------+---+------+------+------+------+

|[a, b, c]|  1|     1|     1|     1|     0|

|   [a, c]|  2|     1|     0|     1|     0|

|      [d]|  3|     0|     0|     0|     1|

|   [b, c]|  4|     0|     1|     1|     0|

|[a, b, d]|  5|     1|     1|     0|     1|

+---------+---+------+------+------+------+

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

add a comment |

If you want use the output with Spark ML it is best to use CountVectorizer:

from pyspark.ml.feature import CountVectorizer



# Add binary=True if needed

df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")

    .fit(df)

    .transform(df))

df_enc.show(truncate=False)

+---------+---+-------------------------+

|name     |id |name_vector              |

+---------+---+-------------------------+

|[a, b, c]|1  |(4,[0,1,2],[1.0,1.0,1.0])|

|[a, c]   |2  |(4,[0,1],[1.0,1.0])      |

|[d]      |3  |(4,[3],[1.0])            |

|[b, c]   |4  |(4,[1,2],[1.0,1.0])      |

|[a, b, d]|5  |(4,[0,2,3],[1.0,1.0,1.0])|

+---------+---+-------------------------+

Otherwise collect distinct values:

from pyspark.sql.functions import array_contains, col, explode



names = [

    x[0] for x in 

    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

and select the columns with array_contains:

df_sep = df.select("*", *[

    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 

    for name in names]

)

df_sep.show()

+---------+---+------+------+------+------+

|     name| id|name_a|name_b|name_c|name_d|

+---------+---+------+------+------+------+

|[a, b, c]|  1|     1|     1|     1|     0|

|   [a, c]|  2|     1|     0|     1|     0|

|      [d]|  3|     0|     0|     0|     1|

|   [b, c]|  4|     0|     1|     1|     0|

|[a, b, d]|  5|     1|     1|     0|     1|

+---------+---+------+------+------+------+

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

add a comment |

If you want use the output with Spark ML it is best to use CountVectorizer:

from pyspark.ml.feature import CountVectorizer



# Add binary=True if needed

df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")

    .fit(df)

    .transform(df))

df_enc.show(truncate=False)

+---------+---+-------------------------+

|name     |id |name_vector              |

+---------+---+-------------------------+

|[a, b, c]|1  |(4,[0,1,2],[1.0,1.0,1.0])|

|[a, c]   |2  |(4,[0,1],[1.0,1.0])      |

|[d]      |3  |(4,[3],[1.0])            |

|[b, c]   |4  |(4,[1,2],[1.0,1.0])      |

|[a, b, d]|5  |(4,[0,2,3],[1.0,1.0,1.0])|

+---------+---+-------------------------+

Otherwise collect distinct values:

from pyspark.sql.functions import array_contains, col, explode



names = [

    x[0] for x in 

    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

and select the columns with array_contains:

df_sep = df.select("*", *[

    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 

    for name in names]

)

df_sep.show()

+---------+---+------+------+------+------+

|     name| id|name_a|name_b|name_c|name_d|

+---------+---+------+------+------+------+

|[a, b, c]|  1|     1|     1|     1|     0|

|   [a, c]|  2|     1|     0|     1|     0|

|      [d]|  3|     0|     0|     0|     1|

|   [b, c]|  4|     0|     1|     1|     0|

|[a, b, d]|  5|     1|     1|     0|     1|

+---------+---+------+------+------+------+

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

If you want use the output with Spark ML it is best to use CountVectorizer:

from pyspark.ml.feature import CountVectorizer



# Add binary=True if needed

df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")

    .fit(df)

    .transform(df))

df_enc.show(truncate=False)

+---------+---+-------------------------+

|name     |id |name_vector              |

+---------+---+-------------------------+

|[a, b, c]|1  |(4,[0,1,2],[1.0,1.0,1.0])|

|[a, c]   |2  |(4,[0,1],[1.0,1.0])      |

|[d]      |3  |(4,[3],[1.0])            |

|[b, c]   |4  |(4,[1,2],[1.0,1.0])      |

|[a, b, d]|5  |(4,[0,2,3],[1.0,1.0,1.0])|

+---------+---+-------------------------+

Otherwise collect distinct values:

from pyspark.sql.functions import array_contains, col, explode



names = [

    x[0] for x in 

    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

and select the columns with array_contains:

df_sep = df.select("*", *[

    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 

    for name in names]

)

df_sep.show()

+---------+---+------+------+------+------+

|     name| id|name_a|name_b|name_c|name_d|

+---------+---+------+------+------+------+

|[a, b, c]|  1|     1|     1|     1|     0|

|   [a, c]|  2|     1|     0|     1|     0|

|      [d]|  3|     0|     0|     0|     1|

|   [b, c]|  4|     0|     1|     1|     0|

|[a, b, d]|  5|     1|     1|     0|     1|

+---------+---+------+------+------+------+

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

edited Dec 4 '18 at 20:32

answered Dec 4 '18 at 20:12

user10465355

2,1392521

answered Dec 4 '18 at 20:12

user10465355

2,1392521

answered Dec 4 '18 at 20:12

user10465355

2,1392521

add a comment |

With explode from the pyspark.sql.functions and pivot:

from pyspark.sql import functions as F

features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4),

             (['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])

df.show()

+---------+---+

|     name| id|

+---------+---+

|[a, b, c]|  1|

|   [a, c]|  2|

|      [d]|  3|

|   [b, c]|  4|

|[a, b, d]|  5|

+---------+---+



df = df.withColumn('exploded', F.explode('name'))



df.drop('name').groupby('id').pivot('exploded').count().show()

+---+----+----+----+----+

| id|   a|   b|   c|   d|

+---+----+----+----+----+

|  5|   1|   1|null|   1|

|  1|   1|   1|   1|null|

|  3|null|null|null|   1|

|  2|   1|null|   1|null|

|  4|null|   1|   1|null|

+---+----+----+----+----+

Sort by id and convert null to 0

df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()

+---+---+---+---+---+

| id|  a|  b|  c|  d|

+---+---+---+---+---+

|  1|  1|  1|  1|  0|

|  2|  1|  0|  1|  0|

|  3|  0|  0|  0|  1|

|  4|  0|  1|  1|  0|

|  5|  1|  1|  0|  1|

+---+---+---+---+---+

explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

add a comment |

With explode from the pyspark.sql.functions and pivot:

from pyspark.sql import functions as F

features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4),

             (['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])

df.show()

+---------+---+

|     name| id|

+---------+---+

|[a, b, c]|  1|

|   [a, c]|  2|

|      [d]|  3|

|   [b, c]|  4|

|[a, b, d]|  5|

+---------+---+



df = df.withColumn('exploded', F.explode('name'))



df.drop('name').groupby('id').pivot('exploded').count().show()

+---+----+----+----+----+

| id|   a|   b|   c|   d|

+---+----+----+----+----+

|  5|   1|   1|null|   1|

|  1|   1|   1|   1|null|

|  3|null|null|null|   1|

|  2|   1|null|   1|null|

|  4|null|   1|   1|null|

+---+----+----+----+----+

Sort by id and convert null to 0

df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()

+---+---+---+---+---+

| id|  a|  b|  c|  d|

+---+---+---+---+---+

|  1|  1|  1|  1|  0|

|  2|  1|  0|  1|  0|

|  3|  0|  0|  0|  1|

|  4|  0|  1|  1|  0|

|  5|  1|  1|  0|  1|

+---+---+---+---+---+

explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

add a comment |

With explode from the pyspark.sql.functions and pivot:

from pyspark.sql import functions as F

features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4),

             (['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])

df.show()

+---------+---+

|     name| id|

+---------+---+

|[a, b, c]|  1|

|   [a, c]|  2|

|      [d]|  3|

|   [b, c]|  4|

|[a, b, d]|  5|

+---------+---+



df = df.withColumn('exploded', F.explode('name'))



df.drop('name').groupby('id').pivot('exploded').count().show()

+---+----+----+----+----+

| id|   a|   b|   c|   d|

+---+----+----+----+----+

|  5|   1|   1|null|   1|

|  1|   1|   1|   1|null|

|  3|null|null|null|   1|

|  2|   1|null|   1|null|

|  4|null|   1|   1|null|

+---+----+----+----+----+

Sort by id and convert null to 0

df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()

+---+---+---+---+---+

| id|  a|  b|  c|  d|

+---+---+---+---+---+

|  1|  1|  1|  1|  0|

|  2|  1|  0|  1|  0|

|  3|  0|  0|  0|  1|

|  4|  0|  1|  1|  0|

|  5|  1|  1|  0|  1|

+---+---+---+---+---+

explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

With explode from the pyspark.sql.functions and pivot:

from pyspark.sql import functions as F

features = [(['a', 'b', 'c'], 1),

             (['a', 'c'], 2),

             (['d'], 3),

             (['b', 'c'], 4),

             (['a', 'b', 'd'], 5)]

df = spark.createDataFrame(features, ['name','id'])

df.show()

+---------+---+

|     name| id|

+---------+---+

|[a, b, c]|  1|

|   [a, c]|  2|

|      [d]|  3|

|   [b, c]|  4|

|[a, b, d]|  5|

+---------+---+



df = df.withColumn('exploded', F.explode('name'))



df.drop('name').groupby('id').pivot('exploded').count().show()

+---+----+----+----+----+

| id|   a|   b|   c|   d|

+---+----+----+----+----+

|  5|   1|   1|null|   1|

|  1|   1|   1|   1|null|

|  3|null|null|null|   1|

|  2|   1|null|   1|null|

|  4|null|   1|   1|null|

+---+----+----+----+----+

Sort by id and convert null to 0

df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()

+---+---+---+---+---+

| id|  a|  b|  c|  d|

+---+---+---+---+---+

|  1|  1|  1|  1|  0|

|  2|  1|  0|  1|  0|

|  3|  0|  0|  0|  1|

|  4|  0|  1|  1|  0|

|  5|  1|  1|  0|  1|

+---+---+---+---+---+

explode returns a new row for each element in the given array or map. You can then use pivot to "transpose" the new column.

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

edited Dec 4 '18 at 22:01

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

answered Dec 4 '18 at 20:56

user2314737

15.5k115571

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky