Pyspark - remove duplicates from dataframe keeping the last appearance












0















I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:



NAME
ID
DOB


I succeeded in Pandas with the following:



df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)


But in spark I tried the following:



df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')


I get this error:



TypeError: dropDuplicates() got an unexpected keyword argument 'keep'


Any ideas?










share|improve this question























  • quynhcodes.wordpress.com/2016/07/29/…

    – karma4917
    Nov 13 '18 at 16:07











  • Base on the [document][1] dropDuplicates do not have para keep in pyspark

    – W-B
    Nov 13 '18 at 16:09
















0















I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:



NAME
ID
DOB


I succeeded in Pandas with the following:



df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)


But in spark I tried the following:



df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')


I get this error:



TypeError: dropDuplicates() got an unexpected keyword argument 'keep'


Any ideas?










share|improve this question























  • quynhcodes.wordpress.com/2016/07/29/…

    – karma4917
    Nov 13 '18 at 16:07











  • Base on the [document][1] dropDuplicates do not have para keep in pyspark

    – W-B
    Nov 13 '18 at 16:09














0












0








0








I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:



NAME
ID
DOB


I succeeded in Pandas with the following:



df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)


But in spark I tried the following:



df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')


I get this error:



TypeError: dropDuplicates() got an unexpected keyword argument 'keep'


Any ideas?










share|improve this question














I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:



NAME
ID
DOB


I succeeded in Pandas with the following:



df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)


But in spark I tried the following:



df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')


I get this error:



TypeError: dropDuplicates() got an unexpected keyword argument 'keep'


Any ideas?







pandas dataframe pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 16:00









David KonDavid Kon

204




204













  • quynhcodes.wordpress.com/2016/07/29/…

    – karma4917
    Nov 13 '18 at 16:07











  • Base on the [document][1] dropDuplicates do not have para keep in pyspark

    – W-B
    Nov 13 '18 at 16:09



















  • quynhcodes.wordpress.com/2016/07/29/…

    – karma4917
    Nov 13 '18 at 16:07











  • Base on the [document][1] dropDuplicates do not have para keep in pyspark

    – W-B
    Nov 13 '18 at 16:09

















quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07





quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07













Base on the [document][1] dropDuplicates do not have para keep in pyspark

– W-B
Nov 13 '18 at 16:09





Base on the [document][1] dropDuplicates do not have para keep in pyspark

– W-B
Nov 13 '18 at 16:09












2 Answers
2






active

oldest

votes


















0














As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?



EDIT



As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.



df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')





share|improve this answer


























  • Cause there still other column , he need the last value for them

    – W-B
    Nov 13 '18 at 16:11











  • Oh, i understand. I will edit my answer.

    – Manrique
    Nov 13 '18 at 16:13



















0














Thanks for your help.
I followed your directives but the outcome was not as expected:



d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)


The outcome was:



+-----+---+----------+------+--------+    
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+


Showing the "Bob" with corrupted data.



Finally, I changed my approach and converted the DF to Pandas and then back to spark:



p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)


This finally brought the correct "Bob":



+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+


Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.



Thanks!






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53284881%2fpyspark-remove-duplicates-from-dataframe-keeping-the-last-appearance%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?



    EDIT



    As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.



    df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')





    share|improve this answer


























    • Cause there still other column , he need the last value for them

      – W-B
      Nov 13 '18 at 16:11











    • Oh, i understand. I will edit my answer.

      – Manrique
      Nov 13 '18 at 16:13
















    0














    As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?



    EDIT



    As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.



    df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')





    share|improve this answer


























    • Cause there still other column , he need the last value for them

      – W-B
      Nov 13 '18 at 16:11











    • Oh, i understand. I will edit my answer.

      – Manrique
      Nov 13 '18 at 16:13














    0












    0








    0







    As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?



    EDIT



    As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.



    df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')





    share|improve this answer















    As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?



    EDIT



    As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.



    df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 '18 at 16:12

























    answered Nov 13 '18 at 16:10









    ManriqueManrique

    500114




    500114













    • Cause there still other column , he need the last value for them

      – W-B
      Nov 13 '18 at 16:11











    • Oh, i understand. I will edit my answer.

      – Manrique
      Nov 13 '18 at 16:13



















    • Cause there still other column , he need the last value for them

      – W-B
      Nov 13 '18 at 16:11











    • Oh, i understand. I will edit my answer.

      – Manrique
      Nov 13 '18 at 16:13

















    Cause there still other column , he need the last value for them

    – W-B
    Nov 13 '18 at 16:11





    Cause there still other column , he need the last value for them

    – W-B
    Nov 13 '18 at 16:11













    Oh, i understand. I will edit my answer.

    – Manrique
    Nov 13 '18 at 16:13





    Oh, i understand. I will edit my answer.

    – Manrique
    Nov 13 '18 at 16:13













    0














    Thanks for your help.
    I followed your directives but the outcome was not as expected:



    d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
    df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
    df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
    df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
    df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
    df_dedupe.show(100, False)


    The outcome was:



    +-----+---+----------+------+--------+    
    |NAME |ID |DOB |Height|ShoeSize|
    +-----+---+----------+------+--------+
    |Bob |10 |1542189668|0 |0 |
    |Alice|10 |1425298030|154 |39 |
    +-----+---+----------+------+--------+


    Showing the "Bob" with corrupted data.



    Finally, I changed my approach and converted the DF to Pandas and then back to spark:



    p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
    d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
    df = spark.createDataFrame(d1, p_schema)
    pdf = df.toPandas()
    df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

    df_spark = spark.createDataFrame(df_dedupe, p_schema)
    df_spark.show(100, False)


    This finally brought the correct "Bob":



    +-----+---+----------+------+--------+
    |NAME |ID |DOB |Height|ShoeSize|
    +-----+---+----------+------+--------+
    |Alice|10 |1425298030|154 |39 |
    |Bob |10 |1542189668|178 |42 |
    +-----+---+----------+------+--------+


    Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.



    Thanks!






    share|improve this answer






























      0














      Thanks for your help.
      I followed your directives but the outcome was not as expected:



      d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
      df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
      df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
      df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
      df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
      df_dedupe.show(100, False)


      The outcome was:



      +-----+---+----------+------+--------+    
      |NAME |ID |DOB |Height|ShoeSize|
      +-----+---+----------+------+--------+
      |Bob |10 |1542189668|0 |0 |
      |Alice|10 |1425298030|154 |39 |
      +-----+---+----------+------+--------+


      Showing the "Bob" with corrupted data.



      Finally, I changed my approach and converted the DF to Pandas and then back to spark:



      p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
      d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
      df = spark.createDataFrame(d1, p_schema)
      pdf = df.toPandas()
      df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

      df_spark = spark.createDataFrame(df_dedupe, p_schema)
      df_spark.show(100, False)


      This finally brought the correct "Bob":



      +-----+---+----------+------+--------+
      |NAME |ID |DOB |Height|ShoeSize|
      +-----+---+----------+------+--------+
      |Alice|10 |1425298030|154 |39 |
      |Bob |10 |1542189668|178 |42 |
      +-----+---+----------+------+--------+


      Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.



      Thanks!






      share|improve this answer




























        0












        0








        0







        Thanks for your help.
        I followed your directives but the outcome was not as expected:



        d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
        df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
        df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
        df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
        df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
        df_dedupe.show(100, False)


        The outcome was:



        +-----+---+----------+------+--------+    
        |NAME |ID |DOB |Height|ShoeSize|
        +-----+---+----------+------+--------+
        |Bob |10 |1542189668|0 |0 |
        |Alice|10 |1425298030|154 |39 |
        +-----+---+----------+------+--------+


        Showing the "Bob" with corrupted data.



        Finally, I changed my approach and converted the DF to Pandas and then back to spark:



        p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
        d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
        df = spark.createDataFrame(d1, p_schema)
        pdf = df.toPandas()
        df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

        df_spark = spark.createDataFrame(df_dedupe, p_schema)
        df_spark.show(100, False)


        This finally brought the correct "Bob":



        +-----+---+----------+------+--------+
        |NAME |ID |DOB |Height|ShoeSize|
        +-----+---+----------+------+--------+
        |Alice|10 |1425298030|154 |39 |
        |Bob |10 |1542189668|178 |42 |
        +-----+---+----------+------+--------+


        Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.



        Thanks!






        share|improve this answer















        Thanks for your help.
        I followed your directives but the outcome was not as expected:



        d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
        df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
        df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
        df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
        df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
        df_dedupe.show(100, False)


        The outcome was:



        +-----+---+----------+------+--------+    
        |NAME |ID |DOB |Height|ShoeSize|
        +-----+---+----------+------+--------+
        |Bob |10 |1542189668|0 |0 |
        |Alice|10 |1425298030|154 |39 |
        +-----+---+----------+------+--------+


        Showing the "Bob" with corrupted data.



        Finally, I changed my approach and converted the DF to Pandas and then back to spark:



        p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
        d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
        df = spark.createDataFrame(d1, p_schema)
        pdf = df.toPandas()
        df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

        df_spark = spark.createDataFrame(df_dedupe, p_schema)
        df_spark.show(100, False)


        This finally brought the correct "Bob":



        +-----+---+----------+------+--------+
        |NAME |ID |DOB |Height|ShoeSize|
        +-----+---+----------+------+--------+
        |Alice|10 |1425298030|154 |39 |
        |Bob |10 |1542189668|178 |42 |
        +-----+---+----------+------+--------+


        Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.



        Thanks!







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 7:02

























        answered Nov 14 '18 at 13:44









        David KonDavid Kon

        204




        204






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53284881%2fpyspark-remove-duplicates-from-dataframe-keeping-the-last-appearance%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Vorschmack

            Quarantine