Spark Data frame 1.6












-3















Data Dump



Work_Id,Assigned_to,Date,Status   
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started


Final Output



Work_Id,Assigned_to,Date,Status   
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress


There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.



Eg:



R1,John,3/4/15,Not Started    
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished


This record will be qualified



R1,John,3/6/15,Finished


Rest all the records other than the status has "Not started" for the same person will be qualified in the output.



Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.










share|improve this question

























  • How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

    – cricket_007
    Nov 14 '18 at 6:00











  • Appreciate your help on this..

    – Ansip
    Nov 14 '18 at 6:02











  • @cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

    – Ansip
    Nov 14 '18 at 6:06













  • Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

    – cricket_007
    Nov 14 '18 at 6:15











  • @cricket_007, I have updated an answer here. Is there any better way to do this?

    – Ansip
    Nov 14 '18 at 6:23
















-3















Data Dump



Work_Id,Assigned_to,Date,Status   
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started


Final Output



Work_Id,Assigned_to,Date,Status   
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress


There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.



Eg:



R1,John,3/4/15,Not Started    
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished


This record will be qualified



R1,John,3/6/15,Finished


Rest all the records other than the status has "Not started" for the same person will be qualified in the output.



Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.










share|improve this question

























  • How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

    – cricket_007
    Nov 14 '18 at 6:00











  • Appreciate your help on this..

    – Ansip
    Nov 14 '18 at 6:02











  • @cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

    – Ansip
    Nov 14 '18 at 6:06













  • Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

    – cricket_007
    Nov 14 '18 at 6:15











  • @cricket_007, I have updated an answer here. Is there any better way to do this?

    – Ansip
    Nov 14 '18 at 6:23














-3












-3








-3


1






Data Dump



Work_Id,Assigned_to,Date,Status   
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started


Final Output



Work_Id,Assigned_to,Date,Status   
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress


There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.



Eg:



R1,John,3/4/15,Not Started    
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished


This record will be qualified



R1,John,3/6/15,Finished


Rest all the records other than the status has "Not started" for the same person will be qualified in the output.



Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.










share|improve this question
















Data Dump



Work_Id,Assigned_to,Date,Status   
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started


Final Output



Work_Id,Assigned_to,Date,Status   
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress


There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.



Eg:



R1,John,3/4/15,Not Started    
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished


This record will be qualified



R1,John,3/6/15,Finished


Rest all the records other than the status has "Not started" for the same person will be qualified in the output.



Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.







apache-spark apache-spark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 5:59









cricket_007

80.7k1142110




80.7k1142110










asked Nov 13 '18 at 18:13









AnsipAnsip

86




86













  • How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

    – cricket_007
    Nov 14 '18 at 6:00











  • Appreciate your help on this..

    – Ansip
    Nov 14 '18 at 6:02











  • @cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

    – Ansip
    Nov 14 '18 at 6:06













  • Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

    – cricket_007
    Nov 14 '18 at 6:15











  • @cricket_007, I have updated an answer here. Is there any better way to do this?

    – Ansip
    Nov 14 '18 at 6:23



















  • How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

    – cricket_007
    Nov 14 '18 at 6:00











  • Appreciate your help on this..

    – Ansip
    Nov 14 '18 at 6:02











  • @cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

    – Ansip
    Nov 14 '18 at 6:06













  • Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

    – cricket_007
    Nov 14 '18 at 6:15











  • @cricket_007, I have updated an answer here. Is there any better way to do this?

    – Ansip
    Nov 14 '18 at 6:23

















How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00





How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00













Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02





Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02













@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06







@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06















Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15





Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15













@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23





@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23












1 Answer
1






active

oldest

votes


















0














I have an answer, but this is currently degraded the job performance. Is there any better way to do this?



val df = myFile.toDF()

val dfFilter = df.filter($"status" === "Not Started")

val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))

val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")

val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)

val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")

val dfDrop = dfRank.drop($"rank").drop($"person")

val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")

val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")

nullDf.unionAll(dfDrop).show





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287168%2fspark-data-frame-1-6%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    I have an answer, but this is currently degraded the job performance. Is there any better way to do this?



    val df = myFile.toDF()

    val dfFilter = df.filter($"status" === "Not Started")

    val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))

    val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")

    val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)

    val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")

    val dfDrop = dfRank.drop($"rank").drop($"person")

    val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")

    val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")

    nullDf.unionAll(dfDrop).show





    share|improve this answer




























      0














      I have an answer, but this is currently degraded the job performance. Is there any better way to do this?



      val df = myFile.toDF()

      val dfFilter = df.filter($"status" === "Not Started")

      val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))

      val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")

      val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)

      val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")

      val dfDrop = dfRank.drop($"rank").drop($"person")

      val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")

      val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")

      nullDf.unionAll(dfDrop).show





      share|improve this answer


























        0












        0








        0







        I have an answer, but this is currently degraded the job performance. Is there any better way to do this?



        val df = myFile.toDF()

        val dfFilter = df.filter($"status" === "Not Started")

        val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))

        val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")

        val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)

        val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")

        val dfDrop = dfRank.drop($"rank").drop($"person")

        val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")

        val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")

        nullDf.unionAll(dfDrop).show





        share|improve this answer













        I have an answer, but this is currently degraded the job performance. Is there any better way to do this?



        val df = myFile.toDF()

        val dfFilter = df.filter($"status" === "Not Started")

        val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))

        val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")

        val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)

        val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")

        val dfDrop = dfRank.drop($"rank").drop($"person")

        val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")

        val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")

        nullDf.unionAll(dfDrop).show






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 14 '18 at 6:22









        AnsipAnsip

        86




        86






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287168%2fspark-data-frame-1-6%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List item for chat from Array inside array React Native

            Thiostrepton

            Caerphilly