Spark Data frame 1.6

-3

Data Dump

Work_Id,Assigned_to,Date,Status   

R1,John,3/4/15,Not Started   

R1,John,3/5/15,In Progress        

R1,John,3/6/15,Finished     

R3,Alaxender,3/7/15,In Progress   

R3,Alaxender,3/8/15,In Progress   

R4,Patrick,3/9/15,Finished   

R5,Peter,3/11/15,Finished   

R7,George,3/13/15,Not Started   

R7,George,3/14/15,In Progress   

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress   

R9,Alaxender,3/17/15,Not Started

Final Output

Work_Id,Assigned_to,Date,Status   

R1,John,3/6/15,Finished    

R7,George,3/14/15,In Progress    

R9,Alaxender,3/17/15,Not Started    

R3,Alaxender,3/7/15,In Progress    

R3,Alaxender,3/8/15,In Progress    

R4,Patrick,3/9/15,Finished    

R5,Peter,3/11/15,Finished    

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress

There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.

Eg:

R1,John,3/4/15,Not Started    

R1,John,3/5/15,In Progress   

R1,John,3/6/15,Finished

This record will be qualified

R1,John,3/6/15,Finished

Rest all the records other than the status has "Not started" for the same person will be qualified in the output.

Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00

Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02

@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06

Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15

@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23

|
show 1 more comment

-3

Data Dump

Work_Id,Assigned_to,Date,Status   

R1,John,3/4/15,Not Started   

R1,John,3/5/15,In Progress        

R1,John,3/6/15,Finished     

R3,Alaxender,3/7/15,In Progress   

R3,Alaxender,3/8/15,In Progress   

R4,Patrick,3/9/15,Finished   

R5,Peter,3/11/15,Finished   

R7,George,3/13/15,Not Started   

R7,George,3/14/15,In Progress   

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress   

R9,Alaxender,3/17/15,Not Started

Final Output

Work_Id,Assigned_to,Date,Status   

R1,John,3/6/15,Finished    

R7,George,3/14/15,In Progress    

R9,Alaxender,3/17/15,Not Started    

R3,Alaxender,3/7/15,In Progress    

R3,Alaxender,3/8/15,In Progress    

R4,Patrick,3/9/15,Finished    

R5,Peter,3/11/15,Finished    

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress

Eg:

R1,John,3/4/15,Not Started    

R1,John,3/5/15,In Progress   

R1,John,3/6/15,Finished

This record will be qualified

R1,John,3/6/15,Finished

Rest all the records other than the status has "Not started" for the same person will be qualified in the output.

Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00

Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02

@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06

Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15

@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23

|
show 1 more comment

-3

Data Dump

Work_Id,Assigned_to,Date,Status   

R1,John,3/4/15,Not Started   

R1,John,3/5/15,In Progress        

R1,John,3/6/15,Finished     

R3,Alaxender,3/7/15,In Progress   

R3,Alaxender,3/8/15,In Progress   

R4,Patrick,3/9/15,Finished   

R5,Peter,3/11/15,Finished   

R7,George,3/13/15,Not Started   

R7,George,3/14/15,In Progress   

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress   

R9,Alaxender,3/17/15,Not Started

Final Output

Work_Id,Assigned_to,Date,Status   

R1,John,3/6/15,Finished    

R7,George,3/14/15,In Progress    

R9,Alaxender,3/17/15,Not Started    

R3,Alaxender,3/7/15,In Progress    

R3,Alaxender,3/8/15,In Progress    

R4,Patrick,3/9/15,Finished    

R5,Peter,3/11/15,Finished    

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress

Eg:

R1,John,3/4/15,Not Started    

R1,John,3/5/15,In Progress   

R1,John,3/6/15,Finished

This record will be qualified

R1,John,3/6/15,Finished

Rest all the records other than the status has "Not started" for the same person will be qualified in the output.

Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

Data Dump

Work_Id,Assigned_to,Date,Status   

R1,John,3/4/15,Not Started   

R1,John,3/5/15,In Progress        

R1,John,3/6/15,Finished     

R3,Alaxender,3/7/15,In Progress   

R3,Alaxender,3/8/15,In Progress   

R4,Patrick,3/9/15,Finished   

R5,Peter,3/11/15,Finished   

R7,George,3/13/15,Not Started   

R7,George,3/14/15,In Progress   

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress   

R9,Alaxender,3/17/15,Not Started

Final Output

Work_Id,Assigned_to,Date,Status   

R1,John,3/6/15,Finished    

R7,George,3/14/15,In Progress    

R9,Alaxender,3/17/15,Not Started    

R3,Alaxender,3/7/15,In Progress    

R3,Alaxender,3/8/15,In Progress    

R4,Patrick,3/9/15,Finished    

R5,Peter,3/11/15,Finished    

R8,John,3/15/15,In Progress    

R8,John,3/16/15,In Progress

Eg:

R1,John,3/4/15,Not Started    

R1,John,3/5/15,In Progress   

R1,John,3/6/15,Finished

This record will be qualified

R1,John,3/6/15,Finished

Rest all the records other than the status has "Not started" for the same person will be qualified in the output.

Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.

apache-spark apache-spark-sql

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

edited Nov 14 '18 at 5:59

cricket_007

80.7k1142110

asked Nov 13 '18 at 18:13

Ansip

asked Nov 13 '18 at 18:13

Ansip

asked Nov 13 '18 at 18:13

Ansip

How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00

Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02

@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06

Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15

@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23

|
show 1 more comment

How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00

Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02

@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06

Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15

@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23

How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt

– cricket_007
Nov 14 '18 at 6:00

Appreciate your help on this..

– Ansip
Nov 14 '18 at 6:02

@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.

– Ansip
Nov 14 '18 at 6:06

Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?

– cricket_007
Nov 14 '18 at 6:15

@cricket_007, I have updated an answer here. Is there any better way to do this?

– Ansip
Nov 14 '18 at 6:23

|
show 1 more comment

1 Answer
1

active

oldest

votes

I have an answer, but this is currently degraded the job performance. Is there any better way to do this?

val df = myFile.toDF()



val dfFilter = df.filter($"status" === "Not Started")



val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))



val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")



val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)



val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")



val dfDrop = dfRank.drop($"rank").drop($"person")



val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")



val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")



nullDf.unionAll(dfDrop).show

answered Nov 14 '18 at 6:22

Ansip

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287168%2fspark-data-frame-1-6%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I have an answer, but this is currently degraded the job performance. Is there any better way to do this?

val df = myFile.toDF()



val dfFilter = df.filter($"status" === "Not Started")



val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))



val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")



val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)



val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")



val dfDrop = dfRank.drop($"rank").drop($"person")



val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")



val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")



nullDf.unionAll(dfDrop).show

answered Nov 14 '18 at 6:22

Ansip

add a comment |

I have an answer, but this is currently degraded the job performance. Is there any better way to do this?

val df = myFile.toDF()



val dfFilter = df.filter($"status" === "Not Started")



val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))



val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")



val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)



val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")



val dfDrop = dfRank.drop($"rank").drop($"person")



val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")



val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")



nullDf.unionAll(dfDrop).show

answered Nov 14 '18 at 6:22

Ansip

add a comment |

I have an answer, but this is currently degraded the job performance. Is there any better way to do this?

val df = myFile.toDF()



val dfFilter = df.filter($"status" === "Not Started")



val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))



val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")



val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)



val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")



val dfDrop = dfRank.drop($"rank").drop($"person")



val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")



val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")



nullDf.unionAll(dfDrop).show

answered Nov 14 '18 at 6:22

Ansip

I have an answer, but this is currently degraded the job performance. Is there any better way to do this?

val df = myFile.toDF()



val dfFilter = df.filter($"status" === "Not Started")



val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))



val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")



val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)



val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")



val dfDrop = dfRank.drop($"rank").drop($"person")



val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")



val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")



nullDf.unionAll(dfDrop).show

answered Nov 14 '18 at 6:22

Ansip

answered Nov 14 '18 at 6:22

Ansip

answered Nov 14 '18 at 6:22

Ansip

answered Nov 14 '18 at 6:22

Ansip

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky