Spark Data frame 1.6
Data Dump
Work_Id,Assigned_to,Date,Status
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started
Final Output
Work_Id,Assigned_to,Date,Status
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.
Eg:
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
This record will be qualified
R1,John,3/6/15,Finished
Rest all the records other than the status has "Not started" for the same person will be qualified in the output.
Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.
apache-spark apache-spark-sql
|
show 1 more comment
Data Dump
Work_Id,Assigned_to,Date,Status
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started
Final Output
Work_Id,Assigned_to,Date,Status
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.
Eg:
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
This record will be qualified
R1,John,3/6/15,Finished
Rest all the records other than the status has "Not started" for the same person will be qualified in the output.
Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.
apache-spark apache-spark-sql
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23
|
show 1 more comment
Data Dump
Work_Id,Assigned_to,Date,Status
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started
Final Output
Work_Id,Assigned_to,Date,Status
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.
Eg:
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
This record will be qualified
R1,John,3/6/15,Finished
Rest all the records other than the status has "Not started" for the same person will be qualified in the output.
Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.
apache-spark apache-spark-sql
Data Dump
Work_Id,Assigned_to,Date,Status
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R7,George,3/13/15,Not Started
R7,George,3/14/15,In Progress
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
R9,Alaxender,3/17/15,Not Started
Final Output
Work_Id,Assigned_to,Date,Status
R1,John,3/6/15,Finished
R7,George,3/14/15,In Progress
R9,Alaxender,3/17/15,Not Started
R3,Alaxender,3/7/15,In Progress
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R5,Peter,3/11/15,Finished
R8,John,3/15/15,In Progress
R8,John,3/16/15,In Progress
There is a datadump same as above which consists of work orders. If there are subsequent request for the same person and the status has "Not started" then the last record(Sort by date) will be qualified. if there is only one record with the status "Not started" then this record will be qualified.
Eg:
R1,John,3/4/15,Not Started
R1,John,3/5/15,In Progress
R1,John,3/6/15,Finished
This record will be qualified
R1,John,3/6/15,Finished
Rest all the records other than the status has "Not started" for the same person will be qualified in the output.
Any help will be appreciated, to be done this in the Spark 1.6 dataframe using scala.
apache-spark apache-spark-sql
apache-spark apache-spark-sql
edited Nov 14 '18 at 5:59
cricket_007
80.7k1142110
80.7k1142110
asked Nov 13 '18 at 18:13
AnsipAnsip
86
86
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23
|
show 1 more comment
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23
|
show 1 more comment
1 Answer
1
active
oldest
votes
I have an answer, but this is currently degraded the job performance. Is there any better way to do this?
val df = myFile.toDF()
val dfFilter = df.filter($"status" === "Not Started")
val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))
val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")
val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)
val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")
val dfDrop = dfRank.drop($"rank").drop($"person")
val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")
val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")
nullDf.unionAll(dfDrop).show
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287168%2fspark-data-frame-1-6%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I have an answer, but this is currently degraded the job performance. Is there any better way to do this?
val df = myFile.toDF()
val dfFilter = df.filter($"status" === "Not Started")
val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))
val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")
val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)
val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")
val dfDrop = dfRank.drop($"rank").drop($"person")
val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")
val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")
nullDf.unionAll(dfDrop).show
add a comment |
I have an answer, but this is currently degraded the job performance. Is there any better way to do this?
val df = myFile.toDF()
val dfFilter = df.filter($"status" === "Not Started")
val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))
val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")
val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)
val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")
val dfDrop = dfRank.drop($"rank").drop($"person")
val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")
val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")
nullDf.unionAll(dfDrop).show
add a comment |
I have an answer, but this is currently degraded the job performance. Is there any better way to do this?
val df = myFile.toDF()
val dfFilter = df.filter($"status" === "Not Started")
val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))
val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")
val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)
val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")
val dfDrop = dfRank.drop($"rank").drop($"person")
val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")
val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")
nullDf.unionAll(dfDrop).show
I have an answer, but this is currently degraded the job performance. Is there any better way to do this?
val df = myFile.toDF()
val dfFilter = df.filter($"status" === "Not Started")
val dfSelect = dfFilter.select(($"Assigned_to").alias("person"))
val dfInner = dfSelect.join(df, $"person" === $"Assigned_to")
val windowSpec = Window.partitionBy($"Assigned_to").orderBy(col("Date").desc)
val dfRank = dfInner.withColumn("rank", rank().over(windowSpec)).filter($"rank" === "1")
val dfDrop = dfRank.drop($"rank").drop($"person")
val dfLeftOuter = df.join(dfSelect, $"Assigned_to" === $"person", "leftouter")
val nullDf = dfLeftOuter.filter($"person".isNull).drop($"person")
nullDf.unionAll(dfDrop).show
answered Nov 14 '18 at 6:22
AnsipAnsip
86
86
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287168%2fspark-data-frame-1-6%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How is this any different from the last question you posted with no attempt at a solution yourself? stackoverflow.com/questions/49472718/… ... idownvotedbecau.se/noattempt
– cricket_007
Nov 14 '18 at 6:00
Appreciate your help on this..
– Ansip
Nov 14 '18 at 6:02
@cricket_007, We would not be able to achieve this with collect_list. Need a different approach altogether.
– Ansip
Nov 14 '18 at 6:06
Okay, well, still... Can you please edit the question with some Scala code that has gotten you part way towards an answer?
– cricket_007
Nov 14 '18 at 6:15
@cricket_007, I have updated an answer here. Is there any better way to do this?
– Ansip
Nov 14 '18 at 6:23