Spark - Sort Double values in an RDD and ignore NaNs
I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.
Either the Double.NaN values should appear at the bottom or top of the sorted RDD.
I was not able to achieve this using sortBy.
scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26
scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0
scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26
scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656
My expected result is
scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN
or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0
scala sorting apache-spark rdd
add a comment |
I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.
Either the Double.NaN values should appear at the bottom or top of the sorted RDD.
I was not able to achieve this using sortBy.
scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26
scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0
scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26
scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656
My expected result is
scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN
or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0
scala sorting apache-spark rdd
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07
add a comment |
I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.
Either the Double.NaN values should appear at the bottom or top of the sorted RDD.
I was not able to achieve this using sortBy.
scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26
scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0
scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26
scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656
My expected result is
scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN
or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0
scala sorting apache-spark rdd
I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.
Either the Double.NaN values should appear at the bottom or top of the sorted RDD.
I was not able to achieve this using sortBy.
scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26
scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0
scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26
scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656
My expected result is
scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN
or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0
scala sorting apache-spark rdd
scala sorting apache-spark rdd
asked Nov 15 '18 at 18:42
abc123abc123
3710
3710
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07
add a comment |
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07
add a comment |
2 Answers
2
active
oldest
votes
Taking what I said in the comment, you can try this:
scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)
scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)
scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
add a comment |
To add on @user3685285 's answer :
scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double
scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double
scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28
scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN
scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28
scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326007%2fspark-sort-double-values-in-an-rdd-and-ignore-nans%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Taking what I said in the comment, you can try this:
scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)
scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)
scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
add a comment |
Taking what I said in the comment, you can try this:
scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)
scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)
scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
add a comment |
Taking what I said in the comment, you can try this:
scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)
scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)
scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)
Taking what I said in the comment, you can try this:
scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)
scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)
scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)
answered Nov 15 '18 at 21:00
user3685285user3685285
1,61241842
1,61241842
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
add a comment |
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
This works! Thank you so much user3685285
– abc123
Nov 15 '18 at 21:42
add a comment |
To add on @user3685285 's answer :
scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double
scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double
scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28
scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN
scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28
scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN
add a comment |
To add on @user3685285 's answer :
scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double
scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double
scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28
scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN
scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28
scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN
add a comment |
To add on @user3685285 's answer :
scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double
scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double
scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28
scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN
scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28
scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN
To add on @user3685285 's answer :
scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double
scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double
scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28
scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN
scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28
scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN
answered Nov 15 '18 at 22:07
abc123abc123
3710
3710
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326007%2fspark-sort-double-values-in-an-rdd-and-ignore-nans%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.
– user3685285
Nov 15 '18 at 19:07