Spark - Sort Double values in an RDD and ignore NaNs

I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.

Either the Double.NaN values should appear at the bottom or top of the sorted RDD.

I was not able to achieve this using sortBy.

scala> res13.sortBy(r => r, ascending = true)

res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26



scala> res21.collect.foreach(println)

0.656

0.99

0.998

1.0

NaN

5.6

7.0



scala> res13.sortBy(r => r, ascending = false)

res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26



scala> res23.collect.foreach(println)

7.0

5.6

NaN

1.0

0.998

0.99

0.656

My expected result is

scala> res23.collect.foreach(println)

    7.0

    5.6

    1.0

    0.998

    0.99

    0.656

    NaN



or 

    scala> res21.collect.foreach(println)

    NaN

    0.656

    0.99

    0.998

    1.0

    5.6

    7.0

asked Nov 15 '18 at 18:42

abc123

3710

you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07

add a comment |

I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.

Either the Double.NaN values should appear at the bottom or top of the sorted RDD.

I was not able to achieve this using sortBy.

scala> res13.sortBy(r => r, ascending = true)

res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26



scala> res21.collect.foreach(println)

0.656

0.99

0.998

1.0

NaN

5.6

7.0



scala> res13.sortBy(r => r, ascending = false)

res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26



scala> res23.collect.foreach(println)

7.0

5.6

NaN

1.0

0.998

0.99

0.656

My expected result is

scala> res23.collect.foreach(println)

    7.0

    5.6

    1.0

    0.998

    0.99

    0.656

    NaN



or 

    scala> res21.collect.foreach(println)

    NaN

    0.656

    0.99

    0.998

    1.0

    5.6

    7.0

asked Nov 15 '18 at 18:42

abc123

3710

you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07

add a comment |

I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.

Either the Double.NaN values should appear at the bottom or top of the sorted RDD.

I was not able to achieve this using sortBy.

scala> res13.sortBy(r => r, ascending = true)

res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26



scala> res21.collect.foreach(println)

0.656

0.99

0.998

1.0

NaN

5.6

7.0



scala> res13.sortBy(r => r, ascending = false)

res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26



scala> res23.collect.foreach(println)

7.0

5.6

NaN

1.0

0.998

0.99

0.656

My expected result is

scala> res23.collect.foreach(println)

    7.0

    5.6

    1.0

    0.998

    0.99

    0.656

    NaN



or 

    scala> res21.collect.foreach(println)

    NaN

    0.656

    0.99

    0.998

    1.0

    5.6

    7.0

asked Nov 15 '18 at 18:42

abc123

3710

I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.

Either the Double.NaN values should appear at the bottom or top of the sorted RDD.

I was not able to achieve this using sortBy.

scala> res13.sortBy(r => r, ascending = true)

res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26



scala> res21.collect.foreach(println)

0.656

0.99

0.998

1.0

NaN

5.6

7.0



scala> res13.sortBy(r => r, ascending = false)

res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26



scala> res23.collect.foreach(println)

7.0

5.6

NaN

1.0

0.998

0.99

0.656

My expected result is

scala> res23.collect.foreach(println)

    7.0

    5.6

    1.0

    0.998

    0.99

    0.656

    NaN



or 

    scala> res21.collect.foreach(println)

    NaN

    0.656

    0.99

    0.998

    1.0

    5.6

    7.0

scala sorting apache-spark rdd

asked Nov 15 '18 at 18:42

abc123

3710

asked Nov 15 '18 at 18:42

abc123

3710

asked Nov 15 '18 at 18:42

abc123

3710

asked Nov 15 '18 at 18:42

abc123

3710

asked Nov 15 '18 at 18:42

abc123

3710

you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07

add a comment |

you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07

you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07

add a comment |

2 Answers
2

active

oldest

votes

Taking what I said in the comment, you can try this:

scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))

a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala> a.sortBy(r => r, ascending = false).collect

res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)



scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect

res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)



scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect

res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)

answered Nov 15 '18 at 21:00

user3685285

1,61241842

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

add a comment |

To add on @user3685285 's answer :

scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }

sortAscending: (r: Double)Double



scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }

sortDescending: (r: Double)Double



scala> res0.sortBy(sortDescending, ascending=false)

res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28



scala> res7.collect.foreach(println)

99.9

34.2

10.98

7.0

6.0

5.0

2.0

0.56

0.01

0.0

NaN

NaN



scala> res0.sortBy(sortAscending, ascending=true)

res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28



scala> res9.collect.foreach(println)

0.0

0.01

0.56

2.0

5.0

6.0

7.0

10.98

34.2

99.9

NaN

NaN

answered Nov 15 '18 at 22:07

abc123

3710

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326007%2fspark-sort-double-values-in-an-rdd-and-ignore-nans%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Taking what I said in the comment, you can try this:

scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))

a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala> a.sortBy(r => r, ascending = false).collect

res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)



scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect

res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)



scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect

res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)

answered Nov 15 '18 at 21:00

user3685285

1,61241842

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

add a comment |

Taking what I said in the comment, you can try this:

scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))

a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala> a.sortBy(r => r, ascending = false).collect

res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)



scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect

res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)



scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect

res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)

answered Nov 15 '18 at 21:00

user3685285

1,61241842

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

add a comment |

Taking what I said in the comment, you can try this:

scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))

a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala> a.sortBy(r => r, ascending = false).collect

res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)



scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect

res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)



scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect

res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)

answered Nov 15 '18 at 21:00

user3685285

1,61241842

Taking what I said in the comment, you can try this:

scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))

a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala> a.sortBy(r => r, ascending = false).collect

res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)



scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect

res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)



scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect

res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)

answered Nov 15 '18 at 21:00

user3685285

1,61241842

answered Nov 15 '18 at 21:00

user3685285

1,61241842

answered Nov 15 '18 at 21:00

user3685285

1,61241842

answered Nov 15 '18 at 21:00

user3685285

1,61241842

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

add a comment |

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

This works! Thank you so much user3685285

– abc123
Nov 15 '18 at 21:42

add a comment |

To add on @user3685285 's answer :

scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }

sortAscending: (r: Double)Double



scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }

sortDescending: (r: Double)Double



scala> res0.sortBy(sortDescending, ascending=false)

res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28



scala> res7.collect.foreach(println)

99.9

34.2

10.98

7.0

6.0

5.0

2.0

0.56

0.01

0.0

NaN

NaN



scala> res0.sortBy(sortAscending, ascending=true)

res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28



scala> res9.collect.foreach(println)

0.0

0.01

0.56

2.0

5.0

6.0

7.0

10.98

34.2

99.9

NaN

NaN

answered Nov 15 '18 at 22:07

abc123

3710

add a comment |

To add on @user3685285 's answer :

scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }

sortAscending: (r: Double)Double



scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }

sortDescending: (r: Double)Double



scala> res0.sortBy(sortDescending, ascending=false)

res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28



scala> res7.collect.foreach(println)

99.9

34.2

10.98

7.0

6.0

5.0

2.0

0.56

0.01

0.0

NaN

NaN



scala> res0.sortBy(sortAscending, ascending=true)

res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28



scala> res9.collect.foreach(println)

0.0

0.01

0.56

2.0

5.0

6.0

7.0

10.98

34.2

99.9

NaN

NaN

answered Nov 15 '18 at 22:07

abc123

3710

add a comment |

To add on @user3685285 's answer :

scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }

sortAscending: (r: Double)Double



scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }

sortDescending: (r: Double)Double



scala> res0.sortBy(sortDescending, ascending=false)

res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28



scala> res7.collect.foreach(println)

99.9

34.2

10.98

7.0

6.0

5.0

2.0

0.56

0.01

0.0

NaN

NaN



scala> res0.sortBy(sortAscending, ascending=true)

res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28



scala> res9.collect.foreach(println)

0.0

0.01

0.56

2.0

5.0

6.0

7.0

10.98

34.2

99.9

NaN

NaN

answered Nov 15 '18 at 22:07

abc123

3710

To add on @user3685285 's answer :

scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }

sortAscending: (r: Double)Double



scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }

sortDescending: (r: Double)Double



scala> res0.sortBy(sortDescending, ascending=false)

res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28



scala> res7.collect.foreach(println)

99.9

34.2

10.98

7.0

6.0

5.0

2.0

0.56

0.01

0.0

NaN

NaN



scala> res0.sortBy(sortAscending, ascending=true)

res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28



scala> res9.collect.foreach(println)

0.0

0.01

0.56

2.0

5.0

6.0

7.0

10.98

34.2

99.9

NaN

NaN

answered Nov 15 '18 at 22:07

abc123

3710

answered Nov 15 '18 at 22:07

abc123

3710

answered Nov 15 '18 at 22:07

abc123

3710

answered Nov 15 '18 at 22:07

abc123

3710

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

nCA11R J3Ej8KxRxl 57qECCO6TGAd,kpZ2Vv46uk,HWKRe 3jvVnPS1Lja7i0qDmiw8cF3oE3OL4 s2yyejQwE0KZogG5J,V

搜尋此網誌

Vfrdtyky