Spark - Sort Double values in an RDD and ignore NaNs












0















I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.



Either the Double.NaN values should appear at the bottom or top of the sorted RDD.



I was not able to achieve this using sortBy.



scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26

scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0

scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26

scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656


My expected result is



scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN

or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0









share|improve this question























  • you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

    – user3685285
    Nov 15 '18 at 19:07


















0















I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.



Either the Double.NaN values should appear at the bottom or top of the sorted RDD.



I was not able to achieve this using sortBy.



scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26

scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0

scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26

scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656


My expected result is



scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN

or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0









share|improve this question























  • you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

    – user3685285
    Nov 15 '18 at 19:07
















0












0








0


1






I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.



Either the Double.NaN values should appear at the bottom or top of the sorted RDD.



I was not able to achieve this using sortBy.



scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26

scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0

scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26

scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656


My expected result is



scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN

or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0









share|improve this question














I want to sort the Double values in a RDD and I want my sort function to ignore the Double.NaN values.



Either the Double.NaN values should appear at the bottom or top of the sorted RDD.



I was not able to achieve this using sortBy.



scala> res13.sortBy(r => r, ascending = true)
res21: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[10] at sortBy at <console>:26

scala> res21.collect.foreach(println)
0.656
0.99
0.998
1.0
NaN
5.6
7.0

scala> res13.sortBy(r => r, ascending = false)
res23: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[15] at sortBy at <console>:26

scala> res23.collect.foreach(println)
7.0
5.6
NaN
1.0
0.998
0.99
0.656


My expected result is



scala> res23.collect.foreach(println)
7.0
5.6
1.0
0.998
0.99
0.656
NaN

or
scala> res21.collect.foreach(println)
NaN
0.656
0.99
0.998
1.0
5.6
7.0






scala sorting apache-spark rdd






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 18:42









abc123abc123

3710




3710













  • you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

    – user3685285
    Nov 15 '18 at 19:07





















  • you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

    – user3685285
    Nov 15 '18 at 19:07



















you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07







you can use an if statement to give the NaNs a value in your r => r. That r => r is just a function that tells scala how the data should be ordered. You're saying element r should be ordered by the value r. But you can make it Double.MaxValue or Double.MinValue.

– user3685285
Nov 15 '18 at 19:07














2 Answers
2






active

oldest

votes


















1














Taking what I said in the comment, you can try this:



scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.sortBy(r => r, ascending = false).collect
res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)

scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)

scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)





share|improve this answer
























  • This works! Thank you so much user3685285

    – abc123
    Nov 15 '18 at 21:42



















0














To add on @user3685285 's answer :



scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
sortAscending: (r: Double)Double

scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
sortDescending: (r: Double)Double

scala> res0.sortBy(sortDescending, ascending=false)
res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28

scala> res7.collect.foreach(println)
99.9
34.2
10.98
7.0
6.0
5.0
2.0
0.56
0.01
0.0
NaN
NaN

scala> res0.sortBy(sortAscending, ascending=true)
res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28

scala> res9.collect.foreach(println)
0.0
0.01
0.56
2.0
5.0
6.0
7.0
10.98
34.2
99.9
NaN
NaN





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326007%2fspark-sort-double-values-in-an-rdd-and-ignore-nans%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Taking what I said in the comment, you can try this:



    scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
    a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24

    scala> a.sortBy(r => r, ascending = false).collect
    res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)

    scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
    res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)

    scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
    res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)





    share|improve this answer
























    • This works! Thank you so much user3685285

      – abc123
      Nov 15 '18 at 21:42
















    1














    Taking what I said in the comment, you can try this:



    scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
    a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24

    scala> a.sortBy(r => r, ascending = false).collect
    res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)

    scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
    res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)

    scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
    res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)





    share|improve this answer
























    • This works! Thank you so much user3685285

      – abc123
      Nov 15 '18 at 21:42














    1












    1








    1







    Taking what I said in the comment, you can try this:



    scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
    a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24

    scala> a.sortBy(r => r, ascending = false).collect
    res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)

    scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
    res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)

    scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
    res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)





    share|improve this answer













    Taking what I said in the comment, you can try this:



    scala> val a = sc.parallelize(Array(0.656, 0.99, 0.998, 1.0, Double.NaN, 5.6, 7.0))
    a: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:24

    scala> a.sortBy(r => r, ascending = false).collect
    res2: Array[Double] = Array(7.0, 5.6, NaN, 1.0, 0.998, 0.99, 0.656)

    scala> a.sortBy(r => if (r.isNaN) Double.MinValue else r, ascending = false).collect
    res3: Array[Double] = Array(7.0, 5.6, 1.0, 0.998, 0.99, 0.656, NaN)

    scala> a.sortBy(r => if (r.isNaN) Double.MaxValue else r, ascending = false).collect
    res4: Array[Double] = Array(NaN, 7.0, 5.6, 1.0, 0.998, 0.99, 0.656)






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 15 '18 at 21:00









    user3685285user3685285

    1,61241842




    1,61241842













    • This works! Thank you so much user3685285

      – abc123
      Nov 15 '18 at 21:42



















    • This works! Thank you so much user3685285

      – abc123
      Nov 15 '18 at 21:42

















    This works! Thank you so much user3685285

    – abc123
    Nov 15 '18 at 21:42





    This works! Thank you so much user3685285

    – abc123
    Nov 15 '18 at 21:42













    0














    To add on @user3685285 's answer :



    scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
    sortAscending: (r: Double)Double

    scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
    sortDescending: (r: Double)Double

    scala> res0.sortBy(sortDescending, ascending=false)
    res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28

    scala> res7.collect.foreach(println)
    99.9
    34.2
    10.98
    7.0
    6.0
    5.0
    2.0
    0.56
    0.01
    0.0
    NaN
    NaN

    scala> res0.sortBy(sortAscending, ascending=true)
    res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28

    scala> res9.collect.foreach(println)
    0.0
    0.01
    0.56
    2.0
    5.0
    6.0
    7.0
    10.98
    34.2
    99.9
    NaN
    NaN





    share|improve this answer




























      0














      To add on @user3685285 's answer :



      scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
      sortAscending: (r: Double)Double

      scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
      sortDescending: (r: Double)Double

      scala> res0.sortBy(sortDescending, ascending=false)
      res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28

      scala> res7.collect.foreach(println)
      99.9
      34.2
      10.98
      7.0
      6.0
      5.0
      2.0
      0.56
      0.01
      0.0
      NaN
      NaN

      scala> res0.sortBy(sortAscending, ascending=true)
      res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28

      scala> res9.collect.foreach(println)
      0.0
      0.01
      0.56
      2.0
      5.0
      6.0
      7.0
      10.98
      34.2
      99.9
      NaN
      NaN





      share|improve this answer


























        0












        0








        0







        To add on @user3685285 's answer :



        scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
        sortAscending: (r: Double)Double

        scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
        sortDescending: (r: Double)Double

        scala> res0.sortBy(sortDescending, ascending=false)
        res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28

        scala> res7.collect.foreach(println)
        99.9
        34.2
        10.98
        7.0
        6.0
        5.0
        2.0
        0.56
        0.01
        0.0
        NaN
        NaN

        scala> res0.sortBy(sortAscending, ascending=true)
        res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28

        scala> res9.collect.foreach(println)
        0.0
        0.01
        0.56
        2.0
        5.0
        6.0
        7.0
        10.98
        34.2
        99.9
        NaN
        NaN





        share|improve this answer













        To add on @user3685285 's answer :



        scala> def sortAscending(r: Double): Double = { if (r.isNaN) Double.MaxValue else r }
        sortAscending: (r: Double)Double

        scala> def sortDescending(r: Double): Double = {if (r.isNaN) Double.MinValue else r }
        sortDescending: (r: Double)Double

        scala> res0.sortBy(sortDescending, ascending=false)
        res7: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[20] at sortBy at <console>:28

        scala> res7.collect.foreach(println)
        99.9
        34.2
        10.98
        7.0
        6.0
        5.0
        2.0
        0.56
        0.01
        0.0
        NaN
        NaN

        scala> res0.sortBy(sortAscending, ascending=true)
        res9: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[25] at sortBy at <console>:28

        scala> res9.collect.foreach(println)
        0.0
        0.01
        0.56
        2.0
        5.0
        6.0
        7.0
        10.98
        34.2
        99.9
        NaN
        NaN






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 22:07









        abc123abc123

        3710




        3710






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326007%2fspark-sort-double-values-in-an-rdd-and-ignore-nans%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Vorschmack

            Quarantine