Find same values in two huge datasets












0















i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp



The simplest way is obviously something like that:



for x in small_List:
if x in big_list:
return True
return False


But does that make sense or is there a more efficient way?



Thanks










share|improve this question























  • Is there any relation between UnixTimestamp,Value01,Value02 ?

    – vivek_23
    Nov 15 '18 at 15:34











  • You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

    – Paddy3118
    Nov 15 '18 at 16:23


















0















i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp



The simplest way is obviously something like that:



for x in small_List:
if x in big_list:
return True
return False


But does that make sense or is there a more efficient way?



Thanks










share|improve this question























  • Is there any relation between UnixTimestamp,Value01,Value02 ?

    – vivek_23
    Nov 15 '18 at 15:34











  • You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

    – Paddy3118
    Nov 15 '18 at 16:23
















0












0








0








i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp



The simplest way is obviously something like that:



for x in small_List:
if x in big_list:
return True
return False


But does that make sense or is there a more efficient way?



Thanks










share|improve this question














i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp



The simplest way is obviously something like that:



for x in small_List:
if x in big_list:
return True
return False


But does that make sense or is there a more efficient way?



Thanks







python-3.x algorithm list search bigdata






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 15:02









TimSTimS

156




156













  • Is there any relation between UnixTimestamp,Value01,Value02 ?

    – vivek_23
    Nov 15 '18 at 15:34











  • You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

    – Paddy3118
    Nov 15 '18 at 16:23





















  • Is there any relation between UnixTimestamp,Value01,Value02 ?

    – vivek_23
    Nov 15 '18 at 15:34











  • You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

    – Paddy3118
    Nov 15 '18 at 16:23



















Is there any relation between UnixTimestamp,Value01,Value02 ?

– vivek_23
Nov 15 '18 at 15:34





Is there any relation between UnixTimestamp,Value01,Value02 ?

– vivek_23
Nov 15 '18 at 15:34













You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

– Paddy3118
Nov 15 '18 at 16:23







You could try giving a cut down version of the two datasets to aid solution writing. In general two lists of lists could become two sets of tuples then the set intersection computed.

– Paddy3118
Nov 15 '18 at 16:23














2 Answers
2






active

oldest

votes


















0














Both are already sorted by timestamp, so use that to your advantage:



big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]


If timestamps are unique, complexity is O(len(big_list) + len(small_List))






share|improve this answer































    1














    If they are just lists, you can try something like this.



    set(small_list) & set(big_list)



    Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.






    share|improve this answer
























    • Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

      – TimS
      Nov 15 '18 at 15:14











    • The cast to a set doesn't work since i guess the list actually contains lists.

      – TimS
      Nov 15 '18 at 15:36











    • What is the time complexity of this?

      – Surt
      Nov 15 '18 at 22:56











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322280%2ffind-same-values-in-two-huge-datasets%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Both are already sorted by timestamp, so use that to your advantage:



    big_list_index = 0
    for x in small_list:
    y = big_list[big_list_index]
    while big_list_index < len(big_list) and y.timestamp < x.timestamp:
    big_list_index += 1
    y = big_list[big_list_index]
    while big_list_index < len(big_list) and y.timestamp == x.timestamp:
    if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
    return True
    else:
    big_list_index += 1
    y = big_list[big_list_index]


    If timestamps are unique, complexity is O(len(big_list) + len(small_List))






    share|improve this answer




























      0














      Both are already sorted by timestamp, so use that to your advantage:



      big_list_index = 0
      for x in small_list:
      y = big_list[big_list_index]
      while big_list_index < len(big_list) and y.timestamp < x.timestamp:
      big_list_index += 1
      y = big_list[big_list_index]
      while big_list_index < len(big_list) and y.timestamp == x.timestamp:
      if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
      return True
      else:
      big_list_index += 1
      y = big_list[big_list_index]


      If timestamps are unique, complexity is O(len(big_list) + len(small_List))






      share|improve this answer


























        0












        0








        0







        Both are already sorted by timestamp, so use that to your advantage:



        big_list_index = 0
        for x in small_list:
        y = big_list[big_list_index]
        while big_list_index < len(big_list) and y.timestamp < x.timestamp:
        big_list_index += 1
        y = big_list[big_list_index]
        while big_list_index < len(big_list) and y.timestamp == x.timestamp:
        if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
        return True
        else:
        big_list_index += 1
        y = big_list[big_list_index]


        If timestamps are unique, complexity is O(len(big_list) + len(small_List))






        share|improve this answer













        Both are already sorted by timestamp, so use that to your advantage:



        big_list_index = 0
        for x in small_list:
        y = big_list[big_list_index]
        while big_list_index < len(big_list) and y.timestamp < x.timestamp:
        big_list_index += 1
        y = big_list[big_list_index]
        while big_list_index < len(big_list) and y.timestamp == x.timestamp:
        if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
        return True
        else:
        big_list_index += 1
        y = big_list[big_list_index]


        If timestamps are unique, complexity is O(len(big_list) + len(small_List))







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 16 '18 at 16:28









        juvianjuvian

        13.3k22127




        13.3k22127

























            1














            If they are just lists, you can try something like this.



            set(small_list) & set(big_list)



            Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.






            share|improve this answer
























            • Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

              – TimS
              Nov 15 '18 at 15:14











            • The cast to a set doesn't work since i guess the list actually contains lists.

              – TimS
              Nov 15 '18 at 15:36











            • What is the time complexity of this?

              – Surt
              Nov 15 '18 at 22:56
















            1














            If they are just lists, you can try something like this.



            set(small_list) & set(big_list)



            Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.






            share|improve this answer
























            • Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

              – TimS
              Nov 15 '18 at 15:14











            • The cast to a set doesn't work since i guess the list actually contains lists.

              – TimS
              Nov 15 '18 at 15:36











            • What is the time complexity of this?

              – Surt
              Nov 15 '18 at 22:56














            1












            1








            1







            If they are just lists, you can try something like this.



            set(small_list) & set(big_list)



            Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.






            share|improve this answer













            If they are just lists, you can try something like this.



            set(small_list) & set(big_list)



            Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 15 '18 at 15:07









            NaveenNaveen

            920214




            920214













            • Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

              – TimS
              Nov 15 '18 at 15:14











            • The cast to a set doesn't work since i guess the list actually contains lists.

              – TimS
              Nov 15 '18 at 15:36











            • What is the time complexity of this?

              – Surt
              Nov 15 '18 at 22:56



















            • Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

              – TimS
              Nov 15 '18 at 15:14











            • The cast to a set doesn't work since i guess the list actually contains lists.

              – TimS
              Nov 15 '18 at 15:36











            • What is the time complexity of this?

              – Surt
              Nov 15 '18 at 22:56

















            Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

            – TimS
            Nov 15 '18 at 15:14





            Great idea thanks. The data doesn't come in a list yet but i planned to format it for that reason.

            – TimS
            Nov 15 '18 at 15:14













            The cast to a set doesn't work since i guess the list actually contains lists.

            – TimS
            Nov 15 '18 at 15:36





            The cast to a set doesn't work since i guess the list actually contains lists.

            – TimS
            Nov 15 '18 at 15:36













            What is the time complexity of this?

            – Surt
            Nov 15 '18 at 22:56





            What is the time complexity of this?

            – Surt
            Nov 15 '18 at 22:56


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322280%2ffind-same-values-in-two-huge-datasets%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Xamarin.iOS Cant Deploy on Iphone

            Glorious Revolution

            Dulmage-Mendelsohn matrix decomposition in Python