Faster implementation of LSH (AND-OR)





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?




Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.




bucket-of-ids = 'empty list of dictionaries of 
length 40'
for each-user in 160000:
for each-band in 40:
r_signature = string(jth 5 elements)
if r_signature in bucket-of-ids[band]:
'add id of user to dictionary of band
using r_signature as key'
else :
'create r_signature as new key and then
add user id to it as list of values'


The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.



Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD



Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.



Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.










share|improve this question































    2















    I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?




    Here is how I have done LSH - Minhash signature length (n) = 200,
    sub-signature length (r) = 5, number of bands (b) = 40.




    bucket-of-ids = 'empty list of dictionaries of 
    length 40'
    for each-user in 160000:
    for each-band in 40:
    r_signature = string(jth 5 elements)
    if r_signature in bucket-of-ids[band]:
    'add id of user to dictionary of band
    using r_signature as key'
    else :
    'create r_signature as new key and then
    add user id to it as list of values'


    The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.



    Here is the link to code as well as data :
    https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD



    Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.



    Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.










    share|improve this question



























      2












      2








      2








      I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?




      Here is how I have done LSH - Minhash signature length (n) = 200,
      sub-signature length (r) = 5, number of bands (b) = 40.




      bucket-of-ids = 'empty list of dictionaries of 
      length 40'
      for each-user in 160000:
      for each-band in 40:
      r_signature = string(jth 5 elements)
      if r_signature in bucket-of-ids[band]:
      'add id of user to dictionary of band
      using r_signature as key'
      else :
      'create r_signature as new key and then
      add user id to it as list of values'


      The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.



      Here is the link to code as well as data :
      https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD



      Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.



      Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.










      share|improve this question
















      I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?




      Here is how I have done LSH - Minhash signature length (n) = 200,
      sub-signature length (r) = 5, number of bands (b) = 40.




      bucket-of-ids = 'empty list of dictionaries of 
      length 40'
      for each-user in 160000:
      for each-band in 40:
      r_signature = string(jth 5 elements)
      if r_signature in bucket-of-ids[band]:
      'add id of user to dictionary of band
      using r_signature as key'
      else :
      'create r_signature as new key and then
      add user id to it as list of values'


      The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.



      Here is the link to code as well as data :
      https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD



      Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.



      Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.







      python locality-sensitive-hash minhash






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 17 '18 at 12:40







      Ramki

















      asked Nov 17 '18 at 6:27









      RamkiRamki

      134




      134
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.






            share|improve this answer




























              0














              Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.






              share|improve this answer


























                0












                0








                0







                Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.






                share|improve this answer













                Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 11 at 13:54









                negasnegas

                863




                863
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Bressuire

                    Vorschmack

                    Quarantine