Faster implementation of LSH (AND-OR)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.

Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

add a comment |

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

add a comment |

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

python locality-sensitive-hash minhash

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

edited Nov 17 '18 at 12:40

asked Nov 17 '18 at 6:27

Ramki

134

asked Nov 17 '18 at 6:27

Ramki

134

asked Nov 17 '18 at 6:27

Ramki

134

add a comment |

1 Answer
1

active

oldest

votes

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.

answered Apr 11 at 13:54

negas

863

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.

answered Apr 11 at 13:54

negas

863

add a comment |

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.

answered Apr 11 at 13:54

negas

863

add a comment |

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.

answered Apr 11 at 13:54

negas

863

Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.

answered Apr 11 at 13:54

negas

863

answered Apr 11 at 13:54

negas

863

answered Apr 11 at 13:54

negas

863

answered Apr 11 at 13:54

negas

863

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky