Faster implementation of LSH (AND-OR)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?
Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.
bucket-of-ids = 'empty list of dictionaries of
length 40'
for each-user in 160000:
for each-band in 40:
r_signature = string(jth 5 elements)
if r_signature in bucket-of-ids[band]:
'add id of user to dictionary of band
using r_signature as key'
else :
'create r_signature as new key and then
add user id to it as list of values'
The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.
Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD
Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.
Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.
python locality-sensitive-hash minhash
add a comment |
I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?
Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.
bucket-of-ids = 'empty list of dictionaries of
length 40'
for each-user in 160000:
for each-band in 40:
r_signature = string(jth 5 elements)
if r_signature in bucket-of-ids[band]:
'add id of user to dictionary of band
using r_signature as key'
else :
'create r_signature as new key and then
add user id to it as list of values'
The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.
Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD
Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.
Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.
python locality-sensitive-hash minhash
add a comment |
I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?
Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.
bucket-of-ids = 'empty list of dictionaries of
length 40'
for each-user in 160000:
for each-band in 40:
r_signature = string(jth 5 elements)
if r_signature in bucket-of-ids[band]:
'add id of user to dictionary of band
using r_signature as key'
else :
'create r_signature as new key and then
add user id to it as list of values'
The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.
Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD
Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.
Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.
python locality-sensitive-hash minhash
I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?
Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.
bucket-of-ids = 'empty list of dictionaries of
length 40'
for each-user in 160000:
for each-band in 40:
r_signature = string(jth 5 elements)
if r_signature in bucket-of-ids[band]:
'add id of user to dictionary of band
using r_signature as key'
else :
'create r_signature as new key and then
add user id to it as list of values'
The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.
Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD
Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.
Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.
python locality-sensitive-hash minhash
python locality-sensitive-hash minhash
edited Nov 17 '18 at 12:40
Ramki
asked Nov 17 '18 at 6:27
RamkiRamki
134
134
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.
add a comment |
Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.
add a comment |
Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.
Have you tried using datasketch library? It has implementation for Minhash and LSH algorithms.
answered Apr 11 at 13:54
negasnegas
863
863
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown