How to cluster similar text tokens
I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:
John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)
I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.
Any guidance here would be really appreciative.
Thank you
text scikit-learn cluster-analysis
add a comment |
I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:
John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)
I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.
Any guidance here would be really appreciative.
Thank you
text scikit-learn cluster-analysis
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51
add a comment |
I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:
John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)
I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.
Any guidance here would be really appreciative.
Thank you
text scikit-learn cluster-analysis
I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:
John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)
I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.
Any guidance here would be really appreciative.
Thank you
text scikit-learn cluster-analysis
text scikit-learn cluster-analysis
edited Nov 4 at 18:53
Luca Kiebel
7,29841431
7,29841431
asked Nov 4 at 17:27
webber
66121236
66121236
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51
add a comment |
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51
add a comment |
2 Answers
2
active
oldest
votes
You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.
One way you could do this:
You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.
Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
add a comment |
I don't think clustering is what you should do here.
Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.
You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.
Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53143461%2fhow-to-cluster-similar-text-tokens%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.
One way you could do this:
You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.
Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
add a comment |
You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.
One way you could do this:
You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.
Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
add a comment |
You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.
One way you could do this:
You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.
Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.
You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.
One way you could do this:
You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.
Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.
answered Nov 4 at 20:49
Felipe Almeida
5,09563882
5,09563882
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
add a comment |
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42
add a comment |
I don't think clustering is what you should do here.
Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.
You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.
Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.
add a comment |
I don't think clustering is what you should do here.
Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.
You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.
Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.
add a comment |
I don't think clustering is what you should do here.
Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.
You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.
Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.
I don't think clustering is what you should do here.
Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.
You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.
Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.
answered Nov 12 at 18:40
Anony-Mousse
57.2k796159
57.2k796159
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53143461%2fhow-to-cluster-similar-text-tokens%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51