How to cluster similar text tokens

I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:

John (cluster 1)

Mike  (cluster 2)

Joe  (cluster 1)

Jon  (cluster 1)

Jony  (cluster 1)

Ajon  (cluster 1)

Brown  (cluster 3)

I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.

Any guidance here would be really appreciative.

Thank you

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51

add a comment |

John (cluster 1)

Mike  (cluster 2)

Joe  (cluster 1)

Jon  (cluster 1)

Jony  (cluster 1)

Ajon  (cluster 1)

Brown  (cluster 3)

Any guidance here would be really appreciative.

Thank you

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51

add a comment |

John (cluster 1)

Mike  (cluster 2)

Joe  (cluster 1)

Jon  (cluster 1)

Jony  (cluster 1)

Ajon  (cluster 1)

Brown  (cluster 3)

Any guidance here would be really appreciative.

Thank you

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

John (cluster 1)

Mike  (cluster 2)

Joe  (cluster 1)

Jon  (cluster 1)

Jony  (cluster 1)

Ajon  (cluster 1)

Brown  (cluster 3)

Any guidance here would be really appreciative.

Thank you

text scikit-learn cluster-analysis

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

edited Nov 4 at 18:53

Luca Kiebel

7,29841431

asked Nov 4 at 17:27

webber

66121236

asked Nov 4 at 17:27

webber

66121236

asked Nov 4 at 17:27

webber

66121236

what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51

add a comment |

what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51

what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51

add a comment |

2 Answers
2

active

oldest

votes

You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.

One way you could do this:

You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.

Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

add a comment |

I don't think clustering is what you should do here.

Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.

You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.

Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53143461%2fhow-to-cluster-similar-text-tokens%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.

One way you could do this:

You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.

Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

add a comment |

You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.

One way you could do this:

You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.

Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

add a comment |

You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.

One way you could do this:

You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.

Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.

One way you could do this:

You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.

Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

answered Nov 4 at 20:49

Felipe Almeida

5,09563882

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

add a comment |

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
– webber
Nov 5 at 3:23

Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
– Anony-Mousse
Nov 12 at 18:42

add a comment |

I don't think clustering is what you should do here.

Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.

You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

add a comment |

I don't think clustering is what you should do here.

Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.

You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

add a comment |

I don't think clustering is what you should do here.

Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.

You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

I don't think clustering is what you should do here.

Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.

You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

answered Nov 12 at 18:40

Anony-Mousse

57.2k796159

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky