How to cluster similar text tokens












1














I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:



John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)


I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.



Any guidance here would be really appreciative.



Thank you










share|improve this question
























  • what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
    – David Batista
    Nov 4 at 17:51


















1














I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:



John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)


I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.



Any guidance here would be really appreciative.



Thank you










share|improve this question
























  • what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
    – David Batista
    Nov 4 at 17:51
















1












1








1







I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:



John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)


I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.



Any guidance here would be really appreciative.



Thank you










share|improve this question















I'm working on a project to cluster similar text tokens. The objective is to group tokens that could be typos and also tokens that share similar spelling. Here is a snippet of my data and the expected clusterization:



John (cluster 1)
Mike (cluster 2)
Joe (cluster 1)
Jon (cluster 1)
Jony (cluster 1)
Ajon (cluster 1)
Brown (cluster 3)


I'm just kicking the tires on clusterization so I'm not sure how can I go about achieving this. Looking through the various clustering techniques in ScikitLearn I came across AffinityPropagation to precompute the similarity but it isnt viable for a large dataset (I have ~200k tokens). All other clustering algorithms need vectors but I'm not sure how to genrate features from these tokens.



Any guidance here would be really appreciative.



Thank you







text scikit-learn cluster-analysis






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 4 at 18:53









Luca Kiebel

7,29841431




7,29841431










asked Nov 4 at 17:27









webber

66121236




66121236












  • what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
    – David Batista
    Nov 4 at 17:51




















  • what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
    – David Batista
    Nov 4 at 17:51


















what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51






what's the origin of these tokens? you could for instance generate an embedding (dense vector of float values) for each token and use these vectors as input to a clustering algorithm
– David Batista
Nov 4 at 17:51














2 Answers
2






active

oldest

votes


















2














You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.



One way you could do this:



You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.



Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.






share|improve this answer





















  • Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
    – webber
    Nov 5 at 3:23










  • Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
    – Anony-Mousse
    Nov 12 at 18:42



















0














I don't think clustering is what you should do here.



Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.



You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.



Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53143461%2fhow-to-cluster-similar-text-tokens%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.



    One way you could do this:



    You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.



    Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.






    share|improve this answer





















    • Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
      – webber
      Nov 5 at 3:23










    • Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
      – Anony-Mousse
      Nov 12 at 18:42
















    2














    You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.



    One way you could do this:



    You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.



    Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.






    share|improve this answer





















    • Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
      – webber
      Nov 5 at 3:23










    • Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
      – Anony-Mousse
      Nov 12 at 18:42














    2












    2








    2






    You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.



    One way you could do this:



    You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.



    Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.






    share|improve this answer












    You need a similarity function that encodes the intuition that tokens that differ only by a couple of letters are maybe mispellings.



    One way you could do this:



    You could transform each token into a vector with 26 dimensions (one for each letter) and each element represents the number of times a given letter appears in the token.



    Tokens that differ by a single letter (maybe misspellings) will be close together in that features space because only one element of the array will be different.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 4 at 20:49









    Felipe Almeida

    5,09563882




    5,09563882












    • Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
      – webber
      Nov 5 at 3:23










    • Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
      – Anony-Mousse
      Nov 12 at 18:42


















    • Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
      – webber
      Nov 5 at 3:23










    • Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
      – Anony-Mousse
      Nov 12 at 18:42
















    Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
    – webber
    Nov 5 at 3:23




    Thanks, this sounds like an interesting approach. Let me give it a shot tomorrow
    – webber
    Nov 5 at 3:23












    Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
    – Anony-Mousse
    Nov 12 at 18:42




    Well, with this approach Jon and Joe have two letter in common. But "One" and "neo" have all three in common. Letter histograms are okay for language detection, but I wouldn't use them for similarity. At least, use letter bigrams or trigrams...
    – Anony-Mousse
    Nov 12 at 18:42













    0














    I don't think clustering is what you should do here.



    Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.



    You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.



    Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.






    share|improve this answer


























      0














      I don't think clustering is what you should do here.



      Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.



      You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.



      Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.






      share|improve this answer
























        0












        0








        0






        I don't think clustering is what you should do here.



        Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.



        You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.



        Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.






        share|improve this answer












        I don't think clustering is what you should do here.



        Because methods such as k-mrans force every point into a "cluster". And that probably isn't what you want.



        You will also run into the problem that transitively, almost everything is "similar". There are plenty of games where you have to turn one word into another one letter at a time.



        Unsteady try to first identify some good values (e.g., by frequency) and then decide with what tolerance to merge alternatives. But without transitive operations. This is simpler than clustering and faster.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 12 at 18:40









        Anony-Mousse

        57.2k796159




        57.2k796159






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53143461%2fhow-to-cluster-similar-text-tokens%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List item for chat from Array inside array React Native

            Thiostrepton

            Caerphilly