Why is & tokenized as “&” in Python NLTK











up vote
2
down vote

favorite












When trying to use the Toktok word tokenizer from NLTK in Python3



string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)


I obtain the following output:



['&&', 'Test', '&', 'and', 'L&R']


Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.



Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with



[tok.replace("&","&") for tok in tokenized_sentence]


but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.










share|improve this question







New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
    – snakecharmerb
    2 days ago















up vote
2
down vote

favorite












When trying to use the Toktok word tokenizer from NLTK in Python3



string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)


I obtain the following output:



['&&', 'Test', '&', 'and', 'L&R']


Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.



Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with



[tok.replace("&","&") for tok in tokenized_sentence]


but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.










share|improve this question







New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
    – snakecharmerb
    2 days ago













up vote
2
down vote

favorite









up vote
2
down vote

favorite











When trying to use the Toktok word tokenizer from NLTK in Python3



string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)


I obtain the following output:



['&&', 'Test', '&', 'and', 'L&R']


Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.



Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with



[tok.replace("&","&") for tok in tokenized_sentence]


but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.










share|improve this question







New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











When trying to use the Toktok word tokenizer from NLTK in Python3



string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)


I obtain the following output:



['&&', 'Test', '&', 'and', 'L&R']


Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.



Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with



[tok.replace("&","&") for tok in tokenized_sentence]


but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.







python nltk tokenize






share|improve this question







New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Nov 9 at 18:14









David GG

111




111




New contributor




David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1




    The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
    – snakecharmerb
    2 days ago














  • 1




    The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
    – snakecharmerb
    2 days ago








1




1




The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago




The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago












1 Answer
1






active

oldest

votes

















up vote
1
down vote













As mentioned by @snakecharmerb for the & the source states:



# Replace problematic character with numeric character reference.


One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:



import re

from nltk.tokenize.toktok import ToktokTokenizer

string = '&& Test & and L&R '

tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]

result = tokenizer.tokenize(string)
print(result)


Output



['&&', 'Test', '&', 'and', 'L&R']





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    David GG is a new contributor. Be nice, and check out our Code of Conduct.










     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231269%2fwhy-is-tokenized-as-amp-in-python-nltk%23new-answer', 'question_page');
    }
    );

    Post as a guest
































    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    As mentioned by @snakecharmerb for the & the source states:



    # Replace problematic character with numeric character reference.


    One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:



    import re

    from nltk.tokenize.toktok import ToktokTokenizer

    string = '&& Test & and L&R '

    tokenizer = ToktokTokenizer()
    tokenizer.AMPERCENT = re.compile('& '), '& '
    tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
    ToktokTokenizer.TOKTOK_REGEXES]

    result = tokenizer.tokenize(string)
    print(result)


    Output



    ['&&', 'Test', '&', 'and', 'L&R']





    share|improve this answer



























      up vote
      1
      down vote













      As mentioned by @snakecharmerb for the & the source states:



      # Replace problematic character with numeric character reference.


      One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:



      import re

      from nltk.tokenize.toktok import ToktokTokenizer

      string = '&& Test & and L&R '

      tokenizer = ToktokTokenizer()
      tokenizer.AMPERCENT = re.compile('& '), '& '
      tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
      ToktokTokenizer.TOKTOK_REGEXES]

      result = tokenizer.tokenize(string)
      print(result)


      Output



      ['&&', 'Test', '&', 'and', 'L&R']





      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        As mentioned by @snakecharmerb for the & the source states:



        # Replace problematic character with numeric character reference.


        One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:



        import re

        from nltk.tokenize.toktok import ToktokTokenizer

        string = '&& Test & and L&R '

        tokenizer = ToktokTokenizer()
        tokenizer.AMPERCENT = re.compile('& '), '& '
        tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
        ToktokTokenizer.TOKTOK_REGEXES]

        result = tokenizer.tokenize(string)
        print(result)


        Output



        ['&&', 'Test', '&', 'and', 'L&R']





        share|improve this answer














        As mentioned by @snakecharmerb for the & the source states:



        # Replace problematic character with numeric character reference.


        One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:



        import re

        from nltk.tokenize.toktok import ToktokTokenizer

        string = '&& Test & and L&R '

        tokenizer = ToktokTokenizer()
        tokenizer.AMPERCENT = re.compile('& '), '& '
        tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
        ToktokTokenizer.TOKTOK_REGEXES]

        result = tokenizer.tokenize(string)
        print(result)


        Output



        ['&&', 'Test', '&', 'and', 'L&R']






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 2 days ago

























        answered 2 days ago









        Daniel Mesejo

        7,6341821




        7,6341821






















            David GG is a new contributor. Be nice, and check out our Code of Conduct.










             

            draft saved


            draft discarded


















            David GG is a new contributor. Be nice, and check out our Code of Conduct.













            David GG is a new contributor. Be nice, and check out our Code of Conduct.












            David GG is a new contributor. Be nice, and check out our Code of Conduct.















             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231269%2fwhy-is-tokenized-as-amp-in-python-nltk%23new-answer', 'question_page');
            }
            );

            Post as a guest




















































































            Popular posts from this blog

            List item for chat from Array inside array React Native

            Thiostrepton

            Caerphilly