Why is & tokenized as “&” in Python NLTK

up vote
2
down vote

favorite

When trying to use the Toktok word tokenizer from NLTK in Python3

string='&& Test & and L&R '

from nltk.tokenize.toktok import ToktokTokenizer

ToktokTokenizer().tokenize(string)

I obtain the following output:

['&&amp;', 'Test', '&amp;', 'and', 'L&R']

Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.

Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with

[tok.replace("&amp;","&") for tok in tokenized_sentence]

but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.

asked Nov 9 at 18:14

David GG

111

New contributor

1

The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago

add a comment |

up vote
2
down vote

favorite

When trying to use the Toktok word tokenizer from NLTK in Python3

string='&& Test & and L&R '

from nltk.tokenize.toktok import ToktokTokenizer

ToktokTokenizer().tokenize(string)

I obtain the following output:

['&&amp;', 'Test', '&amp;', 'and', 'L&R']

Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.

Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with

[tok.replace("&amp;","&") for tok in tokenized_sentence]

but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.

asked Nov 9 at 18:14

David GG

111

New contributor

1

The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago

add a comment |

up vote
2
down vote

favorite

When trying to use the Toktok word tokenizer from NLTK in Python3

string='&& Test & and L&R '

from nltk.tokenize.toktok import ToktokTokenizer

ToktokTokenizer().tokenize(string)

I obtain the following output:

['&&amp;', 'Test', '&amp;', 'and', 'L&R']

Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.

Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with

[tok.replace("&amp;","&") for tok in tokenized_sentence]

but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.

asked Nov 9 at 18:14

David GG

111

New contributor

When trying to use the Toktok word tokenizer from NLTK in Python3

string='&& Test & and L&R '

from nltk.tokenize.toktok import ToktokTokenizer

ToktokTokenizer().tokenize(string)

I obtain the following output:

['&&amp;', 'Test', '&amp;', 'and', 'L&R']

Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.

Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with

[tok.replace("&amp;","&") for tok in tokenized_sentence]

but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.

python nltk tokenize

asked Nov 9 at 18:14

David GG

111

New contributor

asked Nov 9 at 18:14

David GG

111

New contributor

asked Nov 9 at 18:14

David GG

111

New contributor

asked Nov 9 at 18:14

David GG

111

asked Nov 9 at 18:14

David GG

111

New contributor

David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago

add a comment |

1

The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago

The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

As mentioned by @snakecharmerb for the & the source states:

# Replace problematic character with numeric character reference.

One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:

import re



from nltk.tokenize.toktok import ToktokTokenizer



string = '&& Test & and L&R '



tokenizer = ToktokTokenizer()

tokenizer.AMPERCENT = re.compile('& '), '& '

tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in

                            ToktokTokenizer.TOKTOK_REGEXES]



result = tokenizer.tokenize(string)

print(result)

Output

['&&', 'Test', '&', 'and', 'L&R']

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

David GG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231269%2fwhy-is-tokenized-as-amp-in-python-nltk%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

As mentioned by @snakecharmerb for the & the source states:

# Replace problematic character with numeric character reference.

One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:

import re



from nltk.tokenize.toktok import ToktokTokenizer



string = '&& Test & and L&R '



tokenizer = ToktokTokenizer()

tokenizer.AMPERCENT = re.compile('& '), '& '

tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in

                            ToktokTokenizer.TOKTOK_REGEXES]



result = tokenizer.tokenize(string)

print(result)

Output

['&&', 'Test', '&', 'and', 'L&R']

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

add a comment |

up vote
1
down vote

As mentioned by @snakecharmerb for the & the source states:

# Replace problematic character with numeric character reference.

One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:

import re



from nltk.tokenize.toktok import ToktokTokenizer



string = '&& Test & and L&R '



tokenizer = ToktokTokenizer()

tokenizer.AMPERCENT = re.compile('& '), '& '

tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in

                            ToktokTokenizer.TOKTOK_REGEXES]



result = tokenizer.tokenize(string)

print(result)

Output

['&&', 'Test', '&', 'and', 'L&R']

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

add a comment |

up vote
1
down vote

As mentioned by @snakecharmerb for the & the source states:

# Replace problematic character with numeric character reference.

One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:

import re



from nltk.tokenize.toktok import ToktokTokenizer



string = '&& Test & and L&R '



tokenizer = ToktokTokenizer()

tokenizer.AMPERCENT = re.compile('& '), '& '

tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in

                            ToktokTokenizer.TOKTOK_REGEXES]



result = tokenizer.tokenize(string)

print(result)

Output

['&&', 'Test', '&', 'and', 'L&R']

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

As mentioned by @snakecharmerb for the & the source states:

# Replace problematic character with numeric character reference.

One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:

import re



from nltk.tokenize.toktok import ToktokTokenizer



string = '&& Test & and L&R '



tokenizer = ToktokTokenizer()

tokenizer.AMPERCENT = re.compile('& '), '& '

tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in

                            ToktokTokenizer.TOKTOK_REGEXES]



result = tokenizer.tokenize(string)

print(result)

Output

['&&', 'Test', '&', 'and', 'L&R']

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

edited 2 days ago

answered 2 days ago

Daniel Mesejo

7,6341821

answered 2 days ago

Daniel Mesejo

7,6341821

answered 2 days ago

Daniel Mesejo

7,6341821

add a comment |

David GG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

David GG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky