Why is & tokenized as “&” in Python NLTK
up vote
2
down vote
favorite
When trying to use the Toktok word tokenizer from NLTK in Python3
string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)
I obtain the following output:
['&&', 'Test', '&', 'and', 'L&R']
Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.
Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with
[tok.replace("&","&") for tok in tokenized_sentence]
but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.
python nltk tokenize
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
2
down vote
favorite
When trying to use the Toktok word tokenizer from NLTK in Python3
string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)
I obtain the following output:
['&&', 'Test', '&', 'and', 'L&R']
Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.
Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with
[tok.replace("&","&") for tok in tokenized_sentence]
but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.
python nltk tokenize
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
When trying to use the Toktok word tokenizer from NLTK in Python3
string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)
I obtain the following output:
['&&', 'Test', '&', 'and', 'L&R']
Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.
Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with
[tok.replace("&","&") for tok in tokenized_sentence]
but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.
python nltk tokenize
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
When trying to use the Toktok word tokenizer from NLTK in Python3
string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)
I obtain the following output:
['&&', 'Test', '&', 'and', 'L&R']
Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.
Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with
[tok.replace("&","&") for tok in tokenized_sentence]
but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.
python nltk tokenize
python nltk tokenize
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked Nov 9 at 18:14
David GG
111
111
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
David GG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago
add a comment |
1
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago
1
1
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
As mentioned by @snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
As mentioned by @snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']
add a comment |
up vote
1
down vote
As mentioned by @snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']
add a comment |
up vote
1
down vote
up vote
1
down vote
As mentioned by @snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']
As mentioned by @snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']
edited 2 days ago
answered 2 days ago
Daniel Mesejo
7,6341821
7,6341821
add a comment |
add a comment |
David GG is a new contributor. Be nice, and check out our Code of Conduct.
David GG is a new contributor. Be nice, and check out our Code of Conduct.
David GG is a new contributor. Be nice, and check out our Code of Conduct.
David GG is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231269%2fwhy-is-tokenized-as-amp-in-python-nltk%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
The source just states that ampersand is a "problematic character" but doesn't explain why. I don't think there's a way to prevent it from happening.
– snakecharmerb
2 days ago