Unable to remove some emojis from Tweets in Python
I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.
########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")
cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')
cleanedData.head(5).to_dict() // After removing emojis with the above
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE Our eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}
python twitter
add a comment |
I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.
########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")
cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')
cleanedData.head(5).to_dict() // After removing emojis with the above
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE Our eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}
python twitter
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52
add a comment |
I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.
########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")
cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')
cleanedData.head(5).to_dict() // After removing emojis with the above
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE Our eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}
python twitter
I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.
########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")
cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')
cleanedData.head(5).to_dict() // After removing emojis with the above
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE Our eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}
python twitter
python twitter
edited Nov 15 '18 at 15:46
Kabilesh
asked Nov 15 '18 at 15:36
KabileshKabilesh
185317
185317
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52
add a comment |
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52
add a comment |
2 Answers
2
active
oldest
votes
Depending on what you need from the dataset, you could try using a broader regex pattern, such as
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
add a comment |
Try this - no regex:
cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
I'm assuming the symbols are found within a tweet
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322870%2funable-to-remove-some-emojis-from-tweets-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Depending on what you need from the dataset, you could try using a broader regex pattern, such as
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
add a comment |
Depending on what you need from the dataset, you could try using a broader regex pattern, such as
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
add a comment |
Depending on what you need from the dataset, you could try using a broader regex pattern, such as
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)
Depending on what you need from the dataset, you could try using a broader regex pattern, such as
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)
edited Nov 15 '18 at 15:55
answered Nov 15 '18 at 15:39
TimTim
1,772621
1,772621
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
add a comment |
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Getting TypeError: replace() missing 1 required positional argument: 'repl'
– Kabilesh
Nov 15 '18 at 15:52
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
Apologies - fixed it.
– Tim
Nov 15 '18 at 17:27
add a comment |
Try this - no regex:
cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
I'm assuming the symbols are found within a tweet
add a comment |
Try this - no regex:
cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
I'm assuming the symbols are found within a tweet
add a comment |
Try this - no regex:
cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
I'm assuming the symbols are found within a tweet
Try this - no regex:
cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
I'm assuming the symbols are found within a tweet
answered Nov 15 '18 at 15:46
NickNick
93411331
93411331
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322870%2funable-to-remove-some-emojis-from-tweets-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you have some examples of the tweets?
– Nick
Nov 15 '18 at 15:38
See the cleanedData.head(5).to_dict() . I updated above
– Kabilesh
Nov 15 '18 at 15:47
See my answer - you will need to use it on the text part of your tweet object
– Nick
Nov 15 '18 at 15:52