Unable to remove some emojis from Tweets in Python












0















I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.



 ########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")

cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')



cleanedData.head(5).to_dict() // After removing emojis with the above




{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE  Our  eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry   ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}









share|improve this question

























  • Do you have some examples of the tweets?

    – Nick
    Nov 15 '18 at 15:38











  • See the cleanedData.head(5).to_dict() . I updated above

    – Kabilesh
    Nov 15 '18 at 15:47











  • See my answer - you will need to use it on the text part of your tweet object

    – Nick
    Nov 15 '18 at 15:52
















0















I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.



 ########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")

cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')



cleanedData.head(5).to_dict() // After removing emojis with the above




{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE  Our  eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry   ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}









share|improve this question

























  • Do you have some examples of the tweets?

    – Nick
    Nov 15 '18 at 15:38











  • See the cleanedData.head(5).to_dict() . I updated above

    – Kabilesh
    Nov 15 '18 at 15:47











  • See my answer - you will need to use it on the text part of your tweet object

    – Nick
    Nov 15 '18 at 15:52














0












0








0








I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.



 ########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")

cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')



cleanedData.head(5).to_dict() // After removing emojis with the above




{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE  Our  eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry   ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}









share|improve this question
















I have a data set of Tweets. I am trying to remove all the emojis and symbols from these Tweets. However, my code is not removing some of the emojis such as 🤣🤣🤣, ☠, ❤, ⭐ and other. How can I improve what have I tried or use another way to remove all these emojis from the Tweets? I have the Tweets in a pandas datagram.



 ########## How I tried
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]")

cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')



cleanedData.head(5).to_dict() // After removing emojis with the above




{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE  Our  eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry   ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! xa0", 4: ' DEAD 🤣🤣🤣'}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}






python twitter






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 15:46







Kabilesh

















asked Nov 15 '18 at 15:36









KabileshKabilesh

185317




185317













  • Do you have some examples of the tweets?

    – Nick
    Nov 15 '18 at 15:38











  • See the cleanedData.head(5).to_dict() . I updated above

    – Kabilesh
    Nov 15 '18 at 15:47











  • See my answer - you will need to use it on the text part of your tweet object

    – Nick
    Nov 15 '18 at 15:52



















  • Do you have some examples of the tweets?

    – Nick
    Nov 15 '18 at 15:38











  • See the cleanedData.head(5).to_dict() . I updated above

    – Kabilesh
    Nov 15 '18 at 15:47











  • See my answer - you will need to use it on the text part of your tweet object

    – Nick
    Nov 15 '18 at 15:52

















Do you have some examples of the tweets?

– Nick
Nov 15 '18 at 15:38





Do you have some examples of the tweets?

– Nick
Nov 15 '18 at 15:38













See the cleanedData.head(5).to_dict() . I updated above

– Kabilesh
Nov 15 '18 at 15:47





See the cleanedData.head(5).to_dict() . I updated above

– Kabilesh
Nov 15 '18 at 15:47













See my answer - you will need to use it on the text part of your tweet object

– Nick
Nov 15 '18 at 15:52





See my answer - you will need to use it on the text part of your tweet object

– Nick
Nov 15 '18 at 15:52












2 Answers
2






active

oldest

votes


















1














Depending on what you need from the dataset, you could try using a broader regex pattern, such as



cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)





share|improve this answer


























  • Getting TypeError: replace() missing 1 required positional argument: 'repl'

    – Kabilesh
    Nov 15 '18 at 15:52











  • Apologies - fixed it.

    – Tim
    Nov 15 '18 at 17:27



















0














Try this - no regex:



cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')


I'm assuming the symbols are found within a tweet






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322870%2funable-to-remove-some-emojis-from-tweets-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Depending on what you need from the dataset, you could try using a broader regex pattern, such as



    cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)





    share|improve this answer


























    • Getting TypeError: replace() missing 1 required positional argument: 'repl'

      – Kabilesh
      Nov 15 '18 at 15:52











    • Apologies - fixed it.

      – Tim
      Nov 15 '18 at 17:27
















    1














    Depending on what you need from the dataset, you could try using a broader regex pattern, such as



    cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)





    share|improve this answer


























    • Getting TypeError: replace() missing 1 required positional argument: 'repl'

      – Kabilesh
      Nov 15 '18 at 15:52











    • Apologies - fixed it.

      – Tim
      Nov 15 '18 at 17:27














    1












    1








    1







    Depending on what you need from the dataset, you could try using a broader regex pattern, such as



    cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)





    share|improve this answer















    Depending on what you need from the dataset, you could try using a broader regex pattern, such as



    cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^x00-x7F]+', '', regex=True)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 15 '18 at 15:55

























    answered Nov 15 '18 at 15:39









    TimTim

    1,772621




    1,772621













    • Getting TypeError: replace() missing 1 required positional argument: 'repl'

      – Kabilesh
      Nov 15 '18 at 15:52











    • Apologies - fixed it.

      – Tim
      Nov 15 '18 at 17:27



















    • Getting TypeError: replace() missing 1 required positional argument: 'repl'

      – Kabilesh
      Nov 15 '18 at 15:52











    • Apologies - fixed it.

      – Tim
      Nov 15 '18 at 17:27

















    Getting TypeError: replace() missing 1 required positional argument: 'repl'

    – Kabilesh
    Nov 15 '18 at 15:52





    Getting TypeError: replace() missing 1 required positional argument: 'repl'

    – Kabilesh
    Nov 15 '18 at 15:52













    Apologies - fixed it.

    – Tim
    Nov 15 '18 at 17:27





    Apologies - fixed it.

    – Tim
    Nov 15 '18 at 17:27













    0














    Try this - no regex:



    cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
    .decode('utf8')


    I'm assuming the symbols are found within a tweet






    share|improve this answer




























      0














      Try this - no regex:



      cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
      .decode('utf8')


      I'm assuming the symbols are found within a tweet






      share|improve this answer


























        0












        0








        0







        Try this - no regex:



        cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
        .decode('utf8')


        I'm assuming the symbols are found within a tweet






        share|improve this answer













        Try this - no regex:



        cleaned_text = u"U0001F600 some words then symbol U0001F6FF".encode('ascii', 'ignore')
        .decode('utf8')


        I'm assuming the symbols are found within a tweet







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 15:46









        NickNick

        93411331




        93411331






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322870%2funable-to-remove-some-emojis-from-tweets-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List item for chat from Array inside array React Native

            Thiostrepton

            Caerphilly