How to extract year (or datetime) from a column in a pandas dataframe that contains text












-1















Suppose I have a pandas dataframe:



Id    Book                      
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story


How do I extract the year from the column?



Output should be:



Id    Book Title               Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948


So far I have tried:



movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()


and



books['year'] = books['title'].str[-5:-1]


I've messed around with some other things and haven't gotten it to work yet. Any suggestions?










share|improve this question





























    -1















    Suppose I have a pandas dataframe:



    Id    Book                      
    1 Harry Potter (1997)
    2 Of Mice and Men (1937)
    3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story


    How do I extract the year from the column?



    Output should be:



    Id    Book Title               Year
    1 Harry Potter 1997
    2 Of Mice and Men 1937
    3 Babe Ruth Story, The 1948


    So far I have tried:



    movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()


    and



    books['year'] = books['title'].str[-5:-1]


    I've messed around with some other things and haven't gotten it to work yet. Any suggestions?










    share|improve this question



























      -1












      -1








      -1








      Suppose I have a pandas dataframe:



      Id    Book                      
      1 Harry Potter (1997)
      2 Of Mice and Men (1937)
      3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story


      How do I extract the year from the column?



      Output should be:



      Id    Book Title               Year
      1 Harry Potter 1997
      2 Of Mice and Men 1937
      3 Babe Ruth Story, The 1948


      So far I have tried:



      movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()


      and



      books['year'] = books['title'].str[-5:-1]


      I've messed around with some other things and haven't gotten it to work yet. Any suggestions?










      share|improve this question
















      Suppose I have a pandas dataframe:



      Id    Book                      
      1 Harry Potter (1997)
      2 Of Mice and Men (1937)
      3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story


      How do I extract the year from the column?



      Output should be:



      Id    Book Title               Year
      1 Harry Potter 1997
      2 Of Mice and Men 1937
      3 Babe Ruth Story, The 1948


      So far I have tried:



      movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()


      and



      books['year'] = books['title'].str[-5:-1]


      I've messed around with some other things and haven't gotten it to work yet. Any suggestions?







      python regex pandas datetime parsing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 15 '18 at 17:51







      Matt Elgazar

















      asked Nov 15 '18 at 17:44









      Matt ElgazarMatt Elgazar

      10210




      10210
























          3 Answers
          3






          active

          oldest

          votes


















          3














          How about a simple Regex:



          text = 'Harry Potter (1997)'
          re.findall('((d{4}))', text)
          # ['1997'] Note that this is a list of "all" the occurrences.




          With a Dataframe, it can be done like this:



          text = 'Harry Potter (1997)'
          df = pd.DataFrame({'Book': text}, index=[1])
          pattern = '((d{4}))'
          df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

          df
          # Book year
          # 1 Harry Potter (1997) 1997




          Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):



          df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

          sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)

          sep # A new df, separated into title and year
          # 0 1
          # 0 Harry Potter 1997
          # 1 Of Mice and Men 1937
          # 2 Babe Ruth Story, The 1948





          share|improve this answer


























          • Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

            – Matt Elgazar
            Nov 15 '18 at 17:56













          • Whoops I mean df['Books']. I'll mark your answer as correct.

            – Matt Elgazar
            Nov 15 '18 at 18:11






          • 1





            @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

            – Stephen Cowley
            Nov 15 '18 at 18:17





















          0














          Answer for the full series is actually this:



          books['title'].str.findall('((d{4}))').str.get(0)





          share|improve this answer































            0














            You could do the following.



            import pandas as pd
            df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])

            df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')



            1. line: import pandas

            2. line: create the dataframe for sake of understanding

            3. line: create a new column 'Year', which is created from a string extraction on the column Book.


            Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.






            share|improve this answer


























            • This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

              – Matt Elgazar
              Nov 15 '18 at 18:09






            • 1





              Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

              – Philip
              Nov 15 '18 at 18:18











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325182%2fhow-to-extract-year-or-datetime-from-a-column-in-a-pandas-dataframe-that-conta%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            How about a simple Regex:



            text = 'Harry Potter (1997)'
            re.findall('((d{4}))', text)
            # ['1997'] Note that this is a list of "all" the occurrences.




            With a Dataframe, it can be done like this:



            text = 'Harry Potter (1997)'
            df = pd.DataFrame({'Book': text}, index=[1])
            pattern = '((d{4}))'
            df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

            df
            # Book year
            # 1 Harry Potter (1997) 1997




            Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):



            df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

            sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)

            sep # A new df, separated into title and year
            # 0 1
            # 0 Harry Potter 1997
            # 1 Of Mice and Men 1937
            # 2 Babe Ruth Story, The 1948





            share|improve this answer


























            • Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

              – Matt Elgazar
              Nov 15 '18 at 17:56













            • Whoops I mean df['Books']. I'll mark your answer as correct.

              – Matt Elgazar
              Nov 15 '18 at 18:11






            • 1





              @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

              – Stephen Cowley
              Nov 15 '18 at 18:17


















            3














            How about a simple Regex:



            text = 'Harry Potter (1997)'
            re.findall('((d{4}))', text)
            # ['1997'] Note that this is a list of "all" the occurrences.




            With a Dataframe, it can be done like this:



            text = 'Harry Potter (1997)'
            df = pd.DataFrame({'Book': text}, index=[1])
            pattern = '((d{4}))'
            df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

            df
            # Book year
            # 1 Harry Potter (1997) 1997




            Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):



            df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

            sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)

            sep # A new df, separated into title and year
            # 0 1
            # 0 Harry Potter 1997
            # 1 Of Mice and Men 1937
            # 2 Babe Ruth Story, The 1948





            share|improve this answer


























            • Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

              – Matt Elgazar
              Nov 15 '18 at 17:56













            • Whoops I mean df['Books']. I'll mark your answer as correct.

              – Matt Elgazar
              Nov 15 '18 at 18:11






            • 1





              @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

              – Stephen Cowley
              Nov 15 '18 at 18:17
















            3












            3








            3







            How about a simple Regex:



            text = 'Harry Potter (1997)'
            re.findall('((d{4}))', text)
            # ['1997'] Note that this is a list of "all" the occurrences.




            With a Dataframe, it can be done like this:



            text = 'Harry Potter (1997)'
            df = pd.DataFrame({'Book': text}, index=[1])
            pattern = '((d{4}))'
            df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

            df
            # Book year
            # 1 Harry Potter (1997) 1997




            Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):



            df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

            sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)

            sep # A new df, separated into title and year
            # 0 1
            # 0 Harry Potter 1997
            # 1 Of Mice and Men 1937
            # 2 Babe Ruth Story, The 1948





            share|improve this answer















            How about a simple Regex:



            text = 'Harry Potter (1997)'
            re.findall('((d{4}))', text)
            # ['1997'] Note that this is a list of "all" the occurrences.




            With a Dataframe, it can be done like this:



            text = 'Harry Potter (1997)'
            df = pd.DataFrame({'Book': text}, index=[1])
            pattern = '((d{4}))'
            df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

            df
            # Book year
            # 1 Harry Potter (1997) 1997




            Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):



            df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

            sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)

            sep # A new df, separated into title and year
            # 0 1
            # 0 Harry Potter 1997
            # 1 Of Mice and Men 1937
            # 2 Babe Ruth Story, The 1948






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 15 '18 at 18:13

























            answered Nov 15 '18 at 17:50









            Stephen CowleyStephen Cowley

            1,206518




            1,206518













            • Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

              – Matt Elgazar
              Nov 15 '18 at 17:56













            • Whoops I mean df['Books']. I'll mark your answer as correct.

              – Matt Elgazar
              Nov 15 '18 at 18:11






            • 1





              @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

              – Stephen Cowley
              Nov 15 '18 at 18:17





















            • Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

              – Matt Elgazar
              Nov 15 '18 at 17:56













            • Whoops I mean df['Books']. I'll mark your answer as correct.

              – Matt Elgazar
              Nov 15 '18 at 18:11






            • 1





              @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

              – Stephen Cowley
              Nov 15 '18 at 18:17



















            Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

            – Matt Elgazar
            Nov 15 '18 at 17:56







            Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

            – Matt Elgazar
            Nov 15 '18 at 17:56















            Whoops I mean df['Books']. I'll mark your answer as correct.

            – Matt Elgazar
            Nov 15 '18 at 18:11





            Whoops I mean df['Books']. I'll mark your answer as correct.

            – Matt Elgazar
            Nov 15 '18 at 18:11




            1




            1





            @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

            – Stephen Cowley
            Nov 15 '18 at 18:17







            @MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

            – Stephen Cowley
            Nov 15 '18 at 18:17















            0














            Answer for the full series is actually this:



            books['title'].str.findall('((d{4}))').str.get(0)





            share|improve this answer




























              0














              Answer for the full series is actually this:



              books['title'].str.findall('((d{4}))').str.get(0)





              share|improve this answer


























                0












                0








                0







                Answer for the full series is actually this:



                books['title'].str.findall('((d{4}))').str.get(0)





                share|improve this answer













                Answer for the full series is actually this:



                books['title'].str.findall('((d{4}))').str.get(0)






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 15 '18 at 18:00









                Matt ElgazarMatt Elgazar

                10210




                10210























                    0














                    You could do the following.



                    import pandas as pd
                    df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])

                    df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')



                    1. line: import pandas

                    2. line: create the dataframe for sake of understanding

                    3. line: create a new column 'Year', which is created from a string extraction on the column Book.


                    Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.






                    share|improve this answer


























                    • This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                      – Matt Elgazar
                      Nov 15 '18 at 18:09






                    • 1





                      Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                      – Philip
                      Nov 15 '18 at 18:18
















                    0














                    You could do the following.



                    import pandas as pd
                    df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])

                    df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')



                    1. line: import pandas

                    2. line: create the dataframe for sake of understanding

                    3. line: create a new column 'Year', which is created from a string extraction on the column Book.


                    Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.






                    share|improve this answer


























                    • This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                      – Matt Elgazar
                      Nov 15 '18 at 18:09






                    • 1





                      Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                      – Philip
                      Nov 15 '18 at 18:18














                    0












                    0








                    0







                    You could do the following.



                    import pandas as pd
                    df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])

                    df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')



                    1. line: import pandas

                    2. line: create the dataframe for sake of understanding

                    3. line: create a new column 'Year', which is created from a string extraction on the column Book.


                    Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.






                    share|improve this answer















                    You could do the following.



                    import pandas as pd
                    df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])

                    df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')



                    1. line: import pandas

                    2. line: create the dataframe for sake of understanding

                    3. line: create a new column 'Year', which is created from a string extraction on the column Book.


                    Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 15 '18 at 18:19

























                    answered Nov 15 '18 at 17:59









                    PhilipPhilip

                    341213




                    341213













                    • This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                      – Matt Elgazar
                      Nov 15 '18 at 18:09






                    • 1





                      Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                      – Philip
                      Nov 15 '18 at 18:18



















                    • This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                      – Matt Elgazar
                      Nov 15 '18 at 18:09






                    • 1





                      Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                      – Philip
                      Nov 15 '18 at 18:18

















                    This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                    – Matt Elgazar
                    Nov 15 '18 at 18:09





                    This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

                    – Matt Elgazar
                    Nov 15 '18 at 18:09




                    1




                    1





                    Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                    – Philip
                    Nov 15 '18 at 18:18





                    Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

                    – Philip
                    Nov 15 '18 at 18:18


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325182%2fhow-to-extract-year-or-datetime-from-a-column-in-a-pandas-dataframe-that-conta%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python