How to extract year (or datetime) from a column in a pandas dataframe that contains text

-1

Suppose I have a pandas dataframe:

Id    Book                      

1     Harry Potter (1997)

2     Of Mice and Men (1937)

3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

How do I extract the year from the column?

Output should be:

Id    Book Title               Year

1     Harry Potter             1997

2     Of Mice and Men          1937

3     Babe Ruth Story, The     1948

So far I have tried:

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

and

books['year'] = books['title'].str[-5:-1]

I've messed around with some other things and haven't gotten it to work yet. Any suggestions?

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

add a comment |

-1

Suppose I have a pandas dataframe:

Id    Book                      

1     Harry Potter (1997)

2     Of Mice and Men (1937)

3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

How do I extract the year from the column?

Output should be:

Id    Book Title               Year

1     Harry Potter             1997

2     Of Mice and Men          1937

3     Babe Ruth Story, The     1948

So far I have tried:

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

and

books['year'] = books['title'].str[-5:-1]

I've messed around with some other things and haven't gotten it to work yet. Any suggestions?

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

add a comment |

-1

Suppose I have a pandas dataframe:

Id    Book                      

1     Harry Potter (1997)

2     Of Mice and Men (1937)

3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

How do I extract the year from the column?

Output should be:

Id    Book Title               Year

1     Harry Potter             1997

2     Of Mice and Men          1937

3     Babe Ruth Story, The     1948

So far I have tried:

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

and

books['year'] = books['title'].str[-5:-1]

I've messed around with some other things and haven't gotten it to work yet. Any suggestions?

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

Suppose I have a pandas dataframe:

Id    Book                      

1     Harry Potter (1997)

2     Of Mice and Men (1937)

3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

How do I extract the year from the column?

Output should be:

Id    Book Title               Year

1     Harry Potter             1997

2     Of Mice and Men          1937

3     Babe Ruth Story, The     1948

So far I have tried:

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

and

books['year'] = books['title'].str[-5:-1]

I've messed around with some other things and haven't gotten it to work yet. Any suggestions?

python regex pandas datetime parsing

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

edited Nov 15 '18 at 17:51

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

asked Nov 15 '18 at 17:44

Matt Elgazar

10210

add a comment |

3 Answers
3

active

oldest

votes

How about a simple Regex:

text = 'Harry Potter (1997)'

re.findall('((d{4}))', text)

# ['1997'] Note that this is a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'

df = pd.DataFrame({'Book': text}, index=[1])

pattern = '((d{4}))'

df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series



df

#                  Book   year

# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)



sep # A new df, separated into title and year

#                       0      1                           

# 0          Harry Potter   1997 

# 1       Of Mice and Men   1937

# 2  Babe Ruth Story, The   1948

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

1

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

add a comment |

Answer for the full series is actually this:

books['title'].str.findall('((d{4}))').str.get(0)

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

add a comment |

You could do the following.

import pandas as pd

df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')

line: import pandas

line: create the dataframe for sake of understanding

line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

1

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325182%2fhow-to-extract-year-or-datetime-from-a-column-in-a-pandas-dataframe-that-conta%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

How about a simple Regex:

text = 'Harry Potter (1997)'

re.findall('((d{4}))', text)

# ['1997'] Note that this is a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'

df = pd.DataFrame({'Book': text}, index=[1])

pattern = '((d{4}))'

df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series



df

#                  Book   year

# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)



sep # A new df, separated into title and year

#                       0      1                           

# 0          Harry Potter   1997 

# 1       Of Mice and Men   1937

# 2  Babe Ruth Story, The   1948

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

1

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

add a comment |

How about a simple Regex:

text = 'Harry Potter (1997)'

re.findall('((d{4}))', text)

# ['1997'] Note that this is a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'

df = pd.DataFrame({'Book': text}, index=[1])

pattern = '((d{4}))'

df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series



df

#                  Book   year

# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)



sep # A new df, separated into title and year

#                       0      1                           

# 0          Harry Potter   1997 

# 1       Of Mice and Men   1937

# 2  Babe Ruth Story, The   1948

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

1

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

add a comment |

How about a simple Regex:

text = 'Harry Potter (1997)'

re.findall('((d{4}))', text)

# ['1997'] Note that this is a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'

df = pd.DataFrame({'Book': text}, index=[1])

pattern = '((d{4}))'

df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series



df

#                  Book   year

# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)



sep # A new df, separated into title and year

#                       0      1                           

# 0          Harry Potter   1997 

# 1       Of Mice and Men   1937

# 2  Babe Ruth Story, The   1948

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

How about a simple Regex:

text = 'Harry Potter (1997)'

re.findall('((d{4}))', text)

# ['1997'] Note that this is a list of "all" the occurrences.

With a Dataframe, it can be done like this:

text = 'Harry Potter (1997)'

df = pd.DataFrame({'Book': text}, index=[1])

pattern = '((d{4}))'

df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series



df

#                  Book   year

# 1  Harry Potter (1997)  1997

Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)



sep # A new df, separated into title and year

#                       0      1                           

# 0          Harry Potter   1997 

# 1       Of Mice and Men   1937

# 2  Babe Ruth Story, The   1948

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

edited Nov 15 '18 at 18:13

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

answered Nov 15 '18 at 17:50

Stephen Cowley

1,206518

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

1

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

add a comment |

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

1

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)

– Matt Elgazar
Nov 15 '18 at 17:56

Whoops I mean df['Books']. I'll mark your answer as correct.

– Matt Elgazar
Nov 15 '18 at 18:11

@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out

– Stephen Cowley
Nov 15 '18 at 18:17

add a comment |

Answer for the full series is actually this:

books['title'].str.findall('((d{4}))').str.get(0)

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

add a comment |

Answer for the full series is actually this:

books['title'].str.findall('((d{4}))').str.get(0)

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

add a comment |

Answer for the full series is actually this:

books['title'].str.findall('((d{4}))').str.get(0)

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

Answer for the full series is actually this:

books['title'].str.findall('((d{4}))').str.get(0)

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

answered Nov 15 '18 at 18:00

Matt Elgazar

10210

add a comment |

You could do the following.

import pandas as pd

df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')

line: import pandas

line: create the dataframe for sake of understanding

line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

1

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

add a comment |

You could do the following.

import pandas as pd

df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')

line: import pandas

line: create the dataframe for sake of understanding

line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

1

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

add a comment |

You could do the following.

import pandas as pd

df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')

line: import pandas

line: create the dataframe for sake of understanding

line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

You could do the following.

import pandas as pd

df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])



df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')

line: import pandas

line: create the dataframe for sake of understanding

line: create a new column 'Year', which is created from a string extraction on the column Book.

Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

edited Nov 15 '18 at 18:19

answered Nov 15 '18 at 17:59

Philip

341213

answered Nov 15 '18 at 17:59

Philip

341213

answered Nov 15 '18 at 17:59

Philip

341213

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

1

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

add a comment |

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

1

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)

– Matt Elgazar
Nov 15 '18 at 18:09

Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer

– Philip
Nov 15 '18 at 18:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky