How to extract year (or datetime) from a column in a pandas dataframe that contains text
Suppose I have a pandas dataframe:
Id Book
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story
How do I extract the year from the column?
Output should be:
Id Book Title Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948
So far I have tried:
movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()
and
books['year'] = books['title'].str[-5:-1]
I've messed around with some other things and haven't gotten it to work yet. Any suggestions?
python regex pandas datetime parsing
add a comment |
Suppose I have a pandas dataframe:
Id Book
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story
How do I extract the year from the column?
Output should be:
Id Book Title Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948
So far I have tried:
movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()
and
books['year'] = books['title'].str[-5:-1]
I've messed around with some other things and haven't gotten it to work yet. Any suggestions?
python regex pandas datetime parsing
add a comment |
Suppose I have a pandas dataframe:
Id Book
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story
How do I extract the year from the column?
Output should be:
Id Book Title Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948
So far I have tried:
movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()
and
books['year'] = books['title'].str[-5:-1]
I've messed around with some other things and haven't gotten it to work yet. Any suggestions?
python regex pandas datetime parsing
Suppose I have a pandas dataframe:
Id Book
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story
How do I extract the year from the column?
Output should be:
Id Book Title Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948
So far I have tried:
movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()
and
books['year'] = books['title'].str[-5:-1]
I've messed around with some other things and haven't gotten it to work yet. Any suggestions?
python regex pandas datetime parsing
python regex pandas datetime parsing
edited Nov 15 '18 at 17:51
Matt Elgazar
asked Nov 15 '18 at 17:44
Matt ElgazarMatt Elgazar
10210
10210
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
How about a simple Regex:
text = 'Harry Potter (1997)'
re.findall('((d{4}))', text)
# ['1997'] Note that this is a list of "all" the occurrences.
With a Dataframe, it can be done like this:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '((d{4}))'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
add a comment |
Answer for the full series is actually this:
books['title'].str.findall('((d{4}))').str.get(0)
add a comment |
You could do the following.
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')
- line: import pandas
- line: create the dataframe for sake of understanding
- line: create a new column 'Year', which is created from a string extraction on the column Book.
Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325182%2fhow-to-extract-year-or-datetime-from-a-column-in-a-pandas-dataframe-that-conta%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
How about a simple Regex:
text = 'Harry Potter (1997)'
re.findall('((d{4}))', text)
# ['1997'] Note that this is a list of "all" the occurrences.
With a Dataframe, it can be done like this:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '((d{4}))'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
add a comment |
How about a simple Regex:
text = 'Harry Potter (1997)'
re.findall('((d{4}))', text)
# ['1997'] Note that this is a list of "all" the occurrences.
With a Dataframe, it can be done like this:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '((d{4}))'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
add a comment |
How about a simple Regex:
text = 'Harry Potter (1997)'
re.findall('((d{4}))', text)
# ['1997'] Note that this is a list of "all" the occurrences.
With a Dataframe, it can be done like this:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '((d{4}))'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
How about a simple Regex:
text = 'Harry Potter (1997)'
re.findall('((d{4}))', text)
# ['1997'] Note that this is a list of "all" the occurrences.
With a Dataframe, it can be done like this:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '((d{4}))'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
Finally, if you actually want to separate the title and the data (taking the dataframe reconstruction from Philip in another answer):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)((d{4}))', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
edited Nov 15 '18 at 18:13
answered Nov 15 '18 at 17:50
Stephen CowleyStephen Cowley
1,206518
1,206518
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
add a comment |
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Nice! This is pretty much what I was looking for. If I do this I would have to use a for loop through all the values in the pandas series, which is extremely slow. This helped me get the correct answer, which is this: books['title'].str.findall('((d{4}))').str.get(0)
– Matt Elgazar
Nov 15 '18 at 17:56
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
Whoops I mean df['Books']. I'll mark your answer as correct.
– Matt Elgazar
Nov 15 '18 at 18:11
1
1
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
@MattElgazar See my last update to extract the title as well =) Thanks for the fun problem; I learned a few things figuring this out
– Stephen Cowley
Nov 15 '18 at 18:17
add a comment |
Answer for the full series is actually this:
books['title'].str.findall('((d{4}))').str.get(0)
add a comment |
Answer for the full series is actually this:
books['title'].str.findall('((d{4}))').str.get(0)
add a comment |
Answer for the full series is actually this:
books['title'].str.findall('((d{4}))').str.get(0)
Answer for the full series is actually this:
books['title'].str.findall('((d{4}))').str.get(0)
answered Nov 15 '18 at 18:00
Matt ElgazarMatt Elgazar
10210
10210
add a comment |
add a comment |
You could do the following.
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')
- line: import pandas
- line: create the dataframe for sake of understanding
- line: create a new column 'Year', which is created from a string extraction on the column Book.
Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
add a comment |
You could do the following.
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')
- line: import pandas
- line: create the dataframe for sake of understanding
- line: create a new column 'Year', which is created from a string extraction on the column Book.
Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
add a comment |
You could do the following.
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')
- line: import pandas
- line: create the dataframe for sake of understanding
- line: create a new column 'Year', which is created from a string extraction on the column Book.
Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.
You could do the following.
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!()b(d+){1}')
- line: import pandas
- line: create the dataframe for sake of understanding
- line: create a new column 'Year', which is created from a string extraction on the column Book.
Use regex to find the digits. I use https://regex101.com/r/Bid0qA/1, which is a huge help in understanding how regex works.
edited Nov 15 '18 at 18:19
answered Nov 15 '18 at 17:59
PhilipPhilip
341213
341213
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
add a comment |
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
This actually works for the cases I have presented but it does not work for all cases. See this example. Some Book Title 2 (2002)
– Matt Elgazar
Nov 15 '18 at 18:09
1
1
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
Ahh well I was about to write you. That it did work. Post more data so that we have more to work with. If you add more cases along the way you are making it a lot harder to actually answer the question. I've updated my answer
– Philip
Nov 15 '18 at 18:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325182%2fhow-to-extract-year-or-datetime-from-a-column-in-a-pandas-dataframe-that-conta%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown