Regular expression to extract number before/after word

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"

"7 people were arrested".

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

add a comment |

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"

"7 people were arrested".

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

add a comment |

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"

"7 people were arrested".

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"

"7 people were arrested".

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

regex stata

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

edited Nov 14 '18 at 11:50

Pearly Spencer

10.3k173461

asked Nov 14 '18 at 1:13

serpentina

132

asked Nov 14 '18 at 1:13

serpentina

132

asked Nov 14 '18 at 1:13

serpentina

132

add a comment |

3 Answers
3

active

oldest

votes

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).

This will match, if the number is within 20 chars from 'arrests|arrested'.

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

add a comment |

The following works for me (solution based on @PoulBak's idea):

clear



input strL var1

"This is 1 long string saying that police arrests 4 people"

"3 news outlets today reported that 7 people were arrested"

"several witnesses saw 5 people arrested and other 3 killed"

end



generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")



list



   +-------------------------------------------------------------------------------------+

   |                                                       var1                     var2 |

   |-------------------------------------------------------------------------------------|

1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |

2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |

3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |

   +-------------------------------------------------------------------------------------+

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

add a comment |

Perhaps something like this?

(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.

Breaking down the pattern

(d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms
- (d+) the number to capture, with + one or more digits
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (?=arrest|custody) positive look ahead checking for either word:

(?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms
- (?<=arrest|custody) positive lookbehind checking that the word comes before #
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291785%2fregular-expression-to-extract-number-before-after-word%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

This will match, if the number is within 20 chars from 'arrests|arrested'.

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

add a comment |

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

This will match, if the number is within 20 chars from 'arrests|arrested'.

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

add a comment |

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

This will match, if the number is within 20 chars from 'arrests|arrested'.

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

This will match, if the number is within 20 chars from 'arrests|arrested'.

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

answered Nov 14 '18 at 1:58

Poul Bak

5,46831232

add a comment |

The following works for me (solution based on @PoulBak's idea):

clear



input strL var1

"This is 1 long string saying that police arrests 4 people"

"3 news outlets today reported that 7 people were arrested"

"several witnesses saw 5 people arrested and other 3 killed"

end



generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")



list



   +-------------------------------------------------------------------------------------+

   |                                                       var1                     var2 |

   |-------------------------------------------------------------------------------------|

1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |

2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |

3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |

   +-------------------------------------------------------------------------------------+

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

add a comment |

The following works for me (solution based on @PoulBak's idea):

clear



input strL var1

"This is 1 long string saying that police arrests 4 people"

"3 news outlets today reported that 7 people were arrested"

"several witnesses saw 5 people arrested and other 3 killed"

end



generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")



list



   +-------------------------------------------------------------------------------------+

   |                                                       var1                     var2 |

   |-------------------------------------------------------------------------------------|

1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |

2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |

3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |

   +-------------------------------------------------------------------------------------+

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

add a comment |

The following works for me (solution based on @PoulBak's idea):

clear



input strL var1

"This is 1 long string saying that police arrests 4 people"

"3 news outlets today reported that 7 people were arrested"

"several witnesses saw 5 people arrested and other 3 killed"

end



generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")



list



   +-------------------------------------------------------------------------------------+

   |                                                       var1                     var2 |

   |-------------------------------------------------------------------------------------|

1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |

2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |

3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |

   +-------------------------------------------------------------------------------------+

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

The following works for me (solution based on @PoulBak's idea):

clear



input strL var1

"This is 1 long string saying that police arrests 4 people"

"3 news outlets today reported that 7 people were arrested"

"several witnesses saw 5 people arrested and other 3 killed"

end



generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")



list



   +-------------------------------------------------------------------------------------+

   |                                                       var1                     var2 |

   |-------------------------------------------------------------------------------------|

1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |

2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |

3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |

   +-------------------------------------------------------------------------------------+

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

answered Nov 14 '18 at 10:10

Pearly Spencer

10.3k173461

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

add a comment |

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

Thank you! It worked!

– serpentina
Nov 14 '18 at 14:52

add a comment |

Perhaps something like this?

(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.

Breaking down the pattern

(d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms
- (d+) the number to capture, with + one or more digits
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (?=arrest|custody) positive look ahead checking for either word:

(?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms
- (?<=arrest|custody) positive lookbehind checking that the word comes before #
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

add a comment |

Perhaps something like this?

(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.

Breaking down the pattern

(d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms
- (d+) the number to capture, with + one or more digits
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (?=arrest|custody) positive look ahead checking for either word:

(?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms
- (?<=arrest|custody) positive lookbehind checking that the word comes before #
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

add a comment |

Perhaps something like this?

(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.

Breaking down the pattern

(d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms
- (d+) the number to capture, with + one or more digits
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (?=arrest|custody) positive look ahead checking for either word:

(?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms
- (?<=arrest|custody) positive lookbehind checking that the word comes before #
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

Perhaps something like this?

(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.

Breaking down the pattern

(d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms
- (d+) the number to capture, with + one or more digits
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (?=arrest|custody) positive look ahead checking for either word:

(?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms
- (?<=arrest|custody) positive lookbehind checking that the word comes before #
- [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
- (d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

edited Nov 14 '18 at 1:46

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

answered Nov 14 '18 at 1:38

K.Dᴀᴠɪs

7,189112439

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

dl35,MR5K bTPpkZp,Nqe,J,tjTrNrrj7OBG78wqHZYqiKv7jEycVDynN2p1WK2,UoeTwEBpxGN,xRZmbBmSWpmESC

搜尋此網誌

Vfrdtyky

Regular expression to extract number before/after word

3 Answers
3

Breaking down the pattern

Miscellaneous Notes

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Post as a guest

Popular posts from this blog

Bressuire

Vorschmack

Xamarin.iOS Cant Deploy on Iphone

Regular expression to extract number before/after word

3 Answers 3

Breaking down the pattern

Miscellaneous Notes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Breaking down the pattern

Miscellaneous Notes

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Bressuire

Vorschmack

Xamarin.iOS Cant Deploy on Iphone

3 Answers
3

3 Answers
3

3 Answers
3