Regular expression to extract number before/after word
I have 10000
descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99
.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
regex stata
add a comment |
I have 10000
descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99
.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
regex stata
add a comment |
I have 10000
descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99
.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
regex stata
I have 10000
descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99
.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
regex stata
regex stata
edited Nov 14 '18 at 11:50
Pearly Spencer
10.3k173461
10.3k173461
asked Nov 14 '18 at 1:13
serpentinaserpentina
132
132
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
add a comment |
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
add a comment |
Perhaps something like this?
(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(d+)[^,.dn]+?(?=arrest|custody)
First option if # comes before watched terms
(d+)
the number to capture, with+
one or more digits
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.dn]+?(d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(d+)
the number to capture, with+
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291785%2fregular-expression-to-extract-number-before-after-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
add a comment |
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
add a comment |
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
answered Nov 14 '18 at 1:58
Poul BakPoul Bak
5,46831232
5,46831232
add a comment |
add a comment |
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
add a comment |
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
add a comment |
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
answered Nov 14 '18 at 10:10
Pearly SpencerPearly Spencer
10.3k173461
10.3k173461
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
add a comment |
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
Thank you! It worked!
– serpentina
Nov 14 '18 at 14:52
add a comment |
Perhaps something like this?
(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(d+)[^,.dn]+?(?=arrest|custody)
First option if # comes before watched terms
(d+)
the number to capture, with+
one or more digits
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.dn]+?(d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(d+)
the number to capture, with+
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
add a comment |
Perhaps something like this?
(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(d+)[^,.dn]+?(?=arrest|custody)
First option if # comes before watched terms
(d+)
the number to capture, with+
one or more digits
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.dn]+?(d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(d+)
the number to capture, with+
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
add a comment |
Perhaps something like this?
(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(d+)[^,.dn]+?(?=arrest|custody)
First option if # comes before watched terms
(d+)
the number to capture, with+
one or more digits
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.dn]+?(d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(d+)
the number to capture, with+
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
Perhaps something like this?
(d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(d+)[^,.dn]+?(?=arrest|custody)
First option if # comes before watched terms
(d+)
the number to capture, with+
one or more digits
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.dn]+?(d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.dn]+?
matches anything except a comma,
, period.
, digitd
, or new linen
. These prevent FPs from different sentences (must be contained in the same sentence) -+?
one or more times (lazy)
(d+)
the number to capture, with+
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
edited Nov 14 '18 at 1:46
answered Nov 14 '18 at 1:38
K.DᴀᴠɪsK.Dᴀᴠɪs
7,189112439
7,189112439
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291785%2fregular-expression-to-extract-number-before-after-word%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown