Regular expression to extract number before/after word












2















I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.



For example:



"police arrests 4 people"
"7 people were arrested".


The numbers range from 1-99.



I have tried the following code:



gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")


I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.










share|improve this question





























    2















    I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.



    For example:



    "police arrests 4 people"
    "7 people were arrested".


    The numbers range from 1-99.



    I have tried the following code:



    gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")


    I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.










    share|improve this question



























      2












      2








      2








      I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.



      For example:



      "police arrests 4 people"
      "7 people were arrested".


      The numbers range from 1-99.



      I have tried the following code:



      gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")


      I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.










      share|improve this question
















      I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.



      For example:



      "police arrests 4 people"
      "7 people were arrested".


      The numbers range from 1-99.



      I have tried the following code:



      gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")


      I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.







      regex stata






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 11:50









      Pearly Spencer

      10.3k173461




      10.3k173461










      asked Nov 14 '18 at 1:13









      serpentinaserpentina

      132




      132
























          3 Answers
          3






          active

          oldest

          votes


















          1














          You can use this regex:



          (?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))


          It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.



          It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).



          This will match, if the number is within 20 chars from 'arrests|arrested'.






          share|improve this answer































            2














            The following works for me (solution based on @PoulBak's idea):



            clear

            input strL var1
            "This is 1 long string saying that police arrests 4 people"
            "3 news outlets today reported that 7 people were arrested"
            "several witnesses saw 5 people arrested and other 3 killed"
            end

            generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

            list

            +-------------------------------------------------------------------------------------+
            | var1 var2 |
            |-------------------------------------------------------------------------------------|
            1. | This is 1 long string saying that police arrests 4 people arrests 4 |
            2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
            3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
            +-------------------------------------------------------------------------------------+





            share|improve this answer
























            • Thank you! It worked!

              – serpentina
              Nov 14 '18 at 14:52



















            0














            Perhaps something like this?



            (d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)


            Regex101



            Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.





            Breaking down the pattern





            • (d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms



              • (d+) the number to capture, with + one or more digits


              • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


              • (?=arrest|custody) positive look ahead checking for either word:




            • (?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms



              • (?<=arrest|custody) positive lookbehind checking that the word comes before #


              • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


              • (d+) the number to capture, with + one or more digits




            Miscellaneous Notes



            If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.



            If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups






            share|improve this answer

























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291785%2fregular-expression-to-extract-number-before-after-word%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              You can use this regex:



              (?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))


              It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.



              It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).



              This will match, if the number is within 20 chars from 'arrests|arrested'.






              share|improve this answer




























                1














                You can use this regex:



                (?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))


                It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.



                It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).



                This will match, if the number is within 20 chars from 'arrests|arrested'.






                share|improve this answer


























                  1












                  1








                  1







                  You can use this regex:



                  (?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))


                  It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.



                  It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).



                  This will match, if the number is within 20 chars from 'arrests|arrested'.






                  share|improve this answer













                  You can use this regex:



                  (?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))


                  It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.



                  It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).



                  This will match, if the number is within 20 chars from 'arrests|arrested'.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 14 '18 at 1:58









                  Poul BakPoul Bak

                  5,46831232




                  5,46831232

























                      2














                      The following works for me (solution based on @PoulBak's idea):



                      clear

                      input strL var1
                      "This is 1 long string saying that police arrests 4 people"
                      "3 news outlets today reported that 7 people were arrested"
                      "several witnesses saw 5 people arrested and other 3 killed"
                      end

                      generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

                      list

                      +-------------------------------------------------------------------------------------+
                      | var1 var2 |
                      |-------------------------------------------------------------------------------------|
                      1. | This is 1 long string saying that police arrests 4 people arrests 4 |
                      2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
                      3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
                      +-------------------------------------------------------------------------------------+





                      share|improve this answer
























                      • Thank you! It worked!

                        – serpentina
                        Nov 14 '18 at 14:52
















                      2














                      The following works for me (solution based on @PoulBak's idea):



                      clear

                      input strL var1
                      "This is 1 long string saying that police arrests 4 people"
                      "3 news outlets today reported that 7 people were arrested"
                      "several witnesses saw 5 people arrested and other 3 killed"
                      end

                      generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

                      list

                      +-------------------------------------------------------------------------------------+
                      | var1 var2 |
                      |-------------------------------------------------------------------------------------|
                      1. | This is 1 long string saying that police arrests 4 people arrests 4 |
                      2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
                      3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
                      +-------------------------------------------------------------------------------------+





                      share|improve this answer
























                      • Thank you! It worked!

                        – serpentina
                        Nov 14 '18 at 14:52














                      2












                      2








                      2







                      The following works for me (solution based on @PoulBak's idea):



                      clear

                      input strL var1
                      "This is 1 long string saying that police arrests 4 people"
                      "3 news outlets today reported that 7 people were arrested"
                      "several witnesses saw 5 people arrested and other 3 killed"
                      end

                      generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

                      list

                      +-------------------------------------------------------------------------------------+
                      | var1 var2 |
                      |-------------------------------------------------------------------------------------|
                      1. | This is 1 long string saying that police arrests 4 people arrests 4 |
                      2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
                      3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
                      +-------------------------------------------------------------------------------------+





                      share|improve this answer













                      The following works for me (solution based on @PoulBak's idea):



                      clear

                      input strL var1
                      "This is 1 long string saying that police arrests 4 people"
                      "3 news outlets today reported that 7 people were arrested"
                      "several witnesses saw 5 people arrested and other 3 killed"
                      end

                      generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

                      list

                      +-------------------------------------------------------------------------------------+
                      | var1 var2 |
                      |-------------------------------------------------------------------------------------|
                      1. | This is 1 long string saying that police arrests 4 people arrests 4 |
                      2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
                      3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
                      +-------------------------------------------------------------------------------------+






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Nov 14 '18 at 10:10









                      Pearly SpencerPearly Spencer

                      10.3k173461




                      10.3k173461













                      • Thank you! It worked!

                        – serpentina
                        Nov 14 '18 at 14:52



















                      • Thank you! It worked!

                        – serpentina
                        Nov 14 '18 at 14:52

















                      Thank you! It worked!

                      – serpentina
                      Nov 14 '18 at 14:52





                      Thank you! It worked!

                      – serpentina
                      Nov 14 '18 at 14:52











                      0














                      Perhaps something like this?



                      (d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)


                      Regex101



                      Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.





                      Breaking down the pattern





                      • (d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms



                        • (d+) the number to capture, with + one or more digits


                        • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                        • (?=arrest|custody) positive look ahead checking for either word:




                      • (?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms



                        • (?<=arrest|custody) positive lookbehind checking that the word comes before #


                        • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                        • (d+) the number to capture, with + one or more digits




                      Miscellaneous Notes



                      If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.



                      If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups






                      share|improve this answer






























                        0














                        Perhaps something like this?



                        (d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)


                        Regex101



                        Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.





                        Breaking down the pattern





                        • (d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms



                          • (d+) the number to capture, with + one or more digits


                          • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                          • (?=arrest|custody) positive look ahead checking for either word:




                        • (?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms



                          • (?<=arrest|custody) positive lookbehind checking that the word comes before #


                          • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                          • (d+) the number to capture, with + one or more digits




                        Miscellaneous Notes



                        If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.



                        If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups






                        share|improve this answer




























                          0












                          0








                          0







                          Perhaps something like this?



                          (d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)


                          Regex101



                          Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.





                          Breaking down the pattern





                          • (d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms



                            • (d+) the number to capture, with + one or more digits


                            • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                            • (?=arrest|custody) positive look ahead checking for either word:




                          • (?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms



                            • (?<=arrest|custody) positive lookbehind checking that the word comes before #


                            • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                            • (d+) the number to capture, with + one or more digits




                          Miscellaneous Notes



                          If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.



                          If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups






                          share|improve this answer















                          Perhaps something like this?



                          (d+)[^,.dn]+?(?=arrest|custody)|(?<=arrest|custody)[^,.dn]+?(d+)


                          Regex101



                          Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.





                          Breaking down the pattern





                          • (d+)[^,.dn]+?(?=arrest|custody) First option if # comes before watched terms



                            • (d+) the number to capture, with + one or more digits


                            • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                            • (?=arrest|custody) positive look ahead checking for either word:




                          • (?<=arrest|custody)[^,.dn]+?(d+) Second option if # comes after watched terms



                            • (?<=arrest|custody) positive lookbehind checking that the word comes before #


                            • [^,.dn]+? matches anything except a comma ,, period ., digit d, or new line n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)


                            • (d+) the number to capture, with + one or more digits




                          Miscellaneous Notes



                          If you want to add textual representations of your numbers, then you would incorporate that into the (d+) capturing group.



                          If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 14 '18 at 1:46

























                          answered Nov 14 '18 at 1:38









                          K.DᴀᴠɪsK.Dᴀᴠɪs

                          7,189112439




                          7,189112439






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291785%2fregular-expression-to-extract-number-before-after-word%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Xamarin.iOS Cant Deploy on Iphone

                              Glorious Revolution

                              Dulmage-Mendelsohn matrix decomposition in Python