Can't create an xpath capable of meeting certain condition





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



Link to that site



Script I've tried with so far:



import requests
from lxml.html import fromstring

res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)


When I try to get the links meeting above condition with the below approach, I get an error:



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)


Error I get:



if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


This is how the files look like:



enter image description here










share|improve this question





























    2















    I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



    Link to that site



    Script I've tried with so far:



    import requests
    from lxml.html import fromstring

    res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
    root = fromstring(res.text)

    for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
    if ".htm" in item:
    print(item)


    When I try to get the links meeting above condition with the below approach, I get an error:



    for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
    print(item)


    Error I get:



    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
    AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


    This is how the files look like:



    enter image description here










    share|improve this question

























      2












      2








      2








      I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



      Link to that site



      Script I've tried with so far:



      import requests
      from lxml.html import fromstring

      res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
      root = fromstring(res.text)

      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item:
      print(item)


      When I try to get the links meeting above condition with the below approach, I get an error:



      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      print(item)


      Error I get:



      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


      This is how the files look like:



      enter image description here










      share|improve this question














      I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



      Link to that site



      Script I've tried with so far:



      import requests
      from lxml.html import fromstring

      res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
      root = fromstring(res.text)

      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item:
      print(item)


      When I try to get the links meeting above condition with the below approach, I get an error:



      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      print(item)


      Error I get:



      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


      This is how the files look like:



      enter image description here







      python python-3.x xpath web-scraping






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 17 '18 at 7:11









      robots.txtrobots.txt

      313118




      313118
























          3 Answers
          3






          active

          oldest

          votes


















          2














          If you need pure XPath solution, you can use below:



          import requests
          from lxml.html import fromstring

          res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
          root = fromstring(res.text)
          for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
          print(item)


          /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





          share|improve this answer
























          • Apology for the delayed response @sir Andersson. Thanks for your effective solution.

            – robots.txt
            Nov 17 '18 at 10:41











          • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

            – robots.txt
            Nov 17 '18 at 13:17






          • 1





            @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

            – Andersson
            Nov 17 '18 at 13:36





















          1














          It looks like you want:



          //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


          There's a lot of different ways to do this with xpath. Css is probalby much simpler.






          share|improve this answer































            1














            Here is a way using dataframes and pandas



            import pandas as pd
            tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
            base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
            results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
            print(results)





            share|improve this answer


























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer
























              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.

                – robots.txt
                Nov 17 '18 at 10:41











              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

                – robots.txt
                Nov 17 '18 at 13:17






              • 1





                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

                – Andersson
                Nov 17 '18 at 13:36


















              2














              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer
























              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.

                – robots.txt
                Nov 17 '18 at 10:41











              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

                – robots.txt
                Nov 17 '18 at 13:17






              • 1





                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

                – Andersson
                Nov 17 '18 at 13:36
















              2












              2








              2







              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer













              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 17 '18 at 7:39









              AnderssonAndersson

              39.2k113769




              39.2k113769













              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.

                – robots.txt
                Nov 17 '18 at 10:41











              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

                – robots.txt
                Nov 17 '18 at 13:17






              • 1





                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

                – Andersson
                Nov 17 '18 at 13:36





















              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.

                – robots.txt
                Nov 17 '18 at 10:41











              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

                – robots.txt
                Nov 17 '18 at 13:17






              • 1





                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

                – Andersson
                Nov 17 '18 at 13:36



















              Apology for the delayed response @sir Andersson. Thanks for your effective solution.

              – robots.txt
              Nov 17 '18 at 10:41





              Apology for the delayed response @sir Andersson. Thanks for your effective solution.

              – robots.txt
              Nov 17 '18 at 10:41













              Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

              – robots.txt
              Nov 17 '18 at 13:17





              Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

              – robots.txt
              Nov 17 '18 at 13:17




              1




              1





              @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

              – Andersson
              Nov 17 '18 at 13:36







              @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

              – Andersson
              Nov 17 '18 at 13:36















              1














              It looks like you want:



              //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


              There's a lot of different ways to do this with xpath. Css is probalby much simpler.






              share|improve this answer




























                1














                It looks like you want:



                //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                There's a lot of different ways to do this with xpath. Css is probalby much simpler.






                share|improve this answer


























                  1












                  1








                  1







                  It looks like you want:



                  //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                  There's a lot of different ways to do this with xpath. Css is probalby much simpler.






                  share|improve this answer













                  It looks like you want:



                  //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                  There's a lot of different ways to do this with xpath. Css is probalby much simpler.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 17 '18 at 10:25









                  pguardiariopguardiario

                  37.1k1080118




                  37.1k1080118























                      1














                      Here is a way using dataframes and pandas



                      import pandas as pd
                      tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                      base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                      results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                      print(results)





                      share|improve this answer






























                        1














                        Here is a way using dataframes and pandas



                        import pandas as pd
                        tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                        base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                        results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                        print(results)





                        share|improve this answer




























                          1












                          1








                          1







                          Here is a way using dataframes and pandas



                          import pandas as pd
                          tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                          base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                          results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                          print(results)





                          share|improve this answer















                          Here is a way using dataframes and pandas



                          import pandas as pd
                          tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                          base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                          results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                          print(results)






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 17 '18 at 11:38

























                          answered Nov 17 '18 at 7:47









                          QHarrQHarr

                          39.2k82345




                          39.2k82345






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Xamarin.iOS Cant Deploy on Iphone

                              Glorious Revolution

                              Dulmage-Mendelsohn matrix decomposition in Python