Python lxml/beautiful soup to find all links on a web page












7















I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...



result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')









share|improve this question




















  • 1





    this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

    – Benjamin Wohlwend
    May 25 '11 at 21:29
















7















I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...



result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')









share|improve this question




















  • 1





    this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

    – Benjamin Wohlwend
    May 25 '11 at 21:29














7












7








7


1






I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...



result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')









share|improve this question
















I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...



result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')






python lxml






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 1 '16 at 23:50









halfer

14.7k759116




14.7k759116










asked May 25 '11 at 21:20









CmagCmag

5,0911662111




5,0911662111








  • 1





    this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

    – Benjamin Wohlwend
    May 25 '11 at 21:29














  • 1





    this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

    – Benjamin Wohlwend
    May 25 '11 at 21:29








1




1





this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29





this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29












4 Answers
4






active

oldest

votes


















10














Use XPath. Something like (can't test from here):



urls = html.xpath('//a/@href')





share|improve this answer
























  • thank you so much! i'll test

    – Cmag
    May 25 '11 at 21:29











  • OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

    – Cmag
    May 25 '11 at 21:37













  • Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

    – Fred Foo
    May 25 '11 at 21:53











  • where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

    – Cmag
    May 25 '11 at 21:58











  • this works: //a/text()

    – Cmag
    May 25 '11 at 22:11



















4














With iterlinks, lxml provides an excellent function for this task.




This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.







share|improve this answer

































    1














    I want to provide an alternative lxml-based solution.



    The solution uses the function provided in lxml.cssselect



        import urllib
    import lxml.html
    from lxml.cssselect import CSSSelector
    connection = urllib.urlopen('http://www.yourTargetURL/')
    dom = lxml.html.fromstring(connection.read())
    selAnchor = CSSSelector('a')
    foundElements = selAnchor(dom)
    print [e.get('href') for e in foundElements]





    share|improve this answer

































      0














      You can use this method:



      from urllib.parse import urljoin, urlparse
      from lxml import html as lh
      class Crawler:
      def __init__(self, start_url):
      self.start_url = start_url
      self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
      self.visited_urls = set()

      def fetch_urls(self, html):
      urls =
      dom = lh.fromstring(html)
      for href in dom.xpath('//a/@href'):
      url = urljoin(self.base_url, href)
      if url not in self.visited_urls and url.startswith(self.base_url):
      urls.append(url)
      return urls





      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f6131089%2fpython-lxml-beautiful-soup-to-find-all-links-on-a-web-page%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        10














        Use XPath. Something like (can't test from here):



        urls = html.xpath('//a/@href')





        share|improve this answer
























        • thank you so much! i'll test

          – Cmag
          May 25 '11 at 21:29











        • OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

          – Cmag
          May 25 '11 at 21:37













        • Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

          – Fred Foo
          May 25 '11 at 21:53











        • where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

          – Cmag
          May 25 '11 at 21:58











        • this works: //a/text()

          – Cmag
          May 25 '11 at 22:11
















        10














        Use XPath. Something like (can't test from here):



        urls = html.xpath('//a/@href')





        share|improve this answer
























        • thank you so much! i'll test

          – Cmag
          May 25 '11 at 21:29











        • OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

          – Cmag
          May 25 '11 at 21:37













        • Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

          – Fred Foo
          May 25 '11 at 21:53











        • where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

          – Cmag
          May 25 '11 at 21:58











        • this works: //a/text()

          – Cmag
          May 25 '11 at 22:11














        10












        10








        10







        Use XPath. Something like (can't test from here):



        urls = html.xpath('//a/@href')





        share|improve this answer













        Use XPath. Something like (can't test from here):



        urls = html.xpath('//a/@href')






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered May 25 '11 at 21:27









        Fred FooFred Foo

        283k58595727




        283k58595727













        • thank you so much! i'll test

          – Cmag
          May 25 '11 at 21:29











        • OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

          – Cmag
          May 25 '11 at 21:37













        • Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

          – Fred Foo
          May 25 '11 at 21:53











        • where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

          – Cmag
          May 25 '11 at 21:58











        • this works: //a/text()

          – Cmag
          May 25 '11 at 22:11



















        • thank you so much! i'll test

          – Cmag
          May 25 '11 at 21:29











        • OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

          – Cmag
          May 25 '11 at 21:37













        • Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

          – Fred Foo
          May 25 '11 at 21:53











        • where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

          – Cmag
          May 25 '11 at 21:58











        • this works: //a/text()

          – Cmag
          May 25 '11 at 22:11

















        thank you so much! i'll test

        – Cmag
        May 25 '11 at 21:29





        thank you so much! i'll test

        – Cmag
        May 25 '11 at 21:29













        OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

        – Cmag
        May 25 '11 at 21:37







        OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic &amp; Name </a></li> I need the URL, and the description

        – Cmag
        May 25 '11 at 21:37















        Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

        – Fred Foo
        May 25 '11 at 21:53





        Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

        – Fred Foo
        May 25 '11 at 21:53













        where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

        – Cmag
        May 25 '11 at 21:58





        where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

        – Cmag
        May 25 '11 at 21:58













        this works: //a/text()

        – Cmag
        May 25 '11 at 22:11





        this works: //a/text()

        – Cmag
        May 25 '11 at 22:11













        4














        With iterlinks, lxml provides an excellent function for this task.




        This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.







        share|improve this answer






























          4














          With iterlinks, lxml provides an excellent function for this task.




          This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.







          share|improve this answer




























            4












            4








            4







            With iterlinks, lxml provides an excellent function for this task.




            This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.







            share|improve this answer















            With iterlinks, lxml provides an excellent function for this task.




            This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.








            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Feb 13 '13 at 22:13









            Bengt

            9,70743757




            9,70743757










            answered May 28 '11 at 7:55









            Gregory PetukhovGregory Petukhov

            63039




            63039























                1














                I want to provide an alternative lxml-based solution.



                The solution uses the function provided in lxml.cssselect



                    import urllib
                import lxml.html
                from lxml.cssselect import CSSSelector
                connection = urllib.urlopen('http://www.yourTargetURL/')
                dom = lxml.html.fromstring(connection.read())
                selAnchor = CSSSelector('a')
                foundElements = selAnchor(dom)
                print [e.get('href') for e in foundElements]





                share|improve this answer






























                  1














                  I want to provide an alternative lxml-based solution.



                  The solution uses the function provided in lxml.cssselect



                      import urllib
                  import lxml.html
                  from lxml.cssselect import CSSSelector
                  connection = urllib.urlopen('http://www.yourTargetURL/')
                  dom = lxml.html.fromstring(connection.read())
                  selAnchor = CSSSelector('a')
                  foundElements = selAnchor(dom)
                  print [e.get('href') for e in foundElements]





                  share|improve this answer




























                    1












                    1








                    1







                    I want to provide an alternative lxml-based solution.



                    The solution uses the function provided in lxml.cssselect



                        import urllib
                    import lxml.html
                    from lxml.cssselect import CSSSelector
                    connection = urllib.urlopen('http://www.yourTargetURL/')
                    dom = lxml.html.fromstring(connection.read())
                    selAnchor = CSSSelector('a')
                    foundElements = selAnchor(dom)
                    print [e.get('href') for e in foundElements]





                    share|improve this answer















                    I want to provide an alternative lxml-based solution.



                    The solution uses the function provided in lxml.cssselect



                        import urllib
                    import lxml.html
                    from lxml.cssselect import CSSSelector
                    connection = urllib.urlopen('http://www.yourTargetURL/')
                    dom = lxml.html.fromstring(connection.read())
                    selAnchor = CSSSelector('a')
                    foundElements = selAnchor(dom)
                    print [e.get('href') for e in foundElements]






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Dec 24 '11 at 7:52

























                    answered Aug 16 '11 at 7:53









                    吳強福吳強福

                    328218




                    328218























                        0














                        You can use this method:



                        from urllib.parse import urljoin, urlparse
                        from lxml import html as lh
                        class Crawler:
                        def __init__(self, start_url):
                        self.start_url = start_url
                        self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
                        self.visited_urls = set()

                        def fetch_urls(self, html):
                        urls =
                        dom = lh.fromstring(html)
                        for href in dom.xpath('//a/@href'):
                        url = urljoin(self.base_url, href)
                        if url not in self.visited_urls and url.startswith(self.base_url):
                        urls.append(url)
                        return urls





                        share|improve this answer




























                          0














                          You can use this method:



                          from urllib.parse import urljoin, urlparse
                          from lxml import html as lh
                          class Crawler:
                          def __init__(self, start_url):
                          self.start_url = start_url
                          self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
                          self.visited_urls = set()

                          def fetch_urls(self, html):
                          urls =
                          dom = lh.fromstring(html)
                          for href in dom.xpath('//a/@href'):
                          url = urljoin(self.base_url, href)
                          if url not in self.visited_urls and url.startswith(self.base_url):
                          urls.append(url)
                          return urls





                          share|improve this answer


























                            0












                            0








                            0







                            You can use this method:



                            from urllib.parse import urljoin, urlparse
                            from lxml import html as lh
                            class Crawler:
                            def __init__(self, start_url):
                            self.start_url = start_url
                            self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
                            self.visited_urls = set()

                            def fetch_urls(self, html):
                            urls =
                            dom = lh.fromstring(html)
                            for href in dom.xpath('//a/@href'):
                            url = urljoin(self.base_url, href)
                            if url not in self.visited_urls and url.startswith(self.base_url):
                            urls.append(url)
                            return urls





                            share|improve this answer













                            You can use this method:



                            from urllib.parse import urljoin, urlparse
                            from lxml import html as lh
                            class Crawler:
                            def __init__(self, start_url):
                            self.start_url = start_url
                            self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
                            self.visited_urls = set()

                            def fetch_urls(self, html):
                            urls =
                            dom = lh.fromstring(html)
                            for href in dom.xpath('//a/@href'):
                            url = urljoin(self.base_url, href)
                            if url not in self.visited_urls and url.startswith(self.base_url):
                            urls.append(url)
                            return urls






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 16 '18 at 7:27









                            Saeed GharedaghiSaeed Gharedaghi

                            432718




                            432718






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f6131089%2fpython-lxml-beautiful-soup-to-find-all-links-on-a-web-page%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                List item for chat from Array inside array React Native

                                Thiostrepton

                                Caerphilly