Python lxml/beautiful soup to find all links on a web page

I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...

result = self._openurl(self.mainurl)

content = result.read()

html = lxml.html.fromstring(content)

print lxml.html.find_rel_links(html,'href')

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

1

this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29

add a comment |

result = self._openurl(self.mainurl)

content = result.read()

html = lxml.html.fromstring(content)

print lxml.html.find_rel_links(html,'href')

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

1

this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29

add a comment |

result = self._openurl(self.mainurl)

content = result.read()

html = lxml.html.fromstring(content)

print lxml.html.find_rel_links(html,'href')

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

result = self._openurl(self.mainurl)

content = result.read()

html = lxml.html.fromstring(content)

print lxml.html.find_rel_links(html,'href')

python lxml

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

edited Dec 1 '16 at 23:50

halfer

14.7k759116

edited Dec 1 '16 at 23:50

halfer

14.7k759116

edited Dec 1 '16 at 23:50

halfer

14.7k759116

asked May 25 '11 at 21:20

Cmag

5,0911662111

asked May 25 '11 at 21:20

Cmag

5,0911662111

asked May 25 '11 at 21:20

Cmag

5,0911662111

1

this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29

add a comment |

1

this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29

this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…

– Benjamin Wohlwend
May 25 '11 at 21:29

add a comment |

4 Answers
4

active

oldest

votes

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

283k58595727

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

|
show 1 more comment

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

add a comment |

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib

    import lxml.html

    from lxml.cssselect import CSSSelector

    connection = urllib.urlopen('http://www.yourTargetURL/')

    dom =  lxml.html.fromstring(connection.read())

    selAnchor = CSSSelector('a')

    foundElements = selAnchor(dom)

    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

add a comment |

You can use this method:

from urllib.parse import urljoin, urlparse

from lxml import html as lh

class Crawler:

     def __init__(self, start_url):

         self.start_url = start_url

         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'

         self.visited_urls = set()



     def fetch_urls(self, html):

         urls = 

         dom = lh.fromstring(html)

         for href in dom.xpath('//a/@href'):

              url = urljoin(self.base_url, href)

              if url not in self.visited_urls and url.startswith(self.base_url):

                   urls.append(url)

         return urls

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f6131089%2fpython-lxml-beautiful-soup-to-find-all-links-on-a-web-page%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

283k58595727

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

|
show 1 more comment

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

283k58595727

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

|
show 1 more comment

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

283k58595727

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

283k58595727

answered May 25 '11 at 21:27

Fred Foo

283k58595727

answered May 25 '11 at 21:27

Fred Foo

283k58595727

answered May 25 '11 at 21:27

Fred Foo

283k58595727

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

|
show 1 more comment

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

thank you so much! i'll test

– Cmag
May 25 '11 at 21:29

OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description

– Cmag
May 25 '11 at 21:37

Use html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.

– Fred Foo
May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp

– Cmag
May 25 '11 at 21:58

this works: //a/text()

– Cmag
May 25 '11 at 22:11

|
show 1 more comment

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

add a comment |

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

add a comment |

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

edited Feb 13 '13 at 22:13

Bengt

9,70743757

edited Feb 13 '13 at 22:13

Bengt

9,70743757

edited Feb 13 '13 at 22:13

Bengt

9,70743757

answered May 28 '11 at 7:55

Gregory Petukhov

63039

answered May 28 '11 at 7:55

Gregory Petukhov

63039

answered May 28 '11 at 7:55

Gregory Petukhov

63039

add a comment |

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib

    import lxml.html

    from lxml.cssselect import CSSSelector

    connection = urllib.urlopen('http://www.yourTargetURL/')

    dom =  lxml.html.fromstring(connection.read())

    selAnchor = CSSSelector('a')

    foundElements = selAnchor(dom)

    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

add a comment |

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib

    import lxml.html

    from lxml.cssselect import CSSSelector

    connection = urllib.urlopen('http://www.yourTargetURL/')

    dom =  lxml.html.fromstring(connection.read())

    selAnchor = CSSSelector('a')

    foundElements = selAnchor(dom)

    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

add a comment |

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib

    import lxml.html

    from lxml.cssselect import CSSSelector

    connection = urllib.urlopen('http://www.yourTargetURL/')

    dom =  lxml.html.fromstring(connection.read())

    selAnchor = CSSSelector('a')

    foundElements = selAnchor(dom)

    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib

    import lxml.html

    from lxml.cssselect import CSSSelector

    connection = urllib.urlopen('http://www.yourTargetURL/')

    dom =  lxml.html.fromstring(connection.read())

    selAnchor = CSSSelector('a')

    foundElements = selAnchor(dom)

    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

edited Dec 24 '11 at 7:52

answered Aug 16 '11 at 7:53

吳強福

328218

answered Aug 16 '11 at 7:53

吳強福

328218

answered Aug 16 '11 at 7:53

吳強福

328218

add a comment |

You can use this method:

from urllib.parse import urljoin, urlparse

from lxml import html as lh

class Crawler:

     def __init__(self, start_url):

         self.start_url = start_url

         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'

         self.visited_urls = set()



     def fetch_urls(self, html):

         urls = 

         dom = lh.fromstring(html)

         for href in dom.xpath('//a/@href'):

              url = urljoin(self.base_url, href)

              if url not in self.visited_urls and url.startswith(self.base_url):

                   urls.append(url)

         return urls

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

add a comment |

You can use this method:

from urllib.parse import urljoin, urlparse

from lxml import html as lh

class Crawler:

     def __init__(self, start_url):

         self.start_url = start_url

         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'

         self.visited_urls = set()



     def fetch_urls(self, html):

         urls = 

         dom = lh.fromstring(html)

         for href in dom.xpath('//a/@href'):

              url = urljoin(self.base_url, href)

              if url not in self.visited_urls and url.startswith(self.base_url):

                   urls.append(url)

         return urls

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

add a comment |

You can use this method:

from urllib.parse import urljoin, urlparse

from lxml import html as lh

class Crawler:

     def __init__(self, start_url):

         self.start_url = start_url

         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'

         self.visited_urls = set()



     def fetch_urls(self, html):

         urls = 

         dom = lh.fromstring(html)

         for href in dom.xpath('//a/@href'):

              url = urljoin(self.base_url, href)

              if url not in self.visited_urls and url.startswith(self.base_url):

                   urls.append(url)

         return urls

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

You can use this method:

from urllib.parse import urljoin, urlparse

from lxml import html as lh

class Crawler:

     def __init__(self, start_url):

         self.start_url = start_url

         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'

         self.visited_urls = set()



     def fetch_urls(self, html):

         urls = 

         dom = lh.fromstring(html)

         for href in dom.xpath('//a/@href'):

              url = urljoin(self.base_url, href)

              if url not in self.visited_urls and url.startswith(self.base_url):

                   urls.append(url)

         return urls

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

answered Nov 16 '18 at 7:27

Saeed Gharedaghi

432718

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky