Python lxml/beautiful soup to find all links on a web page
I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...
result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
python lxml
add a comment |
I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...
result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
python lxml
1
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29
add a comment |
I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...
result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
python lxml
I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...
result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
python lxml
python lxml
edited Dec 1 '16 at 23:50
halfer
14.7k759116
14.7k759116
asked May 25 '11 at 21:20
CmagCmag
5,0911662111
5,0911662111
1
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29
add a comment |
1
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29
1
1
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29
add a comment |
4 Answers
4
active
oldest
votes
Use XPath. Something like (can't test from here):
urls = html.xpath('//a/@href')
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Usehtml.xpath('//a')instead and then (off the top of my head).attr['href']for the url and.textfor the contents.
– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
|
show 1 more comment
With iterlinks, lxml provides an excellent function for this task.
This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.
add a comment |
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
add a comment |
You can use this method:
from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
def __init__(self, start_url):
self.start_url = start_url
self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
self.visited_urls = set()
def fetch_urls(self, html):
urls =
dom = lh.fromstring(html)
for href in dom.xpath('//a/@href'):
url = urljoin(self.base_url, href)
if url not in self.visited_urls and url.startswith(self.base_url):
urls.append(url)
return urls
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f6131089%2fpython-lxml-beautiful-soup-to-find-all-links-on-a-web-page%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use XPath. Something like (can't test from here):
urls = html.xpath('//a/@href')
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Usehtml.xpath('//a')instead and then (off the top of my head).attr['href']for the url and.textfor the contents.
– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
|
show 1 more comment
Use XPath. Something like (can't test from here):
urls = html.xpath('//a/@href')
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Usehtml.xpath('//a')instead and then (off the top of my head).attr['href']for the url and.textfor the contents.
– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
|
show 1 more comment
Use XPath. Something like (can't test from here):
urls = html.xpath('//a/@href')
Use XPath. Something like (can't test from here):
urls = html.xpath('//a/@href')
answered May 25 '11 at 21:27
Fred FooFred Foo
283k58595727
283k58595727
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Usehtml.xpath('//a')instead and then (off the top of my head).attr['href']for the url and.textfor the contents.
– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
|
show 1 more comment
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Usehtml.xpath('//a')instead and then (off the top of my head).attr['href']for the url and.textfor the contents.
– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
thank you so much! i'll test
– Cmag
May 25 '11 at 21:29
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
OK, then how can I get the 2 variables back from a string such as: <li><a href="domain.org/blah/">Economic & Name </a></li> I need the URL, and the description
– Cmag
May 25 '11 at 21:37
Use
html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.– Fred Foo
May 25 '11 at 21:53
Use
html.xpath('//a') instead and then (off the top of my head) .attr['href'] for the url and .text for the contents.– Fred Foo
May 25 '11 at 21:53
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
where can i read more? how do you know the fields for xpath? reading w3schools.com/xpath/xpath_syntax.asp
– Cmag
May 25 '11 at 21:58
this works: //a/text()
– Cmag
May 25 '11 at 22:11
this works: //a/text()
– Cmag
May 25 '11 at 22:11
|
show 1 more comment
With iterlinks, lxml provides an excellent function for this task.
This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.
add a comment |
With iterlinks, lxml provides an excellent function for this task.
This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.
add a comment |
With iterlinks, lxml provides an excellent function for this task.
This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.
With iterlinks, lxml provides an excellent function for this task.
This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.
edited Feb 13 '13 at 22:13
Bengt
9,70743757
9,70743757
answered May 28 '11 at 7:55
Gregory PetukhovGregory Petukhov
63039
63039
add a comment |
add a comment |
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
add a comment |
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
add a comment |
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
I want to provide an alternative lxml-based solution.
The solution uses the function provided in lxml.cssselect
import urllib
import lxml.html
from lxml.cssselect import CSSSelector
connection = urllib.urlopen('http://www.yourTargetURL/')
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
print [e.get('href') for e in foundElements]
edited Dec 24 '11 at 7:52
answered Aug 16 '11 at 7:53
吳強福吳強福
328218
328218
add a comment |
add a comment |
You can use this method:
from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
def __init__(self, start_url):
self.start_url = start_url
self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
self.visited_urls = set()
def fetch_urls(self, html):
urls =
dom = lh.fromstring(html)
for href in dom.xpath('//a/@href'):
url = urljoin(self.base_url, href)
if url not in self.visited_urls and url.startswith(self.base_url):
urls.append(url)
return urls
add a comment |
You can use this method:
from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
def __init__(self, start_url):
self.start_url = start_url
self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
self.visited_urls = set()
def fetch_urls(self, html):
urls =
dom = lh.fromstring(html)
for href in dom.xpath('//a/@href'):
url = urljoin(self.base_url, href)
if url not in self.visited_urls and url.startswith(self.base_url):
urls.append(url)
return urls
add a comment |
You can use this method:
from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
def __init__(self, start_url):
self.start_url = start_url
self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
self.visited_urls = set()
def fetch_urls(self, html):
urls =
dom = lh.fromstring(html)
for href in dom.xpath('//a/@href'):
url = urljoin(self.base_url, href)
if url not in self.visited_urls and url.startswith(self.base_url):
urls.append(url)
return urls
You can use this method:
from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
def __init__(self, start_url):
self.start_url = start_url
self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
self.visited_urls = set()
def fetch_urls(self, html):
urls =
dom = lh.fromstring(html)
for href in dom.xpath('//a/@href'):
url = urljoin(self.base_url, href)
if url not in self.visited_urls and url.startswith(self.base_url):
urls.append(url)
return urls
answered Nov 16 '18 at 7:27
Saeed GharedaghiSaeed Gharedaghi
432718
432718
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f6131089%2fpython-lxml-beautiful-soup-to-find-all-links-on-a-web-page%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
this has been asked dozens of times and has good answers, e.g.: stackoverflow.com/questions/1080411/…
– Benjamin Wohlwend
May 25 '11 at 21:29