Extracting specific page links from a <a href tag using BeautifulSoup
up vote
1
down vote
favorite
I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn
I am getting all these links this way:
# -*- coding: utf-8 -*-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'
#opening up connecting, grabbing the page
uClient = uReq(my_url)
# put all the content in a variable
page_html = uClient.read()
#close the internet connection
uClient.close()
#It does my HTML parser
page_soup = soup(page_html, "html.parser")
# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))
for container in containers:
link = container
#start_index = link.index('href="')
print(link)
print("---")
#print(start_index)
part of my output is:
Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).
I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"
In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
python beautifulsoup
add a comment |
up vote
1
down vote
favorite
I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn
I am getting all these links this way:
# -*- coding: utf-8 -*-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'
#opening up connecting, grabbing the page
uClient = uReq(my_url)
# put all the content in a variable
page_html = uClient.read()
#close the internet connection
uClient.close()
#It does my HTML parser
page_soup = soup(page_html, "html.parser")
# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))
for container in containers:
link = container
#start_index = link.index('href="')
print(link)
print("---")
#print(start_index)
part of my output is:
Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).
I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"
In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
python beautifulsoup
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn
I am getting all these links this way:
# -*- coding: utf-8 -*-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'
#opening up connecting, grabbing the page
uClient = uReq(my_url)
# put all the content in a variable
page_html = uClient.read()
#close the internet connection
uClient.close()
#It does my HTML parser
page_soup = soup(page_html, "html.parser")
# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))
for container in containers:
link = container
#start_index = link.index('href="')
print(link)
print("---")
#print(start_index)
part of my output is:
Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).
I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"
In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
python beautifulsoup
I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn
I am getting all these links this way:
# -*- coding: utf-8 -*-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'
#opening up connecting, grabbing the page
uClient = uReq(my_url)
# put all the content in a variable
page_html = uClient.read()
#close the internet connection
uClient.close()
#It does my HTML parser
page_soup = soup(page_html, "html.parser")
# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))
for container in containers:
link = container
#start_index = link.index('href="')
print(link)
print("---")
#print(start_index)
part of my output is:
Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).
I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"
In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
python beautifulsoup
python beautifulsoup
edited Nov 11 at 4:38
asked Nov 11 at 4:08
Fabio Soares
759
759
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25
add a comment |
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25
add a comment |
4 Answers
4
active
oldest
votes
up vote
3
down vote
accepted
From what I can see in the image the ones with info have an href
attribute containing format="info"
so you could use an attribute=value CSS selector of [href*=format="info"]
, where the *
indicates contains; the attribute value contains the substring after the first equals.
import bs4 , requests
res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
print(link.getText(), link['href'])
add a comment |
up vote
1
down vote
The best and easiest way is using text attribute when printing the link. like this :
print link.text
add a comment |
up vote
0
down vote
Assuming you already have a list of the substrings you need to search for, you can do something like:
for link in containers:
text = link.get_text().lower()
if any(text.endswith(substr) for substr in substring_list):
print(link)
print('---')
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
|
show 1 more comment
up vote
0
down vote
you want to extract link with specified anchor text?
for container in containers:
link = container
# match exact
#if 'Allegro di molto' == link.text:
if 'Allegro' in link.text: # contain
print(link)
print("---")
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
From what I can see in the image the ones with info have an href
attribute containing format="info"
so you could use an attribute=value CSS selector of [href*=format="info"]
, where the *
indicates contains; the attribute value contains the substring after the first equals.
import bs4 , requests
res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
print(link.getText(), link['href'])
add a comment |
up vote
3
down vote
accepted
From what I can see in the image the ones with info have an href
attribute containing format="info"
so you could use an attribute=value CSS selector of [href*=format="info"]
, where the *
indicates contains; the attribute value contains the substring after the first equals.
import bs4 , requests
res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
print(link.getText(), link['href'])
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
From what I can see in the image the ones with info have an href
attribute containing format="info"
so you could use an attribute=value CSS selector of [href*=format="info"]
, where the *
indicates contains; the attribute value contains the substring after the first equals.
import bs4 , requests
res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
print(link.getText(), link['href'])
From what I can see in the image the ones with info have an href
attribute containing format="info"
so you could use an attribute=value CSS selector of [href*=format="info"]
, where the *
indicates contains; the attribute value contains the substring after the first equals.
import bs4 , requests
res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
print(link.getText(), link['href'])
edited Nov 11 at 13:41
answered Nov 11 at 5:16
QHarr
26.4k81839
26.4k81839
add a comment |
add a comment |
up vote
1
down vote
The best and easiest way is using text attribute when printing the link. like this :
print link.text
add a comment |
up vote
1
down vote
The best and easiest way is using text attribute when printing the link. like this :
print link.text
add a comment |
up vote
1
down vote
up vote
1
down vote
The best and easiest way is using text attribute when printing the link. like this :
print link.text
The best and easiest way is using text attribute when printing the link. like this :
print link.text
answered Nov 11 at 5:22
Ali Kargar
1444
1444
add a comment |
add a comment |
up vote
0
down vote
Assuming you already have a list of the substrings you need to search for, you can do something like:
for link in containers:
text = link.get_text().lower()
if any(text.endswith(substr) for substr in substring_list):
print(link)
print('---')
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
|
show 1 more comment
up vote
0
down vote
Assuming you already have a list of the substrings you need to search for, you can do something like:
for link in containers:
text = link.get_text().lower()
if any(text.endswith(substr) for substr in substring_list):
print(link)
print('---')
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
|
show 1 more comment
up vote
0
down vote
up vote
0
down vote
Assuming you already have a list of the substrings you need to search for, you can do something like:
for link in containers:
text = link.get_text().lower()
if any(text.endswith(substr) for substr in substring_list):
print(link)
print('---')
Assuming you already have a list of the substrings you need to search for, you can do something like:
for link in containers:
text = link.get_text().lower()
if any(text.endswith(substr) for substr in substring_list):
print(link)
print('---')
edited Nov 11 at 4:52
answered Nov 11 at 4:21
eicksl
29427
29427
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
|
show 1 more comment
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45
|
show 1 more comment
up vote
0
down vote
you want to extract link with specified anchor text?
for container in containers:
link = container
# match exact
#if 'Allegro di molto' == link.text:
if 'Allegro' in link.text: # contain
print(link)
print("---")
add a comment |
up vote
0
down vote
you want to extract link with specified anchor text?
for container in containers:
link = container
# match exact
#if 'Allegro di molto' == link.text:
if 'Allegro' in link.text: # contain
print(link)
print("---")
add a comment |
up vote
0
down vote
up vote
0
down vote
you want to extract link with specified anchor text?
for container in containers:
link = container
# match exact
#if 'Allegro di molto' == link.text:
if 'Allegro' in link.text: # contain
print(link)
print("---")
you want to extract link with specified anchor text?
for container in containers:
link = container
# match exact
#if 'Allegro di molto' == link.text:
if 'Allegro' in link.text: # contain
print(link)
print("---")
answered Nov 11 at 10:12
ewwink
6,47122233
6,47122233
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245761%2fextracting-specific-page-links-from-a-a-href-tag-using-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20
@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25