Extracting specific page links from a <a href tag using BeautifulSoup

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-



from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup



my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'



#opening up connecting, grabbing the page

uClient = uReq(my_url)



# put all the content in a variable

page_html = uClient.read()



#close the internet connection

uClient.close()



#It does my HTML parser

page_soup = soup(page_html, "html.parser")



# Grab all of the links

containers = page_soup.findAll('a', href=True)

#print(type(containers))



for container in containers:

    link = container

    #start_index = link.index('href="') 

    print(link)

    print("---")

    #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"

In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-



from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup



my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'



#opening up connecting, grabbing the page

uClient = uReq(my_url)



# put all the content in a variable

page_html = uClient.read()



#close the internet connection

uClient.close()



#It does my HTML parser

page_soup = soup(page_html, "html.parser")



# Grab all of the links

containers = page_soup.findAll('a', href=True)

#print(type(containers))



for container in containers:

    link = container

    #start_index = link.index('href="') 

    print(link)

    print("---")

    #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-



from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup



my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'



#opening up connecting, grabbing the page

uClient = uReq(my_url)



# put all the content in a variable

page_html = uClient.read()



#close the internet connection

uClient.close()



#It does my HTML parser

page_soup = soup(page_html, "html.parser")



# Grab all of the links

containers = page_soup.findAll('a', href=True)

#print(type(containers))



for container in containers:

    link = container

    #start_index = link.index('href="') 

    print(link)

    print("---")

    #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-



from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup



my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'



#opening up connecting, grabbing the page

uClient = uReq(my_url)



# put all the content in a variable

page_html = uClient.read()



#close the internet connection

uClient.close()



#It does my HTML parser

page_soup = soup(page_html, "html.parser")



# Grab all of the links

containers = page_soup.findAll('a', href=True)

#print(type(containers))



for container in containers:

    link = container

    #start_index = link.index('href="') 

    print(link)

    print("---")

    #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

python beautifulsoup

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

asked Nov 11 at 4:08

Fabio Soares

759

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

4 Answers
4

active

oldest

votes

up vote
3
down vote

accepted

From what I can see in the image the ones with info have an href attribute containing format="info" so you could use an attribute=value CSS selector of [href*=format="info"] , where the * indicates contains; the attribute value contains the substring after the first equals.

import bs4 , requests



res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")

soup = bs4.BeautifulSoup(res.text,"html.parser")

for link in soup.select('[href*=format="info"]'):

    print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1444

add a comment |

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:

    text = link.get_text().lower()

    if any(text.endswith(substr) for substr in substring_list):

        print(link)

        print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:

    link = container

    # match exact

    #if 'Allegro di molto' == link.text:

    if 'Allegro' in link.text: # contain

        print(link)

        print("---")

answered Nov 11 at 10:12

ewwink

6,47122233

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245761%2fextracting-specific-page-links-from-a-a-href-tag-using-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
3
down vote

accepted

import bs4 , requests



res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")

soup = bs4.BeautifulSoup(res.text,"html.parser")

for link in soup.select('[href*=format="info"]'):

    print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

add a comment |

up vote
3
down vote

accepted

import bs4 , requests



res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")

soup = bs4.BeautifulSoup(res.text,"html.parser")

for link in soup.select('[href*=format="info"]'):

    print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

add a comment |

up vote
3
down vote

accepted

import bs4 , requests



res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")

soup = bs4.BeautifulSoup(res.text,"html.parser")

for link in soup.select('[href*=format="info"]'):

    print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

import bs4 , requests



res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")

soup = bs4.BeautifulSoup(res.text,"html.parser")

for link in soup.select('[href*=format="info"]'):

    print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

26.4k81839

answered Nov 11 at 5:16

QHarr

26.4k81839

answered Nov 11 at 5:16

QHarr

26.4k81839

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1444

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1444

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1444

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1444

answered Nov 11 at 5:22

Ali Kargar

1444

answered Nov 11 at 5:22

Ali Kargar

1444

answered Nov 11 at 5:22

Ali Kargar

1444

add a comment |

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:

    text = link.get_text().lower()

    if any(text.endswith(substr) for substr in substring_list):

        print(link)

        print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:

    text = link.get_text().lower()

    if any(text.endswith(substr) for substr in substring_list):

        print(link)

        print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:

    text = link.get_text().lower()

    if any(text.endswith(substr) for substr in substring_list):

        print(link)

        print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:

    text = link.get_text().lower()

    if any(text.endswith(substr) for substr in substring_list):

        print(link)

        print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

answered Nov 11 at 4:21

eicksl

29427

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:

    link = container

    # match exact

    #if 'Allegro di molto' == link.text:

    if 'Allegro' in link.text: # contain

        print(link)

        print("---")

answered Nov 11 at 10:12

ewwink

6,47122233

add a comment |

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:

    link = container

    # match exact

    #if 'Allegro di molto' == link.text:

    if 'Allegro' in link.text: # contain

        print(link)

        print("---")

answered Nov 11 at 10:12

ewwink

6,47122233

add a comment |

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:

    link = container

    # match exact

    #if 'Allegro di molto' == link.text:

    if 'Allegro' in link.text: # contain

        print(link)

        print("---")

answered Nov 11 at 10:12

ewwink

6,47122233

you want to extract link with specified anchor text?

for container in containers:

    link = container

    # match exact

    #if 'Allegro di molto' == link.text:

    if 'Allegro' in link.text: # contain

        print(link)

        print("---")

answered Nov 11 at 10:12

ewwink

6,47122233

answered Nov 11 at 10:12

ewwink

6,47122233

answered Nov 11 at 10:12

ewwink

6,47122233

answered Nov 11 at 10:12

ewwink

6,47122233

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

QuTg71Ytlq0VTFrWDgxBo9q1O

搜尋此網誌

Vfrdtyky