Using Python beautifulsoup to select everything except a specific tag [duplicate]

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

This question already has an answer here:

Exclude unwanted tag on Beautifulsoup Python

2 answers

I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.

Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):

<h1>CXR Introduction</h1>

<h2>Basic Principles</h2>



<ul>

<li>Note differences in density.</li>

<li>Identify the site of the pathology by noting silhouettes.</li>

<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>

<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>

</ul>



<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>

<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>

<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

I wrote this code using beautifulsoup:

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    selection = soup.select("h1 > ")

print(selection)

I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

marked as duplicate by l'L'l, jpp python
Users with the python badge can single-handedly close python questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

This question already has an answer here:

Exclude unwanted tag on Beautifulsoup Python

2 answers

I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.

<h1>CXR Introduction</h1>

<h2>Basic Principles</h2>



<ul>

<li>Note differences in density.</li>

<li>Identify the site of the pathology by noting silhouettes.</li>

<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>

<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>

</ul>



<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>

<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>

<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

I wrote this code using beautifulsoup:

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    selection = soup.select("h1 > ")

print(selection)

I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

marked as duplicate by l'L'l, jpp python
Users with the python badge can single-handedly close python questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

This question already has an answer here:

Exclude unwanted tag on Beautifulsoup Python

2 answers

I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.

<h1>CXR Introduction</h1>

<h2>Basic Principles</h2>



<ul>

<li>Note differences in density.</li>

<li>Identify the site of the pathology by noting silhouettes.</li>

<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>

<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>

</ul>



<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>

<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>

<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

I wrote this code using beautifulsoup:

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    selection = soup.select("h1 > ")

print(selection)

I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

This question already has an answer here:

Exclude unwanted tag on Beautifulsoup Python

2 answers

I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.

<h1>CXR Introduction</h1>

<h2>Basic Principles</h2>



<ul>

<li>Note differences in density.</li>

<li>Identify the site of the pathology by noting silhouettes.</li>

<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>

<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>

</ul>



<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>

<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>

<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

I wrote this code using beautifulsoup:

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    selection = soup.select("h1 > ")

print(selection)

I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?

This question already has an answer here:

Exclude unwanted tag on Beautifulsoup Python

2 answers

python beautifulsoup

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

edited Nov 17 '18 at 7:27

asked Nov 17 '18 at 6:47

Code Monkey

3201211

asked Nov 17 '18 at 6:47

Code Monkey

3201211

asked Nov 17 '18 at 6:47

Code Monkey

3201211

marked as duplicate by l'L'l, jpp python
Users with the python badge can single-handedly close python questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by l'L'l, jpp python
Users with the python badge can single-handedly close python questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

2 Answers
2

active

oldest

votes

use .extract() to remove selected tag

output = None

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    soup.h1.extract()

    output = soup



print(output)

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

add a comment |

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 7:22

Khalt

312

add a comment |

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

use .extract() to remove selected tag

output = None

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    soup.h1.extract()

    output = soup



print(output)

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

add a comment |

use .extract() to remove selected tag

output = None

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    soup.h1.extract()

    output = soup



print(output)

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

add a comment |

use .extract() to remove selected tag

output = None

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    soup.h1.extract()

    output = soup



print(output)

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

use .extract() to remove selected tag

output = None

with open("file.htm") as ip:

    #HTML parsing done using the "html.parser".

    soup = BeautifulSoup(ip, "html.parser")

    soup.h1.extract()

    output = soup



print(output)

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

answered Nov 17 '18 at 7:23

ewwink

12.3k22441

add a comment |

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 7:22

Khalt

312

add a comment |

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 7:22

Khalt

312

add a comment |

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 7:22

Khalt

312

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 7:22

Khalt

312

answered Nov 17 '18 at 7:22

Khalt

312

answered Nov 17 '18 at 7:22

Khalt

312

answered Nov 17 '18 at 7:22

Khalt

312

add a comment |

This page is only for reference, If you need detailed information, please check here

5,SkMAC,EfxtG,asXdL20dXFNr9 os2Fo3,lQkFU,9hxph8mjaqmgzQ6pSGI,skS5eXqhWF2D6uT,zfiDUP0B08JpUmoZPMdr,91BAZunXiG

搜尋此網誌

Vfrdtyky