Using Python beautifulsoup to select everything except a specific tag [duplicate]
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
This question already has an answer here:
Exclude unwanted tag on Beautifulsoup Python
2 answers
I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>
element.
Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1>
element):
<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>
<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>
<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>
I wrote this code using beautifulsoup:
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)
I was hoping that this will select everything below the <h1>
element, however it does not. Using soup.select("h1")
only selects one line and doesn't select everything below it. What do I do?
python beautifulsoup
marked as duplicate by l'L'l, jpp
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Exclude unwanted tag on Beautifulsoup Python
2 answers
I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>
element.
Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1>
element):
<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>
<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>
<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>
I wrote this code using beautifulsoup:
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)
I was hoping that this will select everything below the <h1>
element, however it does not. Using soup.select("h1")
only selects one line and doesn't select everything below it. What do I do?
python beautifulsoup
marked as duplicate by l'L'l, jpp
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Exclude unwanted tag on Beautifulsoup Python
2 answers
I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>
element.
Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1>
element):
<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>
<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>
<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>
I wrote this code using beautifulsoup:
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)
I was hoping that this will select everything below the <h1>
element, however it does not. Using soup.select("h1")
only selects one line and doesn't select everything below it. What do I do?
python beautifulsoup
This question already has an answer here:
Exclude unwanted tag on Beautifulsoup Python
2 answers
I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>
element.
Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1>
element):
<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>
<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>
<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>
I wrote this code using beautifulsoup:
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)
I was hoping that this will select everything below the <h1>
element, however it does not. Using soup.select("h1")
only selects one line and doesn't select everything below it. What do I do?
This question already has an answer here:
Exclude unwanted tag on Beautifulsoup Python
2 answers
python beautifulsoup
python beautifulsoup
edited Nov 17 '18 at 7:27
Code Monkey
asked Nov 17 '18 at 6:47
Code MonkeyCode Monkey
3201211
3201211
marked as duplicate by l'L'l, jpp
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by l'L'l, jpp
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
use .extract()
to remove selected tag
output = None
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
soup.h1.extract()
output = soup
print(output)
add a comment |
Have you considered removing the <h1>...<h1/>
element using .decompose()
and then just getting all the rest?
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
use .extract()
to remove selected tag
output = None
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
soup.h1.extract()
output = soup
print(output)
add a comment |
use .extract()
to remove selected tag
output = None
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
soup.h1.extract()
output = soup
print(output)
add a comment |
use .extract()
to remove selected tag
output = None
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
soup.h1.extract()
output = soup
print(output)
use .extract()
to remove selected tag
output = None
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
soup.h1.extract()
output = soup
print(output)
answered Nov 17 '18 at 7:23
ewwinkewwink
12.3k22441
12.3k22441
add a comment |
add a comment |
Have you considered removing the <h1>...<h1/>
element using .decompose()
and then just getting all the rest?
add a comment |
Have you considered removing the <h1>...<h1/>
element using .decompose()
and then just getting all the rest?
add a comment |
Have you considered removing the <h1>...<h1/>
element using .decompose()
and then just getting all the rest?
Have you considered removing the <h1>...<h1/>
element using .decompose()
and then just getting all the rest?
answered Nov 17 '18 at 7:22
KhaltKhalt
312
312
add a comment |
add a comment |