Using Python beautifulsoup to select everything except a specific tag [duplicate]





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0
















This question already has an answer here:




  • Exclude unwanted tag on Beautifulsoup Python

    2 answers




I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.



Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):



<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>

<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>

<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>


I wrote this code using beautifulsoup:



with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)


I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?










share|improve this question















marked as duplicate by l'L'l, jpp python
Users with the  python badge can single-handedly close python questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 17 '18 at 19:23


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

























    0
















    This question already has an answer here:




    • Exclude unwanted tag on Beautifulsoup Python

      2 answers




    I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.



    Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):



    <h1>CXR Introduction</h1>
    <h2>Basic Principles</h2>

    <ul>
    <li>Note differences in density.</li>
    <li>Identify the site of the pathology by noting silhouettes.</li>
    <li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
    <li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
    </ul>

    <p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
    <p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
    <p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>


    I wrote this code using beautifulsoup:



    with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    selection = soup.select("h1 > ")
    print(selection)


    I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?










    share|improve this question















    marked as duplicate by l'L'l, jpp python
    Users with the  python badge can single-handedly close python questions as duplicates and reopen them as needed.

    StackExchange.ready(function() {
    if (StackExchange.options.isMobile) return;

    $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
    var $hover = $(this).addClass('hover-bound'),
    $msg = $hover.siblings('.dupe-hammer-message');

    $hover.hover(
    function() {
    $hover.showInfoMessage('', {
    messageElement: $msg.clone().show(),
    transient: false,
    position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
    dismissable: false,
    relativeToBody: true
    });
    },
    function() {
    StackExchange.helpers.removeMessages();
    }
    );
    });
    });
    Nov 17 '18 at 19:23


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.





















      0












      0








      0









      This question already has an answer here:




      • Exclude unwanted tag on Beautifulsoup Python

        2 answers




      I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.



      Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):



      <h1>CXR Introduction</h1>
      <h2>Basic Principles</h2>

      <ul>
      <li>Note differences in density.</li>
      <li>Identify the site of the pathology by noting silhouettes.</li>
      <li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
      <li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
      </ul>

      <p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
      <p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
      <p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>


      I wrote this code using beautifulsoup:



      with open("file.htm") as ip:
      #HTML parsing done using the "html.parser".
      soup = BeautifulSoup(ip, "html.parser")
      selection = soup.select("h1 > ")
      print(selection)


      I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?










      share|improve this question

















      This question already has an answer here:




      • Exclude unwanted tag on Beautifulsoup Python

        2 answers




      I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.



      Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):



      <h1>CXR Introduction</h1>
      <h2>Basic Principles</h2>

      <ul>
      <li>Note differences in density.</li>
      <li>Identify the site of the pathology by noting silhouettes.</li>
      <li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
      <li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
      </ul>

      <p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
      <p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
      <p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>


      I wrote this code using beautifulsoup:



      with open("file.htm") as ip:
      #HTML parsing done using the "html.parser".
      soup = BeautifulSoup(ip, "html.parser")
      selection = soup.select("h1 > ")
      print(selection)


      I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?





      This question already has an answer here:




      • Exclude unwanted tag on Beautifulsoup Python

        2 answers








      python beautifulsoup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 17 '18 at 7:27







      Code Monkey

















      asked Nov 17 '18 at 6:47









      Code MonkeyCode Monkey

      3201211




      3201211




      marked as duplicate by l'L'l, jpp python
      Users with the  python badge can single-handedly close python questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 17 '18 at 19:23


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









      marked as duplicate by l'L'l, jpp python
      Users with the  python badge can single-handedly close python questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 17 '18 at 19:23


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


























          2 Answers
          2






          active

          oldest

          votes


















          1














          use .extract() to remove selected tag



          output = None
          with open("file.htm") as ip:
          #HTML parsing done using the "html.parser".
          soup = BeautifulSoup(ip, "html.parser")
          soup.h1.extract()
          output = soup

          print(output)





          share|improve this answer































            0














            Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?






            share|improve this answer






























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              use .extract() to remove selected tag



              output = None
              with open("file.htm") as ip:
              #HTML parsing done using the "html.parser".
              soup = BeautifulSoup(ip, "html.parser")
              soup.h1.extract()
              output = soup

              print(output)





              share|improve this answer




























                1














                use .extract() to remove selected tag



                output = None
                with open("file.htm") as ip:
                #HTML parsing done using the "html.parser".
                soup = BeautifulSoup(ip, "html.parser")
                soup.h1.extract()
                output = soup

                print(output)





                share|improve this answer


























                  1












                  1








                  1







                  use .extract() to remove selected tag



                  output = None
                  with open("file.htm") as ip:
                  #HTML parsing done using the "html.parser".
                  soup = BeautifulSoup(ip, "html.parser")
                  soup.h1.extract()
                  output = soup

                  print(output)





                  share|improve this answer













                  use .extract() to remove selected tag



                  output = None
                  with open("file.htm") as ip:
                  #HTML parsing done using the "html.parser".
                  soup = BeautifulSoup(ip, "html.parser")
                  soup.h1.extract()
                  output = soup

                  print(output)






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 17 '18 at 7:23









                  ewwinkewwink

                  12.3k22441




                  12.3k22441

























                      0














                      Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?






                      share|improve this answer




























                        0














                        Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?






                        share|improve this answer


























                          0












                          0








                          0







                          Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?






                          share|improve this answer













                          Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 17 '18 at 7:22









                          KhaltKhalt

                          312




                          312















                              Popular posts from this blog

                              Xamarin.iOS Cant Deploy on Iphone

                              Glorious Revolution

                              Dulmage-Mendelsohn matrix decomposition in Python