Create JSON with XML file using BeautifulSoup

up vote
0
down vote

favorite

I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked 16 hours ago

Timetraveller

117114

add a comment |

up vote
0
down vote

favorite

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked 16 hours ago

Timetraveller

117114

add a comment |

up vote
0
down vote

favorite

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked 16 hours ago

Timetraveller

117114

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

json xml beautifulsoup

asked 16 hours ago

Timetraveller

117114

asked 16 hours ago

Timetraveller

117114

asked 16 hours ago

Timetraveller

117114

asked 16 hours ago

Timetraveller

117114

asked 16 hours ago

Timetraveller

117114

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237663%2fcreate-json-with-xml-file-using-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

add a comment |

up vote
0
down vote

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

add a comment |

up vote
0
down vote

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

edited 13 hours ago

answered 14 hours ago

ewwink

5,33922231

answered 14 hours ago

ewwink

5,33922231

answered 14 hours ago

ewwink

5,33922231

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky