parsing hierarchical data from XML to pandas?

up vote
0
down vote

favorite

I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.

A sample of the XML looks like:

<?xml version="1.0"?>

-<Hilltop>

-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

-<Data>

-<E><O>-687.500</O><I1>95.837</I1></E>

-<E><O>-680.100</O><I1>95.773</I1></E>

</Data>

</Section>



-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

-<Data>

-<E><O>0.000</O><I1>91.138</I1></E>

-<E><O>5.600</O><I1>91.206</I1></E>

-<E><O>11.900</O><I1>91.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

<Data>

<E><O>-99.100</O><I1>98.045</I1></E>

<E><O>-89.900</O><I1>98.334</I1></E>

<E><O>-85.900</O><I1>97.327</I1></E>

<E><O>-75.400</O><I1>94.177</I1></E>

<E><O>-53.100</O><I1>95.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

<Data>

<E><O>0.000</O><I1>95.001</I1></E>

<E><O>4.600</O><I1>94.936</I1></E>

<E><O>14.800</O><I1>94.069</I1></E>

</Data>

</Section>

</Hilltop>

I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:

def getvalueofnode(node):

    """ return node text or None """

    return node.text if node is not None else None



def main():

    """ main """

    parsed_xml = et.parse(###my path and file###)

    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']

    df_xml = pd.DataFrame(columns=dfcols)



    for node in parsed_xml.getroot():

        Survey = node.attrib.get('SurveyTime')

        XSnum = node.attrib.get('Section SiteName')

        STA = node.find('O')

        ELEV = node.find('I1')

        CODE = node.find('Code')



        df_xml = df_xml.append(

            pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),

                       getvalueofnode(STA), getvalueofnode(ELEV), 

                       getvalueofnode(CODE)], index=dfcols),

            ignore_index=True)



    print(df_xml)



main()

It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df

I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.

Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output

Any assistance getting this sorted is greatly appreciated. Cheers!

########## EDIT ##########

Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.

While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).

I've re-grouped and am at least able to extract data with:

import xml.etree.ElementTree as ET

import numpy as np



path = "somepath"

file = "somefile.xml"



tree = ET.parse(path + file)

root = tree.getroot()

So far, so good. I can get a list of each discrete survey and convert to an array:

surveys =

for d in root.findall("./Section/SurveyTime"):

    date = d.text[:-9]

    surveys.append(date)

surv_arr = np.array(surveys)

The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.

station = 

for s in root.findall("./Section/Data/E/O"):

    sta = float(s.text)

    station.append(sta)

sta_arr = np.array(station)

But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for s in section.findall("./Section/Data/E/O"):

        sta = float(s.text)

        station.append(sta)

It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for d in date:

        sta = d.find('.Section/Data/E/O')

        station.append(sta)

The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.

Any thoughts?

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

add a comment |

up vote
0
down vote

favorite

A sample of the XML looks like:

<?xml version="1.0"?>

-<Hilltop>

-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

-<Data>

-<E><O>-687.500</O><I1>95.837</I1></E>

-<E><O>-680.100</O><I1>95.773</I1></E>

</Data>

</Section>



-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

-<Data>

-<E><O>0.000</O><I1>91.138</I1></E>

-<E><O>5.600</O><I1>91.206</I1></E>

-<E><O>11.900</O><I1>91.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

<Data>

<E><O>-99.100</O><I1>98.045</I1></E>

<E><O>-89.900</O><I1>98.334</I1></E>

<E><O>-85.900</O><I1>97.327</I1></E>

<E><O>-75.400</O><I1>94.177</I1></E>

<E><O>-53.100</O><I1>95.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

<Data>

<E><O>0.000</O><I1>95.001</I1></E>

<E><O>4.600</O><I1>94.936</I1></E>

<E><O>14.800</O><I1>94.069</I1></E>

</Data>

</Section>

</Hilltop>

I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:

def getvalueofnode(node):

    """ return node text or None """

    return node.text if node is not None else None



def main():

    """ main """

    parsed_xml = et.parse(###my path and file###)

    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']

    df_xml = pd.DataFrame(columns=dfcols)



    for node in parsed_xml.getroot():

        Survey = node.attrib.get('SurveyTime')

        XSnum = node.attrib.get('Section SiteName')

        STA = node.find('O')

        ELEV = node.find('I1')

        CODE = node.find('Code')



        df_xml = df_xml.append(

            pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),

                       getvalueofnode(STA), getvalueofnode(ELEV), 

                       getvalueofnode(CODE)], index=dfcols),

            ignore_index=True)



    print(df_xml)



main()

It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df

I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.

Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output

Any assistance getting this sorted is greatly appreciated. Cheers!

########## EDIT ##########

Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.

While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).

I've re-grouped and am at least able to extract data with:

import xml.etree.ElementTree as ET

import numpy as np



path = "somepath"

file = "somefile.xml"



tree = ET.parse(path + file)

root = tree.getroot()

So far, so good. I can get a list of each discrete survey and convert to an array:

surveys =

for d in root.findall("./Section/SurveyTime"):

    date = d.text[:-9]

    surveys.append(date)

surv_arr = np.array(surveys)

The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.

station = 

for s in root.findall("./Section/Data/E/O"):

    sta = float(s.text)

    station.append(sta)

sta_arr = np.array(station)

But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for s in section.findall("./Section/Data/E/O"):

        sta = float(s.text)

        station.append(sta)

It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for d in date:

        sta = d.find('.Section/Data/E/O')

        station.append(sta)

Any thoughts?

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

add a comment |

up vote
0
down vote

favorite

A sample of the XML looks like:

<?xml version="1.0"?>

-<Hilltop>

-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

-<Data>

-<E><O>-687.500</O><I1>95.837</I1></E>

-<E><O>-680.100</O><I1>95.773</I1></E>

</Data>

</Section>



-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

-<Data>

-<E><O>0.000</O><I1>91.138</I1></E>

-<E><O>5.600</O><I1>91.206</I1></E>

-<E><O>11.900</O><I1>91.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

<Data>

<E><O>-99.100</O><I1>98.045</I1></E>

<E><O>-89.900</O><I1>98.334</I1></E>

<E><O>-85.900</O><I1>97.327</I1></E>

<E><O>-75.400</O><I1>94.177</I1></E>

<E><O>-53.100</O><I1>95.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

<Data>

<E><O>0.000</O><I1>95.001</I1></E>

<E><O>4.600</O><I1>94.936</I1></E>

<E><O>14.800</O><I1>94.069</I1></E>

</Data>

</Section>

</Hilltop>

I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:

def getvalueofnode(node):

    """ return node text or None """

    return node.text if node is not None else None



def main():

    """ main """

    parsed_xml = et.parse(###my path and file###)

    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']

    df_xml = pd.DataFrame(columns=dfcols)



    for node in parsed_xml.getroot():

        Survey = node.attrib.get('SurveyTime')

        XSnum = node.attrib.get('Section SiteName')

        STA = node.find('O')

        ELEV = node.find('I1')

        CODE = node.find('Code')



        df_xml = df_xml.append(

            pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),

                       getvalueofnode(STA), getvalueofnode(ELEV), 

                       getvalueofnode(CODE)], index=dfcols),

            ignore_index=True)



    print(df_xml)



main()

It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df

I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.

Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output

Any assistance getting this sorted is greatly appreciated. Cheers!

########## EDIT ##########

Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.

While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).

I've re-grouped and am at least able to extract data with:

import xml.etree.ElementTree as ET

import numpy as np



path = "somepath"

file = "somefile.xml"



tree = ET.parse(path + file)

root = tree.getroot()

So far, so good. I can get a list of each discrete survey and convert to an array:

surveys =

for d in root.findall("./Section/SurveyTime"):

    date = d.text[:-9]

    surveys.append(date)

surv_arr = np.array(surveys)

The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.

station = 

for s in root.findall("./Section/Data/E/O"):

    sta = float(s.text)

    station.append(sta)

sta_arr = np.array(station)

But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for s in section.findall("./Section/Data/E/O"):

        sta = float(s.text)

        station.append(sta)

It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for d in date:

        sta = d.find('.Section/Data/E/O')

        station.append(sta)

Any thoughts?

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

A sample of the XML looks like:

<?xml version="1.0"?>

-<Hilltop>

-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

-<Data>

-<E><O>-687.500</O><I1>95.837</I1></E>

-<E><O>-680.100</O><I1>95.773</I1></E>

</Data>

</Section>



-<Section SiteName="1">

-<DataSource NumItems="1" Name="RL Section">

<DataType>RLOffsetSection</DataType>

<ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

-<Data>

-<E><O>0.000</O><I1>91.138</I1></E>

-<E><O>5.600</O><I1>91.206</I1></E>

-<E><O>11.900</O><I1>91.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>

<Data>

<E><O>-99.100</O><I1>98.045</I1></E>

<E><O>-89.900</O><I1>98.334</I1></E>

<E><O>-85.900</O><I1>97.327</I1></E>

<E><O>-75.400</O><I1>94.177</I1></E>

<E><O>-53.100</O><I1>95.012</I1></E>

</Data>

</Section>



<Section SiteName="2">

<DataSource Name="RL Section" NumItems="1">

<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>

</DataSource>

<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>

<Data>

<E><O>0.000</O><I1>95.001</I1></E>

<E><O>4.600</O><I1>94.936</I1></E>

<E><O>14.800</O><I1>94.069</I1></E>

</Data>

</Section>

</Hilltop>

I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:

def getvalueofnode(node):

    """ return node text or None """

    return node.text if node is not None else None



def main():

    """ main """

    parsed_xml = et.parse(###my path and file###)

    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']

    df_xml = pd.DataFrame(columns=dfcols)



    for node in parsed_xml.getroot():

        Survey = node.attrib.get('SurveyTime')

        XSnum = node.attrib.get('Section SiteName')

        STA = node.find('O')

        ELEV = node.find('I1')

        CODE = node.find('Code')



        df_xml = df_xml.append(

            pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),

                       getvalueofnode(STA), getvalueofnode(ELEV), 

                       getvalueofnode(CODE)], index=dfcols),

            ignore_index=True)



    print(df_xml)



main()

It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df

I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.

Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output

Any assistance getting this sorted is greatly appreciated. Cheers!

########## EDIT ##########

Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.

While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).

I've re-grouped and am at least able to extract data with:

import xml.etree.ElementTree as ET

import numpy as np



path = "somepath"

file = "somefile.xml"



tree = ET.parse(path + file)

root = tree.getroot()

So far, so good. I can get a list of each discrete survey and convert to an array:

surveys =

for d in root.findall("./Section/SurveyTime"):

    date = d.text[:-9]

    surveys.append(date)

surv_arr = np.array(surveys)

The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.

station = 

for s in root.findall("./Section/Data/E/O"):

    sta = float(s.text)

    station.append(sta)

sta_arr = np.array(station)

But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for s in section.findall("./Section/Data/E/O"):

        sta = float(s.text)

        station.append(sta)

It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):

station = 



for section in root.findall('./Section'):

    source_items = section.find('DataSource').attrib

    date = section.find('SurveyTime').text[:-9]

    for d in date:

        sta = d.find('.Section/Data/E/O')

        station.append(sta)

Any thoughts?

python xml pandas

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

edited Nov 12 at 1:19

asked Jan 22 at 4:47

CreekGeek

asked Jan 22 at 4:47

CreekGeek

asked Jan 22 at 4:47

CreekGeek

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict

import pandas as pd

xs, survey, station, elevation  = , , , 

with open(path + file) as extract:

    orddict = xmltodict.parse(extract.read())

num_surveys = len(orddict['Hilltop']['Section'])

for i in range(0,num_surveys):

    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']

    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]

    sta_temp = 

    elev_temp = 

    data = orddict['Hilltop']['Section'][i]['Data']['E']

    for d in data:

        sta = float(d['O'])

        elev = float(d['I1'])

        sta_temp.append(sta)

        elev_temp.append(elev)

    xs.append(XS_id)

    survey.append(surv)

    station.append(sta_temp)

    elevation.append(elev_temp)

df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).

Enjoy!!!

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48374823%2fparsing-hierarchical-data-from-xml-to-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict

import pandas as pd

xs, survey, station, elevation  = , , , 

with open(path + file) as extract:

    orddict = xmltodict.parse(extract.read())

num_surveys = len(orddict['Hilltop']['Section'])

for i in range(0,num_surveys):

    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']

    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]

    sta_temp = 

    elev_temp = 

    data = orddict['Hilltop']['Section'][i]['Data']['E']

    for d in data:

        sta = float(d['O'])

        elev = float(d['I1'])

        sta_temp.append(sta)

        elev_temp.append(elev)

    xs.append(XS_id)

    survey.append(surv)

    station.append(sta_temp)

    elevation.append(elev_temp)

df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Enjoy!!!

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

add a comment |

up vote
0
down vote

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict

import pandas as pd

xs, survey, station, elevation  = , , , 

with open(path + file) as extract:

    orddict = xmltodict.parse(extract.read())

num_surveys = len(orddict['Hilltop']['Section'])

for i in range(0,num_surveys):

    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']

    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]

    sta_temp = 

    elev_temp = 

    data = orddict['Hilltop']['Section'][i]['Data']['E']

    for d in data:

        sta = float(d['O'])

        elev = float(d['I1'])

        sta_temp.append(sta)

        elev_temp.append(elev)

    xs.append(XS_id)

    survey.append(surv)

    station.append(sta_temp)

    elevation.append(elev_temp)

df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Enjoy!!!

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

add a comment |

up vote
0
down vote

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict

import pandas as pd

xs, survey, station, elevation  = , , , 

with open(path + file) as extract:

    orddict = xmltodict.parse(extract.read())

num_surveys = len(orddict['Hilltop']['Section'])

for i in range(0,num_surveys):

    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']

    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]

    sta_temp = 

    elev_temp = 

    data = orddict['Hilltop']['Section'][i]['Data']['E']

    for d in data:

        sta = float(d['O'])

        elev = float(d['I1'])

        sta_temp.append(sta)

        elev_temp.append(elev)

    xs.append(XS_id)

    survey.append(surv)

    station.append(sta_temp)

    elevation.append(elev_temp)

df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Enjoy!!!

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.

import xmltodict

import pandas as pd

xs, survey, station, elevation  = , , , 

with open(path + file) as extract:

    orddict = xmltodict.parse(extract.read())

num_surveys = len(orddict['Hilltop']['Section'])

for i in range(0,num_surveys):

    XS_id = orddict['Hilltop']['Section'][i]['@SiteName']

    surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]

    sta_temp = 

    elev_temp = 

    data = orddict['Hilltop']['Section'][i]['Data']['E']

    for d in data:

        sta = float(d['O'])

        elev = float(d['I1'])

        sta_temp.append(sta)

        elev_temp.append(elev)

    xs.append(XS_id)

    survey.append(surv)

    station.append(sta_temp)

    elevation.append(elev_temp)

df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})

Enjoy!!!

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

edited Nov 12 at 1:39

answered Nov 12 at 1:32

CreekGeek

answered Nov 12 at 1:32

CreekGeek

answered Nov 12 at 1:32

CreekGeek

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky