parsing hierarchical data from XML to pandas?











up vote
0
down vote

favorite












I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.



A sample of the XML looks like:



<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>

-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>

<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>

<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>


I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:



def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None

def main():
""" main """
parsed_xml = et.parse(###my path and file###)
dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
df_xml = pd.DataFrame(columns=dfcols)

for node in parsed_xml.getroot():
Survey = node.attrib.get('SurveyTime')
XSnum = node.attrib.get('Section SiteName')
STA = node.find('O')
ELEV = node.find('I1')
CODE = node.find('Code')

df_xml = df_xml.append(
pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
getvalueofnode(STA), getvalueofnode(ELEV),
getvalueofnode(CODE)], index=dfcols),
ignore_index=True)

print(df_xml)

main()


It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df



I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.



Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output



Any assistance getting this sorted is greatly appreciated. Cheers!



########## EDIT ##########

Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.



While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).



I've re-grouped and am at least able to extract data with:



import xml.etree.ElementTree as ET
import numpy as np

path = "somepath"
file = "somefile.xml"

tree = ET.parse(path + file)
root = tree.getroot()


So far, so good. I can get a list of each discrete survey and convert to an array:



surveys =
for d in root.findall("./Section/SurveyTime"):
date = d.text[:-9]
surveys.append(date)
surv_arr = np.array(surveys)


The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.



station = 
for s in root.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
sta_arr = np.array(station)


But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:



station = 

for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for s in section.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)


It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):



station = 

for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for d in date:
sta = d.find('.Section/Data/E/O')
station.append(sta)


The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.



Any thoughts?










share|improve this question




























    up vote
    0
    down vote

    favorite












    I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.



    A sample of the XML looks like:



    <?xml version="1.0"?>
    -<Hilltop>
    -<Section SiteName="1">
    -<DataSource NumItems="1" Name="RL Section">
    <DataType>RLOffsetSection</DataType>
    <ItemFormat>0</ItemFormat>
    </DataSource>
    <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
    -<Data>
    -<E><O>-687.500</O><I1>95.837</I1></E>
    -<E><O>-680.100</O><I1>95.773</I1></E>
    </Data>
    </Section>

    -<Section SiteName="1">
    -<DataSource NumItems="1" Name="RL Section">
    <DataType>RLOffsetSection</DataType>
    <ItemFormat>0</ItemFormat>
    </DataSource>
    <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
    -<Data>
    -<E><O>0.000</O><I1>91.138</I1></E>
    -<E><O>5.600</O><I1>91.206</I1></E>
    -<E><O>11.900</O><I1>91.012</I1></E>
    </Data>
    </Section>

    <Section SiteName="2">
    <DataSource Name="RL Section" NumItems="1">
    <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
    </DataSource>
    <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
    <Data>
    <E><O>-99.100</O><I1>98.045</I1></E>
    <E><O>-89.900</O><I1>98.334</I1></E>
    <E><O>-85.900</O><I1>97.327</I1></E>
    <E><O>-75.400</O><I1>94.177</I1></E>
    <E><O>-53.100</O><I1>95.012</I1></E>
    </Data>
    </Section>

    <Section SiteName="2">
    <DataSource Name="RL Section" NumItems="1">
    <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
    </DataSource>
    <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
    <Data>
    <E><O>0.000</O><I1>95.001</I1></E>
    <E><O>4.600</O><I1>94.936</I1></E>
    <E><O>14.800</O><I1>94.069</I1></E>
    </Data>
    </Section>
    </Hilltop>


    I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:



    def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

    def main():
    """ main """
    parsed_xml = et.parse(###my path and file###)
    dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():
    Survey = node.attrib.get('SurveyTime')
    XSnum = node.attrib.get('Section SiteName')
    STA = node.find('O')
    ELEV = node.find('I1')
    CODE = node.find('Code')

    df_xml = df_xml.append(
    pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
    getvalueofnode(STA), getvalueofnode(ELEV),
    getvalueofnode(CODE)], index=dfcols),
    ignore_index=True)

    print(df_xml)

    main()


    It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
    screen capture of present output df



    I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.



    Using the above XML data as an example, I'm looking for the data like:
    screen capture of desired output



    Any assistance getting this sorted is greatly appreciated. Cheers!



    ########## EDIT ##########

    Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.



    While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).



    I've re-grouped and am at least able to extract data with:



    import xml.etree.ElementTree as ET
    import numpy as np

    path = "somepath"
    file = "somefile.xml"

    tree = ET.parse(path + file)
    root = tree.getroot()


    So far, so good. I can get a list of each discrete survey and convert to an array:



    surveys =
    for d in root.findall("./Section/SurveyTime"):
    date = d.text[:-9]
    surveys.append(date)
    surv_arr = np.array(surveys)


    The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.



    station = 
    for s in root.findall("./Section/Data/E/O"):
    sta = float(s.text)
    station.append(sta)
    sta_arr = np.array(station)


    But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:



    station = 

    for section in root.findall('./Section'):
    source_items = section.find('DataSource').attrib
    date = section.find('SurveyTime').text[:-9]
    for s in section.findall("./Section/Data/E/O"):
    sta = float(s.text)
    station.append(sta)


    It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):



    station = 

    for section in root.findall('./Section'):
    source_items = section.find('DataSource').attrib
    date = section.find('SurveyTime').text[:-9]
    for d in date:
    sta = d.find('.Section/Data/E/O')
    station.append(sta)


    The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.



    Any thoughts?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.



      A sample of the XML looks like:



      <?xml version="1.0"?>
      -<Hilltop>
      -<Section SiteName="1">
      -<DataSource NumItems="1" Name="RL Section">
      <DataType>RLOffsetSection</DataType>
      <ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
      -<Data>
      -<E><O>-687.500</O><I1>95.837</I1></E>
      -<E><O>-680.100</O><I1>95.773</I1></E>
      </Data>
      </Section>

      -<Section SiteName="1">
      -<DataSource NumItems="1" Name="RL Section">
      <DataType>RLOffsetSection</DataType>
      <ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
      -<Data>
      -<E><O>0.000</O><I1>91.138</I1></E>
      -<E><O>5.600</O><I1>91.206</I1></E>
      -<E><O>11.900</O><I1>91.012</I1></E>
      </Data>
      </Section>

      <Section SiteName="2">
      <DataSource Name="RL Section" NumItems="1">
      <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
      <Data>
      <E><O>-99.100</O><I1>98.045</I1></E>
      <E><O>-89.900</O><I1>98.334</I1></E>
      <E><O>-85.900</O><I1>97.327</I1></E>
      <E><O>-75.400</O><I1>94.177</I1></E>
      <E><O>-53.100</O><I1>95.012</I1></E>
      </Data>
      </Section>

      <Section SiteName="2">
      <DataSource Name="RL Section" NumItems="1">
      <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
      <Data>
      <E><O>0.000</O><I1>95.001</I1></E>
      <E><O>4.600</O><I1>94.936</I1></E>
      <E><O>14.800</O><I1>94.069</I1></E>
      </Data>
      </Section>
      </Hilltop>


      I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:



      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml = et.parse(###my path and file###)
      dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
      df_xml = pd.DataFrame(columns=dfcols)

      for node in parsed_xml.getroot():
      Survey = node.attrib.get('SurveyTime')
      XSnum = node.attrib.get('Section SiteName')
      STA = node.find('O')
      ELEV = node.find('I1')
      CODE = node.find('Code')

      df_xml = df_xml.append(
      pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
      getvalueofnode(STA), getvalueofnode(ELEV),
      getvalueofnode(CODE)], index=dfcols),
      ignore_index=True)

      print(df_xml)

      main()


      It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
      screen capture of present output df



      I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.



      Using the above XML data as an example, I'm looking for the data like:
      screen capture of desired output



      Any assistance getting this sorted is greatly appreciated. Cheers!



      ########## EDIT ##########

      Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.



      While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).



      I've re-grouped and am at least able to extract data with:



      import xml.etree.ElementTree as ET
      import numpy as np

      path = "somepath"
      file = "somefile.xml"

      tree = ET.parse(path + file)
      root = tree.getroot()


      So far, so good. I can get a list of each discrete survey and convert to an array:



      surveys =
      for d in root.findall("./Section/SurveyTime"):
      date = d.text[:-9]
      surveys.append(date)
      surv_arr = np.array(surveys)


      The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.



      station = 
      for s in root.findall("./Section/Data/E/O"):
      sta = float(s.text)
      station.append(sta)
      sta_arr = np.array(station)


      But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:



      station = 

      for section in root.findall('./Section'):
      source_items = section.find('DataSource').attrib
      date = section.find('SurveyTime').text[:-9]
      for s in section.findall("./Section/Data/E/O"):
      sta = float(s.text)
      station.append(sta)


      It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):



      station = 

      for section in root.findall('./Section'):
      source_items = section.find('DataSource').attrib
      date = section.find('SurveyTime').text[:-9]
      for d in date:
      sta = d.find('.Section/Data/E/O')
      station.append(sta)


      The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.



      Any thoughts?










      share|improve this question















      I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.



      A sample of the XML looks like:



      <?xml version="1.0"?>
      -<Hilltop>
      -<Section SiteName="1">
      -<DataSource NumItems="1" Name="RL Section">
      <DataType>RLOffsetSection</DataType>
      <ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
      -<Data>
      -<E><O>-687.500</O><I1>95.837</I1></E>
      -<E><O>-680.100</O><I1>95.773</I1></E>
      </Data>
      </Section>

      -<Section SiteName="1">
      -<DataSource NumItems="1" Name="RL Section">
      <DataType>RLOffsetSection</DataType>
      <ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
      -<Data>
      -<E><O>0.000</O><I1>91.138</I1></E>
      -<E><O>5.600</O><I1>91.206</I1></E>
      -<E><O>11.900</O><I1>91.012</I1></E>
      </Data>
      </Section>

      <Section SiteName="2">
      <DataSource Name="RL Section" NumItems="1">
      <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
      <Data>
      <E><O>-99.100</O><I1>98.045</I1></E>
      <E><O>-89.900</O><I1>98.334</I1></E>
      <E><O>-85.900</O><I1>97.327</I1></E>
      <E><O>-75.400</O><I1>94.177</I1></E>
      <E><O>-53.100</O><I1>95.012</I1></E>
      </Data>
      </Section>

      <Section SiteName="2">
      <DataSource Name="RL Section" NumItems="1">
      <DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
      </DataSource>
      <SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
      <Data>
      <E><O>0.000</O><I1>95.001</I1></E>
      <E><O>4.600</O><I1>94.936</I1></E>
      <E><O>14.800</O><I1>94.069</I1></E>
      </Data>
      </Section>
      </Hilltop>


      I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:



      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml = et.parse(###my path and file###)
      dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
      df_xml = pd.DataFrame(columns=dfcols)

      for node in parsed_xml.getroot():
      Survey = node.attrib.get('SurveyTime')
      XSnum = node.attrib.get('Section SiteName')
      STA = node.find('O')
      ELEV = node.find('I1')
      CODE = node.find('Code')

      df_xml = df_xml.append(
      pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
      getvalueofnode(STA), getvalueofnode(ELEV),
      getvalueofnode(CODE)], index=dfcols),
      ignore_index=True)

      print(df_xml)

      main()


      It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
      screen capture of present output df



      I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.



      Using the above XML data as an example, I'm looking for the data like:
      screen capture of desired output



      Any assistance getting this sorted is greatly appreciated. Cheers!



      ########## EDIT ##########

      Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.



      While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).



      I've re-grouped and am at least able to extract data with:



      import xml.etree.ElementTree as ET
      import numpy as np

      path = "somepath"
      file = "somefile.xml"

      tree = ET.parse(path + file)
      root = tree.getroot()


      So far, so good. I can get a list of each discrete survey and convert to an array:



      surveys =
      for d in root.findall("./Section/SurveyTime"):
      date = d.text[:-9]
      surveys.append(date)
      surv_arr = np.array(surveys)


      The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.



      station = 
      for s in root.findall("./Section/Data/E/O"):
      sta = float(s.text)
      station.append(sta)
      sta_arr = np.array(station)


      But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:



      station = 

      for section in root.findall('./Section'):
      source_items = section.find('DataSource').attrib
      date = section.find('SurveyTime').text[:-9]
      for s in section.findall("./Section/Data/E/O"):
      sta = float(s.text)
      station.append(sta)


      It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):



      station = 

      for section in root.findall('./Section'):
      source_items = section.find('DataSource').attrib
      date = section.find('SurveyTime').text[:-9]
      for d in date:
      sta = d.find('.Section/Data/E/O')
      station.append(sta)


      The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.



      Any thoughts?







      python xml pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 1:19

























      asked Jan 22 at 4:47









      CreekGeek

      13




      13
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.



          import xmltodict
          import pandas as pd
          xs, survey, station, elevation = , , ,
          with open(path + file) as extract:
          orddict = xmltodict.parse(extract.read())
          num_surveys = len(orddict['Hilltop']['Section'])
          for i in range(0,num_surveys):
          XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
          surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
          sta_temp =
          elev_temp =
          data = orddict['Hilltop']['Section'][i]['Data']['E']
          for d in data:
          sta = float(d['O'])
          elev = float(d['I1'])
          sta_temp.append(sta)
          elev_temp.append(elev)
          xs.append(XS_id)
          survey.append(surv)
          station.append(sta_temp)
          elevation.append(elev_temp)
          df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})


          Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).



          Enjoy!!!






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48374823%2fparsing-hierarchical-data-from-xml-to-pandas%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.



            import xmltodict
            import pandas as pd
            xs, survey, station, elevation = , , ,
            with open(path + file) as extract:
            orddict = xmltodict.parse(extract.read())
            num_surveys = len(orddict['Hilltop']['Section'])
            for i in range(0,num_surveys):
            XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
            surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
            sta_temp =
            elev_temp =
            data = orddict['Hilltop']['Section'][i]['Data']['E']
            for d in data:
            sta = float(d['O'])
            elev = float(d['I1'])
            sta_temp.append(sta)
            elev_temp.append(elev)
            xs.append(XS_id)
            survey.append(surv)
            station.append(sta_temp)
            elevation.append(elev_temp)
            df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})


            Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).



            Enjoy!!!






            share|improve this answer



























              up vote
              0
              down vote













              xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.



              import xmltodict
              import pandas as pd
              xs, survey, station, elevation = , , ,
              with open(path + file) as extract:
              orddict = xmltodict.parse(extract.read())
              num_surveys = len(orddict['Hilltop']['Section'])
              for i in range(0,num_surveys):
              XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
              surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
              sta_temp =
              elev_temp =
              data = orddict['Hilltop']['Section'][i]['Data']['E']
              for d in data:
              sta = float(d['O'])
              elev = float(d['I1'])
              sta_temp.append(sta)
              elev_temp.append(elev)
              xs.append(XS_id)
              survey.append(surv)
              station.append(sta_temp)
              elevation.append(elev_temp)
              df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})


              Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).



              Enjoy!!!






              share|improve this answer

























                up vote
                0
                down vote










                up vote
                0
                down vote









                xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.



                import xmltodict
                import pandas as pd
                xs, survey, station, elevation = , , ,
                with open(path + file) as extract:
                orddict = xmltodict.parse(extract.read())
                num_surveys = len(orddict['Hilltop']['Section'])
                for i in range(0,num_surveys):
                XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
                surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
                sta_temp =
                elev_temp =
                data = orddict['Hilltop']['Section'][i]['Data']['E']
                for d in data:
                sta = float(d['O'])
                elev = float(d['I1'])
                sta_temp.append(sta)
                elev_temp.append(elev)
                xs.append(XS_id)
                survey.append(surv)
                station.append(sta_temp)
                elevation.append(elev_temp)
                df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})


                Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).



                Enjoy!!!






                share|improve this answer














                xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.



                import xmltodict
                import pandas as pd
                xs, survey, station, elevation = , , ,
                with open(path + file) as extract:
                orddict = xmltodict.parse(extract.read())
                num_surveys = len(orddict['Hilltop']['Section'])
                for i in range(0,num_surveys):
                XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
                surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
                sta_temp =
                elev_temp =
                data = orddict['Hilltop']['Section'][i]['Data']['E']
                for d in data:
                sta = float(d['O'])
                elev = float(d['I1'])
                sta_temp.append(sta)
                elev_temp.append(elev)
                xs.append(XS_id)
                survey.append(surv)
                station.append(sta_temp)
                elevation.append(elev_temp)
                df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})


                Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).



                Enjoy!!!







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 12 at 1:39

























                answered Nov 12 at 1:32









                CreekGeek

                13




                13






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48374823%2fparsing-hierarchical-data-from-xml-to-pandas%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Bressuire

                    Vorschmack

                    Quarantine