parsing hierarchical data from XML to pandas?
up vote
0
down vote
favorite
I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.
A sample of the XML looks like:
<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>
I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse(###my path and file###)
dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
Survey = node.attrib.get('SurveyTime')
XSnum = node.attrib.get('Section SiteName')
STA = node.find('O')
ELEV = node.find('I1')
CODE = node.find('Code')
df_xml = df_xml.append(
pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
getvalueofnode(STA), getvalueofnode(ELEV),
getvalueofnode(CODE)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df
I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.
Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output
Any assistance getting this sorted is greatly appreciated. Cheers!
########## EDIT ##########
Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.
While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).
I've re-grouped and am at least able to extract data with:
import xml.etree.ElementTree as ET
import numpy as np
path = "somepath"
file = "somefile.xml"
tree = ET.parse(path + file)
root = tree.getroot()
So far, so good. I can get a list of each discrete survey and convert to an array:
surveys =
for d in root.findall("./Section/SurveyTime"):
date = d.text[:-9]
surveys.append(date)
surv_arr = np.array(surveys)
The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.
station =
for s in root.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
sta_arr = np.array(station)
But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for s in section.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for d in date:
sta = d.find('.Section/Data/E/O')
station.append(sta)
The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.
Any thoughts?
python xml pandas
add a comment |
up vote
0
down vote
favorite
I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.
A sample of the XML looks like:
<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>
I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse(###my path and file###)
dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
Survey = node.attrib.get('SurveyTime')
XSnum = node.attrib.get('Section SiteName')
STA = node.find('O')
ELEV = node.find('I1')
CODE = node.find('Code')
df_xml = df_xml.append(
pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
getvalueofnode(STA), getvalueofnode(ELEV),
getvalueofnode(CODE)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df
I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.
Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output
Any assistance getting this sorted is greatly appreciated. Cheers!
########## EDIT ##########
Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.
While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).
I've re-grouped and am at least able to extract data with:
import xml.etree.ElementTree as ET
import numpy as np
path = "somepath"
file = "somefile.xml"
tree = ET.parse(path + file)
root = tree.getroot()
So far, so good. I can get a list of each discrete survey and convert to an array:
surveys =
for d in root.findall("./Section/SurveyTime"):
date = d.text[:-9]
surveys.append(date)
surv_arr = np.array(surveys)
The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.
station =
for s in root.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
sta_arr = np.array(station)
But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for s in section.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for d in date:
sta = d.find('.Section/Data/E/O')
station.append(sta)
The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.
Any thoughts?
python xml pandas
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.
A sample of the XML looks like:
<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>
I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse(###my path and file###)
dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
Survey = node.attrib.get('SurveyTime')
XSnum = node.attrib.get('Section SiteName')
STA = node.find('O')
ELEV = node.find('I1')
CODE = node.find('Code')
df_xml = df_xml.append(
pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
getvalueofnode(STA), getvalueofnode(ELEV),
getvalueofnode(CODE)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df
I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.
Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output
Any assistance getting this sorted is greatly appreciated. Cheers!
########## EDIT ##########
Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.
While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).
I've re-grouped and am at least able to extract data with:
import xml.etree.ElementTree as ET
import numpy as np
path = "somepath"
file = "somefile.xml"
tree = ET.parse(path + file)
root = tree.getroot()
So far, so good. I can get a list of each discrete survey and convert to an array:
surveys =
for d in root.findall("./Section/SurveyTime"):
date = d.text[:-9]
surveys.append(date)
surv_arr = np.array(surveys)
The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.
station =
for s in root.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
sta_arr = np.array(station)
But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for s in section.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for d in date:
sta = d.find('.Section/Data/E/O')
station.append(sta)
The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.
Any thoughts?
python xml pandas
I have inherited data in XML format that has multiple, nested 1:many relationships. The lowest level is coordinate data I wish to plot (as scatter and/or line). I wish to use the higher levels as grouping variables and for formatting. Recognizing the inefficiency, a 'flat' data organization is acceptable for my purposes.
A sample of the XML looks like:
<?xml version="1.0"?>
-<Hilltop>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
-<Data>
-<E><O>-687.500</O><I1>95.837</I1></E>
-<E><O>-680.100</O><I1>95.773</I1></E>
</Data>
</Section>
-<Section SiteName="1">
-<DataSource NumItems="1" Name="RL Section">
<DataType>RLOffsetSection</DataType>
<ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
-<Data>
-<E><O>0.000</O><I1>91.138</I1></E>
-<E><O>5.600</O><I1>91.206</I1></E>
-<E><O>11.900</O><I1>91.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime> 1-Oct-1989 00:00:00</SurveyTime>
<Data>
<E><O>-99.100</O><I1>98.045</I1></E>
<E><O>-89.900</O><I1>98.334</I1></E>
<E><O>-85.900</O><I1>97.327</I1></E>
<E><O>-75.400</O><I1>94.177</I1></E>
<E><O>-53.100</O><I1>95.012</I1></E>
</Data>
</Section>
<Section SiteName="2">
<DataSource Name="RL Section" NumItems="1">
<DataType>RLOffsetSection</DataType><ItemFormat>0</ItemFormat>
</DataSource>
<SurveyTime>10-Feb-1993 12:00:00</SurveyTime>
<Data>
<E><O>0.000</O><I1>95.001</I1></E>
<E><O>4.600</O><I1>94.936</I1></E>
<E><O>14.800</O><I1>94.069</I1></E>
</Data>
</Section>
</Hilltop>
I've adapted some code (borrowed from Gokhan Atil's blog) to parse into a pandas dataframe:
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse(###my path and file###)
dfcols = ['Survey', 'XSnum', 'STA', 'Elev', 'CODE']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
Survey = node.attrib.get('SurveyTime')
XSnum = node.attrib.get('Section SiteName')
STA = node.find('O')
ELEV = node.find('I1')
CODE = node.find('Code')
df_xml = df_xml.append(
pd.Series([getvalueofnode(Survey), getvalueofnode(XSnum),
getvalueofnode(STA), getvalueofnode(ELEV),
getvalueofnode(CODE)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
It runs, and I get the dataframe structure I'm after, but I only get cell values of 'None', snippet follows...
screen capture of present output df
I suspect I'm calling the wrong method within the FOR loop, but have exhausted my (at this point) limited abilities troubleshooting code.
Using the above XML data as an example, I'm looking for the data like:
screen capture of desired output
Any assistance getting this sorted is greatly appreciated. Cheers!
########## EDIT ##########
Well, I was able to put this off for a while with manually-intensive workaround, but it doesn't scale, so I'm back to working on this.
While I'd still like to end up at a df eventually, I'm now approaching more incrementally working toward a 3D array (surveys, station, elevation).
I've re-grouped and am at least able to extract data with:
import xml.etree.ElementTree as ET
import numpy as np
path = "somepath"
file = "somefile.xml"
tree = ET.parse(path + file)
root = tree.getroot()
So far, so good. I can get a list of each discrete survey and convert to an array:
surveys =
for d in root.findall("./Section/SurveyTime"):
date = d.text[:-9]
surveys.append(date)
surv_arr = np.array(surveys)
The next bit extracts the numeric values I want to eventually plot on my x-axis, but it is a running list (then array) of all the values tagged with in the xml source.
station =
for s in root.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
sta_arr = np.array(station)
But I need the "sta" values for each survey grouped in their own array (e.g. so "sta_arr" is a 1 x n array where n is the number of surveys). So I try a nested for loop:
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for s in section.findall("./Section/Data/E/O"):
sta = float(s.text)
station.append(sta)
It produces n arrays, but they're empty. So tweak the code a bit to iterate over the date variable (surveys):
station =
for section in root.findall('./Section'):
source_items = section.find('DataSource').attrib
date = section.find('SurveyTime').text[:-9]
for d in date:
sta = d.find('.Section/Data/E/O')
station.append(sta)
The good news is this populates arrays arrays of varying lengths for 'station'. Though the length seems to be incrementing by 1. The bad news is all the values are "-1" and the array length for each survey does not equal the number of observations (tagged '') in the xml file.
Any thoughts?
python xml pandas
python xml pandas
edited Nov 12 at 1:19
asked Jan 22 at 4:47
CreekGeek
13
13
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.
import xmltodict
import pandas as pd
xs, survey, station, elevation = , , ,
with open(path + file) as extract:
orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
sta_temp =
elev_temp =
data = orddict['Hilltop']['Section'][i]['Data']['E']
for d in data:
sta = float(d['O'])
elev = float(d['I1'])
sta_temp.append(sta)
elev_temp.append(elev)
xs.append(XS_id)
survey.append(surv)
station.append(sta_temp)
elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})
Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).
Enjoy!!!
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.
import xmltodict
import pandas as pd
xs, survey, station, elevation = , , ,
with open(path + file) as extract:
orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
sta_temp =
elev_temp =
data = orddict['Hilltop']['Section'][i]['Data']['E']
for d in data:
sta = float(d['O'])
elev = float(d['I1'])
sta_temp.append(sta)
elev_temp.append(elev)
xs.append(XS_id)
survey.append(surv)
station.append(sta_temp)
elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})
Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).
Enjoy!!!
add a comment |
up vote
0
down vote
xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.
import xmltodict
import pandas as pd
xs, survey, station, elevation = , , ,
with open(path + file) as extract:
orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
sta_temp =
elev_temp =
data = orddict['Hilltop']['Section'][i]['Data']['E']
for d in data:
sta = float(d['O'])
elev = float(d['I1'])
sta_temp.append(sta)
elev_temp.append(elev)
xs.append(XS_id)
survey.append(surv)
station.append(sta_temp)
elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})
Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).
Enjoy!!!
add a comment |
up vote
0
down vote
up vote
0
down vote
xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.
import xmltodict
import pandas as pd
xs, survey, station, elevation = , , ,
with open(path + file) as extract:
orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
sta_temp =
elev_temp =
data = orddict['Hilltop']['Section'][i]['Data']['E']
for d in data:
sta = float(d['O'])
elev = float(d['I1'])
sta_temp.append(sta)
elev_temp.append(elev)
xs.append(XS_id)
survey.append(surv)
station.append(sta_temp)
elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})
Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).
Enjoy!!!
xmltodict was what I needed. I was able to sort through indexing in a ordered dictionary approach far faster than whatever hang-ups I was having with etree and minidom.
import xmltodict
import pandas as pd
xs, survey, station, elevation = , , ,
with open(path + file) as extract:
orddict = xmltodict.parse(extract.read())
num_surveys = len(orddict['Hilltop']['Section'])
for i in range(0,num_surveys):
XS_id = orddict['Hilltop']['Section'][i]['@SiteName']
surv = orddict['Hilltop']['Section'][i]['SurveyTime'][:-9]
sta_temp =
elev_temp =
data = orddict['Hilltop']['Section'][i]['Data']['E']
for d in data:
sta = float(d['O'])
elev = float(d['I1'])
sta_temp.append(sta)
elev_temp.append(elev)
xs.append(XS_id)
survey.append(surv)
station.append(sta_temp)
elevation.append(elev_temp)
df = pd.DataFrame.from_dict({'xs':xs, 'survey':survey, 'station':station, 'elevation':elevation})
Note: The DataFrame as structured stores the coordinates as lists and not in a true 'flat' structure as in my original question/request. My solution here delivers the plotting and analytical performance I need while saving one grouping level, so is actually preferable (to me).
Enjoy!!!
edited Nov 12 at 1:39
answered Nov 12 at 1:32
CreekGeek
13
13
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48374823%2fparsing-hierarchical-data-from-xml-to-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown