Recursive walk through a JSON file extracting SELECTED strings












3















I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}



I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.



Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.



What I am asking



Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".



Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):



screenshot of contents of JSON file



Link to full JSON sample: http://pastebin.com/0NS5BiDk



What I have done so far:



1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.



r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'"(.+?)"', j))

numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)


2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values



def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)

print(get_all(r1.json(), "text"))


It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.



Please advise.



UPDATE



I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.



The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.



For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.



# 1st code (it works but not ideal)

j=r1.json()

list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0

print(order(list))









share|improve this question




















  • 1





    Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

    – martineau
    Jun 25 '16 at 14:53











  • @martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

    – ganesa75
    Jun 25 '16 at 15:03













  • @martineau, how can I load the entire JSON file?

    – ganesa75
    Jun 25 '16 at 15:12






  • 1





    If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

    – martineau
    Jun 25 '16 at 15:13













  • I have added a full JSON sample link

    – ganesa75
    Jun 25 '16 at 15:54
















3















I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}



I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.



Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.



What I am asking



Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".



Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):



screenshot of contents of JSON file



Link to full JSON sample: http://pastebin.com/0NS5BiDk



What I have done so far:



1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.



r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'"(.+?)"', j))

numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)


2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values



def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)

print(get_all(r1.json(), "text"))


It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.



Please advise.



UPDATE



I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.



The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.



For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.



# 1st code (it works but not ideal)

j=r1.json()

list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0

print(order(list))









share|improve this question




















  • 1





    Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

    – martineau
    Jun 25 '16 at 14:53











  • @martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

    – ganesa75
    Jun 25 '16 at 15:03













  • @martineau, how can I load the entire JSON file?

    – ganesa75
    Jun 25 '16 at 15:12






  • 1





    If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

    – martineau
    Jun 25 '16 at 15:13













  • I have added a full JSON sample link

    – ganesa75
    Jun 25 '16 at 15:54














3












3








3


1






I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}



I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.



Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.



What I am asking



Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".



Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):



screenshot of contents of JSON file



Link to full JSON sample: http://pastebin.com/0NS5BiDk



What I have done so far:



1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.



r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'"(.+?)"', j))

numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)


2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values



def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)

print(get_all(r1.json(), "text"))


It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.



Please advise.



UPDATE



I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.



The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.



For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.



# 1st code (it works but not ideal)

j=r1.json()

list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0

print(order(list))









share|improve this question
















I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}



I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.



Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.



What I am asking



Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".



Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):



screenshot of contents of JSON file



Link to full JSON sample: http://pastebin.com/0NS5BiDk



What I have done so far:



1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.



r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'"(.+?)"', j))

numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)


2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values



def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)

print(get_all(r1.json(), "text"))


It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.



Please advise.



UPDATE



I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.



The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.



For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.



# 1st code (it works but not ideal)

j=r1.json()

list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list =
for row in j["tree"]["children"][0]["children"]:
list.append(row)

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0

print(order(list))






python json string recursion






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 6:18









Cœur

18.5k9110148




18.5k9110148










asked Jun 25 '16 at 14:16









ganesa75ganesa75

7117




7117








  • 1





    Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

    – martineau
    Jun 25 '16 at 14:53











  • @martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

    – ganesa75
    Jun 25 '16 at 15:03













  • @martineau, how can I load the entire JSON file?

    – ganesa75
    Jun 25 '16 at 15:12






  • 1





    If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

    – martineau
    Jun 25 '16 at 15:13













  • I have added a full JSON sample link

    – ganesa75
    Jun 25 '16 at 15:54














  • 1





    Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

    – martineau
    Jun 25 '16 at 14:53











  • @martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

    – ganesa75
    Jun 25 '16 at 15:03













  • @martineau, how can I load the entire JSON file?

    – ganesa75
    Jun 25 '16 at 15:12






  • 1





    If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

    – martineau
    Jun 25 '16 at 15:13













  • I have added a full JSON sample link

    – ganesa75
    Jun 25 '16 at 15:54








1




1





Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53





Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53













@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03







@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03















@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12





@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12




1




1





If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13







If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13















I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54





I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54












1 Answer
1






active

oldest

votes


















2














I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.



import json
_NUL = object() # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v

with open('json_sample.txt', 'r') as f:
data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1

print('nNumber of "text" entries found: {}'.format(numentries))





share|improve this answer


























  • thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

    – ganesa75
    Jun 25 '16 at 18:41











  • Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

    – ganesa75
    Jun 25 '16 at 18:50











  • I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

    – martineau
    Jun 25 '16 at 20:21











  • To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

    – martineau
    Jun 25 '16 at 20:30











  • I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

    – martineau
    Jun 26 '16 at 13:04











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38029421%2frecursive-walk-through-a-json-file-extracting-selected-strings%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.



import json
_NUL = object() # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v

with open('json_sample.txt', 'r') as f:
data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1

print('nNumber of "text" entries found: {}'.format(numentries))





share|improve this answer


























  • thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

    – ganesa75
    Jun 25 '16 at 18:41











  • Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

    – ganesa75
    Jun 25 '16 at 18:50











  • I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

    – martineau
    Jun 25 '16 at 20:21











  • To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

    – martineau
    Jun 25 '16 at 20:30











  • I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

    – martineau
    Jun 26 '16 at 13:04
















2














I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.



import json
_NUL = object() # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v

with open('json_sample.txt', 'r') as f:
data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1

print('nNumber of "text" entries found: {}'.format(numentries))





share|improve this answer


























  • thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

    – ganesa75
    Jun 25 '16 at 18:41











  • Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

    – ganesa75
    Jun 25 '16 at 18:50











  • I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

    – martineau
    Jun 25 '16 at 20:21











  • To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

    – martineau
    Jun 25 '16 at 20:30











  • I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

    – martineau
    Jun 26 '16 at 13:04














2












2








2







I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.



import json
_NUL = object() # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v

with open('json_sample.txt', 'r') as f:
data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1

print('nNumber of "text" entries found: {}'.format(numentries))





share|improve this answer















I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.



import json
_NUL = object() # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v

with open('json_sample.txt', 'r') as f:
data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1

print('nNumber of "text" entries found: {}'.format(numentries))






share|improve this answer














share|improve this answer



share|improve this answer








edited Aug 27 '16 at 11:23

























answered Jun 25 '16 at 18:18









martineaumartineau

68.2k1090183




68.2k1090183













  • thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

    – ganesa75
    Jun 25 '16 at 18:41











  • Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

    – ganesa75
    Jun 25 '16 at 18:50











  • I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

    – martineau
    Jun 25 '16 at 20:21











  • To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

    – martineau
    Jun 25 '16 at 20:30











  • I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

    – martineau
    Jun 26 '16 at 13:04



















  • thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

    – ganesa75
    Jun 25 '16 at 18:41











  • Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

    – ganesa75
    Jun 25 '16 at 18:50











  • I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

    – martineau
    Jun 25 '16 at 20:21











  • To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

    – martineau
    Jun 25 '16 at 20:30











  • I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

    – martineau
    Jun 26 '16 at 13:04

















thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41





thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41













Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50





Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50













I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21





I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21













To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30





To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30













I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04





I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38029421%2frecursive-walk-through-a-json-file-extracting-selected-strings%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python