Recursive walk through a JSON file extracting SELECTED strings

I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}

I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.

Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.

What I am asking

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

screenshot of contents of JSON file

Link to full JSON sample: http://pastebin.com/0NS5BiDk

What I have done so far:

1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.

r1 = s.post(url2, data=payload1)

j = str(r1.json())



sentences_list = (re.findall(r'"(.+?)"', j))



numentries = 0

for sentences in sentences_list:

    numentries += 1

    print(sentences)

    print(numentries)

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):

    if type(myjson) is dict:

        for jsonkey in (myjson):

            if type(myjson[jsonkey]) in (list, dict):

                get_all(myjson[jsonkey], key)

            elif jsonkey == key:

                print (myjson[jsonkey])

    elif type(myjson) is list:

        for item in myjson:

            if type(item) in (list, dict):

                get_all(item, key)



print(get_all(r1.json(), "text"))

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

Please advise.

UPDATE

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.

# 1st code (it works but not ideal)



j=r1.json()



list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



newlist = sorted(list, key=lambda k: k['text'][-9:])

print(newlist)



# 2nd code I need something to expect missing values and to solve the

# list index error

list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



def date(key):

    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)



def order(list_to_order):

    try:

        return sorted(list_to_order,

                      key=lambda k: k[date(["metadata"][0]["value"])])

    except ValueError:

        return 0



print(order(list))

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

1

Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53

@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03

@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12

1

If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13

I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54

|
show 3 more comments

I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}

What I am asking

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

screenshot of contents of JSON file

Link to full JSON sample: http://pastebin.com/0NS5BiDk

What I have done so far:

r1 = s.post(url2, data=payload1)

j = str(r1.json())



sentences_list = (re.findall(r'"(.+?)"', j))



numentries = 0

for sentences in sentences_list:

    numentries += 1

    print(sentences)

    print(numentries)

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):

    if type(myjson) is dict:

        for jsonkey in (myjson):

            if type(myjson[jsonkey]) in (list, dict):

                get_all(myjson[jsonkey], key)

            elif jsonkey == key:

                print (myjson[jsonkey])

    elif type(myjson) is list:

        for item in myjson:

            if type(item) in (list, dict):

                get_all(item, key)



print(get_all(r1.json(), "text"))

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

Please advise.

UPDATE

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

# 1st code (it works but not ideal)



j=r1.json()



list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



newlist = sorted(list, key=lambda k: k['text'][-9:])

print(newlist)



# 2nd code I need something to expect missing values and to solve the

# list index error

list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



def date(key):

    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)



def order(list_to_order):

    try:

        return sorted(list_to_order,

                      key=lambda k: k[date(["metadata"][0]["value"])])

    except ValueError:

        return 0



print(order(list))

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

1

Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53

@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03

@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12

1

If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13

I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54

|
show 3 more comments

I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}

What I am asking

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

screenshot of contents of JSON file

Link to full JSON sample: http://pastebin.com/0NS5BiDk

What I have done so far:

r1 = s.post(url2, data=payload1)

j = str(r1.json())



sentences_list = (re.findall(r'"(.+?)"', j))



numentries = 0

for sentences in sentences_list:

    numentries += 1

    print(sentences)

    print(numentries)

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):

    if type(myjson) is dict:

        for jsonkey in (myjson):

            if type(myjson[jsonkey]) in (list, dict):

                get_all(myjson[jsonkey], key)

            elif jsonkey == key:

                print (myjson[jsonkey])

    elif type(myjson) is list:

        for item in myjson:

            if type(item) in (list, dict):

                get_all(item, key)



print(get_all(r1.json(), "text"))

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

Please advise.

UPDATE

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

# 1st code (it works but not ideal)



j=r1.json()



list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



newlist = sorted(list, key=lambda k: k['text'][-9:])

print(newlist)



# 2nd code I need something to expect missing values and to solve the

# list index error

list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



def date(key):

    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)



def order(list_to_order):

    try:

        return sorted(list_to_order,

                      key=lambda k: k[date(["metadata"][0]["value"])])

    except ValueError:

        return 0



print(order(list))

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}

What I am asking

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

screenshot of contents of JSON file

Link to full JSON sample: http://pastebin.com/0NS5BiDk

What I have done so far:

r1 = s.post(url2, data=payload1)

j = str(r1.json())



sentences_list = (re.findall(r'"(.+?)"', j))



numentries = 0

for sentences in sentences_list:

    numentries += 1

    print(sentences)

    print(numentries)

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):

    if type(myjson) is dict:

        for jsonkey in (myjson):

            if type(myjson[jsonkey]) in (list, dict):

                get_all(myjson[jsonkey], key)

            elif jsonkey == key:

                print (myjson[jsonkey])

    elif type(myjson) is list:

        for item in myjson:

            if type(item) in (list, dict):

                get_all(item, key)



print(get_all(r1.json(), "text"))

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

Please advise.

UPDATE

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

# 1st code (it works but not ideal)



j=r1.json()



list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



newlist = sorted(list, key=lambda k: k['text'][-9:])

print(newlist)



# 2nd code I need something to expect missing values and to solve the

# list index error

list = 

for row in j["tree"]["children"][0]["children"]:

    list.append(row)



def date(key):

    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)



def order(list_to_order):

    try:

        return sorted(list_to_order,

                      key=lambda k: k[date(["metadata"][0]["value"])])

    except ValueError:

        return 0



print(order(list))

python json string recursion

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

edited Nov 15 '18 at 6:18

Cœur

18.5k9110148

asked Jun 25 '16 at 14:16

ganesa75

7117

asked Jun 25 '16 at 14:16

ganesa75

7117

asked Jun 25 '16 at 14:16

ganesa75

7117

1

Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53

@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03

@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12

1

If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13

I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54

|
show 3 more comments

1

Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53

@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03

@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12

1

If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13

I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54

Please edit your question and include sample of JSON to be parsed as text — just a screenshot isn't enough.

– martineau
Jun 25 '16 at 14:53

@martineau, is it not enough the image I have already loaded? I am not sure I understand your request, sorry. please explain. btw I use python 3.5

– ganesa75
Jun 25 '16 at 15:03

@martineau, how can I load the entire JSON file?

– ganesa75
Jun 25 '16 at 15:12

If someone wants test their code, some sample input will be needed. Folks don't want to have to try making their own. The entire JSON isn't needed, just enough to show the nesting you mention. Alternatively, you could post a link to it the whole thing somewhere like pastebin.

– martineau
Jun 25 '16 at 15:13

I have added a full JSON sample link

– ganesa75
Jun 25 '16 at 15:54

|
show 3 more comments

1 Answer
1

active

oldest

votes

I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.

import json

_NUL = object()  # unique value guaranteed to never be in JSON data



def get_all(myjson, kind, key):

    """ Recursively find all the values of key in all the dictionaries in myjson

        with a "type" key equal to kind.

    """

    if isinstance(myjson, dict):

        key_value = myjson.get(key, _NUL)  # _NUL if key not present

        if key_value is not _NUL and myjson.get("type") == kind:

            yield key_value

        for jsonkey in myjson:

            jsonvalue = myjson[jsonkey]

            for v in get_all(jsonvalue, kind, key):  # recursive

                yield v

    elif isinstance(myjson, list):

        for item in myjson:

            for v in get_all(item, kind, key):  # recursive

                yield v    



with open('json_sample.txt', 'r') as f:

    data = json.load(f)



numentries = 0

for text in get_all(data, "sentence", "text"):

    print(text)

    numentries += 1



print('nNumber of "text" entries found: {}'.format(numentries))

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

|
show 16 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38029421%2frecursive-walk-through-a-json-file-extracting-selected-strings%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

import json

_NUL = object()  # unique value guaranteed to never be in JSON data



def get_all(myjson, kind, key):

    """ Recursively find all the values of key in all the dictionaries in myjson

        with a "type" key equal to kind.

    """

    if isinstance(myjson, dict):

        key_value = myjson.get(key, _NUL)  # _NUL if key not present

        if key_value is not _NUL and myjson.get("type") == kind:

            yield key_value

        for jsonkey in myjson:

            jsonvalue = myjson[jsonkey]

            for v in get_all(jsonvalue, kind, key):  # recursive

                yield v

    elif isinstance(myjson, list):

        for item in myjson:

            for v in get_all(item, kind, key):  # recursive

                yield v    



with open('json_sample.txt', 'r') as f:

    data = json.load(f)



numentries = 0

for text in get_all(data, "sentence", "text"):

    print(text)

    numentries += 1



print('nNumber of "text" entries found: {}'.format(numentries))

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

|
show 16 more comments

import json

_NUL = object()  # unique value guaranteed to never be in JSON data



def get_all(myjson, kind, key):

    """ Recursively find all the values of key in all the dictionaries in myjson

        with a "type" key equal to kind.

    """

    if isinstance(myjson, dict):

        key_value = myjson.get(key, _NUL)  # _NUL if key not present

        if key_value is not _NUL and myjson.get("type") == kind:

            yield key_value

        for jsonkey in myjson:

            jsonvalue = myjson[jsonkey]

            for v in get_all(jsonvalue, kind, key):  # recursive

                yield v

    elif isinstance(myjson, list):

        for item in myjson:

            for v in get_all(item, kind, key):  # recursive

                yield v    



with open('json_sample.txt', 'r') as f:

    data = json.load(f)



numentries = 0

for text in get_all(data, "sentence", "text"):

    print(text)

    numentries += 1



print('nNumber of "text" entries found: {}'.format(numentries))

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

|
show 16 more comments

import json

_NUL = object()  # unique value guaranteed to never be in JSON data



def get_all(myjson, kind, key):

    """ Recursively find all the values of key in all the dictionaries in myjson

        with a "type" key equal to kind.

    """

    if isinstance(myjson, dict):

        key_value = myjson.get(key, _NUL)  # _NUL if key not present

        if key_value is not _NUL and myjson.get("type") == kind:

            yield key_value

        for jsonkey in myjson:

            jsonvalue = myjson[jsonkey]

            for v in get_all(jsonvalue, kind, key):  # recursive

                yield v

    elif isinstance(myjson, list):

        for item in myjson:

            for v in get_all(item, kind, key):  # recursive

                yield v    



with open('json_sample.txt', 'r') as f:

    data = json.load(f)



numentries = 0

for text in get_all(data, "sentence", "text"):

    print(text)

    numentries += 1



print('nNumber of "text" entries found: {}'.format(numentries))

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

import json

_NUL = object()  # unique value guaranteed to never be in JSON data



def get_all(myjson, kind, key):

    """ Recursively find all the values of key in all the dictionaries in myjson

        with a "type" key equal to kind.

    """

    if isinstance(myjson, dict):

        key_value = myjson.get(key, _NUL)  # _NUL if key not present

        if key_value is not _NUL and myjson.get("type") == kind:

            yield key_value

        for jsonkey in myjson:

            jsonvalue = myjson[jsonkey]

            for v in get_all(jsonvalue, kind, key):  # recursive

                yield v

    elif isinstance(myjson, list):

        for item in myjson:

            for v in get_all(item, kind, key):  # recursive

                yield v    



with open('json_sample.txt', 'r') as f:

    data = json.load(f)



numentries = 0

for text in get_all(data, "sentence", "text"):

    print(text)

    numentries += 1



print('nNumber of "text" entries found: {}'.format(numentries))

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

edited Aug 27 '16 at 11:23

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

answered Jun 25 '16 at 18:18

martineau

68.2k1090183

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

|
show 16 more comments

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

thanks it works. can you advise how to check the numentries? there are a couple of conditional loops and no sure where place numentries +=1

– ganesa75
Jun 25 '16 at 18:41

Thanks it works. 1. can you advise how to check the numentries? there are a couple of conditional loops and I am no sure where I have place "numentries +=1" 2. I still need to order the resources before extracting the "text". How I can do it? I studied the JSON file and I found out that the 'resourceType': 'XML', 'text': 'S5CV0280P0.xml' can give me this order. I mean ideally your code should parse the "text" before on 0280 and then 0281. On the original JSON file the .xml are not in order

– ganesa75
Jun 25 '16 at 18:50

I've modified my answer to determine the number of entries without counting each one as it's encountered — it just saves each one in a list and uses the length of that list at the end to determine numentries.

– martineau
Jun 25 '16 at 20:21

To do the sorting based on date, you're going to need to use the dates in the "Comprising period from" metadata in each "resourceType" dict/object encountered, convert that to a datetime and use that as the sorting criteria (aka sort key).

– martineau
Jun 25 '16 at 20:30

I've answered the main question you asked about recursively getting the selected strings and feel you should accept my answer — even if it doesn't do all the other processing you apparently would also like to do.

– martineau
Jun 26 '16 at 13:04

|
show 16 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky