How to implement code to manipulate files that runs in parellel?











up vote
0
down vote

favorite












I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:



5 events which divided into 2 categories



for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'


However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?



well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset



I intended to illustrate the dataset by figure but it turns out to be worse.










share|improve this question
























  • What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
    – Mad Physicist
    Nov 12 at 6:39










  • I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
    – martineau
    Nov 12 at 6:41










  • Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
    – martineau
    Nov 12 at 6:50










  • @MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
    – martineau
    Nov 12 at 6:52















up vote
0
down vote

favorite












I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:



5 events which divided into 2 categories



for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'


However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?



well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset



I intended to illustrate the dataset by figure but it turns out to be worse.










share|improve this question
























  • What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
    – Mad Physicist
    Nov 12 at 6:39










  • I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
    – martineau
    Nov 12 at 6:41










  • Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
    – martineau
    Nov 12 at 6:50










  • @MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
    – martineau
    Nov 12 at 6:52













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:



5 events which divided into 2 categories



for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'


However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?



well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset



I intended to illustrate the dataset by figure but it turns out to be worse.










share|improve this question















I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:



5 events which divided into 2 categories



for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'


However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?



well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset



I intended to illustrate the dataset by figure but it turns out to be worse.







python performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 2:22

























asked Nov 12 at 6:02









Tilmant

13




13












  • What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
    – Mad Physicist
    Nov 12 at 6:39










  • I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
    – martineau
    Nov 12 at 6:41










  • Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
    – martineau
    Nov 12 at 6:50










  • @MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
    – martineau
    Nov 12 at 6:52


















  • What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
    – Mad Physicist
    Nov 12 at 6:39










  • I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
    – martineau
    Nov 12 at 6:41










  • Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
    – martineau
    Nov 12 at 6:50










  • @MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
    – martineau
    Nov 12 at 6:52
















What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39




What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39












I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41




I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41












Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50




Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50












@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52




@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256612%2fhow-to-implement-code-to-manipulate-files-that-runs-in-parellel%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256612%2fhow-to-implement-code-to-manipulate-files-that-runs-in-parellel%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python