reshaping data frame and applying calculation for each row
I have a data frame as follows:
df=pd.DataFrame({ 'family' : ["A","A","B","B"],
'V1' : [5,5,40,10,],
'V2' :[50,10,180,20],
'gr_0' :["all","all","all","all"],
'gr_1' :["m1","m1","m2","m3"],
'gr_2' :["m12","m12","m12","m9"],
'gr_3' :["NO","m14","m15","NO"]
})
and I would like to transform it in the following way:
df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
"calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]
})
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m2 4.5
6 B m3 2.0
7 B m12 4.5
8 B m9 2.0
9 B m15 4.5
In order to reach to df_new:
- I want the rows to be aligned by "family" X each unique value of "gr_" columns.
- For every line to calculate the respective sum(V2)/sum(V1) as shown in the df_new.
I am quite new to python. Softcoding of this seems quite complex to me.
Preferably, I don't want "No" records to be listed in this df_new but it can stay in the output as well.
python python-3.x pandas pivot-table pandas-groupby
add a comment |
I have a data frame as follows:
df=pd.DataFrame({ 'family' : ["A","A","B","B"],
'V1' : [5,5,40,10,],
'V2' :[50,10,180,20],
'gr_0' :["all","all","all","all"],
'gr_1' :["m1","m1","m2","m3"],
'gr_2' :["m12","m12","m12","m9"],
'gr_3' :["NO","m14","m15","NO"]
})
and I would like to transform it in the following way:
df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
"calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]
})
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m2 4.5
6 B m3 2.0
7 B m12 4.5
8 B m9 2.0
9 B m15 4.5
In order to reach to df_new:
- I want the rows to be aligned by "family" X each unique value of "gr_" columns.
- For every line to calculate the respective sum(V2)/sum(V1) as shown in the df_new.
I am quite new to python. Softcoding of this seems quite complex to me.
Preferably, I don't want "No" records to be listed in this df_new but it can stay in the output as well.
python python-3.x pandas pivot-table pandas-groupby
add a comment |
I have a data frame as follows:
df=pd.DataFrame({ 'family' : ["A","A","B","B"],
'V1' : [5,5,40,10,],
'V2' :[50,10,180,20],
'gr_0' :["all","all","all","all"],
'gr_1' :["m1","m1","m2","m3"],
'gr_2' :["m12","m12","m12","m9"],
'gr_3' :["NO","m14","m15","NO"]
})
and I would like to transform it in the following way:
df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
"calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]
})
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m2 4.5
6 B m3 2.0
7 B m12 4.5
8 B m9 2.0
9 B m15 4.5
In order to reach to df_new:
- I want the rows to be aligned by "family" X each unique value of "gr_" columns.
- For every line to calculate the respective sum(V2)/sum(V1) as shown in the df_new.
I am quite new to python. Softcoding of this seems quite complex to me.
Preferably, I don't want "No" records to be listed in this df_new but it can stay in the output as well.
python python-3.x pandas pivot-table pandas-groupby
I have a data frame as follows:
df=pd.DataFrame({ 'family' : ["A","A","B","B"],
'V1' : [5,5,40,10,],
'V2' :[50,10,180,20],
'gr_0' :["all","all","all","all"],
'gr_1' :["m1","m1","m2","m3"],
'gr_2' :["m12","m12","m12","m9"],
'gr_3' :["NO","m14","m15","NO"]
})
and I would like to transform it in the following way:
df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
"calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]
})
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m2 4.5
6 B m3 2.0
7 B m12 4.5
8 B m9 2.0
9 B m15 4.5
In order to reach to df_new:
- I want the rows to be aligned by "family" X each unique value of "gr_" columns.
- For every line to calculate the respective sum(V2)/sum(V1) as shown in the df_new.
I am quite new to python. Softcoding of this seems quite complex to me.
Preferably, I don't want "No" records to be listed in this df_new but it can stay in the output as well.
python python-3.x pandas pivot-table pandas-groupby
python python-3.x pandas pivot-table pandas-groupby
edited Nov 15 '18 at 17:55
user3483203
31.4k82656
31.4k82656
asked Nov 15 '18 at 16:22
glsvmkyglsvmky
111
111
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
You can do so with this:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
add a comment |
melt
+ groupby
:
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
A similar approach to this answer, but with the benefit of being fully vectorized, and avoiding apply
Performance:
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@AlexandreNixon similar in approach, but I just added timings to show whyapply
should really be avoided. Vectorized operations make a big difference
– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323755%2freshaping-data-frame-and-applying-calculation-for-each-row%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can do so with this:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
add a comment |
You can do so with this:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
add a comment |
You can do so with this:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
You can do so with this:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
edited Nov 15 '18 at 16:47
answered Nov 15 '18 at 16:33
yatuyatu
13.2k31341
13.2k31341
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
add a comment |
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
1
1
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
Thanks for that @Ben.T :)
– yatu
Nov 15 '18 at 16:47
add a comment |
melt
+ groupby
:
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
A similar approach to this answer, but with the benefit of being fully vectorized, and avoiding apply
Performance:
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@AlexandreNixon similar in approach, but I just added timings to show whyapply
should really be avoided. Vectorized operations make a big difference
– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
add a comment |
melt
+ groupby
:
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
A similar approach to this answer, but with the benefit of being fully vectorized, and avoiding apply
Performance:
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@AlexandreNixon similar in approach, but I just added timings to show whyapply
should really be avoided. Vectorized operations make a big difference
– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
add a comment |
melt
+ groupby
:
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
A similar approach to this answer, but with the benefit of being fully vectorized, and avoiding apply
Performance:
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
melt
+ groupby
:
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
A similar approach to this answer, but with the benefit of being fully vectorized, and avoiding apply
Performance:
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
edited Nov 15 '18 at 17:53
answered Nov 15 '18 at 16:47
user3483203user3483203
31.4k82656
31.4k82656
@AlexandreNixon similar in approach, but I just added timings to show whyapply
should really be avoided. Vectorized operations make a big difference
– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
add a comment |
@AlexandreNixon similar in approach, but I just added timings to show whyapply
should really be avoided. Vectorized operations make a big difference
– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
@AlexandreNixon similar in approach, but I just added timings to show why
apply
should really be avoided. Vectorized operations make a big difference– user3483203
Nov 15 '18 at 17:54
@AlexandreNixon similar in approach, but I just added timings to show why
apply
should really be avoided. Vectorized operations make a big difference– user3483203
Nov 15 '18 at 17:54
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
Fair point @user3483203
– yatu
Nov 15 '18 at 19:01
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323755%2freshaping-data-frame-and-applying-calculation-for-each-row%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown