Pandas: Enumerate duplicates in index
Let's say I have a list of events that happen on different keys.
data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]
df = pandas.DataFrame(data)
I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.
The best I could do was
df.set_index("key", append=True).swaplevel(0, 1)
event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated
but what I'm expecting is
event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
I also tried something like
df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()
but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...
Any idea?
python python-3.x pandas
add a comment |
Let's say I have a list of events that happen on different keys.
data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]
df = pandas.DataFrame(data)
I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.
The best I could do was
df.set_index("key", append=True).swaplevel(0, 1)
event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated
but what I'm expecting is
event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
I also tried something like
df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()
but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...
Any idea?
python python-3.x pandas
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08
add a comment |
Let's say I have a list of events that happen on different keys.
data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]
df = pandas.DataFrame(data)
I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.
The best I could do was
df.set_index("key", append=True).swaplevel(0, 1)
event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated
but what I'm expecting is
event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
I also tried something like
df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()
but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...
Any idea?
python python-3.x pandas
Let's say I have a list of events that happen on different keys.
data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]
df = pandas.DataFrame(data)
I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.
The best I could do was
df.set_index("key", append=True).swaplevel(0, 1)
event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated
but what I'm expecting is
event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
I also tried something like
df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()
but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...
Any idea?
python python-3.x pandas
python python-3.x pandas
asked Nov 15 '18 at 21:59
CilyanCilyan
3,79011930
3,79011930
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08
add a comment |
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08
add a comment |
2 Answers
2
active
oldest
votes
groupby
+ cumcount
Here are a couple of ways:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
@ScottBoston haha... I was just going to suggest:df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extraassign
.
– Scott Boston
Nov 15 '18 at 22:41
1
@Scott Umm... timeit wise, yourrename_axis
is way ahead....rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...
– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
|
show 1 more comment
You can do this in numpy like this:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328489%2fpandas-enumerate-duplicates-in-index%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
groupby
+ cumcount
Here are a couple of ways:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
@ScottBoston haha... I was just going to suggest:df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extraassign
.
– Scott Boston
Nov 15 '18 at 22:41
1
@Scott Umm... timeit wise, yourrename_axis
is way ahead....rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...
– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
|
show 1 more comment
groupby
+ cumcount
Here are a couple of ways:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
@ScottBoston haha... I was just going to suggest:df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extraassign
.
– Scott Boston
Nov 15 '18 at 22:41
1
@Scott Umm... timeit wise, yourrename_axis
is way ahead....rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...
– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
|
show 1 more comment
groupby
+ cumcount
Here are a couple of ways:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
groupby
+ cumcount
Here are a couple of ways:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
edited Nov 15 '18 at 23:14
answered Nov 15 '18 at 22:07
jppjpp
102k2165115
102k2165115
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
@ScottBoston haha... I was just going to suggest:df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extraassign
.
– Scott Boston
Nov 15 '18 at 22:41
1
@Scott Umm... timeit wise, yourrename_axis
is way ahead....rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...
– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
|
show 1 more comment
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
@ScottBoston haha... I was just going to suggest:df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extraassign
.
– Scott Boston
Nov 15 '18 at 22:41
1
@Scott Umm... timeit wise, yourrename_axis
is way ahead....rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...
– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
3
3
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])
– Scott Boston
Nov 15 '18 at 22:34
1
1
@ScottBoston haha... I was just going to suggest:
df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
@ScottBoston haha... I was just going to suggest:
df.set_index(['key', df.groupby('key').cumcount().rename('count')])
– Jon Clements♦
Nov 15 '18 at 22:38
Yeah... let's get rid of that extra
assign
.– Scott Boston
Nov 15 '18 at 22:41
Yeah... let's get rid of that extra
assign
.– Scott Boston
Nov 15 '18 at 22:41
1
1
@Scott Umm... timeit wise, your
rename_axis
is way ahead... .rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...– Jon Clements♦
Nov 15 '18 at 22:48
@Scott Umm... timeit wise, your
rename_axis
is way ahead... .rename()
is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...– Jon Clements♦
Nov 15 '18 at 22:48
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.
– Cilyan
Nov 15 '18 at 23:01
|
show 1 more comment
You can do this in numpy like this:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
add a comment |
You can do this in numpy like this:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
add a comment |
You can do this in numpy like this:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
You can do this in numpy like this:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
answered Nov 15 '18 at 22:20
Matthias OssadnikMatthias Ossadnik
60937
60937
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
add a comment |
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
Thank you for the suggestion. :)
– Cilyan
Nov 15 '18 at 23:02
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328489%2fpandas-enumerate-duplicates-in-index%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Related? stackoverflow.com/questions/41594703/…
– Evan
Nov 15 '18 at 22:08