Pandas: Enumerate duplicates in index












3















Let's say I have a list of events that happen on different keys.



data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)


I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.



The best I could do was



df.set_index("key", append=True).swaplevel(0, 1)

event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated


but what I'm expecting is



          event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated


I also tried something like



df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()


but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...



Any idea?










share|improve this question























  • Related? stackoverflow.com/questions/41594703/…

    – Evan
    Nov 15 '18 at 22:08
















3















Let's say I have a list of events that happen on different keys.



data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)


I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.



The best I could do was



df.set_index("key", append=True).swaplevel(0, 1)

event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated


but what I'm expecting is



          event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated


I also tried something like



df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()


but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...



Any idea?










share|improve this question























  • Related? stackoverflow.com/questions/41594703/…

    – Evan
    Nov 15 '18 at 22:08














3












3








3


0






Let's say I have a list of events that happen on different keys.



data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)


I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.



The best I could do was



df.set_index("key", append=True).swaplevel(0, 1)

event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated


but what I'm expecting is



          event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated


I also tried something like



df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()


but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...



Any idea?










share|improve this question














Let's say I have a list of events that happen on different keys.



data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)


I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.



The best I could do was



df.set_index("key", append=True).swaplevel(0, 1)

event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated


but what I'm expecting is



          event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated


I also tried something like



df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()


but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...



Any idea?







python python-3.x pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 21:59









CilyanCilyan

3,79011930




3,79011930













  • Related? stackoverflow.com/questions/41594703/…

    – Evan
    Nov 15 '18 at 22:08



















  • Related? stackoverflow.com/questions/41594703/…

    – Evan
    Nov 15 '18 at 22:08

















Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08





Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08












2 Answers
2






active

oldest

votes


















6















groupby + cumcount



Here are a couple of ways:



# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])

print(df)

event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer





















  • 3





    df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

    – Scott Boston
    Nov 15 '18 at 22:34






  • 1





    @ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

    – Jon Clements
    Nov 15 '18 at 22:38













  • Yeah... let's get rid of that extra assign.

    – Scott Boston
    Nov 15 '18 at 22:41






  • 1





    @Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

    – Jon Clements
    Nov 15 '18 at 22:48













  • Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

    – Cilyan
    Nov 15 '18 at 23:01





















0














You can do this in numpy like this:



# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer
























  • Thank you for the suggestion. :)

    – Cilyan
    Nov 15 '18 at 23:02











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328489%2fpandas-enumerate-duplicates-in-index%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









6















groupby + cumcount



Here are a couple of ways:



# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])

print(df)

event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer





















  • 3





    df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

    – Scott Boston
    Nov 15 '18 at 22:34






  • 1





    @ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

    – Jon Clements
    Nov 15 '18 at 22:38













  • Yeah... let's get rid of that extra assign.

    – Scott Boston
    Nov 15 '18 at 22:41






  • 1





    @Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

    – Jon Clements
    Nov 15 '18 at 22:48













  • Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

    – Cilyan
    Nov 15 '18 at 23:01


















6















groupby + cumcount



Here are a couple of ways:



# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])

print(df)

event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer





















  • 3





    df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

    – Scott Boston
    Nov 15 '18 at 22:34






  • 1





    @ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

    – Jon Clements
    Nov 15 '18 at 22:38













  • Yeah... let's get rid of that extra assign.

    – Scott Boston
    Nov 15 '18 at 22:41






  • 1





    @Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

    – Jon Clements
    Nov 15 '18 at 22:48













  • Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

    – Cilyan
    Nov 15 '18 at 23:01
















6












6








6








groupby + cumcount



Here are a couple of ways:



# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])

print(df)

event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer
















groupby + cumcount



Here are a couple of ways:



# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])
.rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())
.set_index(['key', 'count'])

print(df)

event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 23:14

























answered Nov 15 '18 at 22:07









jppjpp

102k2165115




102k2165115








  • 3





    df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

    – Scott Boston
    Nov 15 '18 at 22:34






  • 1





    @ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

    – Jon Clements
    Nov 15 '18 at 22:38













  • Yeah... let's get rid of that extra assign.

    – Scott Boston
    Nov 15 '18 at 22:41






  • 1





    @Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

    – Jon Clements
    Nov 15 '18 at 22:48













  • Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

    – Cilyan
    Nov 15 '18 at 23:01
















  • 3





    df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

    – Scott Boston
    Nov 15 '18 at 22:34






  • 1





    @ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

    – Jon Clements
    Nov 15 '18 at 22:38













  • Yeah... let's get rid of that extra assign.

    – Scott Boston
    Nov 15 '18 at 22:41






  • 1





    @Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

    – Jon Clements
    Nov 15 '18 at 22:48













  • Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

    – Cilyan
    Nov 15 '18 at 23:01










3




3





df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34





df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34




1




1





@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements
Nov 15 '18 at 22:38







@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements
Nov 15 '18 at 22:38















Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41





Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41




1




1





@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements
Nov 15 '18 at 22:48







@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements
Nov 15 '18 at 22:48















Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01







Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01















0














You can do this in numpy like this:



# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer
























  • Thank you for the suggestion. :)

    – Cilyan
    Nov 15 '18 at 23:02
















0














You can do this in numpy like this:



# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer
























  • Thank you for the suggestion. :)

    – Cilyan
    Nov 15 '18 at 23:02














0












0








0







You can do this in numpy like this:



# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated





share|improve this answer













You can do this in numpy like this:



# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))

event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 22:20









Matthias OssadnikMatthias Ossadnik

60937




60937













  • Thank you for the suggestion. :)

    – Cilyan
    Nov 15 '18 at 23:02



















  • Thank you for the suggestion. :)

    – Cilyan
    Nov 15 '18 at 23:02

















Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02





Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328489%2fpandas-enumerate-duplicates-in-index%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python