Pandas: Enumerate duplicates in index

Let's say I have a list of events that happen on different keys.

data = [

    {"key": "A", "event": "created"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "B", "event": "created"},

    {"key": "B", "event": "updated"},

    {"key": "B", "event": "updated"},

    {"key": "C", "event": "created"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

]



df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.

The best I could do was

df.set_index("key", append=True).swaplevel(0, 1)



          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   4   created

    5   updated

    6   updated

C   7   created

    8   updated

    9   updated

    10  updated

    11  updated

    12  updated

but what I'm expecting is

          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   0   created

    1   updated

    2   updated

C   0   created

    1   updated

    2   updated

    3   updated

    4   updated

    5   updated

I also tried something like

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...

Any idea?

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08

add a comment |

Let's say I have a list of events that happen on different keys.

data = [

    {"key": "A", "event": "created"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "B", "event": "created"},

    {"key": "B", "event": "updated"},

    {"key": "B", "event": "updated"},

    {"key": "C", "event": "created"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

]



df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.

The best I could do was

df.set_index("key", append=True).swaplevel(0, 1)



          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   4   created

    5   updated

    6   updated

C   7   created

    8   updated

    9   updated

    10  updated

    11  updated

    12  updated

but what I'm expecting is

          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   0   created

    1   updated

    2   updated

C   0   created

    1   updated

    2   updated

    3   updated

    4   updated

    5   updated

I also tried something like

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...

Any idea?

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08

add a comment |

Let's say I have a list of events that happen on different keys.

data = [

    {"key": "A", "event": "created"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "B", "event": "created"},

    {"key": "B", "event": "updated"},

    {"key": "B", "event": "updated"},

    {"key": "C", "event": "created"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

]



df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.

The best I could do was

df.set_index("key", append=True).swaplevel(0, 1)



          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   4   created

    5   updated

    6   updated

C   7   created

    8   updated

    9   updated

    10  updated

    11  updated

    12  updated

but what I'm expecting is

          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   0   created

    1   updated

    2   updated

C   0   created

    1   updated

    2   updated

    3   updated

    4   updated

    5   updated

I also tried something like

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...

Any idea?

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

Let's say I have a list of events that happen on different keys.

data = [

    {"key": "A", "event": "created"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "A", "event": "updated"},

    {"key": "B", "event": "created"},

    {"key": "B", "event": "updated"},

    {"key": "B", "event": "updated"},

    {"key": "C", "event": "created"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

    {"key": "C", "event": "updated"},

]



df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.

The best I could do was

df.set_index("key", append=True).swaplevel(0, 1)



          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   4   created

    5   updated

    6   updated

C   7   created

    8   updated

    9   updated

    10  updated

    11  updated

    12  updated

but what I'm expecting is

          event

key            

A   0   created

    1   updated

    2   updated

    3   updated

B   0   created

    1   updated

    2   updated

C   0   created

    1   updated

    2   updated

    3   updated

    4   updated

    5   updated

I also tried something like

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...

Any idea?

python python-3.x pandas

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

asked Nov 15 '18 at 21:59

Cilyan

3,79011930

Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08

add a comment |

Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08

Related? stackoverflow.com/questions/41594703/…

– Evan
Nov 15 '18 at 22:08

add a comment |

2 Answers
2

active

oldest

votes

`groupby` + `cumcount`

Here are a couple of ways:

# new version thanks @ScottBoston

df = df.set_index(['key', df.groupby('key').cumcount()])

       .rename_axis(['key','count'])



# original version

df = df.assign(count=df.groupby('key').cumcount())

       .set_index(['key', 'count'])



print(df)



             event

key count         

A   0      created

    1      updated

    2      updated

    3      updated

B   0      created

    1      updated

    2      updated

C   0      created

    1      updated

    2      updated

    3      updated

    4      updated

    5      updated

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

3

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

1

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

1

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

|
show 1 more comment

You can do this in numpy like this:

# df like in OP

keys = df['key'].values

# detect indices where key changes value

change = np.zeros(keys.size, dtype=int)

change[1:] = keys[1:] != keys[:-1]

# naive sequential number

seq = np.arange(keys.size)

# offset by seq at most recent change

offset = np.maximum.accumulate(change * seq)

df['seq'] = seq - offset

print(df.set_index(['key', 'seq']))



           event

key seq         

A   0    created

    1    updated

    2    updated

    3    updated

B   0    created

    1    updated

    2    updated

C   0    created

    1    updated

    2    updated

    3    updated

    4    updated

    5    updated

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328489%2fpandas-enumerate-duplicates-in-index%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

`groupby` + `cumcount`

Here are a couple of ways:

# new version thanks @ScottBoston

df = df.set_index(['key', df.groupby('key').cumcount()])

       .rename_axis(['key','count'])



# original version

df = df.assign(count=df.groupby('key').cumcount())

       .set_index(['key', 'count'])



print(df)



             event

key count         

A   0      created

    1      updated

    2      updated

    3      updated

B   0      created

    1      updated

    2      updated

C   0      created

    1      updated

    2      updated

    3      updated

    4      updated

    5      updated

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

3

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

1

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

1

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

|
show 1 more comment

`groupby` + `cumcount`

Here are a couple of ways:

# new version thanks @ScottBoston

df = df.set_index(['key', df.groupby('key').cumcount()])

       .rename_axis(['key','count'])



# original version

df = df.assign(count=df.groupby('key').cumcount())

       .set_index(['key', 'count'])



print(df)



             event

key count         

A   0      created

    1      updated

    2      updated

    3      updated

B   0      created

    1      updated

    2      updated

C   0      created

    1      updated

    2      updated

    3      updated

    4      updated

    5      updated

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

3

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

1

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

1

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

|
show 1 more comment

`groupby` + `cumcount`

Here are a couple of ways:

# new version thanks @ScottBoston

df = df.set_index(['key', df.groupby('key').cumcount()])

       .rename_axis(['key','count'])



# original version

df = df.assign(count=df.groupby('key').cumcount())

       .set_index(['key', 'count'])



print(df)



             event

key count         

A   0      created

    1      updated

    2      updated

    3      updated

B   0      created

    1      updated

    2      updated

C   0      created

    1      updated

    2      updated

    3      updated

    4      updated

    5      updated

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

`groupby` + `cumcount`

Here are a couple of ways:

# new version thanks @ScottBoston

df = df.set_index(['key', df.groupby('key').cumcount()])

       .rename_axis(['key','count'])



# original version

df = df.assign(count=df.groupby('key').cumcount())

       .set_index(['key', 'count'])



print(df)



             event

key count         

A   0      created

    1      updated

    2      updated

    3      updated

B   0      created

    1      updated

    2      updated

C   0      created

    1      updated

    2      updated

    3      updated

    4      updated

    5      updated

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

edited Nov 15 '18 at 23:14

answered Nov 15 '18 at 22:07

jpp

102k2165115

answered Nov 15 '18 at 22:07

jpp

102k2165115

answered Nov 15 '18 at 22:07

jpp

102k2165115

3

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

1

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

1

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

|
show 1 more comment

3

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

1

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

1

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

df.set_index(['key',df.groupby('key').cumcount()]).rename_axis(['key','count'])

– Scott Boston
Nov 15 '18 at 22:34

@ScottBoston haha... I was just going to suggest: df.set_index(['key', df.groupby('key').cumcount().rename('count')])

– Jon Clements♦
Nov 15 '18 at 22:38

Yeah... let's get rid of that extra assign.

– Scott Boston
Nov 15 '18 at 22:41

@Scott Umm... timeit wise, your rename_axis is way ahead... .rename() is bloomin' slow for some reason... might you - I do have intermittent things running in the background so not the most of reliable timeits, but rename_axis is a good 3 times faster than assign...

– Jon Clements♦
Nov 15 '18 at 22:48

Thanks for the answer. I used @ScottBoston 's proposal to shorten the line.

– Cilyan
Nov 15 '18 at 23:01

|
show 1 more comment

You can do this in numpy like this:

# df like in OP

keys = df['key'].values

# detect indices where key changes value

change = np.zeros(keys.size, dtype=int)

change[1:] = keys[1:] != keys[:-1]

# naive sequential number

seq = np.arange(keys.size)

# offset by seq at most recent change

offset = np.maximum.accumulate(change * seq)

df['seq'] = seq - offset

print(df.set_index(['key', 'seq']))



           event

key seq         

A   0    created

    1    updated

    2    updated

    3    updated

B   0    created

    1    updated

    2    updated

C   0    created

    1    updated

    2    updated

    3    updated

    4    updated

    5    updated

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

add a comment |

You can do this in numpy like this:

# df like in OP

keys = df['key'].values

# detect indices where key changes value

change = np.zeros(keys.size, dtype=int)

change[1:] = keys[1:] != keys[:-1]

# naive sequential number

seq = np.arange(keys.size)

# offset by seq at most recent change

offset = np.maximum.accumulate(change * seq)

df['seq'] = seq - offset

print(df.set_index(['key', 'seq']))



           event

key seq         

A   0    created

    1    updated

    2    updated

    3    updated

B   0    created

    1    updated

    2    updated

C   0    created

    1    updated

    2    updated

    3    updated

    4    updated

    5    updated

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

add a comment |

You can do this in numpy like this:

# df like in OP

keys = df['key'].values

# detect indices where key changes value

change = np.zeros(keys.size, dtype=int)

change[1:] = keys[1:] != keys[:-1]

# naive sequential number

seq = np.arange(keys.size)

# offset by seq at most recent change

offset = np.maximum.accumulate(change * seq)

df['seq'] = seq - offset

print(df.set_index(['key', 'seq']))



           event

key seq         

A   0    created

    1    updated

    2    updated

    3    updated

B   0    created

    1    updated

    2    updated

C   0    created

    1    updated

    2    updated

    3    updated

    4    updated

    5    updated

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

You can do this in numpy like this:

# df like in OP

keys = df['key'].values

# detect indices where key changes value

change = np.zeros(keys.size, dtype=int)

change[1:] = keys[1:] != keys[:-1]

# naive sequential number

seq = np.arange(keys.size)

# offset by seq at most recent change

offset = np.maximum.accumulate(change * seq)

df['seq'] = seq - offset

print(df.set_index(['key', 'seq']))



           event

key seq         

A   0    created

    1    updated

    2    updated

    3    updated

B   0    created

    1    updated

    2    updated

C   0    created

    1    updated

    2    updated

    3    updated

    4    updated

    5    updated

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

answered Nov 15 '18 at 22:20

Matthias Ossadnik

60937

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

add a comment |

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

Thank you for the suggestion. :)

– Cilyan
Nov 15 '18 at 23:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

W0xa0Enyp

搜尋此網誌

Vfrdtyky