How to pivot a dataframe











up vote
170
down vote

favorite
118













  • What is pivot?

  • How do I pivot?

  • Is this a pivot?

  • Long format to wide format?


I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....



... But I'm going to give it a go.





The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)



Look a few examples from my google search





  1. How to pivot a dataframe in Pandas?


    • Good question and answer. But the answer only answers the specific question with little explanation.




  2. pandas pivot table to data frame


    • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.




  3. pandas pivoting a dataframe, duplicate rows


    • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot




So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.





Setup



You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.



import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70


Question(s)




  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape



  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?



    col   col0   col1   col2   col3  col4
    row
    row0 0.77 0.605 NaN 0.860 0.65
    row2 0.13 NaN 0.395 0.500 0.25
    row3 NaN 0.310 NaN 0.545 NaN
    row4 NaN 0.100 0.395 0.760 0.24



  3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?



    col   col0   col1   col2   col3  col4
    row
    row0 0.77 0.605 0.000 0.860 0.65
    row2 0.13 0.000 0.395 0.500 0.25
    row3 0.00 0.310 0.000 0.545 0.00
    row4 0.00 0.100 0.395 0.760 0.24



  4. Can I get something other than mean, like maybe sum?



    col   col0  col1  col2  col3  col4
    row
    row0 0.77 1.21 0.00 0.86 0.65
    row2 0.13 0.00 0.79 0.50 0.50
    row3 0.00 0.31 0.00 1.09 0.00
    row4 0.00 0.10 0.79 1.52 0.24



  5. Can I do more that one aggregation at a time?



           sum                          mean                           
    col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
    row
    row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
    row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
    row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
    row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24



  6. Can I aggregate over multiple value columns?



          val0                             val1                          
    col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
    row
    row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
    row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
    row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
    row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



  7. Can Subdivide by multiple columns?



    item item0             item1                         item2                   
    col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
    row
    row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
    row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
    row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
    row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



  8. Or



    item      item0             item1                         item2                  
    col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
    key row
    key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
    row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
    row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
    row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
    key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
    row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
    row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
    row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
    row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
    row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



  9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?



    col   col0  col1  col2  col3  col4
    row
    row0 1 2 0 1 1
    row2 1 0 2 1 2
    row3 0 1 0 2 0
    row4 0 1 2 2 1











share|improve this question




























    up vote
    170
    down vote

    favorite
    118













    • What is pivot?

    • How do I pivot?

    • Is this a pivot?

    • Long format to wide format?


    I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....



    ... But I'm going to give it a go.





    The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)



    Look a few examples from my google search





    1. How to pivot a dataframe in Pandas?


      • Good question and answer. But the answer only answers the specific question with little explanation.




    2. pandas pivot table to data frame


      • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.




    3. pandas pivoting a dataframe, duplicate rows


      • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot




    So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.





    Setup



    You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.



    import numpy as np
    import pandas as pd
    from numpy.core.defchararray import add

    np.random.seed([3,1415])
    n = 20

    cols = np.array(['key', 'row', 'item', 'col'])
    arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

    df = pd.DataFrame(
    add(cols, arr1), columns=cols
    ).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
    )
    print(df)

    key row item col val0 val1
    0 key0 row3 item1 col3 0.81 0.04
    1 key1 row2 item1 col2 0.44 0.07
    2 key1 row0 item1 col0 0.77 0.01
    3 key0 row4 item0 col2 0.15 0.59
    4 key1 row0 item2 col1 0.81 0.64
    5 key1 row2 item2 col4 0.13 0.88
    6 key2 row4 item1 col3 0.88 0.39
    7 key1 row4 item1 col1 0.10 0.07
    8 key1 row0 item2 col4 0.65 0.02
    9 key1 row2 item0 col2 0.35 0.61
    10 key2 row0 item2 col1 0.40 0.85
    11 key2 row4 item1 col2 0.64 0.25
    12 key0 row2 item2 col3 0.50 0.44
    13 key0 row4 item1 col4 0.24 0.46
    14 key1 row3 item2 col3 0.28 0.11
    15 key0 row3 item1 col1 0.31 0.23
    16 key0 row0 item2 col3 0.86 0.01
    17 key0 row4 item0 col3 0.64 0.21
    18 key2 row2 item2 col0 0.13 0.45
    19 key0 row2 item0 col4 0.37 0.70


    Question(s)




    1. Why do I get ValueError: Index contains duplicate entries, cannot reshape



    2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?



      col   col0   col1   col2   col3  col4
      row
      row0 0.77 0.605 NaN 0.860 0.65
      row2 0.13 NaN 0.395 0.500 0.25
      row3 NaN 0.310 NaN 0.545 NaN
      row4 NaN 0.100 0.395 0.760 0.24



    3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?



      col   col0   col1   col2   col3  col4
      row
      row0 0.77 0.605 0.000 0.860 0.65
      row2 0.13 0.000 0.395 0.500 0.25
      row3 0.00 0.310 0.000 0.545 0.00
      row4 0.00 0.100 0.395 0.760 0.24



    4. Can I get something other than mean, like maybe sum?



      col   col0  col1  col2  col3  col4
      row
      row0 0.77 1.21 0.00 0.86 0.65
      row2 0.13 0.00 0.79 0.50 0.50
      row3 0.00 0.31 0.00 1.09 0.00
      row4 0.00 0.10 0.79 1.52 0.24



    5. Can I do more that one aggregation at a time?



             sum                          mean                           
      col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
      row
      row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
      row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
      row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
      row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24



    6. Can I aggregate over multiple value columns?



            val0                             val1                          
      col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
      row
      row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
      row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
      row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
      row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



    7. Can Subdivide by multiple columns?



      item item0             item1                         item2                   
      col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
      row
      row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
      row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
      row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
      row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



    8. Or



      item      item0             item1                         item2                  
      col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
      key row
      key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
      row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
      row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
      row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
      key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
      row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
      row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
      row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
      key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
      row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
      row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



    9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?



      col   col0  col1  col2  col3  col4
      row
      row0 1 2 0 1 1
      row2 1 0 2 1 2
      row3 0 1 0 2 0
      row4 0 1 2 2 1











    share|improve this question


























      up vote
      170
      down vote

      favorite
      118









      up vote
      170
      down vote

      favorite
      118






      118






      • What is pivot?

      • How do I pivot?

      • Is this a pivot?

      • Long format to wide format?


      I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....



      ... But I'm going to give it a go.





      The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)



      Look a few examples from my google search





      1. How to pivot a dataframe in Pandas?


        • Good question and answer. But the answer only answers the specific question with little explanation.




      2. pandas pivot table to data frame


        • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.




      3. pandas pivoting a dataframe, duplicate rows


        • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot




      So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.





      Setup



      You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.



      import numpy as np
      import pandas as pd
      from numpy.core.defchararray import add

      np.random.seed([3,1415])
      n = 20

      cols = np.array(['key', 'row', 'item', 'col'])
      arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

      df = pd.DataFrame(
      add(cols, arr1), columns=cols
      ).join(
      pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
      )
      print(df)

      key row item col val0 val1
      0 key0 row3 item1 col3 0.81 0.04
      1 key1 row2 item1 col2 0.44 0.07
      2 key1 row0 item1 col0 0.77 0.01
      3 key0 row4 item0 col2 0.15 0.59
      4 key1 row0 item2 col1 0.81 0.64
      5 key1 row2 item2 col4 0.13 0.88
      6 key2 row4 item1 col3 0.88 0.39
      7 key1 row4 item1 col1 0.10 0.07
      8 key1 row0 item2 col4 0.65 0.02
      9 key1 row2 item0 col2 0.35 0.61
      10 key2 row0 item2 col1 0.40 0.85
      11 key2 row4 item1 col2 0.64 0.25
      12 key0 row2 item2 col3 0.50 0.44
      13 key0 row4 item1 col4 0.24 0.46
      14 key1 row3 item2 col3 0.28 0.11
      15 key0 row3 item1 col1 0.31 0.23
      16 key0 row0 item2 col3 0.86 0.01
      17 key0 row4 item0 col3 0.64 0.21
      18 key2 row2 item2 col0 0.13 0.45
      19 key0 row2 item0 col4 0.37 0.70


      Question(s)




      1. Why do I get ValueError: Index contains duplicate entries, cannot reshape



      2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?



        col   col0   col1   col2   col3  col4
        row
        row0 0.77 0.605 NaN 0.860 0.65
        row2 0.13 NaN 0.395 0.500 0.25
        row3 NaN 0.310 NaN 0.545 NaN
        row4 NaN 0.100 0.395 0.760 0.24



      3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?



        col   col0   col1   col2   col3  col4
        row
        row0 0.77 0.605 0.000 0.860 0.65
        row2 0.13 0.000 0.395 0.500 0.25
        row3 0.00 0.310 0.000 0.545 0.00
        row4 0.00 0.100 0.395 0.760 0.24



      4. Can I get something other than mean, like maybe sum?



        col   col0  col1  col2  col3  col4
        row
        row0 0.77 1.21 0.00 0.86 0.65
        row2 0.13 0.00 0.79 0.50 0.50
        row3 0.00 0.31 0.00 1.09 0.00
        row4 0.00 0.10 0.79 1.52 0.24



      5. Can I do more that one aggregation at a time?



               sum                          mean                           
        col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
        row
        row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
        row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
        row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
        row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24



      6. Can I aggregate over multiple value columns?



              val0                             val1                          
        col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
        row
        row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
        row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
        row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
        row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



      7. Can Subdivide by multiple columns?



        item item0             item1                         item2                   
        col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
        row
        row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
        row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
        row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



      8. Or



        item      item0             item1                         item2                  
        col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
        key row
        key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
        row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
        row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
        key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
        row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
        row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
        row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
        key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
        row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
        row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



      9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?



        col   col0  col1  col2  col3  col4
        row
        row0 1 2 0 1 1
        row2 1 0 2 1 2
        row3 0 1 0 2 0
        row4 0 1 2 2 1











      share|improve this question
















      • What is pivot?

      • How do I pivot?

      • Is this a pivot?

      • Long format to wide format?


      I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....



      ... But I'm going to give it a go.





      The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)



      Look a few examples from my google search





      1. How to pivot a dataframe in Pandas?


        • Good question and answer. But the answer only answers the specific question with little explanation.




      2. pandas pivot table to data frame


        • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.




      3. pandas pivoting a dataframe, duplicate rows


        • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot




      So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.





      Setup



      You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.



      import numpy as np
      import pandas as pd
      from numpy.core.defchararray import add

      np.random.seed([3,1415])
      n = 20

      cols = np.array(['key', 'row', 'item', 'col'])
      arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

      df = pd.DataFrame(
      add(cols, arr1), columns=cols
      ).join(
      pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
      )
      print(df)

      key row item col val0 val1
      0 key0 row3 item1 col3 0.81 0.04
      1 key1 row2 item1 col2 0.44 0.07
      2 key1 row0 item1 col0 0.77 0.01
      3 key0 row4 item0 col2 0.15 0.59
      4 key1 row0 item2 col1 0.81 0.64
      5 key1 row2 item2 col4 0.13 0.88
      6 key2 row4 item1 col3 0.88 0.39
      7 key1 row4 item1 col1 0.10 0.07
      8 key1 row0 item2 col4 0.65 0.02
      9 key1 row2 item0 col2 0.35 0.61
      10 key2 row0 item2 col1 0.40 0.85
      11 key2 row4 item1 col2 0.64 0.25
      12 key0 row2 item2 col3 0.50 0.44
      13 key0 row4 item1 col4 0.24 0.46
      14 key1 row3 item2 col3 0.28 0.11
      15 key0 row3 item1 col1 0.31 0.23
      16 key0 row0 item2 col3 0.86 0.01
      17 key0 row4 item0 col3 0.64 0.21
      18 key2 row2 item2 col0 0.13 0.45
      19 key0 row2 item0 col4 0.37 0.70


      Question(s)




      1. Why do I get ValueError: Index contains duplicate entries, cannot reshape



      2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?



        col   col0   col1   col2   col3  col4
        row
        row0 0.77 0.605 NaN 0.860 0.65
        row2 0.13 NaN 0.395 0.500 0.25
        row3 NaN 0.310 NaN 0.545 NaN
        row4 NaN 0.100 0.395 0.760 0.24



      3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?



        col   col0   col1   col2   col3  col4
        row
        row0 0.77 0.605 0.000 0.860 0.65
        row2 0.13 0.000 0.395 0.500 0.25
        row3 0.00 0.310 0.000 0.545 0.00
        row4 0.00 0.100 0.395 0.760 0.24



      4. Can I get something other than mean, like maybe sum?



        col   col0  col1  col2  col3  col4
        row
        row0 0.77 1.21 0.00 0.86 0.65
        row2 0.13 0.00 0.79 0.50 0.50
        row3 0.00 0.31 0.00 1.09 0.00
        row4 0.00 0.10 0.79 1.52 0.24



      5. Can I do more that one aggregation at a time?



               sum                          mean                           
        col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
        row
        row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
        row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
        row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
        row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24



      6. Can I aggregate over multiple value columns?



              val0                             val1                          
        col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
        row
        row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
        row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
        row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
        row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



      7. Can Subdivide by multiple columns?



        item item0             item1                         item2                   
        col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
        row
        row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
        row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
        row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



      8. Or



        item      item0             item1                         item2                  
        col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
        key row
        key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
        row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
        row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
        key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
        row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
        row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
        row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
        key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
        row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
        row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



      9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?



        col   col0  col1  col2  col3  col4
        row
        row0 1 2 0 1 1
        row2 1 0 2 1 2
        row3 0 1 0 2 0
        row4 0 1 2 2 1








      python pandas group-by pivot pandas-groupby






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 13 '17 at 6:20

























      asked Nov 7 '17 at 8:00









      piRSquared

      149k21134274




      149k21134274
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          148
          down vote



          accepted
          +500










          We start by answering the first question:



          Question 1




          Why do I get ValueError: Index contains duplicate entries, cannot reshape




          This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:



          df.duplicated(['row', 'col']).any()

          True


          So when I pivot using



          df.pivot(index='row', columns='col', values='val0')


          I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:



          df.set_index(['row', 'col'])['val0'].unstack()


          Here is a list of idioms we can use to pivot





          1. pd.DataFrame.groupby + pd.DataFrame.unstack


            • Good general approach for doing just about any type of pivot

            • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.




          2. pd.DataFrame.pivot_table


            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.




          3. pd.DataFrame.set_index + pd.DataFrame.unstack


            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.




          4. pd.DataFrame.pivot


            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.




          5. pd.crosstab


            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.




          6. pd.factorize + np.bincount


            • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.




          7. pd.get_dummies + pd.DataFrame.dot


            • I use this for cleverly performing cross tabulation.






          Examples



          What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.



          Question 3




          How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?






          • pd.DataFrame.pivot_table





            • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value


            • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.



              df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')

              col col0 col1 col2 col3 col4
              row
              row0 0.77 0.605 0.000 0.860 0.65
              row2 0.13 0.000 0.395 0.500 0.25
              row3 0.00 0.310 0.000 0.545 0.00
              row4 0.00 0.100 0.395 0.760 0.24





          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)





          Question 4




          Can I get something other than mean, like maybe sum?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')

            col col0 col1 col2 col3 col4
            row
            row0 0.77 1.21 0.00 0.86 0.65
            row2 0.13 0.00 0.79 0.50 0.50
            row3 0.00 0.31 0.00 1.09 0.00
            row4 0.00 0.10 0.79 1.52 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)





          Question 5




          Can I do more that one aggregation at a time?




          Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.





          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])

            size mean
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
            row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
            row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
            row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')





          Question 6




          Can I aggregate over multiple value columns?






          • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely



            df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')

            val0 val1
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
            row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
            row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
            row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)





          Question 7




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            row
            row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
            row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
            row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 8




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            key row
            key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
            row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
            row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
            key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
            row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
            row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
            row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
            key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
            row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
            row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['key', 'row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



          • pd.DataFrame.set_index because the set of keys are unique for both rows and columns



            df.set_index(
            ['key', 'row', 'item', 'col']
            )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 9




          Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?






          • pd.DataFrame.pivot_table



            df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

            col col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



          • pd.cross_tab



            pd.crosstab(df['row'], df['col'])



          • pd.factorize + np.bincount



            # get integer factorization `i` and unique values `r`
            # for column `'row'`
            i, r = pd.factorize(df['row'].values)
            # get integer factorization `j` and unique values `c`
            # for column `'col'`
            j, c = pd.factorize(df['col'].values)
            # `n` will be the number of rows
            # `m` will be the number of columns
            n, m = r.size, c.size
            # `i * m + j` is a clever way of counting the
            # factorization bins assuming a flat array of length
            # `n * m`. Which is why we subsequently reshape as `(n, m)`
            b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
            # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
            pd.DataFrame(b, r, c)

            col3 col2 col0 col1 col4
            row3 2 0 0 1 0
            row2 1 2 1 0 2
            row0 1 0 1 2 1
            row4 2 2 0 1 1



          • pd.get_dummies



            pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

            col0 col1 col2 col3 col4
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1







          share|improve this answer



















          • 5




            Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
            – Vaishali
            Nov 13 '17 at 19:04






          • 20




            Could you please consider extending official docs?
            – MaxU
            Dec 15 '17 at 10:31











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47152691%2fhow-to-pivot-a-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          148
          down vote



          accepted
          +500










          We start by answering the first question:



          Question 1




          Why do I get ValueError: Index contains duplicate entries, cannot reshape




          This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:



          df.duplicated(['row', 'col']).any()

          True


          So when I pivot using



          df.pivot(index='row', columns='col', values='val0')


          I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:



          df.set_index(['row', 'col'])['val0'].unstack()


          Here is a list of idioms we can use to pivot





          1. pd.DataFrame.groupby + pd.DataFrame.unstack


            • Good general approach for doing just about any type of pivot

            • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.




          2. pd.DataFrame.pivot_table


            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.




          3. pd.DataFrame.set_index + pd.DataFrame.unstack


            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.




          4. pd.DataFrame.pivot


            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.




          5. pd.crosstab


            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.




          6. pd.factorize + np.bincount


            • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.




          7. pd.get_dummies + pd.DataFrame.dot


            • I use this for cleverly performing cross tabulation.






          Examples



          What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.



          Question 3




          How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?






          • pd.DataFrame.pivot_table





            • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value


            • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.



              df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')

              col col0 col1 col2 col3 col4
              row
              row0 0.77 0.605 0.000 0.860 0.65
              row2 0.13 0.000 0.395 0.500 0.25
              row3 0.00 0.310 0.000 0.545 0.00
              row4 0.00 0.100 0.395 0.760 0.24





          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)





          Question 4




          Can I get something other than mean, like maybe sum?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')

            col col0 col1 col2 col3 col4
            row
            row0 0.77 1.21 0.00 0.86 0.65
            row2 0.13 0.00 0.79 0.50 0.50
            row3 0.00 0.31 0.00 1.09 0.00
            row4 0.00 0.10 0.79 1.52 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)





          Question 5




          Can I do more that one aggregation at a time?




          Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.





          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])

            size mean
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
            row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
            row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
            row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')





          Question 6




          Can I aggregate over multiple value columns?






          • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely



            df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')

            val0 val1
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
            row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
            row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
            row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)





          Question 7




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            row
            row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
            row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
            row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 8




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            key row
            key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
            row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
            row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
            key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
            row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
            row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
            row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
            key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
            row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
            row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['key', 'row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



          • pd.DataFrame.set_index because the set of keys are unique for both rows and columns



            df.set_index(
            ['key', 'row', 'item', 'col']
            )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 9




          Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?






          • pd.DataFrame.pivot_table



            df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

            col col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



          • pd.cross_tab



            pd.crosstab(df['row'], df['col'])



          • pd.factorize + np.bincount



            # get integer factorization `i` and unique values `r`
            # for column `'row'`
            i, r = pd.factorize(df['row'].values)
            # get integer factorization `j` and unique values `c`
            # for column `'col'`
            j, c = pd.factorize(df['col'].values)
            # `n` will be the number of rows
            # `m` will be the number of columns
            n, m = r.size, c.size
            # `i * m + j` is a clever way of counting the
            # factorization bins assuming a flat array of length
            # `n * m`. Which is why we subsequently reshape as `(n, m)`
            b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
            # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
            pd.DataFrame(b, r, c)

            col3 col2 col0 col1 col4
            row3 2 0 0 1 0
            row2 1 2 1 0 2
            row0 1 0 1 2 1
            row4 2 2 0 1 1



          • pd.get_dummies



            pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

            col0 col1 col2 col3 col4
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1







          share|improve this answer



















          • 5




            Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
            – Vaishali
            Nov 13 '17 at 19:04






          • 20




            Could you please consider extending official docs?
            – MaxU
            Dec 15 '17 at 10:31















          up vote
          148
          down vote



          accepted
          +500










          We start by answering the first question:



          Question 1




          Why do I get ValueError: Index contains duplicate entries, cannot reshape




          This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:



          df.duplicated(['row', 'col']).any()

          True


          So when I pivot using



          df.pivot(index='row', columns='col', values='val0')


          I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:



          df.set_index(['row', 'col'])['val0'].unstack()


          Here is a list of idioms we can use to pivot





          1. pd.DataFrame.groupby + pd.DataFrame.unstack


            • Good general approach for doing just about any type of pivot

            • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.




          2. pd.DataFrame.pivot_table


            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.




          3. pd.DataFrame.set_index + pd.DataFrame.unstack


            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.




          4. pd.DataFrame.pivot


            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.




          5. pd.crosstab


            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.




          6. pd.factorize + np.bincount


            • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.




          7. pd.get_dummies + pd.DataFrame.dot


            • I use this for cleverly performing cross tabulation.






          Examples



          What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.



          Question 3




          How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?






          • pd.DataFrame.pivot_table





            • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value


            • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.



              df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')

              col col0 col1 col2 col3 col4
              row
              row0 0.77 0.605 0.000 0.860 0.65
              row2 0.13 0.000 0.395 0.500 0.25
              row3 0.00 0.310 0.000 0.545 0.00
              row4 0.00 0.100 0.395 0.760 0.24





          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)





          Question 4




          Can I get something other than mean, like maybe sum?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')

            col col0 col1 col2 col3 col4
            row
            row0 0.77 1.21 0.00 0.86 0.65
            row2 0.13 0.00 0.79 0.50 0.50
            row3 0.00 0.31 0.00 1.09 0.00
            row4 0.00 0.10 0.79 1.52 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)





          Question 5




          Can I do more that one aggregation at a time?




          Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.





          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])

            size mean
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
            row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
            row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
            row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')





          Question 6




          Can I aggregate over multiple value columns?






          • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely



            df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')

            val0 val1
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
            row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
            row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
            row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)





          Question 7




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            row
            row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
            row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
            row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 8




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            key row
            key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
            row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
            row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
            key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
            row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
            row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
            row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
            key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
            row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
            row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['key', 'row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



          • pd.DataFrame.set_index because the set of keys are unique for both rows and columns



            df.set_index(
            ['key', 'row', 'item', 'col']
            )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 9




          Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?






          • pd.DataFrame.pivot_table



            df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

            col col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



          • pd.cross_tab



            pd.crosstab(df['row'], df['col'])



          • pd.factorize + np.bincount



            # get integer factorization `i` and unique values `r`
            # for column `'row'`
            i, r = pd.factorize(df['row'].values)
            # get integer factorization `j` and unique values `c`
            # for column `'col'`
            j, c = pd.factorize(df['col'].values)
            # `n` will be the number of rows
            # `m` will be the number of columns
            n, m = r.size, c.size
            # `i * m + j` is a clever way of counting the
            # factorization bins assuming a flat array of length
            # `n * m`. Which is why we subsequently reshape as `(n, m)`
            b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
            # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
            pd.DataFrame(b, r, c)

            col3 col2 col0 col1 col4
            row3 2 0 0 1 0
            row2 1 2 1 0 2
            row0 1 0 1 2 1
            row4 2 2 0 1 1



          • pd.get_dummies



            pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

            col0 col1 col2 col3 col4
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1







          share|improve this answer



















          • 5




            Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
            – Vaishali
            Nov 13 '17 at 19:04






          • 20




            Could you please consider extending official docs?
            – MaxU
            Dec 15 '17 at 10:31













          up vote
          148
          down vote



          accepted
          +500







          up vote
          148
          down vote



          accepted
          +500




          +500




          We start by answering the first question:



          Question 1




          Why do I get ValueError: Index contains duplicate entries, cannot reshape




          This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:



          df.duplicated(['row', 'col']).any()

          True


          So when I pivot using



          df.pivot(index='row', columns='col', values='val0')


          I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:



          df.set_index(['row', 'col'])['val0'].unstack()


          Here is a list of idioms we can use to pivot





          1. pd.DataFrame.groupby + pd.DataFrame.unstack


            • Good general approach for doing just about any type of pivot

            • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.




          2. pd.DataFrame.pivot_table


            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.




          3. pd.DataFrame.set_index + pd.DataFrame.unstack


            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.




          4. pd.DataFrame.pivot


            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.




          5. pd.crosstab


            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.




          6. pd.factorize + np.bincount


            • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.




          7. pd.get_dummies + pd.DataFrame.dot


            • I use this for cleverly performing cross tabulation.






          Examples



          What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.



          Question 3




          How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?






          • pd.DataFrame.pivot_table





            • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value


            • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.



              df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')

              col col0 col1 col2 col3 col4
              row
              row0 0.77 0.605 0.000 0.860 0.65
              row2 0.13 0.000 0.395 0.500 0.25
              row3 0.00 0.310 0.000 0.545 0.00
              row4 0.00 0.100 0.395 0.760 0.24





          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)





          Question 4




          Can I get something other than mean, like maybe sum?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')

            col col0 col1 col2 col3 col4
            row
            row0 0.77 1.21 0.00 0.86 0.65
            row2 0.13 0.00 0.79 0.50 0.50
            row3 0.00 0.31 0.00 1.09 0.00
            row4 0.00 0.10 0.79 1.52 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)





          Question 5




          Can I do more that one aggregation at a time?




          Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.





          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])

            size mean
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
            row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
            row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
            row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')





          Question 6




          Can I aggregate over multiple value columns?






          • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely



            df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')

            val0 val1
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
            row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
            row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
            row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)





          Question 7




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            row
            row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
            row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
            row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 8




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            key row
            key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
            row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
            row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
            key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
            row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
            row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
            row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
            key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
            row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
            row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['key', 'row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



          • pd.DataFrame.set_index because the set of keys are unique for both rows and columns



            df.set_index(
            ['key', 'row', 'item', 'col']
            )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 9




          Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?






          • pd.DataFrame.pivot_table



            df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

            col col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



          • pd.cross_tab



            pd.crosstab(df['row'], df['col'])



          • pd.factorize + np.bincount



            # get integer factorization `i` and unique values `r`
            # for column `'row'`
            i, r = pd.factorize(df['row'].values)
            # get integer factorization `j` and unique values `c`
            # for column `'col'`
            j, c = pd.factorize(df['col'].values)
            # `n` will be the number of rows
            # `m` will be the number of columns
            n, m = r.size, c.size
            # `i * m + j` is a clever way of counting the
            # factorization bins assuming a flat array of length
            # `n * m`. Which is why we subsequently reshape as `(n, m)`
            b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
            # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
            pd.DataFrame(b, r, c)

            col3 col2 col0 col1 col4
            row3 2 0 0 1 0
            row2 1 2 1 0 2
            row0 1 0 1 2 1
            row4 2 2 0 1 1



          • pd.get_dummies



            pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

            col0 col1 col2 col3 col4
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1







          share|improve this answer














          We start by answering the first question:



          Question 1




          Why do I get ValueError: Index contains duplicate entries, cannot reshape




          This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:



          df.duplicated(['row', 'col']).any()

          True


          So when I pivot using



          df.pivot(index='row', columns='col', values='val0')


          I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:



          df.set_index(['row', 'col'])['val0'].unstack()


          Here is a list of idioms we can use to pivot





          1. pd.DataFrame.groupby + pd.DataFrame.unstack


            • Good general approach for doing just about any type of pivot

            • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.




          2. pd.DataFrame.pivot_table


            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.




          3. pd.DataFrame.set_index + pd.DataFrame.unstack


            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.




          4. pd.DataFrame.pivot


            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.




          5. pd.crosstab


            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.




          6. pd.factorize + np.bincount


            • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.




          7. pd.get_dummies + pd.DataFrame.dot


            • I use this for cleverly performing cross tabulation.






          Examples



          What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.



          Question 3




          How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?






          • pd.DataFrame.pivot_table





            • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value


            • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.



              df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')

              col col0 col1 col2 col3 col4
              row
              row0 0.77 0.605 0.000 0.860 0.65
              row2 0.13 0.000 0.395 0.500 0.25
              row3 0.00 0.310 0.000 0.545 0.00
              row4 0.00 0.100 0.395 0.760 0.24





          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)





          Question 4




          Can I get something other than mean, like maybe sum?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')

            col col0 col1 col2 col3 col4
            row
            row0 0.77 1.21 0.00 0.86 0.65
            row2 0.13 0.00 0.79 0.50 0.50
            row3 0.00 0.31 0.00 1.09 0.00
            row4 0.00 0.10 0.79 1.52 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)





          Question 5




          Can I do more that one aggregation at a time?




          Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.





          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])

            size mean
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
            row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
            row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
            row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)



          • pd.crosstab



            pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')





          Question 6




          Can I aggregate over multiple value columns?






          • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely



            df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')

            val0 val1
            col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
            row
            row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
            row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
            row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
            row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)





          Question 7




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            row
            row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
            row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
            row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 8




          Can Subdivide by multiple columns?






          • pd.DataFrame.pivot_table



            df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')

            item item0 item1 item2
            col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
            key row
            key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
            row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
            row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
            row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
            key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
            row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
            row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
            row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
            key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
            row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
            row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00



          • pd.DataFrame.groupby



            df.groupby(
            ['key', 'row', 'item', 'col']
            )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)



          • pd.DataFrame.set_index because the set of keys are unique for both rows and columns



            df.set_index(
            ['key', 'row', 'item', 'col']
            )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)





          Question 9




          Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?






          • pd.DataFrame.pivot_table



            df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

            col col0 col1 col2 col3 col4
            row
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1



          • pd.DataFrame.groupby



            df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)



          • pd.cross_tab



            pd.crosstab(df['row'], df['col'])



          • pd.factorize + np.bincount



            # get integer factorization `i` and unique values `r`
            # for column `'row'`
            i, r = pd.factorize(df['row'].values)
            # get integer factorization `j` and unique values `c`
            # for column `'col'`
            j, c = pd.factorize(df['col'].values)
            # `n` will be the number of rows
            # `m` will be the number of columns
            n, m = r.size, c.size
            # `i * m + j` is a clever way of counting the
            # factorization bins assuming a flat array of length
            # `n * m`. Which is why we subsequently reshape as `(n, m)`
            b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
            # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
            pd.DataFrame(b, r, c)

            col3 col2 col0 col1 col4
            row3 2 0 0 1 0
            row2 1 2 1 0 2
            row0 1 0 1 2 1
            row4 2 2 0 1 1



          • pd.get_dummies



            pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

            col0 col1 col2 col3 col4
            row0 1 2 0 1 1
            row2 1 0 2 1 2
            row3 0 1 0 2 0
            row4 0 1 2 2 1








          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 14 '17 at 17:13

























          answered Nov 7 '17 at 8:00









          piRSquared

          149k21134274




          149k21134274








          • 5




            Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
            – Vaishali
            Nov 13 '17 at 19:04






          • 20




            Could you please consider extending official docs?
            – MaxU
            Dec 15 '17 at 10:31














          • 5




            Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
            – Vaishali
            Nov 13 '17 at 19:04






          • 20




            Could you please consider extending official docs?
            – MaxU
            Dec 15 '17 at 10:31








          5




          5




          Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
          – Vaishali
          Nov 13 '17 at 19:04




          Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
          – Vaishali
          Nov 13 '17 at 19:04




          20




          20




          Could you please consider extending official docs?
          – MaxU
          Dec 15 '17 at 10:31




          Could you please consider extending official docs?
          – MaxU
          Dec 15 '17 at 10:31


















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47152691%2fhow-to-pivot-a-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Xamarin.iOS Cant Deploy on Iphone

          Glorious Revolution

          Dulmage-Mendelsohn matrix decomposition in Python