Pivot_Longer : One function to cover transformations from wide to long form.

[1]:
import janitor
import pandas as pd
import numpy as np

Unpivoting(reshaping data from wide to long form) in Pandas is executed either through pd.melt, pd.wide_to_long, or pd.DataFrame.stack. However, there are scenarios where a few more steps are required to massage the data into the long form that we desire. Take the dataframe below, copied from Stack Overflow:

[2]:
df = pd.DataFrame(
        {
            "id": [1, 2, 3],
            "M_start_date_1": [201709, 201709, 201709],
            "M_end_date_1": [201905, 201905, 201905],
            "M_start_date_2": [202004, 202004, 202004],
            "M_end_date_2": [202005, 202005, 202005],
            "F_start_date_1": [201803, 201803, 201803],
            "F_end_date_1": [201904, 201904, 201904],
            "F_start_date_2": [201912, 201912, 201912],
            "F_end_date_2": [202007, 202007, 202007],
        }
    )

df
[2]:
id M_start_date_1 M_end_date_1 M_start_date_2 M_end_date_2 F_start_date_1 F_end_date_1 F_start_date_2 F_end_date_2
0 1 201709 201905 202004 202005 201803 201904 201912 202007
1 2 201709 201905 202004 202005 201803 201904 201912 202007
2 3 201709 201905 202004 202005 201803 201904 201912 202007

Below is a beautiful solution, from Stack Overflow :

[3]:
df1 = df.set_index('id')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = (df1.stack(level=[0,2,3])
          .sort_index(level=[0,1], ascending=[True, False])
          .reset_index(level=[2,3], drop=True)
          .sort_index(axis=1, ascending=False)
          .rename_axis(['id','cod'])
          .reset_index())

df1
[3]:
id cod start end
0 1 M 201709 201905
1 1 M 202004 202005
2 1 F 201803 201904
3 1 F 201912 202007
4 2 M 201709 201905
5 2 M 202004 202005
6 2 F 201803 201904
7 2 F 201912 202007
8 3 M 201709 201905
9 3 M 202004 202005
10 3 F 201803 201904
11 3 F 201912 202007

We propose an alternative, based on pandas melt that abstracts the reshaping mechanism, allows the user to focus on the task, can be applied to other scenarios, and is chainable :

[4]:
result = df.pivot_longer(
            index="id",
            names_to=("cod", ".value"),
            names_pattern="(M|F)_(start|end)_.+",
            sort_by_appearance=True,
            )

result
[4]:
id cod start end
0 1 M 201709 201905
1 1 M 202004 202005
2 1 F 201803 201904
3 1 F 201912 202007
4 2 M 201709 201905
5 2 M 202004 202005
6 2 F 201803 201904
7 2 F 201912 202007
8 3 M 201709 201905
9 3 M 202004 202005
10 3 F 201803 201904
11 3 F 201912 202007
[5]:
df1.equals(result)
[5]:
True

pivot_longer is not a new idea; it is a combination of ideas from R’s tidyr and data.table and is built on the powerful pandas’ melt function.

pivot_longer can melt dataframes easily; It is just a wrapper around pandas’ melt.

Source Data

[6]:
index = pd.MultiIndex.from_tuples([('person', 'A'), ('person', 'B')])

df = pd.DataFrame({'first': ['John', 'Mary'],
                   'last': ['Doe', 'Bo'],
                   'height': [5.5, 6.0],
                   'weight': [130, 150]},
                   index=index)

df
[6]:
first last height weight
person A John Doe 5.5 130
B Mary Bo 6.0 150
[7]:
df.pivot_longer(index=['first','last'])
[7]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

If you want the data unpivoted in order of appearance, you can set sort_by_appearance to True:

[8]:
df.pivot_longer(
    index=['first','last'],
    sort_by_appearance = True
    )
[8]:
first last variable value
0 John Doe height 5.5
1 John Doe weight 130.0
2 Mary Bo height 6.0
3 Mary Bo weight 150.0

If you wish to reuse the original index, you can set ignore_index to False; note that the index labels will be repeated as necessary:

[9]:
df.pivot_longer(
    index=['first','last'],
    ignore_index = False
    )
[9]:
first last variable value
person A John Doe height 5.5
B Mary Bo height 6.0
A John Doe weight 130.0
B Mary Bo weight 150.0

You can also unpivot MultiIndex columns, the same way you would with pandas’ melt:

Source Data

[10]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF')]

df
[10]:
A B C
D E F
0 a 1 2
1 b 3 4
2 c 5 6
[11]:
df.pivot_longer(
    index = [("A", "D")],
    values_to = "num"
)
[11]:
(A, D) variable num
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
[12]:
df.pivot_longer(
    index = [("A", "D")],
    column_names = [("B", "E")]
)
[12]:
(A, D) variable value
0 a B 1
1 b B 3
2 c B 5

And just like melt, you can unpivot on a specific level, with column_level:

[13]:
df.pivot_longer(
    index = "A",
    column_names = "B",
    column_level = 0
)
[13]:
A variable value
0 a B 1
1 b B 3
2 c B 5

Note that when unpivoting MultiIndex columns, you need to pass a list of tuples to the index or column_names parameters.

Also, if names_sep or names_pattern is not None, then unpivoting on MultiIndex columns is not supported.

You can dynamically select columns, using regular expressions with the janitor.patterns function (inspired by R’s data.table’s patterns function, and is really just a wrapper around re.compile), especially if it is a lot of column names, and you are lazy like me 😄

[14]:
url = 'https://github.com/tidyverse/tidyr/raw/master/data-raw/billboard.csv'
df = pd.read_csv(url)

df
[14]:
year artist track time date.entered wk1 wk2 wk3 wk4 wk5 ... wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 wk75 wk76
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82.0 72.0 77.0 87.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87.0 92.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70.0 68.0 67.0 66.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2000 3 Doors Down Loser 4:24 2000-10-21 76 76.0 72.0 69.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34.0 25.0 17.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 86 83.0 77.0 74.0 83.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
313 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 85 83.0 83.0 82.0 81.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
314 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 95 94.0 91.0 85.0 84.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
315 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 99 99.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
316 2000 matchbox twenty Bent 4:12 2000-04-29 60 37.0 29.0 24.0 22.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

317 rows × 81 columns

[15]:
# unpivot all columns that start with 'wk'
df.pivot_longer(column_names = janitor.patterns("^(wk)"),
                names_to='week')
[15]:
year artist track time date.entered week value
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 87.0
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 91.0
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81.0
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57.0
... ... ... ... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 wk76 NaN
24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 wk76 NaN
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 wk76 NaN
24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 wk76 NaN
24091 2000 matchbox twenty Bent 4:12 2000-04-29 wk76 NaN

24092 rows × 7 columns

You can also use pyjanitor’s select_columns syntax:

[16]:
df.pivot_longer(column_names = "wk*",
                names_to = 'week')
[16]:
year artist track time date.entered week value
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 87.0
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 91.0
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81.0
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57.0
... ... ... ... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 wk76 NaN
24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 wk76 NaN
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 wk76 NaN
24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 wk76 NaN
24091 2000 matchbox twenty Bent 4:12 2000-04-29 wk76 NaN

24092 rows × 7 columns

pivot_longer can also unpivot paired columns. In this regard, it is like pandas’ wide_to_long, but with more flexibility and power. Let’s look at an example from pandas’ wide_to_long docs :

[17]:
df = pd.DataFrame({
    'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
    'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
})

df
[17]:
famid birth ht1 ht2
0 1 1 2.8 3.4
1 1 2 2.9 3.8
2 1 3 2.2 2.9
3 2 1 2.0 3.2
4 2 2 1.8 2.8
5 2 3 1.9 2.4
6 3 1 2.2 3.3
7 3 2 2.3 3.4
8 3 3 2.1 2.9

In the data above, the height(ht) is paired with age(numbers). pd.wide_to_long can handle this easily:

[18]:
pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age')
[18]:
ht
famid birth age
1 1 1 2.8
2 3.4
2 1 2.9
2 3.8
3 1 2.2
2 2.9
2 1 1 2.0
2 3.2
2 1 1.8
2 2.8
3 1 1.9
2 2.4
3 1 1 2.2
2 3.3
2 1 2.3
2 3.4
3 1 2.1
2 2.9

Now let’s see how pivot_longer handles this:

[19]:
df.pivot_longer(index=['famid','birth'],
                names_to=('.value', 'age'),
                names_pattern=r"(ht)(\d)")
[19]:
famid birth age ht
0 1 1 1 2.8
1 1 2 1 2.9
2 1 3 1 2.2
3 2 1 1 2.0
4 2 2 1 1.8
5 2 3 1 1.9
6 3 1 1 2.2
7 3 2 1 2.3
8 3 3 1 2.1
9 1 1 2 3.4
10 1 2 2 3.8
11 1 3 2 2.9
12 2 1 2 3.2
13 2 2 2 2.8
14 2 3 2 2.4
15 3 1 2 3.3
16 3 2 2 3.4
17 3 3 2 2.9

The first observable difference is that pivot_longer is method chainable, while pd.wide_to_long is not. Now, let’s learn more about the .value variable.

When .value is used in names_to, a pairing is created between names_to and names_pattern. For the example above, we get this pairing :

{".value": ("ht"), "age": (\d)}

This tells the pivot_longer function to keep values associated with .value(ht) as the column name, while values not associated with .value, in this case, the numbers, will be collated under a new column age. Internally, pandas str.extract is used to get the capturing groups before reshaping. This level of abstraction, we believe, allows the user to focus on the task, and get things done faster.

Note that if you want the data returned in order of appearance you can set sort_by_appearance to True:

[20]:
df.pivot_longer(
    index = ['famid','birth'],
    names_to = ('.value', 'age'),
    names_pattern = r"(ht)(\d)",
    sort_by_appearance = True,
    )
[20]:
famid birth age ht
0 1 1 1 2.8
1 1 1 2 3.4
2 1 2 1 2.9
3 1 2 2 3.8
4 1 3 1 2.2
5 1 3 2 2.9
6 2 1 1 2.0
7 2 1 2 3.2
8 2 2 1 1.8
9 2 2 2 2.8
10 2 3 1 1.9
11 2 3 2 2.4
12 3 1 1 2.2
13 3 1 2 3.3
14 3 2 1 2.3
15 3 2 2 3.4
16 3 3 1 2.1
17 3 3 2 2.9

Note that you are likely to get more speed when sort_by_appearance is False.

Note also that the values in the age column are of object dtype. You can change the dtype, using pandas’ astype method.

We’ve seen already that pd.wide_to_long handles this already and very well, so why bother? Let’s look at another scenario where pd.wide_to_long would need a few more steps. Source Data:

[21]:
df = pd.DataFrame(
    {
        "off_loc": ["A", "B", "C", "D", "E", "F"],
        "pt_loc": ["G", "H", "I", "J", "K", "L"],
        "pt_lat": [
            100.07548220000001,
            75.191326,
            122.65134479999999,
            124.13553329999999,
            124.13553329999999,
            124.01028909999998,
        ],
        "off_lat": [
            121.271083,
            75.93845266,
            135.043791,
            134.51128400000002,
            134.484374,
            137.962195,
        ],
        "pt_long": [
            4.472089953,
            -144.387785,
            -40.45611048,
            -46.07156181,
            -46.07156181,
            -46.01594293,
        ],
        "off_long": [
            -7.188632000000001,
            -143.2288569,
            21.242563,
            40.937416999999996,
            40.78472,
            22.905889000000002,
        ],
    }
)

df
[21]:
off_loc pt_loc pt_lat off_lat pt_long off_long
0 A G 100.075482 121.271083 4.472090 -7.188632
1 B H 75.191326 75.938453 -144.387785 -143.228857
2 C I 122.651345 135.043791 -40.456110 21.242563
3 D J 124.135533 134.511284 -46.071562 40.937417
4 E K 124.135533 134.484374 -46.071562 40.784720
5 F L 124.010289 137.962195 -46.015943 22.905889

We can unpivot with pd.wide_to_long by first reorganising the columns :

[22]:
df1 = df.copy()
df1.columns = ["_".join(col.split("_")[::-1])
               for col in df1.columns]
df1
[22]:
loc_off loc_pt lat_pt lat_off long_pt long_off
0 A G 100.075482 121.271083 4.472090 -7.188632
1 B H 75.191326 75.938453 -144.387785 -143.228857
2 C I 122.651345 135.043791 -40.456110 21.242563
3 D J 124.135533 134.511284 -46.071562 40.937417
4 E K 124.135533 134.484374 -46.071562 40.784720
5 F L 124.010289 137.962195 -46.015943 22.905889

Now, we can unpivot :

[23]:
pd.wide_to_long(
    df1.reset_index(),
    stubnames=["loc", "lat", "long"],
    sep="_",
    i="index",
    j="set",
    suffix=".+",
)
[23]:
loc lat long
index set
0 off A 121.271083 -7.188632
pt G 100.075482 4.472090
1 off B 75.938453 -143.228857
pt H 75.191326 -144.387785
2 off C 135.043791 21.242563
pt I 122.651345 -40.456110
3 off D 134.511284 40.937417
pt J 124.135533 -46.071562
4 off E 134.484374 40.784720
pt K 124.135533 -46.071562
5 off F 137.962195 22.905889
pt L 124.010289 -46.015943

We can get the same transformed dataframe, with less lines, using pivot_longer :

[24]:
df.pivot_longer(
    names_to = ["set", ".value"],
    names_pattern = "(.+)_(.+)"
    )
[24]:
set loc lat long
0 off A 121.271083 -7.188632
1 off B 75.938453 -143.228857
2 off C 135.043791 21.242563
3 off D 134.511284 40.937417
4 off E 134.484374 40.784720
5 off F 137.962195 22.905889
6 pt G 100.075482 4.472090
7 pt H 75.191326 -144.387785
8 pt I 122.651345 -40.456110
9 pt J 124.135533 -46.071562
10 pt K 124.135533 -46.071562
11 pt L 124.010289 -46.015943
[25]:
# Another way to see the pairings,
# to see what is linked to `.value`,

# names_to =     ["set", ".value"]
# names_pattern = "(.+)_(.+)"
# column _names =   off_loc
#                   off_lat
#                   off_long

Again, the key here is the .value symbol. Pairing names_to with names_pattern and its results from pd.str.extract, we get :

set--> (.+) --> [off, pt] and
.value--> (.+) --> [loc, lat, long]

All values associated with .value(loc, lat, long) remain as column names, while values not associated with .value(off, pt) are lumped into a new column set.

Notice that we did not have to reset the index - pivot_longer takes care of that internally; pivot_longer allows you to focus on what you want, so you can get it and move on.

Note that the unpivoting could also have been executed with names_sep:

[26]:
df.pivot_longer(
    names_to = ["set", ".value"],
    names_sep = "_",
    ignore_index = False,
    sort_by_appearance = True
    )
[26]:
set loc lat long
0 off A 121.271083 -7.188632
0 pt G 100.075482 4.472090
1 off B 75.938453 -143.228857
1 pt H 75.191326 -144.387785
2 off C 135.043791 21.242563
2 pt I 122.651345 -40.456110
3 off D 134.511284 40.937417
3 pt J 124.135533 -46.071562
4 off E 134.484374 40.784720
4 pt K 124.135533 -46.071562
5 off F 137.962195 22.905889
5 pt L 124.010289 -46.015943

Let’s look at another example, from Stack Overflow :

[27]:
df = pd.DataFrame([{'a_1': 2, 'ab_1': 3,
                    'ac_1': 4, 'a_2': 5,
                    'ab_2': 6, 'ac_2': 7}])
df
[27]:
a_1 ab_1 ac_1 a_2 ab_2 ac_2
0 2 3 4 5 6 7

The data above requires extracting a, ab and ac from 1 and 2. This is another example of a paired column. We could solve this using pd.wide_to_long; infact there is a very good solution from Stack Overflow

[28]:
df1 = df.copy()
df1['id'] = df1.index
pd.wide_to_long(df1, ['a','ab','ac'],i='id',j='num',sep='_')
[28]:
a ab ac
id num
0 1 2 3 4
2 5 6 7

Or you could simply pass the buck to pivot_longer :

[29]:
df.pivot_longer(
    names_to = ('.value', 'num'),
    names_sep = '_'
    )
[29]:
num a ab ac
0 1 2 3 4
1 2 5 6 7

In the solution above, we used the names_sep argument, as it is more convenient. A few more examples to get you familiar with the .value symbol.

Source Data

[30]:
df = pd.DataFrame([[1,1,2,3,4,5,6],
                   [2,7,8,9,10,11,12]],
                  columns=['id', 'ax','ay','az','bx','by','bz'])

df
[30]:
id ax ay az bx by bz
0 1 1 2 3 4 5 6
1 2 7 8 9 10 11 12
[31]:
df.pivot_longer(
    index = 'id',
    names_to = ('name', '.value'),
    names_pattern = '(.)(.)'
    )
[31]:
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12

For the code above .value is paired with x, y, z(which become the new column names), while a, b are unpivoted into the name column.

In the dataframe below, we need to unpivot the data, keeping only the suffix hi, and pulling out the number between A and g. Source Data

[32]:
df = pd.DataFrame([{'id': 1, 'A1g_hi': 2,
                    'A2g_hi': 3, 'A3g_hi': 4,
                    'A4g_hi': 5}])
df
[32]:
id A1g_hi A2g_hi A3g_hi A4g_hi
0 1 2 3 4 5
[33]:
df.pivot_longer(
    index = 'id',
    names_to = ['time','.value'],
    names_pattern = "A(\d)g_(hi)")
[33]:
id time hi
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5

Let’s see an example where we have multiple values in a paired column, and we wish to split them into separate columns. Source Data :

[34]:
df = pd.DataFrame(
    {
        "Sony | TV | Model | value": {0: "A222", 1: "A234", 2: "A4345"},
        "Sony | TV | Quantity | value": {0: 5, 1: 5, 2: 4},
        "Sony | TV | Max-quant | value": {0: 10, 1: 9, 2: 9},
        "Panasonic | TV | Model | value": {0: "T232", 1: "S3424", 2: "X3421"},
        "Panasonic | TV | Quantity | value": {0: 1, 1: 5, 2: 1},
        "Panasonic | TV | Max-quant | value": {0: 10, 1: 12, 2: 11},
        "Sanyo | Radio | Model | value": {0: "S111", 1: "S1s1", 2: "S1s2"},
        "Sanyo | Radio | Quantity | value": {0: 4, 1: 2, 2: 4},
        "Sanyo | Radio | Max-quant | value": {0: 9, 1: 9, 2: 10},
    }
)

df
[34]:
Sony | TV | Model | value Sony | TV | Quantity | value Sony | TV | Max-quant | value Panasonic | TV | Model | value Panasonic | TV | Quantity | value Panasonic | TV | Max-quant | value Sanyo | Radio | Model | value Sanyo | Radio | Quantity | value Sanyo | Radio | Max-quant | value
0 A222 5 10 T232 1 10 S111 4 9
1 A234 5 9 S3424 5 12 S1s1 2 9
2 A4345 4 9 X3421 1 11 S1s2 4 10

The goal is to reshape the data into long format, with separate columns for Manufacturer(Sony,…), Device(TV, Radio), Model(S3424, …), maximum quantity and quantity.

Below is the accepted solution on Stack Overflow :

[35]:
df1 = df.copy()
# Create a multiIndex column header
df1.columns = pd.MultiIndex.from_arrays(
    zip(*df1.columns.str.split("\s?\|\s?"))
)

# Reshape the dataframe using
# `set_index`, `droplevel`, and `stack`
(df1.stack([0, 1])
 .droplevel(1, axis=1)
 .set_index("Model", append=True)
 .rename_axis([None, "Manufacturer", "Device", "Model"])
 .sort_index(level=[1, 2, 3])
 .reset_index()
 .drop("level_0", axis=1)
 )

[35]:
Manufacturer Device Model Max-quant Quantity
0 Panasonic TV S3424 12.0 5.0
1 Panasonic TV T232 10.0 1.0
2 Panasonic TV X3421 11.0 1.0
3 Sanyo Radio S111 9.0 4.0
4 Sanyo Radio S1s1 9.0 2.0
5 Sanyo Radio S1s2 10.0 4.0
6 Sony TV A222 10.0 5.0
7 Sony TV A234 9.0 5.0
8 Sony TV A4345 9.0 4.0

Or, we could use pivot_longer, along with .value in names_to and a regular expression in names_pattern :

[36]:
df.pivot_longer(
    names_to = ("Manufacturer", "Device", ".value"),
    names_pattern = r"(.+)\|(.+)\|(.+)\|.*",
    )
[36]:
Manufacturer Device Model Quantity Max-quant
0 Sony TV A222 5 10
1 Sony TV A234 5 9
2 Sony TV A4345 4 9
3 Panasonic TV T232 1 10
4 Panasonic TV S3424 5 12
5 Panasonic TV X3421 1 11
6 Sanyo Radio S111 4 9
7 Sanyo Radio S1s1 2 9
8 Sanyo Radio S1s2 4 10

The cleanup (removal of whitespace in the column names) is left as an exercise for the reader.

What if we are interested in unpivoting only a part of the entire dataframe? Source Data

[37]:
df = pd.DataFrame({'time': [1, 2, 3],
                   'factor': ['a','a','b'],
                   'variable1': [0,0,0],
                   'variable2': [0,0,1],
                   'variable3': [0,2,0],
                   'variable4': [2,0,1],
                   'variable5': [1,0,1],
                   'variable6': [0,1,1],
                   'O1V1': [0,0.2,-0.3],
                   'O1V2': [0,0.4,-0.9],
                   'O1V3': [0.5,0.2,-0.6],
                   'O1V4': [0.5,0.2,-0.6],
                   'O1V5': [0,0.2,-0.3],
                   'O1V6': [0,0.4,-0.9],
                   'O1V7': [0.5,0.2,-0.6],
                   'O1V8': [0.5,0.2,-0.6],
                   'O2V1': [0,0.5,0.3],
                   'O2V2': [0,0.2,0.9],
                   'O2V3': [0.6,0.1,-0.3],
                   'O2V4': [0.5,0.2,-0.6],
                   'O2V5': [0,0.5,0.3],
                   'O2V6': [0,0.2,0.9],
                   'O2V7': [0.6,0.1,-0.3],
                   'O2V8': [0.5,0.2,-0.6],
                   'O3V1': [0,0.7,0.4],
                   'O3V2': [0.9,0.2,-0.3],
                   'O3V3': [0.5,0.2,-0.7],
                   'O3V4': [0.5,0.2,-0.6],
                   'O3V5': [0,0.7,0.4],
                   'O3V6': [0.9,0.2,-0.3],
                   'O3V7': [0.5,0.2,-0.7],
                   'O3V8': [0.5,0.2,-0.6]})
df
[37]:
time factor variable1 variable2 variable3 variable4 variable5 variable6 O1V1 O1V2 ... O2V7 O2V8 O3V1 O3V2 O3V3 O3V4 O3V5 O3V6 O3V7 O3V8
0 1 a 0 0 0 2 1 0 0.0 0.0 ... 0.6 0.5 0.0 0.9 0.5 0.5 0.0 0.9 0.5 0.5
1 2 a 0 0 2 0 0 1 0.2 0.4 ... 0.1 0.2 0.7 0.2 0.2 0.2 0.7 0.2 0.2 0.2
2 3 b 0 1 0 1 1 1 -0.3 -0.9 ... -0.3 -0.6 0.4 -0.3 -0.7 -0.6 0.4 -0.3 -0.7 -0.6

3 rows × 32 columns

What is the task? This is copied verbatim from the source:

Each row of the data frame represents a time period. There are multiple ‘subjects’ being monitored, namely O1, O2, and O3. Each subject has 8 variables being measured. I need to convert this data into long format where each row contains the information for one subject at a given time period, but with only the first 4 subject variables, as well as the extra information about this time period in columns 2-4, but not columns 5-8.

Below is the accepted solution, using wide_to_long:

[38]:
df1 = df.rename(columns={x: x[2:]+x[1:2] for x in df.columns[df.columns.str.startswith('O')]})

df1 = pd.wide_to_long(df1, i=['time', 'factor']+[f'variable{i}' for i in range(1,7)],
                      j='id', stubnames=[f'V{i}' for i in range(1,9)], suffix='.*')

df1 = (df1.reset_index()
          .drop(columns=[f'V{i}' for i in range(5,9)]
                        +[f'variable{i}' for i in range(3,7)]))

df1
[38]:
time factor variable1 variable2 id V1 V2 V3 V4
0 1 a 0 0 1 0.0 0.0 0.5 0.5
1 1 a 0 0 2 0.0 0.0 0.6 0.5
2 1 a 0 0 3 0.0 0.9 0.5 0.5
3 2 a 0 0 1 0.2 0.4 0.2 0.2
4 2 a 0 0 2 0.5 0.2 0.1 0.2
5 2 a 0 0 3 0.7 0.2 0.2 0.2
6 3 b 0 1 1 -0.3 -0.9 -0.6 -0.6
7 3 b 0 1 2 0.3 0.9 -0.3 -0.6
8 3 b 0 1 3 0.4 -0.3 -0.7 -0.6

We can abstract the details and focus on the task with pivot_longer:

[39]:
df.pivot_longer(
    index = slice("time", "variable2"),
    column_names = janitor.patterns(".+V[1-4]$"),
    names_to = ("id", ".value"),
    names_pattern = ".(.)(.+)$",
    sort_by_appearance = True
)
[39]:
time factor variable1 variable2 id V1 V2 V3 V4
0 1 a 0 0 1 0.0 0.0 0.5 0.5
1 1 a 0 0 2 0.0 0.0 0.6 0.5
2 1 a 0 0 3 0.0 0.9 0.5 0.5
3 2 a 0 0 1 0.2 0.4 0.2 0.2
4 2 a 0 0 2 0.5 0.2 0.1 0.2
5 2 a 0 0 3 0.7 0.2 0.2 0.2
6 3 b 0 1 1 -0.3 -0.9 -0.6 -0.6
7 3 b 0 1 2 0.3 0.9 -0.3 -0.6
8 3 b 0 1 3 0.4 -0.3 -0.7 -0.6

One more example on the .value symbol for paired columns Source Data :

[40]:
df = pd.DataFrame({'id': [1, 2],
                   'A_value': [50, 33],
                   'D_value': [60, 45]})
df
[40]:
id A_value D_value
0 1 50 60
1 2 33 45
[41]:
df.pivot_longer(
    index = 'id',
    names_to = ('value_type', '.value'),
    names_sep = '_'
    )
[41]:
id value_type value
0 1 A 50
1 2 A 33
2 1 D 60
3 2 D 45

There are scenarios where we need to unpivot the data, and group values within the column names under new columns. The values in the columns will not become new column names, so we do not need the .value symbol. Let’s see an example below: Source Data

[42]:
df = pd.DataFrame({'subject': [1, 2],
                   'A_target_word_gd': [1, 11],
                   'A_target_word_fd': [2, 12],
                   'B_target_word_gd': [3, 13],
                   'B_target_word_fd': [4, 14],
                   'subject_type': ['mild', 'moderate']})

df
[42]:
subject A_target_word_gd A_target_word_fd B_target_word_gd B_target_word_fd subject_type
0 1 1 2 3 4 mild
1 2 11 12 13 14 moderate

In the dataframe above, A and B represent conditions, while the suffixes gd and fd represent value types. We are not interested in the words in the middle (_target_word). We could solve it this way (this is the chosen solution, copied from Stack Overflow) :

[43]:
new_df =(pd.melt(df,
                id_vars=['subject_type','subject'],
                var_name='abc')
           .sort_values(by=['subject', 'subject_type'])
         )
new_df['cond']=(new_df['abc']
                .apply(lambda x: (x.split('_'))[0])
                )
new_df['value_type']=(new_df
                      .pop('abc')
                      .apply(lambda x: (x.split('_'))[-1])
                      )
new_df

[43]:
subject_type subject value cond value_type
0 mild 1 1 A gd
2 mild 1 2 A fd
4 mild 1 3 B gd
6 mild 1 4 B fd
1 moderate 2 11 A gd
3 moderate 2 12 A fd
5 moderate 2 13 B gd
7 moderate 2 14 B fd

Or, we could just pass the buck to pivot_longer :

[44]:
df.pivot_longer(
    index = ["subject", "subject_type"],
    names_to = ("cond", "value_type"),
    names_pattern = "([A-Z]).*(gd|fd)",
)

[44]:
subject subject_type cond value_type value
0 1 mild A gd 1
1 2 moderate A gd 11
2 1 mild A fd 2
3 2 moderate A fd 12
4 1 mild B gd 3
5 2 moderate B gd 13
6 1 mild B fd 4
7 2 moderate B fd 14

In the code above, we pass in the new names of the columns to names_to(‘cond’, ‘value_type’), and pass the groups to be extracted as a regular expression to names_pattern.

Here’s another example where pivot_longer abstracts the process and makes reshaping easy.

In the dataframe below, we would like to unpivot the data and separate the column names into individual columns(vault should be in an event column, 2012 should be in a year column and f should be in a gender column). Source Data

[45]:
df = pd.DataFrame(
            {
                "country": ["United States", "Russia", "China"],
                "vault_2012_f": [
                    48.132,
                    46.366,
                    44.266,
                ],
                "vault_2012_m": [46.632, 46.866, 48.316],
                "vault_2016_f": [
                    46.866,
                    45.733,
                    44.332,
                ],
                "vault_2016_m": [45.865, 46.033, 45.0],
                "floor_2012_f": [45.366, 41.599, 40.833],
                "floor_2012_m": [45.266, 45.308, 45.133],
                "floor_2016_f": [45.999, 42.032, 42.066],
                "floor_2016_m": [43.757, 44.766, 43.799],
            }
        )
df

[45]:
country vault_2012_f vault_2012_m vault_2016_f vault_2016_m floor_2012_f floor_2012_m floor_2016_f floor_2016_m
0 United States 48.132 46.632 46.866 45.865 45.366 45.266 45.999 43.757
1 Russia 46.366 46.866 45.733 46.033 41.599 45.308 42.032 44.766
2 China 44.266 48.316 44.332 45.000 40.833 45.133 42.066 43.799

We could achieve this with a combination of pd.melt and pandas string methods (or janitor’s deconcatenate_columns method); or we could, again, pass the buck to pivot_longer :

[46]:
df.pivot_longer(
    index = "country",
    names_to = ["event", "year", "gender"],
    names_sep = "_",
    values_to = "score",
)
[46]:
country event year gender score
0 United States vault 2012 f 48.132
1 Russia vault 2012 f 46.366
2 China vault 2012 f 44.266
3 United States vault 2012 m 46.632
4 Russia vault 2012 m 46.866
5 China vault 2012 m 48.316
6 United States vault 2016 f 46.866
7 Russia vault 2016 f 45.733
8 China vault 2016 f 44.332
9 United States vault 2016 m 45.865
10 Russia vault 2016 m 46.033
11 China vault 2016 m 45.000
12 United States floor 2012 f 45.366
13 Russia floor 2012 f 41.599
14 China floor 2012 f 40.833
15 United States floor 2012 m 45.266
16 Russia floor 2012 m 45.308
17 China floor 2012 m 45.133
18 United States floor 2016 f 45.999
19 Russia floor 2016 f 42.032
20 China floor 2016 f 42.066
21 United States floor 2016 m 43.757
22 Russia floor 2016 m 44.766
23 China floor 2016 m 43.799

Again, if you want the data returned in order of appearance, you can turn on the sort_by_appearance parameter:

[47]:
df.pivot_longer(
    index = "country",
    names_to = ["event", "year", "gender"],
    names_sep = "_",
    values_to = "score",
    sort_by_appearance = True
)
[47]:
country event year gender score
0 United States vault 2012 f 48.132
1 United States vault 2012 m 46.632
2 United States vault 2016 f 46.866
3 United States vault 2016 m 45.865
4 United States floor 2012 f 45.366
5 United States floor 2012 m 45.266
6 United States floor 2016 f 45.999
7 United States floor 2016 m 43.757
8 Russia vault 2012 f 46.366
9 Russia vault 2012 m 46.866
10 Russia vault 2016 f 45.733
11 Russia vault 2016 m 46.033
12 Russia floor 2012 f 41.599
13 Russia floor 2012 m 45.308
14 Russia floor 2016 f 42.032
15 Russia floor 2016 m 44.766
16 China vault 2012 f 44.266
17 China vault 2012 m 48.316
18 China vault 2016 f 44.332
19 China vault 2016 m 45.000
20 China floor 2012 f 40.833
21 China floor 2012 m 45.133
22 China floor 2016 f 42.066
23 China floor 2016 m 43.799

One more feature that pivot_longer offers is to pass a list of regular expressions to names_pattern. This comes in handy when one single regex cannot encapsulate similar columns for reshaping to long form. This idea is inspired by the melt function in R’s data.table. A couple of examples should make this clear.

Source Data

[48]:
df = pd.DataFrame(
    [{'title': 'Avatar',
  'actor_1': 'CCH_Pound…',
  'actor_2': 'Joel_Davi…',
  'actor_3': 'Wes_Studi',
  'actor_1_FB_likes': 1000,
  'actor_2_FB_likes': 936,
  'actor_3_FB_likes': 855},
 {'title': 'Pirates_of_the_Car…',
  'actor_1': 'Johnny_De…',
  'actor_2': 'Orlando_B…',
  'actor_3': 'Jack_Daven…',
  'actor_1_FB_likes': 40000,
  'actor_2_FB_likes': 5000,
  'actor_3_FB_likes': 1000},
 {'title': 'The_Dark_Knight_Ri…',
  'actor_1': 'Tom_Hardy',
  'actor_2': 'Christian…',
  'actor_3': 'Joseph_Gor…',
  'actor_1_FB_likes': 27000,
  'actor_2_FB_likes': 23000,
  'actor_3_FB_likes': 23000},
 {'title': 'John_Carter',
  'actor_1': 'Daryl_Sab…',
  'actor_2': 'Samantha_…',
  'actor_3': 'Polly_Walk…',
  'actor_1_FB_likes': 640,
  'actor_2_FB_likes': 632,
  'actor_3_FB_likes': 530},
 {'title': 'Spider-Man_3',
  'actor_1': 'J.K._Simm…',
  'actor_2': 'James_Fra…',
  'actor_3': 'Kirsten_Du…',
  'actor_1_FB_likes': 24000,
  'actor_2_FB_likes': 11000,
  'actor_3_FB_likes': 4000},
 {'title': 'Tangled',
  'actor_1': 'Brad_Garr…',
  'actor_2': 'Donna_Mur…',
  'actor_3': 'M.C._Gainey',
  'actor_1_FB_likes': 799,
  'actor_2_FB_likes': 553,
  'actor_3_FB_likes': 284}]
)

df
[48]:
title actor_1 actor_2 actor_3 actor_1_FB_likes actor_2_FB_likes actor_3_FB_likes
0 Avatar CCH_Pound… Joel_Davi… Wes_Studi 1000 936 855
1 Pirates_of_the_Car… Johnny_De… Orlando_B… Jack_Daven… 40000 5000 1000
2 The_Dark_Knight_Ri… Tom_Hardy Christian… Joseph_Gor… 27000 23000 23000
3 John_Carter Daryl_Sab… Samantha_… Polly_Walk… 640 632 530
4 Spider-Man_3 J.K._Simm… James_Fra… Kirsten_Du… 24000 11000 4000
5 Tangled Brad_Garr… Donna_Mur… M.C._Gainey 799 553 284

Above, we have a dataframe of movie titles, actors, and their facebook likes. It would be great if we could transform this into a long form, with just the title, the actor names, and the number of likes. Let’s look at a possible solution :

First, we reshape the columns, so that the numbers appear at the end.

[49]:
df1 = df.copy()
pat = r"(?P<actor>.+)_(?P<num>\d)_(?P<likes>.+)"
repl = lambda m: f"""{m.group('actor')}_{m.group('likes')}_{m.group('num')}"""
df1.columns = df1.columns.str.replace(pat, repl)
df1
[49]:
title actor_1 actor_2 actor_3 actor_FB_likes_1 actor_FB_likes_2 actor_FB_likes_3
0 Avatar CCH_Pound… Joel_Davi… Wes_Studi 1000 936 855
1 Pirates_of_the_Car… Johnny_De… Orlando_B… Jack_Daven… 40000 5000 1000
2 The_Dark_Knight_Ri… Tom_Hardy Christian… Joseph_Gor… 27000 23000 23000
3 John_Carter Daryl_Sab… Samantha_… Polly_Walk… 640 632 530
4 Spider-Man_3 J.K._Simm… James_Fra… Kirsten_Du… 24000 11000 4000
5 Tangled Brad_Garr… Donna_Mur… M.C._Gainey 799 553 284

Now, we can reshape, using pd.wide_to_long :

[50]:
pd.wide_to_long(df1,
               stubnames = ['actor', 'actor_FB_likes'],
               i = 'title',
               j = 'group',
               sep = '_')
[50]:
actor actor_FB_likes
title group
Avatar 1 CCH_Pound… 1000
Pirates_of_the_Car… 1 Johnny_De… 40000
The_Dark_Knight_Ri… 1 Tom_Hardy 27000
John_Carter 1 Daryl_Sab… 640
Spider-Man_3 1 J.K._Simm… 24000
Tangled 1 Brad_Garr… 799
Avatar 2 Joel_Davi… 936
Pirates_of_the_Car… 2 Orlando_B… 5000
The_Dark_Knight_Ri… 2 Christian… 23000
John_Carter 2 Samantha_… 632
Spider-Man_3 2 James_Fra… 11000
Tangled 2 Donna_Mur… 553
Avatar 3 Wes_Studi 855
Pirates_of_the_Car… 3 Jack_Daven… 1000
The_Dark_Knight_Ri… 3 Joseph_Gor… 23000
John_Carter 3 Polly_Walk… 530
Spider-Man_3 3 Kirsten_Du… 4000
Tangled 3 M.C._Gainey 284

We could attempt to solve it with pivot_longer, using the .value symbol :

[51]:
df1.pivot_longer(
    index = 'title',
    names_to = (".value", "group"),
    names_pattern = "(.+)_(\d)$"
    )
[51]:
title group actor actor_FB_likes
0 Avatar 1 CCH_Pound… 1000
1 Pirates_of_the_Car… 1 Johnny_De… 40000
2 The_Dark_Knight_Ri… 1 Tom_Hardy 27000
3 John_Carter 1 Daryl_Sab… 640
4 Spider-Man_3 1 J.K._Simm… 24000
5 Tangled 1 Brad_Garr… 799
6 Avatar 2 Joel_Davi… 936
7 Pirates_of_the_Car… 2 Orlando_B… 5000
8 The_Dark_Knight_Ri… 2 Christian… 23000
9 John_Carter 2 Samantha_… 632
10 Spider-Man_3 2 James_Fra… 11000
11 Tangled 2 Donna_Mur… 553
12 Avatar 3 Wes_Studi 855
13 Pirates_of_the_Car… 3 Jack_Daven… 1000
14 The_Dark_Knight_Ri… 3 Joseph_Gor… 23000
15 John_Carter 3 Polly_Walk… 530
16 Spider-Man_3 3 Kirsten_Du… 4000
17 Tangled 3 M.C._Gainey 284

What if we could just get our data in long form without the massaging? We know our data has a pattern to it –> it either ends in a number or likes. Can’t we take advantage of that? Yes, we can(I know, I know; it sounds like a campaign slogan 🤪)

[52]:
df.pivot_longer(
    index = 'title',
    names_to = ("actor", "num_likes"),
    names_pattern = ('\d$', 'likes$'),
    )
[52]:
title actor num_likes
0 Avatar CCH_Pound… 1000
1 Pirates_of_the_Car… Johnny_De… 40000
2 The_Dark_Knight_Ri… Tom_Hardy 27000
3 John_Carter Daryl_Sab… 640
4 Spider-Man_3 J.K._Simm… 24000
5 Tangled Brad_Garr… 799
6 Avatar Joel_Davi… 936
7 Pirates_of_the_Car… Orlando_B… 5000
8 The_Dark_Knight_Ri… Christian… 23000
9 John_Carter Samantha_… 632
10 Spider-Man_3 James_Fra… 11000
11 Tangled Donna_Mur… 553
12 Avatar Wes_Studi 855
13 Pirates_of_the_Car… Jack_Daven… 1000
14 The_Dark_Knight_Ri… Joseph_Gor… 23000
15 John_Carter Polly_Walk… 530
16 Spider-Man_3 Kirsten_Du… 4000
17 Tangled M.C._Gainey 284

A pairing of names_to and names_pattern results in :

{"actor": '\d$', "num_likes": 'likes$'}

The first regex looks for columns that end with a number, while the other looks for columns that end with likes. pivot_longer will then look for columns that end with a number and lump all the values in those columns under the actor column, and also look for columns that end with like and combine all the values in those columns into a new column -> num_likes. Underneath the hood, numpy select and pd.Series.str.contains are used to pull apart the columns into the new columns.

Again, it is about the goal; we are not interested in the numbers (1,2,3), we only need the names of the actors, and their facebook likes. pivot_longer aims to give as much flexibility as possible, in addition to ease of use, to allow the end user focus on the task.

Let’s take a look at another example. Source Data :

[53]:
df = pd.DataFrame({'id': [0, 1],
 'Name': ['ABC', 'XYZ'],
 'code': [1, 2],
 'code1': [4, np.nan],
 'code2': ['8', 5],
 'type': ['S', 'R'],
 'type1': ['E', np.nan],
 'type2': ['T', 'U']})

df
[53]:
id Name code code1 code2 type type1 type2
0 0 ABC 1 4.0 8 S E T
1 1 XYZ 2 NaN 5 R NaN U

We cannot directly use pd.wide_to_long here without some massaging, as there is no definite suffix(the first code does not have a suffix), neither can we use .value here, again because there is no suffix. However, we can see a pattern where some columns start with code, and others start with type. Let’s see how pivot_longer solves this, using a sequence of regular expressions in the names_pattern argument :

[54]:
df.pivot_longer(
    index = ["id", "Name"],
    names_to = ("code_all", "type_all"),
    names_pattern = ("^code", "^type")
    )
[54]:
id Name code_all type_all
0 0 ABC 1 S
1 1 XYZ 2 R
2 0 ABC 4 E
3 1 XYZ NaN NaN
4 0 ABC 8 T
5 1 XYZ 5 U

The key here is passing the right regular expression, and ensuring the names in names_to is paired with the right regex in names_pattern; as such, every column that starts with code will be included in the new code_all column; the same happens to the type_all column. Easy and flexible, right?

Let’s explore another example, from Stack Overflow :

[55]:
df = pd.DataFrame(
            [
                {
                    "ID": 1,
                    "DateRange1Start": "1/1/90",
                    "DateRange1End": "3/1/90",
                    "Value1": 4.4,
                    "DateRange2Start": "4/5/91",
                    "DateRange2End": "6/7/91",
                    "Value2": 6.2,
                    "DateRange3Start": "5/5/95",
                    "DateRange3End": "6/6/96",
                    "Value3": 3.3,
                }
            ])

df
[55]:
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
0 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3

In the dataframe above, we need to reshape the data to have a start date, end date and value. For the DateRange columns, the numbers are embedded within the string, while for value it is appended at the end. One possible solution is to reshape the columns so that the numbers are at the end :

[56]:
df1 = df.copy()
pat = r"(?P<head>.+)(?P<num>\d)(?P<tail>.+)"
repl = lambda m: f"""{m.group('head')}{m.group('tail')}{m.group('num')}"""
df1.columns = df1.columns.str.replace(pat,repl)
df1
[56]:
ID DateRangeStart1 DateRangeEnd1 Value1 DateRangeStart2 DateRangeEnd2 Value2 DateRangeStart3 DateRangeEnd3 Value3
0 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3

Now, we can unpivot:

[57]:
pd.wide_to_long(df1,
                stubnames = ['DateRangeStart',
                             'DateRangeEnd',
                             'Value'],
                i = 'ID',
                j = 'num')
[57]:
DateRangeStart DateRangeEnd Value
ID num
1 1 1/1/90 3/1/90 4.4
2 4/5/91 6/7/91 6.2
3 5/5/95 6/6/96 3.3

Using the .value symbol in pivot_longer:

[58]:
df1.pivot_longer(
    index = 'ID',
    names_to = [".value",'num'],
    names_pattern = "(.+)(\d)$"
    )
[58]:
ID num DateRangeStart DateRangeEnd Value
0 1 1 1/1/90 3/1/90 4.4
1 1 2 4/5/91 6/7/91 6.2
2 1 3 5/5/95 6/6/96 3.3

Or, we could allow pivot_longer worry about the massaging; simply pass to names_pattern a list of regular expressions that match what we are after :

[59]:
df.pivot_longer(
    index = 'ID',
    names_to = ("DateRangeStart", "DateRangeEnd", "Value"),
    names_pattern = ("Start$", "End$", "^Value")
    )
[59]:
ID DateRangeStart DateRangeEnd Value
0 1 1/1/90 3/1/90 4.4
1 1 4/5/91 6/7/91 6.2
2 1 5/5/95 6/6/96 3.3

The code above looks for columns that end with Start(Start$), aggregates all the values in those columns into DateRangeStart column, looks for columns that end with End(End$), aggregates all the values within those columns into DateRangeEnd column, and finally looks for columns that start with Value(^Value), and aggregates the values in those columns into the Value column. Just know the patterns, and pair them accordingly. Again, the goal is a focus on the task, to make it simple for the end user.

Let’s look at another example Source Data :

[60]:
df = pd.DataFrame({'Activity': ['P1', 'P2'],
 'General': ['AA', 'BB'],
 'm1': ['A1', 'B1'],
 't1': ['TA1', 'TB1'],
 'm2': ['A2', 'B2'],
 't2': ['TA2', 'TB2'],
 'm3': ['A3', 'B3'],
 't3': ['TA3', 'TB3']})

df
[60]:
Activity General m1 t1 m2 t2 m3 t3
0 P1 AA A1 TA1 A2 TA2 A3 TA3
1 P2 BB B1 TB1 B2 TB2 B3 TB3

This is a solution provided by yours truly :

[61]:
 (pd.wide_to_long(df,
                  i = ["Activity", "General"],
                  stubnames = ["t", "m"],
                  j = "number")
    .set_axis(["Task", "M"],
              axis = "columns")
    .droplevel(-1)
    .reset_index()
     )
[61]:
Activity General Task M
0 P1 AA TA1 A1
1 P1 AA TA2 A2
2 P1 AA TA3 A3
3 P2 BB TB1 B1
4 P2 BB TB2 B2
5 P2 BB TB3 B3

Or, we could use pivot_longer, abstract the details, and focus on the task :

[62]:
df.pivot_longer(
    index = ['Activity','General'],
    names_pattern = ['^m','^t'],
    names_to = ['M','Task']
    )
[62]:
Activity General M Task
0 P1 AA A1 TA1
1 P2 BB B1 TB1
2 P1 AA A2 TA2
3 P2 BB B2 TB2
4 P1 AA A3 TA3
5 P2 BB B3 TB3

Alright, one last example :

Source Data

[63]:
df = pd.DataFrame({'Name': ['John', 'Chris', 'Alex'],
 'activity1': ['Birthday', 'Sleep Over', 'Track Race'],
 'number_activity_1': [1, 2, 4],
 'attendees1': [14, 18, 100],
 'activity2': ['Sleep Over', 'Painting', 'Birthday'],
 'number_activity_2': [4, 5, 1],
 'attendees2': [10, 8, 5]})

df
[63]:
Name activity1 number_activity_1 attendees1 activity2 number_activity_2 attendees2
0 John Birthday 1 14 Sleep Over 4 10
1 Chris Sleep Over 2 18 Painting 5 8
2 Alex Track Race 4 100 Birthday 1 5

The task here is to unpivot the data, and group the data under three new columns (“activity”, “number_activity”, and “attendees”).

We can see that there is a pattern to the data; let’s create a list of regular expressions that match the patterns and pass to names_pattern:

[64]:
df.pivot_longer(
    index = 'Name',
    names_to = ('activity','number_activity','attendees'),
    names_pattern = ("^activity","^number_activity","^attendees")
    )

[64]:
Name activity number_activity attendees
0 John Birthday 1 14
1 Chris Sleep Over 2 18
2 Alex Track Race 4 100
3 John Sleep Over 4 10
4 Chris Painting 5 8
5 Alex Birthday 1 5

Alright, let’s look at one final example:

Source Data

[65]:
df = pd.DataFrame({'Location': ['Madrid', 'Madrid', 'Rome', 'Rome'],
 'Account': ['ABC', 'XYX', 'ABC', 'XYX'],
 'Y2019:MTD:January:Expense': [4354, 769867, 434654, 632556456],
 'Y2019:MTD:January:Income': [56456, 32556456, 5214, 46724423],
 'Y2019:MTD:February:Expense': [235423, 6785423, 235423, 46588]})

df
[65]:
Location Account Y2019:MTD:January:Expense Y2019:MTD:January:Income Y2019:MTD:February:Expense
0 Madrid ABC 4354 56456 235423
1 Madrid XYX 769867 32556456 6785423
2 Rome ABC 434654 5214 235423
3 Rome XYX 632556456 46724423 46588
[66]:
df.pivot_longer(index = ['Location','Account'],
                names_to=("year", "month", ".value"),
                names_pattern=r"Y(.+):MTD:(.{3}).+(Income|Expense)",
                sort_by_appearance=True)


[66]:
Location Account year month Expense Income
0 Madrid ABC 2019 Jan 4354 56456.0
1 Madrid ABC 2019 Feb 235423 NaN
2 Madrid XYX 2019 Jan 769867 32556456.0
3 Madrid XYX 2019 Feb 6785423 NaN
4 Rome ABC 2019 Jan 434654 5214.0
5 Rome ABC 2019 Feb 235423 NaN
6 Rome XYX 2019 Jan 632556456 46724423.0
7 Rome XYX 2019 Feb 46588 NaN

pivot_longer does not solve all problems; no function does. Its aim is to make it easy to unpivot dataframes from wide to long form, while offering a lot of flexibility and power.