janitor.pivot_longer

janitor.pivot_longer(df: pandas.core.frame.DataFrame, index: Optional[Union[List, Tuple, str, Pattern]] = None, column_names: Optional[Union[List, Tuple, str, Pattern]] = None, names_to: Optional[Union[List, Tuple, str]] = 'variable', values_to: Optional[str] = 'value', column_level: Optional[Union[int, str]] = None, names_sep: Optional[Union[str, Pattern]] = None, names_pattern: Optional[Union[List, Tuple, str, Pattern]] = None, sort_by_appearance: Optional[bool] = False, ignore_index: Optional[bool] = True) → pandas.core.frame.DataFrame[source]

Unpivots a DataFrame from ‘wide’ to ‘long’ format.

This method does not mutate the original DataFrame.

It is a wrapper around pd.melt and is meant to serve as a single point for transformations that require pd.melt or pd.wide_to_long.

It is modeled after the pivot_longer function in R’s tidyr package, and offers more functionality and flexibility than pd.wide_to_long.

This function is useful to massage a DataFrame into a format where one or more columns are considered measured variables, and all other columns are considered as identifier variables.

All measured variables are “unpivoted” (and typically duplicated) along the row axis.

See the Example notebooks for a more in depth exploration of pivot_longer.

Example 1: The following DataFrame contains heartrate data for patients treated with two different drugs, ‘a’ and ‘b’.

      name   a   b
0   Wilbur  67  56
1  Petunia  80  90
2  Gregory  64  50

The column names ‘a’ and ‘b’ are actually the names of a measured variable (i.e. the name of a drug), but the values are a different measured variable (heartrate). We would like to unpivot these ‘a’ and ‘b’ columns into a ‘drug’ column and a ‘heartrate’ column.

df = (pd.DataFrame(...)
      .pivot_longer(
          column_names = ['a', 'b'],
          names_to = 'drug',
          values_to = 'heartrate',
          sort_by_appearance = True
          )
    )

      name drug  heartrate
0   Wilbur    a         67
1   Wilbur    b         56
2  Petunia    a         80
3  Petunia    b         90
4  Gregory    a         64
5  Gregory    b         50

Note how the data is stacked in order of first appearance. If, however, you do not care for order of appearance, and want to wring out some more performance, you can set sort_by_appearance to False (the default is False).

df = (pd.DataFrame(...)
      .pivot_longer(
          column_names = ['a', 'b'],
          names_to = 'drug',
          values_to = 'heartrate',
          sort_by_appearance = False
          )
    )

        name     drug  heartrate
0       Wilbur     a    67
1       Petunia    a    80
2       Gregory    a    64
3       Wilbur     b    56
4       Petunia    b    90
5       Gregory    b    50

You can set ignore_index to False, if you wish to reuse the index from the source dataframe (the index will be repeated as many times as necessary):

df = (pd.DataFrame(...)
      .pivot_longer(
          column_names = ['a', 'b'],
          names_to = 'drug',
          values_to = 'heartrate',
          sort_by_appearance = False,
          ignore_index = False
          )
    )

        name     drug  heartrate
0       Wilbur     a    67
1       Petunia    a    80
2       Gregory    a    64
0       Wilbur     b    56
1       Petunia    b    90
2       Gregory    b    50

MultiIndex dataframes are unpivoted in the same form that you would expect from pandas’ melt:

    A  B  C
    D  E  F
0   a  1  2
1   b  3  4
2   c  5  6

df = (pd.DataFrame(...)
       .pivot_longer(
           index = [("A", "D")],
           names_to = ["first", "second"]
           )
    )

     (A, D)  first   second   value
0       a       B       E       1
1       b       B       E       3
2       c       B       E       5
3       a       C       F       2
4       b       C       F       4
5       c       C       F       6

You can also unpivot on a specific level:

df = (pd.DataFrame(...)
      .pivot_longer(
          index = "A",
          names_to = "first",
          column_level = 0
          )
    )

   A      first  value
0  a        B      1
1  b        B      3
2  c        B      5

Example 2: The dataframe below has year and month variables embedded within the column names.

      col1          2019-12      2020-01         2020-02
0       a          -1.085631    -1.506295       -2.426679
1       b           0.997345    -0.578600       -0.428913
2       c           0.282978     1.651437        1.265936

Pivot_longer can conveniently reshape the dataframe into long format, with new columns for the year and month. You simply pass in the new column names to names_to, and pass the hyphen ‘-‘ to the names_sep argument.

df = (pd.DataFrame(...)
      .pivot_longer(
          index = 'col1',
          names_to = ('year', 'month'),
          names_sep = '-',
          sort_by_appearance = True
          )
    )

   col1 year   month      value
0    a  2019     12     -1.085631
1    a  2020     01     -1.506295
2    a  2020     02     -2.426679
3    b  2019     12      0.997345
4    b  2020     01     -0.578600
5    b  2020     02     -0.428913
6    c  2019     12      0.282978
7    c  2020     01      1.651437
8    c  2020     02      1.265936

Example 3: The dataframe below has names embedded in it (‘measure1’, ‘measure2’) that we would love to reuse as column names.

    treat1-measure1     treat1-measure2 treat2-measure1 treat2-measure2
0                1              4                   2               5
1                2              5                   3               4

For this, we use the .value variable, which signals to pivot_longer to treat the part of the column names corresponding to .value as new column names. The .value variable is similar to stubnames in pandas’ wide_to_long function, but with more flexibility.

df = (pd.DataFrame(...)
      .pivot_longer(
          names_to = ("group", '.value'),
          names_sep = '-',
          sort_by_appearance = True
          )
    )

    group  measure1  measure2
0  treat1         1         4
1  treat2         2         5
2  treat1         2         5
3  treat2         3         4

Let’s break down the .value idea. When .value is used, pivot_longer creates a pairing. In the example above, we get a pairing {"group":["treat1", "treat2"], ".value":["measure1", "measure2"]}. All the values associated with .value become new column names, while those not associated with .value`(`treat1 and treat2) become values in a new column group. values_to is overridden during this process.

Note

The values not associated with “.value” (in the example above, this is the group column) are returned as object dtypes. You can change it to your preferred dtype using pandas’ astype method.

Example 4: You can also unpivot from wide to long using regular expressions

    n_1  n_2  n_3  pct_1  pct_2  pct_3
0   10   20   30   0.1    0.2    0.3

df = (pd.DataFrame(...)
      .pivot_longer(
          names_to = (".value", "name"),
          names_pattern = "(.*)_(.)"
          )
    )

    name    n  pct
0     1  10.0  0.1
1     2  20.0  0.2
2     3  30.0  0.3

The same idea of .value works here as well. Based on the capturing groups in the regex in names_pattern, we have two pairings –> {".value":["n", "pct"], "name":[1,2,3]}. Just like in the previous example, the values associated with .value become new column names, while those not associated with .value become values in the new column name.

Note that there are no limits to the pairing; however, you can only have one .value in names_to.

Example 5: You can also pass a list/tuple of regular expressions that match specific patterns to names_pattern, along with a list/tuple of new names to names_to; this can come in handy if .value falls short:

  GameID   Date         Visitor      Score_V       Home         Score_H
0  1     9/10/2020   Houston Texans     20    Kansas City Chiefs   34
1  2     9/13/2020   Seattle Seahawks   38    Atlanta Falcons      25
df = (pd.DataFrame(...)
      .pivot_longer(
            ['GameID','Date'],
            names_to=("Team","Score"),
            names_pattern=("^Visitor|Home", "^Score"))
    )

       GameID       Date              Team       Score
0       1       9/10/2020      Houston Texans     20
1       2       9/13/2020    Seattle Seahawks     38
2       1       9/10/2020  Kansas City Chiefs     34
3       2       9/13/2020     Atlanta Falcons     25

Note that in the code above, the number of entries in both names_to and names_pattern must match. Essentially, what the code does is look for columns that start with Visitor or Home (using the regex supplied) and puts all the values associated with these columns under a new column name Team. It then looks for columns that start with Score and collate all the values associated with these columns to a single column named Score.

You can also take advantage of janitor.patterns function, or the select_columns syntax, which allows selection of columns via a regular expression; this can come in handy if you have a lot of column names to pass to the index or column_names parameters, and you do not wish to manually type them all.

     name    wk1   wk2   wk3   wk4
0    Alice     5     9    20    22
1    Bob       7    11    17    33
2    Carla     6    13    39    40

df = (pd.DataFrame(...)
      .pivot_longer(index = janitor.patterns("^(?!wk)"))
      )

     name   variable  value
0   Alice      wk1      5
1     Bob      wk1      7
2   Carla      wk1      6
3   Alice      wk2      9
4     Bob      wk2     11
5   Carla      wk2     13
6   Alice      wk3     20
7     Bob      wk3     17
8   Carla      wk3     39
9   Alice      wk4     22
10    Bob      wk4     33
11  Carla      wk4     40

Note

Unpivoting a dataframe with MultiIndex columns, when either names_sep or names_pattern is provided is not supported.

Functional usage syntax:

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)
df = jn.pivot_longer(
    df = df,
    index = [column1, column2, ...],
    column_names = [column3, column4, ...],
    names_to = new_column_name,
    names_sep = string/regular expression,
    names_pattern = string/regular expression,
    values_to= new_column_name,
    column_level=None/int/str,
    sort_by_appearance = True/False,
    ignore_index = True/False,
)

Method chaining syntax:

df = (
    pd.DataFrame(...)
    .pivot_longer(
        index = [column1, column2, ...],
        column_names = [column3, column4, ...],
        names_to = new_column_name,
        names_sep = string/regular expression,
        names_pattern = string/regular expression,
        values_to= new_column_name,
        column_level=None/int/str,
        sort_by_appearance = True/False,
        ignore_index = True/False,
    )
)
Parameters
  • df – A pandas dataframe.

  • index – Name(s) of columns to use as identifier variables. Should be either a single column name, or a list/tuple of column names. The janitor.select_columns syntax is supported here, allowing for flexible and dynamic column selection. Index should be a list of tuples if the columns are a MultiIndex.

  • column_names – Name(s) of columns to unpivot. Should be either a single column name or a list/tuple of column names. The janitor.select_columns syntax is supported here, allowing for flexible and dynamic column selection. Column_names should be a list of tuples if the columns are a MultiIndex.

  • names_to – Name of new column as a string that will contain what were previously the column names in column_names. The default is variable if no value is provided. It can also be a list/tuple of strings that will serve as new column names, if name_sep or names_pattern is provided. If .value is in names_to, new column names will be extracted from part of the existing column names and values_to will be replaced.

  • names_sep – Determines how the column name is broken up, if names_to contains multiple values. It takes the same specification as pandas’ str.split method, and can be a string or regular expression. names_sep does not work with MultiIndex columns.

  • names_pattern – Determines how the column name is broken up. It can be a regular expression containing matching groups (it takes the same specification as pandas’ str.extract method), or a list/tuple of regular expressions. If it is a single regex, the number of groups must match the length of names_to ( if the length of names_to is 3, then the number of groups must be 3. If names_to is a string, then there should be only one group in names_pattern). For a list/tuple of regular expressions, names_to must also be a list/tuple and the lengths of both arguments must match(if the length of names_to is 4, then the length of names_pattern must also be 4). The entries in both arguments must also match positionally, i.e if names_to = ("name1", "name2", "name3"), then names_pattern should be (“regex1”, “regex2”, “regex3”), with “name1” pairing “regex1”, “name2” pairing “regex2”, and “name3” pairing “regex3”. names_pattern does not work with MultiIndex columns.

  • values_to – Name of new column as a string that will contain what were previously the values of the columns in column_names.

  • column_level – If columns are a MultiIndex, then use this level to unpivot the dataframe. Provided for compatibility with pandas’ melt, and applies only if neither names_sep nor names_pattern is provided.

  • sort_by_appearance – Default False. Boolean value that determines the final look of the dataframe. If True, the unpivoted dataframe will be stacked in order of first appearance. See examples for more details. pivot_longer is usually more performant if sort_by_appearance is False.

  • ignore_index – Default True. If True, original index is ignored. If False, the original index is retained and the Index labels will be repeated as necessary.

Returns

A pandas DataFrame that has been unpivoted from wide to long format.

Raises
  • TypeError – if index or column_names is not a string, or a list/tuple of column names, or a janitor.patterns function.

  • TypeError – if names_to or column_names is not a string, or a list/tuple of strings.

  • TypeError – if names_sep is not a string or regular expression.

  • TypeError – if names_pattern is not a regular expression, or a list/tuple of regular expressions.

  • TypeError – if values_to is not a string.

  • TypeError – if sort_by_appearance is not a boolean.

  • TypeError – if ignore_index is not a boolean.

  • ValueError – if names_to is a string or a list/tuple of length 1, and names_sep is provided.

  • ValueError – if names_to is a string, and the number of extracted columns is greater than 1.

  • ValueError – if names_to is a list/tuple, and its length does not match the number of extracted columns.

  • ValueError – if the dataframe contains MultiIndex columns, and index or column_names is not a list of tuples.

  • ValueError – if the dataframe contains MultiIndex columns, and either names_sep or names_pattern is provided.