Groupby_agg : Shortcut for assigning a groupby-transform to a new column.

Background

This notebook serves to show how to use the groupby_agg method from pyjanitor’s general functions submodule.

The groupby_agg method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.

Currently in pandas, to append a column back to a dataframe, you do it in three steps: 1. Groupby a column or columns 2. Apply the transform method with an aggregate function on the grouping, and finally 3. Assign the result of the transform to a new column in the dataframe.

The groupby_agg allows you to replicate the same process in one step and with sensible arguments. The example below illustrates this function

[1]:
# load modules
import pandas as pd
import numpy as np
from janitor import groupby_agg
[2]:
data = {
    'item': ['shoe', 'shoe', 'bag', 'shoe', 'bag'],
    'MRP': [220, 450, 320, 200, 305],
    'number_sold': [100, 40, 56, 38, 25]
}

df = pd.DataFrame(data)

df
[2]:
item MRP number_sold
0 shoe 220 100
1 shoe 450 40
2 bag 320 56
3 shoe 200 38
4 bag 305 25

Use grouby_agg to find average price for each item and append column to dataframe

[3]:
df = df.groupby_agg(
    by='item',
    agg='mean',
    agg_column_name='MRP',
    new_column_name='Avg_MRP'
)

df
[3]:
item MRP number_sold Avg_MRP
0 shoe 220 100 290.0
1 shoe 450 40 290.0
2 bag 320 56 312.5
3 shoe 200 38 290.0
4 bag 305 25 312.5

Null cells are retained as well :

[4]:
df = pd.DataFrame(
        {
            "name": ("black", "black", "black", "red", "red"),
            "type": ("chair", "chair", "sofa", "sofa", "plate"),
            "num": (4, 5, 12, 4, 3),
            "nulls": (1, 1, np.nan, np.nan, 3),
        }
    )

df
[4]:
name type num nulls
0 black chair 4 1.0
1 black chair 5 1.0
2 black sofa 12 NaN
3 red sofa 4 NaN
4 red plate 3 3.0
[5]:
filtered_df = df.groupby_agg(by=['nulls'],
                             agg='size',
                             agg_column_name='type',
                             new_column_name='counter'
                             )
filtered_df
[5]:
name type num nulls counter
0 black chair 4 1.0 2
1 black chair 5 1.0 2
2 black sofa 12 NaN 2
3 red sofa 4 NaN 2
4 red plate 3 3.0 1

The groupby_agg method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained. Let’s explore this with an example:

[6]:
filtered_df = (df.groupby_agg(by=['name', 'type'],
                             agg='size',
                             agg_column_name='type',
                             new_column_name='counter'
                             )
                .query('counter > 1'))
filtered_df
[6]:
name type num nulls counter
0 black chair 4 1.0 2
1 black chair 5 1.0 2
[ ]: