janitor.groupby_topk

janitor.groupby_topk(df: pandas.core.frame.DataFrame, groupby_column_name: Hashable, sort_column_name: Hashable, k: int, sort_values_kwargs: Dict = None) → pandas.core.frame.DataFrame[source]

Return top k rows from a groupby of a set of columns.

Returns a dataframe that has the top k values grouped by groupby_column_name and sorted by sort_column_name. Additional parameters to the sorting (such as ascending=True) can be passed using sort_values_kwargs.

List of all sort_values() parameters can be found here.

import pandas as pd
import janitor as jn

df = pd.DataFrame({'age' : [20, 22, 24, 23, 21, 22],
                   'ID' : [1,2,3,4,5,6],
                   'result' : ["pass", "fail", "pass",
                               "pass", "fail", "pass"]})

# Ascending top 3:
df.groupby_topk('result', 'age', 3)
#       age  ID  result
#result
#fail   21   5   fail
#       22   2   fail
#pass   20   1   pass
#       22   6   pass
#       23   4   pass

#Descending top 2:
df.groupby_topk('result', 'age', 2, {'ascending':False})
#       age  ID result
#result
#fail   22   2   fail
#       21   5   fail
#pass   24   3   pass
#       23   4   pass

Functional usage syntax:

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)
df = jn.groupby_topk(
    df = df,
    groupby_column_name = 'groupby_column',
    sort_column_name = 'sort_column',
    k = 5
    )

Method-chaining usage syntax:

import pandas as pd
import janitor as jn

df = (
    pd.DataFrame(...)
    .groupby_topk(
    df = df,
    groupby_column_name = 'groupby_column',
    sort_column_name = 'sort_column',
    k = 5
    )
)
Parameters
  • df – A pandas dataframe.

  • groupby_column_name – Column name to group input dataframe df by.

  • sort_column_name – Name of the column to sort along the input dataframe df.

  • k – Number of top rows to return from each group after sorting.

  • sort_values_kwargs – Arguments to be passed to sort_values function.

Returns

A pandas dataframe with top k rows that are grouped by groupby_column_name column with each group sorted along the column sort_column_name.

Raises
  • ValueError – if k is less than 1.

  • ValueError – if groupby_column_name not in dataframe df.

  • ValueError – if sort_column_name not in dataframe df.

  • KeyError – if inplace:True is present in sort_values_kwargs.