pyjanitor Usage Walkthrough

pyjanitor is a Python-based API on top of pandas inspired by the janitor R package. It aims to provide a clean, understandable interface based on method chaining for common and less-common tasks involving data cleaning and DataFrame manipulation.

The core philosophy and augmentations on pandas’ approach to data cleaning and DataFrame manipulation include:

  • A method-chaining paradigm for coding efficiency & clarity of code through greatly improved readability.
  • Implementation of common, useful DataFrame manipulation tasks that saves on repetitive code.
  • Focus on active tense / verb approaches to function naming to provide at-a-glance understanding of a data manipulation pipeline.

Why pyjanitor?

Originally a simple port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method-chaining paradigm.

Data preprocessing is best expressed as a directed acyclic graph (DAG) of actions taken on data. We take a base data file as the starting point and perform actions on it such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows, and more.

The pandas API has been invaluable for the Python data science ecosystem and implements method chaining for a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the R statistical language ecosystem, where consistent and good API design in the dplyr package enables end-users, who are not necessarily developers, to concisely express data processing code, pyjanitor has evolved into a language for expressing the data processing DAG for pandas users.

What is method chaining?

To accomplish these goals, actions for which we would need to invoke imperative-style statements can be replaced with method chains that allow the user to read off the logical order of actions taken. Note the annotated example below. First, we introduce the textual description of a sample data cleaning pipeline:

  • Create DataFrame.
  • Delete one column.
  • Drop rows with empty values in two particular columns.
  • Rename another two columns.
  • Add a new column.
  • Reset index to account for the missing row we removed above

In pandas code, this would look as such:

df = pd.DataFrame(...)  # create a pandas DataFrame somehow.
del df['column1']  # delete a column from the dataframe.
df = df.dropna(subset=['column2', 'column3'])  # drop rows that have empty values in column 2 and 3.
df = df.rename({'column2': 'unicorns', 'column3': 'dragons'})  # rename column2 and column3
df['new_column'] = ['iterable', 'of', 'items']  # add a new column.
df.reset_index(inplace=True, drop=True)  # reset index to account for the missing row we removed above

The pyjanitor approach

With pyjanitor, we enable method chaining with method names that are verbs which describe the action taken:

df = (
    pd.DataFrame(...)
    .remove_columns(['column1'])
    .dropna(subset=['column2', 'column3'])
    .rename_column('column2', 'unicorns')
    .rename_column('column3', 'dragons')
    .add_column('new_column', ['iterable', 'of', 'items'])
    .reset_index(drop=True)
)

We believe the pyjanitor chaining-based approach leads to much cleaner code where the intent of a series of DataFrame manipulations is much more immediately clear.

pyjanitor’s etymology has a two-fold relationship to “cleanliness”. Firstly, it’s about extending pandas with convenient data cleaning routines. Secondly, it’s about providing a cleaner, method-chaining, verb-based API for common pandas routines.

A survey of pyjanitor functions

  • Cleaning column names (multi-indexes are possible!)
  • Removing empty rows and columns
  • Identifying duplicate entries
  • Encoding columns as categorical
  • Splitting your data into features and targets (for machine learning)
  • Adding, removing, and renaming columns
  • Coalesce multiple columns into a single column
  • Convert excel date (serial format) into a Python datetime format
  • Expand a single column that has delimited, categorical values into dummy-encoded variables

A full list of functionality that pyjanitor implements can be found in the API docs.

Some things that are different

Some pyjanitor methods are DataFrame-mutating operations, i.e., in place. Given that in a method-chaining paradigm, DataFrames that would be created at each step of the chain cannot be accessed anyway, duplication of data at each step would lead to unnecessary, potential considerable slowdowns and increased memory usage due to data-copying operations. The severity of such copying scales with DataFrame size. Take care to understand which functions change the original DataFrame you are chaining on if it is necessary to preserve that data. If it is, you can simply .copy() it as the first step in a df.copy().operation1().operation2()... chain.

How it works

pyjanitor relies on the Pandas Flavor package to register new functions as object methods that can be called directly on DataFrames. For example:

import pandas as pd
import pandas_flavor as pf


@pf.register_dataframe_method
def remove_column(df, column_name: str):
    del df[column_name]
    return df

df = (
    pd.read_csv('my_data.csv')
    .remove_column('my_column_name')
    .operation2(...)
)

Importing the janitor package immediately registers these functions. The fact that each DataFrame method pyjanitor registers returns the DataFrame is what gives it the capability to method chain.

Note that pyjanitor explicitly does not modify any existing pandas methods / functionality.

Demo of various DataFrame manipulation tasks using pyjanitor

Here, we’ll walk through some useful pyjanitor-based approaches to cleaning and manipulating DataFrames.

Code preamble:

[2]:
import pandas as pd
import janitor
import pandas_flavor as pf
import numpy as np

Let’s take a look at our dataset:

[3]:
df = pd.read_excel('dirty_data.xlsx')
df
[3]:
First Name Last Name Employee Status Subject Hire Date % Allocated Full time? do not edit! ---> Certification Certification.1 Certification.2
0 Jason Bourne Teacher PE 39690.0 0.75 Yes NaN Physical ed Theater NaN
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes NaN Physical ed Theater NaN
2 Alicia Keys Teacher Music 37118.0 1.00 Yes NaN Instr. music Vocal music NaN
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes NaN PENDING Computers NaN
4 Desus Nice Administration Dean 41431.0 1.00 Yes NaN PENDING NaN NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 James Joyce Teacher English 32994.0 0.50 No NaN NaN English 6-12 NaN
9 Hedy Lamarr Teacher Science 27919.0 0.50 No NaN PENDING NaN NaN
10 Carlos Boozer Coach Basketball 42221.0 NaN No NaN Physical ed NaN NaN
11 Young Boozer Coach NaN 34700.0 NaN No NaN NaN Political sci. NaN
12 Micheal Larsen Teacher English 40071.0 0.80 No NaN Vocal music English NaN

We can see that this dataset is dirty in a number of ways, including the following:

  • Column names contain spaces, punctuation marks, and inconsistent casing
  • One row (7) with completely missing data
  • One column (do not edit! --->) with completely missing data

Clean up our data using a pyjanitor method-chaining pipeline

Let’s run through a demo DataFrame cleaning procedure:

[4]:
cleaned_df = (
    pd.read_excel('dirty_data.xlsx')
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce(["certification", "certification_1"], "certification")
    .encode_categorical(["subject", "employee_status", "full_time"])
    .convert_excel_date("hire_date")
    .reset_index(drop=True)
)

cleaned_df
[4]:
first_name last_name employee_status subject hire_date percent_allocated full_time certification
0 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed
1 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed
2 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music
3 Ada Lovelace Teacher NaN 1975-05-01 1.00 Yes PENDING
4 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING
5 Chien-Shiung Wu Teacher Physics 1930-03-20 0.50 Yes Science 6-12
6 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.50 Yes Science 6-12
7 James Joyce Teacher English 1990-05-01 0.50 No English 6-12
8 Hedy Lamarr Teacher Science 1976-06-08 0.50 No PENDING
9 Carlos Boozer Coach Basketball 2015-08-05 NaN No Physical ed
10 Young Boozer Coach NaN 1995-01-01 NaN No Political sci.
11 Micheal Larsen Teacher English 2009-09-15 0.80 No Vocal music

The cleaned DataFrame looks much better and quite a bit more usable for our downstream tasks.

Step-by-step walkthrough of pyjanitor DataFrame manipulations

Just for clearer understanding of the above, let’s see how pyjanitor progressively modified the data.

Loading data in:

[5]:
df = pd.read_excel('dirty_data.xlsx')
df
[5]:
First Name Last Name Employee Status Subject Hire Date % Allocated Full time? do not edit! ---> Certification Certification.1 Certification.2
0 Jason Bourne Teacher PE 39690.0 0.75 Yes NaN Physical ed Theater NaN
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes NaN Physical ed Theater NaN
2 Alicia Keys Teacher Music 37118.0 1.00 Yes NaN Instr. music Vocal music NaN
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes NaN PENDING Computers NaN
4 Desus Nice Administration Dean 41431.0 1.00 Yes NaN PENDING NaN NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 James Joyce Teacher English 32994.0 0.50 No NaN NaN English 6-12 NaN
9 Hedy Lamarr Teacher Science 27919.0 0.50 No NaN PENDING NaN NaN
10 Carlos Boozer Coach Basketball 42221.0 NaN No NaN Physical ed NaN NaN
11 Young Boozer Coach NaN 34700.0 NaN No NaN NaN Political sci. NaN
12 Micheal Larsen Teacher English 40071.0 0.80 No NaN Vocal music English NaN

Clean up names by removing whitespace, punctuation / symbols, capitalization:

[6]:
df = df.clean_names()
df
[6]:
first_name last_name employee_status subject hire_date %_allocated full_time_ do_not_edit!_> certification certification_1 certification_2
0 Jason Bourne Teacher PE 39690.0 0.75 Yes NaN Physical ed Theater NaN
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes NaN Physical ed Theater NaN
2 Alicia Keys Teacher Music 37118.0 1.00 Yes NaN Instr. music Vocal music NaN
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes NaN PENDING Computers NaN
4 Desus Nice Administration Dean 41431.0 1.00 Yes NaN PENDING NaN NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 James Joyce Teacher English 32994.0 0.50 No NaN NaN English 6-12 NaN
9 Hedy Lamarr Teacher Science 27919.0 0.50 No NaN PENDING NaN NaN
10 Carlos Boozer Coach Basketball 42221.0 NaN No NaN Physical ed NaN NaN
11 Young Boozer Coach NaN 34700.0 NaN No NaN NaN Political sci. NaN
12 Micheal Larsen Teacher English 40071.0 0.80 No NaN Vocal music English NaN

Remove entirely empty rows / columns:

[7]:
df = df.remove_empty()
df
[7]:
first_name last_name employee_status subject hire_date %_allocated full_time_ certification certification_1
0 Jason Bourne Teacher PE 39690.0 0.75 Yes Physical ed Theater
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes Physical ed Theater
2 Alicia Keys Teacher Music 37118.0 1.00 Yes Instr. music Vocal music
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes PENDING Computers
4 Desus Nice Administration Dean 41431.0 1.00 Yes PENDING NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes Science 6-12 Physics
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes Science 6-12 Physics
7 James Joyce Teacher English 32994.0 0.50 No NaN English 6-12
8 Hedy Lamarr Teacher Science 27919.0 0.50 No PENDING NaN
9 Carlos Boozer Coach Basketball 42221.0 NaN No Physical ed NaN
10 Young Boozer Coach NaN 34700.0 NaN No NaN Political sci.
11 Micheal Larsen Teacher English 40071.0 0.80 No Vocal music English

Rename particular columns:

[8]:
df = (
    df.rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
)
df
[8]:
first_name last_name employee_status subject hire_date percent_allocated full_time certification certification_1
0 Jason Bourne Teacher PE 39690.0 0.75 Yes Physical ed Theater
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes Physical ed Theater
2 Alicia Keys Teacher Music 37118.0 1.00 Yes Instr. music Vocal music
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes PENDING Computers
4 Desus Nice Administration Dean 41431.0 1.00 Yes PENDING NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes Science 6-12 Physics
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes Science 6-12 Physics
7 James Joyce Teacher English 32994.0 0.50 No NaN English 6-12
8 Hedy Lamarr Teacher Science 27919.0 0.50 No PENDING NaN
9 Carlos Boozer Coach Basketball 42221.0 NaN No Physical ed NaN
10 Young Boozer Coach NaN 34700.0 NaN No NaN Political sci.
11 Micheal Larsen Teacher English 40071.0 0.80 No Vocal music English

Take first non-NaN row value in two columns:

[9]:
df = df.coalesce(["certification", "certification_1"], "certification")
df
[9]:
first_name last_name employee_status subject hire_date percent_allocated full_time certification
0 Jason Bourne Teacher PE 39690.0 0.75 Yes Physical ed
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes Physical ed
2 Alicia Keys Teacher Music 37118.0 1.00 Yes Instr. music
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes PENDING
4 Desus Nice Administration Dean 41431.0 1.00 Yes PENDING
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes Science 6-12
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes Science 6-12
7 James Joyce Teacher English 32994.0 0.50 No English 6-12
8 Hedy Lamarr Teacher Science 27919.0 0.50 No PENDING
9 Carlos Boozer Coach Basketball 42221.0 NaN No Physical ed
10 Young Boozer Coach NaN 34700.0 NaN No Political sci.
11 Micheal Larsen Teacher English 40071.0 0.80 No Vocal music

Convert string object rows to categorical to save on memory consumption and speed up access:

[10]:
df.dtypes
[10]:
first_name            object
last_name             object
employee_status       object
subject               object
hire_date            float64
percent_allocated    float64
full_time             object
certification         object
dtype: object
[11]:
df.encode_categorical(["subject", "employee_status", "full_time"])
df.dtypes
[11]:
first_name             object
last_name              object
employee_status      category
subject              category
hire_date             float64
percent_allocated     float64
full_time            category
certification          object
dtype: object

Convert Excel date-formatted column to a more interpretable format:

[12]:
df.convert_excel_date("hire_date")
df
[12]:
first_name last_name employee_status subject hire_date percent_allocated full_time certification
0 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed
1 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed
2 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music
3 Ada Lovelace Teacher NaN 1975-05-01 1.00 Yes PENDING
4 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING
5 Chien-Shiung Wu Teacher Physics 1930-03-20 0.50 Yes Science 6-12
6 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.50 Yes Science 6-12
7 James Joyce Teacher English 1990-05-01 0.50 No English 6-12
8 Hedy Lamarr Teacher Science 1976-06-08 0.50 No PENDING
9 Carlos Boozer Coach Basketball 2015-08-05 NaN No Physical ed
10 Young Boozer Coach NaN 1995-01-01 NaN No Political sci.
11 Micheal Larsen Teacher English 2009-09-15 0.80 No Vocal music

Example analysis of the data

Let’s perform analysis on the above, cleaned DataFrame. First we add some additional, randomly-generated data. Note that we .copy() the original to preserve it, given that the following would otherwise modify it:

[13]:
data_df = (
    cleaned_df
    .copy()
    .add_columns(
        lucky_number=np.random.randint(0, 10, len(cleaned_df)),
        age=np.random.randint(10, 100, len(cleaned_df)),
        employee_of_month_count=np.random.randint(0, 5, len(cleaned_df))
    )
)

data_df
[13]:
first_name last_name employee_status subject hire_date percent_allocated full_time certification lucky_number age employee_of_month_count
0 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed 6 14 2
1 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed 7 29 0
2 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music 2 34 4
3 Ada Lovelace Teacher NaN 1975-05-01 1.00 Yes PENDING 2 81 3
4 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING 0 69 2
5 Chien-Shiung Wu Teacher Physics 1930-03-20 0.50 Yes Science 6-12 9 75 4
6 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.50 Yes Science 6-12 3 34 3
7 James Joyce Teacher English 1990-05-01 0.50 No English 6-12 6 87 2
8 Hedy Lamarr Teacher Science 1976-06-08 0.50 No PENDING 7 92 0
9 Carlos Boozer Coach Basketball 2015-08-05 NaN No Physical ed 9 35 0
10 Young Boozer Coach NaN 1995-01-01 NaN No Political sci. 0 46 1
11 Micheal Larsen Teacher English 2009-09-15 0.80 No Vocal music 7 71 2

Calculate mean, median of all numerical columns after grouping by employee status. Use .collapse_levels(), a pyjanitor convenience function, to convert the DataFrame returned by .agg() from having multi-level columns (because we supplied a list of aggregation operations) to single-level by concatenating the level names with an underscore:

[14]:
stats_df = (
    data_df.groupby('employee_status')
    .agg(['mean', 'median'])
    .collapse_levels()
    .reset_index()
)

stats_df
[14]:
employee_status percent_allocated_mean percent_allocated_median lucky_number_mean lucky_number_median age_mean age_median employee_of_month_count_mean employee_of_month_count_median
0 Administration 1.000000 1.0 0.000000 0.0 69.000000 69.0 2.000000 2.0
1 Coach NaN NaN 4.500000 4.5 40.500000 40.5 0.500000 0.5
2 Teacher 0.644444 0.5 5.444444 6.0 57.444444 71.0 2.222222 2.0