janitor.unionize_dataframe_categories

janitor.unionize_dataframe_categories(*dataframes, column_names: Optional[Iterable[pandas.core.dtypes.dtypes.CategoricalDtype]] = None) → List[pandas.core.frame.DataFrame][source]

Given a group of dataframes which contain some categorical columns, for each categorical column present, find all the possible categories across all the dataframes which have that column. Update each dataframes’ corresponding column with a new categorical object that contains the original data but has labels for all the possible categories from all dataframes. This is useful when concatenating a list of dataframes which all have the same categorical columns into one dataframe.

If, for a given categorical column, all input dataframes do not have at least one instance of all the possible categories, Pandas will change the output dtype of that column from category to object, losing out on dramatic speed gains you get from the former format.

Usage example for concatenation of categorical column-containing dataframes:

Instead of:

concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)

which in your case has resulted in category -> object conversion, use:

unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)
Parameters
  • dataframes – The dataframes you wish to unionize the categorical objects for.

  • column_names – If supplied, only unionize this subset of columns.

Returns

A list of the category-unioned dataframes in the same order they were provided.

Raises

TypeError – if any inputs are not pandas DataFrames.