Tidy Up Web-Scraped Media Franchise Data

Background

This example combines functionalities of pyjanitor and pandas-flavor to showcase an explicit–and thus reproducible–workflow enabled by dataframe method chaining.

The data cleaning workflow largely follows the R example from the tidytuesday project. The raw data is scraped from Wikipedia page titled “List of highest-grossing media franchises”. The workflow is presented both in multi-step (section1) and in one-shot (section 2) fashions.

More specifically, you will find several data-cleaning techniques that one may encounter frequently in web-scraping tasks; This includes:

  • String operations with regular expressions (with pandas-favor)
  • Data type changes (with pyjanitor)
  • Split strings in cells into seperate rows (with pandas-flavor)
  • Split strings in cells into separate columns (with pyjanitor + pandas-flavor)
  • Filter dataframe values based on substring pattern (with pyjanitor)
  • Column value remapping with fuzzy substring matching (with pyjanitor + pandas-flavor)

Data visualization is not included in this example. But if you are looking for inspirations, here is a good example.


Structural convention

1. Annotation system in code comments

This example includes both pyjanitor and pandas-flavors methods. As you step through this example, you will see the following annotation system in code comments that explains various methods’ categories:

  • [pyjanitor] denotes pyjanitor methods
  • [pandas-flavor] denotes custom pandas-flavor methods
  • [pyjanitor + pandas-flavor] denotes pandas-flavor methods built on top of pyjanitor functions

2. R counterpart snippets and python code in tandem

The multi-step workflow is presented by alternating the original R snippets (from tidytuesday) and the corresponding python implementations.


Python implementation

Preparation

[1]:
# Import pyjanitor and pandas
import janitor
import pandas as pd
import pandas_flavor as pf
from typing import List
[2]:
# Supress user warnings when we try overwriting our custom pandas flavor functions
import warnings
warnings.filterwarnings('ignore')

Section 1 Multi-Step

Load data

R snippet:

url <- "https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises"
df <- url %>%
  read_html() %>%
  html_table(fill = TRUE) %>%
  .[[2]]
[3]:
fileurl = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises'
df_raw = pd.read_html(fileurl)[2]
df_raw.head(3)
[3]:
Franchise Year of inception Total revenue (USD) Revenue breakdown (est.) Original media Creator(s) Owner(s)
0 Pokémon 1996 est. $95 billion Licensed merchandise – $64.1 billion[a] Video ... Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Hello Kitty 1974 est. $80 billion Merchandise sales – $80 billion[j] Manga magaz... Cartoon character[37] Yuko Shimizu Shintaro Tsuji Sanrio
2 Winnie the Pooh 1924 est. $75 billion Retail sales – $74.515 billion[n] DVD & Blu-ra... Book[58] A. A. Milne E. H. Shepard The Walt Disney Company

Rename columns

R snippet:

clean_money <- df %>%
  set_names(nm = c("franchise", "year_created", "total_revenue", "revenue_items",
                   "original_media", "creators", "owners"))
[4]:
# pandas-flavor helper functions

# [pandas-flavor]
@pf.register_dataframe_method
def str_remove(df, column_name: str, pattern: str = ''):
    """Remove string patten from a column

    Wrapper around df.str.replace()

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    pattern: str, default to ''
        String pattern to be removed

    Returns
    --------
    df: pd.Dataframe

    """
    df[column_name] = df[column_name].str.replace(pattern, '')
    return df


# [pandas-flavor]
@pf.register_dataframe_method
def str_trim(df, column_name: str):
    """Remove leading and trailing white space from a column of strings

    Wrapper around df.str.strip()

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on

    Returns
    --------
    df: pd.Dataframe

    """
    df[column_name] = df[column_name].str.strip()
    return df


# [pandas-flavor]
@pf.register_dataframe_method
def str_slice(
    df, column_name: str, start: int = 0, stop: int = -1
):
    """Slice a column of strings by indexes

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    start: int, optional, default to 0
        Staring index for string slicing
    stop: int, optional, default to -1
        Ending index for string slicing

    Returns
    --------
    df: pd.Dataframe

    """
    df[column_name] = df[column_name].str[start:stop]
    return df
[5]:
colnames = (
    'franchise', 'year_created', 'total_revenue', 'revenue_items',
    'original_media', 'creators', 'owners'
)
df_dirty = df_raw.rename(columns={
    col_old: col_new for col_old, col_new in zip(df_raw.columns, colnames)
})
df_dirty.head(3)
[5]:
franchise year_created total_revenue revenue_items original_media creators owners
0 Pokémon 1996 est. $95 billion Licensed merchandise – $64.1 billion[a] Video ... Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Hello Kitty 1974 est. $80 billion Merchandise sales – $80 billion[j] Manga magaz... Cartoon character[37] Yuko Shimizu Shintaro Tsuji Sanrio
2 Winnie the Pooh 1924 est. $75 billion Retail sales – $74.515 billion[n] DVD & Blu-ra... Book[58] A. A. Milne E. H. Shepard The Walt Disney Company

Clean up total_revenue column

R snippet:

clean_money <- df %>% ... %>%
mutate(total_revenue = str_remove(total_revenue, "est."),
     total_revenue = str_trim(total_revenue),
     total_revenue = str_remove(total_revenue, "[$]"),
     total_revenue = word(total_revenue, 1, 1),
     total_revenue = as.double(total_revenue))
[6]:
column_name = 'total_revenue'
df_clean_money = (
    df_dirty.str_remove(column_name, pattern='est.')  # [pandas-flavor]
        .str_trim(column_name)  # [pandas-flavor]
        .str_remove(column_name, pattern='\$')   # [pandas-flavor]
        .str_slice(column_name, start=0, stop=2)  # [pandas-flavor]
        .change_type(column_name, float)  # [pyjanitor]
)
df_clean_money.head(3)
[6]:
franchise year_created total_revenue revenue_items original_media creators owners
0 Pokémon 1996 95.0 Licensed merchandise – $64.1 billion[a] Video ... Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Hello Kitty 1974 80.0 Merchandise sales – $80 billion[j] Manga magaz... Cartoon character[37] Yuko Shimizu Shintaro Tsuji Sanrio
2 Winnie the Pooh 1924 75.0 Retail sales – $74.515 billion[n] DVD & Blu-ra... Book[58] A. A. Milne E. H. Shepard The Walt Disney Company

Split column revenue_items into revenue_category and revenue

R snippet:

clean_category <- clean_money %>%
    separate_rows(revenue_items, sep = "\\[") %>%
    filter(str_detect(revenue_items, "illion")) %>%
    separate(revenue_items, into = c("revenue_category", "revenue"), sep = "[$]") %>%
    mutate(revenue_category = str_remove(revenue_category, " – "),
         revenue_category = str_remove(revenue_category, regex(".*\\]")),
         revenue_category = str_remove(revenue_category, "\n"))
[7]:
# pandas-flavor helper functions

# [pandas-flavor]
@pf.register_dataframe_method
def separate_rows(df, column_name: str, sep: str = ''):
    """Split each cell of a column that contains a list of items
    (separated by `sep`) into separate rows

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    sep: str, default to ''
        Substring used as separators for cell splitting

    Returns
    --------
    df: pd.Dataframe

    """
    # Preserve an id field for later merge
    columns_original = list(df.columns)
    df['id'] = df.index
    wdf = (
        pd.DataFrame(
            df[column_name].str.split(sep).tolist()
        ).stack().reset_index()
    )
    # Preserve the same id field for merge
    wdf.rename(columns={'level_0': 'id', 0: 'revenue_items'}, inplace=True)
    wdf.drop(columns=['level_1'], inplace=True)
    # Merge and preserve original order
    return pd.merge(df, wdf, on="id", suffixes=("_drop", "")).drop(
        columns=["id", column_name + "_drop"]
    )[columns_original]


# [pyjanitor + pandas-flavor]
@pf.register_dataframe_method
def separate(df, column_name: str, into: List[str] = None, sep: str = ''):
    """Split a column into separate columns at separator specified by `sep`

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    into: List[str], default to None
        New column names for the splitted columns
    sep: str, default to ''
        Separator at which to split the column

    Returns
    --------
    df: pd.Dataframe

    """
    index_original = list(df.columns).index(column_name)
    cols = list(df.columns)
    cols.remove(column_name)
    for i, col in enumerate(into):
        cols.insert(index_original + i, col)
    return (
        df.deconcatenate_column(column_name, new_column_names=into, sep=sep)
            .drop(columns=column_name)[cols]
    )
[8]:
# Generate `df_clean_category` on top of `df_clean_money`
column_name = 'revenue_items'
df_clean_category = (
    df_clean_money.separate_rows(column_name, sep='\[')  # [pandas-flavor]
        .filter_string(column_name, 'illion')  # [pyjanitor]
        .separate(
            column_name, into=['revenue_category', 'revenue'], sep='\$'
        )  # [pyjanitor + pandas-flavor]
        .str_remove('revenue_category', pattern=' – ')  # [pandas-flavor]
        .str_remove('revenue_category', pattern='.*\]')  # [pandas-flavor]
        .str_remove('revenue_category', pattern='\n')  # [pandas-flavor]
)
df_clean_category.head(3)
[8]:
franchise year_created total_revenue revenue_category revenue original_media creators owners
0 Pokémon 1996 95.0 Licensed merchandise 64.1 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Pokémon 1996 95.0 Video games 17.138 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
2 Pokémon 1996 95.0 Card game 10.853 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...

Clean up revenue_category column

R snippet:

clean_df <- clean_category %>%
  mutate(revenue_category = case_when(
    str_detect(str_to_lower(revenue_category), "box office") ~ "Box Office",
    str_detect(str_to_lower(revenue_category), "dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment") ~ "Home Video/Entertainment",
    str_detect(str_to_lower(revenue_category), "video game|computer game|mobile game|console|game|pachinko|pet|card") ~ "Video Games/Games",
    str_detect(str_to_lower(revenue_category), "comic|manga") ~ "Comic or Manga",
    str_detect(str_to_lower(revenue_category), "music|soundtrack") ~ "Music",
    str_detect(str_to_lower(revenue_category), "tv") ~ "TV",
    str_detect(str_to_lower(revenue_category), "merchandise|licens|mall|stage|retail") ~ "Merchandise, Licensing & Retail",
    TRUE ~ revenue_category))
[9]:
# pandas-flavor helper functions

# [pyjanitor + pandas-flavor]
@pf.register_dataframe_method
def fuzzy_match_replace(df, column_name: str, mapper: dict = None):
    """Value remapping for specific column with fuzzy matching and replacement
    of strings

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    mapper: dict, default to None
        {oldstr_0: newstr_0, oldstr_1: newstr_1, ..., oldstr_n: newstr_n}

    Returns
    --------
    df: pd.Dataframe

    """
    for k, v in mapper.items():
        condition = df[column_name].str.contains(k)
        # [pyjanitor] update_where: update value when condition is True
        df = df.update_where(condition, column_name, v)
    return df
[10]:
# Value mapper `revenue_category`
value_mapper = {
    'box office': 'Box Office',
    'dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment': 'Home Video/Entertainment',
    'video game|computer game|mobile game|console|game|pachinko|pet|card': 'Video Games/Games',
    'comic|manga': 'Comic or Manga',
    'music|soundtrac': 'Music',
    'tv': 'TV',
    'merchandise|licens|mall|stage|retail': 'Merchandise, Licensing & Retail',
}

column_name = 'revenue_category'
df_clean_category = (
    df_clean_category.transform_column(column_name, str.lower)  # [pyjanitor] convert to lower case
        .transform_column(column_name, str.strip)  # [pyjanitor] strip leading/trailing white space
        .fuzzy_match_replace(column_name, mapper=value_mapper)  # [pyjanitor + pandas_flavor]
)
df_clean_category.head(3)
[10]:
franchise year_created total_revenue revenue_category revenue original_media creators owners
0 Pokémon 1996 95.0 Merchandise, Licensing & Retail 64.1 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Pokémon 1996 95.0 Video Games/Games 17.138 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
2 Pokémon 1996 95.0 Video Games/Games 10.853 billion Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...

Clean up revenue column

R snippet:

%>%
mutate(revenue = str_remove(revenue, "illion"),
     revenue = str_trim(revenue),
     revenue = str_remove(revenue, " "),
     revenue = case_when(str_detect(revenue, "m") ~ paste0(str_extract(revenue, "[:digit:]+"), "e-3"),
                         str_detect(revenue, "b") ~ str_extract(revenue, "[:digit:]+"),
                         TRUE ~ NA_character_),
     revenue = format(revenue, scientific = FALSE),
     revenue = parse_number(revenue)) %>%
mutate(original_media = str_remove(original_media, "\\[.+"))
[11]:
# pandas-flavor helper functions

# [pandas-flavor]
@pf.register_dataframe_method
def str_replace(df, column_name: str, old: str = '', new: str = ''):
    """Match and replace strings from a dataframe column.
    Wrapper around df.str.replace

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified
    column_name: str
        Name of the column to be operated on
    old: str, default to ''
        Old string to be matched and replaced
    new: str, default to ''
        New string to replace old

    Returns
    --------
    df: pd.Dataframe

    """
    df[column_name] = df[column_name].str.replace(old, new)
    return df


# [pandas-flavor]
@pf.register_dataframe_method
def parse_number(df):
    """Check all columns of dataframe and properly parse numeric types

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified

    Returns
    --------
    df: pd.Dataframe

    """
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col])
        except ValueError as e:
            continue
    return df

# [pandas-flavor]
@pf.register_dataframe_method
def flatten_multiindex(df):
    """Flatten dataframe with multilevel index
    A wrapper around pd.DataFrame(df.to_records())

    Parameters
    -----------
    df: pd.Dataframe
        Input dataframe to be modified

    Returns
    --------
    df: pd.Dataframe

    """
    return pd.DataFrame(df.to_records())
[12]:
# clean up revenue values
column_name = 'revenue'
df_clean = (
    df_clean_category.str_remove(column_name, 'illion')  # [pandas-flavor]
        .str_trim(column_name)  # [pandas-flavor]
        .str_remove(column_name, ' ')  # [pandas-flavor]
        .str_replace(column_name, '\s*b', '')  # [pandas-flavor]
        .str_replace(column_name, '\s*m', 'e-3')  # [pandas-flavor]
        .parse_number()  # [pandas-flavor]
        .str_remove('original_media', '\[.+')  # [pandas-flavor]
)
df_clean.head(3)
[12]:
franchise year_created total_revenue revenue_category revenue original_media creators owners
0 Pokémon 1996 95.0 Merchandise, Licensing & Retail 64.100 Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
1 Pokémon 1996 95.0 Video Games/Games 17.138 Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...
2 Pokémon 1996 95.0 Video Games/Games 10.853 Video game Satoshi Tajiri Ken Sugimori Nintendo (trademark) The Pokémon Company (Nint...

Section 2 One-Shot

[13]:
df_clean = (
    pd.read_html(fileurl)[2]
        .rename(
            columns={
                col_old: col_new
                for col_old, col_new in zip(df_raw.columns, colnames)
            }
        )
        .str_remove('total_revenue', pattern='est.')  # [pandas-flavor]
        .str_trim('total_revenue')  # [pandas-flavor]
        .str_remove('total_revenue', pattern='\$')   # [pandas-flavor]
        .str_slice('total_revenue', start=0, stop=2)  # [pandas-flavor]
        .change_type('total_revenue', float)  # [pyjanitor]
        .separate_rows('revenue_items', sep='\[')  # [pandas-flavor]
        .filter_string('revenue_items', 'illion')  # [pyjanitor]
        .separate(
            'revenue_items', into=['revenue_category', 'revenue'], sep='\$'
        )  # [pyjanitor + pandas-flavor]
        .str_remove('revenue_category', pattern=' – ')  # [pandas-flavor]
        .str_remove('revenue_category', pattern='.*\]')  # [pandas-flavor]
        .str_remove('revenue_category', pattern='\n')  # [pandas-flavor]
        .transform_column('revenue_category', str.lower)  # [pyjanitor] convert to lower case
        .transform_column('revenue_category', str.strip)  # [pyjanitor] strip leading/trailing white space
        .fuzzy_match_replace('revenue_category', mapper=value_mapper)  # [pyjanitor + pandas_flavor]
        .str_remove('revenue', 'illion')  # [pandas-flavor]
        .str_trim('revenue')  # [pandas-flavor]
        .str_remove('revenue', ' ')  # [pandas-flavor]
        .str_replace('revenue', '\s*b', '')  # [pandas-flavor]
        .str_replace('revenue', '\s*m', 'e-3')  # [pandas-flavor]
        .parse_number()  # [pandas-flavor]
        .str_remove('original_media', '\[.+')  # [pandas-flavor]
)

Final aggregation and join

R snippet:

 sum_df <- clean_df %>%
  group_by(franchise, revenue_category) %>%
  summarize(revenue = sum(revenue))

metadata_df <- clean_df %>%
  select(franchise:revenue_category, original_media:owners, -total_revenue)

final_df <- left_join(sum_df, metadata_df,
                      by = c("franchise", "revenue_category")) %>%
  distinct(.keep_all = TRUE)

final_df
[14]:
df_sum = (
    df_clean.groupby(['franchise', 'revenue_category'])
        .sum().flatten_multiindex()
)
df_sum.head(3)
[14]:
franchise revenue_category year_created total_revenue revenue
0 A Song of Ice and Fire (Game of Thrones) Box Office 1996 5.0 0.0019
1 A Song of Ice and Fire (Game of Thrones) Home Video/Entertainment 1996 5.0 0.2800
2 A Song of Ice and Fire (Game of Thrones) TV 1996 5.0 4.5000
[15]:
df_metadata = df_clean[
    ['franchise', 'revenue_category', 'original_media', 'creators']
]
df_metadata.head(3)
[15]:
franchise revenue_category original_media creators
0 Pokémon Merchandise, Licensing & Retail Video game Satoshi Tajiri Ken Sugimori
1 Pokémon Video Games/Games Video game Satoshi Tajiri Ken Sugimori
2 Pokémon Video Games/Games Video game Satoshi Tajiri Ken Sugimori

Final Dataframe

[16]:
# Generate final dataframe
df_final = (
    pd.merge(
        df_sum, df_metadata, how='left', on=['franchise', 'revenue_category']
    ).drop_duplicates(keep='first').reset_index(drop=True)
)
df_final.head(3)
[16]:
franchise revenue_category year_created total_revenue revenue original_media creators
0 A Song of Ice and Fire (Game of Thrones) Box Office 1996 5.0 0.0019 Novel George R. R. Martin
1 A Song of Ice and Fire (Game of Thrones) Home Video/Entertainment 1996 5.0 0.2800 Novel George R. R. Martin
2 A Song of Ice and Fire (Game of Thrones) TV 1996 5.0 4.5000 Novel George R. R. Martin
[ ]: