Processing Anime Data

Background

We will use pyjanitor to showcase how to conveniently chain methods together to perform data cleaning in one shot. We We first define and register a series of dataframe methods with pandas_flavor. Then we chain the dataframe methods together with pyjanitor methods to complete the data cleaning process. The below example shows a one-shot script followed by a step-by-step detail of each part of the method chain.

We have adapted a TidyTuesday analysis that was originally performed in R. The original text from TidyTuesday will be shown in blockquotes.

Note: TidyTuesday is based on the principles discussed and made popular by Hadley Wickham in his paper Tidy Data.

The original text from TidyTuesday will be shown in blockquotes. Here is a description of the Anime data set that we will use.

This week’s data comes from Tam Nguyen and MyAnimeList.net via Kaggle. According to Wikipedia - “MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. The site claims to have 4.4 million anime and 775,000 manga entries. In 2015, the site received 120 million visitors a month.”

Anime without rankings or popularity scores were excluded. Producers, genre, and studio were converted from lists to tidy observations, so there will be repetitions of shows with multiple producers, genres, etc. The raw data is also uploaded.

Lots of interesting ways to explore the data this week!

Import libraries and load data

[1]:
# Import pyjanitor and pandas
import janitor
import pandas as pd
import pandas_flavor as pf
[2]:
# Supress user warnings when we try overwriting our custom pandas flavor functions
import warnings
warnings.filterwarnings('ignore')

One-Shot

[3]:
filename = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/raw_anime.csv'
df = pd.read_csv(filename)

@pf.register_dataframe_method
def str_remove(df, column_name: str, pat: str, *args, **kwargs):
    """Wrapper around df.str.replace"""

    df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
    return df


@pf.register_dataframe_method
def str_trim(df, column_name: str, *args, **kwargs):
    """Wrapper around df.str.strip"""

    df[column_name] = df[column_name].str.strip(*args, **kwargs)
    return df


@pf.register_dataframe_method
def explode(df: pd.DataFrame, column_name: str, sep: str):
    """
    For rows with a list of values, this function will create new
    rows for each value in the list
    """

    df["id"] = df.index
    wdf = (
        pd.DataFrame(df[column_name].str.split(sep).fillna("").tolist())
        .stack()
        .reset_index()
    )
    # exploded_column = column_name
    wdf.columns = ["id", "depth", column_name]  ## plural form to singular form
    # wdf[column_name] = wdf[column_name].apply(lambda x: x.strip())  # trim
    wdf.drop("depth", axis=1, inplace=True)

    return pd.merge(df, wdf, on="id", suffixes=("_drop", "")).drop(
        columns=["id", column_name + "_drop"]
    )


@pf.register_dataframe_method
def str_word(
    df,
    column_name: str,
    start: int = None,
    stop: int = None,
    pat: str = " ",
    *args,
    **kwargs
):
    """
    Wrapper around `df.str.split` with additional `start` and `end` arguments
    to select a slice of the list of words.
    """

    df[column_name] = df[column_name].str.split(pat).str[start:stop]
    return df


@pf.register_dataframe_method
def str_join(df, column_name: str, sep: str, *args, **kwargs):
    """
    Wrapper around `df.str.join`
    Joins items in a list.
    """

    df[column_name] = df[column_name].str.join(sep)
    return df


@pf.register_dataframe_method
def str_slice(
    df, column_name: str, start: int = None, stop: int = None, *args, **kwargs
):
    """
    Wrapper around `df.str.slice
    """

    df[column_name] = df[column_name].str[start:stop]
    return df


clean_df = (
    df.str_remove(column_name="producers", pat="\[|\]")
    .explode(column_name="producers", sep=",")
    .str_remove(column_name="producers", pat="'")
    .str_trim("producers")
    .str_remove(column_name="genre", pat="\[|\]")
    .explode(column_name="genre", sep=",")
    .str_remove(column_name="genre", pat="'")
    .str_trim(column_name="genre")
    .str_remove(column_name="studio", pat="\[|\]")
    .explode(column_name="studio", sep=",")
    .str_remove(column_name="studio", pat="'")
    .str_trim(column_name="studio")
    .str_remove(column_name="aired", pat="\{|\}|'from':\s*|'to':\s*")
    .str_word(column_name="aired", start=0, stop=2, pat=",")
    .str_join(column_name="aired", sep=",")
    .deconcatenate_column(
        column_name="aired", new_column_names=["start_date", "end_date"], sep=","
    )
    .remove_columns(column_names=["aired"])
    .str_remove(column_name="start_date", pat="'")
    .str_slice(column_name="start_date", start=0, stop=10)
    .str_remove(column_name="end_date", pat="'")
    .str_slice(column_name="end_date", start=0, stop=11)
    .to_datetime("start_date", format="%Y-%m-%d", errors="coerce")
    .to_datetime("end_date", format="%Y-%m-%d", errors="coerce")
    .fill_empty(columns=["rank", "popularity"], value=0)
    .filter_on("rank != 0 & popularity != 0")
)
[4]:
clean_df.head()
[4]:
animeID name title_english title_japanese title_synonyms type source episodes status airing ... synopsis background premiered broadcast related producers genre studio start_date end_date
0 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Action Sunrise 1998-04-03 1999-04-24
1 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Adventure Sunrise 1998-04-03 1999-04-24
2 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Comedy Sunrise 1998-04-03 1999-04-24
3 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Drama Sunrise 1998-04-03 1999-04-24
4 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Sci-Fi Sunrise 1998-04-03 1999-04-24

5 rows × 28 columns

Multi-Step

Data Dictionary

Heads up the dataset is about 97 mb - if you want to free up some space, drop the synopsis and background, they are long strings, or broadcast, premiered, related as they are redundant or less useful.

variable class description
animeID double Anime ID (as in https://myanimelist.net/anime/animeID)
name character anime title - extracted from the site.
title_english character title in English (sometimes is different, sometimes is missing)
title_japanese character title in Japanese (if Anime is Chinese or Korean, the title, if available, in the respective language)
title_synonyms character other variants of the title
type character anime type (e.g. TV, Movie, OVA)
source character source of anime (i.e original, manga, game, music, visual novel etc.)
producers character producers
genre character genre
studio character studio
episodes double number of episodes
status character Aired or not aired
airing logical True/False is still airing
start_date double Start date (ymd)
end_date double End date (ymd)
duration character Per episode duration or entire duration, text string
rating character Age rating
score double Score (higher = better)
scored_by double Number of users that scored
rank double Rank - weight according to MyAnimeList formula
popularity double based on how many members/users have the respective anime in their list
members double number members that added this anime in their list
favorites double number members that favorites these in their list
synopsis character long string with anime synopsis
background character long string with production background and other things
premiered character anime premiered on season/year
broadcast character when is (regularly) broadcasted
related character dictionary: related animes, series, games etc.

Step 0: Load data

[5]:
filename = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/raw_anime.csv'
df = pd.read_csv(filename)
[6]:
df.head(3).T
[6]:
0 1 2
animeID 1 5 6
name Cowboy Bebop Cowboy Bebop: Tengoku no Tobira Trigun
title_english Cowboy Bebop Cowboy Bebop: The Movie Trigun
title_japanese カウボーイビバップ カウボーイビバップ 天国の扉 トライガン
title_synonyms [] ["Cowboy Bebop: Knockin' on Heaven's Door"] []
type TV Movie TV
source Original Original Manga
producers ['Bandai Visual'] ['Sunrise', 'Bandai Visual'] ['Victor Entertainment']
genre ['Action', 'Adventure', 'Comedy', 'Drama', 'Sc... ['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space'] ['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D...
studio ['Sunrise'] ['Bones'] ['Madhouse']
episodes 26 1 26
status Finished Airing Finished Airing Finished Airing
airing False False False
aired {'from': '1998-04-03T00:00:00+00:00', 'to': '1... {'from': '2001-09-01T00:00:00+00:00', 'to': No... {'from': '1998-04-01T00:00:00+00:00', 'to': '1...
duration 24 min per ep 1 hr 55 min 24 min per ep
rating R - 17+ (violence & profanity) R - 17+ (violence & profanity) PG-13 - Teens 13 or older
score 8.81 8.41 8.3
scored_by 405664 120243 212537
rank 26 164 255
popularity 39 449 146
members 795733 197791 408548
favorites 43460 776 10432
synopsis In the year 2071, humanity has colonized sever... Another day, another bounty—such is the life o... Vash the Stampede is the man with a $$60,000,0...
background When Cowboy Bebop first aired in spring of 199... NaN The Japanese release by Victor Entertainment h...
premiered Spring 1998 NaN Spring 1998
broadcast Saturdays at 01:00 (JST) NaN Thursdays at 01:15 (JST)
related {'Adaptation': [{'mal_id': 173, 'type': 'manga... {'Parent story': [{'mal_id': 1, 'type': 'anime... {'Adaptation': [{'mal_id': 703, 'type': 'manga...

Step 1: Clean producers column

The first step tries to clean up the producers column by removing some brackets (‘[]’) and trim off some empty spaces

clean_df <- raw_df %>%
 # Producers
 mutate(producers = str_remove(producers, "\\["),
producers = str_remove(producers, "\\]"))

What is mutate? This link compares R’s mutate to be similar to pandas’ df.assign. However, df.assign returns a new DataFrame whereas mutate adds a new variable while preserving the previous ones. Therefore, for this example, I will compare mutate to be similar to df['col'] = X

As we can see, this is looks like a list of items but in string form

[7]:
# Let's see what we trying to remove
df.loc[df['producers'].str.contains("\[", na=False), 'producers'].head()
[7]:
0               ['Bandai Visual']
1    ['Sunrise', 'Bandai Visual']
2        ['Victor Entertainment']
3               ['Bandai Visual']
4          ['TV Tokyo', 'Dentsu']
Name: producers, dtype: object

Let’s use pandas flavor to create a custom method for just removing some strings so we don’t have to use str.replace so many times.

[8]:
@pf.register_dataframe_method
def str_remove(df, column_name: str, pat: str, *args, **kwargs):
    """
    Wrapper around df.str.replace
    The function will loop through regex patterns and remove them from the desired column.

    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column the string removal action is to be made.
    :param pat: A regex pattern to match and remove.
    """

    if not isinstance(pat, str):
        raise TypeError(
            f"Pattern should be a valid regex pattern. Received pattern: {pat} with dtype: {type(pat)}"
        )
    df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
    return df
[9]:
clean_df = (
    df
    .str_remove(column_name='producers', pat='\[|\]')
)

With brackets removed.

[10]:
clean_df['producers'].head()
[10]:
0               'Bandai Visual'
1    'Sunrise', 'Bandai Visual'
2        'Victor Entertainment'
3               'Bandai Visual'
4          'TV Tokyo', 'Dentsu'
Name: producers, dtype: object

Brackets are removed. Now the next part >>  separate_rows(producers, sep = ",") %>%  >

It seems like separate rows will go through each value of the column, and if the value is a list, will create a new row for each value in the list with the remaining column values being the same. This is commonly known as an explode method but it is not yet implemented in pandas. We will need a function for this (code adopted from here).

[11]:
@pf.register_dataframe_method
def explode(df: pd.DataFrame, column_name: str, sep: str):
    """
    For rows with a list of values, this function will create new rows for each value in the list

    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column the string removal action is to be made.
    :param sep: The delimiter. Example delimiters include `|`, `, `, `,` etc.
    """

    df["id"] = df.index
    wdf = (
        pd.DataFrame(df[column_name].str.split(sep).fillna("").tolist())
        .stack()
        .reset_index()
    )
    # exploded_column = column_name
    wdf.columns = ["id", "depth", column_name]  ## plural form to singular form
    # wdf[column_name] = wdf[column_name].apply(lambda x: x.strip())  # trim
    wdf.drop("depth", axis=1, inplace=True)

    return pd.merge(df, wdf, on="id", suffixes=("_drop", "")).drop(
        columns=["id", column_name + "_drop"]
    )
[12]:
clean_df = (
    clean_df
    .explode(column_name='producers', sep=',')
)

Now every producer is its own row.

[13]:
clean_df['producers'].head()
[13]:
0           'Bandai Visual'
1                 'Sunrise'
2           'Bandai Visual'
3    'Victor Entertainment'
4           'Bandai Visual'
Name: producers, dtype: object

Now remove single quotes and a bit of trimming >mutate(producers = str_remove(producers, "\\'"),          producers = str_remove(producers, "\\'"),          producers = str_trim(producers)) %>%

[14]:
clean_df = (
    clean_df
    .str_remove(column_name='producers', pat='\'')
)

We’ll make another custom function for trimming whitespace.

[15]:
@pf.register_dataframe_method
def str_trim(df, column_name: str, *args, **kwargs):
    """Remove trailing and leading characters, in a given column"""
    df[column_name] = df[column_name].str.strip(*args, **kwargs)
    return df
[16]:
clean_df = clean_df.str_trim('producers')

Finally, here is our cleaned producers column.

[17]:
clean_df['producers'].head()
[17]:
0           Bandai Visual
1                 Sunrise
2           Bandai Visual
3    Victor Entertainment
4           Bandai Visual
Name: producers, dtype: object

Step 2: Clean genre and studio Columns

Let’s do the same process for columns Genre and Studio

 # Genre
mutate(genre = str_remove(genre, "\\["),
genre = str_remove(genre, "\\]")) %>%
separate_rows(genre, sep = ",") %>%
mutate(genre = str_remove(genre, "\\'"),
genre = str_remove(genre, "\\'"),
genre = str_trim(genre)) %>%
 # Studio
mutate(studio = str_remove(studio, "\\["),
studio = str_remove(studio, "\\]")) %>%
separate_rows(studio, sep = ",") %>%
mutate(studio = str_remove(studio, "\\'"),
studio = str_remove(studio, "\\'"),
studio = str_trim(studio)) %>%
[18]:
clean_df = (
    clean_df
    # Perform operation for genre.
    .str_remove(column_name='genre', pat='\[|\]')
    .explode(column_name='genre', sep=',')
    .str_remove(column_name='genre', pat='\'')
    .str_trim(column_name='genre')
    # Now do it for studio
    .str_remove(column_name='studio', pat='\[|\]')
    .explode(column_name='studio', sep=',')
    .str_remove(column_name='studio', pat='\'')
    .str_trim(column_name='studio')
)

Resulting cleaned columns.

[19]:
clean_df[['genre', 'studio']].head()
[19]:
genre studio
0 Action Sunrise
1 Adventure Sunrise
2 Comedy Sunrise
3 Drama Sunrise
4 Sci-Fi Sunrise

Step 3: Clean aired column

The aired column has something a little different. In addition to the usual removing some strings and whitespace trimming, we want to separate the values into two separate columns start_date and end_date

 # Aired
mutate(aired = str_remove(aired, "\\{"),
aired = str_remove(aired, "\\}"),
aired = str_remove(aired, "'from': "),
aired = str_remove(aired, "'to': "),
aired = word(aired, start = 1, 2, sep = ",")) %>%
separate(aired, into = c("start_date", "end_date"), sep = ",") %>%
mutate(start_date = str_remove_all(start_date, "'"),
start_date = str_sub(start_date, 1, 10),
end_date = str_remove_all(start_date, "'"),
end_date = str_sub(end_date, 1, 10)) %>%
mutate(start_date = lubridate::ymd(start_date),
end_date = lubridate::ymd(end_date)) %>%

We will create some custom wrapper functions to emulate R’s word and use pyjanitor’s deconcatenate_column.

[20]:
# Currently looks like this
clean_df['aired'].head()
[20]:
0    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
1    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
2    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
3    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
4    {'from': '1998-04-03T00:00:00+00:00', 'to': '1...
Name: aired, dtype: object
[21]:
@pf.register_dataframe_method
def str_word(
    df,
    column_name: str,
    start: int = None,
    stop: int = None,
    pat: str = " ",
    *args,
    **kwargs
):
    """
    Wrapper around `df.str.split` with additional `start` and `end` arguments
    to select a slice of the list of words.

    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column the split action is to be made.
    :param start: optional An `int` for the start index of the slice
    :param stop: optinal  An `int` for the end index of the slice
    :param pat: String or regular expression to split on. If not specified, split on whitespace.

    """
    df[column_name] = df[column_name].str.split(pat).str[start:stop]
    return df


@pf.register_dataframe_method
def str_join(df, column_name: str, sep: str, *args, **kwargs):
    """
    Wrapper around `df.str.join`
    Joins items in a list.

    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column the split action is to be made.
    :param sep: The delimiter. Example delimiters include `|`, `, `, `,` etc.
    """
    df[column_name] = df[column_name].str.join(sep)
    return df


@pf.register_dataframe_method
def str_slice(
    df, column_name: str, start: int = None, stop: int = None, *args, **kwargs
):
    """
    Wrapper around `df.str.slice
    Slices strings.

    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column the split action is to be made.
    :param start: 'int' indicating start of slice.
    :param stop: 'int' indicating stop of slice.
    """
    df[column_name] = df[column_name].str[start:stop]
    return df
[22]:
clean_df = (
    clean_df.str_remove(column_name="aired", pat="\{|\}|'from':\s*|'to':\s*")
    .str_word(column_name="aired", start=0, stop=2, pat=",")
    .str_join(column_name="aired", sep=",")
    # .add_columns({'start_date': clean_df['aired'][0]})
    .deconcatenate_column(
        column_name="aired", new_column_names=["start_date", "end_date"], sep=","
    )
    .remove_columns(column_names=["aired"])
    .str_remove(column_name="start_date", pat="'")
    .str_slice(column_name="start_date", start=0, stop=10)
    .str_remove(column_name="end_date", pat="'")
    .str_slice(column_name="end_date", start=0, stop=11)
    .to_datetime("start_date", format="%Y-%m-%d", errors="coerce")
    .to_datetime("end_date", format="%Y-%m-%d", errors="coerce")
)
[23]:
# Resulting 'start_date' and 'end_date' columns with 'aired' column removed
clean_df[['start_date', 'end_date']].head()
[23]:
start_date end_date
0 1998-04-03 1999-04-24
1 1998-04-03 1999-04-24
2 1998-04-03 1999-04-24
3 1998-04-03 1999-04-24
4 1998-04-03 1999-04-24

Step 4: Filter out unranked and unpopular series

Finally, let’s drop the unranked or unpopular series with pyjanitor’s filter_on.

[24]:
# First fill any NA values with 0 and then filter != 0
clean_df = clean_df.fill_empty(column_names=["rank", "popularity"], value=0).filter_on(
    "rank != 0 & popularity != 0"
)

End Result

[25]:
clean_df.head()
[25]:
animeID name title_english title_japanese title_synonyms type source episodes status airing ... synopsis background premiered broadcast related producers genre studio start_date end_date
0 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Action Sunrise 1998-04-03 1999-04-24
1 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Adventure Sunrise 1998-04-03 1999-04-24
2 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Comedy Sunrise 1998-04-03 1999-04-24
3 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Drama Sunrise 1998-04-03 1999-04-24
4 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ [] TV Original 26.0 Finished Airing False ... In the year 2071, humanity has colonized sever... When Cowboy Bebop first aired in spring of 199... Spring 1998 Saturdays at 01:00 (JST) {'Adaptation': [{'mal_id': 173, 'type': 'manga... Bandai Visual Sci-Fi Sunrise 1998-04-03 1999-04-24

5 rows × 28 columns