janitor.process_text

janitor.process_text(df: pandas.core.frame.DataFrame, column_name: str, new_column_names: Optional[Union[list, str]] = None, merge_frame: Optional[bool] = False, string_function: Optional[str] = None, **kwargs: str) → pandas.core.frame.DataFrame[source]

Apply a Pandas string method to an existing column and return a dataframe.

This function aims to make string cleaning easy, while chaining, by simply passing the string method name to the process_text function. This modifies an existing column and can also be used to create a new column.

Note

In versions < 0.20.11, this function did not support the creation of new columns.

A list of all the string methods in Pandas can be accessed here.

Example:

import pandas as pd
import janitor as jn

df = pd.DataFrame({"text" : ["Ragnar",
                            "sammywemmy",
                            "ginger"],
                   "code" : [1, 2, 3]})

df.process_text(column_name = "text",
                string_function = "lower")

  text          code
0 ragnar         1
1 sammywemmy     2
2 ginger         3

For string methods with parameters, simply pass the keyword arguments:

df.process_text(
    column_name = "text",
    string_function = "extract",
    pat = r"(ag)",
    expand = False,
    flags = re.IGNORECASE
    )

  text     code
0 ag        1
1 NaN       2
2 NaN       3

A new column can be created, leaving the existing column unmodified:

df.process_text(
    column_name = "text",
    new_column_names = "new_text",
    string_function = "extract",
    pat = r"(ag)",
    flags = re.IGNORECASE
    )

  text           code     new_text
0 Ragnar          1          ag
1 sammywemmy      2          NaN
2 ginger          3          NaN

Functional usage syntax:

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)
df = jn.process_text(
    df = df,
    column_name,
    new_column_names = None/string/list_of_strings,
    merge_frame = True/False,
    string_function = "string_func_name_here",
    kwargs
    )

Method-chaining usage syntax:

import pandas as pd
import janitor as jn

df = (
    pd.DataFrame(...)
    .process_text(
        column_name,
        new_column_names = None/string/list_of_strings,
        merge_frame = True/False
        string_function = "string_func_name_here",
        kwargs
        )
)
Parameters
  • df – A pandas dataframe.

  • column_name – String column to be operated on.

  • new_column_names – Name(s) to assign to the new column(s) created from the text processing. new_column_names can be a string, if the result of the text processing is a Series or string; if the result of the text processing is a dataframe, then new_column_names is treated as a prefix for each of the columns in the new dataframe. new_column_names can also be a list of strings to act as new column names for the new dataframe. The existing column_name stays unmodified if new_column_names is not None.

  • merge_frame – This comes into play if the result of the text processing is a dataframe. If True, the resulting dataframe will be merged with the original dataframe, else the resulting dataframe, not the original dataframe, will be returned.

  • string_function – Pandas string method to be applied.

  • kwargs – Keyword arguments for parameters of the string_function.

Returns

A pandas dataframe with modified column(s).

Raises
  • KeyError – if string_function is not a Pandas string method.

  • TypeError – if wrong arg or kwarg is supplied.

  • ValueError – if column_name not found in dataframe.

  • ValueError – if new_column_names is not None and is found in dataframe.