janitor.filter_string

janitor.filter_string(df: pandas.core.frame.DataFrame, column_name: Hashable, search_string: str, complement: bool = False) → pandas.core.frame.DataFrame[source]

Filter a string-based column according to whether it contains a substring.

This is super sugary syntax that builds on top of pandas.Series.str.contains.

Because this uses internally pandas.Series.str.contains, which allows a regex string to be passed into it, thus search_string can also be a regex pattern.

This method does not mutate the original DataFrame.

This function allows us to method chain filtering operations:

df = (pd.DataFrame(...)
      .filter_string('column', search_string='pattern', complement=False)
      ...)  # chain on more data preprocessing.

This stands in contrast to the in-place syntax that is usually used:

df = pd.DataFrame(...)
df = df[df['column'].str.contains('pattern')]]

As can be seen here, the API design allows for a more seamless flow in expressing the filtering operations.

Functional usage syntax:

df = filter_string(df,
                   column_name='column',
                   search_string='pattern',
                   complement=False)

Method chaining syntax:

df = (pd.DataFrame(...)
      .filter_string(column_name='column',
                     search_string='pattern',
                     complement=False)
      ...)
Parameters
  • df – A pandas DataFrame.

  • column_name – The column to filter. The column should contain strings.

  • search_string – A regex pattern or a (sub-)string to search.

  • complement – Whether to return the complement of the filter or not.

Returns

A filtered pandas DataFrame.