janitor.process_text¶
-
janitor.
process_text
(df: pandas.core.frame.DataFrame, column_name: str, new_column_names: Optional[Union[list, str]] = None, merge_frame: Optional[bool] = False, string_function: Optional[str] = None, **kwargs: str) → pandas.core.frame.DataFrame[source]¶ Apply a Pandas string method to an existing column and return a dataframe.
This function aims to make string cleaning easy, while chaining, by simply passing the string method name to the
process_text
function. This modifies an existing column and can also be used to create a new column.Note
In versions < 0.20.11, this function did not support the creation of new columns.
A list of all the string methods in Pandas can be accessed here.
Example:
import pandas as pd import janitor as jn df = pd.DataFrame({"text" : ["Ragnar", "sammywemmy", "ginger"], "code" : [1, 2, 3]}) df.process_text(column_name = "text", string_function = "lower") text code 0 ragnar 1 1 sammywemmy 2 2 ginger 3
For string methods with parameters, simply pass the keyword arguments:
df.process_text( column_name = "text", string_function = "extract", pat = r"(ag)", expand = False, flags = re.IGNORECASE ) text code 0 ag 1 1 NaN 2 2 NaN 3
A new column can be created, leaving the existing column unmodified:
df.process_text( column_name = "text", new_column_names = "new_text", string_function = "extract", pat = r"(ag)", flags = re.IGNORECASE ) text code new_text 0 Ragnar 1 ag 1 sammywemmy 2 NaN 2 ginger 3 NaN
Functional usage syntax:
import pandas as pd import janitor as jn df = pd.DataFrame(...) df = jn.process_text( df = df, column_name, new_column_names = None/string/list_of_strings, merge_frame = True/False, string_function = "string_func_name_here", kwargs )
Method-chaining usage syntax:
import pandas as pd import janitor as jn df = ( pd.DataFrame(...) .process_text( column_name, new_column_names = None/string/list_of_strings, merge_frame = True/False string_function = "string_func_name_here", kwargs ) )
- Parameters
df – A pandas dataframe.
column_name – String column to be operated on.
new_column_names – Name(s) to assign to the new column(s) created from the text processing. new_column_names can be a string, if the result of the text processing is a Series or string; if the result of the text processing is a dataframe, then new_column_names is treated as a prefix for each of the columns in the new dataframe. new_column_names can also be a list of strings to act as new column names for the new dataframe. The existing column_name stays unmodified if new_column_names is not None.
merge_frame – This comes into play if the result of the text processing is a dataframe. If True, the resulting dataframe will be merged with the original dataframe, else the resulting dataframe, not the original dataframe, will be returned.
string_function – Pandas string method to be applied.
kwargs – Keyword arguments for parameters of the string_function.
- Returns
A pandas dataframe with modified column(s).
- Raises
KeyError – if
string_function
is not a Pandas string method.TypeError – if wrong
arg
orkwarg
is supplied.ValueError – if column_name not found in dataframe.
ValueError – if new_column_names is not None and is found in dataframe.