janitor.select_columns¶
-
janitor.
select_columns
(df: pandas.core.frame.DataFrame, search_column_names: Union[str, callable, Pattern, slice, list], invert: bool = False) → pandas.core.frame.DataFrame[source]¶ Method-chainable selection of columns.
Not applicable to MultiIndex columns.
It accepts a string, shell-like glob strings (string), regex, slice, array-like object, or a list of the previous options.
This method does not mutate the original DataFrame.
Optional ability to invert selection of columns available as well.
import pandas as pd import janitor import numpy as np import datetime import re from janitor import patterns df = pd.DataFrame( { "id": [0, 1], "Name": ["ABC", "XYZ"], "code": [1, 2], "code1": [4, np.nan], "code2": ["8", 5], "type": ["S", "R"], "type1": ["E", np.nan], "type2": ["T", "U"], "code3": pd.Series(["a", "b"], dtype="category"), "type3": pd.to_datetime([np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]), } ) df id Name code code1 code2 type type1 type2 code3 type3 0 0 ABC 1 4.0 8 S E T a 2018-01-01 1 1 XYZ 2 NaN 5 R NaN U b 2018-01-01
Select by string:
df.select_columns("id") id 0 0 1 1
Select via shell-like glob strings (*) is possible:
df.select_columns("*type*") type type1 type2 type3 0 S E T 2018-01-01 1 R NaN U 2018-01-01
Select by slice:
df.select_columns(slice("code1", "type1")) code1 code2 type type1 0 4.0 8 S E 1 NaN 5 R NaN
Select by callable (the callable is applied to every column and should return a single
True
orFalse
per column):df.select_columns(pd.api.types.is_datetime64_dtype) type3 0 2018-01-01 1 2018-01-01 df.select_columns(lambda x: x.name.startswith("code") or x.name.endswith("1")) code code1 code2 type1 code3 0 1 4.0 8 E a 1 2 NaN 5 NaN b df.select_columns(lambda x: x.isna().any()) code1 type1 0 4.0 E 1 NaN NaN
Select by regular expression:
df.select_columns(re.compile("\d+")) code1 code2 type1 type2 code3 type3 0 4.0 8 E T a 2018-01-01 1 NaN 5 NaN U b 2018-01-01 # same as above, with janitor.patterns # simply a wrapper around re.compile df.select_columns(patterns("\d+")) code1 code2 type1 type2 code3 type3 0 4.0 8 E T a 2018-01-01 1 NaN 5 NaN U b 2018-01-01
Select via a list (you can combine any of the previous options):
df.select_columns(["id", "code*", slice("code", "code2")]) id code code1 code2 code3 0 0 1 4.0 8 a 1 1 2 NaN 5 b
You can also pass a list of booleans:
df.select_columns([True, False, True, True, True, False, False, False, True, False]) id code code1 code2 code3 0 0 1 4.0 8 a 1 1 2 NaN 5 b
Setting
invert
toTrue
returns the complement of the columns provided:df.select_columns(["id", "code*", slice("code", "code2")], invert = True) Name type type1 type2 type3 0 ABC S E T 2018-01-01 1 XYZ R NaN U 2018-01-01
Functional usage example:
import pandas as pd import janitor as jn df = pd.DataFrame(...) df = jn.select_columns(['a', 'b', 'col_*'], invert=True)
Method-chaining example:
df = (pd.DataFrame(...) .select_columns(['a', 'b', 'col_*'], invert=True))
- Parameters
df – A pandas DataFrame.
search_column_names –
Valid inputs include:
an exact column name to look for
a shell-style glob string (e.g., *_thing_*)
a regular expression
a callable which is applicable to each Series in the dataframe
a list of all the aforementioned options.
invert – Whether or not to invert the selection. This will result in the selection of the complement of the columns provided.
- Returns
A pandas DataFrame with the specified columns selected.
- Raises
KeyError – if one or more of the specified column names or search strings are not found in DataFrame columns.
ValueError – if the columns is a MultiIndex.