janitor.select_columns

janitor.select_columns(df: pandas.core.frame.DataFrame, search_column_names: Union[str, callable, Pattern, slice, list], invert: bool = False) → pandas.core.frame.DataFrame[source]

Method-chainable selection of columns.

Not applicable to MultiIndex columns.

It accepts a string, shell-like glob strings (string), regex, slice, array-like object, or a list of the previous options.

This method does not mutate the original DataFrame.

Optional ability to invert selection of columns available as well.

import pandas as pd
import janitor
import numpy as np
import datetime
import re
from janitor import patterns

df = pd.DataFrame(
        {
            "id": [0, 1],
            "Name": ["ABC", "XYZ"],
            "code": [1, 2],
            "code1": [4, np.nan],
            "code2": ["8", 5],
            "type": ["S", "R"],
            "type1": ["E", np.nan],
            "type2": ["T", "U"],
            "code3": pd.Series(["a", "b"], dtype="category"),
            "type3": pd.to_datetime([np.datetime64("2018-01-01"),
                                    datetime.datetime(2018, 1, 1)]),
        }
    )

df

   id Name  code  code1 code2 type type1 type2 code3    type3
0   0  ABC     1    4.0     8    S     E     T     a 2018-01-01
1   1  XYZ     2    NaN     5    R   NaN     U     b 2018-01-01
  • Select by string:

     df.select_columns("id")
        id
    0   0
    1   1
    

Select via shell-like glob strings (*) is possible:

df.select_columns("*type*")

   type type1 type2      type3
0    S     E     T 2018-01-01
1    R   NaN     U 2018-01-01
  • Select by slice:

    df.select_columns(slice("code1", "type1"))
    
       code1 code2 type type1
    0    4.0     8    S     E
    1    NaN     5    R   NaN
    
  • Select by callable (the callable is applied to every column and should return a single True or False per column):

    df.select_columns(pd.api.types.is_datetime64_dtype)
    
           type3
    0 2018-01-01
    1 2018-01-01
    
    df.select_columns(lambda x: x.name.startswith("code") or
                                x.name.endswith("1"))
    
       code  code1 code2 type1 code3
    0     1    4.0     8     E     a
    1     2    NaN     5   NaN     b
    
    df.select_columns(lambda x: x.isna().any())
    
         code1 type1
    0    4.0     E
    1    NaN   NaN
    
  • Select by regular expression:

    df.select_columns(re.compile("\d+"))
    
       code1 code2 type1 type2 code3      type3
    0    4.0     8     E     T     a 2018-01-01
    1    NaN     5   NaN     U     b 2018-01-01
    
    # same as above, with janitor.patterns
    # simply a wrapper around re.compile
    
    df.select_columns(patterns("\d+"))
    
       code1 code2 type1 type2 code3      type3
    0    4.0     8     E     T     a 2018-01-01
    1    NaN     5   NaN     U     b 2018-01-01
    
  • Select via a list (you can combine any of the previous options):

    df.select_columns(["id", "code*", slice("code", "code2")])
    
       id  code  code1 code2 code3
    0   0     1    4.0     8     a
    1   1     2    NaN     5     b
    
  • You can also pass a list of booleans:

    df.select_columns([True, False, True, True, True,
                       False, False, False, True, False])
    
       id  code  code1 code2 code3
    0   0     1    4.0     8     a
    1   1     2    NaN     5     b
    
  • Setting invert to True returns the complement of the columns provided:

    df.select_columns(["id", "code*", slice("code", "code2")],
                      invert = True)
    
       Name type type1 type2      type3
    0  ABC    S     E     T 2018-01-01
    1  XYZ    R   NaN     U 2018-01-01
    

Functional usage example:

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)

df = jn.select_columns(['a', 'b', 'col_*'],
                       invert=True)

Method-chaining example:

df = (pd.DataFrame(...)
      .select_columns(['a', 'b', 'col_*'],
      invert=True))
Parameters
  • df – A pandas DataFrame.

  • search_column_names

    Valid inputs include:

    • an exact column name to look for

    • a shell-style glob string (e.g., *_thing_*)

    • a regular expression

    • a callable which is applicable to each Series in the dataframe

    • a list of all the aforementioned options.

  • invert – Whether or not to invert the selection. This will result in the selection of the complement of the columns provided.

Returns

A pandas DataFrame with the specified columns selected.

Raises
  • KeyError – if one or more of the specified column names or search strings are not found in DataFrame columns.

  • ValueError – if the columns is a MultiIndex.