Chemistry

Chemistry and cheminformatics-oriented data cleaning functions.

janitor.chemistry.maccs_keys_fingerprint(df: pandas.core.frame.DataFrame, mols_column_name: Hashable) → pandas.core.frame.DataFrame[source]

Convert a column of RDKIT mol objects into MACCS Keys Fingerprints.

Returns a new dataframe without any of the original data. This is intentional to leave the user with the data requested.

This method does not mutate the original DataFrame.

Functional usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

maccs = janitor.chemistry.maccs_keys_fingerprint(
    df=df.smiles2mol('smiles', 'mols'),
    mols_column_name='mols'
)

Method chaining usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

maccs = (
    df.smiles2mol('smiles', 'mols')
      .maccs_keys_fingerprint(mols_column_name='mols')
)

If you wish to join the maccs keys fingerprints back into the original dataframe, this can be accomplished by doing a join, because the indices are preserved:

joined = df.join(maccs_keys_fingerprint)
Parameters:
  • df – A pandas DataFrame.
  • mols_column_name – The name of the column that has the RDKIT mol objects.
Returns:

A new pandas DataFrame of MACCS keys fingerprints.

janitor.chemistry.molecular_descriptors(df: pandas.core.frame.DataFrame, mols_column_name: Hashable) → pandas.core.frame.DataFrame[source]

Convert a column of RDKIT mol objects into a Pandas DataFrame of molecular descriptors.

Returns a new dataframe without any of the original data. This is intentional to leave the user only with the data requested.

This method does not mutate the original DataFrame.

The molecular descriptors are from the rdkit.Chem.rdMolDescriptors:

Chi0n, Chi0v, Chi1n, Chi1v, Chi2n, Chi2v, Chi3n, Chi3v, Chi4n, Chi4v, ExactMolWt, FractionCSP3, HallKierAlpha, Kappa1, Kappa2, Kappa3, LabuteASA, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAmideBonds, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumAtomStereoCenters, NumBridgeheadAtoms, NumHBA, NumHBD, NumHeteroatoms, NumHeterocycles, NumLipinskiHBA, NumLipinskiHBD, NumRings, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, NumSpiroAtoms, NumUnspecifiedAtomStereoCenters, TPSA.

Functional usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

mol_desc = janitor.chemistry.molecular_descriptors(
    df=df.smiles2mol('smiles', 'mols'),
    mols_column_name='mols'
)

Method chaining usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

mol_desc = (
    df.smiles2mol('smiles', 'mols')
      .molecular_descriptors(mols_column_name='mols')
)

If you wish to join the molecular descriptors back into the original dataframe, this can be accomplished by doing a join, because the indices are preserved:

joined = df.join(mol_desc)
Parameters:
  • df – A pandas DataFrame.
  • mols_column_name – The name of the column that has the RDKIT mol objects.
Returns:

A new pandas DataFrame of molecular descriptors.

janitor.chemistry.morgan_fingerprint(df: pandas.core.frame.DataFrame, mols_column_name: str, radius: int = 3, nbits: int = 2048, kind: str = 'counts') → pandas.core.frame.DataFrame[source]

Convert a column of RDKIT Mol objects into Morgan Fingerprints.

Returns a new dataframe without any of the original data. This is intentional, as Morgan fingerprints are usually high-dimensional features.

This method does not mutate the original DataFrame.

Functional usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

# For "counts" kind
morgans = janitor.chemistry.morgan_fingerprint(
    df=df.smiles2mol('smiles', 'mols'),
    mols_column_name='mols',
    radius=3,      # Defaults to 3
    nbits=2048,    # Defaults to 2048
    kind='counts'  # Defaults to "counts"
)

# For "bits" kind
morgans = janitor.chemistry.morgan_fingerprint(
    df=df.smiles2mol('smiles', 'mols'),
    mols_column_name='mols',
    radius=3,      # Defaults to 3
    nbits=2048,    # Defaults to 2048
    kind='bits'    # Defaults to "counts"
)

Method chaining usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

# For "counts" kind
morgans = (
    df.smiles2mol('smiles', 'mols')
      .morgan_fingerprint(mols_column_name='mols',
                          radius=3,      # Defaults to 3
                          nbits=2048,    # Defaults to 2048
                          kind='counts'  # Defaults to "counts"
      )
)

# For "bits" kind
morgans = (
    df.smiles2mol('smiles', 'mols')
      .morgan_fingerprint(mols_column_name='mols',
                          radius=3,    # Defaults to 3
                          nbits=2048,  # Defaults to 2048
                          kind='bits'  # Defaults to "counts"
      )
)

If you wish to join the morgan fingerprints back into the original dataframe, this can be accomplished by doing a join, because the indices are preserved:

joined = df.join(morgans)
Parameters:
  • df – A pandas DataFrame.
  • mols_column_name – The name of the column that has the RDKIT mol objects
  • radius – Radius of Morgan fingerprints. Defaults to 3.
  • nbits – The length of the fingerprints. Defaults to 2048.
  • kind – Whether to return counts or bits. Defaults to counts.
Returns:

A new pandas DataFrame of Morgan fingerprints.

janitor.chemistry.smiles2mol(df: pandas.core.frame.DataFrame, smiles_column_name: Hashable, mols_column_name: Hashable, drop_nulls: bool = True, progressbar: Union[None, str] = None) → pandas.core.frame.DataFrame[source]

Convert a column of SMILES strings into RDKit Mol objects.

Automatically drops invalid SMILES, as determined by RDKIT.

This method mutates the original DataFrame.

Functional usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

df = janitor.chemistry.smiles2mol(
    df=df,
    smiles_column_name='smiles',
    mols_column_name='mols'
)

Method chaining usage example:

import pandas as pd
import janitor.chemistry

df = pd.DataFrame(...)

df = df.smiles2mol(smiles_column_name='smiles',
                   mols_column_name='mols')

A progressbar can be optionally used.

  • Pass in “notebook” to show a tqdm notebook progressbar. (ipywidgets must be enabled with your Jupyter installation.)
  • Pass in “terminal” to show a tqdm progressbar. Better suited for use with scripts.
  • “none” is the default value - progress bar will be not be shown.
Parameters:
  • df – pandas DataFrame.
  • smiles_column_name – Name of column that holds the SMILES strings.
  • mols_column_name – Name to be given to the new mols column.
  • drop_nulls – Whether to drop rows whose mols failed to be constructed.
  • progressbar – Whether to show a progressbar or not.
Returns:

A pandas DataFrame with new RDKIT Mol objects column.