Chemical Utilities
Endpoint Examples
Examples of using the Endpoint class are listed at the bottom of this page Examples.
The majority of the chemical utilities in Workbench use either RDKIT or Mordred (Community). The inclusion of these utilities allows the use and deployment of this functionality into AWS (FeatureSets, Models, Endpoints).
Chem/RDKIT/Mordred utilities for Workbench
add_compound_tags(df, mol_column='molecule')
Adds a 'tags' column to a DataFrame, tagging compounds based on their properties.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing molecular data. |
required |
mol_column
|
str
|
Column name containing RDKit molecule objects. |
'molecule'
|
Returns:
Type | Description |
---|---|
pd.DataFrame: Updated DataFrame with a 'tags' column. |
Source code in src/workbench/utils/chem_utils.py
canonicalize(df, remove_mol_col=True)
Generate RDKit's canonical SMILES for each molecule in the input DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing a column named 'SMILES' (case-insensitive). |
required |
remove_mol_col
|
bool
|
Whether to drop the intermediate 'molecule' column. Default is True. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with an additional 'smiles_canonical' column and, optionally, the 'molecule' column. |
Source code in src/workbench/utils/chem_utils.py
compute_molecular_descriptors(df)
Compute and add all the Molecular Descriptors
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df(pd.DataFrame)
|
The DataFrame to process and generate RDKit/Mordred Descriptors |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with all the RDKit Descriptors added |
Source code in src/workbench/utils/chem_utils.py
495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 |
|
compute_morgan_fingerprints(df, radius=2, n_bits=2048, counts=True)
Compute and add Morgan fingerprints to the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
radius
|
int
|
Radius for the Morgan fingerprint. |
2
|
n_bits
|
int
|
Number of bits for the fingerprint. |
2048
|
counts
|
bool
|
Count simulation for the fingerprint. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the Morgan fingerprints added as bit strings. |
Note
See: https://greglandrum.github.io/rdkit-blog/posts/2021-07-06-simulating-counts.html
Source code in src/workbench/utils/chem_utils.py
contains_heavy_metals(mol)
Check if a molecule contains any heavy metals (broad filter).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
True if any heavy metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_metalloenzyme_relevant_metals(mol)
Check if a molecule contains metals relevant to metalloenzymes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
True if metalloenzyme-relevant metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_salts(mol)
Check if a molecule contains common salts or counterions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
True if salts are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
custom_tautomer_canonicalization(mol)
Domain-specific processing of a molecule to select the canonical tautomer.
This function enumerates all possible tautomers for a given molecule and applies custom logic to select the canonical form.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the selected canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
geometric_mean(series)
halogen_toxicity_score(mol)
Calculate the halogen count and toxicity threshold for a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
(int, int)
|
Tuple[int, int]: (halogen_count, halogen_threshold), where the threshold |
(int, int)
|
scales with molecule size (minimum of 2 or 20% of atom count). |
Source code in src/workbench/utils/chem_utils.py
img_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey |
'rgba(64, 64, 64, 1)'
|
Returns:
Name | Type | Description |
---|---|---|
str |
Optional[str]
|
PIL image of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
is_druglike_compound(mol)
Filter for drug-likeness and QSAR relevance based on Lipinski's Rule of Five. Returns False for molecules unlikely to be orally bioavailable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the molecule is drug-like, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
log_to_category(log_series)
Convert a pandas Series of log values to concentration categories.
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentration categories.
Source code in src/workbench/utils/chem_utils.py
log_to_micromolar(log_series)
Convert a pandas Series of logarithmic values (log10) back to concentrations in µM (micromolar).
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentrations in micromolar.
Source code in src/workbench/utils/chem_utils.py
micromolar_to_log(series_μM)
Convert a pandas Series of concentrations in µM (micromolar) to their logarithmic values (log10).
Parameters: series_uM (pd.Series): Series of concentrations in micromolar.
Returns: pd.Series: Series of logarithmic values (log10).
Source code in src/workbench/utils/chem_utils.py
project_fingerprints(df, projection='UMAP')
Project fingerprints onto a 2D plane using dimensionality reduction techniques.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing fingerprint data. |
required |
projection
|
str
|
Dimensionality reduction technique to use (TSNE or UMAP). |
'UMAP'
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the projected coordinates added as 'x' and 'y' columns. |
Source code in src/workbench/utils/chem_utils.py
remove_disconnected_fragments(mol)
Remove disconnected fragments from a molecule, keeping the fragment with the most heavy atoms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Mol |
Mol
|
The fragment with the most heavy atoms, or None if no such fragment exists. |
Source code in src/workbench/utils/chem_utils.py
rollup_experimental_data(df, id, time, target, use_gmean=False)
Rolls up a dataset by selecting the largest time per unique ID and averaging the target value if multiple records exist at that time. Supports both arithmetic and geometric mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
id
|
str
|
Column representing the unique molecule ID. |
required |
time
|
str
|
Column representing the time. |
required |
target
|
str
|
Column representing the target value. |
required |
use_gmean
|
bool
|
Whether to use the geometric mean instead of the arithmetic mean. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Rolled-up dataframe with all original columns retained. |
Source code in src/workbench/utils/chem_utils.py
show(smiles, width=500, height=500)
Displays an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
Returns: None
Source code in src/workbench/utils/chem_utils.py
standard_tautomer_canonicalization(mol)
Standard processing of a molecule to select the canonical tautomer.
RDKit's TautomerEnumerator
uses heuristics to select a canonical tautomer,
such as preferring keto over enol forms and minimizing formal charges.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
svg_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an SVG image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey. |
'rgba(64, 64, 64, 1)'
|
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: Encoded SVG string of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
tautomerize_smiles(df)
Perform tautomer enumeration and canonicalization on a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A new DataFrame with additional 'smiles_canonical' and 'smiles_tautomer' columns. |
Source code in src/workbench/utils/chem_utils.py
toxic_elements(mol)
Identifies toxic elements or specific forms of elements in a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of toxic elements or specific forms if found, otherwise None. |
Notes
Halogen toxicity logic integrates with halogen_toxicity_score
and scales thresholds
based on molecule size.
Source code in src/workbench/utils/chem_utils.py
toxic_groups(mol)
Check if a molecule contains known toxic functional groups using RDKit's functional groups and SMARTS patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The molecule to evaluate. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of SMARTS patterns for toxic groups if found, otherwise None. |
Source code in src/workbench/utils/chem_utils.py
Examples
Canonical Smiles
"""Example for computing Canonicalize SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import canonicalize
test_data = [
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=CN=C(N)N"},
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"},
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform canonicalization
result_df = canonicalize(df)
print(result_df)
Output
id smiles expected smiles_canonical
0 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O
1 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1
2 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1
3 Guanidine C(=N)N=C(N)N N=CN=C(N)N N=CN=C(N)N
4 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O
5 Formamide C(=O)N NC=O NC=O
6 Urea C(=O)(N)N NC(N)=O NC(N)=O
7 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1
Tautomerize Smiles
"""Example for Tautomerizing SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import tautomerize_smiles
test_data = [
# Salicylaldehyde undergoes keto-enol tautomerization.
{"id": "Salicylaldehyde (Keto)", "smiles": "O=Cc1cccc(O)c1", "expected": "O=Cc1cccc(O)c1"},
{"id": "2-Hydroxybenzaldehyde (Enol)", "smiles": "Oc1ccc(C=O)cc1", "expected": "O=Cc1ccc(O)cc1"},
# Acetylacetone undergoes keto-enol tautomerization to favor the enol form.
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
# Imidazole undergoes a proton shift in the aromatic ring.
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
# Pyridone prefers the lactam form in RDKit's tautomer enumeration.
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
# Guanidine undergoes amine-imine tautomerization.
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=C(N)N=CN"},
# Catechol standardizes hydroxyl group placement in the aromatic system.
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
# Formamide canonicalizes to NC=O, reflecting its stable form.
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
# Urea undergoes a proton shift between nitrogen atoms.
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
# Phenol standardizes hydroxyl group placement in the aromatic system.
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"}
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform tautomerization
result_df = tautomerize_smiles(df)
print(result_df)
Output
id smiles_orig expected smiles_canonical smiles
0 Salicylaldehyde (Keto) O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1
1 2-Hydroxybenzaldehyde (Enol) Oc1ccc(C=O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1
2 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O CC(=O)CC(C)=O
3 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1 c1c[nH]cn1
4 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1 O=c1cccccn1
5 Guanidine C(=N)N=C(N)N N=C(N)N=CN N=CN=C(N)N N=C(N)N=CN
6 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O Oc1ccccc1O
7 Formamide C(=O)N NC=O NC=O NC=O
8 Urea C(=O)(N)N NC(N)=O NC(N)=O NC(N)=O
9 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1 Oc1ccccc1
Additional Resources
- Workbench API Classes: API Classes
- Consulting Available: SuperCowPowers LLC