Chemical Utilities
Examples
Examples of using the Chemical Utilities are listed at the bottom of this page Examples.
The majority of the chemical utilities in Workbench use either RDKIT or Mordred (Community). The inclusion of these utilities allows the use and deployment of this functionality into AWS (FeatureSets, Models, Endpoints).
Chem/RDKIT/Mordred utilities for Workbench
add_compound_tags(df, mol_column='molecule')
Adds a 'tags' column to a DataFrame, tagging compounds based on their properties.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing molecular data. |
required |
mol_column
|
str
|
Column name containing RDKit molecule objects. |
'molecule'
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Updated DataFrame with a 'tags' column. |
Source code in src/workbench/utils/chem_utils.py
add_salt_features(df)
Add salt features to dataframe with 'molecule' column containing RDKit molecules
Source code in src/workbench/utils/chem_utils.py
canonicalize(df, remove_mol_col=True)
Generate RDKit's canonical SMILES for each molecule in the input DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing a column named 'SMILES' (case-insensitive). |
required |
remove_mol_col
|
bool
|
Whether to drop the intermediate 'molecule' column. Default is True. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with an additional 'smiles_canonical' column and, optionally, the 'molecule' column. |
Source code in src/workbench/utils/chem_utils.py
compute_molecular_descriptors(df, tautomerize=True)
Compute and add all the Molecular Descriptors
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
tautomerize
|
bool
|
Whether to tautomerize the SMILES strings. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with all the RDKit Descriptors added |
Source code in src/workbench/utils/chem_utils.py
558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 |
|
compute_morgan_fingerprints(df, radius=2, n_bits=2048, counts=True)
Compute and add Morgan fingerprints to the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
radius
|
int
|
Radius for the Morgan fingerprint. |
2
|
n_bits
|
int
|
Number of bits for the fingerprint. |
2048
|
counts
|
bool
|
Count simulation for the fingerprint. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the Morgan fingerprints added as bit strings. |
Note
See: https://greglandrum.github.io/rdkit-blog/posts/2021-07-06-simulating-counts.html
Source code in src/workbench/utils/chem_utils.py
compute_stereochemistry_descriptors(df)
Compute stereochemistry descriptors for molecules in a DataFrame.
This function analyzes the stereochemical properties of molecules, including: - Chiral centers (R/S configuration) - Double bond stereochemistry (E/Z configuration)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame with RDKit molecule objects in 'molecule' column |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: DataFrame with added stereochemistry descriptors |
Source code in src/workbench/utils/chem_utils.py
649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 |
|
contains_heavy_metals(mol)
Check if a molecule contains any heavy metals (broad filter).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if any heavy metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_metalloenzyme_relevant_metals(mol)
Check if a molecule contains metals relevant to metalloenzymes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if metalloenzyme-relevant metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_salts(mol)
Check if a molecule contains common salts or counterions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if salts are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
custom_tautomer_canonicalization(mol)
Domain-specific processing of a molecule to select the canonical tautomer.
This function enumerates all possible tautomers for a given molecule and applies custom logic to select the canonical form.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the selected canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
df_to_sdf_file(df, output_file, smiles_col='smiles', id_col=None, include_cols=None, skip_invalid=True, generate_3d=True)
Convert DataFrame with SMILES to SDF file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame containing SMILES and other data |
required |
output_file
|
str
|
Path to output SDF file |
required |
smiles_col
|
str
|
Column name containing SMILES strings |
'smiles'
|
id_col
|
Optional[str]
|
Column to use as molecule ID/name |
None
|
include_cols
|
Optional[List[str]]
|
Specific columns to include as properties (default: all except smiles) |
None
|
skip_invalid
|
bool
|
Skip invalid SMILES instead of raising error |
True
|
generate_3d
|
bool
|
Generate 3D coordinates and optimize geometry |
True
|
Source code in src/workbench/utils/chem_utils.py
extract_advanced_salt_features(mol)
Extract comprehensive salt-related features from RDKit molecule
Source code in src/workbench/utils/chem_utils.py
feature_resolution_issues(df, features, show_cols=None)
Identify and print groups in a DataFrame where the given features have more than one unique SMILES, sorted by group size (largest number of unique SMILES first).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
features
|
List[str]
|
List of features to check. |
required |
show_cols
|
Optional[List[str]]
|
Columns to display; defaults to all columns. |
None
|
Source code in src/workbench/utils/chem_utils.py
geometric_mean(series)
halogen_toxicity_score(mol)
Calculate the halogen count and toxicity threshold for a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
(int, int)
|
Tuple[int, int]: (halogen_count, halogen_threshold), where the threshold |
(int, int)
|
scales with molecule size (minimum of 2 or 20% of atom count). |
Source code in src/workbench/utils/chem_utils.py
img_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey |
'rgba(64, 64, 64, 1)'
|
Returns:
Name | Type | Description |
---|---|---|
str |
Optional[str]
|
PIL image of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
is_druglike_compound(mol)
Filter for drug-likeness and QSAR relevance based on Lipinski's Rule of Five. Returns False for molecules unlikely to be orally bioavailable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the molecule is drug-like, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
log_to_category(log_series)
Convert a pandas Series of log values to concentration categories.
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentration categories.
Source code in src/workbench/utils/chem_utils.py
log_to_micromolar(log_series)
Convert a pandas Series of logarithmic values (log10) back to concentrations in µM (micromolar).
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentrations in micromolar.
Source code in src/workbench/utils/chem_utils.py
micromolar_to_log(series_μM)
Convert a pandas Series of concentrations in µM (micromolar) to their logarithmic values (log10).
Parameters: series_uM (pd.Series): Series of concentrations in micromolar.
Returns: pd.Series: Series of logarithmic values (log10).
Source code in src/workbench/utils/chem_utils.py
project_fingerprints(df, projection='UMAP')
Project fingerprints onto a 2D plane using dimensionality reduction techniques.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing fingerprint data. |
required |
projection
|
str
|
Dimensionality reduction technique to use (TSNE or UMAP). |
'UMAP'
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the projected coordinates added as 'x' and 'y' columns. |
Source code in src/workbench/utils/chem_utils.py
remove_disconnected_fragments(mol)
Remove disconnected fragments from a molecule, keeping the fragment with the most heavy atoms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Mol |
Mol
|
The fragment with the most heavy atoms, or None if no such fragment exists. |
Source code in src/workbench/utils/chem_utils.py
rollup_experimental_data(df, id, time, target, use_gmean=False)
Rolls up a dataset by selecting the largest time per unique ID and averaging the target value if multiple records exist at that time. Supports both arithmetic and geometric mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
id
|
str
|
Column representing the unique molecule ID. |
required |
time
|
str
|
Column representing the time. |
required |
target
|
str
|
Column representing the target value. |
required |
use_gmean
|
bool
|
Whether to use the geometric mean instead of the arithmetic mean. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Rolled-up dataframe with all original columns retained. |
Source code in src/workbench/utils/chem_utils.py
show(smiles, width=500, height=500)
Displays an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
Returns: None
Source code in src/workbench/utils/chem_utils.py
standard_tautomer_canonicalization(mol)
Standard processing of a molecule to select the canonical tautomer.
RDKit's TautomerEnumerator
uses heuristics to select a canonical tautomer,
such as preferring keto over enol forms and minimizing formal charges.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
svg_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an SVG image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey. |
'rgba(64, 64, 64, 1)'
|
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: Encoded SVG string of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
tautomerize_smiles(df)
Perform tautomer enumeration and canonicalization on a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A new DataFrame with additional 'smiles_canonical' and 'smiles_tautomer' columns. |
Source code in src/workbench/utils/chem_utils.py
toxic_elements(mol)
Identifies toxic elements or specific forms of elements in a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of toxic elements or specific forms if found, otherwise None. |
Notes
Halogen toxicity logic integrates with halogen_toxicity_score
and scales thresholds
based on molecule size.
Source code in src/workbench/utils/chem_utils.py
toxic_groups(mol)
Check if a molecule contains known toxic functional groups using RDKit's functional groups and SMARTS patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The molecule to evaluate. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of SMARTS patterns for toxic groups if found, otherwise None. |
Source code in src/workbench/utils/chem_utils.py
Examples
Canonical Smiles
"""Example for computing Canonicalize SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import canonicalize
test_data = [
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=CN=C(N)N"},
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"},
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform canonicalization
result_df = canonicalize(df)
print(result_df)
Output
id smiles expected smiles_canonical
0 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O
1 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1
2 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1
3 Guanidine C(=N)N=C(N)N N=CN=C(N)N N=CN=C(N)N
4 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O
5 Formamide C(=O)N NC=O NC=O
6 Urea C(=O)(N)N NC(N)=O NC(N)=O
7 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1
Tautomerize Smiles
"""Example for Tautomerizing SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import tautomerize_smiles
test_data = [
# Salicylaldehyde undergoes keto-enol tautomerization.
{"id": "Salicylaldehyde (Keto)", "smiles": "O=Cc1cccc(O)c1", "expected": "O=Cc1cccc(O)c1"},
{"id": "2-Hydroxybenzaldehyde (Enol)", "smiles": "Oc1ccc(C=O)cc1", "expected": "O=Cc1ccc(O)cc1"},
# Acetylacetone undergoes keto-enol tautomerization to favor the enol form.
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
# Imidazole undergoes a proton shift in the aromatic ring.
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
# Pyridone prefers the lactam form in RDKit's tautomer enumeration.
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
# Guanidine undergoes amine-imine tautomerization.
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=C(N)N=CN"},
# Catechol standardizes hydroxyl group placement in the aromatic system.
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
# Formamide canonicalizes to NC=O, reflecting its stable form.
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
# Urea undergoes a proton shift between nitrogen atoms.
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
# Phenol standardizes hydroxyl group placement in the aromatic system.
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"}
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform tautomerization
result_df = tautomerize_smiles(df)
print(result_df)
Output
id smiles_orig expected smiles_canonical smiles
0 Salicylaldehyde (Keto) O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1
1 2-Hydroxybenzaldehyde (Enol) Oc1ccc(C=O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1
2 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O CC(=O)CC(C)=O
3 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1 c1c[nH]cn1
4 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1 O=c1cccccn1
5 Guanidine C(=N)N=C(N)N N=C(N)N=CN N=CN=C(N)N N=C(N)N=CN
6 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O Oc1ccccc1O
7 Formamide C(=O)N NC=O NC=O NC=O
8 Urea C(=O)(N)N NC(N)=O NC(N)=O NC(N)=O
9 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1 Oc1ccccc1
Additional Resources
- Workbench API Classes: API Classes
- Consulting Available: SuperCowPowers LLC