Chemical Utilities
Endpoint Examples
Examples of using the Endpoint class are listed at the bottom of this page Examples.
The majority of the chemical utilities in Workbench use either RDKIT or Mordred (Community). The inclusion of these utilities allows the use and deployment of this functionality into AWS (FeatureSets, Models, Endpoints).
Chem/RDKIT/Mordred utilities for Workbench
add_compound_tags(df, mol_column='molecule')
Adds a 'tags' column to a DataFrame, tagging compounds based on their properties.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing molecular data. |
required |
mol_column
|
str
|
Column name containing RDKit molecule objects. |
'molecule'
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Updated DataFrame with a 'tags' column. |
Source code in src/workbench/utils/chem_utils.py
canonicalize(df, remove_mol_col=True)
Generate RDKit's canonical SMILES for each molecule in the input DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing a column named 'SMILES' (case-insensitive). |
required |
remove_mol_col
|
bool
|
Whether to drop the intermediate 'molecule' column. Default is True. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with an additional 'smiles_canonical' column and, optionally, the 'molecule' column. |
Source code in src/workbench/utils/chem_utils.py
compute_molecular_descriptors(df)
Compute and add all the Molecular Descriptors
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with all the RDKit Descriptors added |
Source code in src/workbench/utils/chem_utils.py
497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 |
|
compute_morgan_fingerprints(df, radius=2, n_bits=2048, counts=True)
Compute and add Morgan fingerprints to the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
radius
|
int
|
Radius for the Morgan fingerprint. |
2
|
n_bits
|
int
|
Number of bits for the fingerprint. |
2048
|
counts
|
bool
|
Count simulation for the fingerprint. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the Morgan fingerprints added as bit strings. |
Note
See: https://greglandrum.github.io/rdkit-blog/posts/2021-07-06-simulating-counts.html
Source code in src/workbench/utils/chem_utils.py
compute_stereochemistry_descriptors(df)
Compute stereochemistry descriptors for molecules in a DataFrame.
This function calculates various descriptors related to molecular stereochemistry, including chiral centers (R/S configuration) and double bond stereochemistry (E/Z). It also adds selected topological and geometric descriptors from Mordred that relate to 3D molecular shape.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame with RDKit molecule objects in 'molecule' column |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: DataFrame with added stereochemistry descriptors |
Source code in src/workbench/utils/chem_utils.py
581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 |
|
contains_heavy_metals(mol)
Check if a molecule contains any heavy metals (broad filter).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if any heavy metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_metalloenzyme_relevant_metals(mol)
Check if a molecule contains metals relevant to metalloenzymes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if metalloenzyme-relevant metals are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
contains_salts(mol)
Check if a molecule contains common salts or counterions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if salts are detected, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
custom_tautomer_canonicalization(mol)
Domain-specific processing of a molecule to select the canonical tautomer.
This function enumerates all possible tautomers for a given molecule and applies custom logic to select the canonical form.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the selected canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
feature_resolution_issues(df, features, show_cols=None)
Identify and print groups in a DataFrame where the given features have more than one unique SMILES.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
features
|
List[str]
|
List of features to check. |
required |
show_cols
|
Optional[List[str]]
|
Columns to display; defaults to all columns. |
None
|
Source code in src/workbench/utils/chem_utils.py
geometric_mean(series)
halogen_toxicity_score(mol)
Calculate the halogen count and toxicity threshold for a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
(int, int)
|
Tuple[int, int]: (halogen_count, halogen_threshold), where the threshold |
(int, int)
|
scales with molecule size (minimum of 2 or 20% of atom count). |
Source code in src/workbench/utils/chem_utils.py
img_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey |
'rgba(64, 64, 64, 1)'
|
Returns:
Name | Type | Description |
---|---|---|
str |
Optional[str]
|
PIL image of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
is_druglike_compound(mol)
Filter for drug-likeness and QSAR relevance based on Lipinski's Rule of Five. Returns False for molecules unlikely to be orally bioavailable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the molecule is drug-like, False otherwise. |
Source code in src/workbench/utils/chem_utils.py
log_to_category(log_series)
Convert a pandas Series of log values to concentration categories.
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentration categories.
Source code in src/workbench/utils/chem_utils.py
log_to_micromolar(log_series)
Convert a pandas Series of logarithmic values (log10) back to concentrations in µM (micromolar).
Parameters: log_series (pd.Series): Series of logarithmic values (log10).
Returns: pd.Series: Series of concentrations in micromolar.
Source code in src/workbench/utils/chem_utils.py
micromolar_to_log(series_μM)
Convert a pandas Series of concentrations in µM (micromolar) to their logarithmic values (log10).
Parameters: series_uM (pd.Series): Series of concentrations in micromolar.
Returns: pd.Series: Series of logarithmic values (log10).
Source code in src/workbench/utils/chem_utils.py
project_fingerprints(df, projection='UMAP')
Project fingerprints onto a 2D plane using dimensionality reduction techniques.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing fingerprint data. |
required |
projection
|
str
|
Dimensionality reduction technique to use (TSNE or UMAP). |
'UMAP'
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The input DataFrame with the projected coordinates added as 'x' and 'y' columns. |
Source code in src/workbench/utils/chem_utils.py
remove_disconnected_fragments(mol)
Remove disconnected fragments from a molecule, keeping the fragment with the most heavy atoms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Mol |
Mol
|
The fragment with the most heavy atoms, or None if no such fragment exists. |
Source code in src/workbench/utils/chem_utils.py
rollup_experimental_data(df, id, time, target, use_gmean=False)
Rolls up a dataset by selecting the largest time per unique ID and averaging the target value if multiple records exist at that time. Supports both arithmetic and geometric mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
id
|
str
|
Column representing the unique molecule ID. |
required |
time
|
str
|
Column representing the time. |
required |
target
|
str
|
Column representing the target value. |
required |
use_gmean
|
bool
|
Whether to use the geometric mean instead of the arithmetic mean. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Rolled-up dataframe with all original columns retained. |
Source code in src/workbench/utils/chem_utils.py
show(smiles, width=500, height=500)
Displays an image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
Returns: None
Source code in src/workbench/utils/chem_utils.py
standard_tautomer_canonicalization(mol)
Standard processing of a molecule to select the canonical tautomer.
RDKit's TautomerEnumerator
uses heuristics to select a canonical tautomer,
such as preferring keto over enol forms and minimizing formal charges.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The RDKit molecule for which the canonical tautomer is to be determined. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SMILES string of the canonical tautomer. |
Source code in src/workbench/utils/chem_utils.py
svg_from_smiles(smiles, width=500, height=500, background='rgba(64, 64, 64, 1)')
Generate an SVG image of the molecule represented by the given SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
A SMILES string representing the molecule. |
required |
width
|
int
|
Width of the image in pixels. Default is 500. |
500
|
height
|
int
|
Height of the image in pixels. Default is 500. |
500
|
background
|
str
|
Background color of the image. Default is dark grey. |
'rgba(64, 64, 64, 1)'
|
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: Encoded SVG string of the molecule or None if the SMILES string is invalid. |
Source code in src/workbench/utils/chem_utils.py
tautomerize_smiles(df)
Perform tautomer enumeration and canonicalization on a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing SMILES strings. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A new DataFrame with additional 'smiles_canonical' and 'smiles_tautomer' columns. |
Source code in src/workbench/utils/chem_utils.py
toxic_elements(mol)
Identifies toxic elements or specific forms of elements in a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of toxic elements or specific forms if found, otherwise None. |
Notes
Halogen toxicity logic integrates with halogen_toxicity_score
and scales thresholds
based on molecule size.
Source code in src/workbench/utils/chem_utils.py
toxic_groups(mol)
Check if a molecule contains known toxic functional groups using RDKit's functional groups and SMARTS patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol
|
Mol
|
The molecule to evaluate. |
required |
Returns:
Type | Description |
---|---|
Optional[List[str]]
|
Optional[List[str]]: List of SMARTS patterns for toxic groups if found, otherwise None. |
Source code in src/workbench/utils/chem_utils.py
Examples
Canonical Smiles
"""Example for computing Canonicalize SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import canonicalize
test_data = [
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=CN=C(N)N"},
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"},
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform canonicalization
result_df = canonicalize(df)
print(result_df)
Output
id smiles expected smiles_canonical
0 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O
1 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1
2 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1
3 Guanidine C(=N)N=C(N)N N=CN=C(N)N N=CN=C(N)N
4 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O
5 Formamide C(=O)N NC=O NC=O
6 Urea C(=O)(N)N NC(N)=O NC(N)=O
7 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1
Tautomerize Smiles
"""Example for Tautomerizing SMILES strings"""
import pandas as pd
from workbench.utils.chem_utils import tautomerize_smiles
test_data = [
# Salicylaldehyde undergoes keto-enol tautomerization.
{"id": "Salicylaldehyde (Keto)", "smiles": "O=Cc1cccc(O)c1", "expected": "O=Cc1cccc(O)c1"},
{"id": "2-Hydroxybenzaldehyde (Enol)", "smiles": "Oc1ccc(C=O)cc1", "expected": "O=Cc1ccc(O)cc1"},
# Acetylacetone undergoes keto-enol tautomerization to favor the enol form.
{"id": "Acetylacetone", "smiles": "CC(=O)CC(=O)C", "expected": "CC(=O)CC(C)=O"},
# Imidazole undergoes a proton shift in the aromatic ring.
{"id": "Imidazole", "smiles": "c1cnc[nH]1", "expected": "c1c[nH]cn1"},
# Pyridone prefers the lactam form in RDKit's tautomer enumeration.
{"id": "Pyridone", "smiles": "C1=CC=NC(=O)C=C1", "expected": "O=c1cccccn1"},
# Guanidine undergoes amine-imine tautomerization.
{"id": "Guanidine", "smiles": "C(=N)N=C(N)N", "expected": "N=C(N)N=CN"},
# Catechol standardizes hydroxyl group placement in the aromatic system.
{"id": "Catechol", "smiles": "c1cc(c(cc1)O)O", "expected": "Oc1ccccc1O"},
# Formamide canonicalizes to NC=O, reflecting its stable form.
{"id": "Formamide", "smiles": "C(=O)N", "expected": "NC=O"},
# Urea undergoes a proton shift between nitrogen atoms.
{"id": "Urea", "smiles": "C(=O)(N)N", "expected": "NC(N)=O"},
# Phenol standardizes hydroxyl group placement in the aromatic system.
{"id": "Phenol", "smiles": "c1ccc(cc1)O", "expected": "Oc1ccccc1"}
]
# Convert test data to a DataFrame
df = pd.DataFrame(test_data)
# Perform tautomerization
result_df = tautomerize_smiles(df)
print(result_df)
Output
id smiles_orig expected smiles_canonical smiles
0 Salicylaldehyde (Keto) O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1 O=Cc1cccc(O)c1
1 2-Hydroxybenzaldehyde (Enol) Oc1ccc(C=O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1 O=Cc1ccc(O)cc1
2 Acetylacetone CC(=O)CC(=O)C CC(=O)CC(C)=O CC(=O)CC(C)=O CC(=O)CC(C)=O
3 Imidazole c1cnc[nH]1 c1c[nH]cn1 c1c[nH]cn1 c1c[nH]cn1
4 Pyridone C1=CC=NC(=O)C=C1 O=c1cccccn1 O=c1cccccn1 O=c1cccccn1
5 Guanidine C(=N)N=C(N)N N=C(N)N=CN N=CN=C(N)N N=C(N)N=CN
6 Catechol c1cc(c(cc1)O)O Oc1ccccc1O Oc1ccccc1O Oc1ccccc1O
7 Formamide C(=O)N NC=O NC=O NC=O
8 Urea C(=O)(N)N NC(N)=O NC(N)=O NC(N)=O
9 Phenol c1ccc(cc1)O Oc1ccccc1 Oc1ccccc1 Oc1ccccc1
Additional Resources
- Workbench API Classes: API Classes
- Consulting Available: SuperCowPowers LLC