Molecular Standardization: Canonicalization, Tautomerization, and Salt Handling

FeatureSets Handle This

Workbench FeatureSets run standardization automatically when you create molecular descriptors from SMILES data.

In this blog we'll look at why molecular standardization matters for ML pipelines, what Workbench's feature endpoints actually do under the hood, and how the popular AqSol compound solubility dataset illustrates the challenges of working with real-world chemical data.

Why Standardization Matters

The same molecule can be represented many different ways in SMILES notation. Benzene alone has multiple valid representations: C1=CC=CC=C1, c1ccccc1, C1=CC=C(C=C1) — all describe the same compound. Drug compounds are even worse: they come as salts, mixtures of tautomers, and with inconsistent stereochemistry annotations.

If you feed these raw SMILES into a descriptor computation pipeline, structurally identical compounds produce different feature vectors. Your ML model sees noise where there should be signal. Standardization eliminates this problem.

Workbench's Standardization Pipeline

Workbench feature endpoints run a four-step standardization pipeline using RDKit's MolStandardize module (following the methodology described in the ChEMBL structure pipeline) before computing any molecular descriptors:

Step 1: Cleanup

Removes explicit hydrogens, disconnects metal atoms from organic fragments, and normalizes functional group representations (e.g., different ways of drawing nitro groups or sulfoxides).

Step 2: Salt/Fragment Handling

Many drug compounds are stored as salt forms (e.g., sodium acetate [Na+].CC(=O)[O-]). Workbench provides two modes for handling these:

extract_salts=True (default): Identifies and keeps the largest organic fragment, removes counterions, and records the removed salt for traceability. The pipeline distinguishes true salts from mixtures using a heuristic: small fragments (≤6 heavy atoms if neutral, ≤10 if charged) are treated as salts, while multiple large neutral organic fragments are flagged as mixtures and logged.
extract_salts=False: Keeps the full molecule with all fragments intact and preserves ionic charges. Useful when the salt form itself affects the property you're modeling (e.g., solubility, formulation studies).

# Default: removes salts (ChEMBL standard)
df = standardize(df, extract_salts=True)
# Input:  [Na+].CC(=O)[O-]  →  smiles: CC(=O)O, salt: [Na+]

# Keep salts for salt-dependent properties
df = standardize(df, extract_salts=False)
# Input:  [Na+].CC(=O)[O-]  →  smiles: [Na+].CC(=O)[O-], salt: None

Step 3: Charge Neutralization

When salts are extracted, charges on the parent molecule are neutralized (e.g., CC(=O)[O-] → CC(=O)O). This step is skipped when keeping salts to preserve ionic character.

Step 4: Tautomer Canonicalization

Tautomers are isomers that differ in proton and double-bond positions but exist in rapid equilibrium. The classic example is the keto-enol pair. Workbench uses RDKit's tautomer enumerator to pick a canonical form, ensuring that the same compound always produces the same descriptors regardless of which tautomeric form appeared in the source data. This step is enabled by default (canonicalize_tautomer=True) but can be disabled for workflows where preserving the as-drawn tautomeric form is preferred.

# 2-hydroxypyridine and 2-pyridone are the same compound
Oc1ccccn1  →  O=c1cccc[nH]1  (canonical tautomer)

What About Invalid Structures?

Molecules that fail RDKit parsing (bad valences, unsupported atom types, malformed SMILES) are not silently dropped. The pipeline preserves these rows in the output DataFrame with their original SMILES and fills all descriptor columns with NaN. This keeps row alignment intact so downstream ML pipelines can handle missing values through imputation or filtering as appropriate.

Descriptor Computation

After standardization, Workbench computes ~315 2D molecular descriptors from three sources:

Source	Count	Description
RDKit	~220	Constitutional, topological, electronic, lipophilicity, pharmacophore, and ADMET-specific descriptors (TPSA, QED, Lipinski)
Mordred	~85	Five ADMET-focused modules: AcidBase, Aromatic, Constitutional, Chi connectivity indices, and CarbonTypes
Stereochemistry	10	Custom features: R/S center counts, E/Z bond counts, stereo complexity, and fraction-defined metrics

Invalid molecules receive NaN values rather than being dropped, preserving row alignment with the input DataFrame. The Ipc descriptor is excluded due to known overflow issues in RDKit.

3D Descriptor Computation

In addition to the 2D descriptors above, Workbench offers a separate 3D descriptor pipeline that generates conformers and computes 75 shape/pharmacophore features:

Source	Count	Description
RDKit 3D Shape	10	Principal moments of inertia, normalized PMI ratios, asphericity, eccentricity, spherocity, radius of gyration
Mordred 3D	52	Charged partial surface area (CPSA), geometrical indices, gravitational indices, plane of best fit
Pharmacophore 3D	8	Molecular axis/volume, amphiphilic moment, charge centroid distance, intramolecular H-bond potential
Conformer Ensemble	5	MMFF94 energy statistics (min, range, std), conformer count, conformational flexibility index

Conformers are generated using RDKit's ETKDGv3 algorithm (with special handling for macrocycles) and optionally optimized with the MMFF94 force field. By default the endpoint generates 10 conformers per molecule and uses the lowest-energy conformer for descriptor calculation. This makes the 3D endpoint significantly slower than 2D (~1-2 molecules/second vs. near-instant for 2D), so batch sizes are kept smaller to stay within serverless timeouts.

Feature Endpoints: Deployed on AWS

These standardization and descriptor computations run inside Workbench feature endpoints — SageMaker-hosted transformer models that take raw SMILES and return standardized structures plus computed descriptors. Three variants are available:

smiles-to-taut-md-stereo-v1: Standard 2D pipeline with salt extraction (ChEMBL default)
smiles-to-taut-md-stereo-v1-keep-salts: 2D pipeline that preserves salt forms for salt-sensitive modeling
smiles-to-3d-descriptors-v1: 3D conformer-based descriptors (75 features) with salt extraction

All three endpoints can be deployed as serverless (cost-efficient for intermittent workloads) or on dedicated instances for higher throughput.

The AqSol Dataset: A Real-World Example

AqSolDB is a curated reference set of aqueous solubility values containing 9,982 unique compounds from 9 publicly available datasets (Harvard DataVerse).

Running this dataset through the full standardization + descriptor pipeline is a good stress test for real-world chemical data — the dataset includes organometallics, unusual valences, and SMILES notation quirks that exercise every step of the pipeline.

Key Differences: Canonicalization vs Tautomerization

Aspect	Canonicalization	Tautomerization
Purpose	Standardizes the entire molecular representation	Handles proton/bond-shift equilibria
Scope	Atom ordering, bond types, stereochemistry	Functional groups capable of tautomerization
Output	Unique, canonical SMILES string	A specific canonical tautomeric form
Use Case	Deduplication, consistency, comparison	Consistent descriptors across tautomeric forms

References

ChEMBL Structure Pipeline: Bento, A.P., et al. "An open source chemical structure curation pipeline using RDKit." Journal of Cheminformatics 12, 51 (2020). DOI: 10.1186/s13321-020-00456-1
RDKit Standardization: Landrum, G. "Standardization and Validation with the RDKit." RSC Open Science (2021). GitHub Notebook
RDKit: https://github.com/rdkit/rdkit
Mordred: https://github.com/mordred-descriptor/mordred
AqSolDB: Sorkun, M.C., et al. "AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds." Scientific Data 6, 143 (2019). DOI: 10.1038/s41597-019-0151-1

Questions?

The SuperCowPowers team is happy to answer any questions you may have about AWS and Workbench. Please contact us at workbench@supercowpowers.com or on chat us up on Discord