Skip to content

3D Molecular Descriptors: Shape, Surface, and Pharmacophore Features

Combine 2D + 3D

Run both the 2D descriptor endpoint and the 3D endpoint, then concatenate the results for a ~388-feature set covering topological, electronic, and geometric properties.

2D molecular descriptors capture a lot about a molecule's chemistry from its connectivity graph alone -- molecular weight, hydrogen bond donors, topological polar surface area, and hundreds of other properties. But some of the most important ADMET properties depend on the molecule's shape in three dimensions: how it fits into a transporter binding site, whether it can fold to mask polar groups for membrane permeation, or how its charge distribution maps onto its surface.

Workbench's 3D descriptor endpoints compute 74 conformer-based features from SMILES strings, covering molecular shape, charged partial surface area, pharmacophore spatial distribution, and conformational flexibility. Like all Workbench endpoints, the contract is simple: send a DataFrame, get a DataFrame back -- the input DataFrame comes back with 74 descriptor columns appended. Two pipeline modes are available -- a fast endpoint for realtime inference and a Boltzmann endpoint for high-quality batch processing. Both produce the same 74 features so downstream models can consume either interchangeably.

Why 3D Descriptors?

2D descriptors treat molecules as graphs -- atoms are nodes, bonds are edges. This misses geometry-dependent properties that matter for ADMET:

  • Membrane permeability depends on molecular shape and the spatial separation of polar and nonpolar regions (amphiphilic moment)
  • Transporter interactions (P-gp, BCRP) correlate with molecular elongation, nitrogen spatial distribution, and overall size
  • Protein-ligand binding depends on 3D shape complementarity, not just functional group counts
  • Intramolecular hydrogen bonds enable "chameleonic" behavior where molecules mask polar groups in nonpolar environments -- a purely 3D phenomenon

These properties can't be captured from the molecular graph. You need 3D coordinates.

Two Pipeline Modes: Fast and Full

Workbench provides two 3D descriptor endpoints that share the same computation core but differ in conformer sampling depth:

Fast Full
Endpointsmiles-to-3d-fast-v1smiles-to-3d-full-v1
Conformers10 (fixed)50-500 (adaptive by rotatable bonds)
AggregationBoltzmann-weighted ensembleBoltzmann-weighted ensemble
DeploymentRealtime SageMaker endpointAsync SageMaker endpoint (scale-to-zero)
Use caseSynchronous inference from training pipelinesOvernight batch processing (10k-100k compounds)
Output74 features + 11 diagnostic columns74 features + 11 diagnostic columns

Both modes use Boltzmann-weighted ensemble averaging -- descriptors are computed on every conformer within a 5 kcal/mol energy window of the MMFF minimum, then combined using normalized Boltzmann weights:

\[\Large w_i = \frac{e^{-\Delta E_i \,/\, k_BT}}{\displaystyle\sum_j e^{-\Delta E_j \,/\, k_BT}}, \qquad \langle d \rangle = \sum_i w_i \, d_i\]

where \(\Delta E_i = E_i - E_{\min}\) is the energy above the minimum conformer, \(k_BT\) is the thermal energy at 298 K (0.592 kcal/mol), and \(d_i\) is the descriptor value for conformer \(i\). This is more reproducible than single-conformer descriptors, which can vary significantly with random seed, especially for flexible molecules. The MARCEL benchmark and Nikonenko et al. have shown that ensemble approaches produce more stable QSAR models.

Adaptive Conformer Counts (Boltzmann Mode)

The full endpoint uses a datamol-style adaptive tiering, with the upper tiers bumped above the community-standard 200/300 to reduce single-seed stochastic variance observed on flexible molecules:

Rotatable Bonds Conformers Rationale
< 8 50 Low flexibility, few distinct conformers
8-12 300 Moderate flexibility, needs broader sampling
≥ 13 500 High flexibility, large conformational space

This ensures adequate sampling of the conformational landscape without wasting compute on rigid molecules. On 300-conformer runs at 13+ rotatable bonds we measured ~20% NPR1 variance across random seeds; bumping to 500 conformers uses the documented "more samples" path to reduce that stochastic spread by roughly √(500/300) ≈ 1.29×.

The Computation Pipeline

The 3D descriptor endpoint runs a multi-step pipeline for each molecule:

3D descriptor pipeline: SMILES to Standardize to Conformers to 74 Descriptors
The 3D descriptor pipeline: standardization, tiered conformer generation with MMFF94s optimization, and Boltzmann-weighted ensemble descriptors across four categories.

Step 1: Standardization

The same standardization pipeline used by the 2D endpoints runs first -- salt extraction, charge neutralization, and tautomer canonicalization. Stereochemistry is preserved through tautomer canonicalization (we override RDKit's default tautomerRemoveSp3Stereo=True which would otherwise silently strip @ markers). This ensures the 3D descriptors are computed on the same canonical, stereo-faithful structure as the 2D descriptors.

Standardize also emits an undefined_chiral_centers diagnostic column counting any chiral centers in the input SMILES that lacked a stereo flag. Nonzero values mean the downstream 3D features reflect an arbitrary enantiomer — users should address ambiguous input upstream.

Step 2: Conformer Generation

Generating realistic 3D coordinates from a SMILES string is the most computationally intensive step. Workbench uses RDKit's ETKDGv3 (Experimental Torsion Knowledge Distance Geometry v3), which biases conformer sampling toward torsion angles observed in crystal structures -- appropriate for the condensed-phase geometries relevant to ADMET.

The algorithm uses a three-tier embedding strategy to maximize success rates across diverse chemical structures:

Tier Strategy When It's Needed
1. Standard ETKDGv3Experimental torsion preferences + small ring handlingWorks for ~95% of drug-like molecules
2. Random CoordinatesRandom initial positions instead of distance matrix eigenvaluesMolecules where distance bounds are hard to satisfy
3. Relaxed ConstraintsRandom coordinates + relaxed flat-ring enforcementStrained bridged polycyclics, unusual ring topologies

All conformers are optimized with the MMFF94s force field (preferred over MMFF94 for its improved handling of planar nitrogen centers common in drug molecules), using optimizerForceTol=0.0135 which provides a ~20% speedup with negligible geometry loss. For molecules with unsupported MMFF atom types, the pipeline automatically falls back to UFF (Universal Force Field).

RMSD-based pruning (pruneRmsThresh=0.5) removes redundant geometries -- rigid molecules like benzene naturally collapse to 1-2 unique conformers, while flexible chains retain more diversity.

Step 3: Boltzmann-Weighted Descriptor Calculation

All 74 descriptors are computed on the molecule with explicit hydrogens preserved throughout — MMFF94s energy calculations, Mordred CPSA partial charges, and RDKit's mass-weighted shape descriptors (PMI, radius of gyration) all require explicit Hs for correct results.

The custom pharmacophore descriptors, however, follow the cheminformatics convention of heavy-atom-only geometry for distance and centroid calculations (molecular axis, nitrogen span, charge/HBA centroids). The one exception is molecular volume, which uses RDKit's grid-based van der Waals volume and does include Hs — this gives a physically meaningful volume even for small molecules where a heavy-atom-only convex hull would be degenerate.

For each conformer within the 5 kcal/mol energy window, shape, surface, and pharmacophore descriptors are computed independently and then combined via Boltzmann-weighted averaging. Conformer ensemble statistics (energy range, flexibility index) are computed over the full generated ensemble, not just the window.

After embedding, the pipeline also verifies that the 3D geometry reproduces the input stereochemistry and reports the result in the desc3d_stereo_preserved diagnostic column — a True/False gate that catches the rare case where ETKDGv3 silently drops a stereo specification on strained scaffolds.

Descriptor Categories

RDKit 3D Shape Descriptors (10 features)

These capture the overall molecular shape using the inertia tensor:

Descriptor What It Captures
PMI1, PMI2, PMI3Principal moments of inertia -- raw shape information
NPR1, NPR2Normalized PMI ratios -- classify molecules as rod-like, disc-like, or spherical on the PMI triangle plot
AsphericityHow far from spherical (0 = sphere, higher = elongated)
EccentricityShape elongation (0 = sphere, 1 = linear)
Inertial Shape FactorRatio of smallest to largest PMI -- flat vs compact
Radius of GyrationOverall molecular size (mass-weighted spread from center)
Spherocity IndexHow spherical the molecule is (1 = perfect sphere)

The NPR1/NPR2 triangle plot is a widely used visualization for molecular shape classification: rod-shaped molecules cluster near (0, 1), disc-shaped near (0.5, 0.5), and spherical near (1, 1). Landrum's RDKit blog has shown that these PMI-derived descriptors are among the most conformer-sensitive, which is precisely why Boltzmann-weighted averaging improves their reproducibility.

Mordred 3D Descriptors (52 features)

Mordred's 3D modules compute surface-area-based descriptors that capture how charge, polarity, and hydrophobicity distribute across the molecular surface:

  • CPSA (43 descriptors): Charged Partial Surface Area -- the 3D extension of topological polar surface area. Maps partial charges onto the solvent-accessible surface to capture electrostatic features that govern solvation, permeability, and protein binding.
  • Geometrical Index (4): Petitjean shape indices measuring molecular topology in 3D space.
  • Gravitational Index (4): Mass-weighted distance descriptors.
  • PBF (1): Plane of Best Fit -- measures molecular planarity, relevant for membrane intercalation and crystal packing.

Pharmacophore 3D Descriptors (8 features)

Custom descriptors capturing the spatial distribution of pharmacophoric features:

Descriptor ADMET Relevance
Molecular Axis LengthMaximum heavy-atom distance -- P-gp substrates are typically 25-30 Å long
Molecular VolumeVan der Waals volume via RDKit grid (0.2 Å spacing) -- binding site fit, transporter size constraints
Amphiphilic MomentPolar/nonpolar centroid separation (polar = N/O/S/P + halogens; carbons adjacent to N/O/S/P are neutral) -- membrane orientation, transporter recognition
Charge Centroid DistanceDistance from center of mass to centroid of charge-site nitrogens (quaternary/aromatic/N-H) -- captures peripheral vs central ionizable groups
Nitrogen SpanMax distance between any two nitrogens (no filter) -- multi-point binding, overall N distribution
HBA Centroid DistanceDistance from COM to centroid of pure H-bond acceptors (all O + N with no H and no + charge; nitro groups excluded) -- solubility, permeability
IMHB PotentialIntramolecular H-bond count: D...A distance 2.5-3.5 Å + 4-6 bond separation + D-H...A angle ≥ 120° -- chameleonic permeability
ElongationAxis length / volume^(1/3) -- shape anisotropy

The intramolecular hydrogen bond potential (IMHB) deserves special mention. Molecules that can form intramolecular H-bonds can "mask" their polar groups in nonpolar membrane environments, dramatically increasing permeability despite high polar surface area. This chameleonic behavior is a key design strategy in modern medicinal chemistry and is invisible to 2D descriptors.

Conformer Ensemble Statistics (4 features)

Statistics computed over the full generated conformer ensemble that capture conformational flexibility:

  • Energy minimum: The lowest MMFF94s/UFF energy -- a proxy for strain
  • Energy range / standard deviation: How spread out the conformer energies are
  • Conformational flexibility index: Normalized energy range -- higher values indicate more conformational freedom

Highly flexible molecules tend to have larger energy ranges and higher flexibility indices. These features correlate with permeability (flexible molecules pay higher entropic penalties for binding) and metabolic stability.

Diagnostic Columns

In addition to the 74 model features, both endpoints produce 11 desc3d_* diagnostic columns that track pipeline status, conformer generation quality, stereochemistry preservation, and compute time. These are prefixed to distinguish them from model inputs:

Column Description
desc3d_status ok, skip:parse, skip:heavy_atoms, skip:rot_bonds, skip:rings, skip:ring_complexity, skip:embed, skip:empty
desc3d_mode fast or full
desc3d_conf_count Conformers after RMSD pruning
desc3d_confs_requested Target conformer count
desc3d_confs_in_window Conformers in the Boltzmann energy window
desc3d_embed_failures Distance geometry retry count
desc3d_timeout_failures Per-conformer RDKit timeout count
desc3d_embed_tier Which embedding tier succeeded (1/2/3)
desc3d_force_field MMFF94s, UFF, or none
desc3d_stereo_preserved True if the 3D geometry reproduces the input's assigned stereo (always True for achiral inputs)
desc3d_compute_time_s Per-molecule wall clock

Endpoint output also includes the undefined_chiral_centers column emitted by the upstream standardize() step — count of chiral centers in the original input SMILES that lacked a stereo flag, so users can see when features reflect an arbitrary enantiomer.

Production Guardrails

The 3D endpoints are significantly more compute-intensive than 2D. Several safeguards keep them reliable:

Molecular Complexity Check

Before attempting conformer generation, molecules are screened against size and topology thresholds that catch molecules too large or complex for reliable conformer generation. These are sized for the async endpoint's 60-minute invocation budget in Boltzmann mode, which comfortably accommodates larger drug-like molecules (PROTACs, small peptides, natural products):

Property Threshold Rationale
Heavy atoms > 150 Embedding time scales roughly O(n^2)
Rotatable bonds > 50 Combinatorial explosion of conformer space
Ring systems > 10 Extreme ring counts indicate cage structures
Ring complexity score > 15 Backstop for highly constrained polycyclic cages

The ring complexity score (rings + bridgehead atoms + spiro atoms) is a permissive backstop -- common drug scaffolds score well under 15. Molecules that exceed these thresholds get a specific desc3d_status (e.g. skip:heavy_atoms, skip:rot_bonds) instead of feature values, so downstream pipelines can detect and route them appropriately. Upstream, standardize() independently rejects molecules over 500 atoms as a sanity cap — its 500-atom limit is intentionally larger than the 3D pipeline's 150-heavy-atom limit so the 3D pipeline's own guards are always the binding constraint.

Molecules exceeding any threshold receive NaN features and a specific desc3d_status explaining the skip reason. These guards can be disabled for local analysis (complexity_check=False).

Deploying the Endpoints

Fast Endpoint (Realtime)

# Realtime instance (recommended for 3D)
SERVERLESS=false python feature_endpoints/smiles_to_3d_fast_v1.py

# Serverless (lower cost, but slower)
python feature_endpoints/smiles_to_3d_fast_v1.py

Full Endpoint (Async)

python feature_endpoints/smiles_to_3d_full_v1.py

The full endpoint deploys as an async endpoint with scale-to-zero -- the instance spins down when idle and cold-starts on the next request. This is ideal for overnight batch runs where you don't want to pay for idle compute during the day.

Using the Endpoints

from workbench.api import Endpoint

# Fast endpoint — synchronous, for realtime inference
end_fast = Endpoint("smiles-to-3d-fast-v1")
df_3d = end_fast.inference(df)

# Full endpoint — async deployment, same Endpoint API (auto-routes through async core)
end_full = Endpoint("smiles-to-3d-full-v1")
df_3d_full = end_full.inference(df)

# Both work with InferenceCache for persistent S3-backed caching
from workbench.api.inference_cache import InferenceCache
cached_endpoint = InferenceCache(end_fast, cache_key_column="smiles")
df_cached = cached_endpoint.inference(big_df)  # Only computes uncached rows

References

Conformer Ensemble Methods

  • Zhu, J., Xia, Y., Wu, L., et al. "Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks." ICLR 2024. arXiv: 2310.00115
  • Nikonenko, A., Zankov, D., Baskin, I., et al. "Multiple Conformer Descriptors for QSAR Modeling." Mol. Inform. 40, 2060030 (2021). DOI: 10.1002/minf.202060030
  • Hamakawa, Y. & Miyao, T. "Understanding Conformation Importance in Data-Driven Property Prediction Models." J. Chem. Inf. Model. 65, 3388-3404 (2025). DOI: 10.1021/acs.jcim.5c00018
  • Adams, K. & Coley, C.W. "The Impact of Conformer Quality on Learned Representations of Molecular Conformer Ensembles." arXiv (2025). arXiv: 2502.13220

Conformer Generation

  • Riniker, S. & Landrum, G.A. "Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation." J. Chem. Inf. Model. 55, 2562-2574 (2015). DOI: 10.1021/acs.jcim.5b00654
  • Wang, S., Witek, J., Landrum, G.A. & Riniker, S. "Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences." J. Chem. Inf. Model. 60, 2044-2058 (2020). DOI: 10.1021/acs.jcim.0c00025
  • Landrum, G. "Optimizing conformer generation parameters." RDKit Blog (2022). Blog post
  • Landrum, G. "Variability of PMI Descriptors." RDKit Blog (2022). Blog post
  • Landrum, G. "Understanding conformer generation failures." RDKit Blog (2023). Blog post
  • Landrum, G. "Scaling conformer generation." RDKit Blog (2025). Blog post
  • Datamol conformer generation with adaptive tiering. Documentation

Force Fields

  • Tosco, P., Stiefl, N. & Landrum, G. "Bringing the MMFF force field to the RDKit: implementation and validation." J. Cheminform. 6, 37 (2014). DOI: 10.1186/s13321-014-0037-3

Descriptors

  • RDKit 3D Descriptors: Documentation
  • Mordred Community: GitHub
  • Stanton, D.T. & Jurs, P.C. "Development and Use of Charged Partial Surface Area Structural Descriptors in Computer-Assisted Quantitative Structure-Property Relationship Studies." Anal. Chem. 62, 2323-2329 (1990). DOI: 10.1021/ac00220a013

ADMET and Chameleonic Molecules

  • Whitty, A., et al. "Quantifying the chameleonic properties of macrocycles and other high-molecular-weight drugs." Drug Discov. Today 21, 712-717 (2016). DOI: 10.1016/j.drudis.2016.02.005

Questions?

The SuperCowPowers team is happy to answer any questions you may have about AWS and Workbench. Please contact us at workbench@supercowpowers.com or on chat us up on Discord