Confidence Scores in Workbench

Need Help?

The SuperCowPowers team is happy to give any assistance needed when setting up AWS and Workbench. So please contact us at workbench@supercowpowers.com or on chat us up on Discord

Workbench provides confidence scores for every model prediction, giving users a measure of how much to trust each prediction. Higher confidence means the ensemble models agree closely; lower confidence means they disagree.

Overview

Every Workbench model — XGBoost, PyTorch, or ChemProp — is a 5-model ensemble trained via cross-validation. The same uncertainty quantification pipeline runs for all three frameworks:

Framework	Ensemble	Std Source	Calibration
XGBoost	5-fold CV, XGBRegressor per fold	`np.std` across 5 predictions	Conformal scaling
PyTorch	5-fold CV, TabularMLP per fold	`np.std` across 5 predictions	Conformal scaling
ChemProp	5-fold CV, MPNN per fold	`np.std` across 5 predictions	Conformal scaling

Three-Step Pipeline

1. Ensemble Disagreement

Each fold of the 5-fold cross-validation produces a model trained on a different slice of the data. At inference time, all 5 models make a prediction and we take the average. The standard deviation across the 5 predictions (prediction_std) is the raw uncertainty signal.

When the models agree closely (low std), the prediction is more reliable. When they disagree (high std), something about that compound is tricky.

2. Conformal Calibration

Raw ensemble std tells you which predictions to trust more, but the numbers aren't calibrated — a std of 0.3 doesn't map to a meaningful interval. Workbench uses conformal prediction to fix this:

Compute nonconformity scores on held-out validation data: score = |actual - predicted| / std
For each confidence level (50%, 68%, 80%, 90%, 95%), find the quantile of scores that achieves the target coverage
Build intervals: prediction ± scale_factor × std

The scaling factors are computed once during training and stored as metadata. At inference, building intervals is a simple multiply.

The result: prediction intervals that vary per-compound (based on ensemble disagreement) but are calibrated to achieve correct coverage. An 80% interval really does contain ~80% of true values.

3. Percentile-Rank Confidence

Confidence is the percentile rank of each prediction's prediction_std within the training set's std distribution:

confidence = 1 - percentile_rank(prediction_std)

Confidence 0.7 means this prediction's ensemble disagreement is lower than 70% of the training set — a relatively tight prediction.
Confidence 0.1 means 90% of training predictions had lower uncertainty — this compound is an outlier.

This approach gives scores that spread across the full 0–1 range, are directly interpretable, and require no arbitrary parameters.

Interpreting Confidence Scores

High Confidence (> 0.7)

Ensemble models agree closely on the prediction
Prediction intervals are narrower than most training predictions
Good candidates for prioritization

Medium Confidence (0.3 – 0.7)

Typical level of ensemble disagreement
Predictions are likely reasonable but verify important decisions

Low Confidence (< 0.3)

Ensemble models disagree significantly
Prediction intervals are wider than most training predictions
May indicate out-of-distribution compounds or regions where the model is uncertain

What Confidence Doesn't Tell You

Confidence reflects how much the ensemble models agree — but agreement doesn't guarantee correctness:

High confidence ≠ correct prediction. It means the models agree, not that they're right.
Novel chemistry may get falsely high confidence if it happens to fall in a region where models extrapolate consistently.
Confidence is relative to the training set. A confidence of 0.9 from a kinase solubility model doesn't transfer to a PROTAC dataset.

For truly out-of-distribution detection, consider pairing confidence with applicability domain analysis.

Metrics for Evaluating Confidence

Workbench computes several metrics to evaluate how well confidence correlates with actual prediction quality:

confidence_to_error_corr

Spearman correlation between confidence and absolute error. Should be negative (high confidence = low error). Target: < -0.5

interval_to_error_corr

Spearman correlation between interval width and absolute error. Should be positive (wide intervals = high error). Target: > 0.5

Coverage Metrics

For each confidence level (50%, 68%, 80%, 90%, 95%), the percentage of true values that fall within the prediction interval. Should match the target coverage.

Deep Dive

For more details on the approach, including code walkthrough and validation results, see the Model Confidence Blog.

Additional Resources

Need help with confidence scores or uncertainty quantification? Want to develop a customized application tailored to your business needs?

Consulting Available: SuperCowPowers LLC