Confidence Scores in Workbench
Need Help?
The SuperCowPowers team is happy to give any assistance needed when setting up AWS and Workbench. So please contact us at workbench@supercowpowers.com or on chat us up on Discord
Workbench provides confidence scores for model predictions, giving users a measure of how certain the model is about each prediction. Higher confidence indicates the model is more certain, while lower confidence suggests greater uncertainty.
Overview
| Framework | UQ Method | Confidence Source |
|---|---|---|
| XGBoost | MAPIE Conformalized Quantile Regression | 80% prediction interval width |
| PyTorch | MAPIE Conformalized Quantile Regression | 80% prediction interval width |
| ChemProp | Ensemble disagreement (5 models) | Ensemble standard deviation |
All frameworks use the same confidence formula, but the uncertainty source differs.
Confidence Formula
Confidence is computed using a simple exponential decay function:
Where:
- uncertainty: The model's uncertainty estimate for each prediction
- median_uncertainty: The median uncertainty from the training/validation data
XGBoost and PyTorch: MAPIE CQR
For XGBoost and PyTorch models, Workbench uses MAPIE Conformalized Quantile Regression (CQR) to generate prediction intervals with coverage guarantees.
How It Works
- Train quantile models: LightGBM models are trained to predict the 10th and 90th percentiles (80% confidence interval)
- Conformalize: MAPIE adjusts the intervals using a held-out calibration set to guarantee coverage
- Compute interval width: The uncertainty is the width of the 80% prediction interval (q_90 - q_10)
- Calculate confidence: Apply the exponential decay formula
Why 80% Confidence Interval?
We use the 80% CI (q_10 to q_90) rather than other intervals because:
- 68% CI is too narrow and sensitive to noise
- 95% CI is too wide and less discriminating between samples
- 80% CI provides a good balance for ranking prediction reliability
Coverage Guarantees
MAPIE's conformalization ensures that prediction intervals achieve their target coverage. For example, an 80% CI will contain approximately 80% of true values on the calibration set. This is a key advantage over simple quantile regression.
ChemProp: Ensemble Disagreement
For ChemProp models, Workbench trains an ensemble of 5 models and uses their disagreement as the uncertainty measure.
How It Works
- Train ensemble: 5 ChemProp models are trained with different random seeds
- Predict: Each model makes a prediction for each sample
- Compute disagreement: The standard deviation across the 5 predictions is the uncertainty
- Calibrate: The standard deviation is empirically calibrated against actual errors
- Calculate confidence: Apply the exponential decay formula using calibrated std
Ensemble vs MAPIE
| Aspect | MAPIE (XGBoost/PyTorch) | Ensemble (ChemProp) |
|---|---|---|
| Coverage guarantee | Yes (conformal) | No (empirical) |
| Computational cost | + 5 MAPIE models | Just the 5 chemprop models |
| Uncertainty type | Prediction interval width | Model disagreement |
Both approaches are valid and widely used in the ML community. MAPIE provides theoretical guarantees, while ensembles capture model uncertainty more directly.
Interpreting Confidence Scores
High Confidence (> 0.5)
- Model is relatively certain about the prediction
- Prediction interval is narrower than typical
- Good candidates for prioritization
Medium Confidence (0.3 - 0.5)
- Model has typical uncertainty
- Prediction is likely reasonable but verify important decisions
Low Confidence (< 0.3)
- Model is uncertain about the prediction
- Prediction interval is wider than typical
- May indicate out-of-distribution samples or difficult predictions
Metrics for Evaluating Confidence
Workbench computes several metrics to evaluate how well confidence correlates with actual prediction quality:
confidence_to_error_corr
Spearman correlation between confidence and absolute error. Should be negative (high confidence = low error). Target: < -0.5
interval_to_error_corr
Spearman correlation between interval width and absolute error. Should be positive (wide intervals = high error). Target: > 0.5
Coverage Metrics
For each confidence level (50%, 68%, 80%, 90%, 95%), the percentage of true values that fall within the prediction interval. Should match the target coverage.
Additional Resources

Need help with confidence scores or uncertainty quantification? Want to develop a customized application tailored to your business needs?
- Consulting Available: SuperCowPowers LLC