Residual Analysis

Overview and Definition
Residual analysis involves examining the differences between observed and predicted values, known as residuals, to assess the performance of a predictive model. It is a critical step in model evaluation as it helps identify patterns of errors, diagnose potential problems, and improve model performance. By understanding where and why a model's predictions deviate from actual values, we can make informed adjustments to the model or the data to enhance accuracy and robustness.

Sparse Data Regions
The observation is in a part of feature space with little or no nearby training observations, leading to poor generalization in these regions and resulting in high prediction errors.

Noisy/Inconsistent Data and Preprocessing Issues
The observation is in a part of feature space where the training data is noisy, incorrect, or has high variance in the target variable. Additionally, missing values or incorrect data transformations can introduce errors, leading to unreliable predictions and high residuals.

Feature Resolution
The current feature set may not fully resolve the compounds, leading to ‘collisions’ where different compounds are assigned identical features. Such unresolved features can result in different compounds exhibiting the same features, causing high residuals due to unaccounted structural or chemical nuances.

Activity Cliffs
Structurally similar compounds exhibit significantly different activities, making accurate prediction challenging due to steep changes in activity with minor structural modifications.

Feature Engineering Issues
Irrelevant or redundant features and poor feature scaling can negatively impact the model's performance and accuracy, resulting in higher residuals.

Model Overfitting or Underfitting
Overfitting occurs when the model is too complex and captures noise, while underfitting happens when the model is too simple and misses underlying patterns, both leading to inaccurate predictions.