EDA: High Target Gradients
SageWorks EDS
The SageWorks toolkit a set of plots that show EDA results, it also has a flexible plugin architecture to expand, enhance, or even replace the current set of web components Dashboard.
The SageWorks framework has a broad range of Exploratory Data Analysis (EDA) functionality. Each time a DataSource or FeatureSet is created that data is run through a full set of EDA techniques:
- TBD
- TBD2
One of the latest EDA techniques we've added is the addition of a concept called High Target Gradients
- Definition: For a given data point (x_i) with target value (y_i), and its neighbor (x_j) with target value (y_j), the target gradient (G_{ij}) can be defined as:
[G_{ij} = \frac{|y_i - y_j|}{d(x_i, x_j)}]
where (d(x_i, x_j)) is the distance between (x_i) and (x_j) in the feature space. This equation gives you the rate of change of the target value with respect to the change in features, similar to a slope in a two-dimensional space.
- Max Gradient for Each Point: For each data point (x_i), you can compute the maximum target gradient with respect to all its neighbors:
[G_{i}^{max} = \max_{j \neq i} G_{ij}]
This gives you a scalar value for each point in your training data that represents the maximum rate of change of the target value in its local neighborhood.
-
Usage: You can use (G_{i}^{max}) to identify and filter areas in the feature space that have high target gradients, which may indicate potential issues with data quality or feature representation.
-
Visualization: Plotting the distribution of (G_{i}^{max}) values or visualizing them in the context of the feature space can help you identify regions or specific points that warrant further investigation.
Additional Resources
- Consulting Available: SuperCowPowers LLC