FeatureSet
FeatureSet Examples
Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!
FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.
FeatureSet
Bases: FeatureSetCore
FeatureSet: SageWorks FeatureSet API Class
Common Usage
Source code in src/sageworks/api/feature_set.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
details(**kwargs)
FeatureSet Details
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary of details about the FeatureSet |
pull_dataframe(include_aws_columns=False)
Return a DataFrame of ALL the data from this FeatureSet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_aws_columns
|
bool
|
Include the AWS columns in the DataFrame (default: False) |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame of ALL the data from this FeatureSet |
Note
Obviously this is not recommended for large datasets :)
Source code in src/sageworks/api/feature_set.py
query(query, **kwargs)
Query the AthenaSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to run against the FeatureSet |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The results of the query |
Source code in src/sageworks/api/feature_set.py
to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)
Create a Model from the FeatureSet
Args:
model_type (ModelType): The type of model to create (See sageworks.model.ModelType)
model_class (str): The model class to use for the model (e.g. "KNeighborsRegressor", default: None)
name (str): Set the name for the model. If not specified, a name will be generated
tags (list): Set the tags for the model. If not specified tags will be generated.
description (str): Set the description for the model. If not specified a description is generated.
feature_list (list): Set the feature list for the model. If not specified a feature list is generated.
target_column (str): The target column for the model (use None for unsupervised model)
Returns:
Name | Type | Description |
---|---|---|
Model |
Union[Model, None]
|
The Model created from the FeatureSet (or None if the Model could not be created) |
Source code in src/sageworks/api/feature_set.py
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a FeatureSet from a Datasource
from sageworks.api.data_source import DataSource
# Convert the Data Source to a Feature Set
ds = DataSource('test_data')
fs = ds.to_features("test_features", id_column="id")
print(fs.details())
FeatureSet EDA Statistics
from sageworks.api.feature_set import FeatureSet
import pandas as pd
# Grab a FeatureSet and pull some of the EDA Stats
my_features = FeatureSet('test_features')
# Grab some of the EDA Stats
corr_data = my_features.correlations()
corr_df = pd.DataFrame(corr_data)
print(corr_df)
# Get some outliers
outliers = my_features.outliers()
pprint(outliers.head())
# Full set of EDA Stats
eda_stats = my_features.column_stats()
pprint(eda_stats)
age food_pizza food_steak food_sushi food_tacos height id iq_score
age NaN -0.188645 -0.256356 0.263048 0.054211 0.439678 -0.054948 -0.295513
food_pizza -0.188645 NaN -0.288175 -0.229591 -0.196818 -0.494380 0.137282 0.395378
food_steak -0.256356 -0.288175 NaN -0.374920 -0.321403 -0.002542 -0.005199 0.076477
food_sushi 0.263048 -0.229591 -0.374920 NaN -0.256064 0.536396 0.038279 -0.435033
food_tacos 0.054211 -0.196818 -0.321403 -0.256064 NaN -0.091493 -0.051398 0.033364
height 0.439678 -0.494380 -0.002542 0.536396 -0.091493 NaN -0.117372 -0.655210
id -0.054948 0.137282 -0.005199 0.038279 -0.051398 -0.117372 NaN 0.106020
iq_score -0.295513 0.395378 0.076477 -0.435033 0.033364 -0.655210 0.106020 NaN
name height weight salary age iq_score likes_dogs food_pizza food_steak food_sushi food_tacos outlier_group
0 Person 96 57.582840 148.461349 80000.000000 43 150.000000 1 0 0 0 0 height_low
1 Person 68 73.918663 189.527313 219994.000000 80 100.000000 0 0 0 1 0 iq_score_low
2 Person 49 70.381790 261.237000 175633.703125 49 107.933998 0 0 0 1 0 iq_score_low
3 Person 90 73.488739 193.840698 227760.000000 72 110.821541 1 0 0 0 0 salary_high
<lots of EDA data and statistics>
Query a FeatureSet
All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.
from sageworks.api.feature_set import FeatureSet
# Grab a FeatureSet
my_features = FeatureSet("abalone_features")
# Make some queries using the Athena backend
df = my_features.query("select * from abalone_features where height > .3")
print(df.head())
df = my_features.query("select * from abalone_features where class_number_of_rings < 3")
print(df.head())
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10
1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1
1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2
Create a Model from a FeatureSet
from sageworks.api.feature_set import FeatureSet
from sageworks.api.model import ModelType
from pprint import pprint
# Grab a FeatureSet
my_features = FeatureSet('test_features')
# Create a Model from the FeatureSet
# Note: ModelTypes can be CLASSIFIER, REGRESSOR,
# UNSUPERVISED, or TRANSFORMER
my_model = my_features.to_model(model_type=ModelType.REGRESSOR,
target_column="iq_score")
pprint(my_model.details())
Output
{'approval_status': 'Approved',
'content_types': ['text/csv'],
...
'inference_types': ['ml.t2.medium'],
'input': 'test_features',
'model_metrics': metric_name value
0 RMSE 7.924
1 MAE 6.554,
2 R2 0.604,
'regression_predictions': iq_score prediction
0 136.519012 139.964460
1 133.616974 130.819950
2 122.495415 124.967834
3 133.279510 121.010284
4 127.881073 113.825005
...
'response_types': ['text/csv'],
'sageworks_tags': ['test-model'],
'shapley_values': None,
'size': 0.0,
'status': 'Completed',
'transform_types': ['ml.m5.large'],
'uuid': 'test-model',
'version': 1}
SageWorks UI
Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
Not Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes