FeatureSet
FeatureSet Examples
Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the Workbench FeatureSet makes it a breeze!
FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the Workbench Dashboard UI.
FeatureSet
Bases: FeatureSetCore
FeatureSet: Workbench FeatureSet API Class
Common Usage
Source code in src/workbench/api/feature_set.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
|
details(**kwargs)
FeatureSet Details
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary of details about the FeatureSet |
pull_dataframe(include_aws_columns=False)
Return a DataFrame of ALL the data from this FeatureSet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_aws_columns
|
bool
|
Include the AWS columns in the DataFrame (default: False) |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame of ALL the data from this FeatureSet |
Note
Obviously this is not recommended for large datasets :)
Source code in src/workbench/api/feature_set.py
query(query, **kwargs)
Query the AthenaSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to run against the FeatureSet |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The results of the query |
Source code in src/workbench/api/feature_set.py
to_model(name, model_type, tags=None, description=None, feature_list=None, target_column=None, scikit_model_class=None, model_import_str=None, **kwargs)
Create a Model from the FeatureSet
Args:
name (str): The name of the Model to create
model_type (ModelType): The type of model to create (See workbench.model.ModelType)
tags (list, optional): Set the tags for the model. If not given tags will be generated.
description (str, optional): Set the description for the model. If not give a description is generated.
feature_list (list, optional): Set the feature list for the model. If not given a feature list is generated.
target_column (str, optional): The target column for the model (use None for unsupervised model)
scikit_model_class (str, optional): Scikit model class to use (e.g. "KMeans", default: None)
model_import_str (str, optional): The import for the model (e.g. "from sklearn.cluster import KMeans")
Returns:
Name | Type | Description |
---|---|---|
Model |
Union[Model, None]
|
The Model created from the FeatureSet (or None if the Model could not be created) |
Source code in src/workbench/api/feature_set.py
Examples
All of the Workbench Examples are in the Workbench Repository under the examples/
directory. For a full code listing of any example please visit our Workbench Examples
Create a FeatureSet from a Datasource
from workbench.api.data_source import DataSource
# Convert the Data Source to a Feature Set
ds = DataSource('test_data')
fs = ds.to_features("test_features", id_column="id")
print(fs.details())
FeatureSet EDA Statistics
from workbench.api.feature_set import FeatureSet
import pandas as pd
# Grab a FeatureSet and pull some of the EDA Stats
my_features = FeatureSet('test_features')
# Grab some of the EDA Stats
corr_data = my_features.correlations()
corr_df = pd.DataFrame(corr_data)
print(corr_df)
# Get some outliers
outliers = my_features.outliers()
pprint(outliers.head())
# Full set of EDA Stats
eda_stats = my_features.column_stats()
pprint(eda_stats)
age food_pizza food_steak food_sushi food_tacos height id iq_score
age NaN -0.188645 -0.256356 0.263048 0.054211 0.439678 -0.054948 -0.295513
food_pizza -0.188645 NaN -0.288175 -0.229591 -0.196818 -0.494380 0.137282 0.395378
food_steak -0.256356 -0.288175 NaN -0.374920 -0.321403 -0.002542 -0.005199 0.076477
food_sushi 0.263048 -0.229591 -0.374920 NaN -0.256064 0.536396 0.038279 -0.435033
food_tacos 0.054211 -0.196818 -0.321403 -0.256064 NaN -0.091493 -0.051398 0.033364
height 0.439678 -0.494380 -0.002542 0.536396 -0.091493 NaN -0.117372 -0.655210
id -0.054948 0.137282 -0.005199 0.038279 -0.051398 -0.117372 NaN 0.106020
iq_score -0.295513 0.395378 0.076477 -0.435033 0.033364 -0.655210 0.106020 NaN
name height weight salary age iq_score likes_dogs food_pizza food_steak food_sushi food_tacos outlier_group
0 Person 96 57.582840 148.461349 80000.000000 43 150.000000 1 0 0 0 0 height_low
1 Person 68 73.918663 189.527313 219994.000000 80 100.000000 0 0 0 1 0 iq_score_low
2 Person 49 70.381790 261.237000 175633.703125 49 107.933998 0 0 0 1 0 iq_score_low
3 Person 90 73.488739 193.840698 227760.000000 72 110.821541 1 0 0 0 0 salary_high
<lots of EDA data and statistics>
Query a FeatureSet
All Workbench FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.
from workbench.api.feature_set import FeatureSet
# Grab a FeatureSet
my_features = FeatureSet("abalone_features")
# Make some queries using the Athena backend
df = my_features.query("select * from abalone_features where height > .3")
print(df.head())
df = my_features.query("select * from abalone_features where class_number_of_rings < 3")
print(df.head())
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10
1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1
1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2
Create a Model from a FeatureSet
from workbench.api.feature_set import FeatureSet
from workbench.api.model import ModelType
from pprint import pprint
# Grab a FeatureSet
my_features = FeatureSet('test_features')
# Create a Model from the FeatureSet
# Note: ModelTypes can be CLASSIFIER, REGRESSOR,
# UNSUPERVISED, or TRANSFORMER
my_model = my_features.to_model(name="test-model",
model_type=ModelType.REGRESSOR,
target_column="iq_score")
pprint(my_model.details())
Output
{'approval_status': 'Approved',
'content_types': ['text/csv'],
...
'inference_types': ['ml.t2.medium'],
'input': 'test_features',
'model_metrics': metric_name value
0 RMSE 7.924
1 MAE 6.554,
2 R2 0.604,
'regression_predictions': iq_score prediction
0 136.519012 139.964460
1 133.616974 130.819950
2 122.495415 124.967834
3 133.279510 121.010284
4 127.881073 113.825005
...
'response_types': ['text/csv'],
'workbench_tags': ['test-model'],
'shapley_values': None,
'size': 0.0,
'status': 'Completed',
'transform_types': ['ml.m5.large'],
'uuid': 'test-model',
'version': 1}
Workbench UI
Whenever a FeatureSet is created Workbench performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
Not Finding a particular method?
The Workbench API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: Workbench Core Classes