FeatureSet

FeatureSet Examples

Examples of using the FeatureSet Class are in the Examples section at the bottom of this page. AWS Feature Store and Feature Groups are quite complicated to set up manually but the SageWorks FeatureSet makes it a breeze!

FeatureSet: Manages AWS Feature Store/Group creation and management. FeatureSets are set up so they can easily be queried with AWS Athena. All FeatureSets are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) FeatureSets can be viewed and explored within the SageWorks Dashboard UI.

`FeatureSet`

Bases: FeatureSetCore

FeatureSet: SageWorks FeatureSet API Class

Common Usage

my_features = FeatureSet(name)
my_features.details()
my_features.to_model(
    ModelType.REGRESSOR,
    name="abalone-regression",
    target_column="class_number_of_rings"
    feature_list=["my", "best", "features"])
)

Source code in src/sageworks/api/feature_set.py

class FeatureSet(FeatureSetCore):
    """FeatureSet: SageWorks FeatureSet API Class

    Common Usage:
        ```python
        my_features = FeatureSet(name)
        my_features.details()
        my_features.to_model(
            ModelType.REGRESSOR,
            name="abalone-regression",
            target_column="class_number_of_rings"
            feature_list=["my", "best", "features"])
        )
        ```
    """

    def details(self, **kwargs) -> dict:
        """FeatureSet Details

        Returns:
            dict: A dictionary of details about the FeatureSet
        """
        return super().details(**kwargs)

    def query(self, query: str, **kwargs) -> pd.DataFrame:
        """Query the AthenaSource

        Args:
            query (str): The query to run against the FeatureSet

        Returns:
            pd.DataFrame: The results of the query
        """
        return super().query(query, **kwargs)

    def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:
        """Return a DataFrame of ALL the data from this FeatureSet

        Args:
            include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)

        Returns:
            pd.DataFrame: A DataFrame of ALL the data from this FeatureSet

        Note:
            Obviously this is not recommended for large datasets :)
        """

        # Get the table associated with the data
        self.log.info(f"Pulling all data from {self.uuid}...")
        query = f"SELECT * FROM {self.athena_table}"
        df = self.query(query)

        # Drop any columns generated from AWS
        if not include_aws_columns:
            aws_cols = ["write_time", "api_invocation_time", "is_deleted", "event_time"]
            df = df.drop(columns=aws_cols, errors="ignore")
        return df

    def to_model(
        self,
        model_type: ModelType = ModelType.UNKNOWN,
        model_class: str = None,
        name: str = None,
        tags: list = None,
        description: str = None,
        feature_list: list = None,
        target_column: str = None,
        **kwargs,
    ) -> Union[Model, None]:
        """Create a Model from the FeatureSet

        Args:

            model_type (ModelType): The type of model to create (See sageworks.model.ModelType)
            model_class (str): The model class to use for the model (e.g. "KNeighborsRegressor", default: None)
            name (str): Set the name for the model. If not specified, a name will be generated
            tags (list): Set the tags for the model.  If not specified tags will be generated.
            description (str): Set the description for the model. If not specified a description is generated.
            feature_list (list): Set the feature list for the model. If not specified a feature list is generated.
            target_column (str): The target column for the model (use None for unsupervised model)

        Returns:
            Model: The Model created from the FeatureSet (or None if the Model could not be created)
        """

        # Ensure the model_name is valid
        if name:
            if not Artifact.is_name_valid(name, delimiter="-", lower_case=False):
                self.log.critical(f"Invalid Model name: {name}, not creating Model!")
                return None

        # If the model_name wasn't given generate it
        else:
            name = self.uuid.replace("_features", "") + "-model"
            name = Artifact.generate_valid_name(name, delimiter="-")

        # Create the Model Tags
        tags = [name] if tags is None else tags

        # Transform the FeatureSet into a Model
        features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)
        features_to_model.set_output_tags(tags)
        features_to_model.transform(
            target_column=target_column, description=description, feature_list=feature_list, **kwargs
        )

        # Return the Model
        return Model(name)

`details(**kwargs)`

FeatureSet Details

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary of details about the FeatureSet

Source code in src/sageworks/api/feature_set.py

def details(self, **kwargs) -> dict:
    """FeatureSet Details

    Returns:
        dict: A dictionary of details about the FeatureSet
    """
    return super().details(**kwargs)

`pull_dataframe(include_aws_columns=False)`

Return a DataFrame of ALL the data from this FeatureSet

Parameters:

Name	Type	Description	Default
`include_aws_columns`	`bool`	Include the AWS columns in the DataFrame (default: False)	`False`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A DataFrame of ALL the data from this FeatureSet

Note

Obviously this is not recommended for large datasets :)

Source code in src/sageworks/api/feature_set.py

def pull_dataframe(self, include_aws_columns=False) -> pd.DataFrame:
    """Return a DataFrame of ALL the data from this FeatureSet

    Args:
        include_aws_columns (bool): Include the AWS columns in the DataFrame (default: False)

    Returns:
        pd.DataFrame: A DataFrame of ALL the data from this FeatureSet

    Note:
        Obviously this is not recommended for large datasets :)
    """

    # Get the table associated with the data
    self.log.info(f"Pulling all data from {self.uuid}...")
    query = f"SELECT * FROM {self.athena_table}"
    df = self.query(query)

    # Drop any columns generated from AWS
    if not include_aws_columns:
        aws_cols = ["write_time", "api_invocation_time", "is_deleted", "event_time"]
        df = df.drop(columns=aws_cols, errors="ignore")
    return df

`query(query, **kwargs)`

Query the AthenaSource

Parameters:

Name	Type	Description	Default
`query`	`str`	The query to run against the FeatureSet	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The results of the query

Source code in src/sageworks/api/feature_set.py

def query(self, query: str, **kwargs) -> pd.DataFrame:
    """Query the AthenaSource

    Args:
        query (str): The query to run against the FeatureSet

    Returns:
        pd.DataFrame: The results of the query
    """
    return super().query(query, **kwargs)

`to_model(model_type=ModelType.UNKNOWN, model_class=None, name=None, tags=None, description=None, feature_list=None, target_column=None, **kwargs)`

Create a Model from the FeatureSet

Args:

model_type (ModelType): The type of model to create (See sageworks.model.ModelType)
model_class (str): The model class to use for the model (e.g. "KNeighborsRegressor", default: None)
name (str): Set the name for the model. If not specified, a name will be generated
tags (list): Set the tags for the model.  If not specified tags will be generated.
description (str): Set the description for the model. If not specified a description is generated.
feature_list (list): Set the feature list for the model. If not specified a feature list is generated.
target_column (str): The target column for the model (use None for unsupervised model)

Returns:

Name	Type	Description
`Model`	`Union[Model, None]`	The Model created from the FeatureSet (or None if the Model could not be created)

Source code in src/sageworks/api/feature_set.py

def to_model(
    self,
    model_type: ModelType = ModelType.UNKNOWN,
    model_class: str = None,
    name: str = None,
    tags: list = None,
    description: str = None,
    feature_list: list = None,
    target_column: str = None,
    **kwargs,
) -> Union[Model, None]:
    """Create a Model from the FeatureSet

    Args:

        model_type (ModelType): The type of model to create (See sageworks.model.ModelType)
        model_class (str): The model class to use for the model (e.g. "KNeighborsRegressor", default: None)
        name (str): Set the name for the model. If not specified, a name will be generated
        tags (list): Set the tags for the model.  If not specified tags will be generated.
        description (str): Set the description for the model. If not specified a description is generated.
        feature_list (list): Set the feature list for the model. If not specified a feature list is generated.
        target_column (str): The target column for the model (use None for unsupervised model)

    Returns:
        Model: The Model created from the FeatureSet (or None if the Model could not be created)
    """

    # Ensure the model_name is valid
    if name:
        if not Artifact.is_name_valid(name, delimiter="-", lower_case=False):
            self.log.critical(f"Invalid Model name: {name}, not creating Model!")
            return None

    # If the model_name wasn't given generate it
    else:
        name = self.uuid.replace("_features", "") + "-model"
        name = Artifact.generate_valid_name(name, delimiter="-")

    # Create the Model Tags
    tags = [name] if tags is None else tags

    # Transform the FeatureSet into a Model
    features_to_model = FeaturesToModel(self.uuid, name, model_type=model_type, model_class=model_class)
    features_to_model.set_output_tags(tags)
    features_to_model.transform(
        target_column=target_column, description=description, feature_list=feature_list, **kwargs
    )

    # Return the Model
    return Model(name)

Examples

All of the SageWorks Examples are in the Sageworks Repository under the examples/ directory. For a full code listing of any example please visit our SageWorks Examples

Create a FeatureSet from a Datasource

datasource_to_featureset.py

from sageworks.api.data_source import DataSource

# Convert the Data Source to a Feature Set
ds = DataSource('test_data')
fs = ds.to_features("test_features", id_column="id")
print(fs.details())

FeatureSet EDA Statistics

featureset_eda.py

from sageworks.api.feature_set import FeatureSet
import pandas as pd

# Grab a FeatureSet and pull some of the EDA Stats
my_features = FeatureSet('test_features')

# Grab some of the EDA Stats
corr_data = my_features.correlations()
corr_df = pd.DataFrame(corr_data)
print(corr_df)

# Get some outliers
outliers = my_features.outliers()
pprint(outliers.head())

# Full set of EDA Stats
eda_stats = my_features.column_stats()
pprint(eda_stats)

Output

                 age  food_pizza  food_steak  food_sushi  food_tacos    height        id  iq_score
age              NaN   -0.188645   -0.256356    0.263048    0.054211  0.439678 -0.054948 -0.295513
food_pizza -0.188645         NaN   -0.288175   -0.229591   -0.196818 -0.494380  0.137282  0.395378
food_steak -0.256356   -0.288175         NaN   -0.374920   -0.321403 -0.002542 -0.005199  0.076477
food_sushi  0.263048   -0.229591   -0.374920         NaN   -0.256064  0.536396  0.038279 -0.435033
food_tacos  0.054211   -0.196818   -0.321403   -0.256064         NaN -0.091493 -0.051398  0.033364
height      0.439678   -0.494380   -0.002542    0.536396   -0.091493       NaN -0.117372 -0.655210
id         -0.054948    0.137282   -0.005199    0.038279   -0.051398 -0.117372       NaN  0.106020
iq_score   -0.295513    0.395378    0.076477   -0.435033    0.033364 -0.655210  0.106020       NaN

        name     height      weight         salary  age    iq_score  likes_dogs  food_pizza  food_steak  food_sushi  food_tacos outlier_group
0  Person 96  57.582840  148.461349   80000.000000   43  150.000000           1           0           0           0           0    height_low
1  Person 68  73.918663  189.527313  219994.000000   80  100.000000           0           0           0           1           0  iq_score_low
2  Person 49  70.381790  261.237000  175633.703125   49  107.933998           0           0           0           1           0  iq_score_low
3  Person 90  73.488739  193.840698  227760.000000   72  110.821541           1           0           0           0           0   salary_high

<lots of EDA data and statistics>

Query a FeatureSet

All SageWorks FeatureSet have an 'offline' store that uses AWS Athena, so any query that you can make with Athena is accessible through the FeatureSet API.

featureset_query.py

from sageworks.api.feature_set import FeatureSet

# Grab a FeatureSet
my_features = FeatureSet("abalone_features")

# Make some queries using the Athena backend
df = my_features.query("select * from abalone_features where height > .3")
print(df.head())

df = my_features.query("select * from abalone_features where class_number_of_rings < 3")
print(df.head())

Output

  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings
0   M   0.705     0.565   0.515         2.210          1.1075          0.4865        0.5120                     10
1   F   0.455     0.355   1.130         0.594          0.3320          0.1160        0.1335                      8

  sex  length  diameter  height  whole_weight  shucked_weight  viscera_weight  shell_weight  class_number_of_rings
0   I   0.075     0.055   0.010         0.002          0.0010          0.0005        0.0015                      1
1   I   0.150     0.100   0.025         0.015          0.0045          0.0040         0.0050                      2

Create a Model from a FeatureSet

featureset_to_model.py

from sageworks.api.feature_set import FeatureSet
from sageworks.api.model import ModelType
from pprint import pprint

# Grab a FeatureSet
my_features = FeatureSet('test_features')

# Create a Model from the FeatureSet
# Note: ModelTypes can be CLASSIFIER, REGRESSOR, 
#       UNSUPERVISED, or TRANSFORMER
my_model = my_features.to_model(model_type=ModelType.REGRESSOR, 
                                target_column="iq_score")
pprint(my_model.details())

Output

{'approval_status': 'Approved',
 'content_types': ['text/csv'],
 ...
 'inference_types': ['ml.t2.medium'],
 'input': 'test_features',
 'model_metrics':   metric_name  value
                0        RMSE  7.924
                1         MAE  6.554,
                2          R2  0.604,
 'regression_predictions':       iq_score  prediction
                            0   136.519012  139.964460
                            1   133.616974  130.819950
                            2   122.495415  124.967834
                            3   133.279510  121.010284
                            4   127.881073  113.825005
    ...
 'response_types': ['text/csv'],
 'sageworks_tags': ['test-model'],
 'shapley_values': None,
 'size': 0.0,
 'status': 'Completed',
 'transform_types': ['ml.m5.large'],
 'uuid': 'test-model',
 'version': 1}

SageWorks UI

Whenever a FeatureSet is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.

sageworks_new_light — SageWorks Dashboard: FeatureSets

Not Finding a particular method?

The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes