Skip to content

Workbench

Features To Model

Features To Model

API Classes

For most users the API Classes will provide all the general functionality to create a full AWS ML Pipeline

FeaturesToModel: Train/Create a Model from a Feature Set

`FeaturesToModel`

Bases: Transform

FeaturesToModel: Train/Create a Model from a FeatureSet

Common Usage

from workbench.core.transforms.features_to_model.features_to_model import FeaturesToModel
to_model = FeaturesToModel(feature_name, model_name, model_type=ModelType)
to_model.set_output_tags(["abalone", "public", "whatever"])
to_model.transform(target_column="class_number_of_rings",
                   feature_list=["my", "best", "features"])

Source code in src/workbench/core/transforms/features_to_model/features_to_model.py

class FeaturesToModel(Transform):
    """FeaturesToModel: Train/Create a Model from a FeatureSet

    Common Usage:
        ```python
        from workbench.core.transforms.features_to_model.features_to_model import FeaturesToModel
        to_model = FeaturesToModel(feature_name, model_name, model_type=ModelType)
        to_model.set_output_tags(["abalone", "public", "whatever"])
        to_model.transform(target_column="class_number_of_rings",
                           feature_list=["my", "best", "features"])
        ```
    """

    def __init__(
        self,
        feature_name: str,
        model_name: str,
        model_type: ModelType,
        model_class=None,
        model_import_str=None,
        custom_script=None,
        custom_args=None,
        training_image="training",
        inference_image="inference",
        inference_arch="x86_64",
    ):
        """FeaturesToModel Initialization
        Args:
            feature_name (str): Name of the FeatureSet to use as input
            model_name (str): Name of the Model to create as output
            model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.
            model_class (str, optional): The scikit model (e.g. KNeighborsRegressor) (default None)
            model_import_str (str, optional): The import string for the model (default None)
            custom_script (str, optional): Custom script to use for the model (default None)
            custom_args (dict, optional): Custom arguments to pass to custom model scripts (default None)
            training_image (str, optional): Training image (default "training")
            inference_image (str, optional): Inference image (default "inference")
            inference_arch (str, optional): Inference architecture (default "x86_64")
        """

        # Make sure the model_name is a valid name
        Artifact.is_name_valid(model_name, delimiter="-", lower_case=False)

        # Call superclass init
        super().__init__(feature_name, model_name)

        # Set up all my instance attributes
        self.input_type = TransformInput.FEATURE_SET
        self.output_type = TransformOutput.MODEL
        self.model_type = model_type
        self.model_class = model_class
        self.model_import_str = model_import_str
        self.custom_script = str(custom_script) if custom_script else None
        self.custom_args = custom_args if custom_args else {}
        self.estimator = None
        self.model_description = None
        self.model_training_root = f"{self.models_s3_path}/{self.output_name}/training"
        self.model_feature_list = None
        self.target_column = None
        self.class_labels = None
        self.training_image = training_image
        self.inference_image = inference_image
        self.inference_arch = inference_arch

    def transform_impl(
        self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False, **kwargs
    ):
        """Generic Features to Model: Note you should create a new class and inherit from
        this one to include specific logic for your Feature Set/Model
        Args:
            target_column (str): Column name of the target variable
            description (str): Description of the model (optional)
            feature_list (list[str]): A list of columns for the features (default None, will try to guess)
            train_all_data (bool): Train on ALL (100%) of the data (default False)
        """
        # Delete the existing model (if it exists)
        self.log.important(f"Trying to delete existing model {self.output_name}...")
        ModelCore.managed_delete(self.output_name)

        # Set our model description
        self.model_description = description if description is not None else f"Model created from {self.input_name}"

        # Get our Feature Set and create an S3 CSV Training dataset
        feature_set = FeatureSetCore(self.input_name)
        s3_training_path = feature_set.create_s3_training_data()
        self.log.info(f"Created new training data {s3_training_path}...")

        # Report the target column
        self.target_column = target_column
        self.log.info(f"Target column: {self.target_column}")

        # Did they specify a feature list?
        if feature_list:
            # AWS Feature Groups will also add these implicit columns, so remove them
            aws_cols = ["write_time", "api_invocation_time", "is_deleted", "event_time", "training"]
            feature_list = [c for c in feature_list if c not in aws_cols]

        # If they didn't specify a feature list, try to guess it
        else:
            # Try to figure out features with this logic
            # - Don't include id, event_time, __index_level_0__, or training columns
            # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)
            # - Don't include the target columns
            # - Don't include any columns that are of type string or timestamp
            # - The rest of the columns are assumed to be features
            self.log.warning("Guessing at the feature list, HIGHLY RECOMMENDED to specify an explicit feature list!")
            all_columns = feature_set.columns
            filter_list = [
                "id",
                "auto_id",
                "__index_level_0__",
                "write_time",
                "api_invocation_time",
                "is_deleted",
                "event_time",
                "training",
            ] + [self.target_column]
            feature_list = [c for c in all_columns if c not in filter_list]

            # AWS Feature Store has 3 user column types (String, Integral, Fractional)
            # and two internal types (Timestamp and Boolean). A Feature List for
            # modeling can only contain Integral and Fractional types.
            remove_columns = []
            column_details = feature_set.column_details()
            for column_name in feature_list:
                if column_details[column_name] not in ["Integral", "Fractional"]:
                    self.log.warning(
                        f"Removing {column_name} from feature list, improper type {column_details[column_name]}"
                    )
                    remove_columns.append(column_name)

            # Remove the columns that are not Integral or Fractional
            feature_list = [c for c in feature_list if c not in remove_columns]

        # Set the final feature list
        self.model_feature_list = feature_list
        self.log.important(f"Feature List for Modeling: {self.model_feature_list}")

        # Set up our parameters for the model script
        template_params = {
            "model_imports": self.model_import_str,
            "model_type": self.model_type,
            "model_class": self.model_class,
            "target_column": self.target_column,
            "feature_list": self.model_feature_list,
            "compressed_features": feature_set.get_compressed_features(),
            "model_metrics_s3_path": self.model_training_root,
            "train_all_data": train_all_data,
            "id_column": feature_set.id_column,
            "hyperparameters": kwargs.get("hyperparameters", {}),
        }

        # Custom Script
        if self.custom_script:
            script_path = self.custom_script
            if self.custom_script.endswith(".template"):
                # Model Type is an enumerated type, so we need to convert it to a string
                template_params["model_type"] = template_params["model_type"].value

                # Fill in the custom script template with specific parameters (include any custom args)
                template_params.update(self.custom_args)
                script_path = fill_template(self.custom_script, template_params, "generated_model_script.py")
            self.log.info(f"Custom script path: {script_path}")

        # We're using one of the built-in model script templates
        else:
            # Generate our model script
            script_path = generate_model_script(template_params)

        # Metric Definitions for Regression
        if self.model_type in [ModelType.REGRESSOR, ModelType.UQ_REGRESSOR, ModelType.ENSEMBLE_REGRESSOR]:
            metric_definitions = [
                {"Name": "RMSE", "Regex": "RMSE: ([0-9.]+)"},
                {"Name": "MAE", "Regex": "MAE: ([0-9.]+)"},
                {"Name": "R2", "Regex": "R2: ([0-9.]+)"},
                {"Name": "NumRows", "Regex": "NumRows: ([0-9]+)"},
            ]

        # Metric Definitions for Classification
        elif self.model_type == ModelType.CLASSIFIER:
            # We need to get creative with the Classification Metrics

            # Grab all the target column class values (class labels)
            table = feature_set.data_source.table
            self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM "{table}"')[
                self.target_column
            ].to_list()

            # Sanity check on the targets
            if len(self.class_labels) > 10:
                msg = f"Too many target classes ({len(self.class_labels)}) for classification, aborting!"
                self.log.critical(msg)
                raise ValueError(msg)

            # Dynamically create the metric definitions
            metrics = ["precision", "recall", "fscore"]
            metric_definitions = []
            for t in self.class_labels:
                for m in metrics:
                    metric_definitions.append({"Name": f"Metrics:{t}:{m}", "Regex": f"Metrics:{t}:{m} ([0-9.]+)"})

            # Add the confusion matrix metrics
            for row in self.class_labels:
                for col in self.class_labels:
                    metric_definitions.append(
                        {"Name": f"ConfusionMatrix:{row}:{col}", "Regex": f"ConfusionMatrix:{row}:{col} ([0-9.]+)"}
                    )

        # If the model type is UNKNOWN, our metric_definitions will be empty
        else:
            self.log.important(f"ModelType is {self.model_type}, skipping metric_definitions...")
            metric_definitions = []

        # Take the full script path and extract the entry point and source directory
        entry_point = str(Path(script_path).name)
        source_dir = str(Path(script_path).parent)

        # Create a Sagemaker Model with our script
        image = ModelImages.get_image_uri(self.sm_session.boto_region_name, self.training_image, "0.1")
        self.estimator = Estimator(
            entry_point=entry_point,
            source_dir=source_dir,
            role=self.workbench_role_arn,
            instance_count=1,
            instance_type="ml.m5.xlarge",
            sagemaker_session=self.sm_session,
            image_uri=image,
            metric_definitions=metric_definitions,
        )

        # Training Job Name based on the Model Name and today's date
        training_date_time_utc = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M")
        training_job_name = f"{self.output_name}-{training_date_time_utc}"

        # Train the estimator
        self.log.important(f"Training the Model {self.output_name} with Training Image {image}...")
        self.estimator.fit({"train": s3_training_path}, job_name=training_job_name)

        # Now delete the training data
        self.log.info(f"Deleting training data {s3_training_path}...")
        wr.s3.delete_objects(
            [s3_training_path, s3_training_path.replace(".csv", ".csv.metadata")],
            boto3_session=self.boto3_session,
        )

        # Create Model and officially Register
        self.log.important(f"Creating new model {self.output_name}...")
        self.create_and_register_model(**kwargs)

    def post_transform(self, **kwargs):
        """Post-Transform: Calling onboard() on the Model"""
        self.log.info("Post-Transform: Calling onboard() on the Model...")
        time.sleep(3)  # Give AWS time to complete Model register

        # Store the model feature_list and target_column in the workbench_meta
        output_model = ModelCore(self.output_name, model_type=self.model_type)
        output_model.upsert_workbench_meta({"workbench_model_features": self.model_feature_list})
        output_model.upsert_workbench_meta({"workbench_model_target": self.target_column})

        # Store the class labels (if they exist)
        if self.class_labels:
            output_model.set_class_labels(self.class_labels)

        # Call the Model onboard method
        output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)

    def create_and_register_model(self, aws_region=None, **kwargs):
        """Create and Register the Model

        Args:
            aws_region (str, optional): AWS Region to use (default None)
            **kwargs: Additional keyword arguments to pass to the model registration
        """

        # Get the metadata/tags to push into AWS
        aws_tags = self.get_aws_tags()

        # Create model group (if it doesn't already exist)
        self.sm_client.create_model_package_group(
            ModelPackageGroupName=self.output_name,
            ModelPackageGroupDescription=self.model_description,
            Tags=aws_tags,
        )

        # Register our model
        image = ModelImages.get_image_uri(
            self.sm_session.boto_region_name, self.inference_image, "0.1", self.inference_arch
        )
        self.log.important(f"Registering model {self.output_name} with Inference Image {image}...")
        model = self.estimator.create_model(role=self.workbench_role_arn)
        if aws_region:
            self.log.important(f"Setting AWS Region: {aws_region} for model {self.output_name}...")
            model.env = {"AWS_REGION": aws_region}
        model.register(
            model_package_group_name=self.output_name,
            image_uri=image,
            content_types=["text/csv"],
            response_types=["text/csv"],
            inference_instances=supported_instance_types(self.inference_arch),
            transform_instances=["ml.m5.large", "ml.m5.xlarge"],
            approval_status="Approved",
            description=self.model_description,
        )

`init(feature_name, model_name, model_type, model_class=None, model_import_str=None, custom_script=None, custom_args=None, training_image='training', inference_image='inference', inference_arch='x86_64')`

FeaturesToModel Initialization Args: feature_name (str): Name of the FeatureSet to use as input model_name (str): Name of the Model to create as output model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc. model_class (str, optional): The scikit model (e.g. KNeighborsRegressor) (default None) model_import_str (str, optional): The import string for the model (default None) custom_script (str, optional): Custom script to use for the model (default None) custom_args (dict, optional): Custom arguments to pass to custom model scripts (default None) training_image (str, optional): Training image (default "training") inference_image (str, optional): Inference image (default "inference") inference_arch (str, optional): Inference architecture (default "x86_64")

Source code in src/workbench/core/transforms/features_to_model/features_to_model.py

def __init__(
    self,
    feature_name: str,
    model_name: str,
    model_type: ModelType,
    model_class=None,
    model_import_str=None,
    custom_script=None,
    custom_args=None,
    training_image="training",
    inference_image="inference",
    inference_arch="x86_64",
):
    """FeaturesToModel Initialization
    Args:
        feature_name (str): Name of the FeatureSet to use as input
        model_name (str): Name of the Model to create as output
        model_type (ModelType): ModelType.REGRESSOR or ModelType.CLASSIFIER, etc.
        model_class (str, optional): The scikit model (e.g. KNeighborsRegressor) (default None)
        model_import_str (str, optional): The import string for the model (default None)
        custom_script (str, optional): Custom script to use for the model (default None)
        custom_args (dict, optional): Custom arguments to pass to custom model scripts (default None)
        training_image (str, optional): Training image (default "training")
        inference_image (str, optional): Inference image (default "inference")
        inference_arch (str, optional): Inference architecture (default "x86_64")
    """

    # Make sure the model_name is a valid name
    Artifact.is_name_valid(model_name, delimiter="-", lower_case=False)

    # Call superclass init
    super().__init__(feature_name, model_name)

    # Set up all my instance attributes
    self.input_type = TransformInput.FEATURE_SET
    self.output_type = TransformOutput.MODEL
    self.model_type = model_type
    self.model_class = model_class
    self.model_import_str = model_import_str
    self.custom_script = str(custom_script) if custom_script else None
    self.custom_args = custom_args if custom_args else {}
    self.estimator = None
    self.model_description = None
    self.model_training_root = f"{self.models_s3_path}/{self.output_name}/training"
    self.model_feature_list = None
    self.target_column = None
    self.class_labels = None
    self.training_image = training_image
    self.inference_image = inference_image
    self.inference_arch = inference_arch

`create_and_register_model(aws_region=None, **kwargs)`

Create and Register the Model

Parameters:

Name	Type	Description	Default
`aws_region`	`str`	AWS Region to use (default None)	`None`
`**kwargs`		Additional keyword arguments to pass to the model registration	`{}`

Source code in src/workbench/core/transforms/features_to_model/features_to_model.py

def create_and_register_model(self, aws_region=None, **kwargs):
    """Create and Register the Model

    Args:
        aws_region (str, optional): AWS Region to use (default None)
        **kwargs: Additional keyword arguments to pass to the model registration
    """

    # Get the metadata/tags to push into AWS
    aws_tags = self.get_aws_tags()

    # Create model group (if it doesn't already exist)
    self.sm_client.create_model_package_group(
        ModelPackageGroupName=self.output_name,
        ModelPackageGroupDescription=self.model_description,
        Tags=aws_tags,
    )

    # Register our model
    image = ModelImages.get_image_uri(
        self.sm_session.boto_region_name, self.inference_image, "0.1", self.inference_arch
    )
    self.log.important(f"Registering model {self.output_name} with Inference Image {image}...")
    model = self.estimator.create_model(role=self.workbench_role_arn)
    if aws_region:
        self.log.important(f"Setting AWS Region: {aws_region} for model {self.output_name}...")
        model.env = {"AWS_REGION": aws_region}
    model.register(
        model_package_group_name=self.output_name,
        image_uri=image,
        content_types=["text/csv"],
        response_types=["text/csv"],
        inference_instances=supported_instance_types(self.inference_arch),
        transform_instances=["ml.m5.large", "ml.m5.xlarge"],
        approval_status="Approved",
        description=self.model_description,
    )

`post_transform(**kwargs)`

Post-Transform: Calling onboard() on the Model

Source code in src/workbench/core/transforms/features_to_model/features_to_model.py

def post_transform(self, **kwargs):
    """Post-Transform: Calling onboard() on the Model"""
    self.log.info("Post-Transform: Calling onboard() on the Model...")
    time.sleep(3)  # Give AWS time to complete Model register

    # Store the model feature_list and target_column in the workbench_meta
    output_model = ModelCore(self.output_name, model_type=self.model_type)
    output_model.upsert_workbench_meta({"workbench_model_features": self.model_feature_list})
    output_model.upsert_workbench_meta({"workbench_model_target": self.target_column})

    # Store the class labels (if they exist)
    if self.class_labels:
        output_model.set_class_labels(self.class_labels)

    # Call the Model onboard method
    output_model.onboard_with_args(self.model_type, self.target_column, self.model_feature_list)

`transform_impl(target_column, description=None, feature_list=None, train_all_data=False, **kwargs)`

Generic Features to Model: Note you should create a new class and inherit from this one to include specific logic for your Feature Set/Model Args: target_column (str): Column name of the target variable description (str): Description of the model (optional) feature_list (list[str]): A list of columns for the features (default None, will try to guess) train_all_data (bool): Train on ALL (100%) of the data (default False)

Source code in src/workbench/core/transforms/features_to_model/features_to_model.py

def transform_impl(
    self, target_column: str, description: str = None, feature_list: list = None, train_all_data=False, **kwargs
):
    """Generic Features to Model: Note you should create a new class and inherit from
    this one to include specific logic for your Feature Set/Model
    Args:
        target_column (str): Column name of the target variable
        description (str): Description of the model (optional)
        feature_list (list[str]): A list of columns for the features (default None, will try to guess)
        train_all_data (bool): Train on ALL (100%) of the data (default False)
    """
    # Delete the existing model (if it exists)
    self.log.important(f"Trying to delete existing model {self.output_name}...")
    ModelCore.managed_delete(self.output_name)

    # Set our model description
    self.model_description = description if description is not None else f"Model created from {self.input_name}"

    # Get our Feature Set and create an S3 CSV Training dataset
    feature_set = FeatureSetCore(self.input_name)
    s3_training_path = feature_set.create_s3_training_data()
    self.log.info(f"Created new training data {s3_training_path}...")

    # Report the target column
    self.target_column = target_column
    self.log.info(f"Target column: {self.target_column}")

    # Did they specify a feature list?
    if feature_list:
        # AWS Feature Groups will also add these implicit columns, so remove them
        aws_cols = ["write_time", "api_invocation_time", "is_deleted", "event_time", "training"]
        feature_list = [c for c in feature_list if c not in aws_cols]

    # If they didn't specify a feature list, try to guess it
    else:
        # Try to figure out features with this logic
        # - Don't include id, event_time, __index_level_0__, or training columns
        # - Don't include AWS generated columns (e.g. write_time, api_invocation_time, is_deleted)
        # - Don't include the target columns
        # - Don't include any columns that are of type string or timestamp
        # - The rest of the columns are assumed to be features
        self.log.warning("Guessing at the feature list, HIGHLY RECOMMENDED to specify an explicit feature list!")
        all_columns = feature_set.columns
        filter_list = [
            "id",
            "auto_id",
            "__index_level_0__",
            "write_time",
            "api_invocation_time",
            "is_deleted",
            "event_time",
            "training",
        ] + [self.target_column]
        feature_list = [c for c in all_columns if c not in filter_list]

        # AWS Feature Store has 3 user column types (String, Integral, Fractional)
        # and two internal types (Timestamp and Boolean). A Feature List for
        # modeling can only contain Integral and Fractional types.
        remove_columns = []
        column_details = feature_set.column_details()
        for column_name in feature_list:
            if column_details[column_name] not in ["Integral", "Fractional"]:
                self.log.warning(
                    f"Removing {column_name} from feature list, improper type {column_details[column_name]}"
                )
                remove_columns.append(column_name)

        # Remove the columns that are not Integral or Fractional
        feature_list = [c for c in feature_list if c not in remove_columns]

    # Set the final feature list
    self.model_feature_list = feature_list
    self.log.important(f"Feature List for Modeling: {self.model_feature_list}")

    # Set up our parameters for the model script
    template_params = {
        "model_imports": self.model_import_str,
        "model_type": self.model_type,
        "model_class": self.model_class,
        "target_column": self.target_column,
        "feature_list": self.model_feature_list,
        "compressed_features": feature_set.get_compressed_features(),
        "model_metrics_s3_path": self.model_training_root,
        "train_all_data": train_all_data,
        "id_column": feature_set.id_column,
        "hyperparameters": kwargs.get("hyperparameters", {}),
    }

    # Custom Script
    if self.custom_script:
        script_path = self.custom_script
        if self.custom_script.endswith(".template"):
            # Model Type is an enumerated type, so we need to convert it to a string
            template_params["model_type"] = template_params["model_type"].value

            # Fill in the custom script template with specific parameters (include any custom args)
            template_params.update(self.custom_args)
            script_path = fill_template(self.custom_script, template_params, "generated_model_script.py")
        self.log.info(f"Custom script path: {script_path}")

    # We're using one of the built-in model script templates
    else:
        # Generate our model script
        script_path = generate_model_script(template_params)

    # Metric Definitions for Regression
    if self.model_type in [ModelType.REGRESSOR, ModelType.UQ_REGRESSOR, ModelType.ENSEMBLE_REGRESSOR]:
        metric_definitions = [
            {"Name": "RMSE", "Regex": "RMSE: ([0-9.]+)"},
            {"Name": "MAE", "Regex": "MAE: ([0-9.]+)"},
            {"Name": "R2", "Regex": "R2: ([0-9.]+)"},
            {"Name": "NumRows", "Regex": "NumRows: ([0-9]+)"},
        ]

    # Metric Definitions for Classification
    elif self.model_type == ModelType.CLASSIFIER:
        # We need to get creative with the Classification Metrics

        # Grab all the target column class values (class labels)
        table = feature_set.data_source.table
        self.class_labels = feature_set.query(f'select DISTINCT {self.target_column} FROM "{table}"')[
            self.target_column
        ].to_list()

        # Sanity check on the targets
        if len(self.class_labels) > 10:
            msg = f"Too many target classes ({len(self.class_labels)}) for classification, aborting!"
            self.log.critical(msg)
            raise ValueError(msg)

        # Dynamically create the metric definitions
        metrics = ["precision", "recall", "fscore"]
        metric_definitions = []
        for t in self.class_labels:
            for m in metrics:
                metric_definitions.append({"Name": f"Metrics:{t}:{m}", "Regex": f"Metrics:{t}:{m} ([0-9.]+)"})

        # Add the confusion matrix metrics
        for row in self.class_labels:
            for col in self.class_labels:
                metric_definitions.append(
                    {"Name": f"ConfusionMatrix:{row}:{col}", "Regex": f"ConfusionMatrix:{row}:{col} ([0-9.]+)"}
                )

    # If the model type is UNKNOWN, our metric_definitions will be empty
    else:
        self.log.important(f"ModelType is {self.model_type}, skipping metric_definitions...")
        metric_definitions = []

    # Take the full script path and extract the entry point and source directory
    entry_point = str(Path(script_path).name)
    source_dir = str(Path(script_path).parent)

    # Create a Sagemaker Model with our script
    image = ModelImages.get_image_uri(self.sm_session.boto_region_name, self.training_image, "0.1")
    self.estimator = Estimator(
        entry_point=entry_point,
        source_dir=source_dir,
        role=self.workbench_role_arn,
        instance_count=1,
        instance_type="ml.m5.xlarge",
        sagemaker_session=self.sm_session,
        image_uri=image,
        metric_definitions=metric_definitions,
    )

    # Training Job Name based on the Model Name and today's date
    training_date_time_utc = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M")
    training_job_name = f"{self.output_name}-{training_date_time_utc}"

    # Train the estimator
    self.log.important(f"Training the Model {self.output_name} with Training Image {image}...")
    self.estimator.fit({"train": s3_training_path}, job_name=training_job_name)

    # Now delete the training data
    self.log.info(f"Deleting training data {s3_training_path}...")
    wr.s3.delete_objects(
        [s3_training_path, s3_training_path.replace(".csv", ".csv.metadata")],
        boto3_session=self.boto3_session,
    )

    # Create Model and officially Register
    self.log.important(f"Creating new model {self.output_name}...")
    self.create_and_register_model(**kwargs)

Supported Models

Currently Workbench supports XGBoost (classifier/regressor), and Scikit Learn models. Those models can be created by just specifying different parameters to the FeaturesToModel class. The main issue with the supported models is they are vanilla versions with default parameters, any customization should be done with Custom Models

XGBoost

from workbench.core.transforms.features_to_model.features_to_model import FeaturesToModel

# XGBoost Regression Model
input_name = "abalone_features"
output_name = "abalone-regression"
to_model = FeaturesToModel(input_name, output_name, model_type=ModelType.REGRESSOR)
to_model.set_output_tags(["abalone", "public"])
to_model.transform(target_column="class_number_of_rings", description="Abalone Regression")

# XGBoost Classification Model
input_name = "wine_features"
output_name = "wine-classification"
to_model = FeaturesToModel(input_name, output_name, ModelType.CLASSIFIER)
to_model.set_output_tags(["wine", "public"])
to_model.transform(target_column="wine_class", description="Wine Classification")

# Quantile Regression Model (Abalone)
input_name = "abalone_features"
output_name = "abalone-quantile-reg"
to_model = FeaturesToModel(input_name, output_name, ModelType.UQ_REGRESSOR)
to_model.set_output_tags(["abalone", "quantiles"])
to_model.transform(target_column="class_number_of_rings", description="Abalone Quantile Regression")

Scikit-Learn

from workbench.core.transforms.features_to_model.features_to_model import FeaturesToModel

# Scikit-Learn Kmeans Clustering Model
input_name = "wine_features"
output_name = "wine-clusters"
to_model = FeaturesToModel(
    input_name,
    output_name,
    model_class="KMeans",  # Clustering algorithm
    model_import_str="from sklearn.cluster import KMeans",  # Import statement for KMeans
    model_type=ModelType.CLUSTERER,
)
to_model.set_output_tags(["wine", "clustering"])
to_model.transform(target_column=None, description="Wine Clustering", train_all_data=True)

# Scikit-Learn HDBSCAN Clustering Model
input_name = "wine_features"
output_name = "wine-clusters-hdbscan"
to_model = FeaturesToModel(
    input_name,
    output_name,
    model_class="HDBSCAN",  # Density-based clustering algorithm
    model_import_str="from sklearn.cluster import HDBSCAN",
    model_type=ModelType.CLUSTERER,
)
to_model.set_output_tags(["wine", "density-based clustering"])
to_model.transform(target_column=None, description="Wine Clustering with HDBSCAN", train_all_data=True)

# Scikit-Learn 2D Projection Model using UMAP
input_name = "wine_features"
output_name = "wine-2d-projection"
to_model = FeaturesToModel(
    input_name,
    output_name,
    model_class="UMAP",
    model_import_str="from umap import UMAP",
    model_type=ModelType.PROJECTION,
)
to_model.set_output_tags(["wine", "2d-projection"])
to_model.transform(target_column=None, description="Wine 2D Projection", train_all_data=True)

Custom Models

For custom models we recommend the following steps:

Experimental

The Workbench Custom Models are currently in experimental mode so have fun but expect issues. Requires workbench >= 0.8.60. Feel free to submit issues to Workbench Github

Copy the example custom model script into your own directory
- See: Custom Model Script
Make a requirements.txt and put into the same directory
Train/deploy the ^existing^ example
- This is an important step, don't skip it
- If the existing model script trains/deploys your in great shape for the next step, if it doesn't then now is a good time to debug AWS account/permissions/etc.
Now customize the model script
Train/deploy your custom script

Training/Deploying Custom Models

from workbench.api import ModelType
from workbench.core.transforms.features_to_model.features_to_model import FeaturesToModel

# Note this directory should also have a requirements.txt in it
my_custom_script = "/full/path/to/my/directory/my_custom_script.py"
input_name = "wine_features"    # FeatureSet you want to use
output_name = "my-custom-model" # change to whatever
target_column = "wine-class"    # change to whatever
to_model = FeaturesToModel(input_name, output_name,
                           model_type=ModelType.CLASSIFIER, 
                           custom_script=my_custom_script)
to_model.set_output_tags(["your", "tags"])
to_model.transform(target_column=target_column, description="Custom Model")

Custom Models: Create an Endpoint/Run Inference

from workbench.api import Model, Endpoint

model = Model("my-custom-model")
end = model.to_endpoint()   # Note: This takes a while

# Now run inference on my custom model :)
end.auto_inference(capture=True)

# Run inference with my own dataframe
df = fs.pull_dataframe()  # Or whatever dataframe
end.inference(df)

Questions?

The SuperCowPowers team is happy to answer any questions you may have about AWS and Workbench. Please contact us at workbench@supercowpowers.com or on chat us up on Discord