DataSource
DataSource Examples
Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.
DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the Workbench Dashboard UI.
DataSource
Bases: AthenaSource
DataSource: Workbench DataSource API Class
Common Usage
Source code in src/workbench/api/data_source.py
|
|
__init__(source, name=None, tags=None, **kwargs)
Initializes a new DataSource object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[str, DataFrame]
|
Source of data (existing name, filepath, S3 path, or a Pandas DataFrame) |
required |
name
|
str
|
The name of the data source (must be lowercase). If not specified, a name will be generated |
None
|
tags
|
list[str]
|
A list of tags associated with the data source. If not specified tags will be generated. |
None
|
Source code in src/workbench/api/data_source.py
details(**kwargs)
DataSource Details
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary of details about the DataSource |
pull_dataframe(include_aws_columns=False)
Return a DataFrame of ALL the data from this DataSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_aws_columns
|
bool
|
Include the AWS columns in the DataFrame (default: False) |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame of ALL the data from this DataSource |
Note
Obviously this is not recommended for large datasets :)
Source code in src/workbench/api/data_source.py
query(query)
Query the AthenaSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to run against the DataSource |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The results of the query |
to_features(name, id_column=None, tags=None, event_time_column=None, one_hot_columns=None)
Convert the DataSource to a FeatureSet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Set the name for feature set (must be lowercase). |
required |
id_column
|
str
|
The ID column (if not specified, an 'auto_id' will be generated). |
None
|
tags
|
list
|
Set the tags for the feature set. If not specified tags will be generated |
None
|
event_time_column
|
str
|
Set the event time for feature set. If not specified will be generated |
None
|
one_hot_columns
|
list
|
Set the columns to be one-hot encoded. (default: None) |
None
|
Returns:
Name | Type | Description |
---|---|---|
FeatureSet |
Union[FeatureSet, None]
|
The FeatureSet created from the DataSource (or None if the FeatureSet isn't created) |
Source code in src/workbench/api/data_source.py
Examples
All of the Workbench Examples are in the Workbench Repository under the examples/
directory. For a full code listing of any example please visit our Workbench Examples
Create a DataSource from an S3 Path or File Path
from workbench.api.data_source import DataSource
# Create a DataSource from an S3 Path (or a local file)
source_path = "s3://workbench-public-data/common/abalone.csv"
# source_path = "/full/path/to/local/file.csv"
my_data = DataSource(source_path)
print(my_data.details())
Create a DataSource from a Pandas Dataframe
from workbench.utils.test_data_generator import TestDataGenerator
from workbench.api.data_source import DataSource
# Create a DataSource from a Pandas DataFrame
gen_data = TestDataGenerator()
df = gen_data.person_data()
test_data = DataSource(df, name="test_data")
print(test_data.details())
Query a DataSource
All Workbench DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.
from workbench.api.data_source import DataSource
# Grab a DataSource
my_data = DataSource("abalone_data")
# Make some queries using the Athena backend
df = my_data.query("select * from abalone_data where height > .3")
print(df.head())
df = my_data.query("select * from abalone_data where class_number_of_rings < 3")
print(df.head())
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10
1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1
1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2
Create a FeatureSet from a DataSource
from workbench.api.data_source import DataSource
# Convert the Data Source to a Feature Set
test_data = DataSource('test_data')
my_features = test_data.to_features("test_features")
print(my_features.details())
Workbench UI
Whenever a DataSource is created Workbench performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
Not Finding a particular method?
The Workbench API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: Workbench Core Classes