DataSource
DataSource Examples
Examples of using the DataSource class are in the Examples section at the bottom of this page. S3 data, local files, and Pandas dataframes, DataSource can read data from many different sources.
DataSource: Manages AWS Data Catalog creation and management. DataSources are set up so that can easily be queried with AWS Athena. All DataSources are run through a full set of Exploratory Data Analysis (EDA) techniques (data quality, distributions, stats, outliers, etc.) DataSources can be viewed and explored within the SageWorks Dashboard UI.
DataSource
Bases: AthenaSource
DataSource: SageWorks DataSource API Class
Common Usage
Source code in src/sageworks/api/data_source.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|
__init__(source, name=None, tags=None, **kwargs)
Initializes a new DataSource object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
Union[str, DataFrame]
|
Source of data (existing name, filepath, S3 path, or a Pandas DataFrame) |
required |
name
|
str
|
The name of the data source (must be lowercase). If not specified, a name will be generated |
None
|
tags
|
list[str]
|
A list of tags associated with the data source. If not specified tags will be generated. |
None
|
Source code in src/sageworks/api/data_source.py
details(**kwargs)
DataSource Details
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary of details about the DataSource |
pull_dataframe(include_aws_columns=False)
Return a DataFrame of ALL the data from this DataSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_aws_columns
|
bool
|
Include the AWS columns in the DataFrame (default: False) |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame of ALL the data from this DataSource |
Note
Obviously this is not recommended for large datasets :)
Source code in src/sageworks/api/data_source.py
query(query)
Query the AthenaSource
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to run against the DataSource |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: The results of the query |
to_features(name, id_column, tags=None, event_time_column=None, one_hot_columns=None)
Convert the DataSource to a FeatureSet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Set the name for feature set (must be lowercase). |
required |
id_column
|
str
|
The ID column (must be specified, use "auto" for auto-generated IDs). |
required |
tags
|
list
|
Set the tags for the feature set. If not specified tags will be generated |
None
|
event_time_column
|
str
|
Set the event time for feature set. If not specified will be generated |
None
|
one_hot_columns
|
list
|
Set the columns to be one-hot encoded. (default: None) |
None
|
Returns:
Name | Type | Description |
---|---|---|
FeatureSet |
Union[FeatureSet, None]
|
The FeatureSet created from the DataSource (or None if the FeatureSet isn't created) |
Source code in src/sageworks/api/data_source.py
Examples
All of the SageWorks Examples are in the Sageworks Repository under the examples/
directory. For a full code listing of any example please visit our SageWorks Examples
Create a DataSource from an S3 Path or File Path
from sageworks.api.data_source import DataSource
# Create a DataSource from an S3 Path (or a local file)
source_path = "s3://sageworks-public-data/common/abalone.csv"
# source_path = "/full/path/to/local/file.csv"
my_data = DataSource(source_path)
print(my_data.details())
Create a DataSource from a Pandas Dataframe
from sageworks.utils.test_data_generator import TestDataGenerator
from sageworks.api.data_source import DataSource
# Create a DataSource from a Pandas DataFrame
gen_data = TestDataGenerator()
df = gen_data.person_data()
test_data = DataSource(df, name="test_data")
print(test_data.details())
Query a DataSource
All SageWorks DataSources use AWS Athena, so any query that you can make with Athena is accessible through the DataSource API.
from sageworks.api.data_source import DataSource
# Grab a DataSource
my_data = DataSource("abalone_data")
# Make some queries using the Athena backend
df = my_data.query("select * from abalone_data where height > .3")
print(df.head())
df = my_data.query("select * from abalone_data where class_number_of_rings < 3")
print(df.head())
Output
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 M 0.705 0.565 0.515 2.210 1.1075 0.4865 0.5120 10
1 F 0.455 0.355 1.130 0.594 0.3320 0.1160 0.1335 8
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight class_number_of_rings
0 I 0.075 0.055 0.010 0.002 0.0010 0.0005 0.0015 1
1 I 0.150 0.100 0.025 0.015 0.0045 0.0040 0.0050 2
Create a FeatureSet from a DataSource
from sageworks.api.data_source import DataSource
# Convert the Data Source to a Feature Set
test_data = DataSource('test_data')
my_features = test_data.to_features()
print(my_features.details())
SageWorks UI
Whenever a DataSource is created SageWorks performs a comprehensive set of Exploratory Data Analysis techniques on your data, pushes the results into AWS, and provides a detailed web visualization of the results.
Not Finding a particular method?
The SageWorks API Classes use the 'Core' Classes Internally, so for an extensive listing of all the methods available please take a deep dive into: SageWorks Core Classes