Workbench DataFrame Storage
Examples
Examples of using the Parameter Storage class are listed at the bottom of this page Examples.
Why DataFrame Storage?
Great question, there's a couple of reasons. The first is that the Parameter Store in AWS has a 4KB limit, so that won't support any kind of 'real data'. The second reason is that DataFrames are commonly used as part of the data engineering, science, and ML pipeline construction process. Providing storage of named DataFrames in an accessible location that can be inspected and used by your ML Team comes in super handy.
Efficient Storage
All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
DFStore
Bases: AWSDFStore
DFStore: Fast/efficient storage of DataFrames using AWS S3/Parquet/Snappy
Common Usage
Source code in src/workbench/api/df_store.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
__init__(path_prefix=None)
DFStore Init Method
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_prefix
|
Union[str, None]
|
Add a path prefix to storage locations (Defaults to None) |
None
|
Source code in src/workbench/api/df_store.py
check(location)
Check if a DataFrame exists at the specified location
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the data to check. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the data exists, False otherwise. |
Source code in src/workbench/api/df_store.py
delete(location)
Delete a DataFrame from the AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the data to delete. |
required |
details(include_cache=False)
Return a DataFrame with detailed metadata for all objects in the data_store prefix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_cache
|
bool
|
Include cache objects in the details (Defaults to False). |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with detailed metadata for all objects in the data_store prefix. |
Source code in src/workbench/api/df_store.py
get(location)
Retrieve a DataFrame from AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the data to retrieve. |
required |
Returns:
Type | Description |
---|---|
Union[DataFrame, None]
|
pd.DataFrame: The retrieved DataFrame or None if not found. |
Source code in src/workbench/api/df_store.py
last_modified(location)
Get the last modified date of the DataFrame at the specified location.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the data to check. |
required |
Returns:
Type | Description |
---|---|
Union[datetime, None]
|
Union[datetime, None]: The last modified date of the DataFrame or None if not found. |
Source code in src/workbench/api/df_store.py
list(include_cache=False)
List all the objects in the data_store prefix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_cache
|
bool
|
Include cache objects in the list (Defaults to False). |
False
|
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of all the objects in the data_store prefix. |
Source code in src/workbench/api/df_store.py
summary(include_cache=False)
Return a nicely formatted summary of object locations, sizes (in MB), and modified dates.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
include_cache
|
bool
|
Include cache objects in the summary (Defaults to False). |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A formatted DataFrame with the summary details. |
Source code in src/workbench/api/df_store.py
upsert(location, data)
Insert or update a DataFrame or Series in the AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the data. |
required |
data
|
Union[DataFrame, Series]
|
The data to be stored. |
required |
Source code in src/workbench/api/df_store.py
Examples
These example show how to use the DFStore()
class to list, add, and get dataframes from AWS Storage.
Workbench REPL
If you'd like to experiment with listing, adding, and getting dataframe with the DFStore()
class, you can spin up the Workbench REPL, use the class and test out all the methods. Try it out! Workbench REPL
from workbench.api.df_store import DFStore
df_store = DFStore()
# List DataFrames
df_store().list()
Out[1]:
ml/confustion_matrix (0.002MB/2024-09-23 16:44:48)
ml/hold_out_ids (0.094MB/2024-09-23 16:57:01)
ml/my_awesome_df (0.002MB/2024-09-23 16:43:30)
ml/shap_values (0.019MB/2024-09-23 16:57:21)
# Add a DataFrame
df = pd.DataFrame({"A": [1]*1000, "B": [3]*1000})
df_store.upsert("test/test_df", df)
# List DataFrames (we can just use the REPR)
df_store
Out[2]:
ml/confustion_matrix (0.002MB/2024-09-23 16:44:48)
ml/hold_out_ids (0.094MB/2024-09-23 16:57:01)
ml/my_awesome_df (0.002MB/2024-09-23 16:43:30)
ml/shap_values (0.019MB/2024-09-23 16:57:21)
test/test_df (0.002MB/2024-09-23 16:59:27)
# Retrieve dataframes
return_df = df_store.get("test/test_df")
return_df.head()
Out[3]:
A B
0 1 3
1 1 3
2 1 3
3 1 3
4 1 3
# Delete dataframes
df_store.delete("test/test_df")
Compressed Storage is Automatic
All DataFrames are stored in the Parquet format using 'snappy' storage. Parquet is a columnar storage format that efficiently handles large datasets, and using Snappy compression reduces file size while maintaining fast read/write speeds.