SQL Algorithms
SQL Algorithms
One of the main benefit of SQL Algorithms is that the 'heavy lifting' is all done on the SQL Database, so if you have large datassets this is the place for you.
-
SQL: SQL queries that provide a wide range of functionality:
- Outliers
- Descriptive Stats
- Correlations
- and More Oulliers
SQL based Outliers: Compute outliers for all the columns in a DataSource using SQL
Outliers
Outliers: Class to compute outliers for all the columns in a DataSource using SQL
Source code in src/sageworks/algorithms/sql/outliers.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
|
__init__()
compute_outliers(data_source, scale=1.5, use_stddev=False)
Compute outliers for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing outliers on scale (float): The scale to use for either the IQR or stddev outlier calculation (default: 1.5) use_stddev (bool): Option to use the standard deviation for the outlier calculation (default: False) Returns: pd.DataFrame: A DataFrame of outliers for this DataSource Notes: Uses the IQR * 1.5 (~= 2.5 Sigma) (use 1.7 for ~= 3 Sigma) The scale parameter can be adjusted to change the IQR multiplier
Source code in src/sageworks/algorithms/sql/outliers.py
get_top_n_outliers(outlier_df, n=10)
staticmethod
Function to retrieve the top N highest and lowest outliers for each outlier group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outlier_df
|
DataFrame
|
The DataFrame of outliers with 'outlier_group' column |
required |
n
|
int
|
Number of top outliers to retrieve for each group, defaults to 10 |
10
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing the top N outliers for each outlier group |
Source code in src/sageworks/algorithms/sql/outliers.py
SQL based Descriptive Stats: Compute Descriptive Stats for all the numeric columns in a DataSource using SQL
descriptive_stats(data_source)
Compute Descriptive Stats for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing descriptive stats on Returns: dict(dict): A dictionary of descriptive stats for each column in this format {'col1': {'min': 0, 'q1': 1, 'median': 2, 'q3': 3, 'max': 4, 'mean': 2.5, 'stddev': 1.5}, 'col2': ...}
Source code in src/sageworks/algorithms/sql/descriptive_stats.py
descriptive_stats_query(columns, table_name)
Build a query to compute the descriptive stats for all columns in a table Args: columns(list(str)): The columns to compute descriptive stats on table_name(str): The table to compute descriptive stats on Returns: str: The SQL query to compute descriptive stats
Source code in src/sageworks/algorithms/sql/descriptive_stats.py
SQL based Correlations: Compute Correlations for all the numeric columns in a DataSource using SQL
correlation_query(columns, table_name)
Build a query to compute the correlations between columns in a table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
list(str
|
The columns to compute correlations on |
required |
table_name
|
str
|
The table to compute correlations on |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The SQL query to compute correlations |
Pearson correlation coefficient ranges from -1 to 1:
+1 indicates a perfect positive linear relationship. -1 indicates a perfect negative linear relationship. 0 indicates no linear relationship.
Source code in src/sageworks/algorithms/sql/correlations.py
correlations(data_source)
Compute Correlations for all the numeric columns in a DataSource Args: data_source(DataSource): The DataSource that we're computing correlations on Returns: dict(dict): A dictionary of correlations for each column in this format {'col1': {'col2': 0.5, 'col3': 0.9, 'col4': 0.4, ...}, 'col2': {'col1': 0.5, 'col3': 0.8, 'col4': 0.3, ...}}
Source code in src/sageworks/algorithms/sql/correlations.py
Questions?
The SuperCowPowers team is happy to answer any questions you may have about AWS and SageWorks. Please contact us at sageworks@supercowpowers.com or on chat us up on Discord