pdstools.utils.hds_utils
¶
Module Contents¶
Classes¶
Configuration file for the data anonymizer. |
|
Anonymize a historical dataset. |
- class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = True, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = True, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)¶
Configuration file for the data anonymizer.
- Parameters:
config_file (str = None) – An optional path to a config file
hds_folder (Path = ".") – The path to the hds files
use_datamart (bool = False) – Whether to use the datamart to infer predictor types
datamart_folder (Path = "datamart") – The folder of the datamart files
output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in
output_folder (Path = "output") – The location to write the files to
mapping_file (str = "mapping.map") – The name of the predictor mapping file
mask_predictor_names (bool = True) – Whether to mask the names of regular predictors
mask_context_key_names (bool = True) – Whether to mask the names of context key predictors
mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors
mask_outcome_name (bool = True) – Whether to mask the name of the outcome column
mask_predictor_values (bool = True) – Whether to mask the values of regular predictors
mask_context_key_values (bool = True) – Whether to mask the values of context key predictors
mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors
mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary
context_key_label (str = "Context_*") – The pattern of names for context key predictors
ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors
outcome_column (str = "Decision_Outcome") – The name of the outcome column
positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True
negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False
special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched
sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.
- load_from_config_file(config_file: pathlib.Path)¶
Load the configurations from a file.
- Parameters:
config_file (Path) – The path to the configuration file
- save_to_config_file(file_name: str = None)¶
Save the configurations to a file.
- Parameters:
file_name (str) – The name of the configuration file
- validate_paths()¶
Validate the outcome folder exists.
- class DataAnonymization(config: Config | None = None, df: polars.LazyFrame | None = None, datamart: pdstools.ADMDatamart | None = None, **config_args)¶
Anonymize a historical dataset.
- Parameters:
config (Optional[Config]) – Override the default configurations with the Config class
df (Optional[polars.LazyFrame]) – Manually supply a Polars lazyframe to anonymize
datamart (Optional[pdstools.ADMDatamart]) – Manually supply a Datamart file to infer predictor types
- Keyword Arguments:
**config_args – See
Config
Example
- write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized')¶
Write the processed dataframe to an output file.
- Parameters:
df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()
ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to
mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.
- create_mapping_file()¶
Create a file to write the column mapping
- load_hds_files()¶
Load the historical dataset files from the config.hds_folder location.
- read_predictor_type_from_file(df: polars.LazyFrame)¶
Infer the types of the preditors from the data.
This is non-trivial, as it’s not ideal to pull in all data to memory for this. For this reason, we sample 1% of data, or all data if less than 50 rows, and try to cast it to numeric. If that fails, we set it to categorical, else we set it to numeric.
It is technically supported to manually override this, by just overriding the symbolic_predictors_to_mask & numeric_predictors_to_mask properties.
- Parameters:
df (pl.LazyFrame) – The lazyframe to infer the types with
- static read_predictor_type_from_datamart(datamart_folder: pathlib.Path, datamart: pdstools.ADMDatamart = None)¶
The datamart contains type information about each predictor. This function extracts that information to infer types for the HDS.
- Parameters:
datamart_folder (Path) – The path to the datamart files
datamart (ADMDatamart) – The direct ADMDatamart object
- get_columns_by_type()¶
Get a list of columns for each type.
- get_predictors_mapping()¶
Map the predictor names to their anonymized form.
- getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)¶
- process(strategy='eager', **kwargs)¶
Anonymize the dataset.