pdstools.utils.hds_utils

Module Contents

Classes

Config

Configuration file for the data anonymizer.

DataAnonymization

Anonymize a historical dataset.

class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = True, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = True, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)

Configuration file for the data anonymizer.

Parameters:
  • config_file (str = None) – An optional path to a config file

  • hds_folder (Path = ".") – The path to the hds files

  • use_datamart (bool = False) – Whether to use the datamart to infer predictor types

  • datamart_folder (Path = "datamart") – The folder of the datamart files

  • output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in

  • output_folder (Path = "output") – The location to write the files to

  • mapping_file (str = "mapping.map") – The name of the predictor mapping file

  • mask_predictor_names (bool = True) – Whether to mask the names of regular predictors

  • mask_context_key_names (bool = True) – Whether to mask the names of context key predictors

  • mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors

  • mask_outcome_name (bool = True) – Whether to mask the name of the outcome column

  • mask_predictor_values (bool = True) – Whether to mask the values of regular predictors

  • mask_context_key_values (bool = True) – Whether to mask the values of context key predictors

  • mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors

  • mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary

  • context_key_label (str = "Context_*") – The pattern of names for context key predictors

  • ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors

  • outcome_column (str = "Decision_Outcome") – The name of the outcome column

  • positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True

  • negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False

  • special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched

  • sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.

load_from_config_file(config_file: pathlib.Path)

Load the configurations from a file.

Parameters:

config_file (Path) – The path to the configuration file

save_to_config_file(file_name: str = None)

Save the configurations to a file.

Parameters:

file_name (str) – The name of the configuration file

validate_paths()

Validate the outcome folder exists.

class DataAnonymization(config: Config | None = None, df: polars.LazyFrame | None = None, datamart: pdstools.ADMDatamart | None = None, **config_args)

Anonymize a historical dataset.

Parameters:
  • config (Optional[Config]) – Override the default configurations with the Config class

  • df (Optional[polars.LazyFrame]) – Manually supply a Polars lazyframe to anonymize

  • datamart (Optional[pdstools.ADMDatamart]) – Manually supply a Datamart file to infer predictor types

Keyword Arguments:

**config_args – See Config

Example

See https://pegasystems.github.io/pega-datascientist-tools/Python/articles/Example_Data_Anonymization.html

write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized')

Write the processed dataframe to an output file.

Parameters:
  • df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()

  • ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to

  • mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.

create_mapping_file()

Create a file to write the column mapping

load_hds_files()

Load the historical dataset files from the config.hds_folder location.

read_predictor_type_from_file(df: polars.LazyFrame)

Infer the types of the preditors from the data.

This is non-trivial, as it’s not ideal to pull in all data to memory for this. For this reason, we sample 1% of data, or all data if less than 50 rows, and try to cast it to numeric. If that fails, we set it to categorical, else we set it to numeric.

It is technically supported to manually override this, by just overriding the symbolic_predictors_to_mask & numeric_predictors_to_mask properties.

Parameters:

df (pl.LazyFrame) – The lazyframe to infer the types with

static read_predictor_type_from_datamart(datamart_folder: pathlib.Path, datamart: pdstools.ADMDatamart = None)

The datamart contains type information about each predictor. This function extracts that information to infer types for the HDS.

Parameters:
  • datamart_folder (Path) – The path to the datamart files

  • datamart (ADMDatamart) – The direct ADMDatamart object

get_columns_by_type()

Get a list of columns for each type.

get_predictors_mapping()

Map the predictor names to their anonymized form.

getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)
process(strategy='eager', **kwargs)

Anonymize the dataset.