pdstools.utils.hds_utils_experimental

Module Contents

Classes

Config

Configuration file for the data anonymizer.

DataAnonymization2

class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = False, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = False, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)

Configuration file for the data anonymizer.

Parameters:
  • config_file (str = None) – An optional path to a config file

  • hds_folder (Path = ".") – The path to the hds files

  • use_datamart (bool = False) – Whether to use the datamart to infer predictor types

  • datamart_folder (Path = "datamart") – The folder of the datamart files

  • output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in

  • output_folder (Path = "output") – The location to write the files to

  • mapping_file (str = "mapping.map") – The name of the predictor mapping file

  • mask_predictor_names (bool = True) – Whether to mask the names of regular predictors

  • mask_context_key_names (bool = True) – Whether to mask the names of context key predictors

  • mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors

  • mask_outcome_name (bool = True) – Whether to mask the name of the outcome column

  • mask_predictor_values (bool = True) – Whether to mask the values of regular predictors

  • mask_context_key_values (bool = True) – Whether to mask the values of context key predictors

  • mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors

  • mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary

  • context_key_label (str = "Context_*") – The pattern of names for context key predictors

  • ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors

  • outcome_column (str = "Decision_Outcome") – The name of the outcome column

  • positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True

  • negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False

  • special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched

  • sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.

load_from_config_file(config_file: pathlib.Path)

Load the configurations from a file.

Parameters:

config_file (Path) – The path to the configuration file

save_to_config_file(file_name: str = None)

Save the configurations to a file.

Parameters:

file_name (str) – The name of the configuration file

validate_paths()

Validate the outcome folder exists.

class DataAnonymization2(files, config=None, **config_args)
inferTypes(number_of_files_to_sample=20)
_inferTypes(df)
setPredictorTypes(typeDict: dict)
Parameters:

typeDict (dict)

setTypeForPredictor(predictor, type: Literal[numeric, symbolic])
Parameters:

type (Literal[numeric, symbolic])

getPredictorsWithTypes()
getFiles(orig_files)

Load the historical dataset files from the config.hds_folder location.

create_mapping_file()

Create a file to write the column mapping

get_columns_by_type()

Get a list of columns for each type.

create_predictor_mapping()

Map the predictor names to their anonymized form.

calculateNumericRanges(verbose=False)
setNumericRanges(rangeDict: dict)
Parameters:

rangeDict (dict)

setRangeForPredictor(predictorName, min, max)
write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized', on_failed_file: Literal[warn, ignore, fail] = 'fail', verbose=False)

Write the processed dataframe to an output file.

Parameters:
  • df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()

  • ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to

  • mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.

  • on_failed_file (Literal[warn, ignore, fail])

getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)
to_normalize(cols, verbose=False)
process(df=None, strategy='eager', verbose=False, **kwargs)

Anonymize the dataset.