pdstools.utils.hds_utils_experimental
¶
Module Contents¶
Classes¶
Configuration file for the data anonymizer. |
|
- class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = False, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = False, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)¶
Configuration file for the data anonymizer.
- Parameters:
config_file (str = None) – An optional path to a config file
hds_folder (Path = ".") – The path to the hds files
use_datamart (bool = False) – Whether to use the datamart to infer predictor types
datamart_folder (Path = "datamart") – The folder of the datamart files
output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in
output_folder (Path = "output") – The location to write the files to
mapping_file (str = "mapping.map") – The name of the predictor mapping file
mask_predictor_names (bool = True) – Whether to mask the names of regular predictors
mask_context_key_names (bool = True) – Whether to mask the names of context key predictors
mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors
mask_outcome_name (bool = True) – Whether to mask the name of the outcome column
mask_predictor_values (bool = True) – Whether to mask the values of regular predictors
mask_context_key_values (bool = True) – Whether to mask the values of context key predictors
mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors
mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary
context_key_label (str = "Context_*") – The pattern of names for context key predictors
ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors
outcome_column (str = "Decision_Outcome") – The name of the outcome column
positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True
negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False
special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched
sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.
- load_from_config_file(config_file: pathlib.Path)¶
Load the configurations from a file.
- Parameters:
config_file (Path) – The path to the configuration file
- save_to_config_file(file_name: str = None)¶
Save the configurations to a file.
- Parameters:
file_name (str) – The name of the configuration file
- validate_paths()¶
Validate the outcome folder exists.
- class DataAnonymization2(files, config=None, **config_args)¶
- inferTypes(number_of_files_to_sample=20)¶
- _inferTypes(df)¶
- setPredictorTypes(typeDict: dict)¶
- Parameters:
typeDict (dict)
- setTypeForPredictor(predictor, type: Literal[numeric, symbolic])¶
- Parameters:
type (Literal[numeric, symbolic])
- getPredictorsWithTypes()¶
- getFiles(orig_files)¶
Load the historical dataset files from the config.hds_folder location.
- create_mapping_file()¶
Create a file to write the column mapping
- get_columns_by_type()¶
Get a list of columns for each type.
- create_predictor_mapping()¶
Map the predictor names to their anonymized form.
- calculateNumericRanges(verbose=False)¶
- setNumericRanges(rangeDict: dict)¶
- Parameters:
rangeDict (dict)
- setRangeForPredictor(predictorName, min, max)¶
- write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized', on_failed_file: Literal[warn, ignore, fail] = 'fail', verbose=False)¶
Write the processed dataframe to an output file.
- Parameters:
df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()
ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to
mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.
on_failed_file (Literal[warn, ignore, fail])
- getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)¶
- to_normalize(cols, verbose=False)¶
- process(df=None, strategy='eager', verbose=False, **kwargs)¶
Anonymize the dataset.