pdstools.pega_io.Anonymization¶
Classes¶
A utility class to efficiently anonymize Pega Datasets |
Module Contents¶
- class Anonymization(path_to_files: str, temporary_path: str = '/tmp/anonymisation', output_file: str = 'anonymised.parquet', skip_columns_with_prefix: List[str] | None = None, batch_size: int = 500, file_limit: int | None = None)¶
A utility class to efficiently anonymize Pega Datasets
In particular, this class is aimed at anonymizing the Historical Dataset.
- Parameters:
- path_to_files¶
- temp_path = '/tmp/anonymisation'¶
- output_file = 'anonymised.parquet'¶
- skip_col_prefix = ('Context_', 'Decision_')¶
- batch_size = 500¶
- file_limit = None¶
- anonymize(verbose: bool = True)¶
Anonymize the data.
This method performs the anonymization process on the data files specified during initialization. It writes temporary parquet files, processes and writes the parquet files to a single file, and outputs the anonymized data to the specified output file.
- Parameters:
verbose (bool, optional) – Whether to print verbose output during the anonymization process. Defaults to True.
- static min_max(column_name: str, range: List[Dict[str, float]]) polars.Expr ¶
Normalize the values in a column using the min-max scaling method.
- Parameters:
- Returns:
A Polars expression representing the normalized column.
- Return type:
pl.Expr
Examples
>>> range = [{"min": 0.0, "max": 100.0}] >>> min_max("age", range) Column "age" normalized using min-max scaling.
- static _infer_types(df: polars.DataFrame)¶
Infers the types of columns in a DataFrame.
- Parameters:
(pl.DataFrame) (df) – The DataFrame for which to infer column types.
df (polars.DataFrame)
- Returns:
A dictionary mapping column names to their inferred types. The inferred types can be either “numeric” or “symbolic”.
- Return type:
- static chunker(files: List[str], size: int)¶
Split a list of files into chunks of a specified size.
- Parameters:
- Returns:
A generator that yields chunks of files.
- Return type:
generator
Examples
>>> files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt'] >>> chunks = chunker(files, 2) >>> for chunk in chunks: ... print(chunk) ['file1.txt', 'file2.txt'] ['file3.txt', 'file4.txt'] ['file5.txt']
- chunk_to_parquet(files: List[str], i) str ¶
Convert a chunk of files to Parquet format.
Parameters: files (List[str]):
List of file paths to be converted.
- temp_path (str):
Path to the temporary directory where the Parquet file will be saved.
- i:
Index of the chunk.
Returns: str: File path of the converted Parquet file.
- preprocess(verbose: bool) List[str] ¶
Preprocesses the files in the specified path.
- Parameters:
(bool) (verbose) – Set to True to get a progress bar for the file count
verbose (bool)
- Returns:
list[str]
- Return type:
A list of the temporary bundled parquet files