pdstools.pega_io.Anonymization

Classes

Anonymization

A utility class to efficiently anonymize Pega Datasets

Module Contents

class Anonymization(path_to_files: str, temporary_path: str = '/tmp/anonymisation', output_file: str = 'anonymised.parquet', skip_columns_with_prefix: List[str] | None = None, batch_size: int = 500, file_limit: int | None = None)

A utility class to efficiently anonymize Pega Datasets

In particular, this class is aimed at anonymizing the Historical Dataset.

Parameters:
  • path_to_files (str)

  • temporary_path (str)

  • output_file (str)

  • skip_columns_with_prefix (Optional[List[str]])

  • batch_size (int)

  • file_limit (Optional[int])

path_to_files
temp_path = '/tmp/anonymisation'
output_file = 'anonymised.parquet'
skip_col_prefix = ('Context_', 'Decision_')
batch_size = 500
file_limit = None
anonymize(verbose: bool = True)

Anonymize the data.

This method performs the anonymization process on the data files specified during initialization. It writes temporary parquet files, processes and writes the parquet files to a single file, and outputs the anonymized data to the specified output file.

Parameters:

verbose (bool, optional) – Whether to print verbose output during the anonymization process. Defaults to True.

static min_max(column_name: str, range: List[Dict[str, float]]) polars.Expr

Normalize the values in a column using the min-max scaling method.

Parameters:
  • column_name (str) – The name of the column to be normalized.

  • range (List[Dict[str, float]]) – A list of dictionaries containing the minimum and maximum values for the column.

Returns:

A Polars expression representing the normalized column.

Return type:

pl.Expr

Examples

>>> range = [{"min": 0.0, "max": 100.0}]
>>> min_max("age", range)
Column "age" normalized using min-max scaling.
static _infer_types(df: polars.DataFrame)

Infers the types of columns in a DataFrame.

Parameters:
  • (pl.DataFrame) (df) – The DataFrame for which to infer column types.

  • df (polars.DataFrame)

Returns:

A dictionary mapping column names to their inferred types. The inferred types can be either “numeric” or “symbolic”.

Return type:

dict

static chunker(files: List[str], size: int)

Split a list of files into chunks of a specified size.

Parameters:
  • (List[str]) (files) – A list of file names.

  • (int) (size) – The size of each chunk.

  • files (List[str])

  • size (int)

Returns:

A generator that yields chunks of files.

Return type:

generator

Examples

>>> files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> chunks = chunker(files, 2)
>>> for chunk in chunks:
...     print(chunk)
['file1.txt', 'file2.txt']
['file3.txt', 'file4.txt']
['file5.txt']
chunk_to_parquet(files: List[str], i) str

Convert a chunk of files to Parquet format.

Parameters: files (List[str]):

List of file paths to be converted.

temp_path (str):

Path to the temporary directory where the Parquet file will be saved.

i:

Index of the chunk.

Returns: str: File path of the converted Parquet file.

Parameters:

files (List[str])

Return type:

str

preprocess(verbose: bool) List[str]

Preprocesses the files in the specified path.

Parameters:
  • (bool) (verbose) – Set to True to get a progress bar for the file count

  • verbose (bool)

Returns:

list[str]

Return type:

A list of the temporary bundled parquet files

process(chunked_files: List[str], verbose: bool = True)

Process the data for anonymization.

Parameters:
  • (list[str]) (chunked_files) – A list of the bundled temporary parquet files to process

  • (bool) (verbose) – Whether to print verbose output. Default is True.

  • chunked_files (List[str])

  • verbose (bool)

Raises:
  • ImportError: – If polars-hash is not installed.

  • Returns: – None