pdstools.pega_io.Anonymization¶

Classes¶

Anonymization

A utility class to efficiently anonymize Pega Datasets

Module Contents¶

class Anonymization(path_to_files: str, temporary_path: str = '/tmp/anonymisation', output_file: str = 'anonymised.parquet', skip_columns_with_prefix: List[str] | None = None, batch_size: int = 500, file_limit: int | None = None)¶

A utility class to efficiently anonymize Pega Datasets

In particular, this class is aimed at anonymizing the Historical Dataset.

Parameters:

path_to_files (str)
temporary_path (str)
output_file (str)
skip_columns_with_prefix (Optional[List[str]])
batch_size (int)
file_limit (Optional[int])

path_to_files¶

temp_path = '/tmp/anonymisation'¶

output_file = 'anonymised.parquet'¶

skip_col_prefix = ('Context_', 'Decision_')¶

batch_size = 500¶

file_limit = None¶

anonymize(verbose: bool = True)¶

Anonymize the data.

This method performs the anonymization process on the data files specified during initialization. It writes temporary parquet files, processes and writes the parquet files to a single file, and outputs the anonymized data to the specified output file.

Parameters:: verbose (bool, optional) – Whether to print verbose output during the anonymization process. Defaults to True.

static min_max(column_name: str, range: List[Dict[str, float]]) → polars.Expr¶

Normalize the values in a column using the min-max scaling method.

Parameters:

column_name (str) – The name of the column to be normalized.
range (List[Dict[str, float]]) – A list of dictionaries containing the minimum and maximum values for the column.

Returns:

A Polars expression representing the normalized column.

Return type:

pl.Expr

Examples

>>> range = [{"min": 0.0, "max": 100.0}]
>>> min_max("age", range)
Column "age" normalized using min-max scaling.

static _infer_types(df: polars.DataFrame)¶

Infers the types of columns in a DataFrame.

Parameters:

(pl.DataFrame) (df) – The DataFrame for which to infer column types.
df (polars.DataFrame)

Returns:

A dictionary mapping column names to their inferred types. The inferred types can be either “numeric” or “symbolic”.

Return type:

dict

static chunker(files: List[str], size: int)¶

Split a list of files into chunks of a specified size.

Parameters:

(List[str]) (files) – A list of file names.
(int) (size) – The size of each chunk.
files (List[str])
size (int)

Returns:

A generator that yields chunks of files.

Return type:

generator

Examples

>>> files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> chunks = chunker(files, 2)
>>> for chunk in chunks:
...     print(chunk)
['file1.txt', 'file2.txt']
['file3.txt', 'file4.txt']
['file5.txt']

chunk_to_parquet(files: List[str], i) → str¶

Convert a chunk of files to Parquet format.

Parameters: files (List[str]):

List of file paths to be converted.

temp_path (str):: Path to the temporary directory where the Parquet file will be saved.
i:: Index of the chunk.

Returns: str: File path of the converted Parquet file.

Parameters:: files (List[str])
Return type:: str

preprocess(verbose: bool) → List[str]¶

Preprocesses the files in the specified path.

Parameters:

(bool) (verbose) – Set to True to get a progress bar for the file count
verbose (bool)

Returns:

list[str]

Return type:

A list of the temporary bundled parquet files

process(chunked_files: List[str], verbose: bool = True)¶

Process the data for anonymization.

Parameters:

(list[str]) (chunked_files) – A list of the bundled temporary parquet files to process
(bool) (verbose) – Whether to print verbose output. Default is True.
chunked_files (List[str])
verbose (bool)

Raises:

ImportError: – If polars-hash is not installed.
Returns: – None