pdstools.pega_io.Anonymization ============================== .. py:module:: pdstools.pega_io.Anonymization Classes ------- .. autoapisummary:: pdstools.pega_io.Anonymization.Anonymization Module Contents --------------- .. py:class:: Anonymization(path_to_files: str, temporary_path: str = '/tmp/anonymisation', output_file: str = 'anonymised.parquet', skip_columns_with_prefix: Optional[List[str]] = None, batch_size: int = 500, file_limit: Optional[int] = None) A utility class to efficiently anonymize Pega Datasets In particular, this class is aimed at anonymizing the Historical Dataset. .. py:attribute:: path_to_files .. py:attribute:: temp_path :value: '/tmp/anonymisation' .. py:attribute:: output_file :value: 'anonymised.parquet' .. py:attribute:: skip_col_prefix :value: ('Context_', 'Decision_') .. py:attribute:: batch_size :value: 500 .. py:attribute:: file_limit :value: None .. py:method:: anonymize(verbose: bool = True) Anonymize the data. This method performs the anonymization process on the data files specified during initialization. It writes temporary parquet files, processes and writes the parquet files to a single file, and outputs the anonymized data to the specified output file. :param verbose: Whether to print verbose output during the anonymization process. Defaults to True. :type verbose: bool, optional .. py:method:: min_max(column_name: str, range: List[Dict[str, float]]) -> polars.Expr :staticmethod: Normalize the values in a column using the min-max scaling method. :param column_name: The name of the column to be normalized. :type column_name: str :param range: A list of dictionaries containing the minimum and maximum values for the column. :type range: List[Dict[str, float]] :returns: A Polars expression representing the normalized column. :rtype: pl.Expr .. rubric:: Examples >>> range = [{"min": 0.0, "max": 100.0}] >>> min_max("age", range) Column "age" normalized using min-max scaling. .. py:method:: _infer_types(df: polars.DataFrame) :staticmethod: Infers the types of columns in a DataFrame. :param df (pl.DataFrame): The DataFrame for which to infer column types. :returns: A dictionary mapping column names to their inferred types. The inferred types can be either "numeric" or "symbolic". :rtype: dict .. py:method:: chunker(files: List[str], size: int) :staticmethod: Split a list of files into chunks of a specified size. :param files (List[str]): A list of file names. :param size (int): The size of each chunk. :returns: A generator that yields chunks of files. :rtype: generator .. rubric:: Examples >>> files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt'] >>> chunks = chunker(files, 2) >>> for chunk in chunks: ... print(chunk) ['file1.txt', 'file2.txt'] ['file3.txt', 'file4.txt'] ['file5.txt'] .. py:method:: chunk_to_parquet(files: List[str], i) -> str Convert a chunk of files to Parquet format. Parameters: files (List[str]): List of file paths to be converted. temp_path (str): Path to the temporary directory where the Parquet file will be saved. i: Index of the chunk. Returns: str: File path of the converted Parquet file. .. py:method:: preprocess(verbose: bool) -> List[str] Preprocesses the files in the specified path. :param verbose (bool): Set to True to get a progress bar for the file count :returns: **list[str]** :rtype: A list of the temporary bundled parquet files .. py:method:: process(chunked_files: List[str], verbose: bool = True) Process the data for anonymization. :param chunked_files (list[str]): A list of the bundled temporary parquet files to process :param verbose (bool): Whether to print verbose output. Default is True. :raises ImportError:: If polars-hash is not installed. :raises Returns:: None