pdstools.pega_io ================ .. py:module:: pdstools.pega_io Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/pdstools/pega_io/API/index /autoapi/pdstools/pega_io/Anonymization/index /autoapi/pdstools/pega_io/File/index /autoapi/pdstools/pega_io/S3/index Classes ------- .. autoapisummary:: pdstools.pega_io.Anonymization pdstools.pega_io.S3Data Functions --------- .. autoapisummary:: pdstools.pega_io._read_client_credential_file pdstools.pega_io.get_token pdstools.pega_io.cache_to_file pdstools.pega_io.find_files pdstools.pega_io.get_latest_file pdstools.pega_io.read_dataflow_output pdstools.pega_io.read_ds_export pdstools.pega_io.read_multi_zip pdstools.pega_io.read_zipped_file Package Contents ---------------- .. py:class:: Anonymization(path_to_files: str, temporary_path: str = '/tmp/anonymisation', output_file: str = 'anonymised.parquet', skip_columns_with_prefix: Optional[List[str]] = None, batch_size: int = 500, file_limit: Optional[int] = None) A utility class to efficiently anonymize Pega Datasets In particular, this class is aimed at anonymizing the Historical Dataset. .. py:attribute:: path_to_files .. py:attribute:: temp_path :value: '/tmp/anonymisation' .. py:attribute:: output_file :value: 'anonymised.parquet' .. py:attribute:: skip_col_prefix :value: ('Context_', 'Decision_') .. py:attribute:: batch_size :value: 500 .. py:attribute:: file_limit :value: None .. py:method:: anonymize(verbose: bool = True) Anonymize the data. This method performs the anonymization process on the data files specified during initialization. It writes temporary parquet files, processes and writes the parquet files to a single file, and outputs the anonymized data to the specified output file. :param verbose: Whether to print verbose output during the anonymization process. Defaults to True. :type verbose: bool, optional .. py:method:: min_max(column_name: str, range: List[Dict[str, float]]) -> polars.Expr :staticmethod: Normalize the values in a column using the min-max scaling method. :param column_name: The name of the column to be normalized. :type column_name: str :param range: A list of dictionaries containing the minimum and maximum values for the column. :type range: List[Dict[str, float]] :returns: A Polars expression representing the normalized column. :rtype: pl.Expr .. rubric:: Examples >>> range = [{"min": 0.0, "max": 100.0}] >>> min_max("age", range) Column "age" normalized using min-max scaling. .. py:method:: _infer_types(df: polars.DataFrame) :staticmethod: Infers the types of columns in a DataFrame. :param df (pl.DataFrame): The DataFrame for which to infer column types. :returns: A dictionary mapping column names to their inferred types. The inferred types can be either "numeric" or "symbolic". :rtype: dict .. py:method:: chunker(files: List[str], size: int) :staticmethod: Split a list of files into chunks of a specified size. :param files (List[str]): A list of file names. :param size (int): The size of each chunk. :returns: A generator that yields chunks of files. :rtype: generator .. rubric:: Examples >>> files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt'] >>> chunks = chunker(files, 2) >>> for chunk in chunks: ... print(chunk) ['file1.txt', 'file2.txt'] ['file3.txt', 'file4.txt'] ['file5.txt'] .. py:method:: chunk_to_parquet(files: List[str], i) -> str Convert a chunk of files to Parquet format. Parameters: files (List[str]): List of file paths to be converted. temp_path (str): Path to the temporary directory where the Parquet file will be saved. i: Index of the chunk. Returns: str: File path of the converted Parquet file. .. py:method:: preprocess(verbose: bool) -> List[str] Preprocesses the files in the specified path. :param verbose (bool): Set to True to get a progress bar for the file count :returns: **list[str]** :rtype: A list of the temporary bundled parquet files .. py:method:: process(chunked_files: List[str], verbose: bool = True) Process the data for anonymization. :param chunked_files (list[str]): A list of the bundled temporary parquet files to process :param verbose (bool): Whether to print verbose output. Default is True. :raises ImportError:: If polars-hash is not installed. :raises Returns:: None .. py:function:: _read_client_credential_file(credential_file: os.PathLike) .. py:function:: get_token(credential_file: os.PathLike, verify: bool = True) Get API credentials to a Pega Platform instance. After setting up OAuth2 authentication in Dev Studio, you should be able to download a credential file. Simply point this method to that file, and it'll read the relevant properties and give you your access token. :param credentialFile: The credential file downloaded after setting up OAuth in a Pega system :type credentialFile: str :param verify: Whether to only allow safe SSL requests. In case you're connecting to an unsecured API endpoint, you need to explicitly set verify to False, otherwise Python will yell at you. :type verify: bool, default = True .. py:function:: cache_to_file(df: Union[polars.DataFrame, polars.LazyFrame], path: Union[str, os.PathLike], name: str, cache_type: Literal['parquet'] = 'parquet', compression: polars._typing.ParquetCompression = 'uncompressed') -> pathlib.Path cache_to_file(df: Union[polars.DataFrame, polars.LazyFrame], path: Union[str, os.PathLike], name: str, cache_type: Literal['ipc'] = 'ipc', compression: polars._typing.IpcCompression = 'uncompressed') -> pathlib.Path Very simple convenience function to cache data. Caches in arrow format for very fast reading. :param df: The dataframe to cache :type df: pl.DataFrame :param path: The location to cache the data :type path: os.PathLike :param name: The name to give to the file :type name: str :param cache_type: The type of file to export. Default is IPC, also supports parquet :type cache_type: str :param compression: The compression to apply, default is uncompressed :type compression: str :returns: The filepath to the cached file :rtype: os.PathLike .. py:function:: find_files(files_dir, target) .. py:function:: get_latest_file(path: Union[str, os.PathLike], target: str, verbose: bool = False) -> str Convenience method to find the latest model snapshot. It has a set of default names to search for and finds all files who match it. Once it finds all matching files in the directory, it chooses the most recent one. Supports [".json", ".csv", ".zip", ".parquet", ".feather", ".ipc"]. Needs a path to the directory and a target of either 'modelData' or 'predictorData'. :param path: The filepath where the data is stored :type path: str :param target: Whether to look for data about the predictive models ('model_data') or the predictor bins ('model_data') :type target: str in ['model_data', 'model_data'] :param verbose: Whether to print all found files before comparing name criteria for debugging purposes :type verbose: bool, default = False :returns: The most recent file given the file name criteria. :rtype: str .. py:function:: read_dataflow_output(files: Union[Iterable[str], str], cache_file_name: Optional[str] = None, *, extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: Union[str, os.PathLike] = 'cache') Reads the file output of a dataflow run. By default, the Prediction Studio data export also uses dataflows, thus this function can be used for those use cases as well. Because dataflows have good resiliancy, they can produce a great number of files. By default, every few seconds each dataflow node writes a file for each partition. While this helps the system stay healthy, it is a bit more difficult to consume. This function can take in a list of files (or a glob pattern), and read in all of the files. If `cache_file_name` is specified, this function caches the data it read before as a `parquet` file. This not only reduces the file size, it is also very fast. When this function is run and there is a pre-existing parquet file with the name specified in `cache_file_name`, it will read all of the files that weren't read in before and add it to the parquet file. If no new files are found, it simply returns the contents of that parquet file - significantly speeding up operations. In a future version, the functionality of this function will be extended to also read from S3 or other remote file systems directly using the same caching method. :param files: An iterable (list or a glob) of file strings to read. If a string is provided, we call glob() on it to find all files corresponding :type files: Union[str, Iterable[str]] :param cache_file_name: If given, caches the files to a file with the given name. If None, does not use the cache at all :type cache_file_name: str, Optional :param extension: The extension of the files, by default "json" :type extension: Literal["json"] :param compression: The compression of the files, by default "gzip" :type compression: Literal["gzip"] :param cache_directory: The file path to cache the previously read files :type cache_directory: os.PathLike :param Usage: :param -----: :param >>> from glob import glob: :param >>> read_dataflow_output(files=glob("model_snapshots_*.json")): .. py:function:: read_ds_export(filename: Union[str, io.BytesIO], path: Union[str, os.PathLike] = '.', verbose: bool = False, **reading_opts) -> Optional[polars.LazyFrame] Read in most out of the box Pega dataset export formats Accepts one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet It automatically infers the default file names for both model data as well as predictor data. If you supply either 'modelData' or 'predictorData' as the 'file' argument, it will search for them. If you supply the full name of the file in the 'path' directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call `.collect()` to get an eager frame. :param filename: Can be one of the following: - A string with the full path to the file - A string with the name of the file (to be searched in the given path) - A BytesIO object containing the file data (e.g., from an uploaded file in a webapp) :type filename: Union[str, BytesIO] :param path: The location of the file :type path: str, default = '.' :param verbose: Whether to print out which file will be imported :type verbose: bool, default = True :keyword Any: Any arguments to plug into the scan_* function from Polars. :returns: * *pl.LazyFrame* -- The (lazy) dataframe * *Examples* -- >>> df = read_ds_export(filename='full/path/to/ModelSnapshot.json') >>> df = read_ds_export(filename='ModelSnapshot.json', path='data/ADMData') >>> df = read_ds_export(filename=uploaded_file) # Where uploaded_file is a BytesIO object .. py:function:: read_multi_zip(files: Iterable[str], zip_type: Literal['gzip'] = 'gzip', add_original_file_name: bool = False, verbose: bool = True) -> polars.LazyFrame Reads multiple zipped ndjson files, and concats them to one Polars dataframe. :param files: The list of files to concat :type files: list :param zip_type: At this point, only 'gzip' is supported :type zip_type: Literal['gzip'] :param verbose: Whether to print out the progress of the import :type verbose: bool, default = True .. py:function:: read_zipped_file(file: Union[str, io.BytesIO], verbose: bool = False) -> Tuple[io.BytesIO, str] Read a zipped NDJSON file. Reads a dataset export file as exported and downloaded from Pega. The export file is formatted as a zipped multi-line JSON file. It reads the file, and then returns the file as a BytesIO object. :param file: The full path to the file :type file: str :param verbose: Whether to print the names of the files within the unzipped file for debugging purposes :type verbose: str, default=False :returns: The raw bytes object to pass through to Polars :rtype: os.BytesIO .. py:class:: S3Data(bucketName: str, temp_dir='./s3_download') .. py:attribute:: bucketName .. py:attribute:: temp_dir :value: './s3_download' .. py:method:: getS3Files(prefix, use_meta_files=False, verbose=True) :async: OOTB file exports can be written in many very small files. This method asyncronously retrieves these files, and puts them in a temporary directory. The logic, if `use_meta_files` is True, is: 1. Take the prefix, add a `.` in front of it (`'path/to/files'` becomes (`'path/to/.files'`) * rsplit on `/` (`['path/to', 'files']`) * take the last element (`'files'`) * add `.` in front of it (`'.files'`) * concat back to a filepath (`'path/to/.files'`) 3. fetch all files in the repo that adhere to the prefix (`'path/to/.files*'`) 4. For each file, if the file ends with `.meta`: * rsplit on '/' (`['path/to', '.files_001.json.meta']`) * for the last element (just the filename), strip the period and the .meta (`['path/to', 'files_001.json']`) * concat back to a filepath (`'path/to/files_001.json'`) 5. Import all files in the list If `use_meta_files` is False, the logic is as simple as: 1. Import all files starting with the prefix (`'path/to/files'` gives `['path/to/files_001.json', 'path/to/files_002.json', etc]`, irrespective of whether a `.meta` file exists). :param prefix: The prefix, pointing to the s3 files. See boto3 docs for filter. :type prefix: str :param use_meta_files: Whether to use the meta files to check for eligible files :type use_meta_files: bool, default=False .. rubric:: Notes We don't import/copy over the .meta files at all. There is an internal function, getNewFiles(), that checks if the filename exists in the local file system. Since the meta files are not really useful for local processing, there's no sense in copying them over. This logic also still works with the use_meta_files - we first check which files are 'eligible' in S3 because they have a meta file, then we check if the 'real' files exist on disk. If the file is already on disk, we don't copy it over. .. py:method:: getDatamartData(table, datamart_folder: str = 'datamart', verbose: bool = True) :async: Wrapper method to import one of the tables in the datamart. :param table: One of the datamart tables. See notes for the full list. :type table: str :param datamart_folder: The path to the 'datamart' folder within the s3 bucket. Typically, this is the top-level folder in the bucket. :type datamart_folder: str, default='datamart' :param verbose: Whether to print out the progress of the import :type verbose: bool, default = True .. note:: Supports the following tables: { - "modelSnapshot": "Data-Decision-ADM-ModelSnapshot_pzModelSnapshots", - "predictorSnapshot": "Data-Decision-ADM-PredictorBinningSnapshot_pzADMPredictorSnapshots", - "binaryDistribution": "Data-DM-BinaryDistribution", - "contingencyTable": "Data-DM-ContingencyTable", - "histogram": "Data-DM-Histogram", - "snapshot": "Data-DM-Snapshot", - "notification": "Data-DM-Notification", } .. py:method:: get_ADMDatamart(datamart_folder: str = 'datamart', verbose: bool = True) :async: Get the ADMDatamart class directly from files in S3 In the Prediction Studio settings, you can configure an automatic export of the monitoring tables to a chosen repository. This method interacts with that repository to retrieve files. Because this is an async function, you need to await it. See `Examples` for an example on how to use this (in a jupyter notebook). It checks for files that are already on your local device, but it always concatenates the raw zipped files together when calling the function, which can potentially make it slow. If you don't always need the latest data, just use :meth:`pdstools.adm.ADMDatamart.save_data()` to save the data to more easily digestible files. :param verbose: Whether to print out the progress of the imports :param datamart_folder: The path to the 'datamart' folder within the s3 bucket. Typically, this is the top-level folder in the bucket. :type datamart_folder: str, default='datamart' .. rubric:: Examples >>> dm = await S3Datamart(bucketName='testbucket').get_ADMDatamart()