pdstools.pega_io.File

Module Contents

Functions

readDSExport(→ polars.LazyFrame)

Read a Pega dataset export file.

import_file(→ polars.LazyFrame)

Imports a file using Polars

readZippedFile(→ io.BytesIO)

Read a zipped NDJSON file.

readMultiZip(files[, zip_type, verbose])

Reads multiple zipped ndjson files, and concats them to one Polars dataframe.

get_latest_file(→ str)

Convenience method to find the latest model snapshot.

getMatches(files_dir, target)

cache_to_file(→ str)

Very simple convenience function to cache data.

readDSExport(filename: pandas.DataFrame | polars.DataFrame | str, path: str = '.', verbose: bool = True, **reading_opts) polars.LazyFrame

Read a Pega dataset export file. Can accept either a Pandas DataFrame or one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet

It automatically infers the default file names for both model data as well as predictor data. If you supply either ‘modelData’ or ‘predictorData’ as the ‘file’ argument, it will search for them. If you supply the full name of the file in the ‘path’ directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call .collect() to get an eager frame.

Parameters:
  • filename ([pd.DataFrame, pl.DataFrame, str]) – Either a Pandas/Polars DataFrame with the source data (for compatibility), or a string, in which case it can either be: - The name of the file (if a custom name) or - Whether we want to look for ‘modelData’ or ‘predictorData’ in the path folder.

  • path (str, default = '.') – The location of the file

  • verbose (bool, default = True) – Whether to print out which file will be imported

Keyword Arguments:

Any – Any arguments to plug into the scan_* function from Polars.

Returns:

  • pl.LazyFrame – The (lazy) dataframe

  • Examples – >>> df = readDSExport(filename = ‘modelData’, path = ‘./datamart’) >>> df = readDSExport(filename = ‘ModelSnapshot.json’, path = ‘data/ADMData’)

    >>> df = pd.read_csv('file.csv')
    >>> df = readDSExport(filename = df)
    

Return type:

polars.LazyFrame

import_file(file: str, extension: str, **reading_opts) polars.LazyFrame

Imports a file using Polars

Parameters:
  • File (str) – The path to the file, passed directly to the read functions

  • extension (str) – The extension of the file, used to determine which function to use

  • file (str)

Returns:

The (imported) lazy dataframe

Return type:

pl.LazyFrame

readZippedFile(file: str, verbose: bool = False) io.BytesIO

Read a zipped NDJSON file. Reads a dataset export file as exported and downloaded from Pega. The export file is formatted as a zipped multi-line JSON file. It reads the file, and then returns the file as a BytesIO object.

Parameters:
  • file (str) – The full path to the file

  • verbose (str, default=False) – Whether to print the names of the files within the unzipped file for debugging purposes

Returns:

The raw bytes object to pass through to Polars

Return type:

os.BytesIO

readMultiZip(files: list, zip_type: Literal[gzip] = 'gzip', verbose: bool = True)

Reads multiple zipped ndjson files, and concats them to one Polars dataframe.

Parameters:
  • files (list) – The list of files to concat

  • zip_type (Literal['gzip']) – At this point, only ‘gzip’ is supported

  • verbose (bool, default = True) – Whether to print out the progress of the import

get_latest_file(path: str, target: str, verbose: bool = False) str

Convenience method to find the latest model snapshot. It has a set of default names to search for and finds all files who match it. Once it finds all matching files in the directory, it chooses the most recent one. Supports [“.json”, “.csv”, “.zip”, “.parquet”, “.feather”, “.ipc”]. Needs a path to the directory and a target of either ‘modelData’ or ‘predictorData’.

Parameters:
  • path (str) – The filepath where the data is stored

  • target (str in ['modelData', 'predictorData']) – Whether to look for data about the predictive models (‘modelData’) or the predictor bins (‘predictorData’)

  • verbose (bool, default = False) – Whether to print all found files before comparing name criteria for debugging purposes

Returns:

The most recent file given the file name criteria.

Return type:

str

getMatches(files_dir, target)
cache_to_file(df: polars.DataFrame | polars.LazyFrame, path: os.PathLike, name: str, cache_type: Literal[ipc, parquet] = 'ipc', compression: str = 'uncompressed') str

Very simple convenience function to cache data. Caches in arrow format for very fast reading.

Parameters:
  • df (pl.DataFrame) – The dataframe to cache

  • path (os.PathLike) – The location to cache the data

  • name (str) – The name to give to the file

  • cache_type (str) – The type of file to export. Default is IPC, also supports parquet

  • compression (str) – The compression to apply, default is uncompressed

Returns:

The filepath to the cached file

Return type:

os.PathLike