`pdstools.pega_io`¶

Submodules¶

Package Contents¶

Classes¶

S3Data

Functions¶

`_readClientCredentialFile`(credentialFile)
`get_URL`(credentialFile)	Returns the URL of the Infinity instance in the credential file
`get_token`(credentialFile[, verify])	Get API credentials to a Pega Platform instance.
`setupAzureOpenAI`(api_base, api_version, ...)	Convenience function to automagically setup Azure AD-based authentication
`fromPRPCDateTime`(→ Union[datetime.datetime, str])	Convert from a Pega date-time string.
`readDSExport`(→ polars.LazyFrame)	Read a Pega dataset export file.
`import_file`(→ polars.LazyFrame)	Imports a file using Polars
`readZippedFile`(→ io.BytesIO)	Read a zipped NDJSON file.
`readMultiZip`(files[, zip_type, verbose])	Reads multiple zipped ndjson files, and concats them to one Polars dataframe.
`get_latest_file`(→ str)	Convenience method to find the latest model snapshot.
`getMatches`(files_dir, target)
`cache_to_file`(→ str)	Very simple convenience function to cache data.

_readClientCredentialFile(credentialFile)¶

get_URL(credentialFile: str)¶

Returns the URL of the Infinity instance in the credential file

Parameters:: credentialFile (str)

get_token(credentialFile: str, verify: bool = True, **kwargs)¶

Get API credentials to a Pega Platform instance.

After setting up OAuth2 authentication in Dev Studio, you should be able to download a credential file. Simply point this method to that file, and it’ll read the relevant properties and give you your access token.

Parameters:

credentialFile (str) – The credential file downloaded after setting up OAuth in a Pega system
verify (bool, default = True) – Whether to only allow safe SSL requests. In case you’re connecting to an unsecured API endpoint, you need to explicitly set verify to False, otherwise Python will yell at you.

Keyword Arguments:

url (str) – An optional override of the URL to connect to. This is also extracted out of the credential file, but you may want to customize this (to a different port, etc).

setupAzureOpenAI(api_base: str = 'https://aze-openai-01.openai.azure.com/', api_version: Literal[2022-12-01, 2023-03-15-preview, 2023-05-15, 2023-06-01-preview, 2023-07-01-preview, 2023-09-15-preview, 2023-10-01-preview, 2023-12-01-preview] = '2023-12-01-preview')¶

Convenience function to automagically setup Azure AD-based authentication for the Azure OpenAI service. Mostly meant as an internal tool within Pega, but can of course also be used beyond.

Prerequisites (you should only need to do this once!): - Download Azure CLI (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) - Once installed, run ‘az login’ in your terminal - Additional dependencies: (pip install) azure-identity and (pip install) openai

Running this function automatically sets, among others: - openai.api_key - os.environ[“OPENAI_API_KEY”]

This should ensure that you don’t need to pass tokens and/or api_keys around. The key that’s set has a lifetime, typically of one hour. Therefore, if you get an error message like ‘invalid token’, you may need to run this method again to refresh the token for another hour.

Parameters:

api_base (str) – The url of the Azure service name you’d like to connect to If you have access to the Azure OpenAI playground (https://oai.azure.com/portal), you can easily find this url by clicking ‘view code’ in one of the playgrounds. If you have access to the Azure portal directly (https://portal.azure.com), this will be found under ‘endpoint’. Else, ask your system administrator for the correct url.
api_version (str) – The version of the api to use
Usage
-----
setupAzureOpenAI (>>> from pdstools import)
setupAzureOpenAI() (>>>)

fromPRPCDateTime(x: str, return_string: bool = False) → datetime.datetime | str¶

Convert from a Pega date-time string.

Parameters:

x (str) – String of Pega date-time
return_string (bool, default=False) – If True it will return the date in string format. If False it will return in datetime type

Returns:

Union[datetime.datetime, str] – The converted date in datetime format or string.
Examples – >>> fromPRPCDateTime(“20180316T134127.847 GMT”) >>> fromPRPCDateTime(“20180316T134127.847 GMT”, True) >>> fromPRPCDateTime(“20180316T184127.846”) >>> fromPRPCDateTime(“20180316T184127.846”, True)

Return type:

Union[datetime.datetime, str]

readDSExport(filename: pandas.DataFrame | polars.DataFrame | str, path: str = '.', verbose: bool = True, **reading_opts) → polars.LazyFrame¶

Read a Pega dataset export file. Can accept either a Pandas DataFrame or one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet

It automatically infers the default file names for both model data as well as predictor data. If you supply either ‘modelData’ or ‘predictorData’ as the ‘file’ argument, it will search for them. If you supply the full name of the file in the ‘path’ directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call .collect() to get an eager frame.

Parameters:

filename ([pd.DataFrame, pl.DataFrame, str]) – Either a Pandas/Polars DataFrame with the source data (for compatibility), or a string, in which case it can either be: - The name of the file (if a custom name) or - Whether we want to look for ‘modelData’ or ‘predictorData’ in the path folder.
path (str, default = '.') – The location of the file
verbose (bool, default = True) – Whether to print out which file will be imported

Keyword Arguments:

Any – Any arguments to plug into the scan_* function from Polars.

Returns:

pl.LazyFrame – The (lazy) dataframe
Examples – >>> df = readDSExport(filename = ‘modelData’, path = ‘./datamart’) >>> df = readDSExport(filename = ‘ModelSnapshot.json’, path = ‘data/ADMData’)
```
>>> df = pd.read_csv('file.csv')
>>> df = readDSExport(filename = df)
```

Return type:

polars.LazyFrame

import_file(file: str, extension: str, **reading_opts) → polars.LazyFrame¶

Imports a file using Polars

Parameters:

File (str) – The path to the file, passed directly to the read functions
extension (str) – The extension of the file, used to determine which function to use
file (str)

Returns:

The (imported) lazy dataframe

Return type:

pl.LazyFrame

readZippedFile(file: str, verbose: bool = False) → io.BytesIO¶

Read a zipped NDJSON file. Reads a dataset export file as exported and downloaded from Pega. The export file is formatted as a zipped multi-line JSON file. It reads the file, and then returns the file as a BytesIO object.

Parameters:

file (str) – The full path to the file
verbose (str, default=False) – Whether to print the names of the files within the unzipped file for debugging purposes

Returns:

The raw bytes object to pass through to Polars

Return type:

os.BytesIO

readMultiZip(files: list, zip_type: Literal[gzip] = 'gzip', verbose: bool = True)¶

Reads multiple zipped ndjson files, and concats them to one Polars dataframe.

Parameters:

files (list) – The list of files to concat
zip_type (Literal['gzip']) – At this point, only ‘gzip’ is supported
verbose (bool, default = True) – Whether to print out the progress of the import

get_latest_file(path: str, target: str, verbose: bool = False) → str¶

Convenience method to find the latest model snapshot. It has a set of default names to search for and finds all files who match it. Once it finds all matching files in the directory, it chooses the most recent one. Supports [“.json”, “.csv”, “.zip”, “.parquet”, “.feather”, “.ipc”]. Needs a path to the directory and a target of either ‘modelData’ or ‘predictorData’.

Parameters:

path (str) – The filepath where the data is stored
target (str in ['modelData', 'predictorData']) – Whether to look for data about the predictive models (‘modelData’) or the predictor bins (‘predictorData’)
verbose (bool, default = False) – Whether to print all found files before comparing name criteria for debugging purposes

Returns:

The most recent file given the file name criteria.

Return type:

str

getMatches(files_dir, target)¶

cache_to_file(df: polars.DataFrame | polars.LazyFrame, path: os.PathLike, name: str, cache_type: Literal[ipc, parquet] = 'ipc', compression: str = 'uncompressed') → str¶

Very simple convenience function to cache data. Caches in arrow format for very fast reading.

Parameters:

df (pl.DataFrame) – The dataframe to cache
path (os.PathLike) – The location to cache the data
name (str) – The name to give to the file
cache_type (str) – The type of file to export. Default is IPC, also supports parquet
compression (str) – The compression to apply, default is uncompressed

Returns:

The filepath to the cached file

Return type:

os.PathLike

class S3Data(bucketName: str, temp_dir='./s3_download')¶

Parameters:: bucketName (str)

async getS3Files(prefix, use_meta_files=False, verbose=True)¶

OOTB file exports can be written in many very small files.

This method asyncronously retrieves these files, and puts them in a temporary directory.

The logic, if use_meta_files is True, is:

1. Take the prefix, add a . in front of it (‘path/to/files’ becomes (‘path/to/.files’)

rsplit on / ([‘path/to’, ‘files’])
take the last element (‘files’)
add . in front of it (‘.files’)
concat back to a filepath (‘path/to/.files’)

fetch all files in the repo that adhere to the prefix (‘path/to/.files*’)
For each file, if the file ends with .meta:

rsplit on ‘/’ ([‘path/to’, ‘.files_001.json.meta’])
for the last element (just the filename), strip the period and the .meta ([‘path/to’, ‘files_001.json’])
concat back to a filepath (‘path/to/files_001.json’)

Import all files in the list

If use_meta_files is False, the logic is as simple as:

1. Import all files starting with the prefix (‘path/to/files’ gives [‘path/to/files_001.json’, ‘path/to/files_002.json’, etc], irrespective of whether a .meta file exists).

Parameters:

prefix (str) – The prefix, pointing to the s3 files. See boto3 docs for filter.
use_meta_files (bool, default=False) – Whether to use the meta files to check for eligible files

Notes

We don’t import/copy over the .meta files at all. There is an internal function, getNewFiles(), that checks if the filename exists in the local file system. Since the meta files are not really useful for local processing, there’s no sense in copying them over. This logic also still works with the use_meta_files - we first check which files are ‘eligible’ in S3 because they have a meta file, then we check if the ‘real’ files exist on disk. If the file is already on disk, we don’t copy it over.

async getDatamartData(table, datamart_folder: str = 'datamart', verbose: bool = True)¶

Wrapper method to import one of the tables in the datamart.

Parameters:

table (str) – One of the datamart tables. See notes for the full list.
datamart_folder (str, default='datamart') – The path to the ‘datamart’ folder within the s3 bucket. Typically, this is the top-level folder in the bucket.
verbose (bool, default = True) – Whether to print out the progress of the import

Note

Supports the following tables: {

“modelSnapshot”: “Data-Decision-ADM-ModelSnapshot_pzModelSnapshots”,

“predictorSnapshot”: “Data-Decision-ADM-PredictorBinningSnapshot_pzADMPredictorSnapshots”,

“binaryDistribution”: “Data-DM-BinaryDistribution”,

“contingencyTable”: “Data-DM-ContingencyTable”,

“histogram”: “Data-DM-Histogram”,

“snapshot”: “Data-DM-Snapshot”,

“notification”: “Data-DM-Notification”,

}

async get_ADMDatamart(datamart_folder: str = 'datamart', verbose: bool = True)¶

Get the ADMDatamart class directly from files in S3

In the Prediction Studio settings, you can configure an automatic export of the monitoring tables to a chosen repository. This method interacts with that repository to retrieve files.

Because this is an async function, you need to await it. See Examples for an example on how to use this (in a jupyter notebook).

It checks for files that are already on your local device, but it always concatenates the raw zipped files together when calling the function, which can potentially make it slow. If you don’t always need the latest data, just use pdstools.adm.ADMDatamart.save_data() to save the data to more easily digestible files.

Parameters:

verbose (bool) – Whether to print out the progress of the imports
datamart_folder (str, default='datamart') – The path to the ‘datamart’ folder within the s3 bucket. Typically, this is the top-level folder in the bucket.

Examples

>>> dm = await S3Datamart(bucketName='testbucket').get_ADMDatamart()

pdstools.pega_io¶

Submodules¶

Package Contents¶

Classes¶

Functions¶

`pdstools.pega_io`¶