pdstools.pega_io.S3¶

Classes¶

S3Data

Module Contents¶

class S3Data(bucketName: str, temp_dir='./s3_download')¶

Parameters:: bucketName (str)

bucketName¶

temp_dir = './s3_download'¶

async getS3Files(prefix, use_meta_files=False, verbose=True)¶

OOTB file exports can be written in many very small files.

This method asyncronously retrieves these files, and puts them in a temporary directory.

The logic, if use_meta_files is True, is:

1. Take the prefix, add a . in front of it (‘path/to/files’ becomes (‘path/to/.files’)

rsplit on / ([‘path/to’, ‘files’])
take the last element (‘files’)
add . in front of it (‘.files’)
concat back to a filepath (‘path/to/.files’)

fetch all files in the repo that adhere to the prefix (‘path/to/.files*’)
For each file, if the file ends with .meta:

rsplit on ‘/’ ([‘path/to’, ‘.files_001.json.meta’])
for the last element (just the filename), strip the period and the .meta ([‘path/to’, ‘files_001.json’])
concat back to a filepath (‘path/to/files_001.json’)

Import all files in the list

If use_meta_files is False, the logic is as simple as:

1. Import all files starting with the prefix (‘path/to/files’ gives [‘path/to/files_001.json’, ‘path/to/files_002.json’, etc], irrespective of whether a .meta file exists).

Parameters:

prefix (str) – The prefix, pointing to the s3 files. See boto3 docs for filter.
use_meta_files (bool, default=False) – Whether to use the meta files to check for eligible files

Notes

We don’t import/copy over the .meta files at all. There is an internal function, getNewFiles(), that checks if the filename exists in the local file system. Since the meta files are not really useful for local processing, there’s no sense in copying them over. This logic also still works with the use_meta_files - we first check which files are ‘eligible’ in S3 because they have a meta file, then we check if the ‘real’ files exist on disk. If the file is already on disk, we don’t copy it over.

async getDatamartData(table, datamart_folder: str = 'datamart', verbose: bool = True)¶

Wrapper method to import one of the tables in the datamart.

Parameters:

table (str) – One of the datamart tables. See notes for the full list.
datamart_folder (str, default='datamart') – The path to the ‘datamart’ folder within the s3 bucket. Typically, this is the top-level folder in the bucket.
verbose (bool, default = True) – Whether to print out the progress of the import

Note

Supports the following tables: {

“modelSnapshot”: “Data-Decision-ADM-ModelSnapshot_pzModelSnapshots”,

“predictorSnapshot”: “Data-Decision-ADM-PredictorBinningSnapshot_pzADMPredictorSnapshots”,

“binaryDistribution”: “Data-DM-BinaryDistribution”,

“contingencyTable”: “Data-DM-ContingencyTable”,

“histogram”: “Data-DM-Histogram”,

“snapshot”: “Data-DM-Snapshot”,

“notification”: “Data-DM-Notification”,

}

async get_ADMDatamart(datamart_folder: str = 'datamart', verbose: bool = True)¶

Get the ADMDatamart class directly from files in S3

In the Prediction Studio settings, you can configure an automatic export of the monitoring tables to a chosen repository. This method interacts with that repository to retrieve files.

Because this is an async function, you need to await it. See Examples for an example on how to use this (in a jupyter notebook).

It checks for files that are already on your local device, but it always concatenates the raw zipped files together when calling the function, which can potentially make it slow. If you don’t always need the latest data, just use pdstools.adm.ADMDatamart.save_data() to save the data to more easily digestible files.

Parameters:

verbose (bool) – Whether to print out the progress of the imports
datamart_folder (str, default='datamart') – The path to the ‘datamart’ folder within the s3 bucket. Typically, this is the top-level folder in the bucket.

Examples

>>> dm = await S3Datamart(bucketName='testbucket').get_ADMDatamart()