pdstools.pega_io.S3 =================== .. py:module:: pdstools.pega_io.S3 Classes ------- .. autoapisummary:: pdstools.pega_io.S3.S3Data Module Contents --------------- .. py:class:: S3Data(bucketName: str, temp_dir='./s3_download') .. py:attribute:: bucketName .. py:attribute:: temp_dir :value: './s3_download' .. py:method:: getS3Files(prefix, use_meta_files=False, verbose=True) :async: OOTB file exports can be written in many very small files. This method asyncronously retrieves these files, and puts them in a temporary directory. The logic, if `use_meta_files` is True, is: 1. Take the prefix, add a `.` in front of it (`'path/to/files'` becomes (`'path/to/.files'`) * rsplit on `/` (`['path/to', 'files']`) * take the last element (`'files'`) * add `.` in front of it (`'.files'`) * concat back to a filepath (`'path/to/.files'`) 3. fetch all files in the repo that adhere to the prefix (`'path/to/.files*'`) 4. For each file, if the file ends with `.meta`: * rsplit on '/' (`['path/to', '.files_001.json.meta']`) * for the last element (just the filename), strip the period and the .meta (`['path/to', 'files_001.json']`) * concat back to a filepath (`'path/to/files_001.json'`) 5. Import all files in the list If `use_meta_files` is False, the logic is as simple as: 1. Import all files starting with the prefix (`'path/to/files'` gives `['path/to/files_001.json', 'path/to/files_002.json', etc]`, irrespective of whether a `.meta` file exists). :param prefix: The prefix, pointing to the s3 files. See boto3 docs for filter. :type prefix: str :param use_meta_files: Whether to use the meta files to check for eligible files :type use_meta_files: bool, default=False .. rubric:: Notes We don't import/copy over the .meta files at all. There is an internal function, getNewFiles(), that checks if the filename exists in the local file system. Since the meta files are not really useful for local processing, there's no sense in copying them over. This logic also still works with the use_meta_files - we first check which files are 'eligible' in S3 because they have a meta file, then we check if the 'real' files exist on disk. If the file is already on disk, we don't copy it over. .. py:method:: getDatamartData(table, datamart_folder: str = 'datamart', verbose: bool = True) :async: Wrapper method to import one of the tables in the datamart. :param table: One of the datamart tables. See notes for the full list. :type table: str :param datamart_folder: The path to the 'datamart' folder within the s3 bucket. Typically, this is the top-level folder in the bucket. :type datamart_folder: str, default='datamart' :param verbose: Whether to print out the progress of the import :type verbose: bool, default = True .. note:: Supports the following tables: { - "modelSnapshot": "Data-Decision-ADM-ModelSnapshot_pzModelSnapshots", - "predictorSnapshot": "Data-Decision-ADM-PredictorBinningSnapshot_pzADMPredictorSnapshots", - "binaryDistribution": "Data-DM-BinaryDistribution", - "contingencyTable": "Data-DM-ContingencyTable", - "histogram": "Data-DM-Histogram", - "snapshot": "Data-DM-Snapshot", - "notification": "Data-DM-Notification", } .. py:method:: get_ADMDatamart(datamart_folder: str = 'datamart', verbose: bool = True) :async: Get the ADMDatamart class directly from files in S3 In the Prediction Studio settings, you can configure an automatic export of the monitoring tables to a chosen repository. This method interacts with that repository to retrieve files. Because this is an async function, you need to await it. See `Examples` for an example on how to use this (in a jupyter notebook). It checks for files that are already on your local device, but it always concatenates the raw zipped files together when calling the function, which can potentially make it slow. If you don't always need the latest data, just use :meth:`pdstools.adm.ADMDatamart.save_data()` to save the data to more easily digestible files. :param verbose: Whether to print out the progress of the imports :param datamart_folder: The path to the 'datamart' folder within the s3 bucket. Typically, this is the top-level folder in the bucket. :type datamart_folder: str, default='datamart' .. rubric:: Examples >>> dm = await S3Datamart(bucketName='testbucket').get_ADMDatamart()