pdstools.decision_analyzer.data_read_utils

Functions

read_nested_zip_files(→ polars.DataFrame)

Read Pega Action Analysis export format (nested archive with gzipped NDJSON).

read_gzipped_data(→ polars.DataFrame | None)

Read a single gzipped NDJSON chunk from Pega Action Analysis export.

read_gzipped_ndjson_directory(→ polars.DataFrame)

Read directory of Pega Action Analysis gzipped NDJSON files.

validate_columns(→ tuple[bool, str | None])

Validate that default columns from table definition exist in the dataframe.

Module Contents

read_nested_zip_files(file_buffer) polars.DataFrame

Read Pega Action Analysis export format (nested archive with gzipped NDJSON).

Pega’s Action Analysis feature exports decision data as a ZIP archive containing multiple inner files with .zip extensions. Despite the extension, these inner files are gzipped NDJSON (not ZIP archives). This function handles this format by: 1. Opening the outer ZIP archive 2. Treating each inner .zip file as gzipped NDJSON 3. Decompressing and concatenating all data into a single DataFrame

This format is used for high-volume decision event exports where data is partitioned across multiple compressed files.

Parameters:

file_buffer (UploadedFile or BytesIO) – ZIP archive buffer (e.g., from Streamlit file upload) containing inner gzipped NDJSON files with misleading .zip extensions.

Returns:

Concatenated DataFrame from all inner files, with consistent column ordering.

Return type:

pl.DataFrame

Notes

This is specific to Pega Action Analysis exports. Modern exports may use hive-partitioned parquet directories instead, which can be read with read_data().

read_gzipped_data(data: io.BytesIO) polars.DataFrame | None

Read a single gzipped NDJSON chunk from Pega Action Analysis export.

Helper function for read_nested_zip_files(). Reads one inner file from the Action Analysis export format, decompresses the gzipped content, and parses the NDJSON data.

Parameters:

data (BytesIO) – Gzipped NDJSON data (from an inner file in Action Analysis export).

Returns:

Polars DataFrame, or None if decompression/parsing fails.

Return type:

pl.DataFrame | None

Notes

Returns None on errors to allow processing remaining files even if some are corrupted.

read_gzipped_ndjson_directory(path: str) polars.DataFrame

Read directory of Pega Action Analysis gzipped NDJSON files.

For extracted Action Analysis exports, this function recursively finds all files with .zip extension (which are actually gzipped NDJSON, not ZIP archives) and concatenates them into a single DataFrame. Useful when the outer archive has been extracted to disk.

Parameters:

path (str) – Path to directory containing gzipped NDJSON files with .zip extension (from extracted Action Analysis export).

Returns:

Concatenated DataFrame from all files with consistent column ordering.

Return type:

pl.DataFrame

Notes

This is specific to Pega Action Analysis exports. For normal data reading (including hive-partitioned directories), use read_data() from pega_io instead.

validate_columns(df: polars.LazyFrame, extract_type: dict[str, pdstools.decision_analyzer.column_schema.TableConfig]) tuple[bool, str | None]

Validate that default columns from table definition exist in the dataframe.

This function checks if required columns exist in the data, accounting for the fact that columns may be present under either their source name or their target label name.

Args:

df: The dataframe to validate extract_type: Table configuration mapping column names to their properties

Returns:

tuple containing validation success (bool) and error message (str or None)

Parameters:
Return type:

tuple[bool, str | None]