pdstools.decision_analyzer.data_read_utils ========================================== .. py:module:: pdstools.decision_analyzer.data_read_utils Functions --------- .. autoapisummary:: pdstools.decision_analyzer.data_read_utils.read_nested_zip_files pdstools.decision_analyzer.data_read_utils.read_gzipped_data pdstools.decision_analyzer.data_read_utils.read_gzipped_ndjson_directory pdstools.decision_analyzer.data_read_utils.validate_columns Module Contents --------------- .. py:function:: read_nested_zip_files(file_buffer) -> polars.DataFrame Read Pega Action Analysis export format (nested archive with gzipped NDJSON). Pega's Action Analysis feature exports decision data as a ZIP archive containing multiple inner files with `.zip` extensions. Despite the extension, these inner files are **gzipped NDJSON** (not ZIP archives). This function handles this format by: 1. Opening the outer ZIP archive 2. Treating each inner `.zip` file as gzipped NDJSON 3. Decompressing and concatenating all data into a single DataFrame This format is used for high-volume decision event exports where data is partitioned across multiple compressed files. :param file_buffer: ZIP archive buffer (e.g., from Streamlit file upload) containing inner gzipped NDJSON files with misleading `.zip` extensions. :type file_buffer: UploadedFile or BytesIO :returns: Concatenated DataFrame from all inner files, with consistent column ordering. :rtype: pl.DataFrame .. rubric:: Notes This is specific to Pega Action Analysis exports. Modern exports may use hive-partitioned parquet directories instead, which can be read with read_data(). .. py:function:: read_gzipped_data(data: io.BytesIO) -> polars.DataFrame | None Read a single gzipped NDJSON chunk from Pega Action Analysis export. Helper function for read_nested_zip_files(). Reads one inner file from the Action Analysis export format, decompresses the gzipped content, and parses the NDJSON data. :param data: Gzipped NDJSON data (from an inner file in Action Analysis export). :type data: BytesIO :returns: Polars DataFrame, or None if decompression/parsing fails. :rtype: pl.DataFrame | None .. rubric:: Notes Returns None on errors to allow processing remaining files even if some are corrupted. .. py:function:: read_gzipped_ndjson_directory(path: str) -> polars.DataFrame Read directory of Pega Action Analysis gzipped NDJSON files. For extracted Action Analysis exports, this function recursively finds all files with `.zip` extension (which are actually gzipped NDJSON, not ZIP archives) and concatenates them into a single DataFrame. Useful when the outer archive has been extracted to disk. :param path: Path to directory containing gzipped NDJSON files with `.zip` extension (from extracted Action Analysis export). :type path: str :returns: Concatenated DataFrame from all files with consistent column ordering. :rtype: pl.DataFrame .. rubric:: Notes This is specific to Pega Action Analysis exports. For normal data reading (including hive-partitioned directories), use read_data() from pega_io instead. .. py:function:: validate_columns(df: polars.LazyFrame, extract_type: dict[str, pdstools.decision_analyzer.column_schema.TableConfig]) -> tuple[bool, str | None] Validate that default columns from table definition exist in the dataframe. This function checks if required columns exist in the data, accounting for the fact that columns may be present under either their source name or their target label name. Args: df: The dataframe to validate extract_type: Table configuration mapping column names to their properties Returns: tuple containing validation success (bool) and error message (str or None)