pdstools.decision_analyzer.data_read_utils

Functions

read_nested_zip_files(→ polars.DataFrame)

Reads a zip file buffer (uploaded from Streamlit) that contains .zip files,

read_gzipped_data(→ Optional[polars.DataFrame])

Reads gzipped ndjson data from a BytesIO object and returns a Polars DataFrame.

read_gzips_with_zip_extension(→ polars.DataFrame)

Iterates over all files with a .zip extension in the given directory, treats them

read_data(path)

validate_columns(→ Tuple[bool, Optional[str]])

Validate that default columns from table definition exist in the dataframe.

Module Contents

read_nested_zip_files(file_buffer) polars.DataFrame

Reads a zip file buffer (uploaded from Streamlit) that contains .zip files, which are in fact gzipped ndjson files. Extracts, reads, and concatenates them into a single Polars DataFrame.

Parameters:

file_buffer (UploadedFile) – The uploaded zip file buffer from Streamlit.

Returns:

A concatenated Polars DataFrame containing the data from all gzipped ndjson files.

Return type:

pl.DataFrame

read_gzipped_data(data: io.BytesIO) polars.DataFrame | None

Reads gzipped ndjson data from a BytesIO object and returns a Polars DataFrame.

Parameters:

data (BytesIO) – The gzipped ndjson data.

Returns:

The Polars DataFrame containing the data, or None if reading fails.

Return type:

Optional[pl.DataFrame]

read_gzips_with_zip_extension(path: str) polars.DataFrame

Iterates over all files with a .zip extension in the given directory, treats them as gzipped ndjson files, reads, and concatenates them into a single Polars DataFrame.

Parameters:

path (str) – The path to the directory containing the .zip files.

Returns:

A concatenated Polars DataFrame containing the data from all gzipped ndjson files.

Return type:

pl.DataFrame

read_data(path)
validate_columns(df: polars.LazyFrame, extract_type: Dict[str, pdstools.decision_analyzer.table_definition.TableConfig]) Tuple[bool, str | None]

Validate that default columns from table definition exist in the dataframe.

This function checks if required columns exist in the data, accounting for the fact that columns may be present under either their source name or their target label name.

Args:

df: The dataframe to validate extract_type: Table configuration mapping column names to their properties

Returns:

Tuple containing validation success (bool) and error message (str or None)

Parameters:
Return type:

Tuple[bool, Optional[str]]