pdstools.decision_analyzer.data_read_utils

Attributes

Functions

_is_artifact(→ bool)

Return True for OS-generated junk entries (macOS, Windows, etc.).

read_nested_zip_files(→ polars.DataFrame)

Reads a zip file buffer (uploaded from Streamlit) that contains .zip files,

read_gzipped_data(→ polars.DataFrame | None)

Reads gzipped ndjson data from a BytesIO object and returns a Polars DataFrame.

read_gzips_with_zip_extension(→ polars.DataFrame)

Recursively finds all files with a .zip extension under the given directory,

read_data(path)

validate_columns(→ tuple[bool, str | None])

Validate that default columns from table definition exist in the dataframe.

Module Contents

_SUPPORTED_EXTENSIONS: set[str]
_is_artifact(name: str) bool

Return True for OS-generated junk entries (macOS, Windows, etc.).

Parameters:

name (str)

Return type:

bool

read_nested_zip_files(file_buffer) polars.DataFrame

Reads a zip file buffer (uploaded from Streamlit) that contains .zip files, which are in fact gzipped ndjson files. Extracts, reads, and concatenates them into a single Polars DataFrame.

Parameters:

file_buffer (UploadedFile) – The uploaded zip file buffer from Streamlit.

Returns:

A concatenated Polars DataFrame containing the data from all gzipped ndjson files.

Return type:

pl.DataFrame

read_gzipped_data(data: io.BytesIO) polars.DataFrame | None

Reads gzipped ndjson data from a BytesIO object and returns a Polars DataFrame.

Parameters:

data (BytesIO) – The gzipped ndjson data.

Returns:

The Polars DataFrame containing the data, or None if reading fails.

Return type:

pl.DataFrame | None

read_gzips_with_zip_extension(path: str) polars.DataFrame

Recursively finds all files with a .zip extension under the given directory, treats them as gzipped ndjson files, reads, and concatenates them into a single Polars DataFrame. Supports arbitrary directory depth.

Parameters:

path (str) – The path to the directory containing the .zip files.

Returns:

A concatenated Polars DataFrame containing the data from all gzipped ndjson files.

Return type:

pl.DataFrame

read_data(path)
validate_columns(df: polars.LazyFrame, extract_type: dict[str, pdstools.decision_analyzer.column_schema.TableConfig]) tuple[bool, str | None]

Validate that default columns from table definition exist in the dataframe.

This function checks if required columns exist in the data, accounting for the fact that columns may be present under either their source name or their target label name.

Args:

df: The dataframe to validate extract_type: Table configuration mapping column names to their properties

Returns:

tuple containing validation success (bool) and error message (str or None)

Parameters:
Return type:

tuple[bool, str | None]