pdstools.utils.cdh_utils._polars
================================

.. py:module:: pdstools.utils.cdh_utils._polars

.. autoapi-nested-parse::

   Polars expression and frame helpers (queries, sampling, schema, overlap).


Attributes
----------

.. autoapisummary::

   pdstools.utils.cdh_utils._polars.POLARS_DURATION_PATTERN


Functions
---------

.. autoapisummary::

   pdstools.utils.cdh_utils._polars.is_valid_polars_duration
   pdstools.utils.cdh_utils._polars._apply_query
   pdstools.utils.cdh_utils._polars._combine_queries
   pdstools.utils.cdh_utils._polars._polars_capitalize
   pdstools.utils.cdh_utils._polars._extract_keys
   pdstools.utils.cdh_utils._polars.weighted_average_polars
   pdstools.utils.cdh_utils._polars.weighted_performance_polars
   pdstools.utils.cdh_utils._polars.overlap_matrix
   pdstools.utils.cdh_utils._polars.overlap_lists_polars
   pdstools.utils.cdh_utils._polars.lazy_sample
   pdstools.utils.cdh_utils._polars._apply_schema_types


Module Contents
---------------

.. py:data:: POLARS_DURATION_PATTERN

.. py:function:: is_valid_polars_duration(value: str, max_length: int = 30) -> bool

   Validate Polars duration syntax.

   Checks if a string is a valid Polars duration (e.g., "1d", "1w", "1mo", "1h30m").
   Used to validate user input before passing to Polars methods like dt.truncate()
   or group_by_dynamic().

   :param value: The duration string to validate.
   :type value: str
   :param max_length: Maximum allowed string length (prevents excessive input).
   :type max_length: int, default 30

   :returns: True if the string is a valid Polars duration, False otherwise.
   :rtype: bool

   .. rubric:: Examples

   >>> is_valid_polars_duration("1d")
   True
   >>> is_valid_polars_duration("1w")
   True
   >>> is_valid_polars_duration("1h30m")
   True
   >>> is_valid_polars_duration("invalid")
   False
   >>> is_valid_polars_duration("")
   False


.. py:function:: _apply_query(df: polars.LazyFrame, query: pdstools.utils.types.QUERY | None = None, allow_empty: bool = False) -> polars.LazyFrame
                 _apply_query(df: polars.DataFrame, query: pdstools.utils.types.QUERY | None = None, allow_empty: bool = False) -> polars.DataFrame

.. py:function:: _combine_queries(existing_query: pdstools.utils.types.QUERY, new_query: polars.Expr) -> pdstools.utils.types.QUERY

.. py:function:: _polars_capitalize(df: pdstools.utils.cdh_utils._common.F, extra_endwords: collections.abc.Iterable[str] | None = None) -> pdstools.utils.cdh_utils._common.F

.. py:function:: _extract_keys(df: pdstools.utils.cdh_utils._common.F, key: str = 'Name', capitalize: bool = True) -> pdstools.utils.cdh_utils._common.F

   Extracts keys out of the pyName column

   This is not a lazy operation as we don't know the possible keys
   in advance. For that reason, we select only the key column,
   extract the keys from that, and then collect the resulting dataframe.
   This dataframe is then joined back to the original dataframe.

   This is relatively efficient, but we still do need the whole
   pyName column in memory to do this, so it won't work completely
   lazily from e.g. s3. That's why it only works with eager mode.

   The data in column for which the JSON is extract is normalized a
   little by taking out non-space, non-printable characters. Not just
   ASCII of course. This may be relatively expensive.

   JSON extraction only happens on the unique values so saves a lot
   of time with multiple snapshots of the same models, it also only
   processes rows for which the key column appears to be valid JSON.
   It will break when you "trick" it with malformed JSON.

   Column values for columns that are also encoded in the key column
   will be overwritten with values from the key column, but only for
   rows that are JSON. In previous versions all values were overwritten
   resulting in many nulls.

   :param df: The dataframe to extract the keys from
   :type df: pl.DataFrame | pl.LazyFrame
   :param key: The column with embedded JSON
   :type key: str
   :param capitalize: If True (default) normalizes the names of the embedded columns
                      otherwise keeps the names as-is.
   :type capitalize: bool


.. py:function:: weighted_average_polars(vals: str | polars.Expr, weights: str | polars.Expr) -> polars.Expr

.. py:function:: weighted_performance_polars(vals: str | polars.Expr = 'Performance', weights: str | polars.Expr = 'ResponseCount') -> polars.Expr

   Polars function to return a weighted performance


.. py:function:: overlap_matrix(df: polars.DataFrame, list_col: str, by: str, show_fraction: bool = True) -> polars.DataFrame

   Calculate the overlap of a list element with all other list elements returning a full matrix.

   For each list in the specified column, this function calculates the overlap ratio (intersection size
   divided by the original list size) with every other list in the column, including itself. The result
   is a matrix where each row represents the overlap ratios for one list with all others.

   :param df: The Polars DataFrame containing the list column and grouping column.
   :type df: pl.DataFrame
   :param list_col: The name of the column containing the lists. Each element in this column should be a list.
   :type list_col: str
   :param by: The name of the column to use for grouping and labeling the rows in the result matrix.
   :type by: str

   :returns: A DataFrame where:
             - Each row represents the overlap ratios for one list with all others
             - Each column (except the last) represents the overlap ratio with a specific list
             - Column names are formatted as "Overlap_{list_col_name}_{by}"
             - The last column contains the original values from the 'by' column
   :rtype: pl.DataFrame

   .. rubric:: Examples

   >>> import polars as pl
   >>> df = pl.DataFrame({
   ...     "Channel": ["Mobile", "Web", "Email"],
   ...     "Actions": [
   ...         [1, 2, 3],
   ...         [2, 3, 4, 6],
   ...         [3, 5, 7, 8]
   ...     ]
   ... })
   >>> overlap_matrix(df, "Actions", "Channel")
   shape: (3, 4)
   ┌───────────────────┬───────────────┬───────────────┬─────────┐
   │ Overlap_Actions_M… │ Overlap_Actio… │ Overlap_Actio… │ Channel │
   │ ---               │ ---           │ ---           │ ---     │
   │ f64               │ f64           │ f64           │ str     │
   ╞═══════════════════╪═══════════════╪═══════════════╪═════════╡
   │ 1.0               │ 0.6666667     │ 0.3333333     │ Mobile  │
   │ 0.5               │ 1.0           │ 0.25          │ Web     │
   │ 0.25              │ 0.25          │ 1.0           │ Email   │
   └───────────────────┴───────────────┴───────────────┴─────────┘


.. py:function:: overlap_lists_polars(col: polars.Series) -> polars.Series

   Calculate the average overlap ratio of each list element with all other list elements into a single Series.

   For each list in the input Series, this function calculates the average overlap (intersection)
   with all other lists, normalized by the size of the original list. The overlap ratio represents
   how much each list has in common with all other lists on average.

   :param col: A Polars Series where each element is a list. The function will calculate
               the overlap between each list and all other lists in the Series.
   :type col: pl.Series

   :returns: A Polars Series of float values representing the average overlap ratio for each list.
             Each value is calculated as:
             (sum of intersection sizes with all other lists) / (number of other lists) / (size of original list)
   :rtype: pl.Series

   .. rubric:: Examples

   >>> import polars as pl
   >>> data = pl.Series([
   ...     [1, 2, 3],
   ...     [2, 3, 4, 6],
   ...     [3, 5, 7, 8]
   ... ])
   >>> overlap_lists_polars(data)
   shape: (3,)
   Series: '' [f64]
   [
       0.5
       0.375
       0.25
   ]
   >>> df = pl.DataFrame({"Channel" : ["Mobile", "Web", "Email"], "Actions" : pl.Series([
   ...     [1, 2, 3],
   ...     [2, 3, 4, 6],
   ...     [3, 5, 7, 8]
   ... ])})
   >>> df.with_columns(pl.col("Actions").map_batches(overlap_lists_polars))
   shape: (3, 2)
   ┌─────────┬─────────┐
   │ Channel │ Actions │
   │ ---     │ ---     │
   │ str     │ f64     │
   ╞═════════╪═════════╡
   │ Mobile  │ 0.5     │
   │ Web     │ 0.375   │
   │ Email   │ 0.25    │
   └─────────┴─────────┘


.. py:function:: lazy_sample(df: pdstools.utils.cdh_utils._common.F, n_rows: int, with_replacement: bool = True) -> pdstools.utils.cdh_utils._common.F

.. py:function:: _apply_schema_types(df: pdstools.utils.cdh_utils._common.F, definition, **timestamp_opts) -> pdstools.utils.cdh_utils._common.F

   This function is used to convert the data types of columns in a DataFrame to a desired types.
   The desired types are defined in a `PegaDefaultTables` class.

   :param df: The DataFrame whose columns' data types need to be converted.
   :type df: pl.LazyFrame
   :param definition: A `PegaDefaultTables` object that contains the desired data types for the columns.
   :type definition: PegaDefaultTables
   :param timestamp_opts: Additional arguments for timestamp parsing.
   :type timestamp_opts: str

   :returns: A list with polars expressions for casting data types.
   :rtype: list