pdstools.utils.cdh_utils ======================== .. py:module:: pdstools.utils.cdh_utils Attributes ---------- .. autoapisummary:: pdstools.utils.cdh_utils.F pdstools.utils.cdh_utils.Figure Functions --------- .. autoapisummary:: pdstools.utils.cdh_utils._apply_query pdstools.utils.cdh_utils._combine_queries pdstools.utils.cdh_utils.default_predictor_categorization pdstools.utils.cdh_utils._extract_keys pdstools.utils.cdh_utils.parse_pega_date_time_formats pdstools.utils.cdh_utils.safe_range_auc pdstools.utils.cdh_utils.auc_from_probs pdstools.utils.cdh_utils.auc_from_bincounts pdstools.utils.cdh_utils.aucpr_from_probs pdstools.utils.cdh_utils.aucpr_from_bincounts pdstools.utils.cdh_utils.auc_to_gini pdstools.utils.cdh_utils._capitalize pdstools.utils.cdh_utils._polars_capitalize pdstools.utils.cdh_utils.from_prpc_date_time pdstools.utils.cdh_utils.to_prpc_date_time pdstools.utils.cdh_utils.weighted_average_polars pdstools.utils.cdh_utils.weighted_performance_polars pdstools.utils.cdh_utils.overlap_lists_polars pdstools.utils.cdh_utils.z_ratio pdstools.utils.cdh_utils.lift pdstools.utils.cdh_utils.log_odds pdstools.utils.cdh_utils.feature_importance pdstools.utils.cdh_utils._apply_schema_types pdstools.utils.cdh_utils.gains_table pdstools.utils.cdh_utils.lazy_sample pdstools.utils.cdh_utils.legend_color_order pdstools.utils.cdh_utils.process_files_to_bytes pdstools.utils.cdh_utils.get_latest_pdstools_version pdstools.utils.cdh_utils.setup_logger pdstools.utils.cdh_utils.create_working_and_temp_dir Module Contents --------------- .. py:data:: F .. py:data:: Figure .. py:function:: _apply_query(df: F, query: Optional[pdstools.utils.types.QUERY] = None) -> F .. py:function:: _combine_queries(existing_query: pdstools.utils.types.QUERY, new_query: polars.Expr) -> pdstools.utils.types.QUERY .. py:function:: default_predictor_categorization(x: Union[str, polars.Expr] = pl.col('PredictorName')) -> polars.Expr Function to determine the 'category' of a predictor. It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain. By default, this function returns "Primary" whenever there is no '.' anywhere in the name string, otherwise returns the first string before the first period :param x: The column to parse :type x: Union[str, pl.Expr], default = pl.col('PredictorName') .. py:function:: _extract_keys(df: F, key: str = 'Name', capitalize: bool = True) -> F Extracts keys out of the pyName column This is not a lazy operation as we don't know the possible keys in advance. For that reason, we select only the key column, extract the keys from that, and then collect the resulting dataframe. This dataframe is then joined back to the original dataframe. This is relatively efficient, but we still do need the whole pyName column in memory to do this, so it won't work completely lazily from e.g. s3. That's why it only works with eager mode. The data in column for which the JSON is extract is normalized a little by taking out non-space, non-printable characters. Not just ASCII of course. This may be relatively expensive. JSON extraction only happens on the unique values so saves a lot of time with multiple snapshots of the same models, it also only processes rows for which the key column appears to be valid JSON. It will break when you "trick" it with malformed JSON. Column values for columns that are also encoded in the key column will be overwritten with values from the key column, but only for rows that are JSON. In previous versions all values were overwritten resulting in many nulls. :param df: The dataframe to extract the keys from :type df: Union[pl.DataFrame, pl.LazyFrame] :param key: The column with embedded JSON :type key: str :param capitalize: If True (default) normalizes the names of the embedded columns otherwise keeps the names as-is. :type capitalize: bool .. py:function:: parse_pega_date_time_formats(timestamp_col='SnapshotTime', timestamp_fmt: Optional[str] = None) Parses Pega DateTime formats. Supports commonly used formats: - "%Y-%m-%d %H:%M:%S" - "%Y%m%dT%H%M%S.%f %Z" - "%d-%b-%y" Removes timezones, and rounds to seconds, with a 'ns' time unit. In the implementation, the third expression uses timestamp_fmt or %Y. This is a bit of a hack, because if we pass None, it tries to infer automatically. Inferring raises when it can't find an appropriate format, so that's not good. :param timestampCol: The column to parse :type timestampCol: str, default = 'SnapshotTime' :param timestamp_fmt: An optional format to use rather than the default formats :type timestamp_fmt: str, default = None .. py:function:: safe_range_auc(auc: float) -> float Internal helper to keep auc a safe number between 0.5 and 1.0 always. :param auc: The AUC (Area Under the Curve) score :type auc: float :returns: 'Safe' AUC score, between 0.5 and 1.0 :rtype: float .. py:function:: auc_from_probs(groundtruth: List[int], probs: List[float]) -> List[float] Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label. :param groundtruth: The 'true' values, Positive values must be represented as True or 1. Negative values must be represented as False or 0. :type groundtruth: List[int] :param probs: The predictions, as a numeric vector of the same length as groundtruth :type probs: List[float] :param Returns: The AUC as a value between 0.5 and 1. :type Returns: List[float] :param Examples: >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2]) .. py:function:: auc_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) -> float Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label. :param pos: Vector with counts of the positive responses :type pos: List[int] :param neg: Vector with counts of the negative responses :type neg: List[int] :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg). :type probs: List[float] :returns: * *float* -- The AUC as a value between 0.5 and 1. * *Examples* -- >>> auc_from_bincounts([3,1,0], [2,0,1]) .. py:function:: aucpr_from_probs(groundtruth: List[int], probs: List[float]) -> List[float] Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label. :param groundtruth: The 'true' values, Positive values must be represented as True or 1. Negative values must be represented as False or 0. :type groundtruth: List[int] :param probs: The predictions, as a numeric vector of the same length as groundtruth :type probs: List[float] :param Returns: The AUC as a value between 0.5 and 1. :type Returns: List[float] :param Examples: >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2]) .. py:function:: aucpr_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) -> float Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label. :param pos: Vector with counts of the positive responses :type pos: List[int] :param neg: Vector with counts of the negative responses :type neg: List[int] :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg). :type probs: List[float] :returns: * *float* -- The PR AUC as a value between 0.0 and 1. * *Examples* -- >>> aucpr_from_bincounts([3,1,0], [2,0,1]) .. py:function:: auc_to_gini(auc: float) -> float Convert AUC performance metric to GINI :param auc: The AUC (number between 0.5 and 1) :type auc: float :returns: * *float* -- GINI metric, a number between 0 and 1 * *Examples* -- >>> auc2GINI(0.8232) .. py:function:: _capitalize(fields: Union[str, Iterable[str]]) -> List[str] Applies automatic capitalization, aligned with the R couterpart. :param fields: A list of names :type fields: list :returns: **fields** -- The input list, but each value properly capitalized :rtype: list .. py:function:: _polars_capitalize(df: F) -> F .. py:function:: from_prpc_date_time(x: str, return_string: bool = False) -> Union[datetime.datetime, str] Convert from a Pega date-time string. :param x: String of Pega date-time :type x: str :param return_string: If True it will return the date in string format. If False it will return in datetime type :type return_string: bool, default=False :returns: * *Union[datetime.datetime, str]* -- The converted date in datetime format or string. * *Examples* -- >>> fromPRPCDateTime("20180316T134127.847 GMT") >>> fromPRPCDateTime("20180316T134127.847 GMT", True) >>> fromPRPCDateTime("20180316T184127.846") >>> fromPRPCDateTime("20180316T184127.846", True) .. py:function:: to_prpc_date_time(dt: datetime.datetime) -> str Convert to a Pega date-time string :param x: A datetime object :type x: datetime.datetime :returns: * *str* -- A string representation in the format used by Pega * *Examples* -- >>> toPRPCDateTime(datetime.datetime.now()) .. py:function:: weighted_average_polars(vals: Union[str, polars.Expr], weights: Union[str, polars.Expr]) -> polars.Expr .. py:function:: weighted_performance_polars() -> polars.Expr Polars function to return a weighted performance .. py:function:: overlap_lists_polars(col: polars.Series, row_validity: polars.Series) -> List[float] Calculate the overlap of each of the elements (must be a list) with all the others .. py:function:: z_ratio(pos_col: polars.Expr = pl.col('BinPositives'), neg_col: polars.Expr = pl.col('BinNegatives')) -> polars.Expr Calculates the Z-Ratio for predictor bins. The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is. To recreate the OOTB ZRatios from the datamart, use in a group_by. See `examples`. :param posCol: The (Polars) column of the bin positives :type posCol: pl.Expr :param negCol: The (Polars) column of the bin positives :type negCol: pl.Expr .. rubric:: Examples >>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode() .. py:function:: lift(pos_col: polars.Expr = pl.col('BinPositives'), neg_col: polars.Expr = pl.col('BinNegatives')) -> polars.Expr Calculates the Lift for predictor bins. The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity. :param posCol: The (Polars) column of the bin positives :type posCol: pl.Expr :param negCol: The (Polars) column of the bin positives :type negCol: pl.Expr .. rubric:: Examples >>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode() .. py:function:: log_odds(positives=pl.col('Positives'), negatives=pl.col('ResponseCount') - pl.col('Positives')) .. py:function:: feature_importance(over=['PredictorName', 'ModelID']) .. py:function:: _apply_schema_types(df: F, definition, verbose=False, **timestamp_opts) -> F This function is used to convert the data types of columns in a DataFrame to a desired types. The desired types are defined in a `PegaDefaultTables` class. :param df: The DataFrame whose columns' data types need to be converted. :type df: pl.LazyFrame :param definition: A `PegaDefaultTables` object that contains the desired data types for the columns. :type definition: PegaDefaultTables :param verbose: If True, the function will print a message when a column is not in the default table schema. :type verbose: bool :param timestamp_opts: Additional arguments for timestamp parsing. :type timestamp_opts: str :returns: A list with polars expressions for casting data types. :rtype: List .. py:function:: gains_table(df, value: str, index=None, by=None) Calculates cumulative gains from any data frame. The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage. :param df: The (Polars) dataframe with the raw values :type df: pl.DataFrame :param value: The name of the field with the values (plotted on y-axis) :type value: str :param index = None: Optional name of the field for the x-axis. If not passed in all records are used and weighted equally. :param by = None: Grouping field(s), can also be None :returns: A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1. :rtype: pl.DataFrame .. rubric:: Examples >>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction]) .. py:function:: lazy_sample(df: F, n_rows: int, with_replacement: bool = True) -> F .. py:function:: legend_color_order(fig) Orders legend colors alphabetically in order to provide pega color consistency among different categories .. py:function:: process_files_to_bytes(file_paths: List[Union[str, pathlib.Path]], base_file_name: Union[str, pathlib.Path]) -> Tuple[bytes, str] Processes a list of file paths, returning file content as bytes and a corresponding file name. Useful for zipping muliple model reports and the byte object is used for downloading files in Streamlit app. This function handles three scenarios: 1. Single file: Returns the file's content as bytes and the provided base file name. 2. Multiple files: Creates a zip file containing all files, returns the zip file's content as bytes and a generated zip file name. 3. No files: Returns empty bytes and an empty string. :param file_paths: A list of file paths to process. Can be empty, contain a single path, or multiple paths. :type file_paths: List[Union[str, Path]] :param base_file_name: The base name to use for the output file. For a single file, this name is returned as is. For multiple files, this is used as part of the generated zip file name. :type base_file_name: Union[str, Path] :returns: A tuple containing: - bytes: The content of the single file or the created zip file, or empty bytes if no files. - str: The file name (either base_file_name or a generated zip file name), or an empty string if no files. :rtype: Tuple[bytes, str] .. py:function:: get_latest_pdstools_version() .. py:function:: setup_logger() Returns a logger and log buffer in root level .. py:function:: create_working_and_temp_dir(name: Optional[str] = None, working_dir: Optional[os.PathLike] = None) -> Tuple[pathlib.Path, pathlib.Path] Creates a working directory for saving files and a temp_dir