pdstools version 3: Polars

With the version 3 release of pdstools, the back-end of the libary has shifted to use Polars. Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model. This means faster analyses and a more robust API: both externally and internally within our library. However, it is not a minor change, and will change the way you interact with the library. This article goes over some of the most notable changes, and most importantly: best practices.

First off:

Lazy execution graphs

One of the, if not the single most powerful feature of Polars is its lazy execution API. This allows you to iteratively build up your query, have Polars optimise it for you, and only collect the data after your entire query. In practice, this means that all row-wise filters & column-wise selects are shifted right to whenever you read the file, and all computations are optimised further. Let’s look at an example, and one close to home!

If we simply read in the shipped CDHSample dataset, and pass in the keyword import_strategy='lazy', it will allow us to demonstrate this effect. Note: by default the import strategy is 'eager'. We’ll get to that later.

[2]:
from pdstools import datasets, ADMDatamart
import polars as pl
dm = datasets.CDHSample(import_strategy='lazy')
dm
[2]:
<pdstools.adm.ADMDatamart.ADMDatamart at 0x7f362bd44350>

If you’ve used the previous version of pdstools, you’ve probably noticed nothing’s changed. You still get the same ADMDatamart object.

Try accessing the modelData, predictorData or combinedData property however, and things are different.

[3]:
dm.modelData
[3]:

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

polars_query WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(0, 0)] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(0, 1)] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(0, 0)]--WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(0, 1)] π 12/27 [(0, 2)] π 12/27 WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(0, 1)]--π 12/27 [(0, 2)] RENAME [(0, 3)] RENAME π 12/27 [(0, 2)]--RENAME [(0, 3)] TABLE π */27; σ - [(0, 4)] TABLE π */27; σ - RENAME [(0, 3)]--TABLE π */27; σ - [(0, 4)]

Not a dataframe - but a NAIVE QUERY PLAN. It’s relatively intuitive, but you’d read this chart bottom-to-top:

  • We read the table, \(\pi\) */27; (reading all columns out of the 27 available) and \(\sigma\) -; (no row-wise filters applied).

  • We then rename the columns by a predefined manner

  • Then we only keep 12 out of the 27 available columns

  • Change ["Issue","Group","Channel","Direction","Configuration"] to be categorical, "Performance" to numeric.

  • Transform "SnapshotTime" to datetime

  • Add ["SuccessRate","Last_Positives","Last_ResponseCount"] as columns.

A sharp eye might already notice an inefficiency here. If we first read in all columns, then rename and then filter, can’t we already infer which columns we need based on our predefined function, and filter when reading? Well - yes!

This is the naive plan. If we call .show_graph(), we get the optimized version:

[4]:
dm.modelData.show_graph()
../_images/articles_pdstoolsv3_6_0.svg

At first glance; not a huge change. However, note at the bottom we import \(\pi\) 12/27;: we filter right when reading the file, and only read in 12 columns! This can make a big difference, especially because one column ("pyModelData") is very big.

Thus far, the \(\sigma\) symbol has been -. We can also change that: the \(\sigma\) symbol shows you which filters it applies on the rows when it’s reading. For our example purpose, Let’s say we’re only interested in actions containing Visa. We can add that filter right to our execution plan!

To tell Polars we’re trying to filter on a column, we simply start our query with pl.col("Name"). The Name column, containing the action names, is a string column. Therefore, to access string-based operations, we simply use .str to go to the string namespace. We want the string to contain Visa, therefore: query=pl.col("Name").str.contains("Visa").

[5]:
dm = datasets.CDHSample(
    import_strategy="lazy", query=pl.col("Name").str.contains("Visa")
)
dm.modelData.show_graph()
../_images/articles_pdstoolsv3_8_0.svg

“Okay, but how do we actually get to the data?” - Simple! Just use .collect().

[6]:
dm.modelData.collect()
[6]:
shape: (22, 15)
GroupSnapshotTimeIssueTreatmentConfigurationPerformanceModelIDNameChannelDirectionResponseCountPositivesSuccessRateLast_PositivesLast_ResponseCount
catdatetime[ns]catnullcatf64strstrcatcatf32f32f32datetime[ns]datetime[ns]
"CreditCards"2021-06-01 08:23:26.789"Sales"null"OmniAdaptiveModel"0.5"17571d69-f802-5122-9bfe-6f3f4e…"VisaGold""Web""Inbound"72.01.00.0138892021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 08:23:26.789"Sales"null"OmniAdaptiveModel"0.5"1be7cd72-741b-5d68-b148-c1af01…"VisaGold""SMS""Outbound"166.04.00.0240962021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 08:27:55.789"Sales"null"OmniAdaptiveModel"0.5"17571d69-f802-5122-9bfe-6f3f4e…"VisaGold""Web""Inbound"161.03.00.0186342021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 09:03:01.789"Sales"null"OmniAdaptiveModel"0.5"17571d69-f802-5122-9bfe-6f3f4e…"VisaGold""Web""Inbound"161.03.00.0186342021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 09:03:01.789"Sales"null"OmniAdaptiveModel"0.5"1be7cd72-741b-5d68-b148-c1af01…"VisaGold""SMS""Outbound"166.04.00.0240962021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 12:42:04.789"Sales"null"OmniAdaptiveModel"0.524312"1be7cd72-741b-5d68-b148-c1af01…"VisaGold""SMS""Outbound"674.021.00.0311572021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 12:45:20.789"Sales"null"OmniAdaptiveModel"0.535621"1be7cd72-741b-5d68-b148-c1af01…"VisaGold""SMS""Outbound"746.024.00.0321722021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 13:18:21.789"Sales"null"OmniAdaptiveModel"0.538104"17571d69-f802-5122-9bfe-6f3f4e…"VisaGold""Web""Inbound"706.043.00.0609072021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 13:23:26.789"Sales"null"OmniAdaptiveModel"0.531863"1be7cd72-741b-5d68-b148-c1af01…"VisaGold""SMS""Outbound"822.027.00.0328472021-06-01 13:23:26.7892021-06-01 13:23:26.789
"CreditCards"2021-06-01 13:23:26.789"Sales"null"OmniAdaptiveModel"0.519586"17571d69-f802-5122-9bfe-6f3f4e…"VisaGold""Web""Inbound"778.046.00.0591262021-06-01 13:23:26.7892021-06-01 13:23:26.789

This table is very similar to Pandas tables, though there are some slight differences. The types are explicitly mentioned in the header, string values are in quotes, and long string values get cut off more agressively. Other than that, there’s no real change! And if you still want to use pandas, just call .to_pandas() :)

[7]:
dm.modelData.collect().to_pandas()
[7]:
Group SnapshotTime Issue Treatment Configuration Performance ModelID Name Channel Direction ResponseCount Positives SuccessRate Last_Positives Last_ResponseCount
0 CreditCards 2021-06-01 08:23:26.789 Sales None OmniAdaptiveModel 0.500000 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 72.0 1.0 0.013889 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
1 CreditCards 2021-06-01 08:23:26.789 Sales None OmniAdaptiveModel 0.500000 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 166.0 4.0 0.024096 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
2 CreditCards 2021-06-01 08:27:55.789 Sales None OmniAdaptiveModel 0.500000 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 161.0 3.0 0.018634 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
3 CreditCards 2021-06-01 09:03:01.789 Sales None OmniAdaptiveModel 0.500000 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 161.0 3.0 0.018634 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
4 CreditCards 2021-06-01 09:03:01.789 Sales None OmniAdaptiveModel 0.500000 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 166.0 4.0 0.024096 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
5 CreditCards 2021-06-01 09:06:33.789 Sales None OmniAdaptiveModel 0.500000 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 271.0 7.0 0.025830 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
6 CreditCards 2021-06-01 09:06:33.789 Sales None OmniAdaptiveModel 0.554538 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 233.0 12.0 0.051502 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
7 CreditCards 2021-06-01 09:34:34.789 Sales None OmniAdaptiveModel 0.554538 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 233.0 12.0 0.051502 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
8 CreditCards 2021-06-01 09:43:16.789 Sales None OmniAdaptiveModel 0.500000 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 353.0 8.0 0.022663 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
9 CreditCards 2021-06-01 09:43:16.789 Sales None OmniAdaptiveModel 0.541487 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 320.0 19.0 0.059375 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
10 CreditCards 2021-06-01 10:07:35.789 Sales None OmniAdaptiveModel 0.500000 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 437.0 10.0 0.022883 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
11 CreditCards 2021-06-01 10:07:35.789 Sales None OmniAdaptiveModel 0.557512 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 391.0 24.0 0.061381 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
12 CreditCards 2021-06-01 10:58:26.789 Sales None OmniAdaptiveModel 0.531173 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 470.0 28.0 0.059574 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
13 CreditCards 2021-06-01 10:58:26.789 Sales None OmniAdaptiveModel 0.551713 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 507.0 17.0 0.033531 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
14 CreditCards 2021-06-01 11:29:54.789 Sales None OmniAdaptiveModel 0.551713 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 507.0 17.0 0.033531 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
15 CreditCards 2021-06-01 12:15:05.789 Sales None OmniAdaptiveModel 0.524312 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 674.0 21.0 0.031157 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
16 CreditCards 2021-06-01 12:15:05.789 Sales None OmniAdaptiveModel 0.549843 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 633.0 39.0 0.061611 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
17 CreditCards 2021-06-01 12:42:04.789 Sales None OmniAdaptiveModel 0.524312 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 674.0 21.0 0.031157 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
18 CreditCards 2021-06-01 12:45:20.789 Sales None OmniAdaptiveModel 0.535621 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 746.0 24.0 0.032172 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
19 CreditCards 2021-06-01 13:18:21.789 Sales None OmniAdaptiveModel 0.538104 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 706.0 43.0 0.060907 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
20 CreditCards 2021-06-01 13:23:26.789 Sales None OmniAdaptiveModel 0.531863 1be7cd72-741b-5d68-b148-c1af0142fd4a VisaGold SMS Outbound 822.0 27.0 0.032847 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789
21 CreditCards 2021-06-01 13:23:26.789 Sales None OmniAdaptiveModel 0.519586 17571d69-f802-5122-9bfe-6f3f4ef9073b VisaGold Web Inbound 778.0 46.0 0.059126 2021-06-01 13:23:26.789 2021-06-01 13:23:26.789

Now, we don’t just have modelData, there’s also information about predictors. This is under the predictorData attribute, and it works the same way. Thirdly, we add a attribute called combinedData. combinedData is technically just an inner join between the model data and the predictor data on the model ID. Therefore, the model ID’s we filter out of the models also propagate to the combined data. To give more insight into this join, we can now simply show the execution graph of it:

[8]:
dm.combinedData
[8]:

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

polars_query JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(0, 0)] JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] FILTER BY (col("SnapshotTime").fill... [(100, 1)] FILTER BY (col("SnapshotTime").fill... JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(0, 0)]--FILTER BY (col("SnapshotTime").fill... [(100, 1)] FILTER BY (col("SnapshotTime").fill... [(200, 1)] FILTER BY (col("SnapshotTime").fill... JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(0, 0)]--FILTER BY (col("SnapshotTime").fill... [(200, 1)] FILTER BY col("Name").str.contains(... [(100, 2)] FILTER BY col("Name").str.contains(... FILTER BY (col("SnapshotTime").fill... [(100, 1)]--FILTER BY col("Name").str.contains(... [(100, 2)] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(100, 3)] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] FILTER BY col("Name").str.contains(... [(100, 2)]--WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(100, 3)] FILTER BY col("Name").str.contains(... [(100, 4)] FILTER BY col("Name").str.contains(... WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(100, 3)]--FILTER BY col("Name").str.contains(... [(100, 4)] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(100, 5)] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] FILTER BY col("Name").str.contains(... [(100, 4)]--WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(100, 5)] π 12/27 [(100, 6)] π 12/27 WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(100, 5)]--π 12/27 [(100, 6)] RENAME [(100, 7)] RENAME π 12/27 [(100, 6)]--RENAME [(100, 7)] TABLE π */27; σ - [(100, 8)] TABLE π */27; σ - RENAME [(100, 7)]--TABLE π */27; σ - [(100, 8)] JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(200, 2)] JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] FILTER BY (col("SnapshotTime").fill... [(200, 1)]--JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(200, 2)] WITH COLUMNS ["PredictorCategory"] [(300, 3)] WITH COLUMNS ["PredictorCategory"] JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(200, 2)]--WITH COLUMNS ["PredictorCategory"] [(300, 3)] π 1/15 [(400, 3)] π 1/15 JOIN INNER                    left: [col("ModelID")];                    right: [col("ModelID")] [(200, 2)]--π 1/15 [(400, 3)] WITH COLUMNS ["BinPropensity","BinAdjustedPropensity"] [(300, 4)] WITH COLUMNS ["BinPropensity","BinAdjustedPropensity"] WITH COLUMNS ["PredictorCategory"] [(300, 3)]--WITH COLUMNS ["BinPropensity","BinAdjustedPropensity"] [(300, 4)] WITH COLUMNS ["BinIndex","BinPositives","BinNegatives","BinType","EntryType","PredictorName","BinResponseCount","GroupIndex","SnapshotTime","Type","ResponseCount","Positives"] [(300, 5)] WITH COLUMNS ["BinIndex","BinPositives","BinNegatives","BinType","EntryType","PredictorName","BinResponseCount","GroupIndex","SnapshotTime","Type","ResponseCount","Positives"] WITH COLUMNS ["BinPropensity","BinAdjustedPropensity"] [(300, 4)]--WITH COLUMNS ["BinIndex","BinPositives","BinNegatives","BinType","EntryType","PredictorName","BinResponseCount","GroupIndex","SnapshotTime","Type","ResponseCount","Positives"] [(300, 5)] π 16/36 [(300, 6)] π 16/36 WITH COLUMNS ["BinIndex","BinPositives","BinNegatives","BinType","EntryType","PredictorName","BinResponseCount","GroupIndex","SnapshotTime","Type","ResponseCount","Positives"] [(300, 5)]--π 16/36 [(300, 6)] RENAME [(300, 7)] RENAME π 16/36 [(300, 6)]--RENAME [(300, 7)] TABLE π */36; σ - [(300, 8)] TABLE π */36; σ - RENAME [(300, 7)]--TABLE π */36; σ - [(300, 8)] FILTER BY col("Name").str.contains(... [(400, 4)] FILTER BY col("Name").str.contains(... π 1/15 [(400, 3)]--FILTER BY col("Name").str.contains(... [(400, 4)] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(400, 5)] WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] FILTER BY col("Name").str.contains(... [(400, 4)]--WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(400, 5)] FILTER BY col("Name").str.contains(... [(400, 6)] FILTER BY col("Name").str.contains(... WITH COLUMNS ["SuccessRate","Last_Positives","Last_ResponseCount"] [(400, 5)]--FILTER BY col("Name").str.contains(... [(400, 6)] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(400, 7)] WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] FILTER BY col("Name").str.contains(... [(400, 6)]--WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(400, 7)] π 12/27 [(400, 8)] π 12/27 WITH COLUMNS ["Group","SnapshotTime","Issue","Configuration","Channel","Direction","ResponseCount","Positives"] [(400, 7)]--π 12/27 [(400, 8)] RENAME [(400, 9)] RENAME π 12/27 [(400, 8)]--RENAME [(400, 9)] TABLE π */27; σ - [(400, 10)] TABLE π */27; σ - RENAME [(400, 9)]--TABLE π */27; σ - [(400, 10)]

While a beautiful graph, it is quite a lot to digest. The left side of the chart should look very familiar, as should the right side. The ‘middle’ part is new: this is the execution part for the predictor data. Here we only parse Performance to be a float, parse the timezones, and add [“BinPropensity”,”BinAdjustedPropensity”]. For both the model data and the predictor data we take the last snapshot only when combining them, which you can see right before the topmost inner join.

Again, a sharp eye will notice quite some inefficiencies here. The newest issue is that we’re now importing modelData twice, once to filter out model IDs not being in the filter criteria, and another time to join with the table. But remember, we’re not in the optimized graph:

[9]:
dm.combinedData.show_graph()
could not determine schema
thread '<unnamed>' panicked at crates/polars-lazy/src/dot.rs:49:14:
io error: Error
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 dm.combinedData.show_graph()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1035, in LazyFrame.show_graph(self, optimized, show, output_path, raw_output, figsize, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, streaming)
    975 """
    976 Show a plot of the query plan.
    977
   (...)
   1021 ... ).show_graph()  # doctest: +SKIP
   1022 """
   1023 _ldf = self._ldf.optimization_toggle(
   1024     type_coercion,
   1025     predicate_pushdown,
   (...)
   1032     _eager=False,
   1033 )
-> 1035 dot = _ldf.to_dot(optimized)
   1037 if raw_output:
   1038     # we do not show a graph, nor save a graph to disk
   1039     return dot

PanicException: io error: Error

Pretty neat. Not only do we not import modelData twice, we cache it at a certain point in the execution graph so we can use it twice. And we get all of this for free with Polars’ optimization engine! Also remember, at this point, we’ve not even read in any data yet! We can even do profiling on this execution graph, but that’s a challenge to figure out on your own :). We can easily show the gains in time using this method using our new save_data method though: this will save our data to disk in an efficient format, and gives us a fast way to benchmark.

[10]:
files_without_subset = datasets.CDHSample(
    import_strategy="lazy", subset=False
).save_data()
files_with_subset = datasets.CDHSample(
    import_strategy="lazy", query=pl.col("Name").str.contains("Visa")
).save_data()

[11]:
%timeit ADMDatamart(model_filename=files_without_subset[0], predictor_filename=files_without_subset[1], import_strategy='lazy').combinedData.collect()
18.6 ms ± 69.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[12]:
%timeit ADMDatamart(model_filename=files_with_subset[0], predictor_filename=files_with_subset[1], import_strategy='lazy').combinedData.collect()
11.5 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[13]:
from os import remove
[remove(file) for file in [*files_with_subset, *files_without_subset]]
[13]:
[None, None, None, None]

This +- 2x increase in speed will be even more significant as we increase the size of the data.

Lazy vs Eager

As mentioned before, however, the default mode of reading is "eager". This still allows us to use the previously mentioned execution graph, but the difference is in what we do after we do all of this preprocessing, right before returning the ADMDatamart class: .collect().lazy(). Opening up modelData in eager mode shows that we don’t have a graph anymore:

[14]:
modd = datasets.CDHSample().modelData
print(type(modd))
modd.show_graph()

<class 'polars.lazyframe.frame.LazyFrame'>
../_images/articles_pdstoolsv3_23_1.svg

The type of modelData is a LazyFrame (the representation of this being a query plan). DataFrame and LazyFrames are not so different - LazyFrames build up a query and want you to execute once it’s done, DataFrames execute each command as you’re calling it. However, in eager mode, that execution graph is empty.

That leads to the fundamental difference between the import_strategy keyword argument: - In lazy mode - we only build the execution graph, but we rather than reading the data, we only scan it - In eager mode - we build the execution graph and execute it, reading all data into memory. This means we don’t have to execute that graph everytime we call it.

The eager mode stil maintains LazyFrames for both modelData and predictorData, but those tables are now not stored on disk anymore, but kept in memory. Any subsequent operations will build up a new execution graph, only executed when calling .collect().

When to use which?

By default, we use eager mode. We recommend using that, unless:

  • Your data is too big to fit into memory

  • You’re using just one plot

In all other cases, it’s probably fine to just use eager mode, unless.

File Types

Pega’s default export format is to return a zip file, in which is a data.json file, which is in newline-delimited json format. While this is a decent storage format, it doesn’t allow for any scanning operations because we need to open the entire zip into memory before being able to read anything. Therefore, if you’re using the default export format, we set the import strategy to eager. If you’re working with very large files and you need faster processing, it may be worth it to unzip the files, or even transform them to Arrow/IPC (fastest) or Parquet (most efficient). Then, you can use lazy mode, and can process files larger than memory as an added benefit.

Further Polars examples

To help you get adjusted to this new syntax, let’s go over a few more examples.

The regular facets syntax still works, and can be combined with the new querying syntax.

[15]:
dm = datasets.CDHSample()

dm.plotPerformanceSuccessRateBubbleChart(
    query=pl.col("Issue") == "Sales", facets=["Channel"]
)

However, if you want to combine queries, let’s say just looking at Sales/Web, the best way to do that is like this (also note the extra brackets between expressions):

[16]:
dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Issue") == "Sales") & (pl.col("Channel") == "Web")
)

We can make use of string operations (hover to verify for yourself!):

[17]:
dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Name").str.starts_with("Premier"))
    | (pl.col("Name").str.ends_with("Gold"))
)

Since we have full access to Polar’s expressions, we can use a very wide range of functions, including windowing, folds, and slices.

For something more advanced: let’s look at the highest performance model per group.

[18]:
dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Performance") == pl.max("Performance")).over("Group")
)

Or we can look at temporal filters, like only looking at the first snapshot:

[19]:
dm.plotPerformanceSuccessRateBubbleChart(query=pl.col('SnapshotTime')==pl.col('SnapshotTime').min())