pdstools version 3: Polars¶

With the version 3 release of pdstools, the back-end of the libary has shifted to use Polars. Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model. This means faster analyses and a more robust API: both externally and internally within our library. However, it is not a minor change, and will change the way you interact with the library. This article goes over some of the most notable changes, and most importantly: best practices.

First off:

Lazy execution graphs¶

One of the, if not the single most powerful feature of Polars is its lazy execution API. This allows you to iteratively build up your query, have Polars optimise it for you, and only collect the data after your entire query. In practice, this means that all row-wise filters & column-wise selects are shifted right to whenever you read the file, and all computations are optimised further. Let’s look at an example, and one close to home!

If we simply read in the shipped CDHSample dataset, and pass in the keyword import_strategy='lazy', it will allow us to demonstrate this effect. Note: by default the import strategy is 'eager'. We’ll get to that later.

[2]:

from pdstools import datasets, ADMDatamart
import polars as pl
dm = datasets.CDHSample(import_strategy='lazy')
dm

[2]:

<pdstools.adm.ADMDatamart.ADMDatamart at 0x7f362bd44350>

If you’ve used the previous version of pdstools, you’ve probably noticed nothing’s changed. You still get the same ADMDatamart object.

Try accessing the modelData, predictorData or combinedData property however, and things are different.

[3]:

dm.modelData

[3]:

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

Not a dataframe - but a NAIVE QUERY PLAN. It’s relatively intuitive, but you’d read this chart bottom-to-top:

We read the table, \(\pi\) */27; (reading all columns out of the 27 available) and \(\sigma\) -; (no row-wise filters applied).
We then rename the columns by a predefined manner
Then we only keep 12 out of the 27 available columns
Change ["Issue","Group","Channel","Direction","Configuration"] to be categorical, "Performance" to numeric.
Transform "SnapshotTime" to datetime
Add ["SuccessRate","Last_Positives","Last_ResponseCount"] as columns.

A sharp eye might already notice an inefficiency here. If we first read in all columns, then rename and then filter, can’t we already infer which columns we need based on our predefined function, and filter when reading? Well - yes!

This is the naive plan. If we call .show_graph(), we get the optimized version:

[4]:

dm.modelData.show_graph()

At first glance; not a huge change. However, note at the bottom we import \(\pi\) 12/27;: we filter right when reading the file, and only read in 12 columns! This can make a big difference, especially because one column ("pyModelData") is very big.

Thus far, the \(\sigma\) symbol has been -. We can also change that: the \(\sigma\) symbol shows you which filters it applies on the rows when it’s reading. For our example purpose, Let’s say we’re only interested in actions containing Visa. We can add that filter right to our execution plan!

To tell Polars we’re trying to filter on a column, we simply start our query with pl.col("Name"). The Name column, containing the action names, is a string column. Therefore, to access string-based operations, we simply use .str to go to the string namespace. We want the string to contain Visa, therefore: query=pl.col("Name").str.contains("Visa").

[5]:

dm = datasets.CDHSample(
    import_strategy="lazy", query=pl.col("Name").str.contains("Visa")
)
dm.modelData.show_graph()

“Okay, but how do we actually get to the data?” - Simple! Just use .collect().

[6]:

dm.modelData.collect()

[6]:

shape: (22, 15)

Group	SnapshotTime	Issue	Treatment	Configuration	Performance	ModelID	Name	Channel	Direction	ResponseCount	Positives	SuccessRate	Last_Positives	Last_ResponseCount
cat	datetime[ns]	cat	null	cat	f64	str	str	cat	cat	f32	f32	f32	datetime[ns]	datetime[ns]
"CreditCards"	2021-06-01 08:23:26.789	"Sales"	null	"OmniAdaptiveModel"	0.5	"17571d69-f802-5122-9bfe-6f3f4e…	"VisaGold"	"Web"	"Inbound"	72.0	1.0	0.013889	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 08:23:26.789	"Sales"	null	"OmniAdaptiveModel"	0.5	"1be7cd72-741b-5d68-b148-c1af01…	"VisaGold"	"SMS"	"Outbound"	166.0	4.0	0.024096	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 08:27:55.789	"Sales"	null	"OmniAdaptiveModel"	0.5	"17571d69-f802-5122-9bfe-6f3f4e…	"VisaGold"	"Web"	"Inbound"	161.0	3.0	0.018634	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 09:03:01.789	"Sales"	null	"OmniAdaptiveModel"	0.5	"17571d69-f802-5122-9bfe-6f3f4e…	"VisaGold"	"Web"	"Inbound"	161.0	3.0	0.018634	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 09:03:01.789	"Sales"	null	"OmniAdaptiveModel"	0.5	"1be7cd72-741b-5d68-b148-c1af01…	"VisaGold"	"SMS"	"Outbound"	166.0	4.0	0.024096	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"CreditCards"	2021-06-01 12:42:04.789	"Sales"	null	"OmniAdaptiveModel"	0.524312	"1be7cd72-741b-5d68-b148-c1af01…	"VisaGold"	"SMS"	"Outbound"	674.0	21.0	0.031157	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 12:45:20.789	"Sales"	null	"OmniAdaptiveModel"	0.535621	"1be7cd72-741b-5d68-b148-c1af01…	"VisaGold"	"SMS"	"Outbound"	746.0	24.0	0.032172	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 13:18:21.789	"Sales"	null	"OmniAdaptiveModel"	0.538104	"17571d69-f802-5122-9bfe-6f3f4e…	"VisaGold"	"Web"	"Inbound"	706.0	43.0	0.060907	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 13:23:26.789	"Sales"	null	"OmniAdaptiveModel"	0.531863	"1be7cd72-741b-5d68-b148-c1af01…	"VisaGold"	"SMS"	"Outbound"	822.0	27.0	0.032847	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
"CreditCards"	2021-06-01 13:23:26.789	"Sales"	null	"OmniAdaptiveModel"	0.519586	"17571d69-f802-5122-9bfe-6f3f4e…	"VisaGold"	"Web"	"Inbound"	778.0	46.0	0.059126	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789

This table is very similar to Pandas tables, though there are some slight differences. The types are explicitly mentioned in the header, string values are in quotes, and long string values get cut off more agressively. Other than that, there’s no real change! And if you still want to use pandas, just call .to_pandas() :)

[7]:

dm.modelData.collect().to_pandas()

[7]:

	Group	SnapshotTime	Issue	Treatment	Configuration	Performance	ModelID	Name	Channel	Direction	ResponseCount	Positives	SuccessRate	Last_Positives	Last_ResponseCount
0	CreditCards	2021-06-01 08:23:26.789	Sales	None	OmniAdaptiveModel	0.500000	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	72.0	1.0	0.013889	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
1	CreditCards	2021-06-01 08:23:26.789	Sales	None	OmniAdaptiveModel	0.500000	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	166.0	4.0	0.024096	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
2	CreditCards	2021-06-01 08:27:55.789	Sales	None	OmniAdaptiveModel	0.500000	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	161.0	3.0	0.018634	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
3	CreditCards	2021-06-01 09:03:01.789	Sales	None	OmniAdaptiveModel	0.500000	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	161.0	3.0	0.018634	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
4	CreditCards	2021-06-01 09:03:01.789	Sales	None	OmniAdaptiveModel	0.500000	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	166.0	4.0	0.024096	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
5	CreditCards	2021-06-01 09:06:33.789	Sales	None	OmniAdaptiveModel	0.500000	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	271.0	7.0	0.025830	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
6	CreditCards	2021-06-01 09:06:33.789	Sales	None	OmniAdaptiveModel	0.554538	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	233.0	12.0	0.051502	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
7	CreditCards	2021-06-01 09:34:34.789	Sales	None	OmniAdaptiveModel	0.554538	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	233.0	12.0	0.051502	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
8	CreditCards	2021-06-01 09:43:16.789	Sales	None	OmniAdaptiveModel	0.500000	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	353.0	8.0	0.022663	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
9	CreditCards	2021-06-01 09:43:16.789	Sales	None	OmniAdaptiveModel	0.541487	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	320.0	19.0	0.059375	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
10	CreditCards	2021-06-01 10:07:35.789	Sales	None	OmniAdaptiveModel	0.500000	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	437.0	10.0	0.022883	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
11	CreditCards	2021-06-01 10:07:35.789	Sales	None	OmniAdaptiveModel	0.557512	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	391.0	24.0	0.061381	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
12	CreditCards	2021-06-01 10:58:26.789	Sales	None	OmniAdaptiveModel	0.531173	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	470.0	28.0	0.059574	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
13	CreditCards	2021-06-01 10:58:26.789	Sales	None	OmniAdaptiveModel	0.551713	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	507.0	17.0	0.033531	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
14	CreditCards	2021-06-01 11:29:54.789	Sales	None	OmniAdaptiveModel	0.551713	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	507.0	17.0	0.033531	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
15	CreditCards	2021-06-01 12:15:05.789	Sales	None	OmniAdaptiveModel	0.524312	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	674.0	21.0	0.031157	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
16	CreditCards	2021-06-01 12:15:05.789	Sales	None	OmniAdaptiveModel	0.549843	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	633.0	39.0	0.061611	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
17	CreditCards	2021-06-01 12:42:04.789	Sales	None	OmniAdaptiveModel	0.524312	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	674.0	21.0	0.031157	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
18	CreditCards	2021-06-01 12:45:20.789	Sales	None	OmniAdaptiveModel	0.535621	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	746.0	24.0	0.032172	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
19	CreditCards	2021-06-01 13:18:21.789	Sales	None	OmniAdaptiveModel	0.538104	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	706.0	43.0	0.060907	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
20	CreditCards	2021-06-01 13:23:26.789	Sales	None	OmniAdaptiveModel	0.531863	1be7cd72-741b-5d68-b148-c1af0142fd4a	VisaGold	SMS	Outbound	822.0	27.0	0.032847	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789
21	CreditCards	2021-06-01 13:23:26.789	Sales	None	OmniAdaptiveModel	0.519586	17571d69-f802-5122-9bfe-6f3f4ef9073b	VisaGold	Web	Inbound	778.0	46.0	0.059126	2021-06-01 13:23:26.789	2021-06-01 13:23:26.789

Now, we don’t just have modelData, there’s also information about predictors. This is under the predictorData attribute, and it works the same way. Thirdly, we add a attribute called combinedData. combinedData is technically just an inner join between the model data and the predictor data on the model ID. Therefore, the model ID’s we filter out of the models also propagate to the combined data. To give more insight into this join, we can now simply show the execution graph of it:

[8]:

dm.combinedData

[8]:

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

While a beautiful graph, it is quite a lot to digest. The left side of the chart should look very familiar, as should the right side. The ‘middle’ part is new: this is the execution part for the predictor data. Here we only parse Performance to be a float, parse the timezones, and add [“BinPropensity”,”BinAdjustedPropensity”]. For both the model data and the predictor data we take the last snapshot only when combining them, which you can see right before the topmost inner join.

Again, a sharp eye will notice quite some inefficiencies here. The newest issue is that we’re now importing modelData twice, once to filter out model IDs not being in the filter criteria, and another time to join with the table. But remember, we’re not in the optimized graph:

[9]:

dm.combinedData.show_graph()

could not determine schema
thread '<unnamed>' panicked at crates/polars-lazy/src/dot.rs:49:14:
io error: Error
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 dm.combinedData.show_graph()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1035, in LazyFrame.show_graph(self, optimized, show, output_path, raw_output, figsize, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, streaming)
    975 """
    976 Show a plot of the query plan.
    977
   (...)
   1021 ... ).show_graph()  # doctest: +SKIP
   1022 """
   1023 _ldf = self._ldf.optimization_toggle(
   1024     type_coercion,
   1025     predicate_pushdown,
   (...)
   1032     _eager=False,
   1033 )
-> 1035 dot = _ldf.to_dot(optimized)
   1037 if raw_output:
   1038     # we do not show a graph, nor save a graph to disk
   1039     return dot

PanicException: io error: Error

Pretty neat. Not only do we not import modelData twice, we cache it at a certain point in the execution graph so we can use it twice. And we get all of this for free with Polars’ optimization engine! Also remember, at this point, we’ve not even read in any data yet! We can even do profiling on this execution graph, but that’s a challenge to figure out on your own :). We can easily show the gains in time using this method using our new save_data method though: this will save our data to disk in an efficient format, and gives us a fast way to benchmark.

[10]:

files_without_subset = datasets.CDHSample(
    import_strategy="lazy", subset=False
).save_data()
files_with_subset = datasets.CDHSample(
    import_strategy="lazy", query=pl.col("Name").str.contains("Visa")
).save_data()

[11]:

%timeit ADMDatamart(model_filename=files_without_subset[0], predictor_filename=files_without_subset[1], import_strategy='lazy').combinedData.collect()

18.6 ms ± 69.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

[12]:

%timeit ADMDatamart(model_filename=files_with_subset[0], predictor_filename=files_with_subset[1], import_strategy='lazy').combinedData.collect()

11.5 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

[13]:

from os import remove
[remove(file) for file in [*files_with_subset, *files_without_subset]]

[13]:

[None, None, None, None]

This +- 2x increase in speed will be even more significant as we increase the size of the data.

Lazy vs Eager¶

As mentioned before, however, the default mode of reading is "eager". This still allows us to use the previously mentioned execution graph, but the difference is in what we do after we do all of this preprocessing, right before returning the ADMDatamart class: .collect().lazy(). Opening up modelData in eager mode shows that we don’t have a graph anymore:

[14]:

modd = datasets.CDHSample().modelData
print(type(modd))
modd.show_graph()

<class 'polars.lazyframe.frame.LazyFrame'>

The type of modelData is a LazyFrame (the representation of this being a query plan). DataFrame and LazyFrames are not so different - LazyFrames build up a query and want you to execute once it’s done, DataFrames execute each command as you’re calling it. However, in eager mode, that execution graph is empty.

That leads to the fundamental difference between the import_strategy keyword argument: - In lazy mode - we only build the execution graph, but we rather than reading the data, we only scan it - In eager mode - we build the execution graph and execute it, reading all data into memory. This means we don’t have to execute that graph everytime we call it.

The eager mode stil maintains LazyFrames for both modelData and predictorData, but those tables are now not stored on disk anymore, but kept in memory. Any subsequent operations will build up a new execution graph, only executed when calling .collect().

When to use which?¶

By default, we use eager mode. We recommend using that, unless:

Your data is too big to fit into memory
You’re using just one plot

In all other cases, it’s probably fine to just use eager mode, unless.

File Types¶

Pega’s default export format is to return a zip file, in which is a data.json file, which is in newline-delimited json format. While this is a decent storage format, it doesn’t allow for any scanning operations because we need to open the entire zip into memory before being able to read anything. Therefore, if you’re using the default export format, we set the import strategy to eager. If you’re working with very large files and you need faster processing, it may be worth it to unzip the files, or even transform them to Arrow/IPC (fastest) or Parquet (most efficient). Then, you can use lazy mode, and can process files larger than memory as an added benefit.

Further Polars examples¶

To help you get adjusted to this new syntax, let’s go over a few more examples.

The regular facets syntax still works, and can be combined with the new querying syntax.

[15]:

dm = datasets.CDHSample()

dm.plotPerformanceSuccessRateBubbleChart(
    query=pl.col("Issue") == "Sales", facets=["Channel"]
)

However, if you want to combine queries, let’s say just looking at Sales/Web, the best way to do that is like this (also note the extra brackets between expressions):

[16]:

dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Issue") == "Sales") & (pl.col("Channel") == "Web")
)

We can make use of string operations (hover to verify for yourself!):

[17]:

dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Name").str.starts_with("Premier"))
    | (pl.col("Name").str.ends_with("Gold"))
)

Since we have full access to Polar’s expressions, we can use a very wide range of functions, including windowing, folds, and slices.

For something more advanced: let’s look at the highest performance model per group.

[18]:

dm.plotPerformanceSuccessRateBubbleChart(
    query=(pl.col("Performance") == pl.max("Performance")).over("Group")
)

Or we can look at temporal filters, like only looking at the first snapshot:

[19]:

dm.plotPerformanceSuccessRateBubbleChart(query=pl.col('SnapshotTime')==pl.col('SnapshotTime').min())