Data Anonymization¶

In Pega CDH 8.5 and up, it’s now possible to record the historical data as seen by the Adaptive Models. See this academy challenge for reference. This historical data can be further used to experiment with offline models, but also to fine-tune the OOTB Gradient Boosting model. However, sharing this information with Pega can be sensitive as it contains raw predictor data.

To this end, we provide a simple and transparent script to fully anonimize this dataset.

The DataAnonymization script is now part of pdstools, and you can import it directly as such.

[2]:

from pdstools import ADMDatamart
from pdstools import Config, DataAnonymization
import polars as pl

Input data¶

To demonstrate this process, we’re going to anonymise this toy example dataframe:

[3]:

pl.read_ndjson('../../../../data/SampleHDS.json')

[3]:

shape: (7, 6)

Context_Name	Customer_MaritalStatus	Customer_CLV	Customer_City	IH_Web_Inbound_Accepted_pxLastGroupID	Decision_Outcome
str	str	i64	str	str	str
"FirstMortgage30yr"	"Married"	1460	"Port Raoul"	"Account"	"Rejected"
"FirstMortgage30yr"	"Unknown"	669	"Laurianneshire"	"AutoLoans"	"Accepted"
"MoneyMarketSavingsAccount"	"No Resp+"	1174	"Jacobshaven"	"Account"	"Rejected"
"BasicChecking"	"Unknown"	1476	"Lindton"	"Account"	"Rejected"
"BasicChecking"	"Married"	1211	"South Jimmieshire"	"DepositAccounts"	"Accepted"
"UPlusFinPersonal"	"No Resp+"	533	"Bergeville"	null	"Rejected"
"BasicChecking"	"No Resp+"	555	"Willyville"	"DepositAccounts"	"Rejected"

As you can see, this dataset consists of regular predictors, IH predictors, context keys and the outcome column. Additionally, some columns are numeric, others are strings. Let’s first initialize the DataAnonymization class.

[4]:

anon = DataAnonymization(hds_folder='../../../../data/')

By default, the class applies a set of anonymisation techniques: - Column names are remapped to a non-descriptive name - Categorical values are hashed with a random seed - Numerical values are normalized between 0 and 1 - Outcomes are mapped to a binary outcome.

To apply these techniques, simply call .process():

[5]:

anon.process()

[5]:

shape: (7, 7)

filename	PREDICTOR_1	PREDICTOR_2	PREDICTOR_3	Context_Name	IH_PREDICTOR_0	Decision_Outcome
str	f64	str	str	str	str	bool
"../../../../data/SampleHDS.jso…	9.3879e18	"4322352375578778816"	"17476576417299871274"	"16667091606759666905"	"9928487072251299005"	false
"../../../../data/SampleHDS.jso…	9.5130e18	"4377567666271395146"	"4246104531867950704"	"16667091606759666905"	"9704894017741411585"	true
"../../../../data/SampleHDS.jso…	1.3543e19	"7449789436079273182"	"16245490868189049411"	"9110662069881979168"	"9928487072251299005"	false
"../../../../data/SampleHDS.jso…	2.9338e18	"4377567666271395146"	"14052206754333961127"	"3167542439088846566"	"9928487072251299005"	false
"../../../../data/SampleHDS.jso…	9.1170e18	"4322352375578778816"	"15706014193656415919"	"3167542439088846566"	"4429618874616721482"	true
"../../../../data/SampleHDS.jso…	5.9423e18	"7449789436079273182"	"18083692497973389187"	"15443884618242616713"	null	false
"../../../../data/SampleHDS.jso…	2.9437e17	"7449789436079273182"	"17630561027144944859"	"3167542439088846566"	"4429618874616721482"	false

To trace back the columns to their original names, the class also contains a mapping, which does not have to be provided.

[6]:

anon.column_mapping

[6]:

{'filename': 'filename',
 'Customer_CLV': 'PREDICTOR_1',
 'Customer_MaritalStatus': 'PREDICTOR_2',
 'Customer_City': 'PREDICTOR_3',
 'Context_Name': 'Context_Name',
 'IH_Web_Inbound_Accepted_pxLastGroupID': 'IH_PREDICTOR_0',
 'Decision_Outcome': 'Decision_Outcome'}

Configs¶

Each capability can optionally be turned off - see below for the full list of config options, and refer to the API reference for the full description.

[7]:

dict(zip(Config.__init__.__code__.co_varnames[1:], Config.__init__.__defaults__))

[7]:

{'config_file': None,
 'hds_folder': '.',
 'use_datamart': False,
 'datamart_folder': 'datamart',
 'output_format': 'ndjson',
 'output_folder': 'output',
 'mapping_file': 'mapping.map',
 'mask_predictor_names': True,
 'mask_context_key_names': False,
 'mask_ih_names': True,
 'mask_outcome_name': False,
 'mask_predictor_values': True,
 'mask_context_key_values': True,
 'mask_ih_values': True,
 'mask_outcome_values': True,
 'context_key_label': 'Context_*',
 'ih_label': 'IH_*',
 'outcome_column': 'Decision_Outcome',
 'positive_outcomes': ['Accepted', 'Clicked'],
 'negative_outcomes': ['Rejected', 'Impression'],
 'special_predictors': ['Decision_DecisionTime',
  'Decision_OutcomeTime',
  'Decision_Rank'],
 'sample_percentage_schema_inferencing': 0.01}

It’s easy to change these parameters by just passing the keyword arguments. In the following example, we - Keep the IH predictor names - Keep the outcome values - Keep the context key values - Keep the context key predictor names

[8]:

anon = DataAnonymization(
    hds_folder="../../../../data/",
    mask_ih_names=False,
    mask_outcome_values=False,
    mask_context_key_values=False,
    mask_context_key_names=False,
)
anon.process()

[8]:

shape: (7, 7)

filename	PREDICTOR_1	PREDICTOR_2	PREDICTOR_3	Context_Name	IH_Web_Inbound_Accepted_pxLastGroupID	Decision_Outcome
str	f64	str	str	str	str	str
"../../../../data/SampleHDS.jso…	7.5292e16	"5652797536064939235"	"1883914281589318169"	"FirstMortgage30yr"	"7516570166790360641"	"Rejected"
"../../../../data/SampleHDS.jso…	1.0513e19	"13504995809060850713"	"13130779151990347928"	"FirstMortgage30yr"	"7207095931022615371"	"Accepted"
"../../../../data/SampleHDS.jso…	1.2205e19	"12266615714185244798"	"17512333760039993470"	"MoneyMarketSavingsAccount"	"7516570166790360641"	"Rejected"
"../../../../data/SampleHDS.jso…	1.5206e19	"13504995809060850713"	"13070512220880338382"	"BasicChecking"	"7516570166790360641"	"Rejected"
"../../../../data/SampleHDS.jso…	9.2184e18	"5652797536064939235"	"9707500569982642684"	"BasicChecking"	"15207313976030574712"	"Accepted"
"../../../../data/SampleHDS.jso…	1.7792e19	"12266615714185244798"	"16463319030652753949"	"UPlusFinPersonal"	null	"Rejected"
"../../../../data/SampleHDS.jso…	8.9321e18	"12266615714185244798"	"9489790242227414652"	"BasicChecking"	"15207313976030574712"	"Rejected"

The configs can also be written and read as such:

[9]:

anon.config.save_to_config_file('config.json')

[10]:

anon = DataAnonymization(config=Config(config_file='config.json'))
anon.process()

[10]:

shape: (7, 7)

filename	PREDICTOR_1	PREDICTOR_2	PREDICTOR_3	Context_Name	IH_Web_Inbound_Accepted_pxLastGroupID	Decision_Outcome
str	f64	str	str	str	str	str
"../../../../data/SampleHDS.jso…	1.4092e19	"13050987138282047001"	"7273466646321961720"	"FirstMortgage30yr"	"201110393970616223"	"Rejected"
"../../../../data/SampleHDS.jso…	7.5421e18	"4687207091070476948"	"12321688881564301403"	"FirstMortgage30yr"	"14469491008813909687"	"Accepted"
"../../../../data/SampleHDS.jso…	2.7647e18	"9221479136500986521"	"13224801672048989748"	"MoneyMarketSavingsAccount"	"201110393970616223"	"Rejected"
"../../../../data/SampleHDS.jso…	1.8531e18	"4687207091070476948"	"3005892654458747413"	"BasicChecking"	"201110393970616223"	"Rejected"
"../../../../data/SampleHDS.jso…	2.9153e17	"13050987138282047001"	"11143060317304375272"	"BasicChecking"	"14949876674167102948"	"Accepted"
"../../../../data/SampleHDS.jso…	5.8514e18	"9221479136500986521"	"3042835870227387670"	"UPlusFinPersonal"	null	"Rejected"
"../../../../data/SampleHDS.jso…	1.3068e19	"9221479136500986521"	"6985670021126623909"	"BasicChecking"	"14949876674167102948"	"Rejected"

Exporting¶

Two functions export: - create_mapping_file() writes the mapping file of the predictor names - write_to_output() writes the processed dataframe to disk

Write to output accepts the following extensions: ["ndjson", "parquet", "arrow", "csv"]

[11]:

anon.create_mapping_file()
with open('mapping.map') as f:
    print(f.read())

filename=filename
Customer_CLV=PREDICTOR_1
Customer_MaritalStatus=PREDICTOR_2
Customer_City=PREDICTOR_3
Context_Name=Context_Name
IH_Web_Inbound_Accepted_pxLastGroupID=IH_Web_Inbound_Accepted_pxLastGroupID
Decision_Outcome=Decision_Outcome

[12]:

anon.write_to_output(ext='arrow')

[13]:

pl.read_ipc('output/hds.arrow')

[13]:

shape: (7, 6)

PREDICTOR_1	PREDICTOR_2	PREDICTOR_3	Context_Name	IH_Web_Inbound_Accepted_pxLastGroupID	Decision_Outcome
f64	str	str	str	str	str
1.4553e18	"1677290861643629533"	"675924639818905441"	"FirstMortgage30yr"	"7325192143980280913"	"Rejected"
1.3342e19	"3995997704274528624"	"12466662781232983786"	"FirstMortgage30yr"	"11250321987204776838"	"Accepted"
6.4610e18	"299236262880695810"	"14480333672661849787"	"MoneyMarketSavingsAccount"	"7325192143980280913"	"Rejected"
9.5945e18	"3995997704274528624"	"10510107923994113798"	"BasicChecking"	"7325192143980280913"	"Rejected"
8.8990e18	"1677290861643629533"	"11351735953431803477"	"BasicChecking"	"14140642726280753534"	"Accepted"
1.8116e19	"299236262880695810"	"6705003666536151207"	"UPlusFinPersonal"	null	"Rejected"
3.0632e18	"299236262880695810"	"16498174082774312399"	"BasicChecking"	"14140642726280753534"	"Rejected"