Data Anonymization¶
In Pega CDH 8.5 and up, it’s now possible to record the historical data as seen by the Adaptive Models. See this academy challenge for reference. This historical data can be further used to experiment with offline models, but also to fine-tune the OOTB Gradient Boosting model. However, sharing this information with Pega can be sensitive as it contains raw predictor data.
To this end, we provide a simple and transparent script to fully anonimize this dataset.
The DataAnonymization script is now part of pdstools, and you can import it directly as such.
[2]:
from pdstools import ADMDatamart
from pdstools import Config, DataAnonymization
import polars as pl
Input data¶
To demonstrate this process, we’re going to anonymise this toy example dataframe:
[3]:
pl.read_ndjson('../../../../data/SampleHDS.json')
[3]:
Context_Name | Customer_MaritalStatus | Customer_CLV | Customer_City | IH_Web_Inbound_Accepted_pxLastGroupID | Decision_Outcome |
---|---|---|---|---|---|
str | str | i64 | str | str | str |
"FirstMortgage30yr" | "Married" | 1460 | "Port Raoul" | "Account" | "Rejected" |
"FirstMortgage30yr" | "Unknown" | 669 | "Laurianneshire" | "AutoLoans" | "Accepted" |
"MoneyMarketSavingsAccount" | "No Resp+" | 1174 | "Jacobshaven" | "Account" | "Rejected" |
"BasicChecking" | "Unknown" | 1476 | "Lindton" | "Account" | "Rejected" |
"BasicChecking" | "Married" | 1211 | "South Jimmieshire" | "DepositAccounts" | "Accepted" |
"UPlusFinPersonal" | "No Resp+" | 533 | "Bergeville" | null | "Rejected" |
"BasicChecking" | "No Resp+" | 555 | "Willyville" | "DepositAccounts" | "Rejected" |
As you can see, this dataset consists of regular predictors, IH predictors, context keys and the outcome column. Additionally, some columns are numeric, others are strings. Let’s first initialize the DataAnonymization class.
[4]:
anon = DataAnonymization(hds_folder='../../../../data/')
By default, the class applies a set of anonymisation techniques: - Column names are remapped to a non-descriptive name - Categorical values are hashed with a random seed - Numerical values are normalized between 0 and 1 - Outcomes are mapped to a binary outcome.
To apply these techniques, simply call .process()
:
[5]:
anon.process()
[5]:
filename | PREDICTOR_1 | PREDICTOR_2 | PREDICTOR_3 | Context_Name | IH_PREDICTOR_0 | Decision_Outcome |
---|---|---|---|---|---|---|
str | f64 | str | str | str | str | bool |
"../../../../data/SampleHDS.jso… | 9.3879e18 | "4322352375578778816" | "17476576417299871274" | "16667091606759666905" | "9928487072251299005" | false |
"../../../../data/SampleHDS.jso… | 9.5130e18 | "4377567666271395146" | "4246104531867950704" | "16667091606759666905" | "9704894017741411585" | true |
"../../../../data/SampleHDS.jso… | 1.3543e19 | "7449789436079273182" | "16245490868189049411" | "9110662069881979168" | "9928487072251299005" | false |
"../../../../data/SampleHDS.jso… | 2.9338e18 | "4377567666271395146" | "14052206754333961127" | "3167542439088846566" | "9928487072251299005" | false |
"../../../../data/SampleHDS.jso… | 9.1170e18 | "4322352375578778816" | "15706014193656415919" | "3167542439088846566" | "4429618874616721482" | true |
"../../../../data/SampleHDS.jso… | 5.9423e18 | "7449789436079273182" | "18083692497973389187" | "15443884618242616713" | null | false |
"../../../../data/SampleHDS.jso… | 2.9437e17 | "7449789436079273182" | "17630561027144944859" | "3167542439088846566" | "4429618874616721482" | false |
To trace back the columns to their original names, the class also contains a mapping, which does not have to be provided.
[6]:
anon.column_mapping
[6]:
{'filename': 'filename',
'Customer_CLV': 'PREDICTOR_1',
'Customer_MaritalStatus': 'PREDICTOR_2',
'Customer_City': 'PREDICTOR_3',
'Context_Name': 'Context_Name',
'IH_Web_Inbound_Accepted_pxLastGroupID': 'IH_PREDICTOR_0',
'Decision_Outcome': 'Decision_Outcome'}
Configs¶
Each capability can optionally be turned off - see below for the full list of config options, and refer to the API reference for the full description.
[7]:
dict(zip(Config.__init__.__code__.co_varnames[1:], Config.__init__.__defaults__))
[7]:
{'config_file': None,
'hds_folder': '.',
'use_datamart': False,
'datamart_folder': 'datamart',
'output_format': 'ndjson',
'output_folder': 'output',
'mapping_file': 'mapping.map',
'mask_predictor_names': True,
'mask_context_key_names': False,
'mask_ih_names': True,
'mask_outcome_name': False,
'mask_predictor_values': True,
'mask_context_key_values': True,
'mask_ih_values': True,
'mask_outcome_values': True,
'context_key_label': 'Context_*',
'ih_label': 'IH_*',
'outcome_column': 'Decision_Outcome',
'positive_outcomes': ['Accepted', 'Clicked'],
'negative_outcomes': ['Rejected', 'Impression'],
'special_predictors': ['Decision_DecisionTime',
'Decision_OutcomeTime',
'Decision_Rank'],
'sample_percentage_schema_inferencing': 0.01}
It’s easy to change these parameters by just passing the keyword arguments. In the following example, we - Keep the IH predictor names - Keep the outcome values - Keep the context key values - Keep the context key predictor names
[8]:
anon = DataAnonymization(
hds_folder="../../../../data/",
mask_ih_names=False,
mask_outcome_values=False,
mask_context_key_values=False,
mask_context_key_names=False,
)
anon.process()
[8]:
filename | PREDICTOR_1 | PREDICTOR_2 | PREDICTOR_3 | Context_Name | IH_Web_Inbound_Accepted_pxLastGroupID | Decision_Outcome |
---|---|---|---|---|---|---|
str | f64 | str | str | str | str | str |
"../../../../data/SampleHDS.jso… | 7.5292e16 | "5652797536064939235" | "1883914281589318169" | "FirstMortgage30yr" | "7516570166790360641" | "Rejected" |
"../../../../data/SampleHDS.jso… | 1.0513e19 | "13504995809060850713" | "13130779151990347928" | "FirstMortgage30yr" | "7207095931022615371" | "Accepted" |
"../../../../data/SampleHDS.jso… | 1.2205e19 | "12266615714185244798" | "17512333760039993470" | "MoneyMarketSavingsAccount" | "7516570166790360641" | "Rejected" |
"../../../../data/SampleHDS.jso… | 1.5206e19 | "13504995809060850713" | "13070512220880338382" | "BasicChecking" | "7516570166790360641" | "Rejected" |
"../../../../data/SampleHDS.jso… | 9.2184e18 | "5652797536064939235" | "9707500569982642684" | "BasicChecking" | "15207313976030574712" | "Accepted" |
"../../../../data/SampleHDS.jso… | 1.7792e19 | "12266615714185244798" | "16463319030652753949" | "UPlusFinPersonal" | null | "Rejected" |
"../../../../data/SampleHDS.jso… | 8.9321e18 | "12266615714185244798" | "9489790242227414652" | "BasicChecking" | "15207313976030574712" | "Rejected" |
The configs can also be written and read as such:
[9]:
anon.config.save_to_config_file('config.json')
[10]:
anon = DataAnonymization(config=Config(config_file='config.json'))
anon.process()
[10]:
filename | PREDICTOR_1 | PREDICTOR_2 | PREDICTOR_3 | Context_Name | IH_Web_Inbound_Accepted_pxLastGroupID | Decision_Outcome |
---|---|---|---|---|---|---|
str | f64 | str | str | str | str | str |
"../../../../data/SampleHDS.jso… | 1.4092e19 | "13050987138282047001" | "7273466646321961720" | "FirstMortgage30yr" | "201110393970616223" | "Rejected" |
"../../../../data/SampleHDS.jso… | 7.5421e18 | "4687207091070476948" | "12321688881564301403" | "FirstMortgage30yr" | "14469491008813909687" | "Accepted" |
"../../../../data/SampleHDS.jso… | 2.7647e18 | "9221479136500986521" | "13224801672048989748" | "MoneyMarketSavingsAccount" | "201110393970616223" | "Rejected" |
"../../../../data/SampleHDS.jso… | 1.8531e18 | "4687207091070476948" | "3005892654458747413" | "BasicChecking" | "201110393970616223" | "Rejected" |
"../../../../data/SampleHDS.jso… | 2.9153e17 | "13050987138282047001" | "11143060317304375272" | "BasicChecking" | "14949876674167102948" | "Accepted" |
"../../../../data/SampleHDS.jso… | 5.8514e18 | "9221479136500986521" | "3042835870227387670" | "UPlusFinPersonal" | null | "Rejected" |
"../../../../data/SampleHDS.jso… | 1.3068e19 | "9221479136500986521" | "6985670021126623909" | "BasicChecking" | "14949876674167102948" | "Rejected" |
Exporting¶
Two functions export: - create_mapping_file()
writes the mapping file of the predictor names - write_to_output()
writes the processed dataframe to disk
Write to output accepts the following extensions: ["ndjson", "parquet", "arrow", "csv"]
[11]:
anon.create_mapping_file()
with open('mapping.map') as f:
print(f.read())
filename=filename
Customer_CLV=PREDICTOR_1
Customer_MaritalStatus=PREDICTOR_2
Customer_City=PREDICTOR_3
Context_Name=Context_Name
IH_Web_Inbound_Accepted_pxLastGroupID=IH_Web_Inbound_Accepted_pxLastGroupID
Decision_Outcome=Decision_Outcome
[12]:
anon.write_to_output(ext='arrow')
[13]:
pl.read_ipc('output/hds.arrow')
[13]:
PREDICTOR_1 | PREDICTOR_2 | PREDICTOR_3 | Context_Name | IH_Web_Inbound_Accepted_pxLastGroupID | Decision_Outcome |
---|---|---|---|---|---|
f64 | str | str | str | str | str |
1.4553e18 | "1677290861643629533" | "675924639818905441" | "FirstMortgage30yr" | "7325192143980280913" | "Rejected" |
1.3342e19 | "3995997704274528624" | "12466662781232983786" | "FirstMortgage30yr" | "11250321987204776838" | "Accepted" |
6.4610e18 | "299236262880695810" | "14480333672661849787" | "MoneyMarketSavingsAccount" | "7325192143980280913" | "Rejected" |
9.5945e18 | "3995997704274528624" | "10510107923994113798" | "BasicChecking" | "7325192143980280913" | "Rejected" |
8.8990e18 | "1677290861643629533" | "11351735953431803477" | "BasicChecking" | "14140642726280753534" | "Accepted" |
1.8116e19 | "299236262880695810" | "6705003666536151207" | "UPlusFinPersonal" | null | "Rejected" |
3.0632e18 | "299236262880695810" | "16498174082774312399" | "BasicChecking" | "14140642726280753534" | "Rejected" |