Data Anonymization

In Pega CDH 8.5 and up, it’s now possible to record the historical data as seen by the Adaptive Models. See this academy challenge for reference. This historical data can be further used to experiment with offline models, but also to fine-tune the OOTB Gradient Boosting model. However, sharing this information with Pega can be sensitive as it contains raw predictor data.

To this end, we provide a simple and transparent script to fully anonimize this dataset.

The DataAnonymization script is now part of pdstools, and you can import it directly as such.

[2]:
from pdstools import ADMDatamart
from pdstools import Config, DataAnonymization
import polars as pl

Input data

To demonstrate this process, we’re going to anonymise this toy example dataframe:

[3]:
pl.read_ndjson('../../../../data/SampleHDS.json')
[3]:
shape: (7, 6)
Context_NameCustomer_MaritalStatusCustomer_CLVCustomer_CityIH_Web_Inbound_Accepted_pxLastGroupIDDecision_Outcome
strstri64strstrstr
"FirstMortgage30yr""Married"1460"Port Raoul""Account""Rejected"
"FirstMortgage30yr""Unknown"669"Laurianneshire""AutoLoans""Accepted"
"MoneyMarketSavingsAccount""No Resp+"1174"Jacobshaven""Account""Rejected"
"BasicChecking""Unknown"1476"Lindton""Account""Rejected"
"BasicChecking""Married"1211"South Jimmieshire""DepositAccounts""Accepted"
"UPlusFinPersonal""No Resp+"533"Bergeville"null"Rejected"
"BasicChecking""No Resp+"555"Willyville""DepositAccounts""Rejected"

As you can see, this dataset consists of regular predictors, IH predictors, context keys and the outcome column. Additionally, some columns are numeric, others are strings. Let’s first initialize the DataAnonymization class.

[4]:
anon = DataAnonymization(hds_folder='../../../../data/')

By default, the class applies a set of anonymisation techniques: - Column names are remapped to a non-descriptive name - Categorical values are hashed with a random seed - Numerical values are normalized between 0 and 1 - Outcomes are mapped to a binary outcome.

To apply these techniques, simply call .process():

[5]:
anon.process()
[5]:
shape: (7, 7)
filenamePREDICTOR_1PREDICTOR_2PREDICTOR_3Context_NameIH_PREDICTOR_0Decision_Outcome
strf64strstrstrstrbool
"../../../../data/SampleHDS.jso…9.3879e18"4322352375578778816""17476576417299871274""16667091606759666905""9928487072251299005"false
"../../../../data/SampleHDS.jso…9.5130e18"4377567666271395146""4246104531867950704""16667091606759666905""9704894017741411585"true
"../../../../data/SampleHDS.jso…1.3543e19"7449789436079273182""16245490868189049411""9110662069881979168""9928487072251299005"false
"../../../../data/SampleHDS.jso…2.9338e18"4377567666271395146""14052206754333961127""3167542439088846566""9928487072251299005"false
"../../../../data/SampleHDS.jso…9.1170e18"4322352375578778816""15706014193656415919""3167542439088846566""4429618874616721482"true
"../../../../data/SampleHDS.jso…5.9423e18"7449789436079273182""18083692497973389187""15443884618242616713"nullfalse
"../../../../data/SampleHDS.jso…2.9437e17"7449789436079273182""17630561027144944859""3167542439088846566""4429618874616721482"false

To trace back the columns to their original names, the class also contains a mapping, which does not have to be provided.

[6]:
anon.column_mapping
[6]:
{'filename': 'filename',
 'Customer_CLV': 'PREDICTOR_1',
 'Customer_MaritalStatus': 'PREDICTOR_2',
 'Customer_City': 'PREDICTOR_3',
 'Context_Name': 'Context_Name',
 'IH_Web_Inbound_Accepted_pxLastGroupID': 'IH_PREDICTOR_0',
 'Decision_Outcome': 'Decision_Outcome'}

Configs

Each capability can optionally be turned off - see below for the full list of config options, and refer to the API reference for the full description.

[7]:
dict(zip(Config.__init__.__code__.co_varnames[1:], Config.__init__.__defaults__))
[7]:
{'config_file': None,
 'hds_folder': '.',
 'use_datamart': False,
 'datamart_folder': 'datamart',
 'output_format': 'ndjson',
 'output_folder': 'output',
 'mapping_file': 'mapping.map',
 'mask_predictor_names': True,
 'mask_context_key_names': False,
 'mask_ih_names': True,
 'mask_outcome_name': False,
 'mask_predictor_values': True,
 'mask_context_key_values': True,
 'mask_ih_values': True,
 'mask_outcome_values': True,
 'context_key_label': 'Context_*',
 'ih_label': 'IH_*',
 'outcome_column': 'Decision_Outcome',
 'positive_outcomes': ['Accepted', 'Clicked'],
 'negative_outcomes': ['Rejected', 'Impression'],
 'special_predictors': ['Decision_DecisionTime',
  'Decision_OutcomeTime',
  'Decision_Rank'],
 'sample_percentage_schema_inferencing': 0.01}

It’s easy to change these parameters by just passing the keyword arguments. In the following example, we - Keep the IH predictor names - Keep the outcome values - Keep the context key values - Keep the context key predictor names

[8]:
anon = DataAnonymization(
    hds_folder="../../../../data/",
    mask_ih_names=False,
    mask_outcome_values=False,
    mask_context_key_values=False,
    mask_context_key_names=False,
)
anon.process()

[8]:
shape: (7, 7)
filenamePREDICTOR_1PREDICTOR_2PREDICTOR_3Context_NameIH_Web_Inbound_Accepted_pxLastGroupIDDecision_Outcome
strf64strstrstrstrstr
"../../../../data/SampleHDS.jso…7.5292e16"5652797536064939235""1883914281589318169""FirstMortgage30yr""7516570166790360641""Rejected"
"../../../../data/SampleHDS.jso…1.0513e19"13504995809060850713""13130779151990347928""FirstMortgage30yr""7207095931022615371""Accepted"
"../../../../data/SampleHDS.jso…1.2205e19"12266615714185244798""17512333760039993470""MoneyMarketSavingsAccount""7516570166790360641""Rejected"
"../../../../data/SampleHDS.jso…1.5206e19"13504995809060850713""13070512220880338382""BasicChecking""7516570166790360641""Rejected"
"../../../../data/SampleHDS.jso…9.2184e18"5652797536064939235""9707500569982642684""BasicChecking""15207313976030574712""Accepted"
"../../../../data/SampleHDS.jso…1.7792e19"12266615714185244798""16463319030652753949""UPlusFinPersonal"null"Rejected"
"../../../../data/SampleHDS.jso…8.9321e18"12266615714185244798""9489790242227414652""BasicChecking""15207313976030574712""Rejected"

The configs can also be written and read as such:

[9]:
anon.config.save_to_config_file('config.json')
[10]:
anon = DataAnonymization(config=Config(config_file='config.json'))
anon.process()
[10]:
shape: (7, 7)
filenamePREDICTOR_1PREDICTOR_2PREDICTOR_3Context_NameIH_Web_Inbound_Accepted_pxLastGroupIDDecision_Outcome
strf64strstrstrstrstr
"../../../../data/SampleHDS.jso…1.4092e19"13050987138282047001""7273466646321961720""FirstMortgage30yr""201110393970616223""Rejected"
"../../../../data/SampleHDS.jso…7.5421e18"4687207091070476948""12321688881564301403""FirstMortgage30yr""14469491008813909687""Accepted"
"../../../../data/SampleHDS.jso…2.7647e18"9221479136500986521""13224801672048989748""MoneyMarketSavingsAccount""201110393970616223""Rejected"
"../../../../data/SampleHDS.jso…1.8531e18"4687207091070476948""3005892654458747413""BasicChecking""201110393970616223""Rejected"
"../../../../data/SampleHDS.jso…2.9153e17"13050987138282047001""11143060317304375272""BasicChecking""14949876674167102948""Accepted"
"../../../../data/SampleHDS.jso…5.8514e18"9221479136500986521""3042835870227387670""UPlusFinPersonal"null"Rejected"
"../../../../data/SampleHDS.jso…1.3068e19"9221479136500986521""6985670021126623909""BasicChecking""14949876674167102948""Rejected"

Exporting

Two functions export: - create_mapping_file() writes the mapping file of the predictor names - write_to_output() writes the processed dataframe to disk

Write to output accepts the following extensions: ["ndjson", "parquet", "arrow", "csv"]

[11]:
anon.create_mapping_file()
with open('mapping.map') as f:
    print(f.read())
filename=filename
Customer_CLV=PREDICTOR_1
Customer_MaritalStatus=PREDICTOR_2
Customer_City=PREDICTOR_3
Context_Name=Context_Name
IH_Web_Inbound_Accepted_pxLastGroupID=IH_Web_Inbound_Accepted_pxLastGroupID
Decision_Outcome=Decision_Outcome

[12]:
anon.write_to_output(ext='arrow')
[13]:
pl.read_ipc('output/hds.arrow')
[13]:
shape: (7, 6)
PREDICTOR_1PREDICTOR_2PREDICTOR_3Context_NameIH_Web_Inbound_Accepted_pxLastGroupIDDecision_Outcome
f64strstrstrstrstr
1.4553e18"1677290861643629533""675924639818905441""FirstMortgage30yr""7325192143980280913""Rejected"
1.3342e19"3995997704274528624""12466662781232983786""FirstMortgage30yr""11250321987204776838""Accepted"
6.4610e18"299236262880695810""14480333672661849787""MoneyMarketSavingsAccount""7325192143980280913""Rejected"
9.5945e18"3995997704274528624""10510107923994113798""BasicChecking""7325192143980280913""Rejected"
8.8990e18"1677290861643629533""11351735953431803477""BasicChecking""14140642726280753534""Accepted"
1.8116e19"299236262880695810""6705003666536151207""UPlusFinPersonal"null"Rejected"
3.0632e18"299236262880695810""16498174082774312399""BasicChecking""14140642726280753534""Rejected"