G.1. pgpro_anonymizer
pgpro_anonymizer is an extension to mask or replace personally identifiable information (PII) or commercially sensitive data from a Postgres Pro database.
The project has a declarative approach of anonymization. This means you can declare the masking rules using the Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.
Once the masking rules are defined, you can access the anonymized data in one of the following ways:
Anonymous dumps: simply export the masked data into an SQL file.
Static masking: remove the PII according to the rules.
Dynamic masking: hide PII only for the masked users.
In addition, various functions are available to implement the following masking techniques: randomization, faking, partial scrambling, shuffling, noise. You can also create a custom function.
Beyond masking, you can also use another approach called generalization, which is useful for statistics and data analytics.
G.1.1. Terms and Definitions
The following main strategies are used:
Dynamic masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.
Static masking is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.
The data can be altered with several techniques:
Deletion simply removes data.
Static substitution consistently replaces the data with a generic value. For instance: replacing all values of a
textcolumn with the value “CONFIDENTIAL”.Variance “shifts” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
Generalization reduces the accuracy of the data by replacing it with a range of values. Instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is useful for analytics because the data remains true.
Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.
Randomization replaces sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
Partial scrambling is similar to static substitution but leaves out some part of the data. For instance: a credit card number can be replaced by “40XX XXXX XXXX XX96”.
Custom rules are designed to alter data following specific needs (e.g., simultaneously randomize a zipcode and a city name while keeping them coherent).
Pseudonymization is a way to protect personal information by hiding it using additional information. Encryption and hashing are two examples of pseudonymization techniques. However, pseudonymizated data is still linked to the original data.
G.1.2. Anonymization Example
Suppose you want to mask last names and phone numbers:
SELECT * FROM people; id | firstname | lastname | phone ----+-----------+----------+------------ T1 | Sarah | Connor | 0609110911
Activate the dynamic masking engine:
SELECT anon.start_dynamic_masking();
Declare a masked user:
CREATE ROLE skynet LOGIN; SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
Connect with the masked user:
\connect - skynet SELECT * FROM people; id | firstname | lastname | phone ----+-----------+-----------+------------ T1 | Sarah | Stranahan | 06******11
G.1.3. Installation and Setup
The pgpro_anonymizer extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-anonymizer-ent-15 (for the detailed installation instructions, see Chapter 17). To enable pgpro_anonymizer, complete the following steps:
Add the library name to the
shared_preload_librariesvariable in thepostgresql.conffile:shared_preload_libraries = 'anon'
Reload the database server for the changes to take effect.
Note
To verify that the library was installed correctly, you can run the following command:
SHOW shared_preload_libraries;
Create the extension using the following query:
CREATE EXTENSION anon CASCADE;
Initialize the extension:
SELECT anon.init();
The
init()function imports a default dataset of random data (IBAN, names, cities, etc.). This dataset is in English and is very small (1000 values for each category).
G.1.4. Configuration
The extension has several options that can be defined for the entire instance (in postgresql.conf or with ALTER SYSTEM).
It is also possible and often a good idea to define them at the database level like this:
ALTER DATABASE customers SET anon.algorithm = sha512;
Only a superuser can change the parameters below:
anon.algorithmThe hashing method used by pseudonymizing functions. See pgcrypto documentation for the list of available options.
See
anon.saltto learn why this parameter is a very sensitive information.Type:
textDefault:
sha256Visible: only to superusers
anon.maskschemaThe schema (i.e. namespace) where the dynamic masking views will be stored.
Type:
textDefault:
maskVisible: to all users
anon.restrict_to_trusted_schemasWhen this parameter is enabled (by default), masking rules must be defined using functions located in a limited list of namespaces. By default, only the
anonschema is trusted.This improves security by preventing users from declaring their custom masking filters.
This also means that the schema must be explicit inside the masking rules. For more details, see the Write your own masks section of the Masking functions chapter.
SECURITY LABEL FOR anon ON COLUMN people.name IS 'MASKED WITH FUNCTION lower(people.name)';
The correct way to declare it would be:
SECURITY LABEL FOR anon ON COLUMN people.name IS 'MASKED WITH FUNCTION pg_catalog.lower(people.name)';
Type:
booleanDefault:
onVisible: to all users
anon.saltThe salt used by pseudonymizing functions. It is very important to define a custom salt for each database like this:
ALTER DATABASE foo SET anon.salt = 'This_Is_A_Very_Secret_Salt';
If masked users can read the salt, they can run a brute force attack to retrieve the original data based on the following elements:
The pseudonymized data
The hashing algorithm (see
anon.algorithm)The salt
The salt and the name of the hashing algorithm should be protected with the same level of security that the data itself. This is why the salt should be stored directly within the database with
ALTER DATABASE.Type:
textDefault: (empty)
Visible: only to superusers
anon.sourceshemaThe schema (i.e. namespace) where the tables are masked by the dynamic masking engine.
Change this value before starting dynamic masking.
ALTER DATABASE foo SET anon.sourceschema TO 'my_app';
Then reconnect so that the change takes effect and start the engine.
SELECT anon.start_dynamic_masking();
Type:
textDefault:
publicVisible: to all users
G.1.5. Declare Masking Rules
The main idea of this extension is to offer anonymization by design.
The data masking rules should be written by people who develop the application because they have the best knowledge of how the data model works. Therefore, masking rules must be implemented directly inside the database schema.
This allows to mask the data directly inside the Postgres Pro instance without using an external tool, and thus limiting the exposure and the risks of data leak.
The data masking rules are declared simply by using security labels:
CREATE TABLE player (id SERIAL, name TEXT, points INT); INSERT INTO player VALUES ( 1, 'Kareem Abdul-Jabbar', 38387), ( 5, 'Michael Jordan', 32292 ); SECURITY LABEL FOR anon ON COLUMN player.name IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN player.id IS 'MASKED WITH VALUE NULL';
Important
The maximum length of a masking rule is 1024 characters. If you need more, you should probably write a dedicated masking function.
Important
The masking rules are not inherited. If you have split a table into multiple partitions, you need to declare the masking rules for each partition.
G.1.5.1. Escaping String Literals
As you may have noticed, the masking rule definitions are placed between single quotes. Therefore, if you need to use a string inside a masking rule, use dollar quoting:
SECURITY LABEL FOR anon ON COLUMN player.name IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
G.1.5.2. Listing Masking Rules
To display all the masking rules declared in the current database, see anon.pg_masking_rules:
SELECT * FROM anon.pg_masking_rules;
G.1.5.3. Debugging Masking Rules
When an error occurs due to a wrong masking rule, you can get more detailed information about the problem by setting client_min_messages to DEBUG and you will get useful details.
postgres=# SET client_min_messages=DEBUG; postgres=# SELECT anon.anonymize_database(); DEBUG: Anonymize table public.bar with firstname = anon.fake_first_name() DEBUG: Anonymize table public.foo with id = NULL ERROR: Cannot mask a "NOT NULL" column with a NULL value HINT: If privacy_by_design is enabled, add a default value to the column CONTEXT: PL/pgSQL function anon.anonymize_table(regclass) line 47 at RAISE SQL function "anonymize_database" statement 1
G.1.5.4. Removing a Masking Rule
You can simply erase a masking rule like this:
SECURITY LABEL FOR anon ON COLUMN player.name IS NULL;
To remove all rules at once, you can use:
SELECT anon.remove_masks_for_all_columns();
G.1.6. Masking Functions
The extension provides functions to implement the following main anonymization strategies:
Depending on your data, you may need to use different strategies on different columns:
For names and other “direct identifiers”, faking is often useful
Shuffling is convenient for foreign keys
Adding noise is interesting for numeric values and dates
Partial scrambling is perfect for email addresses and phone numbers
G.1.6.1. Deletion
First of all, the fastest and safest way to anonymize a data is to delete it.
In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.
For instance, you can replace an entire column with the word CONFIDENTIAL like this:
SECURITY LABEL FOR anon ON COLUMN users.address IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
G.1.6.2. Adding Noise
This is also called Variance. The idea is to “shift” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
-
anon.noise(noise_valueanyelement,ratiodouble precision) If
ratiois 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33%.-
anon.dnoise(noise_valueanyelement,noise_rangeinterval) If
interval= “2 days”, the return value will be the original value randomly shifted by +/- 2 days.
Important
The noise() masking functions are vulnerable to a form of repeat attack, especially with Dynamic Masking. A masked user can guess an original value by requesting its masked value multiple times and then simply use the avg() function to get a close approximation. In a nutshell, these functions are best fitted for Anonymous Dumps and Static Masking. They should be avoided when using Dynamic Masking.
G.1.6.3. Randomization
The extension provides a large choice of functions to generate purely random data:
G.1.6.3.1. Basic Random Values
G.1.6.3.2. Random Between
To pick any value inside between two bounds:
Note
With these functions, the lower and upper bounds are included. For instance, anon.random_int_between(1,3) returns either 1, 2 or 3.
For more advanced interval descriptions, see Random in Range.
G.1.6.3.3. Random in Array
The random_in function returns an element in a given array. For example:
-
anon.random_in(ARRAY[1,2,3]) Returns an integer between
1and3.-
anon.random_in(ARRAY['red','green','blue']) Returns a random text value from an array of ['red', 'green', 'blue'].
G.1.6.3.4. Random in Enum
This function returns an element in a given variable of the ENUM type.
-
anon.random_in_enum(variable_of_an_enum_type) Returns any value.
CREATE TYPE card AS ENUM ('visa', 'mastercard', ‘amex’);
SELECT anon.random_in_enum(NULL::CARD);
random_in_enum
----------------
mastercard
CREATE TABLE customer (
id INT,
...
credit_card CARD
);
SECURITY LABEL FOR anon ON COLUMN customer.creditcard
IS 'MASKED WITH FUNCTION anon.random_in_enum(creditcard)'
G.1.6.3.5. Random in Range
RANGE types are a powerfull way to describe an interval of values, where you can define inclusive or exclusive bounds. See examples.
There is a function for each subtype of range:
-
anon.random_in_int4range('[5,6)') Returns an INT of value 5.
-
anon.random_in_int8range('(6,7]') Returns a BIGINT of value 7.
-
anon.random_in_numrange('[0.1,0.9]') Returns a NUMERIC between 0.1 and 0.9.
-
anon.random_in_daterange('[2001-01-01, 2001-12-31)') Returns a date in 2001.
-
anon.random_in_tsrange('[2022-10-01,2022-10-31]') Returns a TIMESTAMP in October 2022.
-
anon.random_in_tstzrange('[2022-10-01,2022-10-31]') Returns a TIMESTAMP WITH TIMEZONE in October 2022.
Note
It is not possible to get a random value from a RANGE with an infinite bound. For example anon.random_in_int4range('[2022,)') returns NULL.
G.1.6.4. Faking
The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
The following faking functions are available:
-
anon.fake_address() Returns a complete postal address.
-
anon.fake_city() Returns an existing city.
-
anon.fake_country() Returns an existing country.
-
anon.fake_company() Returns a generic company name.
-
anon.fake_email() Returns a valid email address.
-
anon.fake_first_name() Returns a generic first name.
-
anon.fake_iban() Returns a valid IBAN.
-
anon.fake_last_name() Returns a generic last name.
-
anon.fake_postcode() Returns a valid zip code.
For text and varchar columns, you can use the classic Lorem Ipsum generator:
-
anon.lorem_ipsum() Returns five paragraphs.
-
anon.lorem_ipsum(2) Returns two paragraphs.
-
anon.lorem_ipsum( paragraphs := 4 ) Returns four paragraphs.
-
anon.lorem_ipsum( words := 20 ) Returns 20 words.
-
anon.lorem_ipsum( characters := anon.length(table.column) ) Returns the same number of characters as the original string.
G.1.6.5. Advanced Faking
Generating fake data is a complex topic. The functions provided here are limited to basic use case. For more advanced faking methods, in particular if you are looking for localized fake data, take a look at PostgreSQL Faker, an extension based upon the well-known Faker Python library.
This extension provides an advanced faking engine with localization support.
For example:
CREATE SCHEMA faker;
CREATE EXTENSION faker SCHEMA faker;
SELECT faker.faker('de_DE');
SELECT faker.first_name_female();
first_name_female
-------------------
Mirja
G.1.6.6. Pseudonymization
Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic: the functions will always return the same fake value based on a seed and an optional salt.
The following pseudonymization functions are available:
-
anon.pseudo_first_name(seedanyelement,salttext) Returns a generic first name.
-
anon.pseudo_last_name(seedanyelement,salttext) Returns a generic last name.
-
anon.pseudo_email(seedanyelement,salttext) Returns a valid email address.
-
anon.pseudo_city(seedanyelement,salttext) Returns an existing city.
-
anon.pseudo_country(seedanyelement,salttext) Returns an existing country.
-
anon.pseudo_company(seedanyelement,salttext) Returns a generic company name.
-
anon.pseudo_iban(seedanyelement,salttext) Returns a valid IBAN.
-
anon.pseudo_siret(seedanyelement,salttext) Returns a valid SIRET.
The second argument (salt) is optional. You can call each function only with the seed like this: anon.pseudo_city('bob'). The salt is here to increase complexity and avoid dictionary and brute force attacks. If a specific salt is not given, the value of the anon.salt GUC parameter is used instead (see Generic Hashing for more details).
The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed:
SECURITY LABEL FOR anon ON COLUMN users.emailaddress IS 'MASKED WITH FUNCTION anon.pseudo_email(users.login)';
Note
You may want to produce unique values using a pseudonymization function. For instance, if you want to mask an email column that is declared as UNIQUE. In this case, you will need to initialize the extension with a fake dataset that is way bigger than the number of rows of the table. Otherwise, you may see some “collisions” happening, i.e. two different original values producing the same pseudo value.
Important
Pseudonymization is often confused with anonymization but in fact they serve two different purposes: pseudonymization is a way to protect the personal information but the pseudonymized data is still “linked” to the real data.
G.1.6.7. Generic Hashing
In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.
For instance, when a pair of primary key / foreign key is a “natural key”, it may contain actual information (like a customer number containing a birth date or something similar).
Hashing such columns allows to keep referential integrity intact even for relatively unusual source data.
-
anon.digest(valuetext,salttext,algorithmtext) Lets you choose a salt and a hash algorithm. Supported algorithms are:
md5,sha1,sha224,sha256,sha384, andsha512.-
anon.hash(valuetext) Returns a text hash of the value using a secret salt that is defined by the
anon.saltparameter and hash algorithm that is defined by theanon.algorithmparameter. The default value ofanon.algorithmissha256and possible values are:md5,sha1,sha224,sha256,sha384orsha512. The default value ofanon.saltis an empty string. You can modify these values with:ALTER DATABASE foo SET anon.salt TO 'xsfnjefnjsnfjsnf'; ALTER DATABASE foo SET anon.algorithm TO 'sha384';
Keep in mind that hashing is a form of Pseudonymization. This means that the data can be “de-anonymized” using the hashed value and the masking function. If attackers get access to these two elements, they could re-identify some persons using brute force or dictionary attacks. Therefore, the salt and the algorithm used to hash the data must be protected with the same level of security as the original dataset.
In a nutshell, we recommend that you use the anon.hash() function rather than anon.digest() because the salt will not appear clearly in the masking rule.
Furthermore, in practice the hash function will return a long string of character like this:
SELECT anon.hash('bob');
hash
----------------------------------------------------------------------------------------------------------------------------------
95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397
For some columns, this may be too long and you may have to cut some parts of the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a varchar(12), you can transform the data like this:
SECURITY LABEL FOR anon ON COLUMN people.phone_number IS 'MASKED WITH FUNCTION anon.left(anon.hash(phone_number),12)'; SECURITY LABEL FOR anon ON COLUMN call_history.fk_phone_number IS 'MASKED WITH FUNCTION anon.left(anon.hash(fk_phone_number),12)';
Of course, cutting the hash value to 12 characters will increase the risk of “collision” (2 different values having the same fake hash). In such case, it's up to you to evaluate this risk.
G.1.6.8. Partial Scrambling
Partial scrambling leaves out some part of the data. For instance, a credit card number can be replaced by “40XX XXXX XXXX XX96”.
-
anon.partial(inputtext,prefixint,paddingtext,suffixint) Partially replaces a given text. For example,
anon.partial('abcdefgh',1,'xxxx',3)returnsaxxxxfgh.-
anon.partial_email(emailtext) Partially replaces a given email. For example,
anon.partial_email('daamien@gmail.com')returnsda******@gm******.com.
G.1.6.9. Conditional Masking
In some situations, you may want to apply a masking filter only for some value or for a limited number of lines in the table.
For instance, if you want to "preserve NULL values", that is masking only the lines that contain a value, you can use the anon.ternary function, which works like a CASE WHEN x THEN y ELSE z statement:
SECURITY LABEL FOR anon ON COLUMN player.score
IS 'MASKED WITH FUNCTION anon.ternary(score IS NULL,
NULL,
anon.random_int_between(0,100));
You may also want to exclude some lines within the table. Like keeping the password of some users so that they still may be able to connect to a testing deployment of your application:
SECURITY LABEL FOR anon ON COLUMN account.password IS 'MASKED WITH FUNCTION anon.ternary( id > 1000, NULL::TEXT, password)';
Warning
Conditional masking may create a partially deterministic "connection" between the original data and the masked data. And that connection can be used to retrieve personal information from the masked data. For instance, if NULL values are preserved for a "deceased_date" column, it will reveal which persons are still actually alive... In a nutshell: conditional masking may often produce a dataset that is not fully anonymized and therefore would still technically contain personal information.
G.1.6.10. Generalization
Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying “Paul is 42 years old”, you would say “Paul is between 40 and 50 years old”.
Note
The generalization functions perform data type transformation. Therefore, it is not possible to use them with the dynamic masking engine. However, they are useful to create anonymized views. See example below.
Let's imagine a table containing health information:
SELECT * FROM patient; id | name | zipcode | birth | disease ----+----------+----------+------------+--------------- 1 | Alice | 47678 | 1979-12-29 | Heart Disease 2 | Bob | 47678 | 1959-03-22 | Heart Disease 3 | Caroline | 47678 | 1988-07-22 | Heart Disease 4 | David | 47905 | 1997-03-04 | Flu 5 | Eleanor | 47909 | 1999-12-15 | Heart Disease 6 | Frank | 47906 | 1968-07-04 | Cancer 7 | Geri | 47605 | 1977-10-30 | Heart Disease 8 | Harry | 47673 | 1978-06-13 | Cancer 9 | Ingrid | 47607 | 1991-12-12 | Cancer
We can build a view upon this table to suppress some columns ( SSN and name ) and generalize the zipcode and the birth date like this:
CREATE VIEW anonymized_patient AS
SELECT
'REDACTED' AS lastname,
anon.generalize_int4range(zipcode,100) AS zipcode,
anon.generalize_tsrange(birth,'decade') AS birth,
disease
FROM patient;
The anonymized table now looks like that:
SELECT * FROM anonymized_patient; lastname | zipcode | birth | disease ----------+---------------+-----------------------------+--------------- REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer
The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.
Postgres Pro offers several range data types, which are perfect for dates and numeric values.
For numeric values, the following functions are available:
-
generalize_int4range(value,step)
generalize_int8range(value,step)
generalize_numrange(value,step) where
valueis the data that will be generalized, andstepis the size of each range.
G.1.6.11. Using pg_catalog Functions
The pg_catalog schema is not trusted by default. This is a security measure designed to prevent users from using sophisticated functions in masking rules (such as pg_catalog.query_to_xml, pg_catalog.ts_stat or the system administration functions) that should not be used as masking functions.
However, the extension provides bindings to some useful and safe functions from the pg_catalog schema for your convenience:
anon.concat(TEXT,TEXT)anon.concat(TEXT,TEXT, TEXT)anon.date_add(TIMESTAMP WITH TIME ZONE,INTERVAL)anon.date_part(TEXT,TIMESTAMP)anon.date_part(TEXT,INTERVAL)anon.date_subtract(TIMESTAMP WITH TIME ZONE, INTERVAL )anon.date_trunc(TEXT,TIMESTAMP)anon.date_trunc(TEXT,TIMESTAMP WITH TIME ZONE,TEXT)anon.date_trunc(TEXT,INTERVAL)anon.left(TEXT,INTEGER)anon.length(TEXT)anon.lower(TEXT)anon.make_date(INT,INT,INT )anon.make_time(INT,INT,DOUBLE PRECISION)anon.md5(TEXT)anon.random()anon.replace(TEXT,TEXT,TEXT)anon.regexp_replace(TEXT,TEXT,TEXT)anon.regexp_replace(TEXT,TEXT,TEXT,TEXT)anon.right(TEXT,INTEGER)anon.substr(TEXT,INTEGER)anon.substr(TEXT,INTEGER,INTEGER)anon.upper(TEXT)
If you need more bindings, you can either:
Write your own mapping function in a trusted schema.
Set the
pg_catalogschema asTRUSTED(not recommended).
G.1.6.12. Write Your Own Masks
You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).
Especially for complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.
For example:
CREATE TABLE company ( business_name TEXT, info JSONB )
The info field contains unstructured data like this:
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
jsonb_pretty
----------------------------------
{
"employees": [
{
"lastName": "Doe",
"firstName": "John"
},
{
"lastName": "Smith",
"firstName": "Anna"
},
{
"lastName": "Jones",
"firstName": "Peter"
}
]
}
(1 row)
Using the JSON functions and operators, you can walk through the keys and replace the sensitive values as needed.
CREATE SCHEMA custom_masks;
-- This step requires superuser privilege
SECURITY LABEL FOR anon ON SCHEMA custom_masks IS 'TRUSTED';
CREATE FUNCTION custom_masks.remove_last_name(j JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
json_build_object(
'employees' ,
array_agg(
jsonb_set(e ,'{lastName}', to_jsonb(anon.fake_last_name()))
)
)::JSONB
FROM jsonb_array_elements( j->'employees') e
$func$;
Then check that the function is working correctly:
SELECT custom_masks.remove_last_name(info) FROM company;
When it is okay, you can declare this function as the mask of the info field:
SECURITY LABEL FOR anon ON COLUMN company.info IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';
And try it out:
SELECT anonymize_table('company');
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
jsonb_pretty
-------------------------------------
{
"employees": [ +
{ +
"lastName": "Prawdzik",+
"firstName": "John" +
}, +
{ +
"lastName": "Baltazor",+
"firstName": "Anna" +
}, +
{ +
"lastName": "Taylan", +
"firstName": "Peter" +
} +
] +
}
(1 row)
This is just a quick example. As you can see, manipulating a sophisticated JSON structure with SQL is possible but it can be tricky at first. There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.
G.1.7. Static Masking
Sometimes, it is useful to transform directly the original dataset. You can do that with different methods:
These methods will destroy the original data. Use with care.
G.1.7.1. Applying Masking Rules
You can permanently apply the masking rules of a database with anon.anonymize_database().
Let's use a basic example:
CREATE TABLE customer ( id SERIAL, full_name TEXT, birth DATE, employer TEXT, zipcode TEXT, fk_shop INTEGER ); INSERT INTO customer VALUES (911,'Chuck Norris','1940-03-10','Texas Rangers', '75001',12), (312,'David Hasselhoff','1952-07-17','Baywatch', '90001',423) ; SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+------------------+------------+---------------+---------+--------- 911 | Chuck Norris | 1940-03-10 | Texas Rangers | 75001 | 12 112 | David Hasselhoff | 1952-07-17 | Baywatch | 90001 | 423
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN customer.full_name IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN customer.employer IS 'MASKED WITH FUNCTION anon.fake_company()'; SECURITY LABEL FOR anon ON COLUMN customer.zipcode IS 'MASKED WITH FUNCTION anon.random_zip()';
Replace authentic data in the masked columns:
SELECT anon.anonymize_database(); SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+-------------+------------+---------------------+---------+--------- 911 | Jesse Kosel | 1940-03-10 | Marigold Properties | 62172 | 12 312 | Leolin Bose | 1952-07-17 | Inventure | 20026 | 423
You can also use anonymize_table() and anonymize_column() to remove data from a subset of the database:
SELECT anon.anonymize_table('customer');
SELECT anon.anonymize_column('customer','zipcode');
Important
Static masking is a slow process. The principle of static masking is to update all lines of all tables containing at least one masked column. This basically means that the server will rewrite all the data on disk. Depending on the database size, the hardware and the instance configuration, it may be faster to export the anonymized data (see Anonymous Dumps) and reload it into the database.
G.1.7.2. Shuffling
Shuffling mixes values within the same columns.
-
anon.shuffle_column(shuffle_tableregclass,shuffle_columnname,primary_keyname) Rearranges all values in a given column. You need to provide the primary key of the table.
This is useful for foreign keys because referential integrity will be kept.
Important
shuffle_column() is not a masking function because it works “verticaly”: it will modify all the values of a column at once.
G.1.7.3. Adding Noise to a Column
There are also some functions that can add noise on an entire column:
-
anon.add_noise_on_numeric_column(tableregclass,columntext,ratiofloat) If
ratio= 0.33, all values of the column will be randomly shifted with a ratio of +/- 33%.-
anon.add_noise_on_datetime_column(tableregclass,columntext,intervalinterval) If
interval=2 days, all values of the column will be randomly shifted by +/- 2 days.
Important
These noise functions are vulnerable to a form of repeat attack.
G.1.8. Dynamic Masking
You can hide some data from a role by declaring this role as a “MASKED” one. Other roles will still access the original data.
CREATE TABLE people (id TEXT, firstname TEXT, lastname TEXT, phone TEXT);
INSERT INTO people VALUES ('T1','Sarah', 'Connor','0609110911');
SELECT * FROM people;
SELECT * FROM people;
id | firstname | lastname | phone
----+----------+----------+------------
T1 | Sarah | Connor | 0609110911
(1 row)
Activate the dynamic masking engine:
SELECT anon.start_dynamic_masking();
Declare a masked user:
CREATE ROLE skynet LOGIN; SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
Connect with the masked user:
\c - skynet SELECT * FROM people; id | firstname | lastname | phone ----+----------+-----------+------------ T1 | Sarah | Stranahan | 06******11 (1 row)
G.1.8.1. How to Change the Type of a Masked Column
When dynamic masking is activated, you are not allowed to change the datatype of a column if there's a mask upon it.
To modify a masked column, you need to temporarily switch off the masking engine like this:
BEGIN; SELECT anon.stop_dynamic_masking(); ALTER TABLE people ALTER COLUMN phone TYPE VARCHAR(255); SELECT anon.start_dynamic_masking(); COMMIT;
G.1.8.2. How to Drop a Masked Table
The dynamic masking engine will build masking views upon the masked tables. This means that it is not possible to drop a masked table directly. You will get an error like this:
DROP TABLE people; psql: ERROR: cannot drop table people because other objects depend on it DETAIL: view mask.company depends on table people
To effectively remove the table, it is necessary to add the CASCADE option, so that the masking view will be dropped too:
DROP TABLE people CASCADE;
G.1.8.3. How to Unmask a Role
Simply remove the security label like this:
SECURITY LABEL FOR anon ON ROLE bob IS NULL;
To unmask all masked roles at once, you can type:
SELECT anon.remove_masks_for_all_roles();
G.1.8.4. Limitations
G.1.8.4.1. Listing the Tables
Due to how the dynamic masking engine works, when a masked role tries to display the tables in psql with the \dt command, then psql will not show any tables.
This is because the search_path of the masked role is rigged.
You can try adding explicit schema you want to search, for instance:
\dt *.* \dt public.*
G.1.8.4.2. Only One Schema
The dynamic masking system only works with one schema (by default public). When you start the masking engine with start_dynamic_masking(), you can specify the schema that will be masked as follows:
ALTER DATABASE foo SET anon.sourceschema TO 'sales';
Then open a new session to the database and type:
SELECT start_dynamic_masking();
However, static masking with anon.anonymize_column(), anon.anonymize_table(), and anon.anonymize_database() will work fine with multiple schemas.
G.1.8.4.3. Performance
Dynamic Masking is known to be very slow with some queries, especially if you try to join two tables on a masked key using hashing or pseudonymization.
G.1.8.4.4. Graphic Tools
When you are using a masked role with a graphic interface, such as DBeaver or pgAdmin, the data panel may produce the following error when trying to display the content of a masked table called foo:
SQL Error [42501]: ERROR: permission denied for table foo
This is because most of these tools will directly query the public.foo table instead of being “redirected” by the masking engine toward the mask.foo view.
In order the view the masked data with a graphic tool, you can either:
Open the SQL query panel and type SELECT * FROM foo
Navigate to Database > Schemas > mask > Views > foo
G.1.9. Anonymous Dumps
G.1.9.1. pg_dump_anon.sh
The pg_dump_anon.sh script supports most of the options of the regular pg_dump command. The environment variables (PGHOST, PGUSER, etc.) and the .pgpass files are also supported.
G.1.9.2. Example
A user named bob can export an anonymous dump of the app database like this:
/opt/pgpro/ent-15/bin/pg_dump_anon.sh -h localhost -U bob --password --file=anonymous_dump.sql app
Important
The name of the database must be the last parameter.
For more details about the supported options, simply type ./pg_dump_anon.sh --help.
G.1.9.3. Limitations
The user password is asked automatically. This means you must either add the
--passwordoption to define it interactively or declare it in thePGPASSWORDvariable or put it inside the.pgpassfile ( however on Windows,thePGPASSFILEvariable must be specified explicitly)The
plainformat is the only supported format. The other formats (custom,dirandtar) are not supported.
G.1.10. Generalization
G.1.10.1. Reducing the Accuracy of Sensitive Data
The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is interesting for analytics because the data remains true while avoiding the risk of re-identification.
Generalization is a way to achieve k-anonymity.
Postgres Pro can handle generalization very easily with the range data types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.
G.1.10.2. Example
Here's a basic table containing medical data:
SELECT * FROM patient;
ssn | firstname | zipcode | birth | disease
-------------+-----------+---------+------------+---------------
253-51-6170 | Alice | 47012 | 1989-12-29 | Heart Disease
091-20-0543 | Bob | 42678 | 1979-03-22 | Allergy
565-94-1926 | Caroline | 42678 | 1971-07-22 | Heart Disease
510-56-7882 | Eleanor | 47909 | 1989-12-15 | Acne
098-24-5548 | David | 47905 | 1997-03-04 | Flu
118-49-5228 | Jean | 47511 | 1993-09-14 | Flu
263-50-7396 | Tim | 47900 | 1981-02-25 | Heart Disease
109-99-6362 | Bernard | 47168 | 1992-01-03 | Asthma
287-17-2794 | Sophie | 42020 | 1972-07-14 | Asthma
409-28-2014 | Arnold | 47000 | 1999-11-20 | Diabetes
(10 rows)
We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers:
CREATE MATERIALIZED VIEW generalized_patient AS SELECT 'REDACTED'::TEXT AS firstname, anon.generalize_int4range(zipcode,1000) AS zipcode, anon.generalize_daterange(birth,'decade') AS birth, disease FROM patient;
This will give us a less accurate view of the data:
SELECT * FROM generalized_patient; firstname | zipcode | birth | disease -----------+---------------+-------------------------+--------------- REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Allergy REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Acne REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Asthma REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Asthma REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes (10 rows)
G.1.10.3. Generalization Functions
pgpro_anonymizer provides generalization functions, one for each range type. Generally, these functions take the original value as the first parameter, and a second parameter for the length of each step.
For numeric values:
-
anon.generalize_int4range(valueinteger,stepinteger) For example,
anon.generalize_int4range(42,5)returns the range[40,45).-
anon.generalize_int8range(valueinteger,stepinteger) For example,
anon.generalize_int8range(12345,1000)returns the range[12000,13000).-
anon.generalize_numrange(valueinteger,stepinteger) For example,
anon.generalize_numrange(42.32378,10)returns the range[40,50).
For time values:
-
anon.generalize_tsrange(valueinteger,stepinteger) For example,
anon.generalize_tsrange('1904-11-07','year')returns['1904-01-01','1905-01-01').-
anon.generalize_tstzrange(valueinteger,stepinteger) For example,
anon.generalize_tstzrange('1904-11-07','week')returns['1904-11-07','1904-11-14').-
anon.generalize_daterange(valueinteger,stepinteger) For example,
anon.generalize_daterange('1904-11-07','decade')returns[1900-01-01,1910-01-01).
The possible steps are: microsecond, millisecond, second, minute, hour, day, week, month, year, decade, century and millennium.
G.1.10.4. Limitations
G.1.10.4.1. Singling Out and Extreme Values
Singling Out is the possibility to isolate an individual in a dataset by using extreme value or exceptional values.
For example:
SELECT * FROM employees; id | name | job | salary ------+----------------+------+-------- 1578 | xkjefus3sfzd | NULL | 1498 2552 | cksnd2se5dfa | NULL | 2257 5301 | fnefckndc2xn | NULL | 45489 7114 | npodn5ltyp3d | NULL | 1821
In this table, we can see that a particular employee has a very high salary, very far from the average salary. Therefore this person is probably the CEO of the company.
With generalization, this is important because the size of the range (the “step”) must be wide enough to prevent the identification of one single individual.
k-anonymity is a way to assess this risk.
G.1.10.4.2. Generalization is Not Compatible with Dynamic Masking
By definition, with generalization the data remains true, but the column type is changed.
This means that the transformation is not transparent, and therefore it cannot be used with dynamic masking.
G.1.10.5. k-anonymity
k-anonymity is an industry-standard term used to describe a property of an anonymized dataset. The k-anonymity principle states that within a given dataset, any anonymized individual cannot be distinguished from at least k-1 other individuals. In other words, k-anonymity might be described as a “hiding in the crowd” guarantee. A low value of k indicates there's a risk of re-identification using linkage with other data sources.
-
anon.k_anonymity(idregclass) You can evaluate the k-anonymity factor of a table as follows:
Define the columns that are indirect identifiers (also known as quasi identifiers) like this:
SECURITY LABEL FOR k_anonymity ON COLUMN patient.firstname IS 'INDIRECT IDENTIFIER'; SECURITY LABEL FOR k_anonymity ON COLUMN patient.zipcode IS 'INDIRECT IDENTIFIER'; SECURITY LABEL FOR k_anonymity ON COLUMN patient.birth IS 'INDIRECT IDENTIFIER';
Once the indirect identifiers are declared:
SELECT anon.k_anonymity('generalized_patient')The higher the value, the better.
G.1.11. Performance
Any anonymization process has a price as it will consume CPU time, RAM space and probably a lot of disk I/O. Here's a quick overview of the question depending on the strategy you are using.
In a nutshell, the anonymization performance mainly depends on the following important factors:
The size of the database
The number of masking rules
G.1.11.1. Static Masking
Basically what static masking does it rewrite entirely the masked tables on disk. This may be slow depending on your environment. And during this process, the tables will be locked.
Note
In this case, the cost of anonymization is “paid” by all the users but it is paid once and for all.
G.1.11.2. Dynamic Masking
With dynamic masking, the real data is replaced on-the-fly every time a masked user sends a query to the database. This means that the masking users will have slower response time than regular (unmasked) users. This is generally ok because usually masked users are not considered as important as the regular ones.
If you apply three or four rules to a table, the response time for the masked users should be approximately 20% to 30% slower than for the normal users.
As the masking rules are applied for each queries of the masked users, the dynamic masking is appropriate when you have a limited number of masked users that connect only from time to time to the database. For instance, a data analyst connecting once a week to generate a business report.
If there are multiple masked users or if a masked user is very active, you should probably export the masked data once-a-week on a secondary instance and let these users connect to this secondary instance.
Note
In this case, the cost of anonymization is “paid” only by the masked users.
G.1.11.3. Anonymous Dumps
If the backup process of your database takes one hour with pg_dump, then anonymizing and exporting the entire database with pg_dump_anon.sh will probably take two hours.
Note
In this case, the cost of anonymization is “paid” by the user asking for the anonymous export. Other users of the database will not be affected.
G.1.11.4. How to Speed Things Up
G.1.11.4.1. Prefer MASKED WITH VALUE Whenever Possible
It is always faster to replace the original data with a static value instead of calling a masking function.
G.1.11.4.2. Sampling
If you need to anonymize data for testing purposes, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data.
See Sampling for more details.
G.1.11.4.3. Materialized Views
Dynamic masking is not always required. In some cases, it is more efficient to build Materialized Views instead.
For instance:
CREATE MATERIALIZED VIEW masked_customer AS
SELECT
id,
anon.random_last_name() AS name,
anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
fk_last_order,
store_id
FROM customer;
G.1.12. Security
G.1.12.1. Permissions
Here's an overview of what users can do depending on the privilege they have:
Table G.1. Privileges
| Action | Superuser | Owner | Masked Role |
|---|---|---|---|
| Create the extension | Yes | ||
| Drop the extension | Yes | ||
| Init the extension | Yes | ||
| Reset the extension | Yes | ||
| Configure the extension | Yes | ||
| Put a mask upon a role | Yes | ||
| Start dynamic masking | Yes | ||
| Stop dynamic masking | Yes | ||
| Create a table | Yes | Yes | |
| Declare a masking rule | Yes | Yes | |
| Insert, delete, update a row | Yes | Yes | |
| Use static masking | Yes | Yes | |
| Select the real data | Yes | Yes | |
| Use regular dumps | Yes | Yes | |
| Use anonymous dumps | Yes | Yes | |
| Use the masking functions | Yes | Yes | Yes |
| Select the masked data | Yes | Yes | Yes |
| View the masking rules | Yes | Yes | Yes |
G.1.12.2. Limit Masking Filters Only to Trusted Schemas
By default, database owners can only write masking rules with functions that are located in the trusted schemas which are controlled by the superusers.
Out of the box, only the anon schema is declared as trusted. This means that by default the functions from the pg_catalog cannot be used in masking rules.
For more information, see Using pg_catalog functions.
G.1.12.3. Security Context of the Functions
Most of the functions of this extension are declared with the SECURITY INVOKER tag. This means that these functions are executed with the privileges of the user that calls them. This is an important restriction.
This extension contains another few functions declared with the tag SECURITY DEFINER.
G.1.13. Sampling
If you are writing an anonymization policy for a dataset, chances are that you do not need to anonymize the entire database. In most cases, extracting a subset of the table is sufficient. For example, if you want to export an anonymous dumps of the data for testing purpose in a CI workflow, extracting and masking only 10% of the database might be enough.
Furthermore, anonymizing a smaller portion (that is a "sample") of the dataset will be way faster.
G.1.13.1. Example
Let's say you have a huge amount of HTTP logs stored in a table. You want to remove the IP addresses and extract only 10% of the table:
CREATE TABLE http_logs ( id integer NOT NULL, date_opened DATE, ip_address INET, url TEXT ); SECURITY LABEL FOR anon ON COLUMN http_logs.ip_address IS 'MASKED WITH VALUE NULL'; SECURITY LABEL FOR anon ON TABLE http_logs IS 'TABLESAMPLE BERNOULLI(10)';
Now you can either do static masking, dynamic masking or an anonymous dumps. The mask data will represent a 10% portion of the real data.
G.1.13.2. Syntax
The syntax is exactly the same as the TABLESAMPLE clause, which can be placed at the end of a SELECT statement.
You can also define a sampling ratio at the database level, and it will be applied to all the tables that do not have their own TABLESAMPLE rule.
SECURITY LABEL FOR anon ON DATABASE app IS 'TABLESAMPLE SYSTEM(33)';
G.1.13.3. Maintaining Referential Integrity
Warning
The sampling method described above WILL FAIL if you have foreign keys pointing at the table you want to sample.
Extracting a subset of a database while maintaining referential integrity is tricky and it is not supported by this extension.
If you really need to keep referential integrity in an anonymized dataset, you need to do it in two steps:
Extract a sample with pg_sample.
Anonymize the extracted sample.
There may be other sampling tools for PostgreSQL but pg_sample is probably the best one.