G.1. pgpro_anonymizer
pgpro_anonymizer is an extension to mask or replace personally identifiable information (PII) or commercially sensitive data from a Postgres Pro database.
The project has a declarative approach of anonymization. This means you can declare the masking rules using the Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.
Once the masking rules are defined, you can access the anonymized data in one of the following ways:
Anonymous dumps: simply export the masked data into an SQL file.
Static masking: remove the PII according to the rules.
Dynamic masking: hide PII only for the masked users.
In addition, various functions are available to implement the following masking techniques: randomization, faking, partial scrambling, shuffling, noise. You can also create a custom function.
Beyond masking, you can also use another approach called generalization, which is useful for statistics and data analytics.
G.1.1. Terms and Definitions
The following main strategies are used:
Dynamic masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.
Static masking is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.
The data can be altered with several techniques:
Deletion simply removes data.
Static substitution consistently replaces the data with a generic value. For instance: replacing all values of a
text
column with the value “CONFIDENTIAL”.Variance “shifts” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
Generalization reduces the accuracy of the data by replacing it with a range of values. Instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is useful for analytics because the data remains true.
Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.
Randomization replaces sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
Partial scrambling is similar to static substitution but leaves out some part of the data. For instance: a credit card number can be replaced by “40XX XXXX XXXX XX96”.
Custom rules are designed to alter data following specific needs (e.g., simultaneously randomize a zipcode and a city name while keeping them coherent).
Pseudonymization is a way to protect personal information by hiding it using additional information. Encryption and hashing are two examples of pseudonymization techniques. However, pseudonymizated data is still linked to the original data.
G.1.2. Anonymization Example
Suppose you want to mask last names and phone numbers:
SELECT * FROM people; id | firstname | lastname | phone ----+-----------+----------+------------ T1 | Sarah | Connor | 0609110911
Activate the dynamic masking engine:
SELECT anon.start_dynamic_masking();
Declare a masked user:
CREATE ROLE skynet LOGIN; SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
Connect with the masked user:
\connect - skynet SELECT * FROM people; id | firstname | lastname | phone ----+-----------+-----------+------------ T1 | Sarah | Stranahan | 06******11
G.1.3. Installation and Setup
The pgpro_anonymizer
extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-anonymizer-ent-15
(for the detailed installation instructions, see Chapter 17). To enable pgpro_anonymizer
, complete the following steps:
Add the library name to the
shared_preload_libraries
variable in thepostgresql.conf
file:shared_preload_libraries = 'anon'
Reload the database server for the changes to take effect.
Note
To verify that the library was installed correctly, you can run the following command:
SHOW shared_preload_libraries;
Create the extension using the following query:
CREATE EXTENSION anon CASCADE;
Initialize the extension:
SELECT anon.init();
The
init()
function imports a default dataset of random data (IBAN, names, cities, etc.). This dataset is in English and is very small (1000 values for each category).
G.1.4. Configuration
The extension has several options that can be defined for the entire instance (in postgresql.conf
or with ALTER SYSTEM
).
It is also possible and often a good idea to define them at the database level like this:
ALTER DATABASE customers SET anon.algorithm = sha512;
Only a superuser can change the parameters below:
anon.algorithm
The hashing method used by pseudonymizing functions. Checkout the pgcrypto documentation for the list of available options.
See
anon.salt
to learn why this parameter is a very sensitive information.Type:
text
Default:
sha256
Visible: only to superusers
anon.maskschema
The schema (i.e. namespace) where the dynamic masking views will be stored.
Type:
text
Default:
mask
Visible: to all users
anon.restrict_to_trusted_schemas
When this parameter is enabled (by default), masking rules must be defined using functions located in a limited list of namespaces. By default,
pg_catalog
andanon
are trusted.This improves security by preventing users from declaring their custom masking filters.
This also means that the schema must be explicit inside the masking rules. For instance, the rules below would fail because the schema of the
lower
function is not declared.SECURITY LABEL FOR anon ON COLUMN people.name IS 'MASKED WITH FUNCTION lower(people.name)';
The correct way to declare it would be:
SECURITY LABEL FOR anon ON COLUMN people.name IS 'MASKED WITH FUNCTION pg_catalog.lower(people.name)';
Type:
boolean
Default:
on
Visible: to all users
anon.salt
The salt used by pseudonymizing functions. It is very important to define a custom salt for each database like this:
ALTER DATABASE foo SET anon.salt = 'This_Is_A_Very_Secret_Salt';
If masked users can read the salt, they can run a brute force attack to retrieve the original data based on the following elements:
The pseudonymized data
The hashing algorithm (see
anon.algorithm
)The salt
The salt and the name of the hashing algorithm should be protected with the same level of security that the data itself. This is why the salt should be stored directly within the database with
ALTER DATABASE
.Type:
text
Default: (empty)
Visible: only to superusers
anon.sourceshema
The schema (i.e. namespace) where the tables are masked by the dynamic masking engine.
Change this value before starting dynamic masking.
ALTER DATABASE foo SET anon.sourceschema TO 'my_app';
Then reconnect so that the change takes effect and start the engine.
SELECT anon.start_dynamic_masking();
Type:
text
Default:
public
Visible: to all users
G.1.5. Declare Masking Rules
The main idea of this extension is to offer anonymization by design.
The data masking rules should be written by people who develop the application because they have the best knowledge of how the data model works. Therefore, masking rules must be implemented directly inside the database schema.
This allows to mask the data directly inside the Postgres Pro instance without using an external tool, and thus limiting the exposure and the risks of data leak.
The data masking rules are declared simply by using security labels:
CREATE TABLE player (id SERIAL, name TEXT, points INT); INSERT INTO player VALUES ( 1, 'Kareem Abdul-Jabbar', 38387), ( 5, 'Michael Jordan', 32292 ); SECURITY LABEL FOR anon ON COLUMN player.name IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN player.id IS 'MASKED WITH VALUE NULL';
Important
The masking rules are not inherited. If you have split a table into multiple partitions, you need to declare the masking rules for each partition.
G.1.5.1. Escaping String Literals
As you may have noticed, the masking rule definitions are placed between single quotes. Therefore, if you need to use a string inside a masking rule, use dollar quoting:
SECURITY LABEL FOR anon ON COLUMN player.name IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
G.1.5.2. Using Expressions
You can use more advanced expressions with the MASKED WITH VALUE
syntax:
SECURITY LABEL FOR anon ON COLUMN player.name IS 'MASKED WITH VALUE CASE WHEN name IS NULL THEN $$John$$ ELSE anon.random_string(LENGTH(name)) END';
G.1.5.3. Removing a Masking Rule
You can simply erase a masking rule like this:
SECURITY LABEL FOR anon ON COLUMN player.name IS NULL;
To remove all rules at once, you can use:
SELECT anon.remove_masks_for_all_columns();
G.1.6. Masking Functions
The extension provides functions to implement the following main anonymization strategies:
Depending on your data, you may need to use different strategies on different columns:
For names and other “direct identifiers”, faking is often useful
Shuffling is convenient for foreign keys
Adding noise is interesting for numeric values and dates
Partial scrambling is perfect for email addresses and phone numbers
G.1.6.1. Deletion
First of all, the fastest and safest way to anonymize a data is to delete it.
In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.
For instance, you can replace an entire column with the word CONFIDENTIAL
like this:
SECURITY LABEL FOR anon ON COLUMN users.address IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
G.1.6.2. Adding Noise
This is also called Variance. The idea is to “shift” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
-
anon.noise(
noise_value
anyelement
,ratio
double precision
) If
ratio
is 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33%.-
anon.dnoise(
noise_value
anyelement
,noise_range
interval
) If
interval
= “2 days”, the return value will be the original value randomly shifted by +/- 2 days.
Important
The noise()
masking functions are vulnerable to a form of repeat attack, especially with Dynamic Masking. A masked user can guess an original value by requesting its masked value multiple times and then simply use the avg()
function to get a close approximation. In a nutshell, these functions are best fitted for Anonymous Dumps and Static Masking. They should be avoided when using Dynamic Masking.
G.1.6.3. Randomization
The extension provides a large choice of functions to generate purely random data:
G.1.6.3.1. Basic Random values
-
anon.random_date()
Returns a
date
.-
random_date_between(
d1
date
,d2
date
) Returns a
date
betweend1
andd2
.-
random_int_between(
i1
integer
,i2
integer
) Returns an
integer
betweeni1
andi2
.-
random_bigint_between(
b1
bigint
,b2
bigint
) Returns a
bigint
betweenb1
andb2
.-
anon.random_string(
n
integer
) Returns a
text
value containingn
letters.-
anon.random_zip()
Returns a 5-digit code.
-
anon.random_phone(
phone_prefix
text
) Returns an 8-digit phone with
phone_prefix
as a prefix.-
anon.random_in(ARRAY[1,2,3])
Returns an integer between
1
and3
.-
anon.random_in(ARRAY['red','green','blue'])
Returns a random text value from an array of ['red', 'green', 'blue'].
G.1.6.4. Faking
The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
The following faking functions are available:
-
anon.fake_address()
Returns a complete postal address.
-
anon.fake_city()
Returns an existing city.
-
anon.fake_country()
Returns an existing country.
-
anon.fake_company()
Returns a generic company name.
-
anon.fake_email()
Returns a valid email address.
-
anon.fake_first_name()
Returns a generic first name.
-
anon.fake_iban()
Returns a valid IBAN.
-
anon.fake_last_name()
Returns a generic last name.
-
anon.fake_postcode()
Returns a valid zip code.
For text
and varchar
columns, you can use the classic Lorem Ipsum generator:
-
anon.lorem_ipsum()
Returns five paragraphs.
-
anon.lorem_ipsum(2)
Returns two paragraphs.
-
anon.lorem_ipsum( paragraphs := 4 )
Returns four paragraphs.
-
anon.lorem_ipsum( words := 20 )
Returns 20 words.
-
anon.lorem_ipsum( characters := LENGTH(table.column) )
Returns the same number of characters as the original string.
G.1.6.5. Advanced Faking
Generating fake data is a complex topic. The functions provided here are limited to basic use case. For more advanced faking methods, in particular if you are looking for localized fake data, take a look at PostgreSQL Faker, an extension based upon the well-known Faker Python library.
This extension provides an advanced faking engine with localization support.
For example:
CREATE SCHEMA faker; CREATE EXTENSION faker SCHEMA faker; SELECT faker.faker('de_DE'); SELECT faker.first_name_female(); first_name_female ------------------- Mirja
G.1.6.6. Pseudonymization
Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic: the functions will always return the same fake value based on a seed and an optional salt.
The following pseudonymization functions are available:
-
anon.pseudo_first_name(
seed
anyelement
,salt
text
) Returns a generic first name.
-
anon.pseudo_last_name(
seed
anyelement
,salt
text
) Returns a generic last name.
-
anon.pseudo_email(
seed
anyelement
,salt
text
) Returns a valid email address.
-
anon.pseudo_city(
seed
anyelement
,salt
text
) Returns an existing city.
-
anon.pseudo_country(
seed
anyelement
,salt
text
) Returns an existing country.
-
anon.pseudo_company(
seed
anyelement
,salt
text
) Returns a generic company name.
-
anon.pseudo_iban(
seed
anyelement
,salt
text
) Returns a valid IBAN.
The second argument (salt
) is optional. You can call each function only with the seed like this: anon.pseudo_city('bob')
. The salt is here to increase complexity and avoid dictionary and brute force attacks. If a salt is not given, a random secret salt is used instead (see Generic Hashing for more details).
The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed:
SECURITY LABEL FOR anon ON COLUMN users.emailaddress IS 'MASKED WITH FUNCTION anon.pseudo_email(users.login)';
Note
You may want to produce unique values using a pseudonymization function. For instance, if you want to mask an email
column that is declared as UNIQUE
. In this case, you will need to initialize the extension with a fake dataset that is way bigger than the number of rows of the table. Otherwise, you may see some “collisions” happening, i.e. two different original values producing the same pseudo value.
Important
Pseudonymization is often confused with anonymization but in fact they serve two different purposes: pseudonymization is a way to protect the personal information but the pseudonymized data is still “linked” to the real data.
G.1.6.7. Generic Hashing
In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.
For instance, when a pair of primary key / foreign key is a “natural key”, it may contain actual information (like a customer number containing a birth date or something similar).
Hashing such columns allows to keep referential integrity intact even for relatively unusual source data.
-
anon.hash(
value
text
) Returns a text hash of the value using a secret salt and hash algorithm.
-
anon.digest(
value
text
,salt
text
,algorithm
text
) Lets you choose a salt and a hash algorithm. Supported algorithms are:
md5
,sha1
,sha224
,sha256
,sha384
, andsha512
.
By default, a random secret salt is generated when the extension is initialized, and the default hash algorithm is sha512
. You can change these for the entire database with the following functions:
-
anon.set_secret_salt(
value
text
) Define your own salt.
-
anon.set_algorithm(
value
text
) Select another hash algorithm. For the list of supported algorithms, see
anon.digest
.
Keep in mind that hashing is a form of Pseudonymization. This means that the data can be “de-anonymized” using the hashed value and the masking function. If attackers get access to these two elements, they could re-identify some persons using brute force
or dictionary
attacks. Therefore, the salt and the algorithm used to hash the data must be protected with the same level of security as the original dataset.
In a nutshell, we recommend that you use the anon.hash()
function rather than anon.digest()
because the salt will not appear clearly in the masking rule.
Furthermore, in practice the hash function will return a long string of character like this:
SELECT anon.hash('bob'); hash ---------------------------------------------------------------------------------------------------------------------------------- 95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397
For some columns, this may be too long and you may have to cut some parts of the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a varchar(12)
, you can transform the data like this:
SECURITY LABEL FOR anon ON COLUMN people.phone_number IS 'MASKED WITH FUNCTION pg_catalog.left(anon.hash(phone_number),12)'; SECURITY LABEL FOR anon ON COLUMN call_history.fk_phone_number IS 'MASKED WITH FUNCTION pg_catalog.left(anon.hash(fk_phone_number),12)';
Of course, cutting the hash value to 12 characters will increase the risk of “collision” (2 different values having the same fake hash). In such case, it's up to you to evaluate this risk.
G.1.6.8. Partial Scrambling
Partial scrambling leaves out some part of the data. For instance, a credit card number can be replaced by “40XX XXXX XXXX XX96”.
-
anon.partial(
input
text
,prefix
int
,padding
text
,suffix
int
) Partially replaces a given text. For example,
anon.partial('abcdefgh',1,'xxxx',3)
returnsaxxxxfgh
.-
anon.partial_email(
email
text
) Partially replaces a given email. For example,
anon.partial_email('daamien@gmail.com')
returnsda******@gm******.com
.
G.1.6.9. Generalization
Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying “Paul is 42 years old”, you would say “Paul is between 40 and 50 years old”.
Note
The generalization functions perform data type transformation. Therefore, it is not possible to use them with the dynamic masking engine. However, they are useful to create anonymized views. See example below.
Let's imagine a table containing health information:
SELECT * FROM patient; id | name | zipcode | birth | disease ----+----------+----------+------------+--------------- 1 | Alice | 47678 | 1979-12-29 | Heart Disease 2 | Bob | 47678 | 1959-03-22 | Heart Disease 3 | Caroline | 47678 | 1988-07-22 | Heart Disease 4 | David | 47905 | 1997-03-04 | Flu 5 | Eleanor | 47909 | 1999-12-15 | Heart Disease 6 | Frank | 47906 | 1968-07-04 | Cancer 7 | Geri | 47605 | 1977-10-30 | Heart Disease 8 | Harry | 47673 | 1978-06-13 | Cancer 9 | Ingrid | 47607 | 1991-12-12 | Cancer
We can build a view upon this table to suppress some columns ( SSN
and name
) and generalize the zipcode and the birth date like this:
CREATE VIEW anonymized_patient AS SELECT 'REDACTED' AS lastname, anon.generalize_int4range(zipcode,100) AS zipcode, anon.generalize_tsrange(birth,'decade') AS birth, disease FROM patient;
The anonymized table now looks like that:
SELECT * FROM anonymized_patient; lastname | zipcode | birth | disease ----------+---------------+-----------------------------+--------------- REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer
The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.
Postgres Pro offers several range data types, which are perfect for dates and numeric values.
For numeric values, the following functions are available:
-
generalize_int4range(
value
,step
)
generalize_int8range(
value
,step
)
generalize_numrange(
value
,step
) where
value
is the data that will be generalized, andstep
is the size of each range.
G.1.6.10. Write Your Own Masks
You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).
For instance, if you wrote a function foo()
inside the schema bar
, then you can apply it like this:
SECURITY LABEL FOR anon ON SCHEMA bar IS 'TRUSTED'; SECURITY LABEL FOR anon ON COLUMN player.score IS 'MASKED WITH FUNCTION bar.foo()';
Note
The bar
schema must be declared as TRUSTED
by a superuser.
G.1.6.10.1. Writing a Masking Function for a JSONB Column
For complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.
For example:
CREATE TABLE company ( business_name TEXT, info JSONB )
The info
field contains unstructured data like this:
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; jsonb_pretty ---------------------------------- { "employees": [ { "lastName": "Doe", "firstName": "John" }, { "lastName": "Smith", "firstName": "Anna" }, { "lastName": "Jones", "firstName": "Peter" } ] } (1 row)
Using the JSON functions and operators, you can walk through the keys and replace the sensitive values as needed.
SECURITY LABEL FOR anon ON SCHEMA custom_masks IS 'TRUSTED'; CREATE FUNCTION custom_masks.remove_last_name(j JSONB) RETURNS JSONB VOLATILE LANGUAGE SQL AS $func$ SELECT json_build_object( 'employees' , array_agg( jsonb_set(e ,'{lastName}', to_jsonb(anon.fake_last_name())) ) )::JSONB FROM jsonb_array_elements( j->'employees') e $func$;
Then check that the function is working correctly:
SELECT custom_masks.remove_last_name(info) FROM company;
When it is okay, you can declare this function as the mask of the info
field:
SECURITY LABEL FOR anon ON COLUMN company.info IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';
And try it out:
SELECT anonymize_table('company'); SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; jsonb_pretty ------------------------------------- { "employees": [ + { + "lastName": "Prawdzik",+ "firstName": "John" + }, + { + "lastName": "Baltazor",+ "firstName": "Anna" + }, + { + "lastName": "Taylan", + "firstName": "Peter" + } + ] + } (1 row)
This is just a quick example. As you can see, manipulating a sophisticated JSON structure with SQL is possible but it can be tricky at first. There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.
G.1.7. Static Masking
Sometimes, it is useful to transform directly the original dataset. You can do that with different methods:
These methods will destroy the original data. Use with care.
G.1.7.1. Applying Masking Rules
You can permanently apply the masking rules of a database with anon.anonymize_database()
.
Let's use a basic example:
CREATE TABLE customer ( id SERIAL, full_name TEXT, birth DATE, employer TEXT, zipcode TEXT, fk_shop INTEGER ); INSERT INTO customer VALUES (911,'Chuck Norris','1940-03-10','Texas Rangers', '75001',12), (312,'David Hasselhoff','1952-07-17','Baywatch', '90001',423) ; SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+------------------+------------+---------------+---------+--------- 911 | Chuck Norris | 1940-03-10 | Texas Rangers | 75001 | 12 112 | David Hasselhoff | 1952-07-17 | Baywatch | 90001 | 423
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN customer.full_name IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN customer.employer IS 'MASKED WITH FUNCTION anon.fake_company()'; SECURITY LABEL FOR anon ON COLUMN customer.zipcode IS 'MASKED WITH FUNCTION anon.random_zip()';
Replace authentic data in the masked columns:
SELECT anon.anonymize_database(); SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+-------------+------------+---------------------+---------+--------- 911 | Jesse Kosel | 1940-03-10 | Marigold Properties | 62172 | 12 312 | Leolin Bose | 1952-07-17 | Inventure | 20026 | 423
You can also use anonymize_table()
and anonymize_column()
to remove data from a subset of the database:
SELECT anon.anonymize_table('customer'); SELECT anon.anonymize_column('customer','zipcode');
Important
Static masking is a slow process. The principle of static masking is to update all lines of all tables containing at least one masked column. This basically means that the server will rewrite all the data on disk. Depending on the database size, the hardware and the instance configuration, it may be faster to export the anonymized data (see Anonymous Dumps) and reload it into the database.
G.1.7.2. Shuffling
Shuffling mixes values within the same columns.
-
anon.shuffle_column(
shuffle_table
regclass
,shuffle_column
name
,primary_key
name
) Rearranges all values in a given column. You need to provide the primary key of the table.
This is useful for foreign keys because referential integrity will be kept.
Important
shuffle_column()
is not a masking function because it works “verticaly”: it will modify all the values of a column at once.
G.1.7.3. Adding Noise to a Column
There are also some functions that can add noise on an entire column:
-
anon.add_noise_on_numeric_column(
table
regclass
,column
text
,ratio
float
) If
ratio
= 0.33, all values of the column will be randomly shifted with a ratio of +/- 33%.-
anon.add_noise_on_datetime_column(
table
regclass
,column
text
,interval
interval
) If
interval
=2 days
, all values of the column will be randomly shifted by +/- 2 days.
Important
These noise functions are vulnerable to a form of repeat attack.
G.1.8. Dynamic Masking
You can hide some data from a role by declaring this role as a “MASKED” one. Other roles will still access the original data.
CREATE TABLE people (id TEXT, firstname TEXT, lastname TEXT, phone TEXT); INSERT INTO people VALUES ('T1','Sarah', 'Connor','0609110911'); SELECT * FROM people; SELECT * FROM people; id | firstname | lastname | phone ----+----------+----------+------------ T1 | Sarah | Connor | 0609110911 (1 row)
Activate the dynamic masking engine:
SELECT anon.start_dynamic_masking();
Declare a masked user:
CREATE ROLE skynet LOGIN; SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
Declare the masking rules:
SECURITY LABEL FOR anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
Connect with the masked user:
\c - skynet SELECT * FROM people; id | firstname | lastname | phone ----+----------+-----------+------------ T1 | Sarah | Stranahan | 06******11 (1 row)
G.1.8.1. How to Change the Type of a Masked Column
When dynamic masking is activated, you are not allowed to change the datatype of a column if there's a mask upon it.
To modify a masked column, you need to temporarily switch off the masking engine like this:
BEGIN; SELECT anon.stop_dynamic_masking(); ALTER TABLE people ALTER COLUMN phone TYPE VARCHAR(255); SELECT anon.start_dynamic_masking(); COMMIT;
G.1.8.2. How to Drop a Masked Table
The dynamic masking engine will build masking views upon the masked tables. This means that it is not possible to drop a masked table directly. You will get an error like this:
DROP TABLE people; psql: ERROR: cannot drop table people because other objects depend on it DETAIL: view mask.company depends on table people
To effectively remove the table, it is necessary to add the CASCADE
option, so that the masking view will be dropped too:
DROP TABLE people CASCADE;
G.1.8.3. How to Unmask a Role
Simply remove the security label like this:
SECURITY LABEL FOR anon ON ROLE bob IS NULL;
To unmask all masked roles at once, you can type:
SELECT anon.remove_masks_for_all_roles();
G.1.8.4. Limitations
G.1.8.4.1. Listing the Tables
Due to how the dynamic masking engine works, when a masked role tries to display the tables in psql with the \dt
command, then psql will not show any tables.
This is because the search_path
of the masked role is rigged.
You can try adding explicit schema you want to search, for instance:
\dt *.* \dt public.*
G.1.8.4.2. Only One Schema
The dynamic masking system only works with one schema (by default public
). When you start the masking engine with start_dynamic_masking()
, you can specify the schema that will be masked as follows:
ALTER DATABASE foo SET anon.sourceschema TO 'sales';
Then open a new session to the database and type:
SELECT start_dynamic_masking();
However, static masking with anon.anonymize()
and anonymous export with anon.dump()
will work fine with multiple schemas.
G.1.8.4.3. Performance
Dynamic Masking is known to be very slow with some queries, especially if you try to join two tables on a masked key using hashing or pseudonymization.
G.1.8.4.4. Graphic Tools
When you are using a masked role with a graphic interface, such as DBeaver or pgAdmin, the data
panel may produce the following error when trying to display the content of a masked table called foo
:
SQL Error [42501]: ERROR: permission denied for table foo
This is because most of these tools will directly query the public.foo
table instead of being “redirected” by the masking engine toward the mask.foo
view.
In order the view the masked data with a graphic tool, you can either:
Open the SQL query panel and type SELECT * FROM foo
Navigate to Database > Schemas > mask > Views > foo
G.1.9. Anonymous Dumps
Due to the core design of this extension, you cannot use pg_dump with a masked user. If you want to export the entire database with the anonymized data, you must use the pg_dump_anon.sh
script.
G.1.9.1. pg_dump_anon.sh
The pg_dump_anon.sh
script supports most of the options of the regular pg_dump
command. The environment variables (PGHOST
, PGUSER
, etc.) and the .pgpass
files are also supported.
G.1.9.2. Example
A user named bob
can export an anonymous dump of the app
database like this:
/opt/pgpro/ent-15/bin/pg_dump_anon.sh -h localhost -U bob --password --file=anonymous_dump.sql app
Important
The name of the database must be the last parameter.
For more details about the supported options, simply type ./pg_dump_anon.sh --help
.
G.1.9.3. Limitations
The user password is asked automatically. This means you must either add the
--password
option to define it interactively or declare it in thePGPASSWORD
variable or put it inside the.pgpass
file ( however on Windows,thePGPASSFILE
variable must be specified explicitly)The
plain
format is the only supported format. The other formats (custom
,dir
andtar
) are not supported.
G.1.10. Generalization
G.1.10.1. Reducing the Accuracy of Sensitive Data
The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is interesting for analytics because the data remains true while avoiding the risk of re-identification.
Generalization is a way to achieve k-anonymity.
Postgres Pro can handle generalization very easily with the range data types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.
G.1.10.2. Example
Here's a basic table containing medical data:
SELECT * FROM patient; ssn | firstname | zipcode | birth | disease -------------+-----------+---------+------------+--------------- 253-51-6170 | Alice | 47012 | 1989-12-29 | Heart Disease 091-20-0543 | Bob | 42678 | 1979-03-22 | Allergy 565-94-1926 | Caroline | 42678 | 1971-07-22 | Heart Disease 510-56-7882 | Eleanor | 47909 | 1989-12-15 | Acne 098-24-5548 | David | 47905 | 1997-03-04 | Flu 118-49-5228 | Jean | 47511 | 1993-09-14 | Flu 263-50-7396 | Tim | 47900 | 1981-02-25 | Heart Disease 109-99-6362 | Bernard | 47168 | 1992-01-03 | Asthma 287-17-2794 | Sophie | 42020 | 1972-07-14 | Asthma 409-28-2014 | Arnold | 47000 | 1999-11-20 | Diabetes (10 rows)
We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers:
CREATE MATERIALIZED VIEW generalized_patient AS SELECT 'REDACTED'::TEXT AS firstname, anon.generalize_int4range(zipcode,1000) AS zipcode, anon.generalize_daterange(birth,'decade') AS birth, disease FROM patient;
This will give us a less accurate view of the data:
SELECT * FROM generalized_patient; firstname | zipcode | birth | disease -----------+---------------+-------------------------+--------------- REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Allergy REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Acne REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Asthma REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Asthma REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes (10 rows)
G.1.10.3. Generalization Functions
pgpro_anonymizer
provides generalization functions, one for each range type. Generally, these functions take the original value as the first parameter, and a second parameter for the length of each step.
For numeric values:
-
anon.generalize_int4range(
value
integer
,step
integer
) For example,
anon.generalize_int4range(42,5)
returns the range[40,45)
.-
anon.generalize_int8range(
value
integer
,step
integer
) For example,
anon.generalize_int8range(12345,1000)
returns the range[12000,13000)
.-
anon.generalize_numrange(
value
integer
,step
integer
) For example,
anon.generalize_numrange(42.32378,10)
returns the range[40,50)
.
For time values:
-
anon.generalize_tsrange(
value
integer
,step
integer
) For example,
anon.generalize_tsrange('1904-11-07','year')
returns['1904-01-01','1905-01-01')
.-
anon.generalize_tstzrange(
value
integer
,step
integer
) For example,
anon.generalize_tstzrange('1904-11-07','week')
returns['1904-11-07','1904-11-14')
.-
anon.generalize_daterange(
value
integer
,step
integer
) For example,
anon.generalize_daterange('1904-11-07','decade')
returns[1900-01-01,1910-01-01)
.
The possible steps are: microsecond, millisecond, second, minute, hour, day, week, month, year, decade, century and millennium.
G.1.10.4. Limitations
G.1.10.4.1. Singling Out and Extreme Values
Singling Out is the possibility to isolate an individual in a dataset by using extreme value or exceptional values.
For example:
SELECT * FROM employees; id | name | job | salary ------+----------------+------+-------- 1578 | xkjefus3sfzd | NULL | 1498 2552 | cksnd2se5dfa | NULL | 2257 5301 | fnefckndc2xn | NULL | 45489 7114 | npodn5ltyp3d | NULL | 1821
In this table, we can see that a particular employee has a very high salary, very far from the average salary. Therefore this person is probably the CEO of the company.
With generalization, this is important because the size of the range (the “step”) must be wide enough to prevent the identification of one single individual.
k-anonymity is a way to assess this risk.
G.1.10.4.2. Generalization is Not Compatible with Dynamic Masking
By definition, with generalization the data remains true, but the column type is changed.
This means that the transformation is not transparent, and therefore it cannot be used with dynamic masking.
G.1.10.5. k-anonymity
k-anonymity is an industry-standard term used to describe a property of an anonymized dataset. The k-anonymity principle states that within a given dataset, any anonymized individual cannot be distinguished from at least k-1
other individuals. In other words, k-anonymity might be described as a “hiding in the crowd” guarantee. A low value of k
indicates there's a risk of re-identification using linkage with other data sources.
-
anon.k_anonymity(
id
regclass
) You can evaluate the k-anonymity factor of a table as follows:
Define the columns that are indirect identifiers (also known as quasi identifiers) like this:
SECURITY LABEL FOR anon ON COLUMN patient.firstname IS 'INDIRECT IDENTIFIER'; SECURITY LABEL FOR anon ON COLUMN patient.zipcode IS 'INDIRECT IDENTIFIER'; SECURITY LABEL FOR anon ON COLUMN patient.birth IS 'INDIRECT IDENTIFIER';
Once the indirect identifiers are declared:
SELECT anon.k_anonymity('generalized_patient')
The higher the value, the better.
G.1.11. Performance
Any anonymization process has a price as it will consume CPU time, RAM space and probably a lot of disk I/O. Here's a quick overview of the question depending on the strategy you are using.
In a nutshell, the anonymization performance mainly depends on the following important factors:
The size of the database
The number of masking rules
G.1.11.1. Static Masking
Basically what static masking does it rewrite entirely the masked tables on disk. This may be slow depending on your environment. And during this process, the tables will be locked.
Note
In this case, the cost of anonymization is “paid” by all the users but it is paid once and for all.
G.1.11.2. Dynamic Masking
With dynamic masking, the real data is replaced on-the-fly every time a masked user sends a query to the database. This means that the masking users will have slower response time than regular (unmasked) users. This is generally ok because usually masked users are not considered as important as the regular ones.
If you apply three or four rules to a table, the response time for the masked users should be approximately 20% to 30% slower than for the normal users.
As the masking rules are applied for each queries of the masked users, the dynamic masking is appropriate when you have a limited number of masked users that connect only from time to time to the database. For instance, a data analyst connecting once a week to generate a business report.
If there are multiple masked users or if a masked user is very active, you should probably export the masked data once-a-week on a secondary instance and let these users connect to this secondary instance.
Note
In this case, the cost of anonymization is “paid” only by the masked users.
G.1.11.3. Anonymous Dumps
If the backup process of your database takes one hour with pg_dump
, then anonymizing and exporting the entire database with pg_dump_anon.sh
will probably take two hours.
Note
In this case, the cost of anonymization is “paid” by the user asking for the anonymous export. Other users of the database will not be affected.
G.1.11.4. How to Speed Things Up
G.1.11.4.1. Prefer MASKED WITH VALUE
Whenever Possible
It is always faster to replace the original data with a static value instead of calling a masking function.
G.1.11.4.2. Materialized Views
Dynamic masking is not always required. In some cases, it is more efficient to build Materialized Views instead.
For instance:
CREATE MATERIALIZED VIEW masked_customer AS SELECT id, anon.random_last_name() AS name, anon.random_date_between('1920-01-01'::DATE,now()) AS birth, fk_last_order, store_id FROM customer;
G.1.12. Security
G.1.12.1. Permissions
Here's an overview of what users can do depending on the privilege they have:
Table G.1. Privileges
Action | Superuser | Owner | Masked Role |
---|---|---|---|
Create the extension | Yes | ||
Drop the extension | Yes | ||
Init the extension | Yes | ||
Reset the extension | Yes | ||
Configure the extension | Yes | ||
Put a mask upon a role | Yes | ||
Start dynamic masking | Yes | ||
Stop dynamic masking | Yes | ||
Create a table | Yes | Yes | |
Declare a masking rule | Yes | Yes | |
Insert, delete, update a row | Yes | Yes | |
Use static masking | Yes | Yes | |
Select the real data | Yes | Yes | |
Use regular dumps | Yes | Yes | |
Use anonymous dumps | Yes | Yes | |
Use the masking functions | Yes | Yes | Yes |
Select the masked data | Yes | Yes | Yes |
View the masking rules | Yes | Yes | Yes |
G.1.12.2. Limit Masking Filters Only to Trusted Schemas
Database owners are allowed to declare masking rules. They can also create functions containing arbitrary code and use these function inside masking rules. In certain circumstances, the database owner can “trick” a superuser into querying a masked table and thus executing the arbitrary code.
To prevent this, superusers can configure the parameter below:
anon.restrict_to_trusted_schemas = on
With this setting, the database owner can only write masking rules with functions that are located in the trusted schemas which are controlled by superusers.
G.1.12.3. Security Context of the Functions
Most of the functions of this extension are declared with the SECURITY INVOKER
tag. This means that these functions are executed with the privileges of the user that calls them. This is an important restriction.
This extension contains another few functions declared with the tag SECURITY DEFINER
.