G.1. pgpro_anonymizer — mask or replace sensitive data #

pgpro_anonymizer is an extension to mask or replace personally identifiable information (PII) or commercially sensitive data from a Postgres Pro database.

The project has a declarative approach of anonymization. This means you can declare the masking rules using the Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.

Once the masking rules are defined, you can access the anonymized data in one of the following ways:

In addition, various functions are available to implement the following masking techniques: randomization, faking, partial scrambling, shuffling, noise. You can also create a custom function.

Beyond masking, you can also use another approach called generalization, which is useful for statistics and data analytics.

G.1.1. Terms and Definitions #

The following main strategies are used:

  • Dynamic masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.

  • Static masking is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.

The data can be altered with several techniques:

  • Deletion simply removes data.

  • Static substitution consistently replaces the data with a generic value. For instance: replacing all values of a text column with the value CONFIDENTIAL.

  • Variance shifts dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.

  • Generalization reduces the accuracy of the data by replacing it with a range of values. Instead of saying Bob is 28 years old, you can say Bob is between 20 and 30 years old. This is useful for analytics because the data remains true.

  • Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.

  • Randomization replaces sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.

  • Partial scrambling is similar to static substitution but leaves out some part of the data. For instance: a credit card number can be replaced by 40XX XXXX XXXX XX96.

  • Custom rules are designed to alter data following specific needs (e.g., simultaneously randomize a zipcode and a city name while keeping them coherent).

  • Pseudonymization is a way to protect personal information by hiding it using additional information. Encryption and hashing are two examples of pseudonymization techniques. However, pseudonymizated data is still linked to the original data.

G.1.2. Anonymization Example #

Suppose you want to mask last names and phone numbers:

SELECT * FROM people;
 id | firstname | lastname |   phone
----+-----------+----------+------------
 T1 | Sarah     | Connor    | 0609110911
  1. Activate the dynamic masking engine:

    SELECT anon.start_dynamic_masking();
    
  2. Declare a masked user:

    CREATE ROLE skynet LOGIN;
    SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
    
  3. Declare the masking rules:

    SECURITY LABEL FOR anon ON COLUMN people.lastname
    IS 'MASKED WITH FUNCTION anon.fake_last_name()';
    
    SECURITY LABEL FOR anon ON COLUMN people.phone
    IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
    
  4. Connect with the masked user:

    \connect - skynet
    SELECT * FROM people;
     id | firstname | lastname  |   phone
    ----+-----------+-----------+------------
     T1 | Sarah     | Stranahan | 06******11
    

G.1.3. Installation and Setup #

The pgpro_anonymizer extension is provided with Postgres Pro Enterprise as a separate pre-built package pgpro-anonymizer-ent-16 (for the detailed installation instructions, see Chapter 17). To enable pgpro_anonymizer, complete the following steps:

  1. Add the library name to the shared_preload_libraries variable in the postgresql.conf file:

    shared_preload_libraries = 'anon'
    
  2. Reload the database server for the changes to take effect.

    Note

    To verify that the library was installed correctly, you can run the following command:

    SHOW shared_preload_libraries;
    
  3. Create the extension using the following query:

    CREATE EXTENSION anon CASCADE;
    
  4. Initialize the extension:

    SELECT anon.init();
    

    The init() function imports a default dataset of random data (IBAN, names, cities, etc.). This dataset is in English and is very small (1000 values for each category).

G.1.4. Configuration #

The extension has several options that can be defined for the entire instance (in postgresql.conf or with ALTER SYSTEM).

It is also possible and often a good idea to define them at the database level like this:

ALTER DATABASE customers SET anon.algorithm = sha512;

Only a superuser can change the parameters below:

anon.algorithm #

The hashing method used by pseudonymizing functions. Checkout the pgcrypto documentation for the list of available options.

See anon.salt to learn why this parameter is a very sensitive information.

Type: text

Default: sha256

Visible: only to superusers

anon.maskschema #

The schema (i.e. namespace) where the dynamic masking views will be stored.

Type: text

Default: mask

Visible: to all users

anon.restrict_to_trusted_schemas #

When this parameter is enabled (by default), masking rules must be defined using functions located in a limited list of namespaces. By default, pg_catalog and anon are trusted.

This improves security by preventing users from declaring their custom masking filters.

This also means that the schema must be explicit inside the masking rules. For instance, the rules below would fail because the schema of the lower function is not declared.

SECURITY LABEL FOR anon ON COLUMN people.name
IS 'MASKED WITH FUNCTION lower(people.name)';

The correct way to declare it would be:

SECURITY LABEL FOR anon ON COLUMN people.name
IS 'MASKED WITH FUNCTION pg_catalog.lower(people.name)';

Type: boolean

Default: on

Visible: to all users

anon.salt #

The salt used by pseudonymizing functions. It is very important to define a custom salt for each database like this:

ALTER DATABASE foo SET anon.salt = 'This_Is_A_Very_Secret_Salt';

If masked users can read the salt, they can run a brute force attack to retrieve the original data based on the following elements:

  • The pseudonymized data

  • The hashing algorithm (see anon.algorithm)

  • The salt

The salt and the name of the hashing algorithm should be protected with the same level of security that the data itself. This is why the salt should be stored directly within the database with ALTER DATABASE.

Type: text

Default: (empty)

Visible: only to superusers

anon.sourceshema #

The schema (i.e. namespace) where the tables are masked by the dynamic masking engine.

Change this value before starting dynamic masking.

ALTER DATABASE foo SET anon.sourceschema TO 'my_app';

Then reconnect so that the change takes effect and start the engine.

SELECT start_dynamic_masking();

Type: text

Default: public

Visible: to all users

G.1.5. Declare Masking Rules #

The main idea of this extension is to offer anonymization by design.

The data masking rules should be written by people who develop the application because they have the best knowledge of how the data model works. Therefore, masking rules must be implemented directly inside the database schema.

This allows to mask the data directly inside the Postgres Pro instance without using an external tool, and thus limiting the exposure and the risks of data leak.

The data masking rules are declared simply by using security labels:

CREATE TABLE player (id SERIAL, name TEXT, points INT);

INSERT INTO player VALUES
  ( 1, 'Kareem Abdul-Jabbar', 38387),
  ( 5, 'Michael Jordan', 32292 );

SECURITY LABEL FOR anon ON COLUMN player.name
  IS 'MASKED WITH FUNCTION anon.fake_last_name()';

SECURITY LABEL FOR anon ON COLUMN player.id
  IS 'MASKED WITH VALUE NULL';

Important

The masking rules are not inherited. If you have split a table into multiple partitions, you need to declare the masking rules for each partition.

G.1.5.1. Escaping String Literals #

As you may have noticed, the masking rule definitions are placed between single quotes. Therefore, if you need to use a string inside a masking rule, use dollar quoting:

SECURITY LABEL FOR anon ON COLUMN player.name
  IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';

G.1.5.2. Using Expressions #

You can use more advanced expressions with the MASKED WITH VALUE syntax:

SECURITY LABEL FOR anon ON COLUMN player.name
  IS 'MASKED WITH VALUE CASE WHEN name IS NULL
                             THEN $$John$$
                             ELSE anon.random_string(LENGTH(name))
                             END';

G.1.5.3. Removing a Masking Rule #

You can simply erase a masking rule like this:

SECURITY LABEL FOR anon ON COLUMN player.name IS NULL;

To remove all rules at once, you can use:

SELECT anon.remove_masks_for_all_columns();

G.1.6. Masking Functions #

The extension provides functions to implement the following main anonymization strategies:

Depending on your data, you may need to use different strategies on different columns:

  • For names and other direct identifiers, faking is often useful

  • Shuffling is convenient for foreign keys

  • Adding noise is interesting for numeric values and dates

  • Partial scrambling is perfect for email addresses and phone numbers

G.1.6.1. Deletion #

First of all, the fastest and safest way to anonymize a data is to delete it.

In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.

For instance, you can replace an entire column with the word CONFIDENTIAL like this:

SECURITY LABEL FOR anon
  ON COLUMN users.address
  IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';

G.1.6.2. Adding Noise #

This is also called Variance. The idea is to shift dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.

anon.noise(noise_value anyelement, ratio double precision) #

If ratio is 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33%.

anon.dnoise(noise_value anyelement, noise_range interval) #

If interval = 2 days, the return value will be the original value randomly shifted by +/- 2 days.

Important

The noise() masking functions are vulnerable to a form of repeat attack, especially with Dynamic Masking. A masked user can guess an original value by requesting its masked value multiple times and then simply use the avg() function to get a close approximation. In a nutshell, these functions are best fitted for Anonymous Dumps and Static Masking. They should be avoided when using Dynamic Masking.

G.1.6.3. Randomization #

The extension provides a large choice of functions to generate purely random data:

G.1.6.3.1. Basic Random values #
anon.random_date() #

Returns a date.

random_date_between(d1 date, d2 date) #

Returns a date between d1 and d2.

random_int_between(i1 integer, i2 integer) #

Returns an integer between i1 and i2.

random_bigint_between(b1 bigint, b2 bigint) #

Returns a bigint between b1 and b2.

anon.random_string(n integer) #

Returns a text value containing n letters.

anon.random_zip() #

Returns a 5-digit code.

anon.random_phone(phone_prefix text) #

Returns an 8-digit phone with phone_prefix as a prefix.

anon.random_in(ARRAY[1,2,3]) #

Returns an integer between 1 and 3.

anon.random_in(ARRAY['red','green','blue']) #

Returns a random text value from an array of ['red', 'green', 'blue'].

G.1.6.4. Faking #

The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.

The following faking functions are available:

anon.fake_address() #

Returns a complete postal address.

anon.fake_city() #

Returns an existing city.

anon.fake_country() #

Returns an existing country.

anon.fake_company() #

Returns a generic company name.

anon.fake_email() #

Returns a valid email address.

anon.fake_first_name() #

Returns a generic first name.

anon.fake_iban() #

Returns a valid IBAN.

anon.fake_last_name() #

Returns a generic last name.

anon.fake_postcode() #

Returns a valid zip code.

For text and varchar columns, you can use the classic Lorem Ipsum generator:

anon.lorem_ipsum() #

Returns five paragraphs.

anon.lorem_ipsum(2) #

Returns two paragraphs.

anon.lorem_ipsum( paragraphs := 4 ) #

Returns four paragraphs.

anon.lorem_ipsum( words := 20 ) #

Returns 20 words.

anon.lorem_ipsum( characters := LENGTH(table.column) ) #

Returns the same number of characters as the original string.

G.1.6.5. Advanced Faking #

Generating fake data is a complex topic. The functions provided here are limited to basic use case. For more advanced faking methods, in particular if you are looking for localized fake data, take a look at PostgreSQL Faker, an extension based upon the well-known Faker Python library.

This extension provides an advanced faking engine with localization support.

For example:

CREATE SCHEMA faker;
CREATE EXTENSION faker SCHEMA faker;
SELECT faker.faker('de_DE');
SELECT faker.first_name_female();
 first_name_female
-------------------
 Mirja

G.1.6.6. Pseudonymization #

Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic: the functions will always return the same fake value based on a seed and an optional salt.

The following pseudonymization functions are available:

anon.pseudo_first_name(seed anyelement, salt text) #

Returns a generic first name.

anon.pseudo_last_name(seed anyelement, salt text) #

Returns a generic last name.

anon.pseudo_email(seed anyelement, salt text) #

Returns a valid email address.

anon.pseudo_city(seed anyelement, salt text) #

Returns an existing city.

anon.pseudo_country(seed anyelement, salt text) #

Returns an existing country.

anon.pseudo_company(seed anyelement, salt text) #

Returns a generic company name.

anon.pseudo_iban(seed anyelement, salt text) #

Returns a valid IBAN.

The second argument (salt) is optional. You can call each function only with the seed like this: anon.pseudo_city('bob'). The salt is here to increase complexity and avoid dictionary and brute force attacks. If a salt is not given, a random secret salt is used instead (see Generic Hashing for more details).

The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed:

SECURITY LABEL FOR anon
  ON COLUMN users.emailaddress
  IS 'MASKED WITH FUNCTION anon.pseudo_email(users.login)';

Note

You may want to produce unique values using a pseudonymization function. For instance, if you want to mask an email column that is declared as UNIQUE. In this case, you will need to initialize the extension with a fake dataset that is way bigger than the number of rows of the table. Otherwise, you may see some collisions happening, i.e. two different original values producing the same pseudo value.

Important

Pseudonymization is often confused with anonymization but in fact they serve two different purposes: pseudonymization is a way to protect the personal information but the pseudonymized data is still linked to the real data.

G.1.6.7. Generic Hashing #

In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.

For instance, when a pair of primary key / foreign key is a natural key, it may contain actual information (like a customer number containing a birth date or something similar).

Hashing such columns allows to keep referential integrity intact even for relatively unusual source data.

anon.hash(value text) #

Returns a text hash of the value using a secret salt and hash algorithm.

anon.digest(value text, salt text, algorithm text) #

Lets you choose a salt and a hash algorithm. Supported algorithms are: md5, sha1, sha224, sha256, sha384, and sha512.

By default, a random secret salt is generated when the extension is initialized, and the default hash algorithm is sha512. You can change these for the entire database with the following functions:

anon.set_secret_salt(value text) #

Define your own salt.

anon.set_algorithm(value text) #

Select another hash algorithm. For the list of supported algorithms, see anon.digest.

Keep in mind that hashing is a form of Pseudonymization. This means that the data can be de-anonymized using the hashed value and the masking function. If attackers get access to these two elements, they could re-identify some persons using brute force or dictionary attacks. Therefore, the salt and the algorithm used to hash the data must be protected with the same level of security as the original dataset.

In a nutshell, we recommend that you use the anon.hash() function rather than anon.digest() because the salt will not appear clearly in the masking rule.

Furthermore, in practice the hash function will return a long string of character like this:

SELECT anon.hash('bob');
                                  hash
----------------------------------------------------------------------------------------------------------------------------------
95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397

For some columns, this may be too long and you may have to cut some parts of the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a varchar(12), you can transform the data like this:

SECURITY LABEL FOR anon ON COLUMN people.phone_number
IS 'MASKED WITH FUNCTION pg_catalog.left(anon.hash(phone_number),12)';

SECURITY LABEL FOR anon ON COLUMN call_history.fk_phone_number
IS 'MASKED WITH FUNCTION pg_catalog.left(anon.hash(fk_phone_number),12)';

Of course, cutting the hash value to 12 characters will increase the risk of collision (2 different values having the same fake hash). In such case, it's up to you to evaluate this risk.

G.1.6.8. Partial Scrambling #

Partial scrambling leaves out some part of the data. For instance, a credit card number can be replaced by 40XX XXXX XXXX XX96.

anon.partial(input text, prefix int, padding text, suffix int) #

Partially replaces a given text. For example, anon.partial('abcdefgh',1,'xxxx',3) returns axxxxfgh.

anon.partial_email(email text) #

Partially replaces a given email. For example, anon.partial_email('daamien@gmail.com') returns da******@gm******.com.

G.1.6.9. Generalization #

Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying Paul is 42 years old, you would say Paul is between 40 and 50 years old.

Note

The generalization functions perform data type transformation. Therefore, it is not possible to use them with the dynamic masking engine. However, they are useful to create anonymized views. See example below.

Let's imagine a table containing health information:

SELECT * FROM patient;
 id |   name   |  zipcode |   birth    |    disease
----+----------+----------+------------+---------------
  1 | Alice    |    47678 | 1979-12-29 | Heart Disease
  2 | Bob      |    47678 | 1959-03-22 | Heart Disease
  3 | Caroline |    47678 | 1988-07-22 | Heart Disease
  4 | David    |    47905 | 1997-03-04 | Flu
  5 | Eleanor  |    47909 | 1999-12-15 | Heart Disease
  6 | Frank    |    47906 | 1968-07-04 | Cancer
  7 | Geri     |    47605 | 1977-10-30 | Heart Disease
  8 | Harry    |    47673 | 1978-06-13 | Cancer
  9 | Ingrid   |    47607 | 1991-12-12 | Cancer

We can build a view upon this table to suppress some columns ( SSN and name ) and generalize the zipcode and the birth date like this:

CREATE VIEW anonymized_patient AS
SELECT
    'REDACTED' AS lastname,
    anon.generalize_int4range(zipcode,100) AS zipcode,
    anon.generalize_tsrange(birth,'decade') AS birth,
    disease
FROM patient;

The anonymized table now looks like that:

SELECT * FROM anonymized_patient;
 lastname |   zipcode     |           birth             |    disease
----------+---------------+-----------------------------+---------------
 REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
 REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease
 REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease
 REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu
 REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease
 REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer
 REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
 REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer
 REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer

The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.

Postgres Pro offers several range data types, which are perfect for dates and numeric values.

For numeric values, the following functions are available:

generalize_int4range(value, step)
generalize_int8range(value, step)
generalize_numrange(value, step) #

where value is the data that will be generalized, and step is the size of each range.

G.1.6.10. Write Your Own Masks #

You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).

For instance, if you wrote a function foo() inside the schema bar, then you can apply it like this:

SECURITY LABEL FOR anon ON SCHEMA bar IS 'TRUSTED';

SECURITY LABEL FOR anon ON COLUMN player.score
IS 'MASKED WITH FUNCTION bar.foo()';

Note

The bar schema must be declared as TRUSTED by a superuser.

G.1.6.10.1. Writing a Masking Function for a JSONB Column #

For complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.

For example:

CREATE TABLE company (
  business_name TEXT,
  info JSONB
)

The info field contains unstructured data like this:

SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
           jsonb_pretty
----------------------------------
 {
     "employees": [
         {
             "lastName": "Doe",
             "firstName": "John"
         },
         {
             "lastName": "Smith",
             "firstName": "Anna"
         },
         {
             "lastName": "Jones",
             "firstName": "Peter"
         }
     ]
 }
(1 row)

Using the JSON functions and operators, you can walk through the keys and replace the sensitive values as needed.

SECURITY LABEL FOR anon ON SCHEMA custom_masks IS 'TRUSTED';

CREATE FUNCTION custom_masks.remove_last_name(j JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
  json_build_object(
    'employees' ,
    array_agg(
      jsonb_set(e ,'{lastName}', to_jsonb(anon.fake_last_name()))
    )
  )::JSONB
FROM jsonb_array_elements( j->'employees') e
$func$;

Then check that the function is working correctly:

SELECT custom_masks.remove_last_name(info) FROM company;

When it is okay, you can declare this function as the mask of the info field:

SECURITY LABEL FOR anon ON COLUMN company.info
IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';

And try it out:

SELECT anonymize_table('company');
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
            jsonb_pretty
-------------------------------------
 {
     "employees": [                 +
         {                          +
             "lastName": "Prawdzik",+
             "firstName": "John"    +
         },                         +
         {                          +
             "lastName": "Baltazor",+
             "firstName": "Anna"    +
         },                         +
         {                          +
             "lastName": "Taylan",  +
             "firstName": "Peter"   +
         }                          +
     ]                              +
 }
(1 row)

This is just a quick example. As you can see, manipulating a sophisticated JSON structure with SQL is possible but it can be tricky at first. There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.

G.1.7. Static Masking #

Sometimes, it is useful to transform directly the original dataset. You can do that with different methods:

These methods will destroy the original data. Use with care.

G.1.7.1. Applying Masking Rules #

You can permanently apply the masking rules of a database with anon.anonymize_database().

Let's use a basic example:

CREATE TABLE customer (
  id SERIAL,
  full_name TEXT,
  birth DATE,
  employer TEXT,
  zipcode TEXT,
  fk_shop INTEGER
);

INSERT INTO customer
VALUES
(911,'Chuck Norris','1940-03-10','Texas Rangers', '75001',12),
(312,'David Hasselhoff','1952-07-17','Baywatch', '90001',423)
;

SELECT * FROM customer;

 id  |   full_name      |   birth    |    employer   | zipcode | fk_shop
-----+------------------+------------+---------------+---------+---------
 911 | Chuck Norris     | 1940-03-10 | Texas Rangers | 75001   | 12
 112 | David Hasselhoff | 1952-07-17 | Baywatch      | 90001   | 423
  1. Declare the masking rules:

    SECURITY LABEL FOR anon ON COLUMN customer.full_name
    IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
    
    SECURITY LABEL FOR anon ON COLUMN customer.employer
    IS 'MASKED WITH FUNCTION anon.fake_company()';
    
    SECURITY LABEL FOR anon ON COLUMN customer.zipcode
    IS 'MASKED WITH FUNCTION anon.random_zip()';
    
  2. Replace authentic data in the masked columns:

    SELECT anon.anonymize_database();
    
    SELECT * FROM customer;
    
     id  |  full_name  |   birth    |      employer       | zipcode | fk_shop
    -----+-------------+------------+---------------------+---------+---------
     911 | Jesse Kosel | 1940-03-10 | Marigold Properties | 62172   |      12
     312 | Leolin Bose | 1952-07-17 | Inventure           | 20026   |     423
    

You can also use anonymize_table() and anonymize_column() to remove data from a subset of the database:

SELECT anon.anonymize_table('customer');
SELECT anon.anonymize_column('customer','zipcode');

Important

Static masking is a slow process. The principle of static masking is to update all lines of all tables containing at least one masked column. This basically means that the server will rewrite all the data on disk. Depending on the database size, the hardware and the instance configuration, it may be faster to export the anonymized data (see Anonymous Dumps) and reload it into the database.

G.1.7.2. Shuffling #

Shuffling mixes values within the same columns.

anon.shuffle_column(shuffle_table regclass, shuffle_column name, primary_key name) #

Rearranges all values in a given column. You need to provide the primary key of the table.

This is useful for foreign keys because referential integrity will be kept.

Important

shuffle_column() is not a masking function because it works verticaly: it will modify all the values of a column at once.

G.1.7.3. Adding Noise to a Column #

There are also some functions that can add noise on an entire column:

anon.add_noise_on_numeric_column(table regclass, column text, ratio float) #

If ratio = 0.33, all values of the column will be randomly shifted with a ratio of +/- 33%.

anon.add_noise_on_datetime_column(table regclass, column text, interval interval) #

If interval = 2 days, all values of the column will be randomly shifted by +/- 2 days.

Important

These noise functions are vulnerable to a form of repeat attack.

G.1.8. Dynamic Masking #

You can hide some data from a role by declaring this role as a MASKED one. Other roles will still access the original data.

CREATE TABLE people (id TEXT, firstname TEXT, lastname TEXT, phone TEXT);
INSERT INTO people VALUES ('T1','Sarah', 'Connor','0609110911');
SELECT * FROM people;

SELECT * FROM people;
 id | firstname | lastname |   phone
----+----------+----------+------------
 T1 | Sarah    | Connor    | 0609110911
(1 row)
  1. Activate the dynamic masking engine:

    SELECT anon.start_dynamic_masking();
    
  2. Declare a masked user:

    CREATE ROLE skynet LOGIN;
    SECURITY LABEL FOR anon ON ROLE skynet
    IS 'MASKED';
    
  3. Declare the masking rules:

    SECURITY LABEL FOR anon ON COLUMN people.lastname
    IS 'MASKED WITH FUNCTION anon.fake_last_name()';
    
    SECURITY LABEL FOR anon ON COLUMN people.phone
    IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
    
  4. Connect with the masked user:

    \c - skynet
    SELECT * FROM people;
     id | firstname | lastname  |   phone
    ----+----------+-----------+------------
     T1 | Sarah    | Stranahan | 06******11
    (1 row)
    

G.1.8.1. How to Change the Type of a Masked Column #

When dynamic masking is activated, you are not allowed to change the datatype of a column if there's a mask upon it.

To modify a masked column, you need to temporarily switch off the masking engine like this:

BEGIN;
SELECT anon.stop_dynamic_masking();
ALTER TABLE people ALTER COLUMN phone TYPE VARCHAR(255);
SELECT anon.start_dynamic_masking();
COMMIT;

G.1.8.2. How to Drop a Masked Table #

The dynamic masking engine will build masking views upon the masked tables. This means that it is not possible to drop a masked table directly. You will get an error like this:

DROP TABLE people;
psql: ERROR:  cannot drop table people because other objects depend on it
DETAIL:  view mask.company depends on table people

To effectively remove the table, it is necessary to add the CASCADE option, so that the masking view will be dropped too:

DROP TABLE people CASCADE;

G.1.8.3. How to Unmask a Role #

Simply remove the security label like this:

SECURITY LABEL FOR anon ON ROLE bob IS NULL;

To unmask all masked roles at once, you can type:

SELECT anon.remove_masks_for_all_roles();

G.1.8.4. Limitations #

G.1.8.4.1. Listing the Tables #

Due to how the dynamic masking engine works, when a masked role tries to display the tables in psql with the \dt command, then psql will not show any tables.

This is because the search_path of the masked role is rigged.

You can try adding explicit schema you want to search, for instance:

\dt *.*
\dt public.*
G.1.8.4.2. Only One Schema #

The dynamic masking system only works with one schema (by default public). When you start the masking engine with start_dynamic_masking(), you can specify the schema that will be masked as follows:

ALTER DATABASE foo SET anon.sourceschema TO 'sales';

Then open a new session to the database and type:

SELECT start_dynamic_masking();

However, static masking with anon.anonymize() and anonymous export with anon.dump() will work fine with multiple schemas.

G.1.8.4.3. Performance #

Dynamic Masking is known to be very slow with some queries, especially if you try to join two tables on a masked key using hashing or pseudonymization.

G.1.8.4.4. Graphic Tools #

When you are using a masked role with a graphic interface, such as DBeaver or pgAdmin, the data panel may produce the following error when trying to display the content of a masked table called foo:

SQL Error [42501]: ERROR: permission denied for table foo

This is because most of these tools will directly query the public.foo table instead of being redirected by the masking engine toward the mask.foo view.

In order the view the masked data with a graphic tool, you can either:

Open the SQL query panel and type SELECT * FROM foo

Navigate to Database > Schemas > mask > Views > foo

G.1.9. Anonymous Dumps #

Due to the core design of this extension, you cannot use pg_dump with a masked user. If you want to export the entire database with the anonymized data, you must use the pg_dump_anon.sh script.

G.1.9.1. pg_dump_anon.sh #

The pg_dump_anon.sh script supports most of the options of the regular pg_dump command. The environment variables (PGHOST, PGUSER, etc.) and the .pgpass files are also supported.

G.1.9.2. Example #

A user named bob can export an anonymous dump of the app database like this:

/opt/pgpro/ent-16/bin/pg_dump_anon.sh -h localhost -U bob --password --file=anonymous_dump.sql app

Important

The name of the database must be the last parameter.

For more details about the supported options, simply type ./pg_dump_anon.sh --help.

G.1.9.3. Limitations #

  • The user password is asked automatically. This means you must either add the --password option to define it interactively or declare it in the PGPASSWORD variable or put it inside the .pgpass file ( however on Windows,the PGPASSFILE variable must be specified explicitly)

  • The plain format is the only supported format. The other formats (custom, dir and tar) are not supported.

G.1.10. Generalization #

G.1.10.1. Reducing the Accuracy of Sensitive Data #

The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying Bob is 28 years old, you can say Bob is between 20 and 30 years old. This is interesting for analytics because the data remains true while avoiding the risk of re-identification.

Generalization is a way to achieve k-anonymity.

Postgres Pro can handle generalization very easily with the range data types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.

G.1.10.2. Example #

Here's a basic table containing medical data:

SELECT * FROM patient;
     ssn     | firstname | zipcode |   birth    |    disease
-------------+-----------+---------+------------+---------------
 253-51-6170 | Alice     |   47012 | 1989-12-29 | Heart Disease
 091-20-0543 | Bob       |   42678 | 1979-03-22 | Allergy
 565-94-1926 | Caroline  |   42678 | 1971-07-22 | Heart Disease
 510-56-7882 | Eleanor   |   47909 | 1989-12-15 | Acne
 098-24-5548 | David     |   47905 | 1997-03-04 | Flu
 118-49-5228 | Jean      |   47511 | 1993-09-14 | Flu
 263-50-7396 | Tim       |   47900 | 1981-02-25 | Heart Disease
 109-99-6362 | Bernard   |   47168 | 1992-01-03 | Asthma
 287-17-2794 | Sophie    |   42020 | 1972-07-14 | Asthma
 409-28-2014 | Arnold    |   47000 | 1999-11-20 | Diabetes
(10 rows)

We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers:

CREATE MATERIALIZED VIEW generalized_patient AS
SELECT
  'REDACTED'::TEXT AS firstname,
  anon.generalize_int4range(zipcode,1000) AS zipcode,
  anon.generalize_daterange(birth,'decade') AS birth,
  disease
FROM patient;

This will give us a less accurate view of the data:

SELECT * FROM generalized_patient;
 firstname |    zipcode    |          birth          |    disease
-----------+---------------+-------------------------+---------------
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Allergy
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Acne
 REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Flu
 REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Flu
 REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
 REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Asthma
 REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Asthma
 REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes
(10 rows)

G.1.10.3. Generalization Functions #

pgpro_anonymizer provides generalization functions, one for each range type. Generally, these functions take the original value as the first parameter, and a second parameter for the length of each step.

For numeric values:

anon.generalize_int4range(value integer, step integer) #

For example, anon.generalize_int4range(42,5) returns the range [40,45).

anon.generalize_int8range(value integer, step integer) #

For example, anon.generalize_int8range(12345,1000) returns the range [12000,13000).

anon.generalize_numrange(value integer, step integer) #

For example, anon.generalize_numrange(42.32378,10) returns the range [40,50).

For time values:

anon.generalize_tsrange(value integer, step integer) #

For example, anon.generalize_tsrange('1904-11-07','year') returns ['1904-01-01','1905-01-01').

anon.generalize_tstzrange(value integer, step integer) #

For example, anon.generalize_tstzrange('1904-11-07','week') returns ['1904-11-07','1904-11-14').

anon.generalize_daterange(value integer, step integer) #

For example, anon.generalize_daterange('1904-11-07','decade') returns [1900-01-01,1910-01-01).

The possible steps are: microsecond, millisecond, second, minute, hour, day, week, month, year, decade, century and millennium.

G.1.10.4. Limitations #

G.1.10.4.1. Singling Out and Extreme Values #

Singling Out is the possibility to isolate an individual in a dataset by using extreme value or exceptional values.

For example:

SELECT * FROM employees;

  id  |  name          | job  | salary
------+----------------+------+--------
 1578 | xkjefus3sfzd   | NULL |    1498
 2552 | cksnd2se5dfa   | NULL |    2257
 5301 | fnefckndc2xn   | NULL |   45489
 7114 | npodn5ltyp3d   | NULL |    1821

In this table, we can see that a particular employee has a very high salary, very far from the average salary. Therefore this person is probably the CEO of the company.

With generalization, this is important because the size of the range (the step) must be wide enough to prevent the identification of one single individual.

k-anonymity is a way to assess this risk.

G.1.10.4.2. Generalization is Not Compatible with Dynamic Masking #

By definition, with generalization the data remains true, but the column type is changed.

This means that the transformation is not transparent, and therefore it cannot be used with dynamic masking.

G.1.10.5. k-anonymity #

k-anonymity is an industry-standard term used to describe a property of an anonymized dataset. The k-anonymity principle states that within a given dataset, any anonymized individual cannot be distinguished from at least k-1 other individuals. In other words, k-anonymity might be described as a hiding in the crowd guarantee. A low value of k indicates there's a risk of re-identification using linkage with other data sources.

anon.k_anonymity(id regclass) #

You can evaluate the k-anonymity factor of a table as follows:

  1. Define the columns that are indirect identifiers (also known as quasi identifiers) like this:

    SECURITY LABEL FOR anon ON COLUMN patient.firstname
    IS 'INDIRECT IDENTIFIER';
    
    SECURITY LABEL FOR anon ON COLUMN patient.zipcode
    IS 'INDIRECT IDENTIFIER';
    
    SECURITY LABEL FOR anon ON COLUMN patient.birth
    IS 'INDIRECT IDENTIFIER';
    

  2. Once the indirect identifiers are declared:

    SELECT anon.k_anonymity('generalized_patient')
    

    The higher the value, the better.

G.1.11. Performance #

Any anonymization process has a price as it will consume CPU time, RAM space and probably a lot of disk I/O. Here's a quick overview of the question depending on the strategy you are using.

In a nutshell, the anonymization performance mainly depends on the following important factors:

  • The size of the database

  • The number of masking rules

G.1.11.1. Static Masking #

Basically what static masking does it rewrite entirely the masked tables on disk. This may be slow depending on your environment. And during this process, the tables will be locked.

Note

In this case, the cost of anonymization is paid by all the users but it is paid once and for all.

G.1.11.2. Dynamic Masking #

With dynamic masking, the real data is replaced on-the-fly every time a masked user sends a query to the database. This means that the masking users will have slower response time than regular (unmasked) users. This is generally ok because usually masked users are not considered as important as the regular ones.

If you apply three or four rules to a table, the response time for the masked users should be approximately 20% to 30% slower than for the normal users.

As the masking rules are applied for each queries of the masked users, the dynamic masking is appropriate when you have a limited number of masked users that connect only from time to time to the database. For instance, a data analyst connecting once a week to generate a business report.

If there are multiple masked users or if a masked user is very active, you should probably export the masked data once-a-week on a secondary instance and let these users connect to this secondary instance.

Note

In this case, the cost of anonymization is paid only by the masked users.

G.1.11.3. Anonymous Dumps #

If the backup process of your database takes one hour with pg_dump, then anonymizing and exporting the entire database with pg_dump_anon.sh will probably take two hours.

Note

In this case, the cost of anonymization is paid by the user asking for the anonymous export. Other users of the database will not be affected.

G.1.11.4. How to Speed Things Up #

G.1.11.4.1. Prefer MASKED WITH VALUE Whenever Possible #

It is always faster to replace the original data with a static value instead of calling a masking function.

G.1.11.4.2. Materialized Views #

Dynamic masking is not always required. In some cases, it is more efficient to build Materialized Views instead.

For instance:

CREATE MATERIALIZED VIEW masked_customer AS
SELECT
    id,
    anon.random_last_name() AS name,
    anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
    fk_last_order,
    store_id
FROM customer;

G.1.12. Security #

G.1.12.1. Permissions #

Here's an overview of what users can do depending on the privilege they have:

Table G.1. Privileges

Action Superuser Owner Masked Role
Create the extension Yes
Drop the extension Yes
Init the extension Yes
Reset the extension Yes
Configure the extension Yes
Put a mask upon a role Yes
Start dynamic masking Yes
Stop dynamic masking Yes
Create a table Yes Yes
Declare a masking rule Yes Yes
Insert, delete, update a row Yes Yes
Use static masking Yes Yes
Select the real data Yes Yes
Use regular dumps Yes Yes
Use anonymous dumps Yes Yes
Use the masking functions Yes Yes Yes
Select the masked data Yes Yes Yes
View the masking rules Yes Yes Yes

G.1.12.2. Limit Masking Filters Only to Trusted Schemas #

Database owners are allowed to declare masking rules. They can also create functions containing arbitrary code and use these function inside masking rules. In certain circumstances, the database owner can trick a superuser into querying a masked table and thus executing the arbitrary code.

To prevent this, superusers can configure the parameter below:

anon.restrict_to_trusted_schemas = on

With this setting, the database owner can only write masking rules with functions that are located in the trusted schemas which are controlled by superusers.

G.1.12.3. Security Context of the Functions #

Most of the functions of this extension are declared with the SECURITY INVOKER tag. This means that these functions are executed with the privileges of the user that calls them. This is an important restriction.

This extension contains another few functions declared with the tag SECURITY DEFINER.