PostgreSQL Anonymizer 0.5: Generalization and k-anonymity - Mailing list pgsql-announce

From Damien Clochard
Subject PostgreSQL Anonymizer 0.5: Generalization and k-anonymity
Date
Msg-id 156ba24736cd6c28ff616801129a4de4@dalibo.info
Whole thread Raw
List pgsql-announce
_Eymoutiers, France, November 6, 2019_

`Postgresql Anonymizer` is an extension that hides or replaces 
personally
identifiable information (PII) or commercially sensitive data from a 
PostgreSQL
database.

The extension supports 3 different anonymization strategies: Dynamic 
Masking,
In-Place Anonymization and Anonymous Dumps. It also offers a large 
choice of
Masking Functions: Substitution, Randomization, Faking, Partial 
Scrambling,
Shuffling, Noise Addition and Generalization.


Generalization
--------------------------------------------------------------------------------

The idea of generalization is to replace data with a broader, less 
accurate
value. For instance, instead of saying "Bob is 28 years old", you can 
say
"Bob is between 20 and 30 years old". This is interesting for analytics 
because
the data remains true while avoiding the risk of re-identification.

PostgreSQL can handle generalization very easily with the RANGE data 
types,
a very powerful way to store and manipulate a set of values contained 
between
a lower and an upper bound.


Here's a basic table containing medical data:


     SELECT * FROM patient;
          ssn     | firstname | zipcode |   birth    |    disease
     -------------+-----------+---------+------------+---------------
      253-51-6170 | Alice     |   47012 | 1989-12-29 | Heart Disease
      091-20-0543 | Bob       |   42678 | 1979-03-22 | Allergy
      565-94-1926 | Caroline  |   42678 | 1971-07-22 | Heart Disease
      510-56-7882 | Eleanor   |   47909 | 1989-12-15 | Acne


We want the anonymized data to remain **true** because it will be
used for statistics. We can build a view upon this table to remove
useless columns and generalize the indirect identifiers (zipcode and
birthday):


     CREATE MATERIALIZED VIEW generalized_patient AS
     SELECT
       'REDACTED'::TEXT AS firstname,
       anon.generalize_int4range(zipcode,1000) AS zipcode,
       anon.generalize_daterange(birth,'decade') AS birth,
       disease
     FROM patient;

This will give us a less accurate view of the data:


     SELECT * FROM generalized_patient;
      firstname |    zipcode    |          birth          |    disease
     
-----------+---------------+-------------------------+---------------
      REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
      REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Allergy
      REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease
      REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Acne


k-anonymity
--------------------------------------------------------------------------------

k-anonymity is an industry-standard term used to describe a property of 
an
anonymized dataset. The k-anonymity principle states that within a
given dataset, any anonymized individual cannot be distinguished from at
least `k-1` other individuals. In other words, k-anonymity might be 
described
as a "hiding in the crowd" guarantee. A low value of `k` indicates 
there's a risk
of re-identification using linkage with other data sources.

You can evaluate the k-anonymity factor of a table in 2 steps :

1/ First defined the columns that are indirect identifiers ( also known 
as "quasi identifers") like this:

     SECURITY LABEL FOR anon ON COLUMN generalized_patient.zipcode
     IS 'INDIRECT IDENTIFIER';

     SECURITY LABEL FOR anon ON COLUMN generalized_patient.birth
     IS 'INDIRECT IDENTIFIER';


2/ Once the indirect identifiers are declared :


     SELECT anon.k_anonymity('generalized_patient')


In the example above, the k-anonymity factor of the 
`generalized_patient`
materialized view is `2`.

Lorem Ipsum
--------------------------------------------------------------------------------

For TEXT and VARCHAR columns, you can now use the classic Lorem Ipsum
generator:

* `anon.lorem_ipsum()` returns 5 paragraphs
* `anon.lorem_ipsum(2)` returns 2 paragraphs
* `anon.lorem_ipsum( paragraphs := 4 )` returns 4 paragraphs
* `anon.lorem_ipsum( words := 20 )` returns 20 words
* `anon.lorem_ipsum( characters := 7 )` returns 7 characters



How to Install
--------------------------------------------------------------------------------

This extension is officially supported on PostgreSQL 9.6 and later.

On Red Hat / CentOS systems, you can install it from the
official PostgreSQL RPM repository:


     $ yum install postgresql_anonymizer12

(replace 12 by the major version of your PostgreSQL instance)

Then add 'anon' in the `shared_preload_libraries` parameter of your
`postgresql.conf` file. And restart your instance.

For other system, check out the install documentation :

https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/

> **WARNING:** The project is at an early stage of development and should 
> be
> used carefully.


Thanks
--------------------------------------------------------------------------------

This release includes code and ideas from Travis Miller, Jan Birk and 
Olleg
Samoylov. Many thanks to them !


How to contribute
--------------------------------------------------------------------------------

PostgreSQL Anonymizer is part of the Dalibo Labs initiative. It is 
mainly
developed by Damien Clochard.

This is an open project, contributions are welcome. We need your 
feedback and
ideas ! Let us know what you think of this tool, how it fits your needs 
and
what features are missing.

If you want to help, you can find a list of `Junior Jobs` here:

https://gitlab.com/dalibo/postgresql_anonymizer/issues?label_name%5B%5D=Junior+Jobs






pgsql-announce by date:

Previous
From: "Hans-Jürgen Schönig (PostgreSQL)"
Date:
Subject: Announcing Cybertec Migrator 1.0
Next
From: Tomas Vondra
Date:
Subject: Prague PostgreSQL Developer Day 2020 / dates, call for papers &sponsors, etc.