Possible solution for masking chosen columns when using pg_dump - Mailing list pgsql-hackers

From Олег Целебровский
Subject Possible solution for masking chosen columns when using pg_dump
Date
Msg-id 1664811017.111625872@f404.i.mail.ru
Whole thread Raw
Responses Re: Possible solution for masking chosen columns when using pg_dump  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
Hello, here's my take on masking data when using pg_dump
 
The main idea is using PostgreSQL functions to replace data during a SELECT.
When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
 
So about the interface: I added 2 more command-line options: 
 
--mask-columns, which specifies what columns from what tables will be masked 
    usage example:
            --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
            or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
 
--mask-function, which specifies what functions will mask data
    usage example:
            --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
            or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
            or --mask-function somedir/filename - the function is "defined" here - more on the structure below
 
Structure of the file with function description:
 
First row - function name (with or without schema name)
Second row - type of in and out value (the design is to only work with same input/output type so no int-to-text shenanigans)
Third row - language of function
Forth and later rows - body of a function
 
Example of such file:
 
mask_text
text
plpgsql
res := '***';
 
First iteration of using file-described functions used just plain SQL query, but since it executed during read-write connection, some things such as writing "DROP TABLE t1;" after the CREATE FUNCTION ...; were possible.
Now even if something harmful is written in function body, it will be executed during dump-read-only connection, where it will just throw an error
 
About "corresponding columns and functions" - masking functions and columns are paired with eachother based on the input order, but --masking-columns and --masking-functions don't have to be subsequent.
Example: pg_dump -t table_name --mask-columns name --mask-colums count --mask-function mask_text --mask-function mask_int - here 'name' will be paired with function 'mask_text' and 'count' with 'mask_int' 
 
Patch includes regression tests
 
I'm open to discussion of this patch
 
Best regards,
 
Oleg Tselebrovskiy
Attachment

pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Miscellaneous tab completion issue fixes
Next
From: Julien Rouhaud
Date:
Subject: Re: Possible solution for masking chosen columns when using pg_dump