Hello, here's my take on masking data when using pg_dump
The main idea is using PostgreSQL functions to replace data during a SELECT.
When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
So about the interface: I added 2 more command-line options:
--mask-columns, which specifies what columns from what tables will be masked
usage example:
--mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
--mask-function, which specifies what functions will mask data
usage example:
--mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
or --mask-function somedir/filename - the function is "defined" here - more on the structure below
Structure of the file with function description:
First row - function name (with or without schema name)
Second row - type of in and out value (the design is to only work with same input/output type so no int-to-text shenanigans)
Third row - language of function
Forth and later rows - body of a function
mask_text
text
plpgsql
res := '***';
First iteration of using file-described functions used just plain SQL query, but since it executed during read-write connection, some things such as writing "DROP TABLE t1;" after the CREATE FUNCTION ...; were possible.
Now even if something harmful is written in function body, it will be executed during dump-read-only connection, where it will just throw an error
About "corresponding columns and functions" - masking functions and columns are paired with eachother based on the input order, but --masking-columns and --masking-functions don't have to be subsequent.
Example: pg_dump -t table_name --mask-columns name --mask-colums count --mask-function mask_text --mask-function mask_int - here 'name' will be paired with function 'mask_text' and 'count' with 'mask_int'
Patch includes regression tests
I'm open to discussion of this patch