Thread: Possible solution for masking chosen columns when using pg_dump

Possible solution for masking chosen columns when using pg_dump

From

Олег Целебровский

Date:

03 October 2022, 15:30:17

Hello, here's my take on masking data when using pg_dump

The main idea is using PostgreSQL functions to replace data during a SELECT.

When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns

Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...

So about the interface: I added 2 more command-line options:

--mask-columns, which specifies what columns from what tables will be masked

usage example:
--mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function

or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function

--mask-function, which specifies what functions will mask data

usage example:
--mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)

or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored

or --mask-function somedir/filename - the function is "defined" here - more on the structure below

Structure of the file with function description:

First row - function name (with or without schema name)

Second row - type of in and out value (the design is to only work with same input/output type so no int-to-text shenanigans)

Third row - language of function

Forth and later rows - body of a function

Example of such file:

mask_text

text

plpgsql

res := '***';

First iteration of using file-described functions used just plain SQL query, but since it executed during read-write connection, some things such as writing "DROP TABLE t1;" after the CREATE FUNCTION ...; were possible.

Now even if something harmful is written in function body, it will be executed during dump-read-only connection, where it will just throw an error

About "corresponding columns and functions" - masking functions and columns are paired with eachother based on the input order, but --masking-columns and --masking-functions don't have to be subsequent.

Example: pg_dump -t table_name --mask-columns name --mask-colums count --mask-function mask_text --mask-function mask_int - here 'name' will be paired with function 'mask_text' and 'count' with 'mask_int'

Patch includes regression tests

I'm open to discussion of this patch

Best regards,

Oleg Tselebrovskiy

Attachment

masking_for_pg_dump_v1.patch

Re: Possible solution for masking chosen columns when using pg_dump

From

Julien Rouhaud

Date:

03 October 2022, 15:44:53

Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>  
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are
replacedwith result of functions on those columns
 
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name),
mask_int(count),date from ...
 
>  
> So about the interface: I added 2 more command-line options: 
>  
> --mask-columns, which specifies what columns from what tables will be masked 
>     usage example:
>             --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding
function
>             or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with
correspodingfunction
 
>  
> --mask-function, which specifies what functions will mask data
>     usage example:
>             --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from
defaultschema (public)
 
>             or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is
stored
>             or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side.  The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

Re: Possible solution for masking chosen columns when using pg_dump

From

Виктория Шепард

Date:

07 October 2022, 01:01:24

Hi,

I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.

- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?

- misspelling in a word

>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?

> char* table = (char*) pg_malloc(256 * sizeof(char));

Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.

- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string

> char *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues

>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/

the name of its table

пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:

Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
>    usage example:
>        --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
>        or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
>    usage example:
>        --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>        or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>        or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

Re[2]: Possible solution for masking chosen columns when using pg_dump

From

Олег Целебровский

Date:

10 October 2022, 09:53:43

Hi,

I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.

Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).

Wrote comments for all ‘magic numbers’

About that

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.

A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.

Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:

Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.

- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.

- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
the name of its table

пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:
Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
>    usage example:
>        --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
>        or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
>    usage example:
>        --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>        or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>        or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

Attachment

masking_for_pg_dump_v2.patch

Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

From

Виктория Шепард

Date:

12 October 2022, 07:19:34

Hi,

Here is an idea of how to read masking options from a file. Please, take a look.

пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy@mail.ru>:

Hi,

I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.

Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).

Wrote comments for all ‘magic numbers’

About that

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.

A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.

Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:

Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.

- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.

- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
the name of its table

пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:
Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
>    usage example:
>        --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
>        or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
>    usage example:
>        --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>        or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>        or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

Attachment

0001-C4a-pg-dump-masking-option.patch

Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

From

Виктория Шепард

Date:

23 October 2022, 23:24:38

Hi,

Thank you, Oleg Tselebrovskiy, for your valuable review, here are the fixes

Best regards,

Viktoria Shepard

ср, 12 окт. 2022 г. в 12:19, Виктория Шепард <we.viktory@gmail.com>:

Hi,

Here is an idea of how to read masking options from a file. Please, take a look.

пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy@mail.ru>:
Hi,

I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.

Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).

Wrote comments for all ‘magic numbers’

About that

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.

A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.

Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:

Hi,
I took a look, here are several suggestions for improvement:

- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.

- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?

- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/

- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.

- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.

- It seems that you are not using temp_string
> char *temp_string = (char*)malloc(256 * sizeof(char));

- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
the name of its table

пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:
Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
>    usage example:
>        --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
>        or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
>    usage example:
>        --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>        or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>        or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

Attachment

0001-C4a-pg-dump-masking-option.patch