Thread: Possible solution for masking chosen columns when using pg_dump
Hello, here's my take on masking data when using pg_dump
The main idea is using PostgreSQL functions to replace data during a SELECT.
When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
So about the interface: I added 2 more command-line options:
--mask-columns, which specifies what columns from what tables will be masked
usage example:
--mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
--mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
--mask-function, which specifies what functions will mask data
usage example:
--mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
--mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
or --mask-function somedir/filename - the function is "defined" here - more on the structure below
Structure of the file with function description:
First row - function name (with or without schema name)
Second row - type of in and out value (the design is to only work with same input/output type so no int-to-text shenanigans)
Third row - language of function
Forth and later rows - body of a function
Example of such file:
mask_text
text
plpgsql
res := '***';
First iteration of using file-described functions used just plain SQL query, but since it executed during read-write connection, some things such as writing "DROP TABLE t1;" after the CREATE FUNCTION ...; were possible.
Now even if something harmful is written in function body, it will be executed during dump-read-only connection, where it will just throw an error
About "corresponding columns and functions" - masking functions and columns are paired with eachother based on the input order, but --masking-columns and --masking-functions don't have to be subsequent.
Example: pg_dump -t table_name --mask-columns name --mask-colums count --mask-function mask_text --mask-function mask_int - here 'name' will be paired with function 'mask_text' and 'count' with 'mask_int'
Patch includes regression tests
I'm open to discussion of this patch
Best regards,
Oleg Tselebrovskiy
Attachment
Hi, On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote: > > Hello, here's my take on masking data when using pg_dump > > The main idea is using PostgreSQL functions to replace data during a SELECT. > When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replacedwith result of functions on those columns > Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count),date from ... > > So about the interface: I added 2 more command-line options: > > --mask-columns, which specifies what columns from what tables will be masked > usage example: > --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function > or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspodingfunction > > --mask-function, which specifies what functions will mask data > usage example: > --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from defaultschema (public) > or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored > or --mask-function somedir/filename - the function is "defined" here - more on the structure below FTR I wrote an extension POC [1] last weekend that does that but on the backend side. The main advantage is that it's working with any existing versions of pg_dump (or any client relying on COPY or even plain interactive SQL statements), and that the DBA can force a dedicated role to only get a masked dump, even if they forgot to ask for it. I only had a quick look at your patch but it seems that you left some todo in russian, which isn't helpful at least to me. [1] https://github.com/rjuju/pg_anonymize
Hi,
I took a look, here are several suggestions for improvement:
- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.
- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?
- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/
- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.
- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.
- It seems that you are not using temp_string
> char *temp_string = (char*)malloc(256 * sizeof(char));
- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
the name of its table
пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:
Hi,
On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
> usage example:
> --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
> or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
> usage example:
> --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
> or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
> or --mask-function somedir/filename - the function is "defined" here - more on the structure below
FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.
I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.
[1] https://github.com/rjuju/pg_anonymize
Hi,
I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.
Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).
Wrote comments for all ‘magic numbers’
About that
>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.
A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.
Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).
Wrote comments for all ‘magic numbers’
About that
>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.
A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:
Hi,I took a look, here are several suggestions for improvement:- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?- misspelling in a word>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/- Why did you use 256 here?> char* table = (char*) pg_malloc(256 * sizeof(char));Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.- It seems that you are not using temp_string> char *temp_string = (char*)malloc(256 * sizeof(char));- Grammar issues>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/the name of its tableпн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:Hi,
On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
> usage example:
> --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
> or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
> usage example:
> --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
> or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
> or --mask-function somedir/filename - the function is "defined" here - more on the structure below
FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.
I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.
[1] https://github.com/rjuju/pg_anonymize
Attachment
Re: Re[2]: Possible solution for masking chosen columns when using pg_dump
From
Виктория Шепард
Date:
Hi,
Here is an idea of how to read masking options from a file. Please, take a look.
пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy@mail.ru>:
Hi,
I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.
Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).
Wrote comments for all ‘magic numbers’
About that
>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.
A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:
Hi,I took a look, here are several suggestions for improvement:- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?- misspelling in a word>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/- Why did you use 256 here?> char* table = (char*) pg_malloc(256 * sizeof(char));Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.- It seems that you are not using temp_string> char *temp_string = (char*)malloc(256 * sizeof(char));- Grammar issues>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/the name of its tableпн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:Hi,
On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
> usage example:
> --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
> or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
> usage example:
> --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
> or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
> or --mask-function somedir/filename - the function is "defined" here - more on the structure below
FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.
I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.
[1] https://github.com/rjuju/pg_anonymize
Attachment
Re: Re[2]: Possible solution for masking chosen columns when using pg_dump
From
Виктория Шепард
Date:
Hi,
Thank you, Oleg Tselebrovskiy, for your valuable review, here are the fixes
Best regards,
Viktoria Shepard
ср, 12 окт. 2022 г. в 12:19, Виктория Шепард <we.viktory@gmail.com>:
Hi,Here is an idea of how to read masking options from a file. Please, take a look.пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy@mail.ru>:Hi,
I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.
Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).
Wrote comments for all ‘magic numbers’
About that
>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.
A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:
Hi,I took a look, here are several suggestions for improvement:- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?- misspelling in a word>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/- Why did you use 256 here?> char* table = (char*) pg_malloc(256 * sizeof(char));Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.- It seems that you are not using temp_string> char *temp_string = (char*)malloc(256 * sizeof(char));- Grammar issues>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/the name of its tableпн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:Hi,
On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>
> So about the interface: I added 2 more command-line options:
>
> --mask-columns, which specifies what columns from what tables will be masked
> usage example:
> --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
> or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>
> --mask-function, which specifies what functions will mask data
> usage example:
> --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
> or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
> or --mask-function somedir/filename - the function is "defined" here - more on the structure below
FTR I wrote an extension POC [1] last weekend that does that but on the backend
side. The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.
I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.
[1] https://github.com/rjuju/pg_anonymize