Re: Re[2]: Possible solution for masking chosen columns when using pg_dump - Mailing list pgsql-hackers

From Виктория Шепард
Subject Re: Re[2]: Possible solution for masking chosen columns when using pg_dump
Date
Msg-id CAAB=fdPuaNk_O9_dcZ6Nj0ZmgpBTKEiibdGo5sz_FAFOK+Rxfw@mail.gmail.com
Whole thread Raw
In response to Re[2]: Possible solution for masking chosen columns when using pg_dump  (Олег Целебровский <oleg_tselebrovskiy@mail.ru>)
Responses Re: Re[2]: Possible solution for masking chosen columns when using pg_dump  (Виктория Шепард <we.viktory@gmail.com>)
List pgsql-hackers
Hi,

Here is an idea of how to read masking options from a file. Please, take a look.

пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy@mail.ru>:
Hi,

I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.

Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).

Wrote comments for all ‘magic numbers’

About that

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.

A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
 
Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we.viktory@gmail.com>:
 
Hi,
I took a look, here are several suggestions for improvement:
 
- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.
 
- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.
 
- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?
 
- misspelling in a word
>/*
>* Add all columns and funcions to list of MaskColumnInfo structures,
>*/
 
- Why did you use 256 here?
> char* table = (char*) pg_malloc(256 * sizeof(char));
Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.
 
- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.
 
- It seems that you are not using temp_string
> char   *temp_string = (char*)malloc(256 * sizeof(char));
 
- Grammar issues
>/*
>* mask_column_info_list contains info about every to-be-masked column:
>* its name, a name its table (if nothing is specified - mask all columns with this name),
>* name of masking function and name of schema containing this function (public if not specified)
>*/
the name of its table
 
 
пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123@gmail.com>:
Hi,

On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>
> Hello, here's my take on masking data when using pg_dump
>  
> The main idea is using PostgreSQL functions to replace data during a SELECT.
> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>  
> So about the interface: I added 2 more command-line options: 
>  
> --mask-columns, which specifies what columns from what tables will be masked 
>     usage example:
>             --mask-columns "t1.name, t2.description" - both columns will be masked with the same corresponding function
>             or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>  
> --mask-function, which specifies what functions will mask data
>     usage example:
>             --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>             or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>             or --mask-function somedir/filename - the function is "defined" here - more on the structure below

FTR I wrote an extension POC [1] last weekend that does that but on the backend
side.  The main advantage is that it's working with any existing versions of
pg_dump (or any client relying on COPY or even plain interactive SQL
statements), and that the DBA can force a dedicated role to only get a masked
dump, even if they forgot to ask for it.

I only had a quick look at your patch but it seems that you left some todo in
russian, which isn't helpful at least to me.

[1] https://github.com/rjuju/pg_anonymize

 
 
Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Add index scan progress to pg_stat_progress_vacuum
Next
From: Masahiko Sawada
Date:
Subject: Re: test_decoding assertion failure for the loss of top-sub transaction relationship