Re: Anonymized database dumps - Mailing list pgsql-general

From Kiriakos Georgiou
Subject Re: Anonymized database dumps
Date
Msg-id EBBC9763-B788-484B-BC4D-3DE836FA4513@olympiakos.com
Whole thread Raw
In response to Re: Anonymized database dumps  (Bill Moran <wmoran@potentialtech.com>)
Responses Re: Anonymized database dumps  (Bill Moran <wmoran@potentialtech.com>)
List pgsql-general
The data anonymizer process is flawed because you are one misstep away from data spillage.  Sensitive data should be
storedencrypted to begin.  For test databases you or your developers can invoke a process that replaces the real
encrypteddata with fake encrypted data (for which everybody has the key/password.)  Or if the overhead is too much (ie
billionsof rows), you can have different decrypt() routines on your test databases that return fake data without
touchingthe real encrypted columns. 

Kiriakos

On Mar 19, 2012, at 8:22 AM, Bill Moran wrote:

> In response to Janning Vygen <vygen@kicktipp.de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3.  It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice.  Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account.  During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need.  Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


pgsql-general by date:

Previous
From: Steve Crawford
Date:
Subject: Re: nice'ing the postgres COPY backend process to make pg_dumps run more "softly"
Next
From: Bill Moran
Date:
Subject: Re: Anonymized database dumps