Re: Anonymized database dumps - Mailing list pgsql-general

From Janning Vygen
Subject Re: Anonymized database dumps
Date
Msg-id 3EC70397-81C3-42E3-B29E-69F8A9A2C084@kicktipp.de
Whole thread Raw
In response to Re: Anonymized database dumps  (Bill Moran <wmoran@potentialtech.com>)
List pgsql-general
Am 19.03.2012 um 13:22 schrieb Bill Moran <wmoran@potentialtech.com>:

> In response to Janning Vygen <vygen@kicktipp.de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3.  It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice.  Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account.  During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need.  Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.

Thanks for your response and your insights to your process. Sounds reasonable.

Regards
Janning

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)
Next
From: Florent THOMAS
Date:
Subject: Conditionnal validation for transaction