Thread: Anonymized database dumps

Anonymized database dumps

From
Janning Vygen
Date:
Hi,

I am working on postgresql 9.1 and loving it!

Sometimes we need a full database dump to test some performance issues
with real data.

Of course we don't like to have sensible data like bunches of e-mail
addresses on our development machines as they are of no interest for
developers and should be kept secure.

So we need an anonymized database dump. I thought about a few ways to
achieve this.

1. Best solution would be a special db user and some rules which fire on
reading some tables and replace privacy data with some random data. Now
doing a dump as this special user doesn't even copy the sensible data at
all. The user just has a different view on this database even when he
calls pg_dump.

But as rules are not fired on COPY it can't work, right?

2. The other solution I can think of is something like

pg_dump | sed > pgdump_anon

where 'sed' does a lot of magical replace operations on the content of
the dump. I don't think this is going to work reliable.

3. More reliable would be to dump the database, restore it on a
different server, run some sql script which randomize some data, and
dump it again. hmm, seems to be the only reliable way so far. But it is
no fun when dumping and restoring takes an hour.

Does anybody has a better idea how to achieve an anonymized database dump?

regards
Janning





--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

Re: Anonymized database dumps

From
Kiriakos Georgiou
Date:
I would store sensitive data encrypted in the database.  Check the pgcrypto module.

Kiriakos


On Mar 18, 2012, at 1:00 PM, Janning Vygen wrote:

> Hi,
>
> I am working on postgresql 9.1 and loving it!
>
> Sometimes we need a full database dump to test some performance issues with real data.
>
> Of course we don't like to have sensible data like bunches of e-mail addresses on our development machines as they
areof no interest for developers and should be kept secure. 
>
> So we need an anonymized database dump. I thought about a few ways to achieve this.
>
> 1. Best solution would be a special db user and some rules which fire on reading some tables and replace privacy data
withsome random data. Now doing a dump as this special user doesn't even copy the sensible data at all. The user just
hasa different view on this database even when he calls pg_dump. 
>
> But as rules are not fired on COPY it can't work, right?
>
> 2. The other solution I can think of is something like
>
> pg_dump | sed > pgdump_anon
>
> where 'sed' does a lot of magical replace operations on the content of the dump. I don't think this is going to work
reliable.
>
> 3. More reliable would be to dump the database, restore it on a different server, run some sql script which randomize
somedata, and dump it again. hmm, seems to be the only reliable way so far. But it is no fun when dumping and restoring
takesan hour. 
>
> Does anybody has a better idea how to achieve an anonymized database dump?
>
> regards
> Janning
>
>
>
>
>
> --
> Kicktipp GmbH
>
> Venloer Straße 8, 40477 Düsseldorf
> Sitz der Gesellschaft: Düsseldorf
> Geschäftsführung: Janning Vygen
> Handelsregister Düsseldorf: HRB 55639
>
> http://www.kicktipp.de/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: Anonymized database dumps

From
Janning Vygen
Date:
pgcrypto does not work for this scenario as far as i know.

pgcrypto enables me to encrypt my data and let only a user with the
right password (or key or whatever) decrypt it, right? So if i run it in
a test environment without this password the application is broken.

I still want to use these table columns in my test environment but
instead of real email addresses i want addresses like
random_number@example.org.

You might be right that it is a good idea to additional encrypt this data.

regards
Janning

Am 19.03.2012 06:24, schrieb Kiriakos Georgiou:
> I would store sensitive data encrypted in the database.  Check the pgcrypto module.
>
> Kiriakos
>
>
> On Mar 18, 2012, at 1:00 PM, Janning Vygen wrote:
>
>> Hi,
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail addresses on our development machines as they
areof no interest for developers and should be kept secure. 
>>
>> So we need an anonymized database dump. I thought about a few ways to achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on reading some tables and replace privacy
datawith some random data. Now doing a dump as this special user doesn't even copy the sensible data at all. The user
justhas a different view on this database even when he calls pg_dump. 
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed>  pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of the dump. I don't think this is going to work
reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a different server, run some sql script which
randomizesome data, and dump it again. hmm, seems to be the only reliable way so far. But it is no fun when dumping and
restoringtakes an hour. 
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>>
>> regards
>> Janning
>>
>>
>>
>>
>>
>> --
>> Kicktipp GmbH
>>
>> Venloer Straße 8, 40477 Düsseldorf
>> Sitz der Gesellschaft: Düsseldorf
>> Geschäftsführung: Janning Vygen
>> Handelsregister Düsseldorf: HRB 55639
>>
>> http://www.kicktipp.de/
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>

--
Kicktipp GmbH

Venloer Straße 8, 40477 Düsseldorf
Sitz der Gesellschaft: Düsseldorf
Geschäftsführung: Janning Vygen
Handelsregister Düsseldorf: HRB 55639

http://www.kicktipp.de/

Re: Anonymized database dumps

From
hari.fuchs@gmail.com
Date:
Janning Vygen <vygen@kicktipp.de> writes:

> pgcrypto does not work for this scenario as far as i know.
>
> pgcrypto enables me to encrypt my data and let only a user with the
> right password (or key or whatever) decrypt it, right? So if i run it
> in a test environment without this password the application is broken.
>
> I still want to use these table columns in my test environment but
> instead of real email addresses i want addresses like
> random_number@example.org.
>
> You might be right that it is a good idea to additional encrypt this data.

Maybe you could change your application so that it doesn't access the
critical tables directly and instead define views for them which, based
on current_user, either do decryption or return randim strings.

Re: Anonymized database dumps

From
Marko Kreen
Date:
On Mon, Mar 19, 2012 at 10:12:01AM +0100, hari.fuchs@gmail.com wrote:
> Janning Vygen <vygen@kicktipp.de> writes:
> > pgcrypto does not work for this scenario as far as i know.
> >
> > pgcrypto enables me to encrypt my data and let only a user with the
> > right password (or key or whatever) decrypt it, right? So if i run it
> > in a test environment without this password the application is broken.
> >
> > I still want to use these table columns in my test environment but
> > instead of real email addresses i want addresses like
> > random_number@example.org.
> >
> > You might be right that it is a good idea to additional encrypt this data.
>
> Maybe you could change your application so that it doesn't access the
> critical tables directly and instead define views for them which, based
> on current_user, either do decryption or return randim strings.

Encryption is wrong tool for "anonymization".

The right tool is hmac() which gives you one-way hash that
is protected by key, which means other side can't even
calcutate the hashes unless they have same key.

You can calculate it with pgcrypto when dumping,
or later post-processing the dumps.

But it produces random values, if you need something
realistic-looking you need custom mapping logic.

--
marko


Re: Anonymized database dumps

From
Bill Moran
Date:
In response to Janning Vygen <vygen@kicktipp.de>:
>
> I am working on postgresql 9.1 and loving it!
>
> Sometimes we need a full database dump to test some performance issues
> with real data.
>
> Of course we don't like to have sensible data like bunches of e-mail
> addresses on our development machines as they are of no interest for
> developers and should be kept secure.
>
> So we need an anonymized database dump. I thought about a few ways to
> achieve this.
>
> 1. Best solution would be a special db user and some rules which fire on
> reading some tables and replace privacy data with some random data. Now
> doing a dump as this special user doesn't even copy the sensible data at
> all. The user just has a different view on this database even when he
> calls pg_dump.
>
> But as rules are not fired on COPY it can't work, right?
>
> 2. The other solution I can think of is something like
>
> pg_dump | sed > pgdump_anon
>
> where 'sed' does a lot of magical replace operations on the content of
> the dump. I don't think this is going to work reliable.
>
> 3. More reliable would be to dump the database, restore it on a
> different server, run some sql script which randomize some data, and
> dump it again. hmm, seems to be the only reliable way so far. But it is
> no fun when dumping and restoring takes an hour.
>
> Does anybody has a better idea how to achieve an anonymized database dump?

I highly recommend #3.  It's how we do it where I work.

At first it seems like a big, slow, complicated monster, but once you've
built the tools and have it running reliably it's very nice.  Our system
does the dumps overnight via cron (we have over 100 production databases)
then changes the sensitive data, as well changing all the passwords to
"password" so developers can easily log in as any account.  During the
day, the developers have access to all the sanitized dump files and can
use them to make as many testing databases as they need.  Yes, the data
gets up to 24 hours out of date, but it's never been a problem for us.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Anonymized database dumps

From
Janning Vygen
Date:
Am 19.03.2012 um 13:22 schrieb Bill Moran <wmoran@potentialtech.com>:

> In response to Janning Vygen <vygen@kicktipp.de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3.  It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice.  Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account.  During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need.  Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.

Thanks for your response and your insights to your process. Sounds reasonable.

Regards
Janning

Re: Anonymized database dumps

From
Kiriakos Georgiou
Date:
The data anonymizer process is flawed because you are one misstep away from data spillage.  Sensitive data should be
storedencrypted to begin.  For test databases you or your developers can invoke a process that replaces the real
encrypteddata with fake encrypted data (for which everybody has the key/password.)  Or if the overhead is too much (ie
billionsof rows), you can have different decrypt() routines on your test databases that return fake data without
touchingthe real encrypted columns. 

Kiriakos

On Mar 19, 2012, at 8:22 AM, Bill Moran wrote:

> In response to Janning Vygen <vygen@kicktipp.de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3.  It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice.  Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account.  During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need.  Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: Anonymized database dumps

From
Bill Moran
Date:
In response to Kiriakos Georgiou <kg.postgresql@olympiakos.com>:

> The data anonymizer process is flawed because you are one misstep away from data spillage.

In our case, it's only one layer.

Other layers that exist:
* The systems where this test data is instantiated can't send email
* The systems where this exist have limited access (i.e., not all
  developers can access it, and it's not used for typical testing --
  only for specific testing that requires production-like data)

You are correct, however, in that there's always the danger of
spillage if new sensitive data is added and the sanitation script
is not properly updated.  It's part of the ongoing overhead of
maintaining such a system.

> Sensitive data should be stored encrypted to begin.  For test databases you or your developers can invoke a process
thatreplaces the real encrypted data with fake encrypted data (for which everybody has the key/password.)  Or if the
overheadis too much (ie billions of rows), you can have different decrypt() routines on your test databases that return
fakedata without touching the real encrypted columns. 

The thing is, this process has the same potential data spillage
issues as sanitizing the data.  I find it intriguing, however, and
I'm going to see if there are places where this approach might
have advantages over our current one.

Since much of our sensitive data is already de-identified, it
provides an additional level of protection on that level as well.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Anonymized database dumps

From
Kiriakos Georgiou
Date:
On Mar 19, 2012, at 5:55 PM, Bill Moran wrote:

>
>> Sensitive data should be stored encrypted to begin.  For test databases you or your developers can invoke a process
thatreplaces the real encrypted data with fake encrypted data (for which everybody has the key/password.)  Or if the
overheadis too much (ie billions of rows), you can have different decrypt() routines on your test databases that return
fakedata without touching the real encrypted columns. 
>
> The thing is, this process has the same potential data spillage
> issues as sanitizing the data.


Not really, in the modality I describe the sensitive data is always encrypted in the database and "useless" because
nobodywill have the private key or know the password that protects it other than the ops subsystems that require
access.
So even if you take an ops dump, load it to a test box, and walk away, you are good.  If your developers/testers want
toplay with the data they will be forced to over-write and "stage" test encrypted data they can decrypt, or call a
"fake"decrypt() that gives them test data (eg: joins to a test data table.) 

Kiriakos