On Sat, Apr 22, 2023 at 4:33 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Sat, Apr 22, 2023 at 12:53:23PM -0400, Isaac Morland wrote:
> >
> > I have an application for this: creating various dev/test versions of data
> > from production.
> >
> > Start by restoring a copy of production from backup. Then successively
> > create several altered versions of the data and save them to a place where
> > developers can pick them up. For example, you might have one version which
> > has all data old than 1 year deleted, and another where 99% of the
> > students/customers/whatever are deleted. Anonymization could also be
> > applied. This would give you realistic (because it ultimately originates
> > from production) test data.
> >
> > This could be done by starting a non-committable transaction, making the
> > adjustments, then doing a pg_dump in the same transaction (using --snapshot
> > to allow it to see that transaction). Then rollback, and repeat for the
> > other versions. This saves repeatedly restoring the (probably very large)
> > production data each time.
>
> There already are tools to handle those use cases. Looks for instance at
> https://github.com/mla/pg_sample to backup a consistent subset of the data, or
> https://github.com/rjuju/pg_anonymize to transparently pg_dump (or
> interactively query) anonymized data.
>
> Both tool also works when connected on a physical standby, while trying to
> update data before dumping them wouldn't.
Like everything in life, there are pros and cons to every approach.
pg_anonymize is an extension that may not be installed on the database
you're working with. And pg_sample (and similar utilities) may not
have a way to extract or sanitize the exact data you want.
With this feature built into Postgres, you'd not need any external
utilities or extensions. The benefits of features built into Postgres
are that the users can come up with ways of leveraging such a feature
in future in a way that we can't envision today.
Best regards,
Gurjeet
http://Gurje.et