Re: Backup advice - Mailing list pgsql-general

From Eduardo Morras
Subject Re: Backup advice
Date
Msg-id 20130416135013.94873e45a8ee1365f28d2554@yahoo.es
Whole thread Raw
In response to Backup advice  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general
On Mon, 15 Apr 2013 19:54:15 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
> <emorrasg@yahoo.es<javascript:_e({}, 'cvml', 'emorrasg@yahoo.es');>
> > wrote:
>
> > On Mon, 8 Apr 2013 10:40:16 -0500
> > Shaun Thomas <sthomas@optionshouse.com <javascript:_e({}, 'cvml',
> > 'sthomas@optionshouse.com');>> wrote:
> >
> > >
> > > Anyone else?
> > >
> >
> > If his db has low inserts/updates/deletes he can use diff between pg_dumps
> > (with default -Fp) before compressing.
> >
>
> Most "diff" implementations will read the entirety of both files into
> memory, so may not work well with 200GB of data, unless it is broken into a
> large number of much smaller files.
>
> open-vcdiff only reads one of the files into memory, but I couldn't really
> figure out what happens memory-wise when you try to undo the resulting
> patch, the documentation is a bit mysterious.
>
> xdelta3 will "work" on streamed files of unlimited size, but it doesn't
> work very well unless the files fit in memory, or have the analogous data
> in the same order between the two files.

I use for my 12-13 GB dump files:

git diff -p 1.sql 2.sql > diff.patch


It uses 4MB for firts phase and upto 140MB on last one and makes a patch file that can be recovered with:

patch 1.sql < diff.patch > 2.sql

or using git apply.

> A while ago I did some attempts to "co-compress" dump files, based on the
> notion that the pg_dump text format does not have \n within records so it
> is sortable as ordinary text, and that usually tables have their "stable"
> columns, like a pk, near the beginning of the table and volatile columns
> near the end, so that sorting the lines of several dump files together will
> gather replicate or near-replicate lines together where ordinary
> compression algorithms can work their magic.  So if you tag each line with
> its line number and which file it originally came from, then sort the lines
> (skipping the tag), you get much better compression.  But not nearly as
> good as open-vcdiff, assuming you have the RAM to spare.
>
> Using two dumps taken months apart on a slowly-changing database, it worked
> fairly well:
>
> cat 1.sql | pigz |wc -c
> 329833147
>
> cat 2.sql | pigz |wc -c
> 353716759
>
> cat 1.sql 2.sql | pigz |wc -c
> 683548147
>
> sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print
> "${.}b\t$_"' 2.sql) | pigz |wc -c
> 436350774
>
> A certain file could be recovered by, for example:
>
> zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' >
> 2.sql2

Be careful, some z* utils decompress the whole file on /tmp (zdiff).

> There all kinds of short-comings here, of course, it was just a quick and
> dirty proof of concept.

A nice one !

> For now I think storage is cheap enough for what I need to do to make this
> not worth fleshing it out any more.
>
> Cheers,
>
> Jeff


---   ---
Eduardo Morras <emorrasg@yahoo.es>


pgsql-general by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Install PostgreSQL 9.2.4 to IBM Power System ppc64
Next
From: Adrian Klaver
Date:
Subject: Re: Role Authentication Failure