Thread: Backup advice

Backup advice

From
Johann Spies
Date:
I would appreciate some advice from the experts on this list about the best backup strategy for my database.

The setup:

Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a daily, weekly and monthly basis.  All this gets backed up to the Tivoli server.

I would like to reduce the size of the backups as far as possible as we have to pay for space on the backup server and I do not want any downtime on the database server.

I do not want replication, but want to be able to restore the database as recent as possible (at least as it was within the past 24 hours) as quickly as possible.

I have read about using pg_basebackup in an article from Shaun Thomas' booklet on Packt Publishers (I will probably buy the booklet)That seems to be a possible solution.

I am considering dropping the pg_dumps in favour of the pg_basebackup method.  Will that be wise?

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: Backup advice

From
Birta Levente
Date:
On 08/04/2013 16:14, Johann Spies wrote:
> I would appreciate some advice from the experts on this list about the
> best backup strategy for my database.
>
> The setup:
>
> Size: might be about 200Gb
> The server uses a Tivoli backup client with daily backup
> At the moment There are pg_dumps for each database on the server on a
> daily, weekly and monthly basis.  All this gets backed up to the Tivoli
> server.
>
> I would like to reduce the size of the backups as far as possible as we
> have to pay for space on the backup server and I do not want any
> downtime on the database server.
>
> I do not want replication, but want to be able to restore the database
> as recent as possible (at least as it was within the past 24 hours) as
> quickly as possible.
>
> I have read about using pg_basebackup in an article from Shaun Thomas'
> booklet on Packt Publishers**(I will probably buy the booklet)*. *That
> seems to be a possible solution.
>
> I am considering dropping the pg_dumps in favour of the pg_basebackup
> method.  Will that be wise?
>
> Regards
> Johann
> --
> Because experiencing your loyal love is better than life itself,
> my lips will praise you.  (Psalm 63:3)
>

You should read this:
http://www.postgresql.org/docs/9.2/static/continuous-archiving.html


And you have to decide which method fit best. I dropped the dump method
because the dumps size. The weekly basebackup and the WAL archives it's
really smaller in size than dumps. In my case.

But, with PITR, you could restore only the whole cluster...cannot
restore only one database.

And hey ... with PITR, you could restore the cluster to any specified
moment or transaction ... not only to the time of dump or basebackup.

Levi






Attachment

Re: Backup advice

From
Ian Lawrence Barwick
Date:
2013/4/8 Johann Spies <johann.spies@gmail.com>:
> I would appreciate some advice from the experts on this list about the best
> backup strategy for my database.

(...)
>
> I have read about using pg_basebackup in an article from Shaun Thomas'
> booklet on Packt Publishers (I will probably buy the booklet).

Get the booklet, it's worth it.

Regards

Ian Barwick


Re: Backup advice

From
Shaun Thomas
Date:
On 04/08/2013 08:14 AM, Johann Spies wrote:

> Size: might be about 200Gb
> The server uses a Tivoli backup client with daily backup
> At the moment There are pg_dumps for each database on the server on a
> daily, weekly and monthly basis.  All this gets backed up to the Tivoli
> server.

Ok. So far, so good.

> I have read about using pg_basebackup in an article from Shaun
> Thomas' booklet on Packt Publishers**(I will probably buy the
> booklet)*. *That seems to be a possible solution.

Ok, with pg_basebackup, you'll get a binary backup of the actual data
files involved in your database cluster. This will, in almost every
case, be larger than pg_dump, and take about the same amount of time to
produce. You also won't be able to get the described method you're using
in your Tivoli software, since pg_basebackup works on the entire install
instead of each individual database.

One benefit, as Birta pointed out, is that you could use this backup as
a base, and apply WAL / transaction logs instead, and those are
generally smaller if your database doesn't see a lot of daily overhead.
Unfortunately if you do a weekly base, and need to recover far into the
week, that can be a time-consuming process.

Then again, so can restoring a pg_dump of a 200GB cluster, thanks to the
index creation points.

One thing you might want to consider, is that 9.3 will presumably have
parallel pg_dump to complement parallel pg_restore. This would greatly
reduce the amount of time a full dump->restore cycle requires.

Aside from that, you're already likely using the smallest backup you can
have. Backing up a 600GB database takes about 120GB for us using tar,
which is close to what pg_basebackup would give you in terms of size. A
compressed pg_dump of the same data is around 50GB.

Still, we use the binaries, because we need uptime more than size, and
they are much faster to restore. We have daily binary backups going back
over four years, because the backups are probably the most important
thing in the company. Lose those under bad circumstances, and we'd
literally be out of business.

If size really is that important, you might want to check what kind of
compression is going on with Tivoli. If you can produce a pg_dump and
compress it with bzip2, or a utility such as lrzip that makes use of
better compression algorithms like LZMA, you may be able to back up much
smaller files than your current process.

I'll warn you, though. Even parallel compression methods like lbzip2 are
much slower than something like parallel gzip (pigz). You'll get a
20-30% smaller file at the cost of a 4-8x slower backup process. Since
your data is likely to grow from its current size of 200GB, it's
something to consider.

Anyone else?

I'm so used to using basic utilities, I know I haven't covered things
like deduplication backup solutions. It may be that Tivoli isn't right
for this, but I'm not familiar with that software.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Backup advice

From
Jeff Janes
Date:
On Mon, Apr 8, 2013 at 6:14 AM, Johann Spies <johann.spies@gmail.com> wrote:
I would appreciate some advice from the experts on this list about the best backup strategy for my database.

The setup:

Size: might be about 200Gb
The server uses a Tivoli backup client with daily backup
At the moment There are pg_dumps for each database on the server on a daily, weekly and monthly basis.  All this gets backed up to the Tivoli server.

Hi Johann,

This backup pattern means that after a year of operation you will have about 23 pg_dump files, 90+% of which have nothing to do with your requirement to restore the database to within the last 24 hours.

In other words, your storage needs are being driven by your historical retention policy, not your immediate restoration policy.  So, can you elaborate on your historical retention policy?  For example, if you need to restore your database to the state it was in 9 months ago, how fast do you need to be able to do that?  If you had a 12 month old pg_basebackup and 3 months of log files, how long would it take to replay those and how big would that many log files be?  (Both of those questions depend on your specific usage, so it is hard to make general guesses about those--they are questions that can only be answered empirically.)

pg_basebackup will almost surely be larger than pg_dumps.  For one thing, it contains all the index data, for another it contains any obsolete rows which have not yet been vacuum and reused.  So switching to that will save you space only if you need to keep less of them than you do of the pg_dumps.
 
Cheers,

Jeff

Re: Backup advice

From
Eduardo Morras
Date:
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas <sthomas@optionshouse.com> wrote:

>
> Anyone else?
>

If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing.

---   ---
Eduardo Morras <emorrasg@yahoo.es>


Re: Backup advice

From
Johann Spies
Date:
Thanks everybody for your valuable comments.

I can specify how many versions of the files should be kept on Tivoli.

The database will regularly get new data and there is a continuous process of data cleaning. It is a database mainly for research purposes and a few researchers are using it.

I will explore the options mentioned. After the first read it looks like continuing pg_dump is not a bad idea - maybe with some optimization (like using diff's).

Regards
Johann

On 9 April 2013 12:05, Eduardo Morras <emorrasg@yahoo.es> wrote:
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas <sthomas@optionshouse.com> wrote:

>
> Anyone else?
>

If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing.

---   ---
Eduardo Morras <emorrasg@yahoo.es>


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: Backup advice

From
Gabriele Bartolini
Date:
 Hi Johann,

 On Wed, 10 Apr 2013 09:58:05 +0200, Johann Spies
 <johann.spies@gmail.com> wrote:
> I can specify how many versions of the files should be kept on
> Tivoli.

 Another option you can evaluate is the usage of backup catalogues,
 retention policies and archiving of Barman (www.pgbarman.org). We use it
 in some contexts with Tivoli (currently only through file system backup,
 but I guess that's what you do anyway).

 However, the approach is totally different and is based on physical
 backups and continuous archiving, allowing you to perform point in time
 recovery as well.

 Maybe it is worth evaluating it.

 Cheers,
 Gabriele
--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it


Backup advice

From
Jeff Janes
Date:
On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras <emorrasg@yahoo.es> wrote:
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas <sthomas@optionshouse.com> wrote:

>
> Anyone else?
>

If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing.

Most "diff" implementations will read the entirety of both files into memory, so may not work well with 200GB of data, unless it is broken into a large number of much smaller files.

open-vcdiff only reads one of the files into memory, but I couldn't really figure out what happens memory-wise when you try to undo the resulting patch, the documentation is a bit mysterious.  

xdelta3 will "work" on streamed files of unlimited size, but it doesn't work very well unless the files fit in memory, or have the analogous data in the same order between the two files.

A while ago I did some attempts to "co-compress" dump files, based on the notion that the pg_dump text format does not have \n within records so it is sortable as ordinary text, and that usually tables have their "stable" columns, like a pk, near the beginning of the table and volatile columns near the end, so that sorting the lines of several dump files together will gather replicate or near-replicate lines together where ordinary compression algorithms can work their magic.  So if you tag each line with its line number and which file it originally came from, then sort the lines (skipping the tag), you get much better compression.  But not nearly as good as open-vcdiff, assuming you have the RAM to spare.

Using two dumps taken months apart on a slowly-changing database, it worked fairly well:

cat 1.sql | pigz |wc -c
329833147

cat 2.sql | pigz |wc -c
353716759

cat 1.sql 2.sql | pigz |wc -c
683548147

sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print "${.}b\t$_"' 2.sql) | pigz |wc -c
436350774

A certain file could be recovered by, for example:

zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > 2.sql2

There all kinds of short-comings here, of course, it was just a quick and dirty proof of concept.

For now I think storage is cheap enough for what I need to do to make this not worth fleshing it out any more.

Cheers,

Jeff

Re: Backup advice

From
Eduardo Morras
Date:
On Mon, 15 Apr 2013 19:54:15 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
> <emorrasg@yahoo.es<javascript:_e({}, 'cvml', 'emorrasg@yahoo.es');>
> > wrote:
>
> > On Mon, 8 Apr 2013 10:40:16 -0500
> > Shaun Thomas <sthomas@optionshouse.com <javascript:_e({}, 'cvml',
> > 'sthomas@optionshouse.com');>> wrote:
> >
> > >
> > > Anyone else?
> > >
> >
> > If his db has low inserts/updates/deletes he can use diff between pg_dumps
> > (with default -Fp) before compressing.
> >
>
> Most "diff" implementations will read the entirety of both files into
> memory, so may not work well with 200GB of data, unless it is broken into a
> large number of much smaller files.
>
> open-vcdiff only reads one of the files into memory, but I couldn't really
> figure out what happens memory-wise when you try to undo the resulting
> patch, the documentation is a bit mysterious.
>
> xdelta3 will "work" on streamed files of unlimited size, but it doesn't
> work very well unless the files fit in memory, or have the analogous data
> in the same order between the two files.

I use for my 12-13 GB dump files:

git diff -p 1.sql 2.sql > diff.patch


It uses 4MB for firts phase and upto 140MB on last one and makes a patch file that can be recovered with:

patch 1.sql < diff.patch > 2.sql

or using git apply.

> A while ago I did some attempts to "co-compress" dump files, based on the
> notion that the pg_dump text format does not have \n within records so it
> is sortable as ordinary text, and that usually tables have their "stable"
> columns, like a pk, near the beginning of the table and volatile columns
> near the end, so that sorting the lines of several dump files together will
> gather replicate or near-replicate lines together where ordinary
> compression algorithms can work their magic.  So if you tag each line with
> its line number and which file it originally came from, then sort the lines
> (skipping the tag), you get much better compression.  But not nearly as
> good as open-vcdiff, assuming you have the RAM to spare.
>
> Using two dumps taken months apart on a slowly-changing database, it worked
> fairly well:
>
> cat 1.sql | pigz |wc -c
> 329833147
>
> cat 2.sql | pigz |wc -c
> 353716759
>
> cat 1.sql 2.sql | pigz |wc -c
> 683548147
>
> sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print
> "${.}b\t$_"' 2.sql) | pigz |wc -c
> 436350774
>
> A certain file could be recovered by, for example:
>
> zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' >
> 2.sql2

Be careful, some z* utils decompress the whole file on /tmp (zdiff).

> There all kinds of short-comings here, of course, it was just a quick and
> dirty proof of concept.

A nice one !

> For now I think storage is cheap enough for what I need to do to make this
> not worth fleshing it out any more.
>
> Cheers,
>
> Jeff


---   ---
Eduardo Morras <emorrasg@yahoo.es>