Re: Proposal: Incremental Backup - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Proposal: Incremental Backup
Date
Msg-id CAB7nPqQvpg6ETg4oxcCsFFAQjP5kL3WcuYv60jLtzUxf1=Q5Gg@mail.gmail.com
Whole thread Raw
In response to Proposal: Incremental Backup  (Marco Nenciarini <marco.nenciarini@2ndquadrant.it>)
Responses Re: Proposal: Incremental Backup  (Marco Nenciarini <marco.nenciarini@2ndquadrant.it>)
List pgsql-hackers
On Fri, Jul 25, 2014 at 10:14 PM, Marco Nenciarini
<marco.nenciarini@2ndquadrant.it> wrote:
> 0. Introduction:
> =================================
> This is a proposal for adding incremental backup support to streaming
> protocol and hence to pg_basebackup command.
Not sure that incremental is a right word as the existing backup
methods using WAL archives are already like that. I recall others
calling that differential backup from some previous threads. Would
that sound better?

> 1. Proposal
> =================================
> Our proposal is to introduce the concept of a backup profile.
Sounds good. Thanks for looking at that.

> The backup
> profile consists of a file with one line per file detailing tablespace,
> path, modification time, size and checksum.
> Using that file the BASE_BACKUP command can decide which file needs to
> be sent again and which is not changed. The algorithm should be very
> similar to rsync, but since our files are never bigger than 1 GB per
> file that is probably granular enough not to worry about copying parts
> of files, just whole files.
There are actually two levels of differential backups: file-level,
which is the approach you are taking, and block level. Block level
backup makes necessary a scan of all the blocks of all the relations
and take only the data from the blocks newer than the LSN given by the
BASE_BACKUP command. In the case of file-level approach, you could
already backup the relation file after finding at least one block
already modified. Btw, the size of relation files depends on the size
defined by --with-segsize when running configure. 1GB is the default
though, and the value usually used. Differential backups can reduce
the size of overall backups depending on the application, at the cost
of some CPU to analyze the relation blocks that need to be included in
the backup.

> It could also be used in 'refresh' mode, by allowing the pg_basebackup
> command to 'refresh' an old backup directory with a new backup.
I am not sure this is really helpful...

> The final piece of this architecture is a new program called
> pg_restorebackup which is able to operate on a "chain of incremental
> backups", allowing the user to build an usable PGDATA from them or
> executing maintenance operations like verify the checksums or estimate
> the final size of recovered PGDATA.
Yes, right. Taking a differential backup is not difficult, but
rebuilding a constant base backup with a full based backup and a set
of differential ones is the tricky part, but you need to be sure that
all the pieces of the puzzle are here.

> We created a wiki page with all implementation details at
> https://wiki.postgresql.org/wiki/Incremental_backup
I had a look at that, and I think that you are missing the shot in the
way differential backups should be taken. What would be necessary is
to pass a WAL position (or LSN, logical sequence number like
0/2000060) with a new clause called DIFFERENTIAL (INCREMENTAL in your
first proposal) in the BASE BACKUP command, and then have the server
report back to client all the files that contain blocks newer than the
given LSN position given for file-level backup, or the blocks newer
than the given LSN for the block-level differential backup.
Note that we would need a way to identify the type of the backup taken
in backup_label, with the LSN position sent with DIFFERENTIAL clause
of BASE_BACKUP, by adding a new field in it.

When taking a differential backup, the LSN position necessary would be
simply the value of START WAL LOCATION of the last differential or
full backup taken. This results as well in a new option for
pg_basebackup of the type --differential='0/2000060' to take directly
a differential backup.

Then, for the utility pg_restorebackup, what you would need to do is
simply to pass a list of backups to it, then validate if they can
build a consistent backup, and build it.

Btw, the file-based method would be simpler to implement, especially
for rebuilding the backups.

Regards,
-- 
Michael



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Shapes on the regression test for polygon
Next
From: Alexey Klyukin
Date:
Subject: implement subject alternative names support for SSL connections