Re: block-level incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: block-level incremental backup
Date
Msg-id CA+TgmoZsvpGJOx8U9euCqcJ2zTC=N7csk0RJOfT+6Wd0O9kEwQ@mail.gmail.com
Whole thread Raw
In response to Re: block-level incremental backup  (Stephen Frost <sfrost@snowman.net>)
Responses Re: block-level incremental backup
List pgsql-hackers
On Thu, Apr 18, 2019 at 6:39 PM Stephen Frost <sfrost@snowman.net> wrote:
> Where is the client going to get the threshold LSN from?
>
> If it doesn't have access to the old backup, then I'm a bit confused as
> to how a incremental backup would be possible?  Isn't that a requirement
> here?

I explained this in the very first email that I wrote on this thread,
and then wrote a very extensive further reply on this exact topic to
Peter Eisentraut.  It's a bit disheartening to see you arguing against
my ideas when it's not clear that you've actually read and understood
them.

> > The obvious way of extending this system to parallel backup is to have
> > N connections each streaming a separate tarfile such that when you
> > combine them all you recreate the original data directory.  That would
> > be perfectly compatible with what I'm proposing for incremental
> > backup.  Maybe you have another idea in mind, but I don't know what it
> > is exactly.
>
> So, while that's an obvious approach, it isn't the most sensible- and
> we know that from experience in actually implementing parallel backup of
> PG files.  I'm happy to discuss the approach we use in pgBackRest if
> you'd like to discuss this further, but it seems a bit far afield from
> the topic of discussion here and it seems like you're not interested or
> offering to work on supporting parallel backup in core.

If there's some way of modifying my proposal so that it makes life
better for external backup tools, I'm certainly willing to consider
that, but you're going to have to tell me what you have in mind.  If
that means describing what pgbackrest does, then do it.

My concern here is that you seem to want a lot of complicated stuff
that will require *significant* setup in order for people to be able
to use it.  From what I am able to gather from your remarks so far,
you think people should archive their WAL to a separate machine, and
then the WAL-summarizer should run there, and then data from that
should be fed back to the backup client, which should then give the
server a list of modified files (and presumably, someday, blocks) and
the server then returns that data, which the client then
cross-verifies with checksums and awesome sauce.

Which is all fine, but actually requires quite a bit of set-up and
quite a bit of buy-in to the tool.  And I have no problem with people
having that level of buy-in to the tool.  EnterpriseDB offers a number
of tools which require similar levels of setup and configuration, and
it's not inappropriate for an enterprise-grade backup tool to have all
that stuff.  However, for those who may not want to do all that, my
original proposal lets you take an incremental backup by doing the
following list of steps:

1. Take an incremental backup.

If you'd like, you can also:

0. Enable the WAL-scanning background worker to make incremental
backups much faster.

You do not need a WAL archive, and you do not need EITHER the backup
tool or the server to have access to previous backups, and you do not
need the client to have any access to archived WAL or the summary
files produced from it.  The only thing you need to know the
start-of-backup LSN for the previous backup.

I expect you to reply with a long complaint about how my proposal is
totally inadequate, but actually I think for most people, most of the
time, it would not only be adequate, but extremely convenient.  And
despite your protestations to the contrary, it does not block
parallelism, checksum verification, or any other cool features that
somebody may want to add later.  It'll work just fine with those
things.

And for the record, I am willing to put some effort into parallelism.
I just think that it makes more sense to do the incremental part
first.  I think that incremental backup is likely to have less effect
on parallel backup than the other way around.  What I'm NOT willing to
do is build a whole bunch of infrastructure that will help pgbackrest
do amazing things but will not provide a simple and convenient way of
taking incremental backups using only core tools.  I do care about
having something that's good for pgbackrest and other out-of-core
tools.  I just care about it MUCH LESS than I care about making
PostgreSQL core awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Pathological performance when inserting many NULLs into a unique index
Next
From: Robert Haas
Date:
Subject: Re: finding changed blocks using WAL scanning