Re: pg_basebackup and snapshots - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: pg_basebackup and snapshots
Date
Msg-id 20200207211339.GN3195@tamriel.snowman.net
Whole thread Raw
In response to Re: pg_basebackup and snapshots  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2020-02-07 14:56:47 -0500, Stephen Frost wrote:
> > * Andres Freund (andres@anarazel.de) wrote:
> > > Maybe that's looking too far into the future, but I'd like to see
> > > improvements to pg_basebackup that make it integrate with root requiring
> > > tooling, to do more efficient base backups. E.g. having pg_basebackup
> > > handle start/stop backup and WAL handling, but do the actual backup of
> > > the data via a snapshot mechanism (yes, one needs start/stop backup in
> > > the general case, for multiple FSs), would be nice.
> >
> > The challenge with this approach is that you need to drop the 'backup
> > label' file into place as part of this operation, either by putting it
> > into the snapshot after it's been taken, or by putting it into the data
> > directory at restore time.  Of course, you have to keep track of WAL
> > anyway from the time the snapshots are taken until the restore is done,
> > so it's certainly possible, as with all of this, it's just somewhat
> > complicated.
>
> It's not dead trivial, but also doesn't seem *that* hard to me compared
> to the other challenges of adding features like this?  How to best
> approach it I think depends somewhat on what exact type of backup
> (mainly whether to set up a new system or to make a PITR base backup)
> we'd want to focus on. And what kind of snapshotting system / what kind
> of target data store.

I'm also not sure that pg_basebackup is the right tool for this though,
really, given the complications and how it's somewhat beyond what
pg_basebackup's mandate is.  This isn't something you'd like do
remotely, for example, due to the need to take the snapshot, mount the
snapshot, etc.  I don't see this as really in line with "just another
option to -F", there'd be a fair bit of configuring, it seems, and a
good deal of what pg_basebackup would really be doing with this feature
is just running bits of code the user has given us, except for the
actual calls to PG to do start/stop backup.

> Plenty of snapshotting systems allow write access to the snapshot once
> it finished, so that's one way one can deal with that. I have a hard
> time believing that it'd be hard to have pg_basebackup delay writing the
> backup label in that case.  The WAL part would probably be harder, since
> there we want to start writing before the snapshot is done. And copying
> all the WAL at the end isn't enticing either.

pg_basebackup already delays writing out the backup label until the end.

But, yes, there's also timing issues to deal with, which are complicated
because there isn't just a syscall we can use to say "take a snapshot
for us" or to say "mount this snapshot over here" (at least, not in any
kind of portable way, even in places where such things do exist).  Maybe
we could have shell commands that a user provides for "take a snapshot"
and "mount this snapshot", but putting all of that on the user has its
own drawbacks (more on that below..).

> For the PITR base backup case it'd definitely be nice to support writing
> (potentially with callbacks instead of implementing all of them in core)
> into $cloud_provider's blob store, without having to transfer all data
> first through a replication connection and then again to the blob store
> (and without manually implementing non-exclusive base backup). Adding
> WAL after the fact to the same blob really a thing for anything like
> that (obviously - even if one can hack it by storing tars etc).

We seem to be mixing things now..  You've moved into talking about 'blob
stores' which are rather different from snapshots, no?  I certainly agree
with the general idea of supporting blob stores (pgbackrest has
supported s3 for quite some time, with a nicely pluggable architecture
that we'll be using to write drivers for other blob storage, all in very
well tested C99 code, and it's all done directly, if you want, without
going over the network in some other way first..).

I don't really care for the idea of using callbacks for this, at least
if what you mean by "callback" is "something like archive_command".
There's a lot of potential failure cases and issues, writing to most s3
stores requires retries, and getting it all to work right when you're
going through a shell to run some other command to actually get the data
across safely and durably is, ultimately, a bit of a mess.  I feel like
we should be learning from the mess that is archive_command and avoiding
anything like that if at all possible when it comes to moving data
around that needs to be confirmed durably written.  Making users have to
piece together the bits to make it work just isn't a good idea either
(see, again, archive command, and our own documentation for why that's a
bad idea...).

> Wonder if the the WAL part in particular would actually be best solved
> by having recovery probe more than one WAL directory when looking for
> WAL segments (i.e. doing so before switching methods). Much faster than
> using restore_command, and what one really wants in a pretty decent
> number of cases. And it'd allow to just restore the base backup
> (e.g. mount [copy of] the snapshot) and the received WAL stream
> separately, without needing more complicated orchestration.

That looks to be pretty orthogonal to the original discussion, but it
doesn't seem like a terrible idea.  I'd want David's thoughts on it, but
it seems like this might work pretty well for pgbackrest- we already
pull down WAL in advance of the restore_command asking for it and store
it nearby so we can swap it into place about as fast as possible.  Being
able to give a directory instead would be nice, although you have to
figure out which WAL is going to be needed (which timeline, what time or
recovery point for PITR, etc) and that information isn't passed to the
recovery_command currently.  We are working presently on adding support
to pgbackrest to better understand the point in time being asked by the
user for a restore, and we have plans to scan the WAL and track recovery
points, and we should know the timeline they're asking for, so maybe
once all that's done we will just 'know' what PG is going to ask for and
can prep it into a directory, but I don't think it really makes sense to
assume that all of the WAL that might ever be asked for is going to be
in one directory or that users will necessairly be happy with having
what would potentially be a pretty large volume have all of the WAL to
perform the restore with.  Having something fetch WAL and feed it into
the directory, maintaining some user-defined size, and then having
something (PG maybe?) remove WAL when done might work..

If we were doing all of this from scratch, or without a
'restore_command' kind of interface, I feel like we'd have 3 or 4
different patches to choose from that implemented s3 support in core,
potentially with all of this pre-fetching and queue'ing.  The restore
command approach does mean that older versions of PG can leverage a tool
like pgbackrest to get these features though, so I guess that's a
positive for it.  Certainly, one of the reasons we've hacked on
pgbackrest with these things is because we can support *existing*
deployments, whereas something in core wouldn't be available until at
least next year and you'd have to get people upgraded to it and such..

> Perhaps I am also answering something completely besides what you were
> wondering about?

There definitely are a few different threads and thoughts in here...
They're mostly about backups and PITR of some sort though, so I'm happy
to chat about them. :)

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Reducing WaitEventSet syscall churn
Next
From: Thomas Munro
Date:
Subject: Re: Reducing WaitEventSet syscall churn