Re: pg_basebackup and snapshots - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: pg_basebackup and snapshots |
Date | |
Msg-id | 20200207211339.GN3195@tamriel.snowman.net Whole thread Raw |
In response to | Re: pg_basebackup and snapshots (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-02-07 14:56:47 -0500, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > Maybe that's looking too far into the future, but I'd like to see > > > improvements to pg_basebackup that make it integrate with root requiring > > > tooling, to do more efficient base backups. E.g. having pg_basebackup > > > handle start/stop backup and WAL handling, but do the actual backup of > > > the data via a snapshot mechanism (yes, one needs start/stop backup in > > > the general case, for multiple FSs), would be nice. > > > > The challenge with this approach is that you need to drop the 'backup > > label' file into place as part of this operation, either by putting it > > into the snapshot after it's been taken, or by putting it into the data > > directory at restore time. Of course, you have to keep track of WAL > > anyway from the time the snapshots are taken until the restore is done, > > so it's certainly possible, as with all of this, it's just somewhat > > complicated. > > It's not dead trivial, but also doesn't seem *that* hard to me compared > to the other challenges of adding features like this? How to best > approach it I think depends somewhat on what exact type of backup > (mainly whether to set up a new system or to make a PITR base backup) > we'd want to focus on. And what kind of snapshotting system / what kind > of target data store. I'm also not sure that pg_basebackup is the right tool for this though, really, given the complications and how it's somewhat beyond what pg_basebackup's mandate is. This isn't something you'd like do remotely, for example, due to the need to take the snapshot, mount the snapshot, etc. I don't see this as really in line with "just another option to -F", there'd be a fair bit of configuring, it seems, and a good deal of what pg_basebackup would really be doing with this feature is just running bits of code the user has given us, except for the actual calls to PG to do start/stop backup. > Plenty of snapshotting systems allow write access to the snapshot once > it finished, so that's one way one can deal with that. I have a hard > time believing that it'd be hard to have pg_basebackup delay writing the > backup label in that case. The WAL part would probably be harder, since > there we want to start writing before the snapshot is done. And copying > all the WAL at the end isn't enticing either. pg_basebackup already delays writing out the backup label until the end. But, yes, there's also timing issues to deal with, which are complicated because there isn't just a syscall we can use to say "take a snapshot for us" or to say "mount this snapshot over here" (at least, not in any kind of portable way, even in places where such things do exist). Maybe we could have shell commands that a user provides for "take a snapshot" and "mount this snapshot", but putting all of that on the user has its own drawbacks (more on that below..). > For the PITR base backup case it'd definitely be nice to support writing > (potentially with callbacks instead of implementing all of them in core) > into $cloud_provider's blob store, without having to transfer all data > first through a replication connection and then again to the blob store > (and without manually implementing non-exclusive base backup). Adding > WAL after the fact to the same blob really a thing for anything like > that (obviously - even if one can hack it by storing tars etc). We seem to be mixing things now.. You've moved into talking about 'blob stores' which are rather different from snapshots, no? I certainly agree with the general idea of supporting blob stores (pgbackrest has supported s3 for quite some time, with a nicely pluggable architecture that we'll be using to write drivers for other blob storage, all in very well tested C99 code, and it's all done directly, if you want, without going over the network in some other way first..). I don't really care for the idea of using callbacks for this, at least if what you mean by "callback" is "something like archive_command". There's a lot of potential failure cases and issues, writing to most s3 stores requires retries, and getting it all to work right when you're going through a shell to run some other command to actually get the data across safely and durably is, ultimately, a bit of a mess. I feel like we should be learning from the mess that is archive_command and avoiding anything like that if at all possible when it comes to moving data around that needs to be confirmed durably written. Making users have to piece together the bits to make it work just isn't a good idea either (see, again, archive command, and our own documentation for why that's a bad idea...). > Wonder if the the WAL part in particular would actually be best solved > by having recovery probe more than one WAL directory when looking for > WAL segments (i.e. doing so before switching methods). Much faster than > using restore_command, and what one really wants in a pretty decent > number of cases. And it'd allow to just restore the base backup > (e.g. mount [copy of] the snapshot) and the received WAL stream > separately, without needing more complicated orchestration. That looks to be pretty orthogonal to the original discussion, but it doesn't seem like a terrible idea. I'd want David's thoughts on it, but it seems like this might work pretty well for pgbackrest- we already pull down WAL in advance of the restore_command asking for it and store it nearby so we can swap it into place about as fast as possible. Being able to give a directory instead would be nice, although you have to figure out which WAL is going to be needed (which timeline, what time or recovery point for PITR, etc) and that information isn't passed to the recovery_command currently. We are working presently on adding support to pgbackrest to better understand the point in time being asked by the user for a restore, and we have plans to scan the WAL and track recovery points, and we should know the timeline they're asking for, so maybe once all that's done we will just 'know' what PG is going to ask for and can prep it into a directory, but I don't think it really makes sense to assume that all of the WAL that might ever be asked for is going to be in one directory or that users will necessairly be happy with having what would potentially be a pretty large volume have all of the WAL to perform the restore with. Having something fetch WAL and feed it into the directory, maintaining some user-defined size, and then having something (PG maybe?) remove WAL when done might work.. If we were doing all of this from scratch, or without a 'restore_command' kind of interface, I feel like we'd have 3 or 4 different patches to choose from that implemented s3 support in core, potentially with all of this pre-fetching and queue'ing. The restore command approach does mean that older versions of PG can leverage a tool like pgbackrest to get these features though, so I guess that's a positive for it. Certainly, one of the reasons we've hacked on pgbackrest with these things is because we can support *existing* deployments, whereas something in core wouldn't be available until at least next year and you'd have to get people upgraded to it and such.. > Perhaps I am also answering something completely besides what you were > wondering about? There definitely are a few different threads and thoughts in here... They're mostly about backups and PITR of some sort though, so I'm happy to chat about them. :) Thanks, Stephen
Attachment
pgsql-hackers by date: