Re: trying again to get incremental backup - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: trying again to get incremental backup |
Date | |
Msg-id | CA+TgmoaJ1KC982R4-Fw+AG8njoB5Ta5+-oE0nmXK2H9z8ZEP9A@mail.gmail.com Whole thread Raw |
In response to | Re: trying again to get incremental backup (David Steele <david@pgmasters.net>) |
Responses |
Re: trying again to get incremental backup
|
List | pgsql-hackers |
On Mon, Oct 23, 2023 at 7:56 PM David Steele <david@pgmasters.net> wrote: > > I also think a lot of the use of the low-level API is > > driven by it being just too darn slow to copy the whole database, and > > incremental backup can help with that in some circumstances. > > I would argue that restore performance is *more* important than backup > performance and this patch is a step backward in that regard. Backups > will be faster and less space will be used in the repository, but > restore performance is going to suffer. If the deltas are very small the > difference will probably be negligible, but as the deltas get large (and > especially if there are a lot of them) the penalty will be more noticeable. I think an awful lot depends here on whether the repository is local or remote. If you have filesystem access to wherever the backups are stored anyway, I don't think that using pg_combinebackup to write out a new data directory is going to be much slower than copying one data directory from the repository to wherever you'd actually use the backup. It may be somewhat slower because we do need to access some data in every involved backup, but I don't think it should be vastly slower because we don't have to read every backup in its entirety. For each file, we read the (small) header of the newest incremental file and every incremental file that precedes it until we find a full file. Then, we construct a map of which blocks need to be read from which sources and read only the required blocks from each source. If all the blocks are coming from a single file (because there are no incremental for a certain file, or they contain no blocks) then we just copy the entire source file in one shot, which can be optimized using the same tricks we use elsewhere. Inevitably, this is going to read more data and do more random I/O than just a flat copy of a directory, but it's not terrible. The overall amount of I/O should be a lot closer to the size of the output directory than to the sum of the sizes of the input directories. Now, if the repository is remote, and you have to download all of those backups first, and then run pg_combinebackup on them afterward, that is going to be unpleasant, unless the incremental backups are all quite small. Possibly this could be addressed by teaching pg_combinebackup to do things like accessing data over HTTP and SSH, and relatedly, looking inside tarfiles without needing them unpacked. For now, I've left those as ideas for future improvement, but I think potentially they could address some of your concerns here. A difficulty is that there are a lot of protocols that people might want to use to push bytes around, and it might be hard to keep up with the march of progress. I do agree, though, that there's no such thing as a free lunch. I wouldn't recommend to anyone that they plan to restore from a chain of 100 incremental backups. Not only might it be slow, but the opportunities for something to go wrong are magnified. Even if you've automated everything well enough that there's no human error involved, what if you've got a corrupted file somewhere? Maybe that's not likely in absolute terms, but the more files you've got, the more likely it becomes. What I'd suggest someone do instead is periodically do pg_combinebackup full_reference_backup oldest_incremental -o new_full_reference_backup; rm -rf full_reference_backup; mv new_full_reference_backup full_reference_backup. The new full reference backup is intended to still be usable for restoring incrementals based on the incremental it replaced. I hope that, if people use the feature well, this should limit the need for really long backup chains. I am sure, though, that some people will use it poorly. Maybe there's room for more documentation on this topic. > I was concerned with the difficulty of trying to stage the correct > backups for pg_combinebackup, not whether it would recognize that the > needed data was not available and then error appropriately. The latter > is surmountable within pg_combinebackup but the former is left up to the > user. Indeed. > One note regarding the patches. I feel like > v5-0005-Prototype-patch-for-incremental-backup should be split to have > the WAL summarizer as one patch and the changes to base backup as a > separate patch. > > It might not be useful to commit one without the other, but it would > make for an easier read. Just my 2c. Yeah, maybe so. I'm not quite ready to commit to doing that split as of this writing but I will think about it and possibly do it. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: