Thread: trying again to get incremental backup
A few years ago, I sketched out a design for incremental backup, but no patch for incremental backup ever got committed. Instead, the whole thing evolved into a project to add backup manifests, which are nice, but not as nice as incremental backup would be. So I've decided to have another go at incremental backup itself. Attached are some WIP patches. Let me summarize the design and some open questions and problems with it that I've discovered. I welcome problem reports and test results from others, as well. The basic design of this patch set is pretty simple, and there are three main parts. First, there's a new background process called the walsummarizer which runs all the time. It reads the WAL and generates WAL summary files. WAL summary files are extremely small compared to the original WAL and contain only the minimal amount of information that we need in order to determine which parts of the database need to be backed up. They tell us about files getting created, destroyed, or truncated, and they tell us about modified blocks. Naturally, we don't find out about blocks that were modified without any write-ahead log record, e.g. hint bit updates, but those are of necessity not critical for correctness, so it's OK. Second, pg_basebackup has a mode where it can take an incremental backup. You must supply a backup manifest from a previous full backup. We read the WAL summary files that have been generated between the start of the previous backup and the start of this one, and use that to figure out which relation files have changed and how much. Non-relation files are sent normally, just as they would be in a full backup. Relation files can either be sent in full or be replaced by an incremental file, which contains a subset of the blocks in the file plus a bit of information to handle truncations properly. Third, there's now a pg_combinebackup utility which takes a full backup and one or more incremental backups, performs a bunch of sanity checks, and if everything works out, writes out a new, synthetic full backup, aka a data directory. Simple usage example: pg_basebackup -cfast -Dx pg_basebackup -cfast -Dy --incremental x/backup_manifest pg_combinebackup x y -o z The part of all this with which I'm least happy is the WAL summarization engine. Actually, the core process of summarizing the WAL seems totally fine, and the file format is very compact thanks to some nice ideas from my colleague Dilip Kumar. Someone may of course wish to argue that the information should be represented in some other file format instead, and that can be done if it's really needed, but I don't see a lot of value in tinkering with it, either. Where I do think there's a problem is deciding how much WAL ought to be summarized in one WAL summary file. Summary files cover a certain range of WAL records - they have names like $TLI${START_LSN}${END_LSN}.summary. It's not too hard to figure out where a file should start - generally, it's wherever the previous file ended, possibly on a new timeline, but figuring out where the summary should end is trickier. You always have the option to either read another WAL record and fold it into the current summary, or end the current summary where you are, write out the file, and begin a new one. So how do you decide what to do? I originally had the idea of summarizing a certain number of MB of WAL per WAL summary file, and so I added a GUC wal_summarize_mb for that purpose. But then I realized that actually, you really want WAL summary file boundaries to line up with possible redo points, because when you do an incremental backup, you need a summary that stretches from the redo point of the checkpoint written at the start of the prior backup to the redo point of the checkpoint written at the start of the current backup. The block modifications that happen in that range of WAL records are the ones that need to be included in the incremental. Unfortunately, there's no indication in the WAL itself that you've reached a redo point, but I wrote code that tries to notice when we've reached the redo point stored in shared memory and stops the summary there. But I eventually realized that's not good enough either, because if summarization zooms past the redo point before noticing the updated redo point in shared memory, then the backup sat around waiting for the next summary file to be generated so it had enough summaries to proceed with the backup, while the summarizer was in no hurry to finish up the current file and just sat there waiting for more WAL to be generated. Eventually the incremental backup would just time out. I tried to fix that by making it so that if somebody's waiting for a summary file to be generated, they can let the summarizer know about that and it can write a summary file ending at the LSN up to which it has read and then begin a new file from there. That seems to fix the hangs, but now I've got three overlapping, interconnected systems for deciding where to end the current summary file, and maybe that's OK, but I have a feeling there might be a better way. Dilip had an interesting potential solution to this problem, which was to always emit a special WAL record at the redo pointer. That is, when we fix the redo pointer for the checkpoint record we're about to write, also insert a WAL record there. That way, when the summarizer reaches that sentinel record, it knows it should stop the summary just before. I'm not sure whether this approach is viable, especially from a performance and concurrency perspective, and I'm not sure whether people here would like it, but it does seem like it would make things a whole lot simpler for this patch set. Another thing that I'm not too sure about is: what happens if we find a relation file on disk that doesn't appear in the backup_manifest for the previous backup and isn't mentioned in the WAL summaries either? The fact that said file isn't mentioned in the WAL summaries seems like it ought to mean that the file is unchanged, in which case perhaps this ought to be an error condition. But I'm not too sure about that treatment. I have a feeling that there might be some subtle problems here, especially if databases or tablespaces get dropped and then new ones get created that happen to have the same OIDs. And what about wal_level=minimal? I'm not at a point where I can say I've gone through and plugged up these kinds of corner-case holes tightly yet, and I'm worried that there may be still other scenarios of which I haven't even thought. Happy to hear your ideas about what the problem cases are or how any of the problems should be solved. A related design question is whether we should really be sending the whole backup manifest to the server at all. If it turns out that we don't really need anything except for the LSN of the previous backup, we could send that one piece of information instead of everything. On the other hand, if we need the list of files from the previous backup, then sending the whole manifest makes sense. Another big and rather obvious problem with the patch set is that it doesn't currently have any automated test cases, or any real documentation. Those are obviously things that need a lot of work before there could be any thought of committing this. And probably a lot of bugs will be found along the way, too. A few less-serious problems with the patch: - We don't have an incremental JSON parser, so if you have a backup_manifest>1GB, pg_basebackup --incremental is going to fail. That's also true of the existing code in pg_verifybackup, and for the same reason. I talked to Andrew Dunstan at one point about adapting our JSON parser to support incremental parsing, and he had a patch for that, but I think he found some problems with it and I'm not sure what the current status is. - The patch does support differential backup, aka an incremental atop another incremental. There's no particular limit to how long a chain of backups can be. However, pg_combinebackup currently requires that the first backup is a full backup and all the later ones are incremental backups. So if you have a full backup a and an incremental backup b and a differential backup c, you can combine a b and c to get a full backup equivalent to one you would have gotten if you had taken a full backup at the time you took c. However, you can't combine b and c with each other without combining them with a, and that might be desirable in some situations. You might want to collapse a bunch of older differential backups into a single one that covers the whole time range of all of them. I think that the file format can support that, but the tool is currently too dumb. - We only know how to operate on directories, not tar files. I thought about that when working on pg_verifybackup as well, but I didn't do anything about it. It would be nice to go back and make that tool work on tar-format backups, and this one, too. I don't think there would be a whole lot of point trying to operate on compressed tar files because you need random access and that seems hard on a compressed file, but on uncompressed files it seems at least theoretically doable. I'm not sure whether anyone would care that much about this, though, even though it does sound pretty cool. In the attached patch series, patches 1 through 6 are various refactoring patches, patch 7 is the main event, and patch 8 adds a useful inspection tool. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v1-0001-In-basebackup.c-refactor-to-create-verify_page_ch.patch
- v1-0005-Change-how-a-base-backup-decides-which-files-have.patch
- v1-0003-Change-struct-tablespaceinfo-s-oid-member-from-ch.patch
- v1-0002-In-basebackup.c-refactor-to-create-read_file_data.patch
- v1-0004-Refactor-parse_filename_for_nontemp_relation-to-p.patch
- v1-0006-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
- v1-0008-Add-new-pg_walsummary-tool.patch
- v1-0007-Prototype-patch-for-incremental-and-differential-.patch
Hi, On 2023-06-14 14:46:48 -0400, Robert Haas wrote: > A few years ago, I sketched out a design for incremental backup, but > no patch for incremental backup ever got committed. Instead, the whole > thing evolved into a project to add backup manifests, which are nice, > but not as nice as incremental backup would be. So I've decided to > have another go at incremental backup itself. Attached are some WIP > patches. Let me summarize the design and some open questions and > problems with it that I've discovered. I welcome problem reports and > test results from others, as well. Cool! > I originally had the idea of summarizing a certain number of MB of WAL > per WAL summary file, and so I added a GUC wal_summarize_mb for that > purpose. But then I realized that actually, you really want WAL > summary file boundaries to line up with possible redo points, because > when you do an incremental backup, you need a summary that stretches > from the redo point of the checkpoint written at the start of the > prior backup to the redo point of the checkpoint written at the start > of the current backup. The block modifications that happen in that > range of WAL records are the ones that need to be included in the > incremental. I assume this is "solely" required for keeping the incremental backups as small as possible, rather than being required for correctness? > Unfortunately, there's no indication in the WAL itself > that you've reached a redo point, but I wrote code that tries to > notice when we've reached the redo point stored in shared memory and > stops the summary there. But I eventually realized that's not good > enough either, because if summarization zooms past the redo point > before noticing the updated redo point in shared memory, then the > backup sat around waiting for the next summary file to be generated so > it had enough summaries to proceed with the backup, while the > summarizer was in no hurry to finish up the current file and just sat > there waiting for more WAL to be generated. Eventually the incremental > backup would just time out. I tried to fix that by making it so that > if somebody's waiting for a summary file to be generated, they can let > the summarizer know about that and it can write a summary file ending > at the LSN up to which it has read and then begin a new file from > there. That seems to fix the hangs, but now I've got three > overlapping, interconnected systems for deciding where to end the > current summary file, and maybe that's OK, but I have a feeling there > might be a better way. Could we just recompute the WAL summary for the [redo, end of chunk] for the relevant summary file? > Dilip had an interesting potential solution to this problem, which was > to always emit a special WAL record at the redo pointer. That is, when > we fix the redo pointer for the checkpoint record we're about to > write, also insert a WAL record there. That way, when the summarizer > reaches that sentinel record, it knows it should stop the summary just > before. I'm not sure whether this approach is viable, especially from > a performance and concurrency perspective, and I'm not sure whether > people here would like it, but it does seem like it would make things > a whole lot simpler for this patch set. FWIW, I like the idea of a special WAL record at that point, independent of this feature. It wouldn't be a meaningful overhead compared to the cost of a checkpoint, and it seems like it'd be quite useful for debugging. But I can see uses going beyond that - we occasionally have been discussing associating additional data with redo points, and that'd be a lot easier to deal with during recovery with such a record. I don't really see a performance and concurrency angle right now - what are you wondering about? > Another thing that I'm not too sure about is: what happens if we find > a relation file on disk that doesn't appear in the backup_manifest for > the previous backup and isn't mentioned in the WAL summaries either? Wouldn't that commonly happen for unlogged relations at least? I suspect there's also other ways to end up with such additional files, e.g. by crashing during the creation of a new relation. > A few less-serious problems with the patch: > > - We don't have an incremental JSON parser, so if you have a > backup_manifest>1GB, pg_basebackup --incremental is going to fail. > That's also true of the existing code in pg_verifybackup, and for the > same reason. I talked to Andrew Dunstan at one point about adapting > our JSON parser to support incremental parsing, and he had a patch for > that, but I think he found some problems with it and I'm not sure what > the current status is. As a stopgap measure, can't we just use the relevant flag to allow larger allocations? > - The patch does support differential backup, aka an incremental atop > another incremental. There's no particular limit to how long a chain > of backups can be. However, pg_combinebackup currently requires that > the first backup is a full backup and all the later ones are > incremental backups. So if you have a full backup a and an incremental > backup b and a differential backup c, you can combine a b and c to get > a full backup equivalent to one you would have gotten if you had taken > a full backup at the time you took c. However, you can't combine b and > c with each other without combining them with a, and that might be > desirable in some situations. You might want to collapse a bunch of > older differential backups into a single one that covers the whole > time range of all of them. I think that the file format can support > that, but the tool is currently too dumb. That seems like a feature for the future... > - We only know how to operate on directories, not tar files. I thought > about that when working on pg_verifybackup as well, but I didn't do > anything about it. It would be nice to go back and make that tool work > on tar-format backups, and this one, too. I don't think there would be > a whole lot of point trying to operate on compressed tar files because > you need random access and that seems hard on a compressed file, but > on uncompressed files it seems at least theoretically doable. I'm not > sure whether anyone would care that much about this, though, even > though it does sound pretty cool. I don't know the tar format well, but my understanding is that it doesn't have a "central metadata" portion. I.e. doing something like this would entail scanning the tar file sequentially, skipping file contents? And wouldn't you have to create an entirely new tar file for the modified output? That kind of makes it not so incremental ;) IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't seem worth bothering about now. But I might just be missing something. Greetings, Andres Freund
On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote: > I assume this is "solely" required for keeping the incremental backups as > small as possible, rather than being required for correctness? I believe so. I want to spend some more time thinking about this to make sure I'm not missing anything. > Could we just recompute the WAL summary for the [redo, end of chunk] for the > relevant summary file? I'm not understanding how that would help. If we were going to compute a WAL summary on the fly rather than waiting for one to show up on disk, what we'd want is [end of last WAL summary that does exist on disk, redo]. But I'm not sure that's a great approach, because that LSN gap might be large and then we're duplicating a lot of work that the summarizer has probably already done most of. > FWIW, I like the idea of a special WAL record at that point, independent of > this feature. It wouldn't be a meaningful overhead compared to the cost of a > checkpoint, and it seems like it'd be quite useful for debugging. But I can > see uses going beyond that - we occasionally have been discussing associating > additional data with redo points, and that'd be a lot easier to deal with > during recovery with such a record. > > I don't really see a performance and concurrency angle right now - what are > you wondering about? I'm not really sure. I expect Dilip would be happy to post his patch, and if you'd be willing to have a look at it and express your concerns or lack thereof, that would be super valuable. > > Another thing that I'm not too sure about is: what happens if we find > > a relation file on disk that doesn't appear in the backup_manifest for > > the previous backup and isn't mentioned in the WAL summaries either? > > Wouldn't that commonly happen for unlogged relations at least? > > I suspect there's also other ways to end up with such additional files, > e.g. by crashing during the creation of a new relation. Yeah, this needs some more careful thought. > > A few less-serious problems with the patch: > > > > - We don't have an incremental JSON parser, so if you have a > > backup_manifest>1GB, pg_basebackup --incremental is going to fail. > > That's also true of the existing code in pg_verifybackup, and for the > > same reason. I talked to Andrew Dunstan at one point about adapting > > our JSON parser to support incremental parsing, and he had a patch for > > that, but I think he found some problems with it and I'm not sure what > > the current status is. > > As a stopgap measure, can't we just use the relevant flag to allow larger > allocations? I'm not sure that's a good idea, but theoretically, yes. We can also just choose to accept the limitation that your data directory can't be too darn big if you want to use this feature. But getting incremental JSON parsing would be better. Not having the manifest in JSON would be an even better solution, but regrettably I did not win that argument. > That seems like a feature for the future... Sure. > I don't know the tar format well, but my understanding is that it doesn't have > a "central metadata" portion. I.e. doing something like this would entail > scanning the tar file sequentially, skipping file contents? And wouldn't you > have to create an entirely new tar file for the modified output? That kind of > makes it not so incremental ;) > > IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't > seem worth bothering about now. But I might just be missing something. Oh, yeah, it's just an idle thought. I'll get to it when I get to it, or else I won't. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 14 Jun 2023 at 20:47, Robert Haas <robertmhaas@gmail.com> wrote: > > A few years ago, I sketched out a design for incremental backup, but > no patch for incremental backup ever got committed. Instead, the whole > thing evolved into a project to add backup manifests, which are nice, > but not as nice as incremental backup would be. So I've decided to > have another go at incremental backup itself. Attached are some WIP > patches. Nice, I like this idea. > Let me summarize the design and some open questions and > problems with it that I've discovered. I welcome problem reports and > test results from others, as well. Skimming through the 7th patch, I see claims that FSM is not fully WAL-logged and thus shouldn't be tracked, and so it indeed doesn't track those changes. I disagree with that decision: we now have support for custom resource managers, which may use the various forks for other purposes than those used in PostgreSQL right now. It would be a shame if data is lost because of the backup tool ignoring forks because the PostgreSQL project itself doesn't have post-recovery consistency guarantees in that fork. So, unless we document that WAL-logged changes in the FSM fork are actually not recoverable from backup, regardless of the type of contents, we should still keep track of the changes in the FSM fork and include the fork in our backups or only exclude those FSM updates that we know are safe to ignore. Kind regards, Matthias van de Meent Neon, Inc.
Hi, On 2023-06-14 16:10:38 -0400, Robert Haas wrote: > On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote: > > Could we just recompute the WAL summary for the [redo, end of chunk] for the > > relevant summary file? > > I'm not understanding how that would help. If we were going to compute > a WAL summary on the fly rather than waiting for one to show up on > disk, what we'd want is [end of last WAL summary that does exist on > disk, redo]. Oh, right. > But I'm not sure that's a great approach, because that LSN gap might be > large and then we're duplicating a lot of work that the summarizer has > probably already done most of. I guess that really depends on what the summary granularity is. If you create a separate summary every 32MB or so, recomputing just the required range shouldn't be too bad. > > FWIW, I like the idea of a special WAL record at that point, independent of > > this feature. It wouldn't be a meaningful overhead compared to the cost of a > > checkpoint, and it seems like it'd be quite useful for debugging. But I can > > see uses going beyond that - we occasionally have been discussing associating > > additional data with redo points, and that'd be a lot easier to deal with > > during recovery with such a record. > > > > I don't really see a performance and concurrency angle right now - what are > > you wondering about? > > I'm not really sure. I expect Dilip would be happy to post his patch, > and if you'd be willing to have a look at it and express your concerns > or lack thereof, that would be super valuable. Will do. Adding me to CC: might help, I have a backlog unfortunately :(. Greetings, Andres Freund
On Thu, Jun 15, 2023 at 2:11 AM Andres Freund <andres@anarazel.de> wrote: > > > I'm not really sure. I expect Dilip would be happy to post his patch, > > and if you'd be willing to have a look at it and express your concerns > > or lack thereof, that would be super valuable. > > Will do. Adding me to CC: might help, I have a backlog unfortunately :(. Thanks, I have posted it here[1] [1] https://www.postgresql.org/message-id/CAFiTN-s-K%3DmVA%3DHPr_VoU-5bvyLQpNeuzjq1ebPJMEfCJZKFsg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote: > > But I'm not sure that's a great approach, because that LSN gap might be > > large and then we're duplicating a lot of work that the summarizer has > > probably already done most of. > > I guess that really depends on what the summary granularity is. If you create > a separate summary every 32MB or so, recomputing just the required range > shouldn't be too bad. Yeah, but I don't think that's the right approach, for two reasons. First, one of the things I'm rather worried about is what happens when the WAL distance between the prior backup and the incremental backup is large. It could be a terabyte. If we have a WAL summary for every 32MB of WAL, that's 32k files we have to read, and I'm concerned that's too many. Maybe it isn't, but it's something that has really been weighing on my mind as I've been thinking through the design questions here. The files are really very small, and having to open a bazillion tiny little files to get the job done sounds lame. Second, I don't see what problem it actually solves. Why not just signal the summarizer to write out the accumulated data to a file instead of re-doing the work ourselves? Or else adopt the WAL-record-at-the-redo-pointer approach, and then the whole thing is moot? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-06-19 09:46:12 -0400, Robert Haas wrote: > On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote: > > > But I'm not sure that's a great approach, because that LSN gap might be > > > large and then we're duplicating a lot of work that the summarizer has > > > probably already done most of. > > > > I guess that really depends on what the summary granularity is. If you create > > a separate summary every 32MB or so, recomputing just the required range > > shouldn't be too bad. > > Yeah, but I don't think that's the right approach, for two reasons. > First, one of the things I'm rather worried about is what happens when > the WAL distance between the prior backup and the incremental backup > is large. It could be a terabyte. If we have a WAL summary for every > 32MB of WAL, that's 32k files we have to read, and I'm concerned > that's too many. Maybe it isn't, but it's something that has really > been weighing on my mind as I've been thinking through the design > questions here. It doesn't have to be a separate file - you could easily summarize ranges at a higher granularity, storing multiple ranges into a single file with a coarser naming pattern. > The files are really very small, and having to open a bazillion tiny little > files to get the job done sounds lame. Second, I don't see what problem it > actually solves. Why not just signal the summarizer to write out the > accumulated data to a file instead of re-doing the work ourselves? Or else > adopt the WAL-record-at-the-redo-pointer approach, and then the whole thing > is moot? The one point for a relatively grainy summarization scheme that I see is that it would pave the way for using the WAL summary data for other purposes in the future. That could be done orthogonal to the other solutions to the redo pointer issues. Other potential use cases: - only restore parts of a base backup that aren't going to be overwritten by WAL replay - reconstructing database contents from WAL after data loss - more efficient pg_rewind - more efficient prefetching during WAL replay Greetings, Andres Freund
Hi, In the limited time that I've had to work on this project lately, I've been trying to come up with a test case for this feature -- and since I've gotten completely stuck, I thought it might be time to post and see if anyone else has a better idea. I thought a reasonable test case would be: Do a full backup. Change some stuff. Do an incremental backup. Restore both backups and perform replay to the same LSN. Then compare the files on disk. But I cannot make this work. The first problem I ran into was that replay of the full backup does a restartpoint, while the replay of the incremental backup does not. That results in, for example, pg_subtrans having different contents. I'm not sure whether it can also result in data files having different contents: are changes that we replayed following the last restartpoint guaranteed to end up on disk when the server is shut down? It wasn't clear to me that this is the case. I thought maybe I could get both servers to perform a restartpoint at the same location by shutting down the primary and then replaying through the shutdown checkpoint, but that doesn't work because the primary doesn't finish archiving before shutting down. After some more fiddling I settled (at least for research purposes) on having the restored backups PITR and promote, instead of PITR and pause, so that we're guaranteed a checkpoint. But that just caused me to run into a far worse problem: replay on the standby doesn't actually create a state that is byte-for-byte identical to the one that exists on the primary. I quickly discovered that in my test case, I was ending up with different contents in the "hole" of a block wherein a tuple got updated. Replay doesn't think it's important to make the hole end up with the same contents on all machines that replay the WAL, so I end up with one server that has more junk in there than the other one and the tests fail. Unless someone has a brilliant idea that I lack, this suggests to me that this whole line of testing is a dead end. I can, of course, write tests that compare clusters *logically* -- do the correct relations exist, are they accessible, do they have the right contents? But I feel like it would be easy to have bugs that escape detection in such a test but would be detected by a physical comparison of the clusters. However, such a comparison can only be conducted if either (a) there's some way to set up the test so that byte-for-byte identical clusters can be expected or (b) there's some way to perform the comparison that can distinguish between expected, harmless differences and unexpected, problematic differences. And at the moment my conclusion is that neither (a) nor (b) exists. Does anyone think otherwise? Meanwhile, here's a rebased set of patches. The somewhat-primitive attempts at writing tests are in 0009, but they don't work, for the reasons explained above. I think I'd probably like to go ahead and commit 0001 and 0002 soon if there are no objections, since I think those are good refactorings independently of the rest of this. ...Robert
Attachment
- 0004-Refactor-parse_filename_for_nontemp_relation-to-pars.patch
- 0002-In-basebackup.c-refactor-to-create-read_file_data_in.patch
- 0001-In-basebackup.c-refactor-to-create-verify_page_check.patch
- 0005-Change-how-a-base-backup-decides-which-files-have-ch.patch
- 0003-Change-struct-tablespaceinfo-s-oid-member-from-char-.patch
- 0006-Move-src-bin-pg_verifybackup-parse_manifest.c-into-s.patch
- 0008-Add-new-pg_walsummary-tool.patch
- 0009-Add-TAP-tests-this-is-broken-doesn-t-work.patch
- 0007-Prototype-patch-for-incremental-and-differential-bac.patch
Hi Robert, On 8/30/23 10:49, Robert Haas wrote: > In the limited time that I've had to work on this project lately, I've > been trying to come up with a test case for this feature -- and since > I've gotten completely stuck, I thought it might be time to post and > see if anyone else has a better idea. I thought a reasonable test case > would be: Do a full backup. Change some stuff. Do an incremental > backup. Restore both backups and perform replay to the same LSN. Then > compare the files on disk. But I cannot make this work. The first > problem I ran into was that replay of the full backup does a > restartpoint, while the replay of the incremental backup does not. > That results in, for example, pg_subtrans having different contents. pg_subtrans, at least, can be ignored since it is excluded from the backup and not required for recovery. > I'm not sure whether it can also result in data files having different > contents: are changes that we replayed following the last restartpoint > guaranteed to end up on disk when the server is shut down? It wasn't > clear to me that this is the case. I thought maybe I could get both > servers to perform a restartpoint at the same location by shutting > down the primary and then replaying through the shutdown checkpoint, > but that doesn't work because the primary doesn't finish archiving > before shutting down. After some more fiddling I settled (at least for > research purposes) on having the restored backups PITR and promote, > instead of PITR and pause, so that we're guaranteed a checkpoint. But > that just caused me to run into a far worse problem: replay on the > standby doesn't actually create a state that is byte-for-byte > identical to the one that exists on the primary. I quickly discovered > that in my test case, I was ending up with different contents in the > "hole" of a block wherein a tuple got updated. Replay doesn't think > it's important to make the hole end up with the same contents on all > machines that replay the WAL, so I end up with one server that has > more junk in there than the other one and the tests fail. This is pretty much what I discovered when investigating backup from standby back in 2016. My (ultimately unsuccessful) efforts to find a clean delta resulted in [1] as I systematically excluded directories that are not required for recovery and will not be synced between a primary and standby. FWIW Heikki also made similar attempts at this before me (back then I found the thread but I doubt I could find it again) and arrived at similar results. We discussed this in person and figured out that we had come to more or less the same conclusion. Welcome to the club! > Unless someone has a brilliant idea that I lack, this suggests to me > that this whole line of testing is a dead end. I can, of course, write > tests that compare clusters *logically* -- do the correct relations > exist, are they accessible, do they have the right contents? But I > feel like it would be easy to have bugs that escape detection in such > a test but would be detected by a physical comparison of the clusters. Agreed, though a matching logical result is still very compelling. > However, such a comparison can only be conducted if either (a) there's > some way to set up the test so that byte-for-byte identical clusters > can be expected or (b) there's some way to perform the comparison that > can distinguish between expected, harmless differences and unexpected, > problematic differences. And at the moment my conclusion is that > neither (a) nor (b) exists. Does anyone think otherwise? I do not. My conclusion back then was that validating a physical comparison would be nearly impossible without changes to Postgres to make the primary and standby match via replication. Which, by the way, I still think would be a great idea. In principle, at least. Replay is already a major bottleneck and anything that makes it slower will likely not be very popular. This would also be great for WAL, since last time I tested the same WAL segment can be different between the primary and standby because the unused (and recycled) portion at the end is not zeroed as it is on the primary (but logically they do match). I would be very happy if somebody told me that my info is out of date here and this has been fixed. But when I looked at the code it was incredibly tricky to do this because of how WAL is replicated. > Meanwhile, here's a rebased set of patches. The somewhat-primitive > attempts at writing tests are in 0009, but they don't work, for the > reasons explained above. I think I'd probably like to go ahead and > commit 0001 and 0002 soon if there are no objections, since I think > those are good refactorings independently of the rest of this. No objections to 0001/0002. Regards, -David [1] http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8
Hey, thanks for the reply. On Thu, Aug 31, 2023 at 6:50 PM David Steele <david@pgmasters.net> wrote: > pg_subtrans, at least, can be ignored since it is excluded from the > backup and not required for recovery. I agree... > Welcome to the club! Thanks for the welcome, but being a member feels *terrible*. :-) > I do not. My conclusion back then was that validating a physical > comparison would be nearly impossible without changes to Postgres to > make the primary and standby match via replication. Which, by the way, I > still think would be a great idea. In principle, at least. Replay is > already a major bottleneck and anything that makes it slower will likely > not be very popular. Fair point. But maybe the bigger issue is the work involved. I don't think zeroing the hole in all cases would likely be that expensive, but finding everything that can cause the standby to diverge from the primary and fixing all of it sounds like an unpleasant amount of effort. Still, it's good to know that I'm not missing something obvious. > No objections to 0001/0002. Cool. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > Unless someone has a brilliant idea that I lack, this suggests to me > that this whole line of testing is a dead end. I can, of course, write > tests that compare clusters *logically* -- do the correct relations > exist, are they accessible, do they have the right contents? Can't we think of comparing at the block level, like we can compare each block but ignore the content of the hole? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 4, 2023 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Can't we think of comparing at the block level, like we can compare > each block but ignore the content of the hole? We could do that, but I don't think that's a full solution. I think I'd end up having to reimplement the equivalent of heap_mask, btree_mask, et. al. in Perl, which doesn't seem very reasonable. It's fairly complicated logic even written in C, and doing the right thing in Perl would be more complex, I think, because it wouldn't have access to all the same #defines which depend on things like word size and Endianness and stuff. If we want to allow this sort of comparison, I feel we should think of changing the C code in some way to make it work reliably rather than try to paper over the problems in Perl. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Meanwhile, here's a rebased set of patches. The somewhat-primitive > attempts at writing tests are in 0009, but they don't work, for the > reasons explained above. I think I'd probably like to go ahead and > commit 0001 and 0002 soon if there are no objections, since I think > those are good refactorings independently of the rest of this. > I have started reading the patch today, I haven't yet completed one pass but here are my comments in 0007 1. + BlockNumber relative_block_numbers[RELSEG_SIZE]; This is close to 400kB of memory, so I think it is better we palloc it instead of keeping it in the stack. 2. /* * Try to parse the directory name as an unsigned integer. * - * Tablespace directories should be positive integers that can - * be represented in 32 bits, with no leading zeroes or trailing + * Tablespace directories should be positive integers that can be + * represented in 32 bits, with no leading zeroes or trailing * garbage. If we come across a name that doesn't meet those * criteria, skip it. Unrelated code refactoring hunk 3. +typedef struct +{ + const char *filename; + pg_checksum_context *checksum_ctx; + bbsink *sink; + size_t bytes_sent; +} FileChunkContext; This structure is not used anywhere. 4. + * If the file is to be set incrementally, then num_incremental_blocks + * should be the number of blocks to be sent, and incremental_blocks /If the file is to be set incrementally/If the file is to be sent incrementally 5. - while (bytes_done < statbuf->st_size) + while (1) { - size_t remaining = statbuf->st_size - bytes_done; + /* I do not really like this change, because after removing this you have put 2 independent checks for sending the full file[1] and sending it incrementally[1]. Actually for sending incrementally 'statbuf->st_size' is computed from the 'num_incremental_blocks' itself so why don't we keep this breaking condition in the while loop itself? So that we can avoid these two separate conditions. [1] + /* + * If we've read the required number of bytes, then it's time to + * stop. + */ + if (bytes_done >= statbuf->st_size) + break; [2] + /* + * If we've read all the blocks, then it's time to stop. + */ + if (ibindex >= num_incremental_blocks) + break; 6. +typedef struct +{ + TimeLineID tli; + XLogRecPtr start_lsn; + XLogRecPtr end_lsn; +} backup_wal_range; + +typedef struct +{ + uint32 status; + const char *path; + size_t size; +} backup_file_entry; Better we add some comments for these structures. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 4:50 PM Robert Haas <robertmhaas@gmail.com> wrote: [..] I've played a little bit more this second batch of patches on e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait events refactor): test_across_wallevelminimal.sh test_many_incrementals_dbcreate.sh test_many_incrementals.sh test_multixact.sh test_pending_2pc.sh test_reindex_and_vacuum_full.sh test_truncaterollback.sh test_unlogged_table.sh all those basic tests had GOOD results. Please find attached. I'll try to schedule some more realistic (in terms of workload and sizes) test in a couple of days + maybe have some fun with cross-backup-and restores across standbys. As per earlier doubt: raw wal_level = minimal situation, shouldn't be a concern, sadly because it requires max_wal_senders==0, while pg_basebackup requires it above 0 (due to "FATAL: number of requested standby connections exceeds max_wal_senders (currently 0)"). I wanted to also introduce corruption onto pg_walsummaries files, but later saw in code that is already covered with CRC32, cool. In v07: > +#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000 170000 ? > A related design question is whether we should really be sending the > whole backup manifest to the server at all. If it turns out that we > don't really need anything except for the LSN of the previous backup, > we could send that one piece of information instead of everything. On > the other hand, if we need the list of files from the previous backup, > then sending the whole manifest makes sense. If that is still an area open for discussion: wouldn't it be better to just specify LSN as it would allow resyncing standby across major lag where the WAL to replay would be enormous? Given that we had primary->standby where standby would be stuck on some LSN, right now it would be: 1) calculate backup manifest of desynced 10TB standby (how? using which tool?) - even if possible, that means reading 10TB of data instead of just putting a number, isn't it? 2) backup primary with such incremental backup >= LSN 3) copy the incremental backup to standby 4) apply it to the impaired standby 5) restart the WAL replay > - We only know how to operate on directories, not tar files. I thought > about that when working on pg_verifybackup as well, but I didn't do > anything about it. It would be nice to go back and make that tool work > on tar-format backups, and this one, too. I don't think there would be > a whole lot of point trying to operate on compressed tar files because > you need random access and that seems hard on a compressed file, but > on uncompressed files it seems at least theoretically doable. I'm not > sure whether anyone would care that much about this, though, even > though it does sound pretty cool. Also maybe it's too early to ask, but wouldn't it be nice if we could have an future option in pg_combinebackup to avoid double writes when used from restore hosts (right now we need to first to reconstruct the original datadir from full and incremental backups on host hosting backups and then TRANSFER it again and on target host?). So something like that could work well from restorehost: pg_combinebackup /tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is that such a pipe prevents parallelism from day 1 and I'm afraid I do not have a better easy idea on how to have both at the same time in the long term. -J.
Attachment
On Fri, Sep 1, 2023 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > No objections to 0001/0002. > > Cool. Nobody else objected either, so I went ahead and committed those. I'll rebase the rest of the patches on top of the latest master and repost, hopefully after addressing some of the other review comments from Dilip and Jakub. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 12, 2023 at 5:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > + BlockNumber relative_block_numbers[RELSEG_SIZE]; > > This is close to 400kB of memory, so I think it is better we palloc it > instead of keeping it in the stack. Fixed. > Unrelated code refactoring hunk Fixed. > This structure is not used anywhere. Removed. > /If the file is to be set incrementally/If the file is to be sent incrementally Fixed. > I do not really like this change, because after removing this you have > put 2 independent checks for sending the full file[1] and sending it > incrementally[1]. Actually for sending incrementally > 'statbuf->st_size' is computed from the 'num_incremental_blocks' > itself so why don't we keep this breaking condition in the while loop > itself? So that we can avoid these two separate conditions. I don't think that would be correct. The number of bytes that need to be read from the original file is not equal to the number of bytes that will be written to the incremental file. Admittedly, they're currently different by less than a block, but that could change if we change the format of the incremental file (e.g. suppose we compressed the blocks in the incremental file with gzip, or smushed out the holes in the pages). I wrote the loop as I did precisely so that the two cases could have different loop exit conditions. > Better we add some comments for these structures. Done. Here's a new patch set, also addressing Jakub's observation that MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v3-0002-Refactor-parse_filename_for_nontemp_relation-to-p.patch
- v3-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
- v3-0001-Change-struct-tablespaceinfo-s-oid-member-from-ch.patch
- v3-0003-Change-how-a-base-backup-decides-which-files-have.patch
- v3-0005-Prototype-patch-for-incremental-and-differential-.patch
- v3-0006-Add-new-pg_walsummary-tool.patch
- v3-0007-Add-TAP-tests-this-is-broken-doesn-t-work.patch
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > all those basic tests had GOOD results. Please find attached. I'll try > to schedule some more realistic (in terms of workload and sizes) test > in a couple of days + maybe have some fun with cross-backup-and > restores across standbys. That's awesome! Thanks for testing! This can definitely benefit from any amount of beating on it that people wish to do. It's a complex, delicate area that risks data loss. > If that is still an area open for discussion: wouldn't it be better to > just specify LSN as it would allow resyncing standby across major lag > where the WAL to replay would be enormous? Given that we had > primary->standby where standby would be stuck on some LSN, right now > it would be: > 1) calculate backup manifest of desynced 10TB standby (how? using > which tool?) - even if possible, that means reading 10TB of data > instead of just putting a number, isn't it? > 2) backup primary with such incremental backup >= LSN > 3) copy the incremental backup to standby > 4) apply it to the impaired standby > 5) restart the WAL replay Hmm. I wonder if this would even be a safe procedure. I admit that I can't quite see a problem with it, but sometimes I'm kind of dumb. > Also maybe it's too early to ask, but wouldn't it be nice if we could > have an future option in pg_combinebackup to avoid double writes when > used from restore hosts (right now we need to first to reconstruct the > original datadir from full and incremental backups on host hosting > backups and then TRANSFER it again and on target host?). So something > like that could work well from restorehost: pg_combinebackup > /tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh > dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is > that such a pipe prevents parallelism from day 1 and I'm afraid I do > not have a better easy idea on how to have both at the same time in > the long term. I don't think it's too early to ask for this, but I do think it's too early for you to get it. ;-) -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new patch set, also addressing Jakub's observation that > MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating. Here's yet another new version. In this version, I reversed the order of the first two patches, with the idea that what's now 0001 seems fairly reasonable as an independent commit, and could thus perhaps be committed sometime soon-ish. In the main patch, I added SGML documentation for pg_combinebackup. I also fixed the broken TAP tests so that they work, by basing them on pg_dump equivalence rather than file-level equivalence. I'm sad to give up on testing the latter, but it seems to be unrealistic. I cleaned up a few other odds and ends, too. But, what exactly is the bigger picture for this patch in terms of moving forward? Here's a list of things that are on my mind: - I'd like to get the patch to mark the redo point in the WAL committed[1] and then reword this patch set to make use of that infrastructure. Right now, we make a best effort to end WAL summaries at redo point boundaries, but it's racey, and sometimes we fail to do so. In theory that just has the effect of potentially making an incremental backup contain some extra blocks that it shouldn't really need to contain, but I think it can actually lead to weird stalls, because when an incremental backup is taken, we have to wait until a WAL summary shows up that extends at least up to the start LSN of the backup we're about to take. I believe all the logic in this area can be made a good deal simpler and more reliable if that patch gets committed and this one reworked accordingly. - I would like some feedback on the generation of WAL summary files. Right now, I have it enabled by default, and summaries are kept for a week. That means that, with no additional setup, you can take an incremental backup as long as the reference backup was taken in the last week. File removal is governed by mtimes, so if you change the mtimes of your summary files or whack your system clock around, weird things might happen. But obviously this might be inconvenient. Some people might not want WAL summary files to be generated at all because they don't care about incremental backup, and other people might want them retained for longer, and still other people might want them to be not removed automatically or removed automatically based on some criteria other than mtime. I don't really know what's best here. I don't think the default policy that the patches implement is especially terrible, but it's just something that I made up and I don't have any real confidence that it's wonderful. One point to be consider here is that, if WAL summarization is enabled, checkpoints can't remove WAL that isn't summarized yet. Mostly that's not a problem, I think, because the WAL summarizer is pretty fast. But it could increase disk consumption for some people. I don't think that we need to worry about the summaries themselves being a problem in terms of space consumption; at least in all the cases I've tested, they're just not very big. - On a related note, I haven't yet tested this on a standby, which is a thing that I definitely need to do. I don't know of a reason why it shouldn't be possible for all of this machinery to work on a standby just as it does on a primary, but then we need the WAL summarizer to run there too, which could end up being a waste if nobody ever tries to take an incremental backup. I wonder how that should be reflected in the configuration. We could do something like what we've done for archive_mode, where on means "only on if this is a primary" and you have to say always if you want it to run on standbys as well ... but I'm not sure if that's a design pattern that we really want to replicate into more places. I'd be somewhat inclined to just make whatever configuration parameters we need to configure this thing on the primary also work on standbys, and you can set each server up as you please. But I'm open to other suggestions. - We need to settle the question of whether to send the whole backup manifest to the server or just the LSN. In a previous attempt at incremental backup, we decided the whole manifest was necessary, because flat-copying files could make new data show up with old LSNs. But that version of the patch set was trying to find modified blocks by checking their LSNs individually, not by summarizing WAL. And since the operations that flat-copy files are WAL-logged, the WAL summary approach seems to eliminate that problem - maybe an LSN (and the associated TLI) is good enough now. This also relates to Jakub's question about whether this machinery could be used to fast-forward a standby, which is not exactly a base backup but ... perhaps close enough? I'm somewhat inclined to believe that we can simplify to an LSN and TLI; however, if we do that, then we'll have big problems if later we realize that we want the manifest for something after all. So if anybody thinks that there's a reason to keep doing what the patch does today -- namely, upload the whole manifest to the server -- please speak up. - It's regrettable that we don't have incremental JSON parsing; I think that means anyone who has a backup manifest that is bigger than 1GB can't use this feature. However, that's also a problem for the existing backup manifest feature, and as far as I can see, we have no complaints about it. So maybe people just don't have databases with enough relations for that to be much of a live issue yet. I'm inclined to treat this as a non-blocker, although Andrew Dunstan tells me he does have a prototype for incremental JSON parsing so maybe that will land and we can use it here. - Right now, I have a hard-coded 60 second timeout for WAL summarization. If you try to take an incremental backup and the WAL summaries you need don't show up within 60 seconds, the backup times out. I think that's a reasonable default, but should it be configurable? If yes, should that be a GUC or, perhaps better, a pg_basebackup option? - I'm curious what people think about the pg_walsummary tool that is included in 0006. I think it's going to be fairly important for debugging, but it does feel a little bit bad to add a new binary for something pretty niche. Nevertheless, merging it into any other utility seems relatively awkward, so I'm inclined to think both that this should be included in whatever finally gets committed and that it should be a separate binary. I considered whether it should go in contrib, but we seem to have moved to a policy that heavily favors limiting contrib to extensions and loadable modules, rather than binaries. Clearly there's a good amount of stuff to sort out here, but we've still got quite a bit of time left before feature freeze so I'd like to have a go at it. Please let me know your thoughts, if you have any. [1] http://postgr.es/m/CA+TgmoZAM24Ub=uxP0aWuWstNYTUJQ64j976FYJeVaMJ+qD0uw@mail.gmail.com -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v4-0006-Add-new-pg_walsummary-tool.patch
- v4-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patch
- v4-0003-Change-how-a-base-backup-decides-which-files-have.patch
- v4-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
- v4-0005-Prototype-patch-for-incremental-and-differential-.patch
- v4-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patch
On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > Clearly there's a good amount of stuff to sort out here, but we've > still got quite a bit of time left before feature freeze so I'd like > to have a go at it. Please let me know your thoughts, if you have any. Apparently, nobody has any thoughts, but here's an updated patch set anyway. The main change, other than rebasing, is that I did a bunch more documentation work on the main patch (0005). I'm much happier with it now, although I expect it may need more adjustments here and there as outstanding design questions get settled. After some thought, I think that it should be fine to commit 0001 and 0002 as independent refactoring patches, and I plan to go ahead and do that pretty soon unless somebody objects. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v5-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
- v5-0006-Add-new-pg_walsummary-tool.patch
- v5-0003-Change-how-a-base-backup-decides-which-files-have.patch
- v5-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patch
- v5-0005-Prototype-patch-for-incremental-backup.patch
- v5-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patch
On 10/19/23 12:05, Robert Haas wrote: > On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote: >> Clearly there's a good amount of stuff to sort out here, but we've >> still got quite a bit of time left before feature freeze so I'd like >> to have a go at it. Please let me know your thoughts, if you have any. > > Apparently, nobody has any thoughts, but here's an updated patch set > anyway. The main change, other than rebasing, is that I did a bunch > more documentation work on the main patch (0005). I'm much happier > with it now, although I expect it may need more adjustments here and > there as outstanding design questions get settled. > > After some thought, I think that it should be fine to commit 0001 and > 0002 as independent refactoring patches, and I plan to go ahead and do > that pretty soon unless somebody objects. 0001 looks pretty good to me. The only thing I find a little troublesome is the repeated construction of file names with/without segment numbers in ResetUnloggedRelationsInDbspaceDir(), .e.g.: + if (segno == 0) + snprintf(dstpath, sizeof(dstpath), "%s/%u", + dbspacedirname, relNumber); + else + snprintf(dstpath, sizeof(dstpath), "%s/%u.%u", + dbspacedirname, relNumber, segno); If this happened three times I'd definitely want a helper function, but even with two I think it would be a bit nicer. 0002 is definitely a good idea. FWIW pgBackRest does this conversion but also errors if it does not succeed. We have never seen a report of this error happening in the wild, so I think it must be pretty rare if it does happen. Regards, -David
On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote: > 0001 looks pretty good to me. The only thing I find a little troublesome > is the repeated construction of file names with/without segment numbers > in ResetUnloggedRelationsInDbspaceDir(), .e.g.: > > + if (segno == 0) > + snprintf(dstpath, sizeof(dstpath), "%s/%u", > + dbspacedirname, relNumber); > + else > + snprintf(dstpath, sizeof(dstpath), "%s/%u.%u", > + dbspacedirname, relNumber, segno); > > > If this happened three times I'd definitely want a helper function, but > even with two I think it would be a bit nicer. Personally I think that would make the code harder to read rather than easier. I agree that repeating code isn't great, but this is a relatively brief idiom and pretty self-explanatory. If other people agree with you I can change it, but to me it's not an improvement. > 0002 is definitely a good idea. FWIW pgBackRest does this conversion but > also errors if it does not succeed. We have never seen a report of this > error happening in the wild, so I think it must be pretty rare if it > does happen. Cool, but ... how about the main patch set? It's nice to get some of these refactoring bits and pieces out of the way, but if I spend the effort to work out what I think are the right answers to the remaining design questions for the main patch set and then find out after I've done all that that you have massive objections, I'm going to be annoyed. I've been trying to get this feature into PostgreSQL for years, and if I don't succeed this time, I want the reason to be something better than "well, I didn't find out that David disliked X until five minutes before I was planning to type 'git push'." I'm not really concerned about detailed bug-hunting in the main patches just yet. The time for that will come. But if you have views on how to resolve the design questions that I mentioned in a couple of emails back, or intend to advocate vigorously against the whole concept for some reason, let's try to sort that out sooner rather than later. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert, On Wed, Oct 4, 2023 at 10:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Here's a new patch set, also addressing Jakub's observation that > > MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating. > > Here's yet another new version.[..] Okay, so another good news - related to the patch version #4. Not-so-tiny stress test consisting of pgbench run for 24h straight (with incremental backups every 2h, with base of initial full backup), followed by two PITRs (one not using incremental backup and one using to to illustrate the performance point - and potentially spot any errors in between). In both cases it worked fine. Pgbench has this behaviour that it doesn't cause space growth over time - it produces lots of WAL instead. Some stats: START DBSIZE: ~3.3GB (pgbench -i -s 200 --partitions=8) END DBSIZE: ~4.3GB RUN DURATION: 24h (pgbench -P 1 -R 100 -T 86400) WALARCHIVES-24h: 77GB FULL-DB-BACKUP-SIZE: 3.4GB INCREMENTAL-BACKUP-11-SIZE: 3.5GB Env: Azure VM D4s (4VCPU), Debian 11, gcc 10.2, normal build (asserts and debug disabled) The increments were taken every 2h just to see if they would fail for any reason - they did not. PITR RTO RESULTS (copy/pg_combinebackup time + recovery time): 1. time to restore from fullbackup (+ recovery of 24h WAL[77GB]): 53s + 4640s =~ 78min 2. time to restore from fullbackup+incremental backup from 2h ago (+ recovery of 2h WAL [5.4GB]): 68s + 190s =~ 4min18s I could probably pre populate the DB with 1TB cold data (not touched to be touched pgbench at all), just for the sake of argument, and that would probably could be demonstrated how space efficient the incremental backup can be, but most of time would be time wasted on copying the 1TB here... > - I would like some feedback on the generation of WAL summary files. > Right now, I have it enabled by default, and summaries are kept for a > week. That means that, with no additional setup, you can take an > incremental backup as long as the reference backup was taken in the > last week. I've just noticed one thing when recovery is progress: is summarization working during recovery - in the background - an expected behaviour? I'm wondering about that, because after freshly restored and recovered DB, one would need to create a *new* full backup and only from that point new summaries would have any use? Sample log: 2023-10-20 11:10:02.288 UTC [64434] LOG: restored log file "000000010000000200000022" from archive 2023-10-20 11:10:02.599 UTC [64434] LOG: restored log file "000000010000000200000023" from archive 2023-10-20 11:10:02.769 UTC [64446] LOG: summarized WAL on TLI 1 from 2/139B1130 to 2/239B1518 2023-10-20 11:10:02.923 UTC [64434] LOG: restored log file "000000010000000200000024" from archive 2023-10-20 11:10:03.193 UTC [64434] LOG: restored log file "000000010000000200000025" from archive 2023-10-20 11:10:03.345 UTC [64432] LOG: restartpoint starting: wal 2023-10-20 11:10:03.407 UTC [64446] LOG: summarized WAL on TLI 1 from 2/239B1518 to 2/25B609D0 2023-10-20 11:10:03.521 UTC [64434] LOG: restored log file "000000010000000200000026" from archive 2023-10-20 11:10:04.429 UTC [64434] LOG: restored log file "000000010000000200000027" from archive > > - On a related note, I haven't yet tested this on a standby, which is > a thing that I definitely need to do. I don't know of a reason why it > shouldn't be possible for all of this machinery to work on a standby > just as it does on a primary, but then we need the WAL summarizer to > run there too, which could end up being a waste if nobody ever tries > to take an incremental backup. I wonder how that should be reflected > in the configuration. We could do something like what we've done for > archive_mode, where on means "only on if this is a primary" and you > have to say always if you want it to run on standbys as well ... but > I'm not sure if that's a design pattern that we really want to > replicate into more places. I'd be somewhat inclined to just make > whatever configuration parameters we need to configure this thing on > the primary also work on standbys, and you can set each server up as > you please. But I'm open to other suggestions. I'll try to play with some standby restores in future, stay tuned. Regards, -J.
On 10/19/23 16:00, Robert Haas wrote: > On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote: >> 0001 looks pretty good to me. The only thing I find a little troublesome >> is the repeated construction of file names with/without segment numbers >> in ResetUnloggedRelationsInDbspaceDir(), .e.g.: >> >> + if (segno == 0) >> + snprintf(dstpath, sizeof(dstpath), "%s/%u", >> + dbspacedirname, relNumber); >> + else >> + snprintf(dstpath, sizeof(dstpath), "%s/%u.%u", >> + dbspacedirname, relNumber, segno); >> >> >> If this happened three times I'd definitely want a helper function, but >> even with two I think it would be a bit nicer. > > Personally I think that would make the code harder to read rather than > easier. I agree that repeating code isn't great, but this is a > relatively brief idiom and pretty self-explanatory. If other people > agree with you I can change it, but to me it's not an improvement. Then I'm fine with it as is. >> 0002 is definitely a good idea. FWIW pgBackRest does this conversion but >> also errors if it does not succeed. We have never seen a report of this >> error happening in the wild, so I think it must be pretty rare if it >> does happen. > > Cool, but ... how about the main patch set? It's nice to get some of > these refactoring bits and pieces out of the way, but if I spend the > effort to work out what I think are the right answers to the remaining > design questions for the main patch set and then find out after I've > done all that that you have massive objections, I'm going to be > annoyed. I've been trying to get this feature into PostgreSQL for > years, and if I don't succeed this time, I want the reason to be > something better than "well, I didn't find out that David disliked X > until five minutes before I was planning to type 'git push'." I simply have not had time to look at the main patch set in any detail. > I'm not really concerned about detailed bug-hunting in the main > patches just yet. The time for that will come. But if you have views > on how to resolve the design questions that I mentioned in a couple of > emails back, or intend to advocate vigorously against the whole > concept for some reason, let's try to sort that out sooner rather than > later. In my view this feature puts the cart way before the horse. I'd think higher priority features might be parallelism, a backup repository, expiration management, archiving, or maybe even a restore command. It seems the only goal here is to make pg_basebackup a tool for external backup software to use, which might be OK, but I don't believe this feature really advances pg_basebackup as a usable piece of stand-alone software. If people really think that start/stop backup is too complicated an interface how are they supposed to track page incrementals and get them to a place where pg_combinebackup can put them backup together? If automation is required to use this feature, shouldn't pg_basebackup implement that automation? I have plenty of thoughts about the implementation as well, but I have a lot on my plate right now and I don't have time to get into it. I don't plan to stand in your way on this feature. I'm reviewing what patches I can out of courtesy and to be sure that nothing adjacent to your work is being affected. My apologies if my reviews are not meeting your expectations, but I am contributing as my time constraints allow. Regards, -David
On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote: > Then I'm fine with it as is. OK, thanks. > In my view this feature puts the cart way before the horse. I'd think > higher priority features might be parallelism, a backup repository, > expiration management, archiving, or maybe even a restore command. > > It seems the only goal here is to make pg_basebackup a tool for external > backup software to use, which might be OK, but I don't believe this > feature really advances pg_basebackup as a usable piece of stand-alone > software. If people really think that start/stop backup is too > complicated an interface how are they supposed to track page > incrementals and get them to a place where pg_combinebackup can put them > backup together? If automation is required to use this feature, > shouldn't pg_basebackup implement that automation? > > I have plenty of thoughts about the implementation as well, but I have a > lot on my plate right now and I don't have time to get into it. > > I don't plan to stand in your way on this feature. I'm reviewing what > patches I can out of courtesy and to be sure that nothing adjacent to > your work is being affected. My apologies if my reviews are not meeting > your expectations, but I am contributing as my time constraints allow. Sorry, I realize reading this response that I probably didn't do a very good job writing that email and came across sounding like a jerk. Possibly, I actually am a jerk. Whether it just sounded like it or I actually am, I apologize. But your last paragraph here gets at my real question, which is whether you were going to try to block the feature. I recognize that we have different priorities when it comes to what would make most sense to implement in PostgreSQL, and that's OK, or at least, it's OK with me. I also don't have any particular expectation about how much you should review the patch or in what level of detail, and I have sincerely appreciated your feedback thus far. If you are able to continue to provide more, that's great, and if that's not, well, you're not obligated. What I was concerned about was whether that review was a precursor to a vigorous attempt to keep the main patch from getting committed, because if that was going to be the case, then I'd like to surface that conflict sooner rather than later. It sounds like that's not an issue, which is great. At the risk of drifting into the fraught question of what I *should* be implementing rather than the hopefully-less-fraught question of whether what I am actually implementing is any good, I see incremental backup as a way of removing some of the use cases for the low-level backup API. If you said "but people still will have lots of reasons to use it," I would agree; and if you said "people can still screw things up with pg_basebackup," I would also agree. Nonetheless, most of the disasters I've personally seen have stemmed from the use of the low-level API rather than from the use of pg_basebackup, though there are exceptions. I also think a lot of the use of the low-level API is driven by it being just too darn slow to copy the whole database, and incremental backup can help with that in some circumstances. Also, I have worked fairly hard to try to make sure that if you misuse pg_combinebackup, or fail to use it altogether, you'll get an error rather than silent data corruption. I would be interested to hear about scenarios where the checks that I've implemented can be defeated by something that is plausibly described as stupidity rather than malice. I'm not sure we can fix all such cases, but I'm very alert to the horror that will befall me if user error looks exactly like a bug in the code. For my own sanity, we have to be able to distinguish those cases. Moreover, we also need to be able to distinguish backup-time bugs from reassembly-time bugs, which is why I've got the pg_walsummary tool, and why pg_combinebackup has the ability to emit fairly detailed debugging output. I anticipate those things being useful in investigating bug reports when they show up. I won't be too surprised if it turns out that more work on sanity-checking and/or debugging tools is needed, but I think your concern about people misusing stuff is bang on target and I really want to do whatever we can to avoid that when possible and detect it when it happens. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 20, 2023 at 9:20 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Okay, so another good news - related to the patch version #4. > Not-so-tiny stress test consisting of pgbench run for 24h straight > (with incremental backups every 2h, with base of initial full backup), > followed by two PITRs (one not using incremental backup and one using > to to illustrate the performance point - and potentially spot any > errors in between). In both cases it worked fine. This is great testing, thanks. What might be even better is to test whether the resulting backups are correct, somehow. > I've just noticed one thing when recovery is progress: is > summarization working during recovery - in the background - an > expected behaviour? I'm wondering about that, because after freshly > restored and recovered DB, one would need to create a *new* full > backup and only from that point new summaries would have any use? Actually, I think you could take an incremental backup relative to a full backup from a previous timeline. But the question of what summarization ought to do (or not do) during recovery, and whether it ought to be enabled by default, and what the retention policy ought to be are very much live ones. Right now, it's enabled by default and keeps summaries for a week, assuming you don't reset your local clock and that it advances at the same speed as the universe's own clock. But that's all debatable. Any views? Meanwhile, here's a new patch set. I went ahead and committed the first two preparatory patches, as I said earlier that I intended to do. And here I've adjusted the main patch, which is now 0003, for the addition of XLOG_CHECKPOINT_REDO, which permitted me to simplify a few things. wal_summarize_mb now feels like a bit of a silly GUC -- presumably you'd never care, unless you had an absolutely gigantic inter-checkpoint WAL distance. And if you have that, maybe you should also have enough memory to summarize all that WAL. Or maybe not: perhaps it's better to write WAL summaries more than once per checkpoint when checkpoints are really big. But I'm worried that the GUC will become a source of needless confusion for users. For most people, it seems like emitting one summary per checkpoint should be totally fine, and they might prefer a simple Boolean GUC, summarize_wal = true | false, over this. I'm just not quite sure about the corner cases. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 10/23/23 11:44, Robert Haas wrote: > On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote: >> >> I don't plan to stand in your way on this feature. I'm reviewing what >> patches I can out of courtesy and to be sure that nothing adjacent to >> your work is being affected. My apologies if my reviews are not meeting >> your expectations, but I am contributing as my time constraints allow. > > Sorry, I realize reading this response that I probably didn't do a > very good job writing that email and came across sounding like a jerk. > Possibly, I actually am a jerk. Whether it just sounded like it or I > actually am, I apologize. That was the way it came across, though I prefer to think it was unintentional. I certainly understand how frustrating dealing with a large and uncertain patch can be. Either way, I appreciate the apology. Now onward... > But your last paragraph here gets at my real > question, which is whether you were going to try to block the feature. > I recognize that we have different priorities when it comes to what > would make most sense to implement in PostgreSQL, and that's OK, or at > least, it's OK with me. This seem perfectly natural to me. > I also don't have any particular expectation > about how much you should review the patch or in what level of detail, > and I have sincerely appreciated your feedback thus far. If you are > able to continue to provide more, that's great, and if that's not, > well, you're not obligated. What I was concerned about was whether > that review was a precursor to a vigorous attempt to keep the main > patch from getting committed, because if that was going to be the > case, then I'd like to surface that conflict sooner rather than later. > It sounds like that's not an issue, which is great. Overall I would say I'm not strongly for or against the patch. I think it will be very difficult to use in a manual fashion, but automation is they way to go in general so that's not necessarily and argument against. However, this is an area of great interest to me so I do want to at least make sure nothing is being impacted adjacent to the main goal of this patch. So far I have seen no sign of that, but that has been a primary goal of my reviews. > At the risk of drifting into the fraught question of what I *should* > be implementing rather than the hopefully-less-fraught question of > whether what I am actually implementing is any good, I see incremental > backup as a way of removing some of the use cases for the low-level > backup API. If you said "but people still will have lots of reasons to > use it," I would agree; and if you said "people can still screw things > up with pg_basebackup," I would also agree. Nonetheless, most of the > disasters I've personally seen have stemmed from the use of the > low-level API rather than from the use of pg_basebackup, though there > are exceptions. This all makes sense to me. > I also think a lot of the use of the low-level API is > driven by it being just too darn slow to copy the whole database, and > incremental backup can help with that in some circumstances. I would argue that restore performance is *more* important than backup performance and this patch is a step backward in that regard. Backups will be faster and less space will be used in the repository, but restore performance is going to suffer. If the deltas are very small the difference will probably be negligible, but as the deltas get large (and especially if there are a lot of them) the penalty will be more noticeable. > Also, I > have worked fairly hard to try to make sure that if you misuse > pg_combinebackup, or fail to use it altogether, you'll get an error > rather than silent data corruption. I would be interested to hear > about scenarios where the checks that I've implemented can be defeated > by something that is plausibly described as stupidity rather than > malice. I'm not sure we can fix all such cases, but I'm very alert to > the horror that will befall me if user error looks exactly like a bug > in the code. For my own sanity, we have to be able to distinguish > those cases. I was concerned with the difficulty of trying to stage the correct backups for pg_combinebackup, not whether it would recognize that the needed data was not available and then error appropriately. The latter is surmountable within pg_combinebackup but the former is left up to the user. > Moreover, we also need to be able to distinguish > backup-time bugs from reassembly-time bugs, which is why I've got the > pg_walsummary tool, and why pg_combinebackup has the ability to emit > fairly detailed debugging output. I anticipate those things being > useful in investigating bug reports when they show up. I won't be too > surprised if it turns out that more work on sanity-checking and/or > debugging tools is needed, but I think your concern about people > misusing stuff is bang on target and I really want to do whatever we > can to avoid that when possible and detect it when it happens. The ability of users to misuse tools is, of course, legendary, so that all sounds good to me. One note regarding the patches. I feel like v5-0005-Prototype-patch-for-incremental-backup should be split to have the WAL summarizer as one patch and the changes to base backup as a separate patch. It might not be useful to commit one without the other, but it would make for an easier read. Just my 2c. Regards, -David
On Mon, Oct 23, 2023 at 7:56 PM David Steele <david@pgmasters.net> wrote: > > I also think a lot of the use of the low-level API is > > driven by it being just too darn slow to copy the whole database, and > > incremental backup can help with that in some circumstances. > > I would argue that restore performance is *more* important than backup > performance and this patch is a step backward in that regard. Backups > will be faster and less space will be used in the repository, but > restore performance is going to suffer. If the deltas are very small the > difference will probably be negligible, but as the deltas get large (and > especially if there are a lot of them) the penalty will be more noticeable. I think an awful lot depends here on whether the repository is local or remote. If you have filesystem access to wherever the backups are stored anyway, I don't think that using pg_combinebackup to write out a new data directory is going to be much slower than copying one data directory from the repository to wherever you'd actually use the backup. It may be somewhat slower because we do need to access some data in every involved backup, but I don't think it should be vastly slower because we don't have to read every backup in its entirety. For each file, we read the (small) header of the newest incremental file and every incremental file that precedes it until we find a full file. Then, we construct a map of which blocks need to be read from which sources and read only the required blocks from each source. If all the blocks are coming from a single file (because there are no incremental for a certain file, or they contain no blocks) then we just copy the entire source file in one shot, which can be optimized using the same tricks we use elsewhere. Inevitably, this is going to read more data and do more random I/O than just a flat copy of a directory, but it's not terrible. The overall amount of I/O should be a lot closer to the size of the output directory than to the sum of the sizes of the input directories. Now, if the repository is remote, and you have to download all of those backups first, and then run pg_combinebackup on them afterward, that is going to be unpleasant, unless the incremental backups are all quite small. Possibly this could be addressed by teaching pg_combinebackup to do things like accessing data over HTTP and SSH, and relatedly, looking inside tarfiles without needing them unpacked. For now, I've left those as ideas for future improvement, but I think potentially they could address some of your concerns here. A difficulty is that there are a lot of protocols that people might want to use to push bytes around, and it might be hard to keep up with the march of progress. I do agree, though, that there's no such thing as a free lunch. I wouldn't recommend to anyone that they plan to restore from a chain of 100 incremental backups. Not only might it be slow, but the opportunities for something to go wrong are magnified. Even if you've automated everything well enough that there's no human error involved, what if you've got a corrupted file somewhere? Maybe that's not likely in absolute terms, but the more files you've got, the more likely it becomes. What I'd suggest someone do instead is periodically do pg_combinebackup full_reference_backup oldest_incremental -o new_full_reference_backup; rm -rf full_reference_backup; mv new_full_reference_backup full_reference_backup. The new full reference backup is intended to still be usable for restoring incrementals based on the incremental it replaced. I hope that, if people use the feature well, this should limit the need for really long backup chains. I am sure, though, that some people will use it poorly. Maybe there's room for more documentation on this topic. > I was concerned with the difficulty of trying to stage the correct > backups for pg_combinebackup, not whether it would recognize that the > needed data was not available and then error appropriately. The latter > is surmountable within pg_combinebackup but the former is left up to the > user. Indeed. > One note regarding the patches. I feel like > v5-0005-Prototype-patch-for-incremental-backup should be split to have > the WAL summarizer as one patch and the changes to base backup as a > separate patch. > > It might not be useful to commit one without the other, but it would > make for an easier read. Just my 2c. Yeah, maybe so. I'm not quite ready to commit to doing that split as of this writing but I will think about it and possibly do it. -- Robert Haas EDB: http://www.enterprisedb.com
On 04.10.23 22:08, Robert Haas wrote: > - I would like some feedback on the generation of WAL summary files. > Right now, I have it enabled by default, and summaries are kept for a > week. That means that, with no additional setup, you can take an > incremental backup as long as the reference backup was taken in the > last week. File removal is governed by mtimes, so if you change the > mtimes of your summary files or whack your system clock around, weird > things might happen. But obviously this might be inconvenient. Some > people might not want WAL summary files to be generated at all because > they don't care about incremental backup, and other people might want > them retained for longer, and still other people might want them to be > not removed automatically or removed automatically based on some > criteria other than mtime. I don't really know what's best here. I > don't think the default policy that the patches implement is > especially terrible, but it's just something that I made up and I > don't have any real confidence that it's wonderful. The easiest answer is to have it off by default. Let people figure out what works for them. There are various factors like storage, network, server performance, RTO that will determine what combination of full backup, incremental backup, and WAL replay will satisfy someone's requirements. I suppose tests could be set up to determine this to some degree. But we could also start slow and let people figure it out themselves. When pg_basebackup was added, it was also disabled by default. If we think that 7d is a good setting, then I would suggest to consider, like 10d. Otherwise, if you do a weekly incremental backup and you have a time change or a hiccup of some kind one day, you lose your backup sequence. Another possible answer is, like, 400 days? Because why not? What is a reasonable upper limit for this? > - It's regrettable that we don't have incremental JSON parsing; I > think that means anyone who has a backup manifest that is bigger than > 1GB can't use this feature. However, that's also a problem for the > existing backup manifest feature, and as far as I can see, we have no > complaints about it. So maybe people just don't have databases with > enough relations for that to be much of a live issue yet. I'm inclined > to treat this as a non-blocker, It looks like each file entry in the manifest takes about 150 bytes, so 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? > - Right now, I have a hard-coded 60 second timeout for WAL > summarization. If you try to take an incremental backup and the WAL > summaries you need don't show up within 60 seconds, the backup times > out. I think that's a reasonable default, but should it be > configurable? If yes, should that be a GUC or, perhaps better, a > pg_basebackup option? The current user experience of pg_basebackup is that it waits possibly a long time for a checkpoint, and there is an option to make it go faster, but there is no timeout AFAICT. Is this substantially different? Could we just let it wait forever? Also, does waiting for checkpoint and WAL summarization happen in parallel? If so, what if it starts a checkpoint that might take 15 min to complete, and then after 60 seconds it kicks you off because the WAL summarization isn't ready. That might be wasteful. > - I'm curious what people think about the pg_walsummary tool that is > included in 0006. I think it's going to be fairly important for > debugging, but it does feel a little bit bad to add a new binary for > something pretty niche. This seems fine. Is the WAL summary file format documented anywhere in your patch set yet? My only thought was, maybe the file format could be human-readable (more like backup_label) to avoid this. But maybe not.
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut <peter@eisentraut.org> wrote: > The easiest answer is to have it off by default. Let people figure out > what works for them. There are various factors like storage, network, > server performance, RTO that will determine what combination of full > backup, incremental backup, and WAL replay will satisfy someone's > requirements. I suppose tests could be set up to determine this to some > degree. But we could also start slow and let people figure it out > themselves. When pg_basebackup was added, it was also disabled by default. > > If we think that 7d is a good setting, then I would suggest to consider, > like 10d. Otherwise, if you do a weekly incremental backup and you have > a time change or a hiccup of some kind one day, you lose your backup > sequence. > > Another possible answer is, like, 400 days? Because why not? What is a > reasonable upper limit for this? In concept, I don't think this should even be time-based. What you should do is remove WAL summaries once you know that you've taken as many incremental backups that might use them as you're ever going to do. But PostgreSQL itself doesn't have any way of knowing what your intended backup patterns are. If your incremental backup fails on Monday night and you run it manually on Tuesday morning, you might still rerun it as an incremental backup. If it fails every night for a month and you finally realize and decide to intervene manually, maybe you want a new full backup at that point. It's been a month. But on the other hand maybe you don't. There's no time-based answer to this question that is really correct, and I think it's quite possible that your backup software might want to shut off time-based deletion altogether and make its own decisions about when to nuke summaries. However, I also don't think that's a great default setting. It could easily lead to people wasting a bunch of disk space for no reason. As far as the 7d value, I figured that nighty incremental backups would be fairly common. If we think weekly incremental backups would be common, then pushing this out to 10d would make sense. While there's no reason you couldn't take an annual incremental backup, and thus want a 400d value, it seems like a pretty niche use case. Note that whether to remove summaries is a separate question from whether to generate them in the first place. Right now, I have wal_summarize_mb controlling whether they get generated in the first place, but as I noted in another recent email, that isn't an entirely satisfying solution. > It looks like each file entry in the manifest takes about 150 bytes, so > 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? I suspect a few people have more files than that. They'll just have to wait to use this feature until we get incremental JSON parsing (or undo the decision to use JSON for the manifest). > The current user experience of pg_basebackup is that it waits possibly a > long time for a checkpoint, and there is an option to make it go faster, > but there is no timeout AFAICT. Is this substantially different? Could > we just let it wait forever? We could. I installed the timeout because the first versions of the feature were buggy, and I didn't like having my tests hang forever with no indication of what had gone wrong. At least in my experience so far, the time spent waiting for WAL summarization is typically quite short, because only the WAL that needs to be summarized is whatever was emitted since the last time it woke up up through the start LSN of the backup. That's probably not much, and the next time the summarizer wakes up, the file should appear within moments. So, it's a little different from the checkpoint case, where long waits are expected. > Also, does waiting for checkpoint and WAL summarization happen in > parallel? If so, what if it starts a checkpoint that might take 15 min > to complete, and then after 60 seconds it kicks you off because the WAL > summarization isn't ready. That might be wasteful. It is not parallel. The trouble is, we don't really have any way to know whether WAL summarization is going to fail for whatever reason. We don't expect that to happen, but if somebody changes the permissions on the WAL summary directory or attaches gdb to the WAL summarizer process or something of that sort, it might. We could check at the outset whether we seem to be really far behind and emit a warning. For instance, if we're 1TB behind on WAL summarization when the checkpoint is requested, chances are something is busted and we're probably not going to catch up any time soon. We could warn the user about that and let them make their own decision about whether to cancel. But, that idea won't help in unattended operation, and the threshold for "really far behind" is not very clear. It might be better to wait until we get more experience with how things actually fail before doing too much engineering here, but I'm also open to suggestions. > Is the WAL summary file format documented anywhere in your patch set > yet? My only thought was, maybe the file format could be human-readable > (more like backup_label) to avoid this. But maybe not. The comment in blkreftable.c just above "#define BLOCKS_PER_CHUNK" gives an overview of the format. I think that we probably don't want to convert to a text format, because this format is extremely space-efficient and very convenient to transfer between disk and memory. We don't want to run out of memory when summarizing large ranges of WAL, or when taking an incremental backup that requires combining many individual summaries into a combined summary that tells us what needs to be included in the backup. -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-10-24 Tu 12:08, Robert Haas wrote: > >> It looks like each file entry in the manifest takes about 150 bytes, so >> 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? > I suspect a few people have more files than that. They'll just have to Maybe someone on the list can see some way o > wait to use this feature until we get incremental JSON parsing (or > undo the decision to use JSON for the manifest). Robert asked me to work on this quite some time ago, and most of this work was done last year. Here's my WIP for an incremental JSON parser. It works and passes all the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. The reason I haven't posted it before is that it's about 50% slower in pure parsing speed than the current recursive descent parser in my testing. I've tried various things to make it faster, but haven't made much impact. One of my colleagues is going to take a fresh look at it, but maybe someone on the list can see where we can save some cycles. If we can't make it faster, I guess we could use the RD parser for non-incremental cases and only use the non-RD parser for incremental, although that would be a bit sad. However, I don't think we can make the RD parser suitable for incremental parsing - there's too much state involved in the call stack. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Robert asked me to work on this quite some time ago, and most of this > work was done last year. > > Here's my WIP for an incremental JSON parser. It works and passes all > the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. > The reason I haven't posted it before is that it's about 50% slower in > pure parsing speed than the current recursive descent parser in my > testing. I've tried various things to make it faster, but haven't made > much impact. One of my colleagues is going to take a fresh look at it, > but maybe someone on the list can see where we can save some cycles. > > If we can't make it faster, I guess we could use the RD parser for > non-incremental cases and only use the non-RD parser for incremental, > although that would be a bit sad. However, I don't think we can make the > RD parser suitable for incremental parsing - there's too much state > involved in the call stack. Yeah, this is exactly why I didn't want to use JSON for the backup manifest in the first place. Parsing such a manifest incrementally is complicated. If we'd gone with my original design where the manifest consisted of a bunch of lines each of which could be parsed separately, we'd already have incremental parsing and wouldn't be faced with these difficult trade-offs. Unfortunately, I'm not in a good position either to figure out how to make your prototype faster, or to evaluate how painful it is to keep both in the source tree. It's probably worth considering how likely it is that we'd be interested in incremental JSON parsing in other cases. Maintaining two JSON parsers is probably not a lot of fun regardless, but if each of them gets used for a bunch of things, that feels less bad than if one of them gets used for a bunch of things and the other one only ever gets used for backup manifests. Would we be interested in JSON-format database dumps? Incrementally parsing JSON LOBs? Either seems tenuous, but those are examples of the kind of thing that could make us happy to have incremental JSON parsing as a general facility. If nobody's very excited by those kinds of use cases, then this just boils down to whether we want to (a) accept that users with very large numbers of relation files won't be able to use pg_verifybackup or incremental backup, (b) accept that we're going to maintain a second JSON parser just to enable that use cas and with no other benefit, or (c) undertake to change the manifest format to something that is straightforward to parse incrementally. I think (a) is reasonable short term, but at some point I think we should do better. I'm not really that enthused about (c) because it means more work for me and possibly more arguing, but if (b) is going to cause a lot of hassle then we might need to consider it. -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-10-25 We 09:05, Robert Haas wrote: > On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Robert asked me to work on this quite some time ago, and most of this >> work was done last year. >> >> Here's my WIP for an incremental JSON parser. It works and passes all >> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. >> The reason I haven't posted it before is that it's about 50% slower in >> pure parsing speed than the current recursive descent parser in my >> testing. I've tried various things to make it faster, but haven't made >> much impact. One of my colleagues is going to take a fresh look at it, >> but maybe someone on the list can see where we can save some cycles. >> >> If we can't make it faster, I guess we could use the RD parser for >> non-incremental cases and only use the non-RD parser for incremental, >> although that would be a bit sad. However, I don't think we can make the >> RD parser suitable for incremental parsing - there's too much state >> involved in the call stack. > Yeah, this is exactly why I didn't want to use JSON for the backup > manifest in the first place. Parsing such a manifest incrementally is > complicated. If we'd gone with my original design where the manifest > consisted of a bunch of lines each of which could be parsed > separately, we'd already have incremental parsing and wouldn't be > faced with these difficult trade-offs. > > Unfortunately, I'm not in a good position either to figure out how to > make your prototype faster, or to evaluate how painful it is to keep > both in the source tree. It's probably worth considering how likely it > is that we'd be interested in incremental JSON parsing in other cases. > Maintaining two JSON parsers is probably not a lot of fun regardless, > but if each of them gets used for a bunch of things, that feels less > bad than if one of them gets used for a bunch of things and the other > one only ever gets used for backup manifests. Would we be interested > in JSON-format database dumps? Incrementally parsing JSON LOBs? Either > seems tenuous, but those are examples of the kind of thing that could > make us happy to have incremental JSON parsing as a general facility. > > If nobody's very excited by those kinds of use cases, then this just > boils down to whether we want to (a) accept that users with very large > numbers of relation files won't be able to use pg_verifybackup or > incremental backup, (b) accept that we're going to maintain a second > JSON parser just to enable that use cas and with no other benefit, or > (c) undertake to change the manifest format to something that is > straightforward to parse incrementally. I think (a) is reasonable > short term, but at some point I think we should do better. I'm not > really that enthused about (c) because it means more work for me and > possibly more arguing, but if (b) is going to cause a lot of hassle > then we might need to consider it. I'm not too worried about the maintenance burden. The RD routines were added in March 2013 (commit a570c98d7fa) and have hardly changed since then. The new code is not ground-breaking - it's just a different (and fairly well known) way of doing the same thing. I'd be happier if we could make it faster, but maybe it's just a fact that keeping an explicit stack, which is how this works, is slower. I wouldn't at all be surprised if there were other good uses for incremental JSON parsing, including some you've identified. That said, I agree that JSON might not be the best format for backup manifests, but maybe that ship has sailed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I'm not too worried about the maintenance burden. > > That said, I agree that JSON might not be the best format for backup > manifests, but maybe that ship has sailed. I think it's a decision we could walk back if we had a good enough reason, but it would be nicer if we didn't have to, because what we have right now is working. If we change it for no real reason, we might introduce new bugs, and at least in theory, incompatibility with third-party tools that parse the existing format. If you think we can live with the additional complexity in the JSON parsing stuff, I'd rather go that way. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 24, 2023 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, maybe so. I'm not quite ready to commit to doing that split as > of this writing but I will think about it and possibly do it. I have done this. Here's v7. This version also includes several new TAP tests for the main patch, some of which were inspired by our discussion. It also includes SGML documentation for pg_walsummary. New tests: 003_timeline.pl tests the case where the prior backup for an incremental backup was taken on an earlier timeline. 004_manifest.pl tests the manifest-related options for pg_combinebackup. 005_integrity.pl tests the sanity checks that prevent combining a backup with the wrong prior backup. Overview of the new organization of the patch set: 0001 - preparatory refactoring of basebackup.c, changing the algorithm that we use to decide which files have checksums 0002 - code movement only. makes it possible to reuse parse_manifest.c 0003 - add the WAL summarizer process, but useless on its own 0004 - add incremental backup, making use of 0003 0005 - add pg_walsummary debugging tool Notes: - I suspect that 0003 is the most likely to have serious bugs, followed by 0004. - See XXX comments in the commit messages for some known open issues. - Still looking for more comments on http://postgr.es/m/CA+TgmoYdPS7a4eiqAFCZ8dr4r3-O0zq1LvTO5drwWr+7wHQaSQ@mail.gmail.com and other recent emails where design questions came up -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 2023-10-25 We 11:24, Robert Haas wrote: > On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> I'm not too worried about the maintenance burden. >> >> That said, I agree that JSON might not be the best format for backup >> manifests, but maybe that ship has sailed. > I think it's a decision we could walk back if we had a good enough > reason, but it would be nicer if we didn't have to, because what we > have right now is working. If we change it for no real reason, we > might introduce new bugs, and at least in theory, incompatibility with > third-party tools that parse the existing format. If you think we can > live with the additional complexity in the JSON parsing stuff, I'd > rather go that way. > OK, I'll go with that. It will actually be a bit less invasive than the patch I posted. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote: > OK, I'll go with that. It will actually be a bit less invasive than the > patch I posted. Why's that? -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-10-25 We 15:19, Robert Haas wrote: > On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> OK, I'll go with that. It will actually be a bit less invasive than the >> patch I posted. > Why's that? > Because we won't be removing the RD parser. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Oct 26, 2023 at 6:59 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Because we won't be removing the RD parser. Ah, OK. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > Note that whether to remove summaries is a separate question from > whether to generate them in the first place. Right now, I have > wal_summarize_mb controlling whether they get generated in the first > place, but as I noted in another recent email, that isn't an entirely > satisfying solution. I did some more research on this. My conclusion is that I should remove wal_summarize_mb and just have a GUC summarize_wal = on|off that controls whether the summarizer runs at all. There will be one summary file per checkpoint, no matter how far apart checkpoints are or how large the summary gets. Below I'll explain the reasoning; let me know if you disagree. What I describe above would be a bad plan if it were realistically possible for a summary file to get so large that it might run the machine out of memory either when producing it or when trying to make use of it for an incremental backup. This seems to be a somewhat difficult scenario to create. So far, I haven't been able to generate WAL summary files more than a few tens of megabytes in size, even when summarizing 50+ GB of WAL per summary file. One reason why it's hard to produce large summary files is because, for a single relation fork, the WAL summary size converges to 1 bit per modified block when the number of modified blocks is large. This means that, even if you have a terabyte sized relation, you're looking at no more than perhaps 20MB of summary data no matter how much of it gets modified. Now, somebody could have a 30TB relation and then if they modify the whole thing they could have the better part of a gigabyte of summary data for that relation, but if you've got a 30TB table you probably have enough memory that that's no big deal. But, what if you have multiple relations? I initialized pgbench with a scale factor of 30000 and also with 30000 partitions and did a 1-hour run. I got 4 checkpoints during that time and each one produced an approximately 16MB summary file. The efficiency here drops considerably. For example, one of the files is 16495398 bytes and records information on 7498403 modified blocks, which works out to about 2.2 bytes per modified block. That's more than an order of magnitude worse than what I got in the single-relation case, where the summary file didn't even use two *bits* per modified block. But here again, the file just isn't that big in absolute terms. To get a 1GB+ WAL summary file, you'd need to modify millions of relation forks, maybe tens of millions, and most installations aren't even going to have that many relation forks, let alone be modifying them all frequently. My conclusion here is that it's pretty hard to have a database where WAL summarization is going to use too much memory. I wouldn't be terribly surprised if there are some extreme cases where it happens, but those databases probably aren't great candidates for incremental backup anyway. They're probably databases with millions of relations and frequent, widely-scattered modifications to those relations. And if you have that kind of high turnover rate then incremental backup isn't going to as helpful anyway, so there's probably no reason to enable WAL summarization in the first place. Maybe if you have that plus in the same database cluster you have a 100TB of completely static data that is never modified, and if you also do all of this on a pretty small machine, then you can find a case where incremental backup would have worked well but for the memory consumed by WAL summarization. But I think that's sufficiently niche that the current patch shouldn't concern itself with such cases. If we find that they're common enough to worry about, we might eventually want to do something to mitigate them, but whether that thing looks anything like wal_summarize_mb seems pretty unclear. So I conclude that it's a mistake to include that GUC as currently designed and propose to replace it with a Boolean as described above. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
While reviewing this thread today, I realized that I never responded to this email. That was inadvertent; my apologies. On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Nice, I like this idea. Cool. > Skimming through the 7th patch, I see claims that FSM is not fully > WAL-logged and thus shouldn't be tracked, and so it indeed doesn't > track those changes. > I disagree with that decision: we now have support for custom resource > managers, which may use the various forks for other purposes than > those used in PostgreSQL right now. It would be a shame if data is > lost because of the backup tool ignoring forks because the PostgreSQL > project itself doesn't have post-recovery consistency guarantees in > that fork. So, unless we document that WAL-logged changes in the FSM > fork are actually not recoverable from backup, regardless of the type > of contents, we should still keep track of the changes in the FSM fork > and include the fork in our backups or only exclude those FSM updates > that we know are safe to ignore. I'm not sure what to do about this problem. I don't think any data would be *lost* in the scenario that you mention; what I think would happen is that the FSM forks would be backed up in their entirety even if they were owned by some other table AM or index AM that was WAL-logging all changes to whatever it was storing in that fork. So I think that there is not a correctness issue here but rather an efficiency issue. It would still be nice to fix that somehow, but I don't see how to do it. It would be easy to make the WAL summarizer stop treating the FSM as a special case, but there's no way for basebackup_incremental.c to know whether a particular relation fork is for the heap AM or some other AM that handles WAL-logging differently. It can't for example examine pg_class; it's not connected to any database, let alone every database. So we have to either trust that the WAL for the FSM is correct and complete in all cases, or assume that it isn't in any case. And the former doesn't seem like a safe or wise assumption given how the heap AM works. I think the reality here is unfortunately that we're missing a lot of important infrastructure to really enable a multi-table-AM world. The heap AM, and every other table AM, should include a metapage so we can tell what we're looking at just by examining the disk files. Relation forks don't scale and should be replaced with some better system that does. We should have at least two table AMs in core that are fully supported and do truly useful things. Until some of that stuff (and probably a bunch of other things) get sorted out, out-of-core AMs are going to have to remain second-class citizens to some degree. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > If that is still an area open for discussion: wouldn't it be better to > just specify LSN as it would allow resyncing standby across major lag > where the WAL to replay would be enormous? Given that we had > primary->standby where standby would be stuck on some LSN, right now > it would be: > 1) calculate backup manifest of desynced 10TB standby (how? using > which tool?) - even if possible, that means reading 10TB of data > instead of just putting a number, isn't it? > 2) backup primary with such incremental backup >= LSN > 3) copy the incremental backup to standby > 4) apply it to the impaired standby > 5) restart the WAL replay As you may be able to tell from the flurry of posts and new patch sets, I'm trying hard to sort out the remaining open items that pertain to this patch set, and I'm now back to thinking about this one. TL;DR: I think the idea has some potential, but there are some pitfalls that I'm not sure how to address. I spent some time looking at how we currently use the data from the backup manifest. Currently, we do two things with it. First, when we're backing up each file, we check whether it's present in the backup manifest and, if not, we back it up in full. This actually feels fairly poor. If it makes any difference at all, then presumably the underlying algorithm is buggy and needs to be fixed. Maybe that should be ripped out altogether or turned into some kind of sanity check that causes a big explosion if it fails. Second, we check whether the WAL ranges reported by the client match up with the timeline history of the server (see PrepareForIncrementalBackup). This set of sanity checks seems fairly important to me, and I'd regret discarding them. I think there's some possibility that they might catch user error, like where somebody promotes multiple standbys and maybe they even get the same timeline on more than one of them, and then confusion might ensue. I also think that there's a real possibility that they might make it easier to track down bugs in my code, even if those bugs aren't necessarily timeline-related. If (or more realistically when) somebody ends up with a corrupted cluster after running pg_combinebackup, we're going to need to figure out whether that corruption is the result of bugs (and if so where they are) or user error (and if so what it was). The most obvious ways of ending up with a corrupted cluster are (1) taking an incremental backup against a prior backup that is not in the history of the server from which the backup is taken or (2) combining an incremental backup the wrong prior backup, so whatever sanity checks we can have that will tend to prevent those kinds of mistakes seem like a really good idea. And those kinds of checks seem relevant here, too. Consider that it wouldn't be valid to use pg_combinebackup to fast-forward a standby server if the incremental backup's backup-end-LSN preceded the standby server's minimum recovery point. Imagine that you have a standby whose last checkpoint's redo location was at LSN 2/48. Being the enterprising DBA that you are, you make a note of that LSN and go take an incremental backup based on it. You then stop the standby server and try to apply the incremental backup to fast-forward the standby. Well, it's possible that in the meanwhile the standby actually caught up, and now has a minimum recovery point that follows the backup-end-LSN of your incremental backup. In that case, you can't legally use that incremental backup to fast-forward that standby, but no code I've yet written would be smart enough to figure that out. Or, maybe you (or some other DBA on your team) got really excited and actually promoted that standby meanwhile, and now it's not even on the same timeline any more. In the "normal" case where you take an incremental backup based on an earlier base backup, these kinds of problems are detectable, and it seems to me that if we want to enable this kind of use case, it would be pretty smart to have a plan to detect similar mistakes here. I don't, currently, but maybe there is one. Another practical problem here is that, right now, pg_combinebackup doesn't have an in-place mode. It knows how to take a bunch of input backups and write out an output backup, but that output backup needs to go into a new, fresh directory (or directories plural, if there are user-defined tablespaces). I had previously considered adding such a mode, but the idea I had at the time wouldn't have worked for this case. I imagined that someone might want to run "pg_combinebackup --in-place full incr" and clobber the contents of the incr directory with the output, basically discarding the incremental backup you took in favor of a full backup that could have been taken at the same point in time. But here, you'd want to clobber the *first* input to pg_combinebackup, not the last one, so if we want to add something like this, the UI needs some thought. One thing that I find quite scary about such a mode is that if you crash mid-way through, you're in a lot of trouble. In the case that I had previous contemplated -- overwrite the last incremental with the reconstructed full backup -- you *might* be able to make it crash safe by writing out the full files for each incremental file, fsyncing everything, then removing all of the incremental files and fsyncing again. The idea would be that if you crash midway through it's OK to just repeat whatever you were trying to do before the crash and if it succeeds the second time then all is well. If, for a given file, there are both incremental and non-incremental versions, then the second attempt should remove and recreate the non-incremental version from the incremental version. If there's only a non-incremental version, it could be that the previous attempt got far enough to remove the incremental file, but in that case the full file that we now have should be the same thing that we would produce if we did the operation now. It all sounds a little scary, but maybe it's OK. And as long as you don't remove the this-is-an-incremental-backup markers from the backup_label file until you've done everything else, you can tell whether you've ever successfully completed the reassembly or not. But if you're using a hypothetical overwrite mode to overwrite the first input rather than the last one, well, it looks like a valid data directory already, and if you replace a bunch of files and then crash, it still does, but it's not any more, really. I'm not sure I've really wrapped my head around all of the cases here, but it does feel like there are some new ways to go wrong. One thing I also realized when thinking about this is that you could probably hose yourself with the patch set as it stands today by taking a full backup, downgrading to wal_level=minimal for a while, doing some WAL-skipping operations, upgrading to a higher WAL-level again, and then taking an incremental backup. I think the solution to that is probably for the WAL summarizer to refuse to run if wal_level=minimal. Then there would be a gap in the summary files which an incremental backup attempt would detect. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-10-30 10:45:03 -0400, Robert Haas wrote: > On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Note that whether to remove summaries is a separate question from > > whether to generate them in the first place. Right now, I have > > wal_summarize_mb controlling whether they get generated in the first > > place, but as I noted in another recent email, that isn't an entirely > > satisfying solution. > > I did some more research on this. My conclusion is that I should > remove wal_summarize_mb and just have a GUC summarize_wal = on|off > that controls whether the summarizer runs at all. There will be one > summary file per checkpoint, no matter how far apart checkpoints are > or how large the summary gets. Below I'll explain the reasoning; let > me know if you disagree. > What I describe above would be a bad plan if it were realistically > possible for a summary file to get so large that it might run the > machine out of memory either when producing it or when trying to make > use of it for an incremental backup. This seems to be a somewhat > difficult scenario to create. So far, I haven't been able to generate > WAL summary files more than a few tens of megabytes in size, even when > summarizing 50+ GB of WAL per summary file. One reason why it's hard > to produce large summary files is because, for a single relation fork, > the WAL summary size converges to 1 bit per modified block when the > number of modified blocks is large. This means that, even if you have > a terabyte sized relation, you're looking at no more than perhaps 20MB > of summary data no matter how much of it gets modified. Now, somebody > could have a 30TB relation and then if they modify the whole thing > they could have the better part of a gigabyte of summary data for that > relation, but if you've got a 30TB table you probably have enough > memory that that's no big deal. I'm not particularly worried about the rewriting-30TB-table case - that'd also generate >= 30TB of WAL most of the time. Which realistically is going to trigger a few checkpoints, even on very big instances. > But, what if you have multiple relations? I initialized pgbench with a > scale factor of 30000 and also with 30000 partitions and did a 1-hour > run. I got 4 checkpoints during that time and each one produced an > approximately 16MB summary file. Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due to having to read data from disk, lots of full page write and having to write out lots of data? I.e. we won't do all that many transactions during the 1h? > To get a 1GB+ WAL summary file, you'd need to modify millions of relation > forks, maybe tens of millions, and most installations aren't even going to > have that many relation forks, let alone be modifying them all frequently. I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench script to create 10k single-row relations in each script, ran that with 96 clients, checkpointed, and ran a pgbench script that updated the single row in each table. After creation of the relation WAL summarizer uses LOG: level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 chunks); 372804744 used and creates a 26MB summary file. After checkpoint & updates WAL summarizer uses: LOG: level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 chunks); 363340856 used and creates a 26MB summary file. Sure, 350MB ain't nothing, but simply just executing \dt in the database created by this makes the backend use 260MB after. Which isn't going away, whereas WAL summarizer drops its memory usage soon after. > But I think that's sufficiently niche that the current patch shouldn't > concern itself with such cases. If we find that they're common enough > to worry about, we might eventually want to do something to mitigate > them, but whether that thing looks anything like wal_summarize_mb > seems pretty unclear. So I conclude that it's a mistake to include > that GUC as currently designed and propose to replace it with a > Boolean as described above. After playing with this for a while, I don't see a reason for wal_summarize_mb from a memory usage POV at least. I wonder if there are use cases that might like to consume WAL summaries before the next checkpoint? For those wal_summarize_mb likely wouldn't be a good control, but they might want to request a summary file to be created at some point? Greetings, Andres Freund
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote: > After playing with this for a while, I don't see a reason for wal_summarize_mb > from a memory usage POV at least. Cool! Thanks for testing. > I wonder if there are use cases that might like to consume WAL summaries > before the next checkpoint? For those wal_summarize_mb likely wouldn't be a > good control, but they might want to request a summary file to be created at > some point? It's possible. I actually think it's even more likely that there are use cases that will also want the WAL summarized, but in some different way. For example, you might want a summary that would give you the LSN or approximate LSN where changes to a certain block occurred. Such a summary would be way bigger than these summaries and therefore, at least IMHO, a lot less useful for incremental backup, but it could be really useful for something else. Or you might want summaries that focus on something other than which blocks got changed, like what relations were created or destroyed, or only changes to certain kinds of relations or relation forks, or whatever. In a way, you can even think of logical decoding as a kind of WAL summarization, just with a very different set of goals from this one. I won't be too surprised if the next hacker wants something that is different enough from what this does that it doesn't make sense to share mechanism, but if by chance they want the same thing but dumped a bit more frequently, well, that can be done. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Oct 30, 2023 at 6:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > If that is still an area open for discussion: wouldn't it be better to > > just specify LSN as it would allow resyncing standby across major lag > > where the WAL to replay would be enormous? Given that we had > > primary->standby where standby would be stuck on some LSN, right now > > it would be: > > 1) calculate backup manifest of desynced 10TB standby (how? using > > which tool?) - even if possible, that means reading 10TB of data > > instead of just putting a number, isn't it? > > 2) backup primary with such incremental backup >= LSN > > 3) copy the incremental backup to standby > > 4) apply it to the impaired standby > > 5) restart the WAL replay > > As you may be able to tell from the flurry of posts and new patch > sets, I'm trying hard to sort out the remaining open items that > pertain to this patch set, and I'm now back to thinking about this > one. > > TL;DR: I think the idea has some potential, but there are some > pitfalls that I'm not sure how to address. > > I spent some time looking at how we currently use the data from the > backup manifest. Currently, we do two things with it. First, when > we're backing up each file, we check whether it's present in the > backup manifest and, if not, we back it up in full. This actually > feels fairly poor. If it makes any difference at all, then presumably > the underlying algorithm is buggy and needs to be fixed. Maybe that > should be ripped out altogether or turned into some kind of sanity > check that causes a big explosion if it fails. Second, we check > whether the WAL ranges reported by the client match up with the > timeline history of the server (see PrepareForIncrementalBackup). This > set of sanity checks seems fairly important to me, and I'd regret > discarding them. I think there's some possibility that they might > catch user error, like where somebody promotes multiple standbys and > maybe they even get the same timeline on more than one of them, and > then confusion might ensue. [..] > Another practical problem here is that, right now, pg_combinebackup > doesn't have an in-place mode. It knows how to take a bunch of input > backups and write out an output backup, but that output backup needs > to go into a new, fresh directory (or directories plural, if there are > user-defined tablespaces). I had previously considered adding such a > mode, but the idea I had at the time wouldn't have worked for this > case. I imagined that someone might want to run "pg_combinebackup > --in-place full incr" and clobber the contents of the incr directory > with the output, basically discarding the incremental backup you took > in favor of a full backup that could have been taken at the same point > in time. [..] Thanks for answering! It all sounds like this resync-standby-using-primary-incrbackup idea isn't fit for the current pg_combinebackup, but rather for a new tool hopefully in future. It could take the current LSN from stuck standby, calculate manifest on the lagged and offline standby (do we need to calculate manifest Checksum in that case? I cannot find code for it), deliver it via "UPLOAD_MANIFEST" to primary and start fetching and applying the differences while doing some form of copy-on-write from old & incoming incrbackup data to "$relfilenodeid.new" and then durable_unlink() old one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would it still be possible in theory? (it could use additional safeguards like rename controlfile when starting and just before ending to additionally block startup if it hasn't finished). Also it looks as per comment nearby struct IncrementalBackupInfo.manifest_files that even checksums are just more for safeguarding rather than core implementation (?) What I've meant in the initial idea is not to hinder current efforts, but asking if the current design will not stand in a way for such a cool new addition in future ? > One thing I also realized when thinking about this is that you could > probably hose yourself with the patch set as it stands today by taking > a full backup, downgrading to wal_level=minimal for a while, doing > some WAL-skipping operations, upgrading to a higher WAL-level again, > and then taking an incremental backup. I think the solution to that is > probably for the WAL summarizer to refuse to run if wal_level=minimal. > Then there would be a gap in the summary files which an incremental > backup attempt would detect. As per earlier test [1], I've already tried to simulate that in incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that worked (but that was with CTAS-wal-minimal-optimization -> new relfilenodeOID is used for CTAS which got included in the incremental backup as it's new file) Even retested that with Your v7 patch with asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still recovers using incremental backup because the WAL contains: rmgr: Storage, desc: CREATE base/5/36425 [..] rmgr: XLOG, desc: FPI , blkref #0: rel 1663/5/36425 blk 0 FPW [..] e.g. TRUNCATE sets a new relfilenode each time, so they will always be included in backup and wal_level=minimal optimizations kicks only for commands that issue a new relfilenode. True/false? postgres=# select oid, relfilenode, relname from pg_class where relname like 'night%' order by 1; oid | relfilenode | relname -------+-------------+--------------------- 16384 | 0 | nightmare 16390 | 36420 | nightmare_p0 16398 | 36425 | nightmare_p1 36411 | 0 | nightmare_pkey 36413 | 36422 | nightmare_p0_pkey 36415 | 36427 | nightmare_p1_pkey 36417 | 0 | nightmare_brin_idx 36418 | 36423 | nightmare_p0_ts_idx 36419 | 36428 | nightmare_p1_ts_idx (9 rows) postgres=# truncate nightmare; TRUNCATE TABLE postgres=# select oid, relfilenode, relname from pg_class where relname like 'night%' order by 1; oid | relfilenode | relname -------+-------------+--------------------- 16384 | 0 | nightmare 16390 | 36434 | nightmare_p0 16398 | 36439 | nightmare_p1 36411 | 0 | nightmare_pkey 36413 | 36436 | nightmare_p0_pkey 36415 | 36441 | nightmare_p1_pkey 36417 | 0 | nightmare_brin_idx 36418 | 36437 | nightmare_p0_ts_idx 36419 | 36442 | nightmare_p1_ts_idx -J. [1] - https://www.postgresql.org/message-id/CAKZiRmzT%2BbX2ZYdORO32cADtfQ9DvyaOE8fsOEWZc2V5FkEWVg%40mail.gmail.com
On Wed, Nov 1, 2023 at 8:57 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Thanks for answering! It all sounds like this > resync-standby-using-primary-incrbackup idea isn't fit for the current > pg_combinebackup, but rather for a new tool hopefully in future. It > could take the current LSN from stuck standby, calculate manifest on > the lagged and offline standby (do we need to calculate manifest > Checksum in that case? I cannot find code for it), deliver it via > "UPLOAD_MANIFEST" to primary and start fetching and applying the > differences while doing some form of copy-on-write from old & incoming > incrbackup data to "$relfilenodeid.new" and then durable_unlink() old > one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would > it still be possible in theory? (it could use additional safeguards > like rename controlfile when starting and just before ending to > additionally block startup if it hasn't finished). Also it looks as > per comment nearby struct IncrementalBackupInfo.manifest_files that > even checksums are just more for safeguarding rather than core > implementation (?) > > What I've meant in the initial idea is not to hinder current efforts, > but asking if the current design will not stand in a way for such a > cool new addition in future ? Hmm, interesting idea. I think something like that could be made to work. My first thought was that it would sort of suck to have to compute a manifest as a precondition of doing this, but then I started to think maybe it wouldn't, really. I mean, you'd have to scan the local directory tree and collect all the filenames so that you could remove any files that are no longer present in the current version of the data directory which the incremental backup would send to you. If you're already doing that, the additional cost of generating a manifest isn't that high, at least if you don't include checksums, which aren't required. On the other hand, if you didn't need to send the server a manifest and just needed to send the required WAL ranges, that would be even cheaper. I'll spend some more time thinking about this next week. > As per earlier test [1], I've already tried to simulate that in > incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that > worked (but that was with CTAS-wal-minimal-optimization -> new > relfilenodeOID is used for CTAS which got included in the incremental > backup as it's new file) Even retested that with Your v7 patch with > asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY > nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still > recovers using incremental backup because the WAL contains: TRUNCATE itself is always WAL-logged, but data added to the relation in the same relation as the TRUNCATE isn't always WAL-logged (but sometimes it is, depending on the relation size). So the failure case wouldn't be missing the TRUNCATE but missing some data-containing blocks within the relation shortly after it was created or truncated. I think what I need to do here is avoid summarizing WAL that was generated under wal_level=minimal. The walsummarizer process should just refuse to emit summaries for any such WAL. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote: > After playing with this for a while, I don't see a reason for wal_summarize_mb > from a memory usage POV at least. Here's v8. Changes: - Replace wal_summarize_mb GUC with summarize_wal = on | off. - Document the summarize_wal and wal_summary_keep_time GUCs. - Refuse to start with summarize_wal = on and wal_level = minimal. - Increase default wal_summary_keep_time to 10d from 7d, per (what I think was) a suggestion from Peter E. - Fix fencepost errors when deciding which WAL summaries are needed for a backup. - Fix indentation damage. - Standardize on ereport(DEBUG1, ...) in walsummarizer.c vs. various more and less chatty things I had before. - Include the timeline in some error messages because not having it proved confusing. - Be more consistent about ignoring the FSM fork. - Fix a bug that could cause WAL summarization to error out when switching timelines. - Fix the division between the wal summarizer and incremental backup patches so that the former passes tests without the latter. - Fix some things that an older compiler didn't like, including adding pg_attribute_printf in some places. - Die with an error instead of crashing if someone feeds us a manifest with no WAL ranges. - Sort the block numbers that need to be read from a relation file before reading them, so that we're certain to read them in ascending order. - Be more careful about computing the truncation_block_length of an incremental file; don't do math on a block number that might be InvalidBlockNumber. - Fix pg_combinebackup so it doesn't fail when zero-filled blocks are added to a relation between the prior backup and the incremental backup. - Improve the pg_combinebackup -d output so that it explains in detail how it's carrying out reconstruction, to improve debuggability. - Disable WAL summarization by default, but add a test patch to the series to enable it, because running the whole test suite with it turned on is good for bug-hunting. - In pg_walsummary, zero a struct before using instead of starting with arbitrary junk values. To do list: - Figure out whether to do something other than uploading the whole summary, per discussion with Jakub Wartak. - Decide what to do about the 60-second waiting-for-WAL-summarization timeout. - Make incremental backup fail quickly if WAL summarization is not even enabled. - Have pg_basebackup error out nicely if an incremental backup is requested from an older server that can't do that. - Add some kind of tests for pg_walsummary. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v8-0005-Add-new-pg_walsummary-tool.patch
- v8-0001-Change-how-a-base-backup-decides-which-files-have.patch
- v8-0006-Test-patch-Enable-summarize_wal-by-default.patch
- v8-0003-Add-a-new-WAL-summarizer-process.patch
- v8-0004-Add-support-for-incremental-backup.patch
- v8-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
On Tue, Nov 7, 2023 at 2:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote: > > After playing with this for a while, I don't see a reason for wal_summarize_mb > > from a memory usage POV at least. > > Here's v8. Changes: Review comments, based on what I reviewed so far. - I think 0001 looks good improvement irrespective of the patch series. - review 0003 1. + be enabled either on a primary or on a standby. WAL summarization can + cannot be enabled when <varname>wal_level</varname> is set to + <literal>minimal</literal>. Grammatical error "WAL summarization can cannot" -> WAL summarization cannot 2. + <varlistentry id="guc-wal-summarize-keep-time" xreflabel="wal_summarize_keep_time"> + <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_summarize_keep_time</varname> configuration parameter</primary> + </indexterm> I feel the name of the guy should be either wal_summarizer_keep_time or wal_summaries_keep_time, I mean either we should refer to the summarizer process or to the way summaries files. 3. +XLogGetOldestSegno(TimeLineID tli) +{ + + /* Ignore files that are not XLOG segments */ + if (!IsXLogFileName(xlde->d_name)) + continue; + + /* Parse filename to get TLI and segno. */ + XLogFromFileName(xlde->d_name, &file_tli, &file_segno, + wal_segment_size); + + /* Ignore anything that's not from the TLI of interest. */ + if (tli != file_tli) + continue; + + /* If it's the oldest so far, update oldest_segno. */ Some of the single-line comments end with a full stop whereas others do not, so better to be consistent. 4. + * If start_lsn != InvalidXLogRecPtr, only summaries that end before the + * indicated LSN will be included. + * + * If end_lsn != InvalidXLogRecPtr, only summaries that start before the + * indicated LSN will be included. + * + * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn) + * to get all WAL summaries on the indicated timeline that overlap the + * specified LSN range. + */ +List * +GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn) Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end before the" it should be "If start_lsn != InvalidXLogRecPtr, only summaries that end after the" because only if the summary files are Ending after the start_lsn then it will have some overlapping and we need to return them if ending before start lsn then those files are not overlapping at all, right? 5. In FilterWalSummaries() header also the comment is wrong same as for GetWalSummaries() function. 6. + * If the whole range of LSNs is covered, returns true, otherwise false. + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr + * if there are no WAL summary files in the input list, or to the first LSN + * in the range that is not covered by a WAL summary file in the input list. + */ +bool +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn, I did not see the usage of this function, but I think if the whole range is not covered why not keep the behavior uniform w.r.t. what we set for '*missing_lsn', I mean suppose there is no file then missing_lsn is the start_lsn because a very first LSN is missing. 7. + nbytes = FileRead(io->file, data, length, io->filepos, + WAIT_EVENT_WAL_SUMMARY_READ); + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write file \"%s\": %m", + FilePathName(io->file)))); /could not write file/ could not read file 8. +/* + * Comparator to sort a List of WalSummaryFile objects by start_lsn. + */ +static int +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b) +{ -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Great stuff you got here. I'm doing a first pass trying to grok the whole thing for more substantive comments, but in the meantime here are some cosmetic ones. I got the following warnings, both valid: ../../../../pgsql/source/master/src/common/blkreftable.c: In function 'WriteBlockRefTable': ../../../../pgsql/source/master/src/common/blkreftable.c:520:45: warning: declaration of 'brtentry' shadows a previous local[-Wshadow=compatible-local] 520 | BlockRefTableEntry *brtentry; | ^~~~~~~~ ../../../../pgsql/source/master/src/common/blkreftable.c:492:37: note: shadowed declaration is here 492 | BlockRefTableEntry *brtentry; | ^~~~~~~~ ../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c: In function 'SummarizeWAL': ../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:833:57: warning: declaration of 'private_data'shadows a previous local [-Wshadow=compatible-local] 833 | SummarizerReadLocalXLogPrivate *private_data; | ^~~~~~~~~~~~ ../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:709:41: note: shadowed declaration is here 709 | SummarizerReadLocalXLogPrivate *private_data; | ^~~~~~~~~~~~ In blkreftable.c, I think the definition of SH_EQUAL should have an outer layer of parentheses. Also, it would be good to provide and use a function to initialize a BlockRefTableKey from the RelFileNode and forknum components, and ensure that any padding bytes are zeroed. Otherwise it's not going to be a great hash key. On my platform there aren't any (padding bytes), but I think it's unwise to rely on that. I don't think SummarizerReadLocalXLogPrivate->waited is used for anything. Could be removed AFAICS, unless you're foreseen adding something that uses it. These forward struct declarations are not buying you anything, I'd remove them: diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h index 70d6c072d7..316e67122c 100644 --- a/src/include/common/blkreftable.h +++ b/src/include/common/blkreftable.h @@ -29,10 +29,7 @@ /* Magic number for serialization file format. */ #define BLOCKREFTABLE_MAGIC 0x652b137b -struct BlockRefTable; -struct BlockRefTableEntry; -struct BlockRefTableReader; -struct BlockRefTableWriter; +/* Struct definitions appear in blkreftable.c */ typedef struct BlockRefTable BlockRefTable; typedef struct BlockRefTableEntry BlockRefTableEntry; typedef struct BlockRefTableReader BlockRefTableReader; and backup_label.h doesn't know about TimeLineID, so it needs this: diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h index 08d6ed67a9..3af7ea274c 100644 --- a/src/bin/pg_combinebackup/backup_label.h +++ b/src/bin/pg_combinebackup/backup_label.h @@ -12,6 +12,7 @@ #ifndef BACKUP_LABEL_H #define BACKUP_LABEL_H +#include "access/xlogdefs.h" #include "common/checksum_helper.h" #include "lib/stringinfo.h" I don't much like the way header files in src/bin/pg_combinebackup files are structured. Particularly, causing a simplehash to be "instantiated" just because load_manifest.h is included seems poised to cause pain. I think there should be a file with the basic struct declarations (no simplehash); and then maybe since both pg_basebackup and pg_combinebackup seem to need the same simplehash, create a separate header file containing just that.. But, did you notice that anything that includes reconstruct.h will instantiate the simplehash stuff, because it includes load_manifest.h? It may be unwise to have the simplehash in a header file. Maybe just declare it in each .c file that needs it. The duplicity is not that large. I'll see if I can understand the way all these headers are needed to propose some other arrangement. I see this idea of having "struct FooBar;" immediately followed by "typedef struct FooBar FooBar;" which I mentioned from blkreftable.h occurs in other places as well (JsonManifestParseContext in parse_manifest.h, maybe others?). Was this pattern cargo-culted from somewhere? Perhaps we have other places to clean up. Why leave unnamed arguments in function declarations? For example, in static void manifest_process_file(JsonManifestParseContext *, char *pathname, size_t size, pg_checksum_type checksum_type, int checksum_length, uint8 *checksum_payload); the first argument lacks a name. Is this just an oversight, I hope? In GetFileBackupMethod(), which arguments are in and which are out? The comment doesn't say, and it's not obvious why we pass both the file path as well as the individual constituent pieces for it. DO_NOT_BACKUP_FILE appears not to be set anywhere. Do you expect to use this later? If not, maybe remove it. There are two functions named record_manifest_details_for_file() in different programs. I think this sort of arrangement is not great, as it is confusing confusing to follow. It would be better if those two routines were called something like, say, verifybackup_perfile_cb and combinebackup_perfile_cb instead; then in the function comment say something like /* * JsonManifestParseContext->perfile_cb implementation for pg_combinebackup. * * Record details extracted from the backup manifest for one file, * because we like to keep things tracked or whatever. */ so it's easy to track down what does what and why. Same with perwalrange_cb. "perfile" looks bothersome to me as a name entity. Why not per_file_cb? and per_walrange_cb? In walsummarizer.c, HandleWalSummarizerInterrupts is called in summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that. Maybe it should? I think this path is not going to be very human-likeable. snprintf(final_path, MAXPGPATH, XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary", tli, LSN_FORMAT_ARGS(summary_start_lsn), LSN_FORMAT_ARGS(summary_end_lsn)); Why not add a dash between the TLI and between both LSNs, or something like that? (Also, are we really printing TLIs as 8-byte hexs?) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I suspect most samba developers are already technically insane... Of course, since many of them are Australians, you can't tell." (L. Torvalds)
On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > - I think 0001 looks good improvement irrespective of the patch series. OK, perhaps that can be independently committed, then, if nobody objects. Thanks for the review; I've fixed a bunch of things that you mentioned. I'll just comment on the ones I haven't yet done anything about below. > 2. > + <varlistentry id="guc-wal-summarize-keep-time" > xreflabel="wal_summarize_keep_time"> > + <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>) > + <indexterm> > + <primary><varname>wal_summarize_keep_time</varname> > configuration parameter</primary> > + </indexterm> > > I feel the name of the guy should be either wal_summarizer_keep_time > or wal_summaries_keep_time, I mean either we should refer to the > summarizer process or to the way summaries files. How about wal_summary_keep_time? > 6. > + * If the whole range of LSNs is covered, returns true, otherwise false. > + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr > + * if there are no WAL summary files in the input list, or to the first LSN > + * in the range that is not covered by a WAL summary file in the input list. > + */ > +bool > +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn, > > I did not see the usage of this function, but I think if the whole > range is not covered why not keep the behavior uniform w.r.t. what we > set for '*missing_lsn', I mean suppose there is no file then > missing_lsn is the start_lsn because a very first LSN is missing. It's used later in the patch series. I think the way that I have it makes for a more understandable error message. > 8. > +/* > + * Comparator to sort a List of WalSummaryFile objects by start_lsn. > + */ > +static int > +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b) > +{ I'm not sure what needs fixing here. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Great stuff you got here. I'm doing a first pass trying to grok the > whole thing for more substantive comments, but in the meantime here are > some cosmetic ones. Thanks, thanks, and thanks. I've fixed some things that you mentioned in the attached version. Other comments below. > In blkreftable.c, I think the definition of SH_EQUAL should have an > outer layer of parentheses. Also, it would be good to provide and use a > function to initialize a BlockRefTableKey from the RelFileNode and > forknum components, and ensure that any padding bytes are zeroed. > Otherwise it's not going to be a great hash key. On my platform there > aren't any (padding bytes), but I think it's unwise to rely on that. I'm having trouble understanding the second part of this suggestion. Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and in a backend context, we get the default, which is MemoryContextAllocZero. Maybe there's some case this doesn't cover, though? > These forward struct declarations are not buying you anything, I'd > remove them: I've had problems from time to time when I don't do this. I'll remove it here, but I'm not convinced that it's always useless. > I don't much like the way header files in src/bin/pg_combinebackup files > are structured. Particularly, causing a simplehash to be "instantiated" > just because load_manifest.h is included seems poised to cause pain. I > think there should be a file with the basic struct declarations (no > simplehash); and then maybe since both pg_basebackup and > pg_combinebackup seem to need the same simplehash, create a separate > header file containing just that.. But, did you notice that anything > that includes reconstruct.h will instantiate the simplehash stuff, > because it includes load_manifest.h? It may be unwise to have the > simplehash in a header file. Maybe just declare it in each .c file that > needs it. The duplicity is not that large. I think that I did this correctly. AIUI, if you're defining a simplehash that only one source file needs, you make the scope "static" and do both SH_DECLARE and SH_DEFILE it in that file. If you need it to be shared by multiple files, you make it "extern" in the header file, do SH_DECLARE there, and SH_DEFINE in one of those source files. Or you could make the scope "static inline" in the header file and then you'd both SH_DECLARE and SH_DEFINE it in the header file. If I were to do as you suggest here, I think I'd end up with 2 copies of the compiled code for this instead of one, and if they ever got out of sync everything would break silently. > Why leave unnamed arguments in function declarations? For example, in > > static void manifest_process_file(JsonManifestParseContext *, > char *pathname, > size_t size, > pg_checksum_type checksum_type, > int checksum_length, > uint8 *checksum_payload); > the first argument lacks a name. Is this just an oversight, I hope? I mean, I've changed it now, but I don't think it's worth getting too excited about. "int checksum_length" is much better documentation than just "int," but "JsonManifestParseContext *context" is just noise, IMHO. You can argue that it's better for consistency that way, but whatever. > In GetFileBackupMethod(), which arguments are in and which are out? > The comment doesn't say, and it's not obvious why we pass both the file > path as well as the individual constituent pieces for it. The header comment does document which values are potentially set on return. I guess I thought it was clear enough that the stuff not documented to be output parameters was input parameters. Most of them aren't even pointers, so they have to be input parameters. The only exception is 'path', which I have some difficulty thinking that anyone is going to imagine to be an input pointer. Maybe you could propose a more specific rewording of this comment? FWIW, I'm not altogether sure whether this function is going to get more heavily adjusted in a rev or three of the patch set, so maybe we want to wait to sort this out until this is closer to final, but OTOH if I know what you have in mind for the current version, I might be more likely to keep it in a good place if I end up changing it. > DO_NOT_BACKUP_FILE appears not to be set anywhere. Do you expect to use > this later? If not, maybe remove it. Woops, that was a holdover from an earlier version. > There are two functions named record_manifest_details_for_file() in > different programs. I think this sort of arrangement is not great, as > it is confusing confusing to follow. It would be better if those two > routines were called something like, say, verifybackup_perfile_cb and > combinebackup_perfile_cb instead; then in the function comment say > something like > /* > * JsonManifestParseContext->perfile_cb implementation for pg_combinebackup. > * > * Record details extracted from the backup manifest for one file, > * because we like to keep things tracked or whatever. > */ > so it's easy to track down what does what and why. Same with > perwalrange_cb. "perfile" looks bothersome to me as a name entity. Why > not per_file_cb? and per_walrange_cb? I had trouble figuring out how to name this stuff. I did notice the awkwardness, but surely nobody can think that two functions with the same name in different binaries can be actually the same function. If we want to inject more underscores here, my vote is to go all the way and make it per_wal_range_cb. > In walsummarizer.c, HandleWalSummarizerInterrupts is called in > summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that. > Maybe it should? I replaced all the CHECK_FOR_INTERRUPTS() in that file with HandleWalSummarizerInterrupts(). Does that seem right? > I think this path is not going to be very human-likeable. > snprintf(final_path, MAXPGPATH, > XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary", > tli, > LSN_FORMAT_ARGS(summary_start_lsn), > LSN_FORMAT_ARGS(summary_end_lsn)); > Why not add a dash between the TLI and between both LSNs, or something > like that? (Also, are we really printing TLIs as 8-byte hexs?) Dealing with the last part first, we already do that in every WAL file name. I actually think these file names are easier to work with than WAL file names, because 000000010000000000000020 is not the WAL starting at 0/20, but rather the WAL starting at 0/20000000. To know at what LSN a WAL file starts, you have to mentally delete characters 17 through 22, which will always be zero, and instead add six zeroes at the end. I don't think whoever came up with that file naming convention deserves an award, unless it's a raspberry award. With these names, you get something like 0000000100000000015125B800000000015128F0.summary and you can sort of see that 1512 repeats so the LSN went from something ending in 5B8 to something ending in 8F0. I actually think it's way better. But I have a hard time arguing that it wouldn't be more readable still if we put some separator characters in there. I didn't do that because then they'd look less like WAL file names, but maybe that's not really a problem. A possible reason not to bother is that these files are less necessary for humans to care about than WAL files, since they don't need to be archived or transported between nodes in any way. Basically I think this is probably fine the way it is, but if you or others think it's really important to change it, I can do that. Just as long as we don't spend 50 emails arguing about which separator character to use. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v9-0005-Add-new-pg_walsummary-tool.patch
- v9-0001-Change-how-a-base-backup-decides-which-files-have.patch
- v9-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patch
- v9-0004-Add-support-for-incremental-backup.patch
- v9-0003-Add-a-new-WAL-summarizer-process.patch
- v9-0006-Test-patch-Enable-summarize_wal-by-default.patch
On Tue, Nov 14, 2023 at 12:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > - I think 0001 looks good improvement irrespective of the patch series. > > OK, perhaps that can be independently committed, then, if nobody objects. > > Thanks for the review; I've fixed a bunch of things that you > mentioned. I'll just comment on the ones I haven't yet done anything > about below. > > > 2. > > + <varlistentry id="guc-wal-summarize-keep-time" > > xreflabel="wal_summarize_keep_time"> > > + <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>) > > + <indexterm> > > + <primary><varname>wal_summarize_keep_time</varname> > > configuration parameter</primary> > > + </indexterm> > > > > I feel the name of the guy should be either wal_summarizer_keep_time > > or wal_summaries_keep_time, I mean either we should refer to the > > summarizer process or to the way summaries files. > > How about wal_summary_keep_time? Yes, that looks perfect to me. > > 6. > > + * If the whole range of LSNs is covered, returns true, otherwise false. > > + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr > > + * if there are no WAL summary files in the input list, or to the first LSN > > + * in the range that is not covered by a WAL summary file in the input list. > > + */ > > +bool > > +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn, > > > > I did not see the usage of this function, but I think if the whole > > range is not covered why not keep the behavior uniform w.r.t. what we > > set for '*missing_lsn', I mean suppose there is no file then > > missing_lsn is the start_lsn because a very first LSN is missing. > > It's used later in the patch series. I think the way that I have it > makes for a more understandable error message. Okay > > 8. > > +/* > > + * Comparator to sort a List of WalSummaryFile objects by start_lsn. > > + */ > > +static int > > +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b) > > +{ > > I'm not sure what needs fixing here. I think I copy-pasted it by mistake, just ignore it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 14, 2023 at 2:10 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Great stuff you got here. I'm doing a first pass trying to grok the > > whole thing for more substantive comments, but in the meantime here are > > some cosmetic ones. > > Thanks, thanks, and thanks. > > I've fixed some things that you mentioned in the attached version. > Other comments below. Here are some more comments based on what I have read so far, mostly cosmetics comments. 1. + * summary file yet, then stoppng doesn't make any sense, and we + * should wait until the next stop point instead. Typo /stoppng/stopping 2. + /* Close temporary file and shut down xlogreader. */ + FileClose(io.file); + We have already freed the xlogreader so the second part of the comment is not valid. 3.+ /* + * If a relation fork is truncated on disk, there is in point in + * tracking anything about block modifications beyond the truncation + * point. Typo. /there is in point/ there is no point 4. +/* + * Special handling for WAL recods with RM_XACT_ID. + */ /recods/records 5. + if (xact_info == XLOG_XACT_COMMIT || + xact_info == XLOG_XACT_COMMIT_PREPARED) + { + xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader); + xl_xact_parsed_commit parsed; + int i; + + ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed); + for (i = 0; i < parsed.nrels; ++i) + { + ForkNumber forknum; + + for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum) + if (forknum != FSM_FORKNUM) + BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i], + forknum, 0); + } + } For SmgrCreate and Truncate I understand setting the 'limit block' but why for commit/abort? I think it would be better to add some comments here. 6. + * Caller must set private_data->tli to the TLI of interest, + * private_data->read_upto to the lowest LSN that is not known to be safe + * to read on that timeline, and private_data->historic to true if and only + * if the timeline is not the current timeline. This function will update + * private_data->read_upto and private_data->historic if more WAL appears + * on the current timeline or if the current timeline becomes historic. + */ +static int +summarizer_read_local_xlog_page(XLogReaderState *state, + XLogRecPtr targetPagePtr, int reqLen, + XLogRecPtr targetRecPtr, char *cur_page) The comments say "private_data->read_upto to the lowest LSN that is not known to be safe" but is it really the lowest LSN? I think it is the highest LSN this is known to be safe for that TLI no? 7. + /* If it's time to remove any old WAL summaries, do that now. */ + MaybeRemoveOldWalSummaries(); I was just wondering whether removing old summaries should be the job of the wal summarizer or it should be the job of the checkpointer, I mean while removing the old wals it can also check and remove the old summaries? Anyway, it's just a question and I do not have a strong opinion on this. 8. + /* + * Whether we we removed the file or not, we need not consider it + * again. + */ Typo /Whether we we removed/ Whether we removed 9. +/* + * Get an entry from a block reference table. + * + * If the entry does not exist, this function returns NULL. Otherwise, it + * returns the entry and sets *limit_block to the value from the entry. + */ +BlockRefTableEntry * +BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator, + ForkNumber forknum, BlockNumber *limit_block) If this function is already returning 'BlockRefTableEntry' then why would it need to set an extra '*limit_block' out parameter which it is actually reading from the entry itself? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
0001 looks OK to push, and since it stands on its own I would get it out of the way soon rather than waiting for the rest of the series to be further reviewed. 0002: This moves bin/pg_verifybackup/parse_manifest.c to common/parse_manifest.c, where it's not clear that it's for backup manifests (wasn't a problem in the previous location). I wonder if we're going to have anything else called "manifest", in which case I propose to rename the file to make it clear that this is about backup manifests -- maybe parse_bkp_manifest.c. This patch looks pretty uncontroversial, but there's no point in going further with this one until followup patches are closer to commit. 0003: AmWalSummarizerProcess() is unused. Remove? MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c? This causes a function call to be emitted every time through. That looks odd. All other process starts have some triggering condition. GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization and SummarizeWAL use while(1). Maybe settle on one style? Still reading this one. 0004: in PrepareForIncrementalBackup(), the logic that determines earliest_wal_range_tli and latest_wal_range_tli looks pretty weird. I think it works fine if there's a single timeline, but not otherwise. Or maybe the trick is that it relies on timelines returned by readTimeLineHistory being sorted backwards? If so, maybe add a comment about that somewhere; I don't think other callers of readTimeLineHistory make that assumption. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Postgres is bloatware by design: it was built to house PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
Hi Robert, [..spotted the v9 patchset..] so I've spent some time playing still with patchset v8 (without the 6/6 testing patch related to wal_level=minimal), with the exception of - patchset v9 - marked otherwise. 1. On compile time there were 2 warnings to shadowing variable (at least with gcc version 10.2.1), but on v9 that is fixed: blkreftable.c: In function ‘WriteBlockRefTable’: blkreftable.c:520:24: warning: declaration of ‘brtentry’ shadows a previous local [-Wshadow=compatible-local] walsummarizer.c: In function ‘SummarizeWAL’: walsummarizer.c:833:36: warning: declaration of ‘private_data’ shadows a previous local [-Wshadow=compatible-local] 2. Usability thing: I hit the timeout hard: "This backup requires WAL to be summarized up to 0/90000D8, but summarizer has only reached 0/0." with summarize_wal=off (default) but apparently this in TODO. Looks like an important usability thing. 3. I've verified that if the DB was in wal_level=minimal even temporarily (and thus summarization was disabled) it is impossible to take incremental backup: pg_basebackup: error: could not initiate base backup: ERROR: WAL summaries are required on timeline 1 from 0/70000D8 to 0/10000028, but the summaries for that timeline and LSN range are incomplete DETAIL: The first unsummarized LSN is this range is 0/D04AE88. 4. As we have discussed off list, there's is (was) this pg_combinebackup bug in v8's reconstruct_from_incremental_file() where it was unable to realize that - in case of combining multiple incremental backups - it should stop looking for the previous instance of the full file (while it was fine with v6 of the patchset). I've checked it on v9 - it is good now. 5. On v8 i've finally played a little bit with standby(s) and this patchset with couple of basic scenarios while mixing source of the backups: a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby # sometimes i'm getting spurious error like those when doing incrementals on standby with -c fast : 2023-11-15 13:49:05.721 CET [10573] LOG: recovery restart point at 0/A000028 2023-11-15 13:49:07.591 CET [10597] WARNING: aborting backup due to backend exiting before pg_backup_stop was called 2023-11-15 13:49:07.591 CET [10597] ERROR: manifest requires WAL from final timeline 1 ending at 0/A0000F8, but this backup starts at 0/A000028 2023-11-15 13:49:07.591 CET [10597] STATEMENT: BASE_BACKUP ( INCREMENTAL, LABEL 'pg_basebackup base backup', PROGRESS, CHECKPOINT 'fast', WAIT 0, MANIFEST 'yes', TARGET 'client') # when you retry the same pg_basebackup it goes fine (looks like CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);" against primary just before pg_basebackup --incr on standby workarounds it) b. full on primary, incr1 on standby, full db restore (incl. incr1) on standby # WORKS c. full on standby, incr1 on standby, full db restore (incl. incr1) on primary # WORKS* d. full on primary, incr1 on standby, full db restore (incl. incr1) on primary # WORKS* * - needs pg_promote() due to the controlfile having standby bit + potential fiddling with postgresql.auto.conf as it is having primary_connstring GUC. 6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having more recent pg_basebackup (e.g. from pg18 one day) running against pg17 in the scope of having this incremental backups possibility. Is it going to be safe? (currently there seems to be no safeguards against such use) or should those things (core, pg_basebackup) should be running in version lock step? Regards, -J.
On 2023-Nov-13, Robert Haas wrote: > On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Also, it would be good to provide and use a > > function to initialize a BlockRefTableKey from the RelFileNode and > > forknum components, and ensure that any padding bytes are zeroed. > > Otherwise it's not going to be a great hash key. On my platform there > > aren't any (padding bytes), but I think it's unwise to rely on that. > > I'm having trouble understanding the second part of this suggestion. > Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and > in a backend context, we get the default, which is > MemoryContextAllocZero. Maybe there's some case this doesn't cover, > though? I meant code like this memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator)); key.forknum = forknum; entry = blockreftable_lookup(brtab->hash, key); where any padding bytes in "key" could have arbitrary values, because they're not initialized. So I'd have a (maybe static inline) function BlockRefTableKeyInit(&key, rlocator, forknum) that fills it in for you. Note: #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0) AFAICT the new simplehash uses in this patch series are the only ones that use memcmp() as SH_EQUAL, so we don't necessarily have precedent on lack of padding bytes initialization in existing uses of simplehash. > > These forward struct declarations are not buying you anything, I'd > > remove them: > > I've had problems from time to time when I don't do this. I'll remove > it here, but I'm not convinced that it's always useless. Well, certainly there are places where they are necessary. > > I don't much like the way header files in src/bin/pg_combinebackup files > > are structured. Particularly, causing a simplehash to be "instantiated" > > just because load_manifest.h is included seems poised to cause pain. I > > think there should be a file with the basic struct declarations (no > > simplehash); and then maybe since both pg_basebackup and > > pg_combinebackup seem to need the same simplehash, create a separate > > header file containing just that.. But, did you notice that anything > > that includes reconstruct.h will instantiate the simplehash stuff, > > because it includes load_manifest.h? It may be unwise to have the > > simplehash in a header file. Maybe just declare it in each .c file that > > needs it. The duplicity is not that large. > > I think that I did this correctly. Oh, I hadn't grokked that we had this SH_SCOPE thing and a separate SH_DECLARE for it being extern. OK, please ignore that. > > Why leave unnamed arguments in function declarations? > > I mean, I've changed it now, but I don't think it's worth getting too > excited about. Well, we did get into consistency arguments on this point previously. I agree it's not *terribly* important, but on thread https://www.postgresql.org/message-id/flat/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg%40mail.gmail.com people got really worked up about this stuff. > > In GetFileBackupMethod(), which arguments are in and which are out? > > The comment doesn't say, and it's not obvious why we pass both the file > > path as well as the individual constituent pieces for it. > > The header comment does document which values are potentially set on > return. I guess I thought it was clear enough that the stuff not > documented to be output parameters was input parameters. Most of them > aren't even pointers, so they have to be input parameters. The only > exception is 'path', which I have some difficulty thinking that anyone > is going to imagine to be an input pointer. An output pointer, you mean :-) (Should it be const?) When the return value is BACK_UP_FILE_FULLY, it's not clear what happens to these output values; we modify some, but why? Maybe they should be left alone? In that case, the "if size == 0" test should move a couple of lines up, in the brtentry == NULL block. BTW, you could do the qsort() after deciding to backup the file fully if more than 90% needs to be replaced. BTW, in sendDir() why do lookup_path = pstrdup(pathbuf + basepathlen + 1); when you could do lookup_path = pstrdup(tarfilename); ? > > There are two functions named record_manifest_details_for_file() in > > different programs. > > I had trouble figuring out how to name this stuff. I did notice the > awkwardness, but surely nobody can think that two functions with the > same name in different binaries can be actually the same function. Of course not, but when cscope-jumping around, it is weird. > If we want to inject more underscores here, my vote is to go all the > way and make it per_wal_range_cb. +1 > > In walsummarizer.c, HandleWalSummarizerInterrupts is called in > > summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that. > > Maybe it should? > > I replaced all the CHECK_FOR_INTERRUPTS() in that file with > HandleWalSummarizerInterrupts(). Does that seem right? Looks to be what walwriter.c does at least, so I guess it's OK. > > I think this path is not going to be very human-likeable. > > snprintf(final_path, MAXPGPATH, > > XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary", > > tli, > > LSN_FORMAT_ARGS(summary_start_lsn), > > LSN_FORMAT_ARGS(summary_end_lsn)); > > Why not add a dash between the TLI and between both LSNs, or something > > like that? > But I have a hard time arguing that it wouldn't be more readable still > if we put some separator characters in there. I didn't do that because > then they'd look less like WAL file names, but maybe that's not really > a problem. A possible reason not to bother is that these files are > less necessary for humans to care about than WAL files, since they > don't need to be archived or transported between nodes in any way. > Basically I think this is probably fine the way it is, but if you or > others think it's really important to change it, I can do that. Just > as long as we don't spend 50 emails arguing about which separator > character to use. Yeah, I just think that endless stream of hex chars are hard to read, and I've found myself following digits in the screen with my fingers in order to parse file names. I guess you could say thousands separators for regular numbers aren't needed either, but we do have them for readability sake. I think a new section in chapter 30 "Reliability and the Write-Ahead Log" is warranted. It would explain the summarization process, what the summary files are used for, and the deletion mechanism. I can draft something if you want. It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some purpose or it's just a leftover from prior development. I see it's only read in an assertion ... Maybe if we think this cross-check is important, it should be turned into an elog? Otherwise, I'd remove it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia)
On Tue, Nov 14, 2023 at 8:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > 0001 looks OK to push, and since it stands on its own I would get it out > of the way soon rather than waiting for the rest of the series to be > further reviewed. All right, done. > 0003: > AmWalSummarizerProcess() is unused. Remove? The intent seems to be to have one of these per enum value, whether it gets used or not. Some of the others aren't used, either. > MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c? > This causes a function call to be emitted every time through. That > looks odd. All other process starts have some triggering condition. I'm not sure how much this matters, really. I would expect that the function call overhead here wouldn't be very noticeable. Generally I think that when ServerLoop returns from WaitEventSetWait it's going to be because we need to fork a process. That's pretty expensive compared to a function call. If we can iterate through this loop lots of times without doing any real work then it might matter, but I feel like that's probably not the case, and probably something we would want to fix if it were the case. Now, I could nevertheless move some of the triggering conditions in MaybeStartWalSummarizer(), but moving, say, just the summarize_wal condition wouldn't be enough to avoid having MaybeStartWalSummarizer() called repeatedly when there was no work to do, because summarize_wal could be true and the summarizer could all be running. Similarly, if I move just the WalSummarizerPID == 0 condition, the function gets called repeatedly without doing anything when summarize_wal = off. So at a minimum you have to move both of those if you care about avoiding the function call overhead, and then you have to wonder if you care about the corner cases where the function would be called repeatedly for no gain even then. Another approach would be to make the function static inline rather than just static. Or we could delete the whole function and just duplicate the logic it contains at both call sites. Personally I'm inclined to just leave it how it is in the absence of some evidence that there's a real problem here. It's nice to have all the triggering conditions in one place with nothing duplicated. > GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization > and SummarizeWAL use while(1). Maybe settle on one style? OK. > 0004: > in PrepareForIncrementalBackup(), the logic that determines > earliest_wal_range_tli and latest_wal_range_tli looks pretty weird. I > think it works fine if there's a single timeline, but not otherwise. > Or maybe the trick is that it relies on timelines returned by > readTimeLineHistory being sorted backwards? If so, maybe add a comment > about that somewhere; I don't think other callers of readTimeLineHistory > make that assumption. It does indeed rely on that assumption, and the comment at the top of the for (i = 0; i < num_wal_ranges; ++i) loop explains that. Note also the comment just below that begins "If we found this TLI in the server's history". I agree with you that this logic looks strange, and it's possible that there's some better way to do encode the idea than what I've done here, but I think it might be just that the particular calculation we're trying to do here is strange. It's almost easier to understand the logic if you start by reading the sanity checks ("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X" et. al.), look at the triggering conditions for those, and then work upward to see how earliest/latest_wal_range_tli get set, and then look up from there to see how saw_earliest/latest_wal_range_tli are used in computing those values. We do rely on the ordering assumption elsewhere. For example, in XLogFileReadAnyTLI, see if (tli < curFileTLI) break. We also use it to set expectedTLEs, which is documented to have this property. And AddWALInfoToBackupManifest relies on it too; see the comment "Because the timeline history file lists newer timelines before older ones" in AddWALInfoToBackupManifest. We're not entirely consistent about this, e.g., unlike XLogFileReadAnyTLI, tliInHistory() and tliOfPointInHistory() don't have an early exit provision, but we do use it some places. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I meant code like this > > memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator)); > key.forknum = forknum; > entry = blockreftable_lookup(brtab->hash, key); Ah, I hadn't thought about that. Another way of handling that might be to add = {0} to the declaration of key. But I can do the initializer thing too if you think it's better. I'm not sure if there's an argument that the initializer might optimize better. > An output pointer, you mean :-) (Should it be const?) I'm bad at const, but that seems to work, so sure. > When the return value is BACK_UP_FILE_FULLY, it's not clear what happens > to these output values; we modify some, but why? Maybe they should be > left alone? In that case, the "if size == 0" test should move a couple > of lines up, in the brtentry == NULL block. OK. > BTW, you could do the qsort() after deciding to backup the file fully if > more than 90% needs to be replaced. OK. > BTW, in sendDir() why do > lookup_path = pstrdup(pathbuf + basepathlen + 1); > when you could do > lookup_path = pstrdup(tarfilename); > ? No reason, changed. > > If we want to inject more underscores here, my vote is to go all the > > way and make it per_wal_range_cb. > > +1 Will look into this. > Yeah, I just think that endless stream of hex chars are hard to read, > and I've found myself following digits in the screen with my fingers in > order to parse file names. I guess you could say thousands separators > for regular numbers aren't needed either, but we do have them for > readability sake. Sigh. > I think a new section in chapter 30 "Reliability and the Write-Ahead > Log" is warranted. It would explain the summarization process, what the > summary files are used for, and the deletion mechanism. I can draft > something if you want. Sure, if you want to take a crack at it, that's great. > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some > purpose or it's just a leftover from prior development. I see it's only > read in an assertion ... Maybe if we think this cross-check is > important, it should be turned into an elog? Otherwise, I'd remove it. I've been thinking about that. One thing I'm not quite sure about though is introspection. Maybe there should be a function that shows summarized_tli and summarized_lsn from WalSummarizerData, and maybe it should expose pending_lsn too. -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-Oct-04, Robert Haas wrote: > - I would like some feedback on the generation of WAL summary files. > Right now, I have it enabled by default, and summaries are kept for a > week. That means that, with no additional setup, you can take an > incremental backup as long as the reference backup was taken in the > last week. File removal is governed by mtimes, so if you change the > mtimes of your summary files or whack your system clock around, weird > things might happen. But obviously this might be inconvenient. Some > people might not want WAL summary files to be generated at all because > they don't care about incremental backup, and other people might want > them retained for longer, and still other people might want them to be > not removed automatically or removed automatically based on some > criteria other than mtime. I don't really know what's best here. I > don't think the default policy that the patches implement is > especially terrible, but it's just something that I made up and I > don't have any real confidence that it's wonderful. One point to be > consider here is that, if WAL summarization is enabled, checkpoints > can't remove WAL that isn't summarized yet. Mostly that's not a > problem, I think, because the WAL summarizer is pretty fast. But it > could increase disk consumption for some people. I don't think that we > need to worry about the summaries themselves being a problem in terms > of space consumption; at least in all the cases I've tested, they're > just not very big. So, wal_summary is no longer turned on by default, I think following a comment from Peter E. I think this is a good decision, as we're only going to need them on servers from which incremental backups are going to be taken, which is a strict subset of all servers; and furthermore, people that need them are going to realize that very easily, while if we went the other around most people would not realize that they need to turn them off to save some resource consumption. Granted, the amount of resources additionally used is probably not very big. But since it can be changed with a reload not restart, it doesn't seem problematic. ... oh, I just noticed that this patch now fails to compile because of the MemoryContextResetAndDeleteChildren removal. (Typo in the pg_walsummary manpage: "since WAL summary files primary exist" -> "primarily") > - On a related note, I haven't yet tested this on a standby, which is > a thing that I definitely need to do. I don't know of a reason why it > shouldn't be possible for all of this machinery to work on a standby > just as it does on a primary, but then we need the WAL summarizer to > run there too, which could end up being a waste if nobody ever tries > to take an incremental backup. I wonder how that should be reflected > in the configuration. We could do something like what we've done for > archive_mode, where on means "only on if this is a primary" and you > have to say always if you want it to run on standbys as well ... but > I'm not sure if that's a design pattern that we really want to > replicate into more places. I'd be somewhat inclined to just make > whatever configuration parameters we need to configure this thing on > the primary also work on standbys, and you can set each server up as > you please. But I'm open to other suggestions. I think it should default to off in primary and standby, and the user has to enable it in whichever server they want to take backups from. > - We need to settle the question of whether to send the whole backup > manifest to the server or just the LSN. In a previous attempt at > incremental backup, we decided the whole manifest was necessary, > because flat-copying files could make new data show up with old LSNs. > But that version of the patch set was trying to find modified blocks > by checking their LSNs individually, not by summarizing WAL. And since > the operations that flat-copy files are WAL-logged, the WAL summary > approach seems to eliminate that problem - maybe an LSN (and the > associated TLI) is good enough now. This also relates to Jakub's > question about whether this machinery could be used to fast-forward a > standby, which is not exactly a base backup but ... perhaps close > enough? I'm somewhat inclined to believe that we can simplify to an > LSN and TLI; however, if we do that, then we'll have big problems if > later we realize that we want the manifest for something after all. So > if anybody thinks that there's a reason to keep doing what the patch > does today -- namely, upload the whole manifest to the server -- > please speak up. I don't understand this point. Currently, the protocol is that UPLOAD_MANIFEST is used to send the manifest prior to requesting the backup. You seem to be saying that you're thinking of removing support for UPLOAD_MANIFEST and instead just give the LSN as an option to the BASE_BACKUP command? > - It's regrettable that we don't have incremental JSON parsing; We now do have it, at least in patch form. I guess the question is whether we're going to accept it in core. I see chances of changing the format of the manifest rather slim at this point, and the need for very large manifests is likely to go up with time, so we probably need to take that code and polish it up, and see if we can improve its performance. > - Right now, I have a hard-coded 60 second timeout for WAL > summarization. If you try to take an incremental backup and the WAL > summaries you need don't show up within 60 seconds, the backup times > out. I think that's a reasonable default, but should it be > configurable? If yes, should that be a GUC or, perhaps better, a > pg_basebackup option? I'd rather have a way for the server to provide diagnostics on why the summaries aren't being produced. Maybe a server running under valgrind is going to fail and need a longer one, but otherwise a hardcoded timeout seems sufficient. You did say later that you thought summary files would just go from one checkpoint to the next. So the only question is at what point the file for the last checkpoint (i.e. from the previous one up to the one requested by pg_basebackup) is written. If walsummarizer keeps almost the complete state in memory and just waits for the checkpoint record to write it, then it's probably okay. > - I'm curious what people think about the pg_walsummary tool that is > included in 0006. I think it's going to be fairly important for > debugging, but it does feel a little bit bad to add a new binary for > something pretty niche. Nevertheless, merging it into any other > utility seems relatively awkward, so I'm inclined to think both that > this should be included in whatever finally gets committed and that it > should be a separate binary. I considered whether it should go in > contrib, but we seem to have moved to a policy that heavily favors > limiting contrib to extensions and loadable modules, rather than > binaries. I propose to keep the door open for that binary doing other things that dumping the files as text. So add a command argument, which currently can only be "dump", to allow the command do other things later if needed. (For example, remove files from a server on which summarize_wal has been turned off; or perhaps remove files that are below some LSN.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Estoy de acuerdo contigo en que la verdad absoluta no existe... El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)
On 2023-Nov-16, Robert Haas wrote: > On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I meant code like this > > > > memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator)); > > key.forknum = forknum; > > entry = blockreftable_lookup(brtab->hash, key); > > Ah, I hadn't thought about that. Another way of handling that might be > to add = {0} to the declaration of key. But I can do the initializer > thing too if you think it's better. I'm not sure if there's an > argument that the initializer might optimize better. I think the {0} initializer is good enough, given a comment to indicate why. > > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some > > purpose or it's just a leftover from prior development. I see it's only > > read in an assertion ... Maybe if we think this cross-check is > > important, it should be turned into an elog? Otherwise, I'd remove it. > > I've been thinking about that. One thing I'm not quite sure about > though is introspection. Maybe there should be a function that shows > summarized_tli and summarized_lsn from WalSummarizerData, and maybe it > should expose pending_lsn too. True. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 2023-Nov-16, Alvaro Herrera wrote: > On 2023-Oct-04, Robert Haas wrote: > > - Right now, I have a hard-coded 60 second timeout for WAL > > summarization. If you try to take an incremental backup and the WAL > > summaries you need don't show up within 60 seconds, the backup times > > out. I think that's a reasonable default, but should it be > > configurable? If yes, should that be a GUC or, perhaps better, a > > pg_basebackup option? > > I'd rather have a way for the server to provide diagnostics on why the > summaries aren't being produced. Maybe a server running under valgrind > is going to fail and need a longer one, but otherwise a hardcoded > timeout seems sufficient. > > You did say later that you thought summary files would just go from one > checkpoint to the next. So the only question is at what point the file > for the last checkpoint (i.e. from the previous one up to the one > requested by pg_basebackup) is written. If walsummarizer keeps almost > the complete state in memory and just waits for the checkpoint record to > write it, then it's probably okay. On 2023-Nov-16, Alvaro Herrera wrote: > On 2023-Nov-16, Robert Haas wrote: > > > On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some > > > purpose or it's just a leftover from prior development. I see it's only > > > read in an assertion ... Maybe if we think this cross-check is > > > important, it should be turned into an elog? Otherwise, I'd remove it. > > > > I've been thinking about that. One thing I'm not quite sure about > > though is introspection. Maybe there should be a function that shows > > summarized_tli and summarized_lsn from WalSummarizerData, and maybe it > > should expose pending_lsn too. > > True. Putting those two thoughts together, I think pg_basebackup with --progress could tell you "still waiting for the summary file up to LSN %X/%X to appear, and the walsummarizer is currently handling lsn %X/%X" or something like that. This would probably require two concurrent connections, one to run BASE_BACKUP and another to inquire server state; but this should easy enough to integrate together with parallel basebackup later. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Thu, Nov 16, 2023 at 12:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Putting those two thoughts together, I think pg_basebackup with > --progress could tell you "still waiting for the summary file up to LSN > %X/%X to appear, and the walsummarizer is currently handling lsn %X/%X" > or something like that. This would probably require two concurrent > connections, one to run BASE_BACKUP and another to inquire server state; > but this should easy enough to integrate together with parallel > basebackup later. I had similar thoughts, except I was thinking it would be better to have the warnings be generated on the server side. That would save the need for a second libpq connection, which would be good, because I think adding that would result in a pretty large increase in complexity and some not-so-great user-visible consequences. In fact, my latest thought is to just remove the timeout altogether, and emit warnings like this: WARNING: still waiting for WAL summarization to reach %X/%X after %d seconds, currently at %X/%X We could emit that every 30 seconds or so until either the situation resolves itself or the user hits ^C. I think that would be good enough here. If we want, the interval between messages can be a GUC, but I don't know how much real need there will be to tailor that. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So, wal_summary is no longer turned on by default, I think following a > comment from Peter E. I think this is a good decision, as we're only > going to need them on servers from which incremental backups are going > to be taken, which is a strict subset of all servers; and furthermore, > people that need them are going to realize that very easily, while if we > went the other around most people would not realize that they need to > turn them off to save some resource consumption. > > Granted, the amount of resources additionally used is probably not very > big. But since it can be changed with a reload not restart, it doesn't > seem problematic. Yeah. I meant to say that I'd changed that for that reason, but in the flurry of new versions I omitted to do so. > ... oh, I just noticed that this patch now fails to compile because of > the MemoryContextResetAndDeleteChildren removal. Fixed. > (Typo in the pg_walsummary manpage: "since WAL summary files primary > exist" -> "primarily") This, too. > I think it should default to off in primary and standby, and the user > has to enable it in whichever server they want to take backups from. Yeah, that's how it works currently. > I don't understand this point. Currently, the protocol is that > UPLOAD_MANIFEST is used to send the manifest prior to requesting the > backup. You seem to be saying that you're thinking of removing support > for UPLOAD_MANIFEST and instead just give the LSN as an option to the > BASE_BACKUP command? I don't think I'd want to do exactly that, because then you could only send one LSN, and I do think we want to send a set of LSN ranges with the corresponding TLI for each. I was thinking about dumping UPLOAD_MANIFEST and instead having a command like: INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698 The client would execute this command one or more times before starting an incremental backup. > I propose to keep the door open for that binary doing other things that > dumping the files as text. So add a command argument, which currently > can only be "dump", to allow the command do other things later if > needed. (For example, remove files from a server on which summarize_wal > has been turned off; or perhaps remove files that are below some LSN.) I don't like that very much. That sounds like one of those forward-compatibility things that somebody designs and then nothing ever happens and ten years later you still have an ugly wart. My theory is that these files are going to need very little management. In general, they're small; if you never removed them, it probably wouldn't hurt, or at least, not for a long time. As to specific use cases, if you want to remove files from a server on which summarize_wal has been turned off, you can just use rm. Removing files from before a certain LSN would probably need a bit of scripting, but only a bit. Conceivably we could provide something like that in core, but it doesn't seem necessary, and it also seems to me that we might do well to include that in pg_archivecleanup rather than in pg_walsummary. Here's a new version. Changes: - Add preparatory renaming patches to the series. - Rename wal_summarize_keep_time to wal_summary_keep_time. - Change while (true) to while (1). - Typo fixes. - Fix incorrect assertion in summarizer_read_local_xlog_page; this could cause occasional regression test failures in 004_pg_xlog_symlink and 009_growing_files. - Zero-initialize BlockRefTableKey variables. - Replace a couple instances of pathbuf + basepathlen + 1 with tarfilename. - Add const to path argument of GetFileBackupMethod. - Avoid setting output parameters of GetFileBackupMethod unless the return value is BACK_UP_FILE_INCREMENTALLY. - In GetFileBackupMethod, postpone qsorting block numbers slightly. - Define INCREMENTAL_PREFIX_LENGTH using sizeof(), because that should hopefully work everywhere and the StaticAssertStmt that checks the value of this doesn't work on Windows. - Change MemoryContextResetAndDeleteChildren to MemoryContextReset. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v10-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patch
- v10-0006-Add-new-pg_walsummary-tool.patch
- v10-0004-Add-a-new-WAL-summarizer-process.patch
- v10-0007-Test-patch-Enable-summarize_wal-by-default.patch
- v10-0005-Add-support-for-incremental-backup.patch
- v10-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patch
- v10-0001-Rename-JsonManifestParseContext-callbacks.patch
I made a pass over pg_combinebackup for NLS. I propose the attached patch. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Right now the sectors on the hard disk run clockwise, but I heard a rumor that you can squeeze 0.2% more throughput by running them counterclockwise. It's worth the effort. Recommended." (Gerry Pourwelle)
Attachment
On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I made a pass over pg_combinebackup for NLS. I propose the attached > patch. This doesn't quite compile for me so I changed a few things and incorporated it. Hopefully I didn't mess anything up. Here's v11. In addition to incorporating Álvaro's NLS changes, with the off-list help of Jakub Wartak, I finally tracked down two one-line bugs in BlockRefTableEntryGetBlocks that have been causing the cfbot to blow up on these patches. What I hadn't realized is that cfbot runs with the relation segment size changed to 6 blocks, which tickled some code paths that I wasn't exercising locally. Thanks a ton to Jakub for the help running this down. cfbot was unhappy about a %lu so I've changed that to %zu in this version, too. Finally, the previous version of this patch set had some pgindent damage, so that is hopefully now cleaned up as well. I wish I had better ideas about how to thoroughly test this. I've got a bunch of different tests for pg_combinebackup and I think those are good, but the bugs mentioned in the previous paragraph show that those aren't sufficient to catch all of the logic errors that can exist, which is not great. But, as I say, I'm not quite sure how to do better, so I guess I'll just need to keep fixing problems as we find them. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v11-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patch
- v11-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patch
- v11-0001-Rename-JsonManifestParseContext-callbacks.patch
- v11-0005-Add-support-for-incremental-backup.patch
- v11-0004-Add-a-new-WAL-summarizer-process.patch
- v11-0006-Add-new-pg_walsummary-tool.patch
- v11-0007-Test-patch-Enable-summarize_wal-by-default.patch
On 2023-Nov-16, Robert Haas wrote: > On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I don't understand this point. Currently, the protocol is that > > UPLOAD_MANIFEST is used to send the manifest prior to requesting the > > backup. You seem to be saying that you're thinking of removing support > > for UPLOAD_MANIFEST and instead just give the LSN as an option to the > > BASE_BACKUP command? > > I don't think I'd want to do exactly that, because then you could only > send one LSN, and I do think we want to send a set of LSN ranges with > the corresponding TLI for each. I was thinking about dumping > UPLOAD_MANIFEST and instead having a command like: > > INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698 > > The client would execute this command one or more times before > starting an incremental backup. That sounds good to me. Not having to parse the manifest server-side sounds like a win, as does saving the transfer, for the cases where the manifest is large. Is this meant to support multiple timelines each with non-overlapping adjacent ranges, rather than multiple non-adjacent ranges? Do I have it right that you want to rewrite this bit before considering this ready to commit? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > That sounds good to me. Not having to parse the manifest server-side > sounds like a win, as does saving the transfer, for the cases where the > manifest is large. OK. I'll look into this next week, hopefully. > Is this meant to support multiple timelines each with non-overlapping > adjacent ranges, rather than multiple non-adjacent ranges? Correct. I don't see how non-adjacent LSN ranges could ever be a useful thing, but adjacent ranges on different timelines are useful. > Do I have it right that you want to rewrite this bit before considering > this ready to commit? For sure. I don't think this is the only thing that needs to be revised before commit, but it's definitely *a* thing that needs to be revised before commit. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 20, 2023 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Is this meant to support multiple timelines each with non-overlapping > > adjacent ranges, rather than multiple non-adjacent ranges? > > Correct. I don't see how non-adjacent LSN ranges could ever be a > useful thing, but adjacent ranges on different timelines are useful. Thinking about this a bit more, there are a couple of things we could do here in terms of syntax. Once idea is to give up on having a separate MANIFEST-WAL-RANGE command for each range and instead just cram everything into either a single command: MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}... Or even into a single option to the BASE_BACKUP command: BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...' Or, since we expect adjacent, non-overlapping ranges, you could even arrange to elide the duplicated boundary LSNs, e.g. MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn} Or BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn' I'm not sure what's best here. Trying to trim out the duplicated boundary LSNs feels a bit like rearrangement for the sake of rearrangement, but maybe it isn't really. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I made a pass over pg_combinebackup for NLS. I propose the attached > > patch. > > This doesn't quite compile for me so I changed a few things and > incorporated it. Hopefully I didn't mess anything up. > > Here's v11. [..] > I wish I had better ideas about how to thoroughly test this. [..] Hopefully the below add some confidence, I've done some further quick(?) checks today and results are good: make check-world #GOOD test_full_pri__incr_stby__restore_on_pri.sh #GOOD test_full_pri__incr_stby__restore_on_stby.sh #GOOD* test_full_stby__incr_stby__restore_on_pri.sh #GOOD test_full_stby__incr_stby__restore_on_stby.sh #GOOD* test_many_incrementals_dbcreate.sh #GOOD test_many_incrementals.sh #GOOD test_multixact.sh #GOOD test_pending_2pc.sh #GOOD test_reindex_and_vacuum_full.sh #GOOD test_truncaterollback.sh #GOOD test_unlogged_table.sh #GOOD test_across_wallevelminimal.sh # GOOD(expected failure, that walsummaries are off during walminimal and incremental cannot be taken--> full needs to be taken after wal_level=minimal) CFbot failed on two hosts this time, I haven't looked at the details yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? -> LOG: WAL summarizer process (PID 71511) was terminated by signal 6: Aborted?) The remaining test idea is to have a longer running DB under stress test in more real-world conditions and try to recover using chained incremental backups (one such test was carried out on patchset v6 and the result was good back then). -J.
On Wed, Nov 22, 2023 at 3:14 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > CFbot failed on two hosts this time, I haven't looked at the details > yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? -> > LOG: WAL summarizer process (PID 71511) was terminated by signal 6: > Aborted?) Robert pinged me to see if I had any ideas. The reason it fails on Windows is because there is a special code path for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it is incomplete: it returns early without feeding the data into the checksum, so all the checksums have the same initial and bogus value. I commented that part out so it took the normal path like Unix, and it passed. The reason it fails on Linux 32 bit with -fsanitize is because this has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer when doing a size test that could easily be changed to non-overflowing formulation. AFAICS it is not a live bug because it comes to the right conclusion without deferencing the pointer due to other checks, but the sanitizer is not wrong to complain about it and I will post a patch to fix that in a new thread. With the draft patch I am testing, the sanitizer is happy and this passes too.
On Thu, Nov 23, 2023 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Robert pinged me to see if I had any ideas. Thanks, Thomas. > The reason it fails on Windows is because there is a special code path > for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it > is incomplete: it returns early without feeding the data into the > checksum, so all the checksums have the same initial and bogus value. > I commented that part out so it took the normal path like Unix, and it > passed. Yikes, that's embarrassing. Thanks for running it down. There is logic in the caller to figure out whether we need to recompute the checksum or can reuse one we already have, but copy_file() doesn't understand that it should take the slow path if a new checksum computation is required. > The reason it fails on Linux 32 bit with -fsanitize is because this > has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer > when doing a size test that could easily be changed to non-overflowing > formulation. AFAICS it is not a live bug because it comes to the > right conclusion without deferencing the pointer due to other checks, > but the sanitizer is not wrong to complain about it and I will post a > patch to fix that in a new thread. With the draft patch I am testing, > the sanitizer is happy and this passes too. Thanks so much. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > so I've spent some time playing still with patchset v8 (without the > 6/6 testing patch related to wal_level=minimal), with the exception of > - patchset v9 - marked otherwise. Thanks, as usual, for that. > 2. Usability thing: I hit the timeout hard: "This backup requires WAL > to be summarized up to 0/90000D8, but summarizer has only reached > 0/0." with summarize_wal=off (default) but apparently this in TODO. > Looks like an important usability thing. All right. I'd sort of forgotten about the need to address that issue, but apparently, I need to re-remember. > 5. On v8 i've finally played a little bit with standby(s) and this > patchset with couple of basic scenarios while mixing source of the > backups: > > a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby > # sometimes i'm getting spurious error like those when doing > incrementals on standby with -c fast : > 2023-11-15 13:49:05.721 CET [10573] LOG: recovery restart point > at 0/A000028 > 2023-11-15 13:49:07.591 CET [10597] WARNING: aborting backup due > to backend exiting before pg_backup_stop was called > 2023-11-15 13:49:07.591 CET [10597] ERROR: manifest requires WAL > from final timeline 1 ending at 0/A0000F8, but this backup starts at > 0/A000028 > 2023-11-15 13:49:07.591 CET [10597] STATEMENT: BASE_BACKUP ( > INCREMENTAL, LABEL 'pg_basebackup base backup', PROGRESS, > CHECKPOINT 'fast', WAIT 0, MANIFEST 'yes', TARGET 'client') > # when you retry the same pg_basebackup it goes fine (looks like > CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig > deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);" > against primary just before pg_basebackup --incr on standby > workarounds it) > > b. full on primary, incr1 on standby, full db restore (incl. incr1) on > standby # WORKS > c. full on standby, incr1 on standby, full db restore (incl. incr1) on > primary # WORKS* > d. full on primary, incr1 on standby, full db restore (incl. incr1) on > primary # WORKS* > > * - needs pg_promote() due to the controlfile having standby bit + > potential fiddling with postgresql.auto.conf as it is having > primary_connstring GUC. Well, "manifest requires WAL from final timeline 1 ending at 0/A0000F8, but this backup starts at 0/A000028" is a valid complaint, not a spurious error. It's essentially saying that WAL replay for this incremental backup would have to begin at a location that is earlier than where replay for the earlier backup would have to end while recovering that backup. It's almost like you're trying to go backwards in time, with the incremental happening before the full backup instead of after it. I think the reason this is happening is that when you take a backup, recovery has to start from the previous checkpoint. On the primary, we perform a new checkpoint and plan to start recovery from it. But on a standby, we can't perform a new checkpoint, since we can't write WAL, so we arrange for recovery of the backup to begin from the most recent checkpoint. And if you do two backups on the standby in a row without much happening in the middle, then the most recent checkpoint will be the same for both. And that I think is what's resulting in this error, because the end of the backup follows the start of the backup, so if two consecutive backups have the same start, then the start of the second one will precede the end of the first one. One thing that's interesting to note here is that there is no point in performing an incremental backup under these circumstances. You would accrue no advantage over just letting replay continue further from the full backup. The whole point of an incremental backup is that it lets you "fast forward" your older backup -- you could have just replayed all the WAL from the older backup until you got to the start LSN of the newer backup, but reconstructing a backup that can start replay from the newer LSN directly is, hopefully, quicker than replaying all of that WAL. But in this scenario, you're starting from the same checkpoint no matter what -- the amount of WAL replay required to reach any given LSN will be unchanged. So storing an incremental backup would be strictly a loss. Another interesting point to consider is that you could also get this complaint by doing something like take the full backup from the primary, and then try to take an incremental backup from a standby, maybe even a time-delayed standby that's far behind the primary. In that case, you would really be trying to take an incremental backup before you actually took the full backup, as far as LSN time goes. I'm not quite sure what to do about any of this. I think the error is correct and accurate, but understanding what it means and why it's happening and what to do about it is probably going to be difficult for people. Possibly we should have documentation that talks you through all of this. Or possibly there are ways to elaborate on the error message itself. But I'm a little skeptical about the latter approach because it's all so complicated. I don't know that we can summarize it in a sentence or two. > 6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having > more recent pg_basebackup (e.g. from pg18 one day) running against > pg17 in the scope of having this incremental backups possibility. Is > it going to be safe? (currently there seems to be no safeguards > against such use) or should those things (core, pg_basebackup) should > be running in version lock step? I think it should be safe, actually. pg_basebackup has no reason to care about WAL format changes across versions. It doesn't even care about the format of the WAL summaries, which it never sees, but only needs the server to have. If we change the format of the incremental files that are included in the backup, then we will need backward-compatibility code, or we can disallow cross-version operations. I don't currently foresee a need to do that, but you never know. It's manageable in any event. But note that I also didn't (and can't, without a lot of ugliness) make pg_combinebackup version-independent. So you could think of taking incremental backups with a different version of pg_basebackup, but if you want to restore you're going to need a matching version of pg_combinebackup. -- Robert Haas EDB: http://www.enterprisedb.com
New patch set. 0001: Rename JsonManifestParseContext callbacks, per feedback from Álvaro. Not logically related to the rest of this, except by code proximity. Separately committable, if nobody objects. 0002: Rename record_manifest_details_for_{file,wal_range}, per feedback from Álvaro that the names were too generic. Separately committable, if nobody objects. 0003: Move parse_manifest.c to src/common. No significant changes since the previous version. 0004: Add a new WAL summarizer process. No significant changes since the previous version. 0005: Incremental backup itself. Changes: - Remove UPLOAD_MANIFEST replication command and instead add INCREMENTAL_WAL_RANGE replication command. - In consequence, load_manifest.c which was included in the previous patch sets now moves to src/fe_utils and has some adjustments. - Actually document the new replication command which I overlooked previously. - Error out promptly if an incremental backup is attended with summarize_wal = off. - Fix test in copy_file(). We should be willing to use the fast-path if a new checksum is *not* required, but the sense of the test was inverted in previous versions. - Fix some handling of the missing-manifest case in pg_combinebackup. - Fix terminology in a help message. 0006: Add pg_walsummary tool. No significant changes since the previous version. 0007: Test patch, not for commit. As far as I know, the main commit-blockers here are (1) the timeout when waiting for WAL summarization is still hard-coded to 60 seconds and (2) the ubsan issue that Thomas hunted down, which would cause at least the entire CF environment and maybe some portion of the BF to turn red if this were committed. That issue is in xlogreader rather than in this patch set, at least in part, but it still needs fixing before this goes ahead. I also suspect that the slightly-more significant refactoring in this version may turn up a few new bugs in the CF environment. I think once that the aforementioned items are sorted out, this could be committed through 0005, and 0001 and 0002 could be committed sooner. 0006 should have some tests written before it gets committed, but it doesn't necessarily have to be committed at the exact same moment as everything else, and writing tests isn't that hard, either. Other loose ends that would be nice to tidy up at some point: - Incremental JSON parsing so we can handle huge manifests. - More documentation as proposed by Álvaro but I'm failing to find the details of his proposal right now. - More introspection facilities, maybe, or possibly rip some some stuff out of WalSummarizerCtl if we don't want it. This one might be a higher priority to address before initial commit, but it's probably not absolutely critical, either. I'm not quite sure how aggressively to press forward with getting stuff committed. I'd certainly rather debug as much as I can locally and via cfbot before turning the buildfarm pretty colors, but I think it generally works out better when larger features get pushed earlier in the cycle rather than in the mad frenzy right before feature freeze, so I'm not inclined to be too patient, either. ...Robert
Attachment
- v12-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patch
- v12-0001-Rename-JsonManifestParseContext-callbacks.patch
- v12-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patch
- v12-0004-Add-a-new-WAL-summarizer-process.patch
- v12-0006-Add-new-pg_walsummary-tool.patch
- v12-0005-Add-support-for-incremental-backup.patch
- v12-0007-Test-patch-Enable-summarize_wal-by-default.patch
On Thu, Nov 30, 2023 at 9:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > 0005: Incremental backup itself. Changes: > - Remove UPLOAD_MANIFEST replication command and instead add > INCREMENTAL_WAL_RANGE replication command. Unfortunately, I think this change is going to need to be reverted. Jakub reported out a problem to me off-list, which I think boils down to this: take a full backup on the primary. create a database on the primary. now take an incremental backup on the standby using the full backup from the master as the prior backup. What happens at this point depends on how far replay has progressed on the standby. I think there are three scenarios: (1) If replay has not yet reached a checkpoint later than the one at which the full backup began, then taking the incremental backup will fail. This is correct, because it makes no sense to take an incremental backup that goes backwards in time, and it's pointless to take one that goes forwards but not far enough to reach the next checkpoint, as you won't save anything. (2) If replay has progressed far enough that the redo pointer is now beyond the CREATE DATABASE record, then everything is fine. (3) But if the redo pointer for the backup is a later checkpoint than the one from which the full backup started, but also before the CREATE DATABASE record, then the new database's files exist on disk, but are not mentioned in the WAL summary, which covers all LSNs from the start of the prior backup to the start of this one. Here, the start of the backup is basically the LSN from which replay will start, and since the database was created after that, those changes aren't in the WAL summary. This means that we think the file is unchanged since the prior backup, and so backup no blocks at all. But now we have an incremental file for a relation for which no full file is present in the prior backup, and we're in big trouble. If my analysis is correct, this bug should be new in v12. In v11 and prior, I think that we always included every file that didn't appear in the prior manifest in full. I didn't really quite know why I was doing that, which is why I was willing to rip it out and thus remove the need for the manifest, but now I think it was actually preventing exactly this problem. This issue, in general, is files that get created after the start of the backup. By that time, the WAL summary that drives the backup has already been built, so it doesn't know anything about the new files. That would be fine if we either (A) omitted those new files from the backup completely, since replay would recreate them or (B) backed them up in full, so that there was nothing relying on them being there in the earlier backup. But an incremental backup of such a file is no good. Then I started worrying about whether there were problems in cases where a file was dropped and recreated with the same name. I *think* it's OK. If file F is dropped and recreated after being copied into the full backup but before being copied into the incremental backup, then there are basically two cases. First, F might be dropped before the start LSN of the incremental backup; if so, we'll know from the WAL summary that the limit block is 0 and back up the whole thing. Second, F might be dropped after the start LSN of the incremental backup and before it's actually coped. In that case, we'll not know when backing up the file that it was dropped and recreated, so we'll back it up incrementally as if that hadn't happened. That's OK as long as reconstruction doesn't fail, because WAL replay will again drop and recreate F. And I think reconstruction won't fail: blocks that are in the incremental file will be taken from there, blocks in the prior backup file will be taken from there, and blocks in neither place will be zero-filled. The result is logically incoherent, but replay will nuke the file anyway, so whatever. It bugs me a bit that we don't obey the WAL-before-data rule with file creation, e.g. RelationCreateStorage does smgrcreate() and then log_smgrcreate(). So in theory we could see a file on disk for which nothing has been logged yet; it could even happen that the file gets created before the start LSN of the backup and the log record gets written afterward. It seems like it would be far more comfortable to swap the order there, so that if it's on disk, it's definitely in the WAL. But I haven't yet been able to think of a scenario in which the current ordering causes a real problem. If we backup a stray file in full (or, hypothetically, if we skipped it entirely) then nothing will happen that can't already happen today with full backup; any problems we end up having are, I think, not new problems. It's only when we back up a file incrementally that we need to be careful, and the analsysis is basically the same as before ... whatever we put into an incremental file will cause *something* to get reconstructed except when there's no prior file at all. Having the manifest for the prior backup lets us avoid the incremental-with-no-prior-file scenario. And as long as *something* gets reconstructed, I think WAL replay will fix up the rest. Considering all this, what I'm inclined to do is go and put UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust accordingly. But first: does anybody see more problems here that I may have missed? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > Considering all this, what I'm inclined to do is go and put > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust > accordingly. But first: does anybody see more problems here that I may > have missed? OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a long comment explaining why that's believed to be necessary and sufficient. I committed 0001 and 0002 from the previous series also, since it doesn't seem like anyone has further comments on those renamings. This version also improves (at least, IMHO) the way that we wait for WAL summarization to finish. Previously, you either caught up fully within 60 seconds or you died. I didn't like that, because it seemed like some people would get timeouts when the operation was slowly progressing and would eventually succeed. So what this version does is: - Every 10 seconds, it logs a warning saying that it's still waiting for WAL summarization. That way, a human operator can understand what's happening easily, and cancel if they want. - If 60 seconds go by without the WAL summarizer ingesting even a single WAL record, it times out. That way, if the WAL summarizer is dead or totally stuck (e.g. debugger attached, hung I/O) the user won't be left waiting forever even if they never cancel. But if it's just slow, it probably won't time out, and the operation should eventually succeed. To me, this seems like a reasonable compromise. It might be unreasonable if WAL summarization is proceeding at a very low but non-zero rate. But it's hard for me to think of a situation where that will happen, with the exception of when CPU or I/O are badly overloaded. But in those cases, the WAL generation rate is probably also not that high, because apparently the system is paralyzed, so maybe the wait won't even be that bad, especially given that everything else on the box should be super-slow too. Plus, even if we did want to time out in such a case, it's hard to know how slow is too slow. In any event, I think most failures here are likely to be complete failures, where the WAL summarizer just doesn't, so the fact that this times out in those cases seems to me to likely be as much as we need to do here. But if someone sees a problem with this or has a clever idea how to make it better, I'm all ears. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 5, 2023 at 7:11 PM Robert Haas <robertmhaas@gmail.com> wrote: [..v13 patchset] The results with v13 patchset are following: * - requires checkpoint on primary when doing incremental on standby when it's too idle, this was explained by Robert in [1], something AKA too-fast-incremental backup due to testing-scenario: test_across_wallevelminimal.sh - GOOD test_many_incrementals_dbcreate.sh - GOOD test_many_incrementals.sh - GOOD test_multixact.sh - GOOD test_reindex_and_vacuum_full.sh - GOOD test_standby_incr_just_backup.sh - GOOD* test_truncaterollback.sh - GOOD test_unlogged_table.sh - GOOD test_full_pri__incr_stby__restore_on_pri.sh - GOOD test_full_pri__incr_stby__restore_on_stby.sh - GOOD test_full_stby__incr_stby__restore_on_pri.sh - GOOD* test_full_stby__incr_stby__restore_on_stby.sh - GOOD* test_incr_on_standby_after_promote.sh - GOOD* test_incr_after_timelineincrease.sh (pg_ctl stop, pg_resetwal -l 00000002000000000000000E ..., pg_ctl start, pg_basebackup --incremental) - GOOD, I've got: pg_basebackup: error: could not initiate base backup: ERROR: timeline 1 found in manifest, but not in this server's history Comment: I was wondering if it wouldn't make some sense to teach pg_resetwal to actually delete all WAL summaries after any any WAL/controlfile alteration? test_stuck_walsummary.sh (pkill -STOP walsumm) - GOOD: > This version also improves (at least, IMHO) the way that we wait for > WAL summarization to finish. Previously, you either caught up fully > within 60 seconds or you died. I didn't like that, because it seemed > like some people would get timeouts when the operation was slowly > progressing and would eventually succeed. So what this version does > is: WARNING: still waiting for WAL summarization through 0/A0000D8 after 10 seconds DETAIL: Summarization has reached 0/8000028 on disk and 0/80000F8 in memory. [..] pg_basebackup: error: could not initiate base backup: ERROR: WAL summarization is not progressing DETAIL: Summarization is needed through 0/A0000D8, but is stuck at 0/8000028 on disk and 0/80000F8 in memory. Comment2: looks good to me! test_pending_2pc.sh - getting GOOD on most recent runs, but several times during early testing (probably due to my own mishaps), I've been hit by Abort/TRAP. I'm still investigating and trying to reproduce those ones. TRAP: failed Assert("summary_end_lsn >= WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940 Regards, -J. [1] - https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com
On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Comment: I was wondering if it wouldn't make some sense to teach > pg_resetwal to actually delete all WAL summaries after any any > WAL/controlfile alteration? I thought that this was a good idea so I decided to go implement it, only to discover that it was already part of the patch set ... did you find some case where it doesn't work as expected? The code looks like this: RewriteControlFile(); KillExistingXLOG(); KillExistingArchiveStatus(); KillExistingWALSummaries(); WriteEmptyXLOG(); > test_pending_2pc.sh - getting GOOD on most recent runs, but several > times during early testing (probably due to my own mishaps), I've been > hit by Abort/TRAP. I'm still investigating and trying to reproduce > those ones. TRAP: failed Assert("summary_end_lsn >= > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940 I have a fix for this locally, but I'm going to hold off on publishing a new version until either there's a few more things I can address all at once, or until Thomas commits the ubsan fix. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Dec 7, 2023 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote: Hi Robert, > On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > Comment: I was wondering if it wouldn't make some sense to teach > > pg_resetwal to actually delete all WAL summaries after any any > > WAL/controlfile alteration? > > I thought that this was a good idea so I decided to go implement it, > only to discover that it was already part of the patch set ... did you > find some case where it doesn't work as expected? The code looks like > this: Ah, my bad, with a fresh mind and coffee the error message makes it clear and of course it did reset the summaries properly. While we are at it, maybe around the below in PrepareForIncrementalBackup() if (tlep[i] == NULL) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("timeline %u found in manifest, but not in this server's history", range->tli))); we could add errhint("You might need to start a new full backup instead of incremental one") ? > > test_pending_2pc.sh - getting GOOD on most recent runs, but several > > times during early testing (probably due to my own mishaps), I've been > > hit by Abort/TRAP. I'm still investigating and trying to reproduce > > those ones. TRAP: failed Assert("summary_end_lsn >= > > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940 > > I have a fix for this locally, but I'm going to hold off on publishing > a new version until either there's a few more things I can address all > at once, or until Thomas commits the ubsan fix. > Great, I cannot get it to fail again today, it had to be some dirty state of the testing env. BTW: Thomas has pushed that ubsan fix. -J.
On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Considering all this, what I'm inclined to do is go and put > > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust > > accordingly. But first: does anybody see more problems here that I may > > have missed? > > OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a > long comment explaining why that's believed to be necessary and > sufficient. I committed 0001 and 0002 from the previous series also, > since it doesn't seem like anyone has further comments on those > renamings. I have done some testing on standby, but I am facing some issues, although things are working fine on the primary. As shown below test [1]standby is reporting some errors that manifest require WAL from 0/60000F8, but this backup starts at 0/6000028. Then I tried to look into the manifest file of the full backup and it shows contents as below[0]. Actually from this WARNING and ERROR, I am not clear what is the problem, I understand that full backup ends at "0/60000F8" so for the next incremental backup we should be looking for a summary that has WAL starting at "0/60000F8" and we do have those WALs. In fact, the error message is saying "this backup starts at 0/6000028" which is before "0/60000F8" so whats the issue? [0] "WAL-Ranges": [ { "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" } [1] -- test on primary dilipkumar@dkmac bin % ./pg_basebackup -D d dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -- cleanup the backup directory dilipkumar@dkmac bin % rm -rf d dilipkumar@dkmac bin % rm -rf d1 --test on standby dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433 dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433 WARNING: aborting backup due to backend exiting before pg_backup_stop was called pg_basebackup: error: could not initiate base backup: ERROR: manifest requires WAL from final timeline 1 ending at 0/60000F8, but this backup starts at 0/6000028 pg_basebackup: removing data directory "d1" -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 11, 2023 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Considering all this, what I'm inclined to do is go and put > > > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust > > > accordingly. But first: does anybody see more problems here that I may > > > have missed? > > > > OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a > > long comment explaining why that's believed to be necessary and > > sufficient. I committed 0001 and 0002 from the previous series also, > > since it doesn't seem like anyone has further comments on those > > renamings. > > I have done some testing on standby, but I am facing some issues, > although things are working fine on the primary. As shown below test > [1]standby is reporting some errors that manifest require WAL from > 0/60000F8, but this backup starts at 0/6000028. Then I tried to look > into the manifest file of the full backup and it shows contents as > below[0]. Actually from this WARNING and ERROR, I am not clear what > is the problem, I understand that full backup ends at "0/60000F8" so > for the next incremental backup we should be looking for a summary > that has WAL starting at "0/60000F8" and we do have those WALs. In > fact, the error message is saying "this backup starts at 0/6000028" > which is before "0/60000F8" so whats the issue? > > [0] > "WAL-Ranges": [ > { "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" } > > > [1] > -- test on primary > dilipkumar@dkmac bin % ./pg_basebackup -D d > dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest > > -- cleanup the backup directory > dilipkumar@dkmac bin % rm -rf d > dilipkumar@dkmac bin % rm -rf d1 > > --test on standby > dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433 > dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433 > > WARNING: aborting backup due to backend exiting before pg_backup_stop > was called > pg_basebackup: error: could not initiate base backup: ERROR: manifest > requires WAL from final timeline 1 ending at 0/60000F8, but this > backup starts at 0/6000028 > pg_basebackup: removing data directory "d1" Jakub, pinged me offlist and pointed me to the thread[1] where it is already explained so I think we can ignore this. [1] https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > While we are at it, maybe around the below in PrepareForIncrementalBackup() > > if (tlep[i] == NULL) > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("timeline %u found in > manifest, but not in this server's history", > range->tli))); > > we could add > > errhint("You might need to start a new full backup instead of > incremental one") > > ? I can't exactly say that such a hint would be inaccurate, but I think the impulse to add it here is misguided. One of my design goals for this system is to make it so that you never have to take a new incremental backup "just because," not even in case of an intervening timeline switch. So, all of the errors in this function are warning you that you've done something that you really should not have done. In this particular case, you've either (1) manually removed the timeline history file, and not just any timeline history file but the one for a timeline for a backup that you still intend to use as the basis for taking an incremental backup or (2) tried to use a full backup taken from one server as the basis for an incremental backup on a completely different server that happens to share the same system identifier, e.g. because you promoted two standbys derived from the same original primary and then tried to use a full backup taken on one as the basis for an incremental backup taken on the other. The scenario I was really concerned about when I wrote this test was (2), because that could lead to a corrupt restore. This test isn't strong enough to prevent that completely, because two unrelated standbys can branch onto the same new timelines at the same LSNs, and then these checks can't tell that something bad has happened. However, they can detect a useful subset of problem cases. And the solution is not so much "take a new full backup" as "keep straight which server is which." Likewise, in case (1), the relevant hint would be "don't manually remove timeline history files, and if you must, then at least don't nuke timelines that you actually still care about." > > I have a fix for this locally, but I'm going to hold off on publishing > > a new version until either there's a few more things I can address all > > at once, or until Thomas commits the ubsan fix. > > > > Great, I cannot get it to fail again today, it had to be some dirty > state of the testing env. BTW: Thomas has pushed that ubsan fix. Huzzah, the cfbot likes the patch set now. Here's a new version with the promised fix for your non-reproducible issue. Let's see whether you and cfbot still like this version. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi Robert, On Mon, Dec 11, 2023 at 6:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > While we are at it, maybe around the below in PrepareForIncrementalBackup() > > > > if (tlep[i] == NULL) > > ereport(ERROR, > > > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("timeline %u found in > > manifest, but not in this server's history", > > range->tli))); > > > > we could add > > > > errhint("You might need to start a new full backup instead of > > incremental one") > > > > ? > > I can't exactly say that such a hint would be inaccurate, but I think > the impulse to add it here is misguided. One of my design goals for > this system is to make it so that you never have to take a new > incremental backup "just because," Did you mean take a new full backup here? > not even in case of an intervening > timeline switch. So, all of the errors in this function are warning > you that you've done something that you really should not have done. > In this particular case, you've either (1) manually removed the > timeline history file, and not just any timeline history file but the > one for a timeline for a backup that you still intend to use as the > basis for taking an incremental backup or (2) tried to use a full > backup taken from one server as the basis for an incremental backup on > a completely different server that happens to share the same system > identifier, e.g. because you promoted two standbys derived from the > same original primary and then tried to use a full backup taken on one > as the basis for an incremental backup taken on the other. > Okay, but please consider two other possibilities: (3) I had a corrupted DB where I've fixed it by running pg_resetwal and some cronjob just a day later attempted to take incremental and failed with that error. (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb directory) the DB where I had cronjob that just failed with this error I bet that (4) is going to happen more often than (1), (2) , which might trigger users to complain on forums, support tickets. > > > I have a fix for this locally, but I'm going to hold off on publishing > > > a new version until either there's a few more things I can address all > > > at once, or until Thomas commits the ubsan fix. > > > > > > > Great, I cannot get it to fail again today, it had to be some dirty > > state of the testing env. BTW: Thomas has pushed that ubsan fix. > > Huzzah, the cfbot likes the patch set now. Here's a new version with > the promised fix for your non-reproducible issue. Let's see whether > you and cfbot still like this version. LGTM, all quick tests work from my end too. BTW: I have also scheduled the long/large pgbench -s 14000 (~200GB?) - multiple day incremental test. I'll let you know how it went. -J.
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > I can't exactly say that such a hint would be inaccurate, but I think > > the impulse to add it here is misguided. One of my design goals for > > this system is to make it so that you never have to take a new > > incremental backup "just because," > > Did you mean take a new full backup here? Yes, apologies for the typo. > > not even in case of an intervening > > timeline switch. So, all of the errors in this function are warning > > you that you've done something that you really should not have done. > > In this particular case, you've either (1) manually removed the > > timeline history file, and not just any timeline history file but the > > one for a timeline for a backup that you still intend to use as the > > basis for taking an incremental backup or (2) tried to use a full > > backup taken from one server as the basis for an incremental backup on > > a completely different server that happens to share the same system > > identifier, e.g. because you promoted two standbys derived from the > > same original primary and then tried to use a full backup taken on one > > as the basis for an incremental backup taken on the other. > > > > Okay, but please consider two other possibilities: > > (3) I had a corrupted DB where I've fixed it by running pg_resetwal > and some cronjob just a day later attempted to take incremental and > failed with that error. > > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb > directory) the DB where I had cronjob that just failed with this error > > I bet that (4) is going to happen more often than (1), (2) , which > might trigger users to complain on forums, support tickets. Hmm. In case (4), I was thinking that you'd get a complaint about the database system identifier not matching. I'm not actually sure that's what would happen, though, now that you mention it. In case (3), I think you would get an error about missing WAL summary files. > > Huzzah, the cfbot likes the patch set now. Here's a new version with > > the promised fix for your non-reproducible issue. Let's see whether > > you and cfbot still like this version. > > LGTM, all quick tests work from my end too. BTW: I have also scheduled > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental > test. I'll let you know how it went. Awesome, thank you so much. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert, On Wed, Dec 13, 2023 at 2:16 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > not even in case of an intervening > > > timeline switch. So, all of the errors in this function are warning > > > you that you've done something that you really should not have done. > > > In this particular case, you've either (1) manually removed the > > > timeline history file, and not just any timeline history file but the > > > one for a timeline for a backup that you still intend to use as the > > > basis for taking an incremental backup or (2) tried to use a full > > > backup taken from one server as the basis for an incremental backup on > > > a completely different server that happens to share the same system > > > identifier, e.g. because you promoted two standbys derived from the > > > same original primary and then tried to use a full backup taken on one > > > as the basis for an incremental backup taken on the other. > > > > > > > Okay, but please consider two other possibilities: > > > > (3) I had a corrupted DB where I've fixed it by running pg_resetwal > > and some cronjob just a day later attempted to take incremental and > > failed with that error. > > > > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb > > directory) the DB where I had cronjob that just failed with this error > > > > I bet that (4) is going to happen more often than (1), (2) , which > > might trigger users to complain on forums, support tickets. > > Hmm. In case (4), I was thinking that you'd get a complaint about the > database system identifier not matching. I'm not actually sure that's > what would happen, though, now that you mention it. > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID mismatch (of course they do differ after initdb), but i get this instead: $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432 --incremental /tmp/incr1.before.upgrade/backup_manifest WARNING: aborting backup due to backend exiting before pg_backup_stop was called pg_basebackup: error: could not initiate base backup: ERROR: timeline 2 found in manifest, but not in this server's history pg_basebackup: removing data directory "/tmp/incr2.after.upgrade" Also in the manifest I don't see DBID ? Maybe it's a nuisance and all I'm trying to see is that if an automated cronjob with pg_basebackup --incremental hits a freshly upgraded cluster, that error message without errhint() is going to scare some Junior DBAs. > > LGTM, all quick tests work from my end too. BTW: I have also scheduled > > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental > > test. I'll let you know how it went. > > Awesome, thank you so much. OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did generate pretty large incrementals (so I had to abort it due to lack of space, I was expecting to see smaller incrementals so it took too much space). I initally suspected that the problem lies in the normal distribution of `\set aid random(1, 100000 * :scale)` for tpcbb that UPDATEs on big pgbench_accounts. $ du -sm /backups/backups/* /backups/archive/ 216205 /backups/backups/full 215207 /backups/backups/incr.1 216706 /backups/backups/incr.2 102273 /backups/archive/ So I verified the recoverability yesterday anyway - the pg_combinebackup "full incr.1 incr.2" took 44 minutes and later archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan foir sum(abalance) on the pgbench_accounts and other tables worked fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys), while performing mostly 60-80MB/s separately for both reads and writes (it's slow, but it's due to maxed out sequence I/O of the Premium on a small SSD on Azure). So i've launched another improved test (to force more localized UPDATEs) to see the more real-world space-effectiveness of the incremental backup: \set aid random_exponential(1, 100000 * :scale, 8) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; INSERT INTO pgbench_history (tid , bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; But then... (and i have verified the low-IDs for :aid above).. same has happened: backups/backups$ du -sm /backups/backups/* 210229 /backups/backups/full 208299 /backups/backups/incr.1 208351 /backups/backups/incr.2 # pgbench_accounts has relfilenodeid 16486 postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173 174 175 ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L ./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ; echo; done 005c6bbb40fca3c1a0a819376ef0e793 full/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 ./incr.1/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 ./incr.2/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 /var/lib/postgres/17/data/base/5/16486.5 [.. all the checksums match (!) for the above $L..] c5117a213253035da5e5ee8a80c3ee3d full/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d ./incr.1/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d ./incr.2/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d /var/lib/postgres/17/data/base/5/16486.173 47ee6b18d7f8e40352598d194b9a3c8a full/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a ./incr.1/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a ./incr.2/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a /var/lib/postgres/17/data/base/5/16486.174 82dfeba58b4a1031ac12c23f9559a330 full/base/5/16486.175 21a8ac1e6fef3cf0b34546c41d59b2cc ./incr.1/base/5/16486.175 2c3d89c612b2f97d575a55c6c0204d0b ./incr.2/base/5/16486.175 73367d44d76e98276d3a6bbc14bb31f1 /var/lib/postgres/17/data/base/5/16486.175 So to me, it looks like it copied anyway 174 out of 175 files lowering the effectiveness of that incremental backup to 0% .The commands to generate those incr backups were: pg_basebackup -v -P -c fast -D /backups/backups/incr.1 --incremental=/backups/backups/full/backup_manifest sleep 4h pg_basebackup -v -P -c fast -D /backups/backups/incr.2 --incremental=/backups/backups/incr1/backup_manifest The incrementals are being generated , but just for the first (0) segment of the relation? /backups/backups$ ls -l incr.2/base/5 | grep INCR -rw------- 1 postgres postgres 12 Dec 14 21:33 INCREMENTAL.112 -rw------- 1 postgres postgres 12 Dec 14 21:01 INCREMENTAL.113 -rw------- 1 postgres postgres 12 Dec 14 21:36 INCREMENTAL.1247 -rw------- 1 postgres postgres 12 Dec 14 21:38 INCREMENTAL.1247_vm [..note, no INCREMENTAL.$int.$segment files] -rw------- 1 postgres postgres 12 Dec 14 21:24 INCREMENTAL.6238 -rw------- 1 postgres postgres 12 Dec 14 21:17 INCREMENTAL.6239 -rw------- 1 postgres postgres 12 Dec 14 21:55 INCREMENTAL.827 # 16486 is pgbench_accounts /backups/backups$ ls -l incr.2/base/5/*16486* | grep INCR -rw------- 1 postgres postgres 14613480 Dec 14 21:00 incr.2/base/5/INCREMENTAL.16486 -rw------- 1 postgres postgres 12 Dec 14 21:52 incr.2/base/5/INCREMENTAL.16486_vm /backups/backups$ /backups/backups$ find incr* -name INCREMENTAL.* | wc -l 1342 /backups/backups$ find incr* -name INCREMENTAL.*_* | wc -l # VM or FSM 236 /backups/backups$ find incr* -name INCREMENTAL.*.* | wc -l # not a single >1GB single incremental relation 0 I'm quickly passing info and I haven't really looked at the code yet , but it should be somewhere around GetFileBackupMethod() and reproducible easily with that configure --with-segsize-blocks=X switch. -J.
I have a couple of quick fixes here. The first fixes up some things in nls.mk related to a file move. The second is some cleanup because some function you are using has been removed in the meantime; you probably found that yourself while rebasing. The pg_walsummary patch doesn't have a nls.mk, but you also comment that it doesn't have tests yet, so I assume it's not considered complete yet anyway.
Attachment
A separate bikeshedding topic: The GUC "summarize_wal", could that be "wal_something" instead? (wal_summarize? wal_summarizer?) It would be nice if these settings names group together a bit, both with existing wal_* ones and also with the new ones you are adding (wal_summary_keep_time).
Another set of comments, about the patch that adds pg_combinebackup: Make sure all the options are listed in a consistent order. We have lately changed everything to be alphabetical. This includes: - reference page pg_combinebackup.sgml - long_options listing - getopt_long() argument - subsequent switch - (--help output, but it looks ok as is) Also, in pg_combinebackup.sgml, the option --sync-method is listed as if it does not take an argument, but it does.
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut <peter@eisentraut.org> wrote: > The first fixes up some things in nls.mk related to a file move. The > second is some cleanup because some function you are using has been > removed in the meantime; you probably found that yourself while rebasing. Incorporated these. As you guessed, MemoryContextResetAndDeleteChildren -> MemoryContextReset had already been done locally. > The pg_walsummary patch doesn't have a nls.mk, but you also comment that > it doesn't have tests yet, so I assume it's not considered complete yet > anyway. I think this was more of a case of me just not realizing that I should add that. I'll add something simple to the next version, but I'm not very good at this NLS stuff. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote: > A separate bikeshedding topic: The GUC "summarize_wal", could that be > "wal_something" instead? (wal_summarize? wal_summarizer?) It would be > nice if these settings names group together a bit, both with existing > wal_* ones and also with the new ones you are adding > (wal_summary_keep_time). Yeah, this is highly debatable, so bikeshed away. IMHO, the question here is whether we care more about (1) having the name of the GUC sound nice grammatically or (2) having the GUC begin with the same string as other, related GUCs. I think that Tom Lane tends to prefer the former, and probably some other people do too, while some other people tend to prefer the latter. Ideally it would be possible to satisfy both goals at once here, but everything I thought about that started with "wal" sounded too awkward for me to like it; hence the current choice of name. But if there's consensus on something else, so be it. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Dec 18, 2023 at 4:10 AM Peter Eisentraut <peter@eisentraut.org> wrote: > Another set of comments, about the patch that adds pg_combinebackup: > > Make sure all the options are listed in a consistent order. We have > lately changed everything to be alphabetical. This includes: > > - reference page pg_combinebackup.sgml > > - long_options listing > > - getopt_long() argument > > - subsequent switch > > - (--help output, but it looks ok as is) > > Also, in pg_combinebackup.sgml, the option --sync-method is listed as if > it does not take an argument, but it does. I've attempted to clean this stuff up in the attached version. This version also includes a fix for the bug found by Jakub that caused things to not work properly for segment files beyond the first for any particular relation, which turns out to be a really stupid mistake in my earlier commit 025584a168a4b3002e193. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v15-0001-Fix-brown-paper-bag-bug-in-025584a168a4b3002e193.patch
- v15-0005-Add-new-pg_walsummary-tool.patch
- v15-0003-Add-a-new-WAL-summarizer-process.patch
- v15-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patch
- v15-0004-Add-support-for-incremental-backup.patch
- v15-0006-Test-patch-Enable-summarize_wal-by-default.patch
On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID > mismatch (of course they do differ after initdb), but i get this > instead: > > $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432 > --incremental /tmp/incr1.before.upgrade/backup_manifest > WARNING: aborting backup due to backend exiting before pg_backup_stop > was called > pg_basebackup: error: could not initiate base backup: ERROR: timeline > 2 found in manifest, but not in this server's history > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade" > > Also in the manifest I don't see DBID ? > Maybe it's a nuisance and all I'm trying to see is that if an > automated cronjob with pg_basebackup --incremental hits a freshly > upgraded cluster, that error message without errhint() is going to > scare some Junior DBAs. Yeah. I think we should add the system identifier to the manifest, but I think that should be left for a future project, as I don't think the lack of it is a good reason to stop all progress here. When we have that, we can give more reliable error messages about system mismatches at an earlier stage. Unfortunately, I don't think that the timeline messages you're seeing here are going to apply in every case: suppose you have two unrelated servers that are both on timeline 1. I think you could use a base backup from one of those servers and use it as the basis for the incremental from the other, and I think that if you did it right you might fail to hit any sanity check that would block that. pg_combinebackup will realize there's a problem, because it has the whole cluster to work with, not just the manifest, and will notice the mismatching system identifiers, but that's kind of late to find out that you made a big mistake. However, right now, it's the best we can do. > The incrementals are being generated , but just for the first (0) > segment of the relation? I committed the first two patches from the series I posted yesterday. The first should fix this, and the second relocates parse_manifest.c. That patch hasn't changed in a while and seems unlikely to attract major objections. There's no real reason to commit it until we're ready to move forward with the main patches, but I think we're very close to that now, so I did. Here's a rebase for cfbot. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi Robert, On Tue, Dec 19, 2023 at 9:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID > > mismatch (of course they do differ after initdb), but i get this > > instead: > > > > $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432 > > --incremental /tmp/incr1.before.upgrade/backup_manifest > > WARNING: aborting backup due to backend exiting before pg_backup_stop > > was called > > pg_basebackup: error: could not initiate base backup: ERROR: timeline > > 2 found in manifest, but not in this server's history > > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade" > > > > Also in the manifest I don't see DBID ? > > Maybe it's a nuisance and all I'm trying to see is that if an > > automated cronjob with pg_basebackup --incremental hits a freshly > > upgraded cluster, that error message without errhint() is going to > > scare some Junior DBAs. > > Yeah. I think we should add the system identifier to the manifest, but > I think that should be left for a future project, as I don't think the > lack of it is a good reason to stop all progress here. When we have > that, we can give more reliable error messages about system mismatches > at an earlier stage. Unfortunately, I don't think that the timeline > messages you're seeing here are going to apply in every case: suppose > you have two unrelated servers that are both on timeline 1. I think > you could use a base backup from one of those servers and use it as > the basis for the incremental from the other, and I think that if you > did it right you might fail to hit any sanity check that would block > that. pg_combinebackup will realize there's a problem, because it has > the whole cluster to work with, not just the manifest, and will notice > the mismatching system identifiers, but that's kind of late to find > out that you made a big mistake. However, right now, it's the best we > can do. > OK, understood. > > The incrementals are being generated , but just for the first (0) > > segment of the relation? > > I committed the first two patches from the series I posted yesterday. > The first should fix this, and the second relocates parse_manifest.c. > That patch hasn't changed in a while and seems unlikely to attract > major objections. There's no real reason to commit it until we're > ready to move forward with the main patches, but I think we're very > close to that now, so I did. > > Here's a rebase for cfbot. the v15 patchset (posted yesterday) test results are GOOD: 1. make check-world - GOOD 2. cfbot was GOOD 3. the devel/master bug present in parse_filename_for_nontemp_relation() seems to be gone (in local testing) 4. some further tests: test_across_wallevelminimal.sh - GOOD test_incr_after_timelineincrease.sh - GOOD test_incr_on_standby_after_promote.sh - GOOD test_many_incrementals_dbcreate.sh - GOOD test_many_incrementals.sh - GOOD test_multixact.sh - GOOD test_pending_2pc.sh - GOOD test_reindex_and_vacuum_full.sh - GOOD test_repro_assert.sh test_standby_incr_just_backup.sh - GOOD test_stuck_walsum.sh - GOOD test_truncaterollback.sh - GOOD test_unlogged_table.sh - GOOD test_full_pri__incr_stby__restore_on_pri.sh - GOOD test_full_pri__incr_stby__restore_on_stby.sh - GOOD test_full_stby__incr_stby__restore_on_pri.sh - GOOD test_full_stby__incr_stby__restore_on_stby.sh - GOOD 5. the more real-world pgbench test with localized segment writes usigng `\set aid random_exponential...` [1] indicates much greater efficiency in terms of backup space use now, du -sm shows: 210229 /backups/backups/full 250 /backups/backups/incr.1 255 /backups/backups/incr.2 [..] 348 /backups/backups/incr.13 408 /backups/backups/incr.14 // latest(20th of Dec on 10:40) 6673 /backups/archive/ The DB size was as reported by \l+ 205GB. That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30) with slow 100 TPS (-R), so no insane amounts of WAL. Time to reconstruct 14 chained incremental backups was 45mins (pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full /backups/backups/incr.1 (..) /backups/backups/incr.14). DB after recovering was OK and working fine. -J.
On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > the v15 patchset (posted yesterday) test results are GOOD: All right. I committed the main two patches, dropped the for-testing-only patch, and added a simple test to the remaining pg_walsummary patch. That needs more work, but here's what I have as of now. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hello Robert, 20.12.2023 23:56, Robert Haas wrote: > On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: >> the v15 patchset (posted yesterday) test results are GOOD: > All right. I committed the main two patches, dropped the > for-testing-only patch, and added a simple test to the remaining > pg_walsummary patch. That needs more work, but here's what I have as > of now. I've found several typos/inconsistencies introduced with 174c48050 and dc2123400. Maybe you would want to fix them, while on it?: s/arguent/argument/; s/BlkRefTableEntry/BlockRefTableEntry/; s/BlockRefTablEntry/BlockRefTableEntry/; s/Caonicalize/Canonicalize/; s/Checksum_Algorithm/Checksum-Algorithm/; s/corresonding/corresponding/; s/differenly/differently/; s/excessing/excessive/; s/ exta / extra /; s/hexademical/hexadecimal/; s/initally/initially/; s/MAXGPATH/MAXPGPATH/; s/overrreacting/overreacting/; s/old_meanifest_file/old_manifest_file/; s/pg_cominebackup/pg_combinebackup/; s/pg_tblpc/pg_tblspc/; s/pointrs/pointers/; s/Recieve/Receive/; s/recieved/received/; s/ recod / record /; s/ recods / records /; s/substntially/substantially/; s/sumamry/summary/; s/summry/summary/; s/synchronizaton/synchronization/; s/sytem/system/; s/withot/without/; s/Woops/Whoops/; s/xlograder/xlogreader/; Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a comment above redo_pointer_at_last_summary_removal declaration, but perhaps it should say about removing summaries instead? Best regards, Alexander
On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > I've found several typos/inconsistencies introduced with 174c48050 and > dc2123400. Maybe you would want to fix them, while on it?: That's an impressively long list of mistakes in something I thought I'd been careful about. Sigh. I don't suppose you could provide these corrections in the form of a patch? I don't really want to run these sed commands across the entire tree and then try to figure out what's what... > Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a > comment above redo_pointer_at_last_summary_removal declaration, but > perhaps it should say about removing summaries instead? Wow, yeah. Thanks, will fix. -- Robert Haas EDB: http://www.enterprisedb.com
21.12.2023 15:07, Robert Haas wrote: > On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> I've found several typos/inconsistencies introduced with 174c48050 and >> dc2123400. Maybe you would want to fix them, while on it?: > That's an impressively long list of mistakes in something I thought > I'd been careful about. Sigh. > > I don't suppose you could provide these corrections in the form of a > patch? I don't really want to run these sed commands across the entire > tree and then try to figure out what's what... Please look at the attached patch; it corrects all 29 items ("recods" fixed in two places), but maybe you find some substitutions wrong... I've also observed that those commits introduced new warnings: $ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8 reconstruct.c: In function ‘read_bytes’: reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] 511 | if (rb < 0) | ^ reconstruct.c: In function ‘write_reconstructed_file’: reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] 650 | if (rb < 0) | ^ reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] 662 | if (wb < 0) There are also two deadcode.DeadStores complaints from clang. First one is about: /* * Align the wait time to prevent drift. This doesn't really matter, * but we'd like the warnings about how long we've been waiting to say * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever * drifting to something that is not a multiple of ten. */ timeout_in_ms -= TimestampDifferenceMilliseconds(current_time, initial_time) % timeout_in_ms; It looks like this timeout is really not used. And the minor one (similar to many existing, maybe doesn't deserve fixing): walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores] summary_end_lsn = private_data->read_upto; ^ ~~~~~~~~~~~~~~~~~~~~~~~ >> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a >> comment above redo_pointer_at_last_summary_removal declaration, but >> perhaps it should say about removing summaries instead? > Wow, yeah. Thanks, will fix. Thank you for paying attention to it! Best regards, Alexander
Attachment
On Thu, Dec 21, 2023 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > Please look at the attached patch; it corrects all 29 items ("recods" > fixed in two places), but maybe you find some substitutions wrong... Thanks, committed with a few additions. > I've also observed that those commits introduced new warnings: > $ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8 > reconstruct.c: In function ‘read_bytes’: > reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] > 511 | if (rb < 0) > | ^ > reconstruct.c: In function ‘write_reconstructed_file’: > reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] > 650 | if (rb < 0) > | ^ > reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits] > 662 | if (wb < 0) Oops. I think the variables should be type int. See attached. > There are also two deadcode.DeadStores complaints from clang. First one is > about: > /* > * Align the wait time to prevent drift. This doesn't really matter, > * but we'd like the warnings about how long we've been waiting to say > * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever > * drifting to something that is not a multiple of ten. > */ > timeout_in_ms -= > TimestampDifferenceMilliseconds(current_time, initial_time) % > timeout_in_ms; > It looks like this timeout is really not used. Oops. It should be. See attached. > And the minor one (similar to many existing, maybe doesn't deserve fixing): > walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores] > summary_end_lsn = private_data->read_upto; > ^ ~~~~~~~~~~~~~~~~~~~~~~~ It kind of surprises me that this is dead, but it seems best to keep it there to be on the safe side, in case some change to the logic renders it not dead in the future. > >> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a > >> comment above redo_pointer_at_last_summary_removal declaration, but > >> perhaps it should say about removing summaries instead? > > Wow, yeah. Thanks, will fix. > > Thank you for paying attention to it! I'll fix this next. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
21.12.2023 23:43, Robert Haas wrote: >> There are also two deadcode.DeadStores complaints from clang. First one is >> about: >> /* >> * Align the wait time to prevent drift. This doesn't really matter, >> * but we'd like the warnings about how long we've been waiting to say >> * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever >> * drifting to something that is not a multiple of ten. >> */ >> timeout_in_ms -= >> TimestampDifferenceMilliseconds(current_time, initial_time) % >> timeout_in_ms; >> It looks like this timeout is really not used. > Oops. It should be. See attached. My quick experiment shows that that TimestampDifferenceMilliseconds call always returns zero, due to it's arguments swapped. The other changes look good to me. Thank you! Best regards, Alexander
My compiler has the following complaint: ../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’: ../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in thisfunction [-Werror=maybe-uninitialized] 540 | WalSummarizerCtl->pending_lsn = unsummarized_lsn; | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~ I haven't looked closely to see whether there is actually a problem here, but the attached patch at least resolves the warning. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Dec 23, 2023 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > My compiler has the following complaint: > > ../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’: > ../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in thisfunction [-Werror=maybe-uninitialized] > 540 | WalSummarizerCtl->pending_lsn = unsummarized_lsn; > | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~ Thanks. I don't think there's a real bug, but I pushed a fix, same as what you had. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote: > Thanks. I don't think there's a real bug, but I pushed a fix, same as > what you had. Thanks! I also noticed that WALSummarizerLock probably needs a mention in wait_event_names.txt. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Dec 27, 2023 at 10:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote: > > Thanks. I don't think there's a real bug, but I pushed a fix, same as > > what you had. > > Thanks! I also noticed that WALSummarizerLock probably needs a mention in > wait_event_names.txt. Fixed. It seems like it would be good if there were an automated cross-check between lwlocknames.txt and wait_event_names.txt. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > My quick experiment shows that that TimestampDifferenceMilliseconds call > always returns zero, due to it's arguments swapped. Thanks. Tom already changed the unsigned -> int stuff in a separate commit, so I just pushed the fixes to PrepareForIncrementalBackup, both the one I had before, and swapping the arguments to TimestampDifferenceMilliseconds. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 3 Jan 2024 at 15:10, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.
Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds
I would like to query the following:
--tablespace-mapping=olddir=newdir
Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespace as it exists in the first backup specified on the command line, and newdir is the absolute path to use for the tablespace in the reconstructed backup.
Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespace as it exists in the first backup specified on the command line, and newdir is the absolute path to use for the tablespace in the reconstructed backup.
The first backup specified on the command line will be the regular, full, non-incremental backup. But if a tablespace was introduced subsequently, it would only appear in an incremental backup. Wouldn't this then mean that a mapping would need to be provided based on the path to the tablespace of that incremental backup's copy?
Regards
Thom
On Thu, Apr 25, 2024 at 6:44 PM Thom Brown <thom@linux.com> wrote: > I would like to query the following: > > --tablespace-mapping=olddir=newdir > > Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespaceas it exists in the first backup specified on the command line, and newdir is the absolute path to use for thetablespace in the reconstructed backup. > > The first backup specified on the command line will be the regular, full, non-incremental backup. But if a tablespacewas introduced subsequently, it would only appear in an incremental backup. Wouldn't this then mean that a mappingwould need to be provided based on the path to the tablespace of that incremental backup's copy? Yes. Tomas Vondra found the same issue, which I have fixed in 1713e3d6cd393fcc1d4873e75c7fa1f6c7023d75. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, So I am a bit confused about the status of the tar format support, and after re-reading the thread (or at least grepping it for ' tar '), this wasn't really much discussed here either. On Wed, Jun 14, 2023 at 02:46:48PM -0400, Robert Haas wrote: > - We only know how to operate on directories, not tar files. I thought > about that when working on pg_verifybackup as well, but I didn't do > anything about it. It would be nice to go back and make that tool work > on tar-format backups, and this one, too. I believe "that tool" is pg_verifybackup, while "this one" is pg_combinebackup? However, what's up with pg_basebackup itself with respect to tar format incremental backups? AFAICT (see below), pg_basebackup -Ft --incremental=foo/backup_manifest happily creates an incremental backup in tar format; however, pg_combinebackup will not be able to restore it? If that is the case, shouldn't there be a bigger warning in the documentation about this, or maybe pg_basebackup should refuse to make incremental tar-format backups in the first place? Am I missing something here? It will be obvious to users after the first failure (to try to restore) that this will not work, and hopefully everybody tests a restore before they put a backup solution into production (or even better, wait until this whole feature is included in a wholesale solution), but I wonder whether somebody might trip over this after all and be unhappy. If one reads the pg_combinebackup documentation carefully it kinda becomes obvious that it does occupy itself with tar format backups, but it is not spelt out explicitly either. |postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_full |postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_incr_1 --incremental=backup/backup_full/backup_manifest |postgres@mbanck-lin-1:~$ echo $? |0 |postgres@mbanck-lin-1:~$ du -h backup/ |44M backup/backup_incr_1 |4,5G backup/backup_full |4,5G backup/ |postgres@mbanck-lin-1:~$ tar tf backup/backup_incr_1/base.tar | grep INCR | head |base/1/INCREMENTAL.3603 |base/1/INCREMENTAL.2187 |base/1/INCREMENTAL.13418 |base/1/INCREMENTAL.3467 |base/1/INCREMENTAL.2615_vm |base/1/INCREMENTAL.2228 |base/1/INCREMENTAL.3503 |base/1/INCREMENTAL.2659 |base/1/INCREMENTAL.2607_vm |base/1/INCREMENTAL.4164 |postgres@mbanck-lin-1:~$ /usr/lib/postgresql/17/bin/pg_combinebackup backup/backup_full/ backup/backup_incr_1/ -o backup/combined |pg_combinebackup: error: could not open file "backup/backup_incr_1//PG_VERSION": No such file or directory Michael