Thread: trying again to get incremental backup

trying again to get incremental backup

From
Robert Haas
Date:
A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.

The basic design of this patch set is pretty simple, and there are
three main parts. First, there's a new background process called the
walsummarizer which runs all the time. It reads the WAL and generates
WAL summary files. WAL summary files are extremely small compared to
the original WAL and contain only the minimal amount of information
that we need in order to determine which parts of the database need to
be backed up. They tell us about files getting created, destroyed, or
truncated, and they tell us about modified blocks. Naturally, we don't
find out about blocks that were modified without any write-ahead log
record, e.g. hint bit updates, but those are of necessity not critical
for correctness, so it's OK. Second, pg_basebackup has a mode where it
can take an incremental backup. You must supply a backup manifest from
a previous full backup. We read the WAL summary files that have been
generated between the start of the previous backup and the start of
this one, and use that to figure out which relation files have changed
and how much. Non-relation files are sent normally, just as they would
be in a full backup. Relation files can either be sent in full or be
replaced by an incremental file, which contains a subset of the blocks
in the file plus a bit of information to handle truncations properly.
Third, there's now a pg_combinebackup utility which takes a full
backup and one or more incremental backups, performs a bunch of sanity
checks, and if everything works out, writes out a new, synthetic full
backup, aka a data directory.

Simple usage example:

pg_basebackup -cfast -Dx
pg_basebackup -cfast -Dy --incremental x/backup_manifest
pg_combinebackup x y -o z

The part of all this with which I'm least happy is the WAL
summarization engine. Actually, the core process of summarizing the
WAL seems totally fine, and the file format is very compact thanks to
some nice ideas from my colleague Dilip Kumar. Someone may of course
wish to argue that the information should be represented in some other
file format instead, and that can be done if it's really needed, but I
don't see a lot of value in tinkering with it, either. Where I do
think there's a problem is deciding how much WAL ought to be
summarized in one WAL summary file. Summary files cover a certain
range of WAL records - they have names like
$TLI${START_LSN}${END_LSN}.summary. It's not too hard to figure out
where a file should start - generally, it's wherever the previous file
ended, possibly on a new timeline, but figuring out where the summary
should end is trickier. You always have the option to either read
another WAL record and fold it into the current summary, or end the
current summary where you are, write out the file, and begin a new
one. So how do you decide what to do?

I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental. Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.

Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.

Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?
The fact that said file isn't mentioned in the WAL summaries seems
like it ought to mean that the file is unchanged, in which case
perhaps this ought to be an error condition. But I'm not too sure
about that treatment. I have a feeling that there might be some subtle
problems here, especially if databases or tablespaces get dropped and
then new ones get created that happen to have the same OIDs. And what
about wal_level=minimal? I'm not at a point where I can say I've gone
through and plugged up these kinds of corner-case holes tightly yet,
and I'm worried that there may be still other scenarios of which I
haven't even thought. Happy to hear your ideas about what the problem
cases are or how any of the problems should be solved.

A related design question is whether we should really be sending the
whole backup manifest to the server at all. If it turns out that we
don't really need anything except for the LSN of the previous backup,
we could send that one piece of information instead of everything. On
the other hand, if we need the list of files from the previous backup,
then sending the whole manifest makes sense.

Another big and rather obvious problem with the patch set is that it
doesn't currently have any automated test cases, or any real
documentation. Those are obviously things that need a lot of work
before there could be any thought of committing this. And probably a
lot of bugs will be found along the way, too.

A few less-serious problems with the patch:

- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.

- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.

- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.

In the attached patch series, patches 1 through 6 are various
refactoring patches, patch 7 is the main event, and patch 8 adds a
useful inspection tool.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Andres Freund
Date:
Hi,

On 2023-06-14 14:46:48 -0400, Robert Haas wrote:
> A few years ago, I sketched out a design for incremental backup, but
> no patch for incremental backup ever got committed. Instead, the whole
> thing evolved into a project to add backup manifests, which are nice,
> but not as nice as incremental backup would be. So I've decided to
> have another go at incremental backup itself. Attached are some WIP
> patches. Let me summarize the design and some open questions and
> problems with it that I've discovered. I welcome problem reports and
> test results from others, as well.

Cool!


> I originally had the idea of summarizing a certain number of MB of WAL
> per WAL summary file, and so I added a GUC wal_summarize_mb for that
> purpose. But then I realized that actually, you really want WAL
> summary file boundaries to line up with possible redo points, because
> when you do an incremental backup, you need a summary that stretches
> from the redo point of the checkpoint written at the start of the
> prior backup to the redo point of the checkpoint written at the start
> of the current backup. The block modifications that happen in that
> range of WAL records are the ones that need to be included in the
> incremental.

I assume this is "solely" required for keeping the incremental backups as
small as possible, rather than being required for correctness?


> Unfortunately, there's no indication in the WAL itself
> that you've reached a redo point, but I wrote code that tries to
> notice when we've reached the redo point stored in shared memory and
> stops the summary there. But I eventually realized that's not good
> enough either, because if summarization zooms past the redo point
> before noticing the updated redo point in shared memory, then the
> backup sat around waiting for the next summary file to be generated so
> it had enough summaries to proceed with the backup, while the
> summarizer was in no hurry to finish up the current file and just sat
> there waiting for more WAL to be generated. Eventually the incremental
> backup would just time out. I tried to fix that by making it so that
> if somebody's waiting for a summary file to be generated, they can let
> the summarizer know about that and it can write a summary file ending
> at the LSN up to which it has read and then begin a new file from
> there. That seems to fix the hangs, but now I've got three
> overlapping, interconnected systems for deciding where to end the
> current summary file, and maybe that's OK, but I have a feeling there
> might be a better way.

Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?


> Dilip had an interesting potential solution to this problem, which was
> to always emit a special WAL record at the redo pointer. That is, when
> we fix the redo pointer for the checkpoint record we're about to
> write, also insert a WAL record there. That way, when the summarizer
> reaches that sentinel record, it knows it should stop the summary just
> before. I'm not sure whether this approach is viable, especially from
> a performance and concurrency perspective, and I'm not sure whether
> people here would like it, but it does seem like it would make things
> a whole lot simpler for this patch set.

FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.

I don't really see a performance and concurrency angle right now - what are
you wondering about?


> Another thing that I'm not too sure about is: what happens if we find
> a relation file on disk that doesn't appear in the backup_manifest for
> the previous backup and isn't mentioned in the WAL summaries either?

Wouldn't that commonly happen for unlogged relations at least?

I suspect there's also other ways to end up with such additional files,
e.g. by crashing during the creation of a new relation.


> A few less-serious problems with the patch:
> 
> - We don't have an incremental JSON parser, so if you have a
> backup_manifest>1GB, pg_basebackup --incremental is going to fail.
> That's also true of the existing code in pg_verifybackup, and for the
> same reason. I talked to Andrew Dunstan at one point about adapting
> our JSON parser to support incremental parsing, and he had a patch for
> that, but I think he found some problems with it and I'm not sure what
> the current status is.

As a stopgap measure, can't we just use the relevant flag to allow larger
allocations?


> - The patch does support differential backup, aka an incremental atop
> another incremental. There's no particular limit to how long a chain
> of backups can be. However, pg_combinebackup currently requires that
> the first backup is a full backup and all the later ones are
> incremental backups. So if you have a full backup a and an incremental
> backup b and a differential backup c, you can combine a b and c to get
> a full backup equivalent to one you would have gotten if you had taken
> a full backup at the time you took c. However, you can't combine b and
> c with each other without combining them with a, and that might be
> desirable in some situations. You might want to collapse a bunch of
> older differential backups into a single one that covers the whole
> time range of all of them. I think that the file format can support
> that, but the tool is currently too dumb.

That seems like a feature for the future...


> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too. I don't think there would be
> a whole lot of point trying to operate on compressed tar files because
> you need random access and that seems hard on a compressed file, but
> on uncompressed files it seems at least theoretically doable. I'm not
> sure whether anyone would care that much about this, though, even
> though it does sound pretty cool.

I don't know the tar format well, but my understanding is that it doesn't have
a "central metadata" portion. I.e. doing something like this would entail
scanning the tar file sequentially, skipping file contents?  And wouldn't you
have to create an entirely new tar file for the modified output? That kind of
makes it not so incremental ;)

IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
seem worth bothering about now. But I might just be missing something.


Greetings,

Andres Freund



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:
> I assume this is "solely" required for keeping the incremental backups as
> small as possible, rather than being required for correctness?

I believe so. I want to spend some more time thinking about this to
make sure I'm not missing anything.

> Could we just recompute the WAL summary for the [redo, end of chunk] for the
> relevant summary file?

I'm not understanding how that would help. If we were going to compute
a WAL summary on the fly rather than waiting for one to show up on
disk, what we'd want is [end of last WAL summary that does exist on
disk, redo]. But I'm not sure that's a great approach, because that
LSN gap might be large and then we're duplicating a lot of work that
the summarizer has probably already done most of.

> FWIW, I like the idea of a special WAL record at that point, independent of
> this feature. It wouldn't be a meaningful overhead compared to the cost of a
> checkpoint, and it seems like it'd be quite useful for debugging. But I can
> see uses going beyond that - we occasionally have been discussing associating
> additional data with redo points, and that'd be a lot easier to deal with
> during recovery with such a record.
>
> I don't really see a performance and concurrency angle right now - what are
> you wondering about?

I'm not really sure. I expect Dilip would be happy to post his patch,
and if you'd be willing to have a look at it and express your concerns
or lack thereof, that would be super valuable.

> > Another thing that I'm not too sure about is: what happens if we find
> > a relation file on disk that doesn't appear in the backup_manifest for
> > the previous backup and isn't mentioned in the WAL summaries either?
>
> Wouldn't that commonly happen for unlogged relations at least?
>
> I suspect there's also other ways to end up with such additional files,
> e.g. by crashing during the creation of a new relation.

Yeah, this needs some more careful thought.

> > A few less-serious problems with the patch:
> >
> > - We don't have an incremental JSON parser, so if you have a
> > backup_manifest>1GB, pg_basebackup --incremental is going to fail.
> > That's also true of the existing code in pg_verifybackup, and for the
> > same reason. I talked to Andrew Dunstan at one point about adapting
> > our JSON parser to support incremental parsing, and he had a patch for
> > that, but I think he found some problems with it and I'm not sure what
> > the current status is.
>
> As a stopgap measure, can't we just use the relevant flag to allow larger
> allocations?

I'm not sure that's a good idea, but theoretically, yes. We can also
just choose to accept the limitation that your data directory can't be
too darn big if you want to use this feature. But getting incremental
JSON parsing would be better.

Not having the manifest in JSON would be an even better solution, but
regrettably I did not win that argument.

> That seems like a feature for the future...

Sure.

> I don't know the tar format well, but my understanding is that it doesn't have
> a "central metadata" portion. I.e. doing something like this would entail
> scanning the tar file sequentially, skipping file contents?  And wouldn't you
> have to create an entirely new tar file for the modified output? That kind of
> makes it not so incremental ;)
>
> IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
> seem worth bothering about now. But I might just be missing something.

Oh, yeah, it's just an idle thought. I'll get to it when I get to it,
or else I won't.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Matthias van de Meent
Date:
On Wed, 14 Jun 2023 at 20:47, Robert Haas <robertmhaas@gmail.com> wrote:
>
> A few years ago, I sketched out a design for incremental backup, but
> no patch for incremental backup ever got committed. Instead, the whole
> thing evolved into a project to add backup manifests, which are nice,
> but not as nice as incremental backup would be. So I've decided to
> have another go at incremental backup itself. Attached are some WIP
> patches.

Nice, I like this idea.

> Let me summarize the design and some open questions and
> problems with it that I've discovered. I welcome problem reports and
> test results from others, as well.

Skimming through the 7th patch, I see claims that FSM is not fully
WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
track those changes.
I disagree with that decision: we now have support for custom resource
managers, which may use the various forks for other purposes than
those used in PostgreSQL right now. It would be a shame if data is
lost because of the backup tool ignoring forks because the PostgreSQL
project itself doesn't have post-recovery consistency guarantees in
that fork. So, unless we document that WAL-logged changes in the FSM
fork are actually not recoverable from backup, regardless of the type
of contents, we should still keep track of the changes in the FSM fork
and include the fork in our backups or only exclude those FSM updates
that we know are safe to ignore.

Kind regards,

Matthias van de Meent
Neon, Inc.



Re: trying again to get incremental backup

From
Andres Freund
Date:
Hi,

On 2023-06-14 16:10:38 -0400, Robert Haas wrote:
> On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:
> > Could we just recompute the WAL summary for the [redo, end of chunk] for the
> > relevant summary file?
> 
> I'm not understanding how that would help. If we were going to compute
> a WAL summary on the fly rather than waiting for one to show up on
> disk, what we'd want is [end of last WAL summary that does exist on
> disk, redo].

Oh, right.


> But I'm not sure that's a great approach, because that LSN gap might be
> large and then we're duplicating a lot of work that the summarizer has
> probably already done most of.

I guess that really depends on what the summary granularity is. If you create
a separate summary every 32MB or so, recomputing just the required range
shouldn't be too bad.


> > FWIW, I like the idea of a special WAL record at that point, independent of
> > this feature. It wouldn't be a meaningful overhead compared to the cost of a
> > checkpoint, and it seems like it'd be quite useful for debugging. But I can
> > see uses going beyond that - we occasionally have been discussing associating
> > additional data with redo points, and that'd be a lot easier to deal with
> > during recovery with such a record.
> >
> > I don't really see a performance and concurrency angle right now - what are
> > you wondering about?
> 
> I'm not really sure. I expect Dilip would be happy to post his patch,
> and if you'd be willing to have a look at it and express your concerns
> or lack thereof, that would be super valuable.

Will do. Adding me to CC: might help, I have a backlog unfortunately :(.


Greetings,

Andres Freund



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Thu, Jun 15, 2023 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
>

> > I'm not really sure. I expect Dilip would be happy to post his patch,
> > and if you'd be willing to have a look at it and express your concerns
> > or lack thereof, that would be super valuable.
>
> Will do. Adding me to CC: might help, I have a backlog unfortunately :(.

Thanks, I have posted it here[1]

[1] https://www.postgresql.org/message-id/CAFiTN-s-K%3DmVA%3DHPr_VoU-5bvyLQpNeuzjq1ebPJMEfCJZKFsg%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote:
> > But I'm not sure that's a great approach, because that LSN gap might be
> > large and then we're duplicating a lot of work that the summarizer has
> > probably already done most of.
>
> I guess that really depends on what the summary granularity is. If you create
> a separate summary every 32MB or so, recomputing just the required range
> shouldn't be too bad.

Yeah, but I don't think that's the right approach, for two reasons.
First, one of the things I'm rather worried about is what happens when
the WAL distance between the prior backup and the incremental backup
is large. It could be a terabyte. If we have a WAL summary for every
32MB of WAL, that's 32k files we have to read, and I'm concerned
that's too many. Maybe it isn't, but it's something that has really
been weighing on my mind as I've been thinking through the design
questions here. The files are really very small, and having to open a
bazillion tiny little files to get the job done sounds lame. Second, I
don't see what problem it actually solves. Why not just signal the
summarizer to write out the accumulated data to a file instead of
re-doing the work ourselves? Or else adopt the
WAL-record-at-the-redo-pointer approach, and then the whole thing is
moot?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Andres Freund
Date:
Hi,

On 2023-06-19 09:46:12 -0400, Robert Haas wrote:
> On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote:
> > > But I'm not sure that's a great approach, because that LSN gap might be
> > > large and then we're duplicating a lot of work that the summarizer has
> > > probably already done most of.
> >
> > I guess that really depends on what the summary granularity is. If you create
> > a separate summary every 32MB or so, recomputing just the required range
> > shouldn't be too bad.
> 
> Yeah, but I don't think that's the right approach, for two reasons.
> First, one of the things I'm rather worried about is what happens when
> the WAL distance between the prior backup and the incremental backup
> is large. It could be a terabyte. If we have a WAL summary for every
> 32MB of WAL, that's 32k files we have to read, and I'm concerned
> that's too many. Maybe it isn't, but it's something that has really
> been weighing on my mind as I've been thinking through the design
> questions here.

It doesn't have to be a separate file - you could easily summarize ranges
at a higher granularity, storing multiple ranges into a single file with a
coarser naming pattern.


> The files are really very small, and having to open a bazillion tiny little
> files to get the job done sounds lame. Second, I don't see what problem it
> actually solves. Why not just signal the summarizer to write out the
> accumulated data to a file instead of re-doing the work ourselves? Or else
> adopt the WAL-record-at-the-redo-pointer approach, and then the whole thing
> is moot?

The one point for a relatively grainy summarization scheme that I see is that
it would pave the way for using the WAL summary data for other purposes in the
future. That could be done orthogonal to the other solutions to the redo
pointer issues.

Other potential use cases:

- only restore parts of a base backup that aren't going to be overwritten by
  WAL replay
- reconstructing database contents from WAL after data loss
- more efficient pg_rewind
- more efficient prefetching during WAL replay


Greetings,

Andres Freund



Re: trying again to get incremental backup

From
Robert Haas
Date:
Hi,

In the limited time that I've had to work on this project lately, I've
been trying to come up with a test case for this feature -- and since
I've gotten completely stuck, I thought it might be time to post and
see if anyone else has a better idea. I thought a reasonable test case
would be: Do a full backup. Change some stuff. Do an incremental
backup. Restore both backups and perform replay to the same LSN. Then
compare the files on disk. But I cannot make this work. The first
problem I ran into was that replay of the full backup does a
restartpoint, while the replay of the incremental backup does not.
That results in, for example, pg_subtrans having different contents.
I'm not sure whether it can also result in data files having different
contents: are changes that we replayed following the last restartpoint
guaranteed to end up on disk when the server is shut down? It wasn't
clear to me that this is the case. I thought maybe I could get both
servers to perform a restartpoint at the same location by shutting
down the primary and then replaying through the shutdown checkpoint,
but that doesn't work because the primary doesn't finish archiving
before shutting down. After some more fiddling I settled (at least for
research purposes) on having the restored backups PITR and promote,
instead of PITR and pause, so that we're guaranteed a checkpoint. But
that just caused me to run into a far worse problem: replay on the
standby doesn't actually create a state that is byte-for-byte
identical to the one that exists on the primary. I quickly discovered
that in my test case, I was ending up with different contents in the
"hole" of a block wherein a tuple got updated. Replay doesn't think
it's important to make the hole end up with the same contents on all
machines that replay the WAL, so I end up with one server that has
more junk in there than the other one and the tests fail.

Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents? But I
feel like it would be easy to have bugs that escape detection in such
a test but would be detected by a physical comparison of the clusters.
However, such a comparison can only be conducted if either (a) there's
some way to set up the test so that byte-for-byte identical clusters
can be expected or (b) there's some way to perform the comparison that
can distinguish between expected, harmless differences and unexpected,
problematic differences. And at the moment my conclusion is that
neither (a) nor (b) exists. Does anyone think otherwise?

Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.

...Robert

Attachment

Re: trying again to get incremental backup

From
David Steele
Date:
Hi Robert,

On 8/30/23 10:49, Robert Haas wrote:
> In the limited time that I've had to work on this project lately, I've
> been trying to come up with a test case for this feature -- and since
> I've gotten completely stuck, I thought it might be time to post and
> see if anyone else has a better idea. I thought a reasonable test case
> would be: Do a full backup. Change some stuff. Do an incremental
> backup. Restore both backups and perform replay to the same LSN. Then
> compare the files on disk. But I cannot make this work. The first
> problem I ran into was that replay of the full backup does a
> restartpoint, while the replay of the incremental backup does not.
> That results in, for example, pg_subtrans having different contents.

pg_subtrans, at least, can be ignored since it is excluded from the 
backup and not required for recovery.

> I'm not sure whether it can also result in data files having different
> contents: are changes that we replayed following the last restartpoint
> guaranteed to end up on disk when the server is shut down? It wasn't
> clear to me that this is the case. I thought maybe I could get both
> servers to perform a restartpoint at the same location by shutting
> down the primary and then replaying through the shutdown checkpoint,
> but that doesn't work because the primary doesn't finish archiving
> before shutting down. After some more fiddling I settled (at least for
> research purposes) on having the restored backups PITR and promote,
> instead of PITR and pause, so that we're guaranteed a checkpoint. But
> that just caused me to run into a far worse problem: replay on the
> standby doesn't actually create a state that is byte-for-byte
> identical to the one that exists on the primary. I quickly discovered
> that in my test case, I was ending up with different contents in the
> "hole" of a block wherein a tuple got updated. Replay doesn't think
> it's important to make the hole end up with the same contents on all
> machines that replay the WAL, so I end up with one server that has
> more junk in there than the other one and the tests fail.

This is pretty much what I discovered when investigating backup from 
standby back in 2016. My (ultimately unsuccessful) efforts to find a 
clean delta resulted in [1] as I systematically excluded directories 
that are not required for recovery and will not be synced between a 
primary and standby.

FWIW Heikki also made similar attempts at this before me (back then I 
found the thread but I doubt I could find it again) and arrived at 
similar results. We discussed this in person and figured out that we had 
come to more or less the same conclusion. Welcome to the club!

> Unless someone has a brilliant idea that I lack, this suggests to me
> that this whole line of testing is a dead end. I can, of course, write
> tests that compare clusters *logically* -- do the correct relations
> exist, are they accessible, do they have the right contents? But I
> feel like it would be easy to have bugs that escape detection in such
> a test but would be detected by a physical comparison of the clusters.

Agreed, though a matching logical result is still very compelling.

> However, such a comparison can only be conducted if either (a) there's
> some way to set up the test so that byte-for-byte identical clusters
> can be expected or (b) there's some way to perform the comparison that
> can distinguish between expected, harmless differences and unexpected,
> problematic differences. And at the moment my conclusion is that
> neither (a) nor (b) exists. Does anyone think otherwise?

I do not. My conclusion back then was that validating a physical 
comparison would be nearly impossible without changes to Postgres to 
make the primary and standby match via replication. Which, by the way, I 
still think would be a great idea. In principle, at least. Replay is 
already a major bottleneck and anything that makes it slower will likely 
not be very popular.

This would also be great for WAL, since last time I tested the same WAL 
segment can be different between the primary and standby because the 
unused (and recycled) portion at the end is not zeroed as it is on the 
primary (but logically they do match). I would be very happy if somebody 
told me that my info is out of date here and this has been fixed. But 
when I looked at the code it was incredibly tricky to do this because of 
how WAL is replicated.

> Meanwhile, here's a rebased set of patches. The somewhat-primitive
> attempts at writing tests are in 0009, but they don't work, for the
> reasons explained above. I think I'd probably like to go ahead and
> commit 0001 and 0002 soon if there are no objections, since I think
> those are good refactorings independently of the rest of this.

No objections to 0001/0002.

Regards,
-David

[1] 
http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8



Re: trying again to get incremental backup

From
Robert Haas
Date:
Hey, thanks for the reply.

On Thu, Aug 31, 2023 at 6:50 PM David Steele <david@pgmasters.net> wrote:
> pg_subtrans, at least, can be ignored since it is excluded from the
> backup and not required for recovery.

I agree...

> Welcome to the club!

Thanks for the welcome, but being a member feels *terrible*. :-)

> I do not. My conclusion back then was that validating a physical
> comparison would be nearly impossible without changes to Postgres to
> make the primary and standby match via replication. Which, by the way, I
> still think would be a great idea. In principle, at least. Replay is
> already a major bottleneck and anything that makes it slower will likely
> not be very popular.

Fair point. But maybe the bigger issue is the work involved. I don't
think zeroing the hole in all cases would likely be that expensive,
but finding everything that can cause the standby to diverge from the
primary and fixing all of it sounds like an unpleasant amount of
effort. Still, it's good to know that I'm not missing something
obvious.

> No objections to 0001/0002.

Cool.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

> Unless someone has a brilliant idea that I lack, this suggests to me
> that this whole line of testing is a dead end. I can, of course, write
> tests that compare clusters *logically* -- do the correct relations
> exist, are they accessible, do they have the right contents?

Can't we think of comparing at the block level, like we can compare
each block but ignore the content of the hole?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Sep 4, 2023 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Can't we think of comparing at the block level, like we can compare
> each block but ignore the content of the hole?

We could do that, but I don't think that's a full solution. I think
I'd end up having to reimplement the equivalent of heap_mask,
btree_mask, et. al. in Perl, which doesn't seem very reasonable. It's
fairly complicated logic even written in C, and doing the right thing
in Perl would be more complex, I think, because it wouldn't have
access to all the same #defines which depend on things like word size
and Endianness and stuff. If we want to allow this sort of comparison,
I feel we should think of changing the C code in some way to make it
work reliably rather than try to paper over the problems in Perl.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Meanwhile, here's a rebased set of patches. The somewhat-primitive
> attempts at writing tests are in 0009, but they don't work, for the
> reasons explained above. I think I'd probably like to go ahead and
> commit 0001 and 0002 soon if there are no objections, since I think
> those are good refactorings independently of the rest of this.
>

I have started reading the patch today, I haven't yet completed one
pass but here are my comments in 0007

1.

+ BlockNumber relative_block_numbers[RELSEG_SIZE];

This is close to 400kB of memory, so I think it is better we palloc it
instead of keeping it in the stack.

2.
  /*
  * Try to parse the directory name as an unsigned integer.
  *
- * Tablespace directories should be positive integers that can
- * be represented in 32 bits, with no leading zeroes or trailing
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
  * garbage. If we come across a name that doesn't meet those
  * criteria, skip it.

Unrelated code refactoring hunk

3.
+typedef struct
+{
+ const char *filename;
+ pg_checksum_context *checksum_ctx;
+ bbsink    *sink;
+ size_t bytes_sent;
+} FileChunkContext;

This structure is not used anywhere.

4.
+ * If the file is to be set incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks

/If the file is to be set incrementally/If the file is to be sent incrementally

5.
- while (bytes_done < statbuf->st_size)
+ while (1)
  {
- size_t remaining = statbuf->st_size - bytes_done;
+ /*

I do not really like this change, because after removing this you have
put 2 independent checks for sending the full file[1] and sending it
incrementally[1].  Actually for sending incrementally
'statbuf->st_size' is computed from the 'num_incremental_blocks'
itself so why don't we keep this breaking condition in the while loop
itself?  So that we can avoid these two separate conditions.

[1]
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;

[2]
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;


6.
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;

Better we add some comments for these structures.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
On Wed, Aug 30, 2023 at 4:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
[..]

I've played a little bit more this second batch of patches on
e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait
events refactor):

test_across_wallevelminimal.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_truncaterollback.sh
test_unlogged_table.sh

all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys. As per earlier doubt: raw wal_level =
minimal situation, shouldn't be a concern, sadly because it requires
max_wal_senders==0, while pg_basebackup requires it above 0 (due to
"FATAL:  number of requested standby connections exceeds
max_wal_senders (currently 0)").

I wanted to also introduce corruption onto pg_walsummaries files, but
later saw in code that is already covered with CRC32, cool.

In v07:
> +#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000

170000 ?

> A related design question is whether we should really be sending the
> whole backup manifest to the server at all. If it turns out that we
> don't really need anything except for the LSN of the previous backup,
> we could send that one piece of information instead of everything. On
> the other hand, if we need the list of files from the previous backup,
> then sending the whole manifest makes sense.

If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?)  - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay

> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too. I don't think there would be
> a whole lot of point trying to operate on compressed tar files because
> you need random access and that seems hard on a compressed file, but
> on uncompressed files it seems at least theoretically doable. I'm not
> sure whether anyone would care that much about this, though, even
> though it does sound pretty cool.

Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.

-J.

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Sep 1, 2023 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > No objections to 0001/0002.
>
> Cool.

Nobody else objected either, so I went ahead and committed those. I'll
rebase the rest of the patches on top of the latest master and repost,
hopefully after addressing some of the other review comments from
Dilip and Jakub.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Sep 12, 2023 at 5:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> + BlockNumber relative_block_numbers[RELSEG_SIZE];
>
> This is close to 400kB of memory, so I think it is better we palloc it
> instead of keeping it in the stack.

Fixed.

> Unrelated code refactoring hunk

Fixed.

> This structure is not used anywhere.

Removed.

> /If the file is to be set incrementally/If the file is to be sent incrementally

Fixed.

> I do not really like this change, because after removing this you have
> put 2 independent checks for sending the full file[1] and sending it
> incrementally[1].  Actually for sending incrementally
> 'statbuf->st_size' is computed from the 'num_incremental_blocks'
> itself so why don't we keep this breaking condition in the while loop
> itself?  So that we can avoid these two separate conditions.

I don't think that would be correct. The number of bytes that need to
be read from the original file is not equal to the number of bytes
that will be written to the incremental file. Admittedly, they're
currently different by less than a block, but that could change if we
change the format of the incremental file (e.g. suppose we compressed
the blocks in the incremental file with gzip, or smushed out the holes
in the pages). I wrote the loop as I did precisely so that the two
cases could have different loop exit conditions.

> Better we add some comments for these structures.

Done.

Here's a new patch set, also addressing Jakub's observation that
MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> all those basic tests had GOOD results. Please find attached. I'll try
> to schedule some more realistic (in terms of workload and sizes) test
> in a couple of days + maybe have some fun with cross-backup-and
> restores across standbys.

That's awesome! Thanks for testing! This can definitely benefit from
any amount of beating on it that people wish to do. It's a complex,
delicate area that risks data loss.

> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

Hmm. I wonder if this would even be a safe procedure. I admit that I
can't quite see a problem with it, but sometimes I'm kind of dumb.

> Also maybe it's too early to ask, but wouldn't it be nice if we could
> have an future option in pg_combinebackup to avoid double writes when
> used from restore hosts (right now we need to first to reconstruct the
> original datadir from full and incremental backups on host hosting
> backups and then TRANSFER it again and on target host?). So something
> like that could work well from restorehost: pg_combinebackup
> /tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
> dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
> that such a pipe prevents parallelism from day 1 and I'm afraid I do
> not have a better easy idea on how to have both at the same time in
> the long term.

I don't think it's too early to ask for this, but I do think it's too
early for you to get it. ;-)

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Here's a new patch set, also addressing Jakub's observation that
> MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.

Here's yet another new version. In this version, I reversed the order
of the first two patches, with the idea that what's now 0001 seems
fairly reasonable as an independent commit, and could thus perhaps be
committed sometime soon-ish. In the main patch, I added SGML
documentation for pg_combinebackup. I also fixed the broken TAP tests
so that they work, by basing them on pg_dump equivalence rather than
file-level equivalence. I'm sad to give up on testing the latter, but
it seems to be unrealistic. I cleaned up a few other odds and ends,
too. But, what exactly is the bigger picture for this patch in terms
of moving forward? Here's a list of things that are on my mind:

- I'd like to get the patch to mark the redo point in the WAL
committed[1] and then reword this patch set to make use of that
infrastructure. Right now, we make a best effort to end WAL summaries
at redo point boundaries, but it's racey, and sometimes we fail to do
so. In theory that just has the effect of potentially making an
incremental backup contain some extra blocks that it shouldn't really
need to contain, but I think it can actually lead to weird stalls,
because when an incremental backup is taken, we have to wait until a
WAL summary shows up that extends at least up to the start LSN of the
backup we're about to take. I believe all the logic in this area can
be made a good deal simpler and more reliable if that patch gets
committed and this one reworked accordingly.

- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful. One point to be
consider here is that, if WAL summarization is enabled, checkpoints
can't remove WAL that isn't summarized yet. Mostly that's not a
problem, I think, because the WAL summarizer is pretty fast. But it
could increase disk consumption for some people. I don't think that we
need to worry about the summaries themselves being a problem in terms
of space consumption; at least in all the cases I've tested, they're
just not very big.

- On a related note, I haven't yet tested this on a standby, which is
a thing that I definitely need to do. I don't know of a reason why it
shouldn't be possible for all of this machinery to work on a standby
just as it does on a primary, but then we need the WAL summarizer to
run there too, which could end up being a waste if nobody ever tries
to take an incremental backup. I wonder how that should be reflected
in the configuration. We could do something like what we've done for
archive_mode, where on means "only on if this is a primary" and you
have to say always if you want it to run on standbys as well ... but
I'm not sure if that's a design pattern that we really want to
replicate into more places. I'd be somewhat inclined to just make
whatever configuration parameters we need to configure this thing on
the primary also work on standbys, and you can set each server up as
you please. But I'm open to other suggestions.

- We need to settle the question of whether to send the whole backup
manifest to the server or just the LSN. In a previous attempt at
incremental backup, we decided the whole manifest was necessary,
because flat-copying files could make new data show up with old LSNs.
But that version of the patch set was trying to find modified blocks
by checking their LSNs individually, not by summarizing WAL. And since
the operations that flat-copy files are WAL-logged, the WAL summary
approach seems to eliminate that problem - maybe an LSN (and the
associated TLI) is good enough now. This also relates to Jakub's
question about whether this machinery could be used to fast-forward a
standby, which is not exactly a base backup but  ... perhaps close
enough? I'm somewhat inclined to believe that we can simplify to an
LSN and TLI; however, if we do that, then we'll have big problems if
later we realize that we want the manifest for something after all. So
if anybody thinks that there's a reason to keep doing what the patch
does today -- namely, upload the whole manifest to the server --
please speak up.

- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker, although Andrew Dunstan tells me he
does have a prototype for incremental JSON parsing so maybe that will
land and we can use it here.

- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?

- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche. Nevertheless, merging it into any other
utility seems relatively awkward, so I'm inclined to think both that
this should be included in whatever finally gets committed and that it
should be a separate binary. I considered whether it should go in
contrib, but we seem to have moved to a policy that heavily favors
limiting contrib to extensions and loadable modules, rather than
binaries.

Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.

[1]  http://postgr.es/m/CA+TgmoZAM24Ub=uxP0aWuWstNYTUJQ64j976FYJeVaMJ+qD0uw@mail.gmail.com

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Clearly there's a good amount of stuff to sort out here, but we've
> still got quite a bit of time left before feature freeze so I'd like
> to have a go at it. Please let me know your thoughts, if you have any.

Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.

After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
David Steele
Date:
On 10/19/23 12:05, Robert Haas wrote:
> On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> Clearly there's a good amount of stuff to sort out here, but we've
>> still got quite a bit of time left before feature freeze so I'd like
>> to have a go at it. Please let me know your thoughts, if you have any.
> 
> Apparently, nobody has any thoughts, but here's an updated patch set
> anyway. The main change, other than rebasing, is that I did a bunch
> more documentation work on the main patch (0005). I'm much happier
> with it now, although I expect it may need more adjustments here and
> there as outstanding design questions get settled.
> 
> After some thought, I think that it should be fine to commit 0001 and
> 0002 as independent refactoring patches, and I plan to go ahead and do
> that pretty soon unless somebody objects.

0001 looks pretty good to me. The only thing I find a little troublesome 
is the repeated construction of file names with/without segment numbers 
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:

+            if (segno == 0)
+                snprintf(dstpath, sizeof(dstpath), "%s/%u",
+                         dbspacedirname, relNumber);
+            else
+                snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+                         dbspacedirname, relNumber, segno);


If this happened three times I'd definitely want a helper function, but 
even with two I think it would be a bit nicer.

0002 is definitely a good idea. FWIW pgBackRest does this conversion but 
also errors if it does not succeed. We have never seen a report of this 
error happening in the wild, so I think it must be pretty rare if it 
does happen.

Regards,
-David



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote:
> 0001 looks pretty good to me. The only thing I find a little troublesome
> is the repeated construction of file names with/without segment numbers
> in ResetUnloggedRelationsInDbspaceDir(), .e.g.:
>
> +                       if (segno == 0)
> +                               snprintf(dstpath, sizeof(dstpath), "%s/%u",
> +                                                dbspacedirname, relNumber);
> +                       else
> +                               snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
> +                                                dbspacedirname, relNumber, segno);
>
>
> If this happened three times I'd definitely want a helper function, but
> even with two I think it would be a bit nicer.

Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.

> 0002 is definitely a good idea. FWIW pgBackRest does this conversion but
> also errors if it does not succeed. We have never seen a report of this
> error happening in the wild, so I think it must be pretty rare if it
> does happen.

Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."

I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
Hi Robert,

On Wed, Oct 4, 2023 at 10:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Here's a new patch set, also addressing Jakub's observation that
> > MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.
>
> Here's yet another new version.[..]

Okay, so another good news - related to the patch version #4.
Not-so-tiny stress test consisting of pgbench run for 24h straight
(with incremental backups every 2h, with base of initial full backup),
followed by two PITRs (one not using incremental backup and one using
to to illustrate the performance point - and potentially spot any
errors in between). In both cases it worked fine. Pgbench has this
behaviour that it doesn't cause space growth over time - it produces
lots of WAL instead. Some stats:

START DBSIZE: ~3.3GB (pgbench -i -s 200 --partitions=8)
END DBSIZE: ~4.3GB
RUN DURATION: 24h (pgbench -P 1 -R 100 -T 86400)
WALARCHIVES-24h: 77GB
FULL-DB-BACKUP-SIZE: 3.4GB
INCREMENTAL-BACKUP-11-SIZE: 3.5GB
Env: Azure VM D4s (4VCPU), Debian 11, gcc 10.2, normal build (asserts
and debug disabled)
The increments were taken every 2h just to see if they would fail for
any reason - they did not.

PITR RTO RESULTS (copy/pg_combinebackup time + recovery time):
1. time to restore from fullbackup (+ recovery of 24h WAL[77GB]): 53s
+ 4640s =~ 78min
2. time to restore from fullbackup+incremental backup from 2h ago (+
recovery of 2h WAL [5.4GB]): 68s + 190s =~ 4min18s

I could probably pre populate the DB with 1TB cold data (not touched
to be touched pgbench at all), just for the sake of argument, and that
would probably could be demonstrated how space efficient the
incremental backup can be, but most of time would be time wasted on
copying the 1TB here...

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week.

I've just noticed one thing when recovery is progress: is
summarization working during recovery - in the background - an
expected behaviour? I'm wondering about that, because after freshly
restored and recovered DB, one would need to create a *new* full
backup and only from that point new summaries would have any use?
Sample log:

2023-10-20 11:10:02.288 UTC [64434] LOG:  restored log file
"000000010000000200000022" from archive
2023-10-20 11:10:02.599 UTC [64434] LOG:  restored log file
"000000010000000200000023" from archive
2023-10-20 11:10:02.769 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/139B1130 to 2/239B1518
2023-10-20 11:10:02.923 UTC [64434] LOG:  restored log file
"000000010000000200000024" from archive
2023-10-20 11:10:03.193 UTC [64434] LOG:  restored log file
"000000010000000200000025" from archive
2023-10-20 11:10:03.345 UTC [64432] LOG:  restartpoint starting: wal
2023-10-20 11:10:03.407 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/239B1518 to 2/25B609D0
2023-10-20 11:10:03.521 UTC [64434] LOG:  restored log file
"000000010000000200000026" from archive
2023-10-20 11:10:04.429 UTC [64434] LOG:  restored log file
"000000010000000200000027" from archive

>
> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I'll try to play with some standby restores in future, stay tuned.

Regards,
-J.



Re: trying again to get incremental backup

From
David Steele
Date:
On 10/19/23 16:00, Robert Haas wrote:
> On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote:
>> 0001 looks pretty good to me. The only thing I find a little troublesome
>> is the repeated construction of file names with/without segment numbers
>> in ResetUnloggedRelationsInDbspaceDir(), .e.g.:
>>
>> +                       if (segno == 0)
>> +                               snprintf(dstpath, sizeof(dstpath), "%s/%u",
>> +                                                dbspacedirname, relNumber);
>> +                       else
>> +                               snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
>> +                                                dbspacedirname, relNumber, segno);
>>
>>
>> If this happened three times I'd definitely want a helper function, but
>> even with two I think it would be a bit nicer.
> 
> Personally I think that would make the code harder to read rather than
> easier. I agree that repeating code isn't great, but this is a
> relatively brief idiom and pretty self-explanatory. If other people
> agree with you I can change it, but to me it's not an improvement.

Then I'm fine with it as is.

>> 0002 is definitely a good idea. FWIW pgBackRest does this conversion but
>> also errors if it does not succeed. We have never seen a report of this
>> error happening in the wild, so I think it must be pretty rare if it
>> does happen.
> 
> Cool, but ... how about the main patch set? It's nice to get some of
> these refactoring bits and pieces out of the way, but if I spend the
> effort to work out what I think are the right answers to the remaining
> design questions for the main patch set and then find out after I've
> done all that that you have massive objections, I'm going to be
> annoyed. I've been trying to get this feature into PostgreSQL for
> years, and if I don't succeed this time, I want the reason to be
> something better than "well, I didn't find out that David disliked X
> until five minutes before I was planning to type 'git push'."

I simply have not had time to look at the main patch set in any detail.

> I'm not really concerned about detailed bug-hunting in the main
> patches just yet. The time for that will come. But if you have views
> on how to resolve the design questions that I mentioned in a couple of
> emails back, or intend to advocate vigorously against the whole
> concept for some reason, let's try to sort that out sooner rather than
> later.

In my view this feature puts the cart way before the horse. I'd think 
higher priority features might be parallelism, a backup repository, 
expiration management, archiving, or maybe even a restore command.

It seems the only goal here is to make pg_basebackup a tool for external 
backup software to use, which might be OK, but I don't believe this 
feature really advances pg_basebackup as a usable piece of stand-alone 
software. If people really think that start/stop backup is too 
complicated an interface how are they supposed to track page 
incrementals and get them to a place where pg_combinebackup can put them 
backup together? If automation is required to use this feature, 
shouldn't pg_basebackup implement that automation?

I have plenty of thoughts about the implementation as well, but I have a 
lot on my plate right now and I don't have time to get into it.

I don't plan to stand in your way on this feature. I'm reviewing what 
patches I can out of courtesy and to be sure that nothing adjacent to 
your work is being affected. My apologies if my reviews are not meeting 
your expectations, but I am contributing as my time constraints allow.

Regards,
-David



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote:
> Then I'm fine with it as is.

OK, thanks.

> In my view this feature puts the cart way before the horse. I'd think
> higher priority features might be parallelism, a backup repository,
> expiration management, archiving, or maybe even a restore command.
>
> It seems the only goal here is to make pg_basebackup a tool for external
> backup software to use, which might be OK, but I don't believe this
> feature really advances pg_basebackup as a usable piece of stand-alone
> software. If people really think that start/stop backup is too
> complicated an interface how are they supposed to track page
> incrementals and get them to a place where pg_combinebackup can put them
> backup together? If automation is required to use this feature,
> shouldn't pg_basebackup implement that automation?
>
> I have plenty of thoughts about the implementation as well, but I have a
> lot on my plate right now and I don't have time to get into it.
>
> I don't plan to stand in your way on this feature. I'm reviewing what
> patches I can out of courtesy and to be sure that nothing adjacent to
> your work is being affected. My apologies if my reviews are not meeting
> your expectations, but I am contributing as my time constraints allow.

Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize. But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me. I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.

At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions. I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances. Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases. Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Oct 20, 2023 at 9:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Okay, so another good news - related to the patch version #4.
> Not-so-tiny stress test consisting of pgbench run for 24h straight
> (with incremental backups every 2h, with base of initial full backup),
> followed by two PITRs (one not using incremental backup and one using
> to to illustrate the performance point - and potentially spot any
> errors in between). In both cases it worked fine.

This is great testing, thanks. What might be even better is to test
whether the resulting backups are correct, somehow.

> I've just noticed one thing when recovery is progress: is
> summarization working during recovery - in the background - an
> expected behaviour? I'm wondering about that, because after freshly
> restored and recovered DB, one would need to create a *new* full
> backup and only from that point new summaries would have any use?

Actually, I think you could take an incremental backup relative to a
full backup from a previous timeline.

But the question of what summarization ought to do (or not do) during
recovery, and whether it ought to be enabled by default, and what the
retention policy ought to be are very much live ones. Right now, it's
enabled by default and keeps summaries for a week, assuming you don't
reset your local clock and that it advances at the same speed as the
universe's own clock. But that's all debatable. Any views?

Meanwhile, here's a new patch set. I went ahead and committed the
first two preparatory patches, as I said earlier that I intended to
do. And here I've adjusted the main patch, which is now 0003, for the
addition of XLOG_CHECKPOINT_REDO, which permitted me to simplify a few
things. wal_summarize_mb now feels like a bit of a silly GUC --
presumably you'd never care, unless you had an absolutely gigantic
inter-checkpoint WAL distance. And if you have that, maybe you should
also have enough memory to summarize all that WAL. Or maybe not:
perhaps it's better to write WAL summaries more than once per
checkpoint when checkpoints are really big. But I'm worried that the
GUC will become a source of needless confusion for users. For most
people, it seems like emitting one summary per checkpoint should be
totally fine, and they might prefer a simple Boolean GUC,
summarize_wal = true | false, over this. I'm just not quite sure about
the corner cases.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
David Steele
Date:
On 10/23/23 11:44, Robert Haas wrote:
> On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote:
>>
>> I don't plan to stand in your way on this feature. I'm reviewing what
>> patches I can out of courtesy and to be sure that nothing adjacent to
>> your work is being affected. My apologies if my reviews are not meeting
>> your expectations, but I am contributing as my time constraints allow.
> 
> Sorry, I realize reading this response that I probably didn't do a
> very good job writing that email and came across sounding like a jerk.
> Possibly, I actually am a jerk. Whether it just sounded like it or I
> actually am, I apologize. 

That was the way it came across, though I prefer to think it was 
unintentional. I certainly understand how frustrating dealing with a 
large and uncertain patch can be. Either way, I appreciate the apology.

Now onward...

> But your last paragraph here gets at my real
> question, which is whether you were going to try to block the feature.
> I recognize that we have different priorities when it comes to what
> would make most sense to implement in PostgreSQL, and that's OK, or at
> least, it's OK with me. 

This seem perfectly natural to me.

> I also don't have any particular expectation
> about how much you should review the patch or in what level of detail,
> and I have sincerely appreciated your feedback thus far. If you are
> able to continue to provide more, that's great, and if that's not,
> well, you're not obligated. What I was concerned about was whether
> that review was a precursor to a vigorous attempt to keep the main
> patch from getting committed, because if that was going to be the
> case, then I'd like to surface that conflict sooner rather than later.
> It sounds like that's not an issue, which is great.

Overall I would say I'm not strongly for or against the patch. I think 
it will be very difficult to use in a manual fashion, but automation is 
they way to go in general so that's not necessarily and argument against.

However, this is an area of great interest to me so I do want to at 
least make sure nothing is being impacted adjacent to the main goal of 
this patch. So far I have seen no sign of that, but that has been a 
primary goal of my reviews.

> At the risk of drifting into the fraught question of what I *should*
> be implementing rather than the hopefully-less-fraught question of
> whether what I am actually implementing is any good, I see incremental
> backup as a way of removing some of the use cases for the low-level
> backup API. If you said "but people still will have lots of reasons to
> use it," I would agree; and if you said "people can still screw things
> up with pg_basebackup," I would also agree. Nonetheless, most of the
> disasters I've personally seen have stemmed from the use of the
> low-level API rather than from the use of pg_basebackup, though there
> are exceptions. 

This all makes sense to me.

> I also think a lot of the use of the low-level API is
> driven by it being just too darn slow to copy the whole database, and
> incremental backup can help with that in some circumstances. 

I would argue that restore performance is *more* important than backup 
performance and this patch is a step backward in that regard. Backups 
will be faster and less space will be used in the repository, but 
restore performance is going to suffer. If the deltas are very small the 
difference will probably be negligible, but as the deltas get large (and 
especially if there are a lot of them) the penalty will be more noticeable.

> Also, I
> have worked fairly hard to try to make sure that if you misuse
> pg_combinebackup, or fail to use it altogether, you'll get an error
> rather than silent data corruption. I would be interested to hear
> about scenarios where the checks that I've implemented can be defeated
> by something that is plausibly described as stupidity rather than
> malice. I'm not sure we can fix all such cases, but I'm very alert to
> the horror that will befall me if user error looks exactly like a bug
> in the code. For my own sanity, we have to be able to distinguish
> those cases. 

I was concerned with the difficulty of trying to stage the correct 
backups for pg_combinebackup, not whether it would recognize that the 
needed data was not available and then error appropriately. The latter 
is surmountable within pg_combinebackup but the former is left up to the 
user.

> Moreover, we also need to be able to distinguish
> backup-time bugs from reassembly-time bugs, which is why I've got the
> pg_walsummary tool, and why pg_combinebackup has the ability to emit
> fairly detailed debugging output. I anticipate those things being
> useful in investigating bug reports when they show up. I won't be too
> surprised if it turns out that more work on sanity-checking and/or
> debugging tools is needed, but I think your concern about people
> misusing stuff is bang on target and I really want to do whatever we
> can to avoid that when possible and detect it when it happens.

The ability of users to misuse tools is, of course, legendary, so that 
all sounds good to me.

One note regarding the patches. I feel like 
v5-0005-Prototype-patch-for-incremental-backup should be split to have 
the WAL summarizer as one patch and the changes to base backup as a 
separate patch.

It might not be useful to commit one without the other, but it would 
make for an easier read. Just my 2c.

Regards,
-David



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Oct 23, 2023 at 7:56 PM David Steele <david@pgmasters.net> wrote:
> > I also think a lot of the use of the low-level API is
> > driven by it being just too darn slow to copy the whole database, and
> > incremental backup can help with that in some circumstances.
>
> I would argue that restore performance is *more* important than backup
> performance and this patch is a step backward in that regard. Backups
> will be faster and less space will be used in the repository, but
> restore performance is going to suffer. If the deltas are very small the
> difference will probably be negligible, but as the deltas get large (and
> especially if there are a lot of them) the penalty will be more noticeable.

I think an awful lot depends here on whether the repository is local
or remote. If you have filesystem access to wherever the backups are
stored anyway, I don't think that using pg_combinebackup to write out
a new data directory is going to be much slower than copying one data
directory from the repository to wherever you'd actually use the
backup. It may be somewhat slower because we do need to access some
data in every involved backup, but I don't think it should be vastly
slower because we don't have to read every backup in its entirety. For
each file, we read the (small) header of the newest incremental file
and every incremental file that precedes it until we find a full file.
Then, we construct a map of which blocks need to be read from which
sources and read only the required blocks from each source. If all the
blocks are coming from a single file (because there are no incremental
for a certain file, or they contain no blocks) then we just copy the
entire source file in one shot, which can be optimized using the same
tricks we use elsewhere. Inevitably, this is going to read more data
and do more random I/O than just a flat copy of a directory, but it's
not terrible. The overall amount of I/O should be a lot closer to the
size of the output directory than to the sum of the sizes of the input
directories.

Now, if the repository is remote, and you have to download all of
those backups first, and then run pg_combinebackup on them afterward,
that is going to be unpleasant, unless the incremental backups are all
quite small. Possibly this could be addressed by teaching
pg_combinebackup to do things like accessing data over HTTP and SSH,
and relatedly, looking inside tarfiles without needing them unpacked.
For now, I've left those as ideas for future improvement, but I think
potentially they could address some of your concerns here. A
difficulty is that there are a lot of protocols that people might want
to use to push bytes around, and it might be hard to keep up with the
march of progress.

I do agree, though, that there's no such thing as a free lunch. I
wouldn't recommend to anyone that they plan to restore from a chain of
100 incremental backups. Not only might it be slow, but the
opportunities for something to go wrong are magnified. Even if you've
automated everything well enough that there's no human error involved,
what if you've got a corrupted file somewhere? Maybe that's not likely
in absolute terms, but the more files you've got, the more likely it
becomes. What I'd suggest someone do instead is periodically do
pg_combinebackup full_reference_backup oldest_incremental -o
new_full_reference_backup; rm -rf full_reference_backup; mv
new_full_reference_backup full_reference_backup. The new full
reference backup is intended to still be usable for restoring
incrementals based on the incremental it replaced. I hope that, if
people use the feature well, this should limit the need for really
long backup chains. I am sure, though, that some people will use it
poorly. Maybe there's room for more documentation on this topic.

> I was concerned with the difficulty of trying to stage the correct
> backups for pg_combinebackup, not whether it would recognize that the
> needed data was not available and then error appropriately. The latter
> is surmountable within pg_combinebackup but the former is left up to the
> user.

Indeed.

> One note regarding the patches. I feel like
> v5-0005-Prototype-patch-for-incremental-backup should be split to have
> the WAL summarizer as one patch and the changes to base backup as a
> separate patch.
>
> It might not be useful to commit one without the other, but it would
> make for an easier read. Just my 2c.

Yeah, maybe so. I'm not quite ready to commit to doing that split as
of this writing but I will think about it and possibly do it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Peter Eisentraut
Date:
On 04.10.23 22:08, Robert Haas wrote:
> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week. File removal is governed by mtimes, so if you change the
> mtimes of your summary files or whack your system clock around, weird
> things might happen. But obviously this might be inconvenient. Some
> people might not want WAL summary files to be generated at all because
> they don't care about incremental backup, and other people might want
> them retained for longer, and still other people might want them to be
> not removed automatically or removed automatically based on some
> criteria other than mtime. I don't really know what's best here. I
> don't think the default policy that the patches implement is
> especially terrible, but it's just something that I made up and I
> don't have any real confidence that it's wonderful.

The easiest answer is to have it off by default.  Let people figure out 
what works for them.  There are various factors like storage, network, 
server performance, RTO that will determine what combination of full 
backup, incremental backup, and WAL replay will satisfy someone's 
requirements.  I suppose tests could be set up to determine this to some 
degree.  But we could also start slow and let people figure it out 
themselves.  When pg_basebackup was added, it was also disabled by default.

If we think that 7d is a good setting, then I would suggest to consider, 
like 10d.  Otherwise, if you do a weekly incremental backup and you have 
a time change or a hiccup of some kind one day, you lose your backup 
sequence.

Another possible answer is, like, 400 days?  Because why not?  What is a 
reasonable upper limit for this?

> - It's regrettable that we don't have incremental JSON parsing; I
> think that means anyone who has a backup manifest that is bigger than
> 1GB can't use this feature. However, that's also a problem for the
> existing backup manifest feature, and as far as I can see, we have no
> complaints about it. So maybe people just don't have databases with
> enough relations for that to be much of a live issue yet. I'm inclined
> to treat this as a non-blocker,

It looks like each file entry in the manifest takes about 150 bytes, so 
1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

> - Right now, I have a hard-coded 60 second timeout for WAL
> summarization. If you try to take an incremental backup and the WAL
> summaries you need don't show up within 60 seconds, the backup times
> out. I think that's a reasonable default, but should it be
> configurable? If yes, should that be a GUC or, perhaps better, a
> pg_basebackup option?

The current user experience of pg_basebackup is that it waits possibly a 
long time for a checkpoint, and there is an option to make it go faster, 
but there is no timeout AFAICT.  Is this substantially different?  Could 
we just let it wait forever?

Also, does waiting for checkpoint and WAL summarization happen in 
parallel?  If so, what if it starts a checkpoint that might take 15 min 
to complete, and then after 60 seconds it kicks you off because the WAL 
summarization isn't ready.  That might be wasteful.

> - I'm curious what people think about the pg_walsummary tool that is
> included in 0006. I think it's going to be fairly important for
> debugging, but it does feel a little bit bad to add a new binary for
> something pretty niche.

This seems fine.

Is the WAL summary file format documented anywhere in your patch set 
yet?  My only thought was, maybe the file format could be human-readable 
(more like backup_label) to avoid this.  But maybe not.




Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> The easiest answer is to have it off by default.  Let people figure out
> what works for them.  There are various factors like storage, network,
> server performance, RTO that will determine what combination of full
> backup, incremental backup, and WAL replay will satisfy someone's
> requirements.  I suppose tests could be set up to determine this to some
> degree.  But we could also start slow and let people figure it out
> themselves.  When pg_basebackup was added, it was also disabled by default.
>
> If we think that 7d is a good setting, then I would suggest to consider,
> like 10d.  Otherwise, if you do a weekly incremental backup and you have
> a time change or a hiccup of some kind one day, you lose your backup
> sequence.
>
> Another possible answer is, like, 400 days?  Because why not?  What is a
> reasonable upper limit for this?

In concept, I don't think this should even be time-based. What you
should do is remove WAL summaries once you know that you've taken as
many incremental backups that might use them as you're ever going to
do. But PostgreSQL itself doesn't have any way of knowing what your
intended backup patterns are. If your incremental backup fails on
Monday night and you run it manually on Tuesday morning, you might
still rerun it as an incremental backup. If it fails every night for a
month and you finally realize and decide to intervene manually, maybe
you want a new full backup at that point. It's been a month. But on
the other hand maybe you don't. There's no time-based answer to this
question that is really correct, and I think it's quite possible that
your backup software might want to shut off time-based deletion
altogether and make its own decisions about when to nuke summaries.
However, I also don't think that's a great default setting. It could
easily lead to people wasting a bunch of disk space for no reason.

As far as the 7d value, I figured that nighty incremental backups
would be fairly common. If we think weekly incremental backups would
be common, then pushing this out to 10d would make sense. While
there's no reason you couldn't take an annual incremental backup, and
thus want a 400d value, it seems like a pretty niche use case.

Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.

> It looks like each file entry in the manifest takes about 150 bytes, so
> 1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

I suspect a few people have more files than that. They'll just have to
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).

> The current user experience of pg_basebackup is that it waits possibly a
> long time for a checkpoint, and there is an option to make it go faster,
> but there is no timeout AFAICT.  Is this substantially different?  Could
> we just let it wait forever?

We could. I installed the timeout because the first versions of the
feature were buggy, and I didn't like having my tests hang forever
with no indication of what had gone wrong. At least in my experience
so far, the time spent waiting for WAL summarization is typically
quite short, because only the WAL that needs to be summarized is
whatever was emitted since the last time it woke up up through the
start LSN of the backup. That's probably not much, and the next time
the summarizer wakes up, the file should appear within moments. So,
it's a little different from the checkpoint case, where long waits are
expected.

> Also, does waiting for checkpoint and WAL summarization happen in
> parallel?  If so, what if it starts a checkpoint that might take 15 min
> to complete, and then after 60 seconds it kicks you off because the WAL
> summarization isn't ready.  That might be wasteful.

It is not parallel. The trouble is, we don't really have any way to
know whether WAL summarization is going to fail for whatever reason.
We don't expect that to happen, but if somebody changes the
permissions on the WAL summary directory or attaches gdb to the WAL
summarizer process or something of that sort, it might.

We could check at the outset whether we seem to be really far behind
and emit a warning. For instance, if we're 1TB behind on WAL
summarization when the checkpoint is requested, chances are something
is busted and we're probably not going to catch up any time soon. We
could warn the user about that and let them make their own decision
about whether to cancel. But, that idea won't help in unattended
operation, and the threshold for "really far behind" is not very
clear. It might be better to wait until we get more experience with
how things actually fail before doing too much engineering here, but
I'm also open to suggestions.

> Is the WAL summary file format documented anywhere in your patch set
> yet?  My only thought was, maybe the file format could be human-readable
> (more like backup_label) to avoid this.  But maybe not.

The comment in blkreftable.c just above "#define BLOCKS_PER_CHUNK"
gives an overview of the format. I think that we probably don't want
to convert to a text format, because this format is extremely
space-efficient and very convenient to transfer between disk and
memory. We don't want to run out of memory when summarizing large
ranges of WAL, or when taking an incremental backup that requires
combining many individual summaries into a combined summary that tells
us what needs to be included in the backup.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Andrew Dunstan
Date:
On 2023-10-24 Tu 12:08, Robert Haas wrote:
>
>> It looks like each file entry in the manifest takes about 150 bytes, so
>> 1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?
> I suspect a few people have more files than that. They'll just have to Maybe someone on the list can see some way o
> wait to use this feature until we get incremental JSON parsing (or
> undo the decision to use JSON for the manifest).


Robert asked me to work on this quite some time ago, and most of this 
work was done last year.

Here's my WIP for an incremental JSON parser. It works and passes all 
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. 
The reason I haven't posted it before is that it's about 50% slower in 
pure parsing speed than the current recursive descent parser in my 
testing. I've tried various things to make it faster, but haven't made 
much impact. One of my colleagues is going to take a fresh look at it, 
but maybe someone on the list can see where we can save some cycles.

If we can't make it faster, I guess we could use the RD parser for 
non-incremental cases and only use the non-RD parser for incremental, 
although that would be a bit sad. However, I don't think we can make the 
RD parser suitable for incremental parsing - there's too much state 
involved in the call stack.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Robert asked me to work on this quite some time ago, and most of this
> work was done last year.
>
> Here's my WIP for an incremental JSON parser. It works and passes all
> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
> The reason I haven't posted it before is that it's about 50% slower in
> pure parsing speed than the current recursive descent parser in my
> testing. I've tried various things to make it faster, but haven't made
> much impact. One of my colleagues is going to take a fresh look at it,
> but maybe someone on the list can see where we can save some cycles.
>
> If we can't make it faster, I guess we could use the RD parser for
> non-incremental cases and only use the non-RD parser for incremental,
> although that would be a bit sad. However, I don't think we can make the
> RD parser suitable for incremental parsing - there's too much state
> involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Andrew Dunstan
Date:
On 2023-10-25 We 09:05, Robert Haas wrote:
> On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Robert asked me to work on this quite some time ago, and most of this
>> work was done last year.
>>
>> Here's my WIP for an incremental JSON parser. It works and passes all
>> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
>> The reason I haven't posted it before is that it's about 50% slower in
>> pure parsing speed than the current recursive descent parser in my
>> testing. I've tried various things to make it faster, but haven't made
>> much impact. One of my colleagues is going to take a fresh look at it,
>> but maybe someone on the list can see where we can save some cycles.
>>
>> If we can't make it faster, I guess we could use the RD parser for
>> non-incremental cases and only use the non-RD parser for incremental,
>> although that would be a bit sad. However, I don't think we can make the
>> RD parser suitable for incremental parsing - there's too much state
>> involved in the call stack.
> Yeah, this is exactly why I didn't want to use JSON for the backup
> manifest in the first place. Parsing such a manifest incrementally is
> complicated. If we'd gone with my original design where the manifest
> consisted of a bunch of lines each of which could be parsed
> separately, we'd already have incremental parsing and wouldn't be
> faced with these difficult trade-offs.
>
> Unfortunately, I'm not in a good position either to figure out how to
> make your prototype faster, or to evaluate how painful it is to keep
> both in the source tree. It's probably worth considering how likely it
> is that we'd be interested in incremental JSON parsing in other cases.
> Maintaining two JSON parsers is probably not a lot of fun regardless,
> but if each of them gets used for a bunch of things, that feels less
> bad than if one of them gets used for a bunch of things and the other
> one only ever gets used for backup manifests. Would we be interested
> in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
> seems tenuous, but those are examples of the kind of thing that could
> make us happy to have incremental JSON parsing as a general facility.
>
> If nobody's very excited by those kinds of use cases, then this just
> boils down to whether we want to (a) accept that users with very large
> numbers of relation files won't be able to use pg_verifybackup or
> incremental backup, (b) accept that we're going to maintain a second
> JSON parser just to enable that use cas and with no other benefit, or
> (c) undertake to change the manifest format to something that is
> straightforward to parse incrementally. I think (a) is reasonable
> short term, but at some point I think we should do better. I'm not
> really that enthused about (c) because it means more work for me and
> possibly more arguing, but if (b) is going to cause a lot of hassle
> then we might need to consider it.


I'm not too worried about the maintenance burden. The RD routines were 
added in March 2013 (commit a570c98d7fa) and have hardly changed since 
then. The new code is not ground-breaking - it's just a different (and 
fairly well known) way of doing the same thing. I'd be happier if we 
could make it faster, but maybe it's just a fact that keeping an 
explicit stack, which is how this works, is slower.

I wouldn't at all be surprised if there were other good uses for 
incremental JSON parsing, including some you've identified.

That said, I agree that JSON might not be the best format for backup 
manifests, but maybe that ship has sailed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I'm not too worried about the maintenance burden.
>
> That said, I agree that JSON might not be the best format for backup
> manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Oct 24, 2023 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, maybe so. I'm not quite ready to commit to doing that split as
> of this writing but I will think about it and possibly do it.

I have done this. Here's v7.

This version also includes several new TAP tests for the main patch,
some of which were inspired by our discussion. It also includes SGML
documentation for pg_walsummary.

New tests:
003_timeline.pl tests the case where the prior backup for an
incremental backup was taken on an earlier timeline.
004_manifest.pl tests the manifest-related options for pg_combinebackup.
005_integrity.pl tests the sanity checks that prevent combining a
backup with the wrong prior backup.

Overview of the new organization of the patch set:
0001 - preparatory refactoring of basebackup.c, changing the algorithm
that we use to decide which files have checksums
0002 - code movement only. makes it possible to reuse parse_manifest.c
0003 - add the WAL summarizer process, but useless on its own
0004 - add incremental backup, making use of 0003
0005 - add pg_walsummary debugging tool

Notes:
- I suspect that 0003 is the most likely to have serious bugs, followed by 0004.
- See XXX comments in the commit messages for some known open issues.
- Still looking for more comments on
http://postgr.es/m/CA+TgmoYdPS7a4eiqAFCZ8dr4r3-O0zq1LvTO5drwWr+7wHQaSQ@mail.gmail.com
and other recent emails where design questions came up

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Andrew Dunstan
Date:
On 2023-10-25 We 11:24, Robert Haas wrote:
> On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> I'm not too worried about the maintenance burden.
>>
>> That said, I agree that JSON might not be the best format for backup
>> manifests, but maybe that ship has sailed.
> I think it's a decision we could walk back if we had a good enough
> reason, but it would be nicer if we didn't have to, because what we
> have right now is working. If we change it for no real reason, we
> might introduce new bugs, and at least in theory, incompatibility with
> third-party tools that parse the existing format. If you think we can
> live with the additional complexity in the JSON parsing stuff, I'd
> rather go that way.
>

OK, I'll go with that. It will actually be a bit less invasive than the 
patch I posted.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> OK, I'll go with that. It will actually be a bit less invasive than the
> patch I posted.

Why's that?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Andrew Dunstan
Date:
On 2023-10-25 We 15:19, Robert Haas wrote:
> On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> OK, I'll go with that. It will actually be a bit less invasive than the
>> patch I posted.
> Why's that?
>

Because we won't be removing the RD parser.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Oct 26, 2023 at 6:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Because we won't be removing the RD parser.

Ah, OK.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Note that whether to remove summaries is a separate question from
> whether to generate them in the first place. Right now, I have
> wal_summarize_mb controlling whether they get generated in the first
> place, but as I noted in another recent email, that isn't an entirely
> satisfying solution.

I did some more research on this. My conclusion is that I should
remove wal_summarize_mb and just have a GUC summarize_wal = on|off
that controls whether the summarizer runs at all. There will be one
summary file per checkpoint, no matter how far apart checkpoints are
or how large the summary gets. Below I'll explain the reasoning; let
me know if you disagree.

What I describe above would be a bad plan if it were realistically
possible for a summary file to get so large that it might run the
machine out of memory either when producing it or when trying to make
use of it for an incremental backup. This seems to be a somewhat
difficult scenario to create. So far, I haven't been able to generate
WAL summary files more than a few tens of megabytes in size, even when
summarizing 50+ GB of WAL per summary file. One reason why it's hard
to produce large summary files is because, for a single relation fork,
the WAL summary size converges to 1 bit per modified block when the
number of modified blocks is large. This means that, even if you have
a terabyte sized relation, you're looking at no more than perhaps 20MB
of summary data no matter how much of it gets modified. Now, somebody
could have a 30TB relation and then if they modify the whole thing
they could have the better part of a gigabyte of summary data for that
relation, but if you've got a 30TB table you probably have enough
memory that that's no big deal.

But, what if you have multiple relations? I initialized pgbench with a
scale factor of 30000 and also with 30000 partitions and did a 1-hour
run. I got 4 checkpoints during that time and each one produced an
approximately 16MB summary file. The efficiency here drops
considerably. For example, one of the files is 16495398 bytes and
records information on 7498403 modified blocks, which works out to
about 2.2 bytes per modified block. That's more than an order of
magnitude worse than what I got in the single-relation case, where the
summary file didn't even use two *bits* per modified block. But here
again, the file just isn't that big in absolute terms. To get a 1GB+
WAL summary file, you'd need to modify millions of relation forks,
maybe tens of millions, and most installations aren't even going to
have that many relation forks, let alone be modifying them all
frequently.

My conclusion here is that it's pretty hard to have a database where
WAL summarization is going to use too much memory. I wouldn't be
terribly surprised if there are some extreme cases where it happens,
but those databases probably aren't great candidates for incremental
backup anyway. They're probably databases with millions of relations
and frequent, widely-scattered modifications to those relations. And
if you have that kind of high turnover rate then incremental backup
isn't going to as helpful anyway, so there's probably no reason to
enable WAL summarization in the first place. Maybe if you have that
plus in the same database cluster you have a 100TB of completely
static data that is never modified, and if you also do all of this on
a pretty small machine, then you can find a case where incremental
backup would have worked well but for the memory consumed by WAL
summarization.

But I think that's sufficiently niche that the current patch shouldn't
concern itself with such cases. If we find that they're common enough
to worry about, we might eventually want to do something to mitigate
them, but whether that thing looks anything like wal_summarize_mb
seems pretty unclear. So I conclude that it's a mistake to include
that GUC as currently designed and propose to replace it with a
Boolean as described above.

Comments?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
While reviewing this thread today, I realized that I never responded
to this email. That was inadvertent; my apologies.

On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Nice, I like this idea.

Cool.

> Skimming through the 7th patch, I see claims that FSM is not fully
> WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
> track those changes.
> I disagree with that decision: we now have support for custom resource
> managers, which may use the various forks for other purposes than
> those used in PostgreSQL right now. It would be a shame if data is
> lost because of the backup tool ignoring forks because the PostgreSQL
> project itself doesn't have post-recovery consistency guarantees in
> that fork. So, unless we document that WAL-logged changes in the FSM
> fork are actually not recoverable from backup, regardless of the type
> of contents, we should still keep track of the changes in the FSM fork
> and include the fork in our backups or only exclude those FSM updates
> that we know are safe to ignore.

I'm not sure what to do about this problem. I don't think any data
would be *lost* in the scenario that you mention; what I think would
happen is that the FSM forks would be backed up in their entirety even
if they were owned by some other table AM or index AM that was
WAL-logging all changes to whatever it was storing in that fork. So I
think that there is not a correctness issue here but rather an
efficiency issue.

It would still be nice to fix that somehow, but I don't see how to do
it. It would be easy to make the WAL summarizer stop treating the FSM
as a special case, but there's no way for basebackup_incremental.c to
know whether a particular relation fork is for the heap AM or some
other AM that handles WAL-logging differently. It can't for example
examine pg_class; it's not connected to any database, let alone every
database. So we have to either trust that the WAL for the FSM is
correct and complete in all cases, or assume that it isn't in any
case. And the former doesn't seem like a safe or wise assumption given
how the heap AM works.

I think the reality here is unfortunately that we're missing a lot of
important infrastructure to really enable a multi-table-AM world. The
heap AM, and every other table AM, should include a metapage so we can
tell what we're looking at just by examining the disk files. Relation
forks don't scale and should be replaced with some better system that
does. We should have at least two table AMs in core that are fully
supported and do truly useful things. Until some of that stuff (and
probably a bunch of other things) get sorted out, out-of-core AMs are
going to have to remain second-class citizens to some degree.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

As you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.

TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.

I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue. I also think that there's a real
possibility that they might make it easier to track down bugs in my
code, even if those bugs aren't necessarily timeline-related. If (or
more realistically when) somebody ends up with a corrupted cluster
after running pg_combinebackup, we're going to need to figure out
whether that corruption is the result of bugs (and if so where they
are) or user error (and if so what it was). The most obvious ways of
ending up with a corrupted cluster are (1) taking an incremental
backup against a prior backup that is not in the history of the server
from which the backup is taken or (2) combining an incremental backup
the wrong prior backup, so whatever sanity checks we can have that
will tend to prevent those kinds of mistakes seem like a really good
idea.

And those kinds of checks seem relevant here, too. Consider that it
wouldn't be valid to use pg_combinebackup to fast-forward a standby
server if the incremental backup's backup-end-LSN preceded the standby
server's minimum recovery point. Imagine that you have a standby whose
last checkpoint's redo location was at LSN 2/48. Being the
enterprising DBA that you are, you make a note of that LSN and go take
an incremental backup based on it. You then stop the standby server
and try to apply the incremental backup to fast-forward the standby.
Well, it's possible that in the meanwhile the standby actually caught
up, and now has a minimum recovery point that follows the
backup-end-LSN of your incremental backup. In that case, you can't
legally use that incremental backup to fast-forward that standby, but
no code I've yet written would be smart enough to figure that out. Or,
maybe you (or some other DBA on your team) got really excited and
actually promoted that standby meanwhile, and now it's not even on the
same timeline any more. In the "normal" case where you take an
incremental backup based on an earlier base backup, these kinds of
problems are detectable, and it seems to me that if we want to enable
this kind of use case, it would be pretty smart to have a plan to
detect similar mistakes here. I don't, currently, but maybe there is
one.

Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time. But here, you'd want to clobber the *first* input to
pg_combinebackup, not the last one, so if we want to add something
like this, the UI needs some thought.

One thing that I find quite scary about such a mode is that if you
crash mid-way through, you're in a lot of trouble. In the case that I
had previous contemplated -- overwrite the last incremental with the
reconstructed full backup -- you *might* be able to make it crash safe
by writing out the full files for each incremental file, fsyncing
everything, then removing all of the incremental files and fsyncing
again. The idea would be that if you crash midway through it's OK to
just repeat whatever you were trying to do before the crash and if it
succeeds the second time then all is well. If, for a given file, there
are both incremental and non-incremental versions, then the second
attempt should remove and recreate the non-incremental version from
the incremental version. If there's only a non-incremental version, it
could be that the previous attempt got far enough to remove the
incremental file, but in that case the full file that we now have
should be the same thing that we would produce if we did the operation
now. It all sounds a little scary, but maybe it's OK. And as long as
you don't remove the this-is-an-incremental-backup markers from the
backup_label file until you've done everything else, you can tell
whether you've ever successfully completed the reassembly or not. But
if you're using a hypothetical overwrite mode to overwrite the first
input rather than the last one, well, it looks like a valid data
directory already, and if you replace a bunch of files and then crash,
it still does, but it's not any more, really. I'm not sure I've really
wrapped my head around all of the cases here, but it does feel like
there are some new ways to go wrong.

One thing I also realized when thinking about this is that you could
probably hose yourself with the patch set as it stands today by taking
a full backup, downgrading to wal_level=minimal for a while, doing
some WAL-skipping operations, upgrading to a higher WAL-level again,
and then taking an incremental backup. I think the solution to that is
probably for the WAL summarizer to refuse to run if wal_level=minimal.
Then there would be a gap in the summary files which an incremental
backup attempt would detect.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Andres Freund
Date:
Hi,

On 2023-10-30 10:45:03 -0400, Robert Haas wrote:
> On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Note that whether to remove summaries is a separate question from
> > whether to generate them in the first place. Right now, I have
> > wal_summarize_mb controlling whether they get generated in the first
> > place, but as I noted in another recent email, that isn't an entirely
> > satisfying solution.
> 
> I did some more research on this. My conclusion is that I should
> remove wal_summarize_mb and just have a GUC summarize_wal = on|off
> that controls whether the summarizer runs at all. There will be one
> summary file per checkpoint, no matter how far apart checkpoints are
> or how large the summary gets. Below I'll explain the reasoning; let
> me know if you disagree.

> What I describe above would be a bad plan if it were realistically
> possible for a summary file to get so large that it might run the
> machine out of memory either when producing it or when trying to make
> use of it for an incremental backup. This seems to be a somewhat
> difficult scenario to create. So far, I haven't been able to generate
> WAL summary files more than a few tens of megabytes in size, even when
> summarizing 50+ GB of WAL per summary file. One reason why it's hard
> to produce large summary files is because, for a single relation fork,
> the WAL summary size converges to 1 bit per modified block when the
> number of modified blocks is large. This means that, even if you have
> a terabyte sized relation, you're looking at no more than perhaps 20MB
> of summary data no matter how much of it gets modified. Now, somebody
> could have a 30TB relation and then if they modify the whole thing
> they could have the better part of a gigabyte of summary data for that
> relation, but if you've got a 30TB table you probably have enough
> memory that that's no big deal.

I'm not particularly worried about the rewriting-30TB-table case - that'd also
generate >= 30TB of WAL most of the time. Which realistically is going to
trigger a few checkpoints, even on very big instances.


> But, what if you have multiple relations? I initialized pgbench with a
> scale factor of 30000 and also with 30000 partitions and did a 1-hour
> run. I got 4 checkpoints during that time and each one produced an
> approximately 16MB summary file.

Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due
to having to read data from disk, lots of full page write and having to write
out lots of data?  I.e. we won't do all that many transactions during the 1h?


> To get a 1GB+ WAL summary file, you'd need to modify millions of relation
> forks, maybe tens of millions, and most installations aren't even going to
> have that many relation forks, let alone be modifying them all frequently.

I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench
script to create 10k single-row relations in each script, ran that with 96
clients, checkpointed, and ran a pgbench script that updated the single row in
each table.

After creation of the relation WAL summarizer uses
LOG:  level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 chunks); 372804744 used
and creates a 26MB summary file.

After checkpoint & updates WAL summarizer uses:
LOG:  level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 chunks); 363340856 used
and creates a 26MB summary file.

Sure, 350MB ain't nothing, but simply just executing \dt in the database
created by this makes the backend use 260MB after. Which isn't going away,
whereas WAL summarizer drops its memory usage soon after.


> But I think that's sufficiently niche that the current patch shouldn't
> concern itself with such cases. If we find that they're common enough
> to worry about, we might eventually want to do something to mitigate
> them, but whether that thing looks anything like wal_summarize_mb
> seems pretty unclear. So I conclude that it's a mistake to include
> that GUC as currently designed and propose to replace it with a
> Boolean as described above.

After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.

I wonder if there are use cases that might like to consume WAL summaries
before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
good control, but they might want to request a summary file to be created at
some point?

Greetings,

Andres Freund



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
> After playing with this for a while, I don't see a reason for wal_summarize_mb
> from a memory usage POV at least.

Cool! Thanks for testing.

> I wonder if there are use cases that might like to consume WAL summaries
> before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
> good control, but they might want to request a summary file to be created at
> some point?

It's possible. I actually think it's even more likely that there are
use cases that will also want the WAL summarized, but in some
different way. For example, you might want a summary that would give
you the LSN or approximate LSN where changes to a certain block
occurred. Such a summary would be way bigger than these summaries and
therefore, at least IMHO, a lot less useful for incremental backup,
but it could be really useful for something else. Or you might want
summaries that focus on something other than which blocks got changed,
like what relations were created or destroyed, or only changes to
certain kinds of relations or relation forks, or whatever. In a way,
you can even think of logical decoding as a kind of WAL summarization,
just with a very different set of goals from this one. I won't be too
surprised if the next hacker wants something that is different enough
from what this does that it doesn't make sense to share mechanism, but
if by chance they want the same thing but dumped a bit more
frequently, well, that can be done.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
On Mon, Oct 30, 2023 at 6:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > If that is still an area open for discussion: wouldn't it be better to
> > just specify LSN as it would allow resyncing standby across major lag
> > where the WAL to replay would be enormous? Given that we had
> > primary->standby where standby would be stuck on some LSN, right now
> > it would be:
> > 1) calculate backup manifest of desynced 10TB standby (how? using
> > which tool?)  - even if possible, that means reading 10TB of data
> > instead of just putting a number, isn't it?
> > 2) backup primary with such incremental backup >= LSN
> > 3) copy the incremental backup to standby
> > 4) apply it to the impaired standby
> > 5) restart the WAL replay
>
> As you may be able to tell from the flurry of posts and new patch
> sets, I'm trying hard to sort out the remaining open items that
> pertain to this patch set, and I'm now back to thinking about this
> one.
>
> TL;DR: I think the idea has some potential, but there are some
> pitfalls that I'm not sure how to address.
>
> I spent some time looking at how we currently use the data from the
> backup manifest. Currently, we do two things with it. First, when
> we're backing up each file, we check whether it's present in the
> backup manifest and, if not, we back it up in full. This actually
> feels fairly poor. If it makes any difference at all, then presumably
> the underlying algorithm is buggy and needs to be fixed. Maybe that
> should be ripped out altogether or turned into some kind of sanity
> check that causes a big explosion if it fails. Second, we check
> whether the WAL ranges reported by the client match up with the
> timeline history of the server (see PrepareForIncrementalBackup). This
> set of sanity checks seems fairly important to me, and I'd regret
> discarding them. I think there's some possibility that they might
> catch user error, like where somebody promotes multiple standbys and
> maybe they even get the same timeline on more than one of them, and
> then confusion might ensue.
[..]

> Another practical problem here is that, right now, pg_combinebackup
> doesn't have an in-place mode. It knows how to take a bunch of input
> backups and write out an output backup, but that output backup needs
> to go into a new, fresh directory (or directories plural, if there are
> user-defined tablespaces). I had previously considered adding such a
> mode, but the idea I had at the time wouldn't have worked for this
> case. I imagined that someone might want to run "pg_combinebackup
> --in-place full incr" and clobber the contents of the incr directory
> with the output, basically discarding the incremental backup you took
> in favor of a full backup that could have been taken at the same point
> in time.
[..]

Thanks for answering! It all sounds like this
resync-standby-using-primary-incrbackup idea isn't fit for the current
pg_combinebackup, but rather for a new tool hopefully in future. It
could take the current LSN from stuck standby, calculate manifest on
the lagged and offline standby (do we need to calculate manifest
Checksum in that case? I cannot find code for it), deliver it via
"UPLOAD_MANIFEST" to primary and start fetching and applying the
differences while doing some form of copy-on-write from old & incoming
incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
it still be possible in theory? (it could use additional safeguards
like rename controlfile when starting and just before ending to
additionally block startup if it hasn't finished). Also it looks as
per comment nearby struct IncrementalBackupInfo.manifest_files that
even checksums are just more for safeguarding rather than core
implementation (?)

What I've meant in the initial idea is not to hinder current efforts,
but asking if the current design will not stand in a way for such a
cool new addition in future ?

> One thing I also realized when thinking about this is that you could
> probably hose yourself with the patch set as it stands today by taking
> a full backup, downgrading to wal_level=minimal for a while, doing
> some WAL-skipping operations, upgrading to a higher WAL-level again,
> and then taking an incremental backup. I think the solution to that is
> probably for the WAL summarizer to refuse to run if wal_level=minimal.
> Then there would be a gap in the summary files which an incremental
> backup attempt would detect.

As per earlier test [1], I've already tried to simulate that in
incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
worked (but that was with CTAS-wal-minimal-optimization -> new
relfilenodeOID is used for CTAS which got included in the incremental
backup as it's new file) Even retested that with Your v7 patch with
asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still
recovers using incremental backup because the WAL contains:

rmgr: Storage, desc: CREATE base/5/36425
[..]
rmgr: XLOG,    desc: FPI , blkref #0: rel 1663/5/36425 blk 0 FPW
[..]

e.g. TRUNCATE sets a new relfilenode each time, so they will always be
included in backup and wal_level=minimal optimizations kicks only for
commands that issue a new relfilenode. True/false?

postgres=# select oid, relfilenode, relname from pg_class where
relname like 'night%' order by 1;
  oid  | relfilenode |       relname
-------+-------------+---------------------
 16384 |           0 | nightmare
 16390 |       36420 | nightmare_p0
 16398 |       36425 | nightmare_p1
 36411 |           0 | nightmare_pkey
 36413 |       36422 | nightmare_p0_pkey
 36415 |       36427 | nightmare_p1_pkey
 36417 |           0 | nightmare_brin_idx
 36418 |       36423 | nightmare_p0_ts_idx
 36419 |       36428 | nightmare_p1_ts_idx
(9 rows)

postgres=# truncate nightmare;
TRUNCATE TABLE
postgres=# select oid, relfilenode, relname from pg_class where
relname like 'night%' order by 1;
  oid  | relfilenode |       relname
-------+-------------+---------------------
 16384 |           0 | nightmare
 16390 |       36434 | nightmare_p0
 16398 |       36439 | nightmare_p1
 36411 |           0 | nightmare_pkey
 36413 |       36436 | nightmare_p0_pkey
 36415 |       36441 | nightmare_p1_pkey
 36417 |           0 | nightmare_brin_idx
 36418 |       36437 | nightmare_p0_ts_idx
 36419 |       36442 | nightmare_p1_ts_idx

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzT%2BbX2ZYdORO32cADtfQ9DvyaOE8fsOEWZc2V5FkEWVg%40mail.gmail.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Nov 1, 2023 at 8:57 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Thanks for answering! It all sounds like this
> resync-standby-using-primary-incrbackup idea isn't fit for the current
> pg_combinebackup, but rather for a new tool hopefully in future. It
> could take the current LSN from stuck standby, calculate manifest on
> the lagged and offline standby (do we need to calculate manifest
> Checksum in that case? I cannot find code for it), deliver it via
> "UPLOAD_MANIFEST" to primary and start fetching and applying the
> differences while doing some form of copy-on-write from old & incoming
> incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
> one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
> it still be possible in theory? (it could use additional safeguards
> like rename controlfile when starting and just before ending to
> additionally block startup if it hasn't finished). Also it looks as
> per comment nearby struct IncrementalBackupInfo.manifest_files that
> even checksums are just more for safeguarding rather than core
> implementation (?)
>
> What I've meant in the initial idea is not to hinder current efforts,
> but asking if the current design will not stand in a way for such a
> cool new addition in future ?

Hmm, interesting idea. I think something like that could be made to
work. My first thought was that it would sort of suck to have to
compute a manifest as a precondition of doing this, but then I started
to think maybe it wouldn't, really. I mean, you'd have to scan the
local directory tree and collect all the filenames so that you could
remove any files that are no longer present in the current version of
the data directory which the incremental backup would send to you. If
you're already doing that, the additional cost of generating a
manifest isn't that high, at least if you don't include checksums,
which aren't required. On the other hand, if you didn't need to send
the server a manifest and just needed to send the required WAL ranges,
that would be even cheaper. I'll spend some more time thinking about
this next week.

> As per earlier test [1], I've already tried to simulate that in
> incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
> worked (but that was with CTAS-wal-minimal-optimization -> new
> relfilenodeOID is used for CTAS which got included in the incremental
> backup as it's new file) Even retested that with Your v7 patch with
> asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
> nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still
> recovers using incremental backup because the WAL contains:

TRUNCATE itself is always WAL-logged, but data added to the relation
in the same relation as the TRUNCATE isn't always WAL-logged (but
sometimes it is, depending on the relation size). So the failure case
wouldn't be missing the TRUNCATE but missing some data-containing
blocks within the relation shortly after it was created or truncated.
I think what I need to do here is avoid summarizing WAL that was
generated under wal_level=minimal. The walsummarizer process should
just refuse to emit summaries for any such WAL.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
> After playing with this for a while, I don't see a reason for wal_summarize_mb
> from a memory usage POV at least.

Here's v8. Changes:

- Replace wal_summarize_mb GUC with summarize_wal = on | off.
- Document the summarize_wal and wal_summary_keep_time GUCs.
- Refuse to start with summarize_wal = on and wal_level = minimal.
- Increase default wal_summary_keep_time to 10d from 7d, per (what I
think was) a suggestion from Peter E.
- Fix fencepost errors when deciding which WAL summaries are needed
for a backup.
- Fix indentation damage.
- Standardize on ereport(DEBUG1, ...) in walsummarizer.c vs. various
more and less chatty things I had before.
- Include the timeline in some error messages because not having it
proved confusing.
- Be more consistent about ignoring the FSM fork.
- Fix a bug that could cause WAL summarization to error out when
switching timelines.
- Fix the division between the wal summarizer and incremental backup
patches so that the former passes tests without the latter.
- Fix some things that an older compiler didn't like, including adding
pg_attribute_printf in some places.
- Die with an error instead of crashing if someone feeds us a manifest
with no WAL ranges.
- Sort the block numbers that need to be read from a relation file
before reading them, so that we're certain to read them in ascending
order.
- Be more careful about computing the truncation_block_length of an
incremental file; don't do math on a block number that might be
InvalidBlockNumber.
- Fix pg_combinebackup so it doesn't fail when zero-filled blocks are
added to a relation between the prior backup and the incremental
backup.
- Improve the pg_combinebackup -d output so that it explains in detail
how it's carrying out reconstruction, to improve debuggability.
- Disable WAL summarization by default, but add a test patch to the
series to enable it, because running the whole test suite with it
turned on is good for bug-hunting.
- In pg_walsummary, zero a struct before using instead of starting
with arbitrary junk values.

To do list:

- Figure out whether to do something other than uploading the whole
summary, per discussion with Jakub Wartak.
- Decide what to do about the 60-second waiting-for-WAL-summarization timeout.
- Make incremental backup fail quickly if WAL summarization is not even enabled.
- Have pg_basebackup error out nicely if an incremental backup is
requested from an older server that can't do that.
- Add some kind of tests for pg_walsummary.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Tue, Nov 7, 2023 at 2:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
> > After playing with this for a while, I don't see a reason for wal_summarize_mb
> > from a memory usage POV at least.
>
> Here's v8. Changes:

Review comments, based on what I reviewed so far.

- I think 0001 looks good improvement irrespective of the patch series.

- review 0003
1.
+       be enabled either on a primary or on a standby. WAL summarization can
+       cannot be enabled when <varname>wal_level</varname> is set to
+       <literal>minimal</literal>.

Grammatical error
"WAL summarization can cannot" -> WAL summarization cannot

2.
+     <varlistentry id="guc-wal-summarize-keep-time"
xreflabel="wal_summarize_keep_time">
+      <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_summarize_keep_time</varname>
configuration parameter</primary>
+      </indexterm>

I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.

3.

+XLogGetOldestSegno(TimeLineID tli)
+{
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */

Some of the single-line comments end with a full stop whereas others
do not, so better to be consistent.

4.

+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)


Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end
before the" it should be "If start_lsn != InvalidXLogRecPtr, only
summaries that end after the" because only if the summary files are
Ending after the start_lsn then it will have some overlapping and we
need to return them if ending before start lsn then those files are
not overlapping at all, right?

5.
In FilterWalSummaries() header also the comment is wrong same as for
GetWalSummaries() function.

6.
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,

I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn',  I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.

7.
+ nbytes = FileRead(io->file, data, length, io->filepos,
+   WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));

/could not write file/ could not read file

8.
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
Great stuff you got here.  I'm doing a first pass trying to grok the
whole thing for more substantive comments, but in the meantime here are
some cosmetic ones.

I got the following warnings, both valid:

../../../../pgsql/source/master/src/common/blkreftable.c: In function 'WriteBlockRefTable':
../../../../pgsql/source/master/src/common/blkreftable.c:520:45: warning: declaration of 'brtentry' shadows a previous
local[-Wshadow=compatible-local]
 
  520 |                         BlockRefTableEntry *brtentry;
      |                                             ^~~~~~~~
../../../../pgsql/source/master/src/common/blkreftable.c:492:37: note: shadowed declaration is here
  492 |                 BlockRefTableEntry *brtentry;
      |                                     ^~~~~~~~

../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c: In function 'SummarizeWAL':
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:833:57: warning: declaration of
'private_data'shadows a previous local [-Wshadow=compatible-local]
 
  833 |                         SummarizerReadLocalXLogPrivate *private_data;
      |                                                         ^~~~~~~~~~~~
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:709:41: note: shadowed declaration is here
  709 |         SummarizerReadLocalXLogPrivate *private_data;
      |                                         ^~~~~~~~~~~~

In blkreftable.c, I think the definition of SH_EQUAL should have an
outer layer of parentheses.  Also, it would be good to provide and use a
function to initialize a BlockRefTableKey from the RelFileNode and
forknum components, and ensure that any padding bytes are zeroed.
Otherwise it's not going to be a great hash key.  On my platform there
aren't any (padding bytes), but I think it's unwise to rely on that.

I don't think SummarizerReadLocalXLogPrivate->waited is used for
anything.  Could be removed AFAICS, unless you're foreseen adding
something that uses it.

These forward struct declarations are not buying you anything, I'd
remove them:

diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
index 70d6c072d7..316e67122c 100644
--- a/src/include/common/blkreftable.h
+++ b/src/include/common/blkreftable.h
@@ -29,10 +29,7 @@
 /* Magic number for serialization file format. */
 #define BLOCKREFTABLE_MAGIC            0x652b137b
 
-struct BlockRefTable;
-struct BlockRefTableEntry;
-struct BlockRefTableReader;
-struct BlockRefTableWriter;
+/* Struct definitions appear in blkreftable.c */
 typedef struct BlockRefTable BlockRefTable;
 typedef struct BlockRefTableEntry BlockRefTableEntry;
 typedef struct BlockRefTableReader BlockRefTableReader;


and backup_label.h doesn't know about TimeLineID, so it needs this:

diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
index 08d6ed67a9..3af7ea274c 100644
--- a/src/bin/pg_combinebackup/backup_label.h
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -12,6 +12,7 @@
 #ifndef BACKUP_LABEL_H
 #define BACKUP_LABEL_H
 
+#include "access/xlogdefs.h"
 #include "common/checksum_helper.h"
 #include "lib/stringinfo.h"
 

I don't much like the way header files in src/bin/pg_combinebackup files
are structured.  Particularly, causing a simplehash to be "instantiated"
just because load_manifest.h is included seems poised to cause pain.  I
think there should be a file with the basic struct declarations (no
simplehash); and then maybe since both pg_basebackup and
pg_combinebackup seem to need the same simplehash, create a separate
header file containing just that..  But, did you notice that anything
that includes reconstruct.h will instantiate the simplehash stuff,
because it includes load_manifest.h?  It may be unwise to have the
simplehash in a header file.  Maybe just declare it in each .c file that
needs it.  The duplicity is not that large.

I'll see if I can understand the way all these headers are needed to
propose some other arrangement.

I see this idea of having "struct FooBar;" immediately followed by
"typedef struct FooBar FooBar;" which I mentioned from blkreftable.h
occurs in other places as well (JsonManifestParseContext in
parse_manifest.h, maybe others?).  Was this pattern cargo-culted from
somewhere?  Perhaps we have other places to clean up.


Why leave unnamed arguments in function declarations?  For example, in

static void manifest_process_file(JsonManifestParseContext *,
                                  char *pathname,
                                  size_t size,
                                  pg_checksum_type checksum_type,
                                  int checksum_length,
                                  uint8 *checksum_payload);
the first argument lacks a name.  Is this just an oversight, I hope?


In GetFileBackupMethod(), which arguments are in and which are out?
The comment doesn't say, and it's not obvious why we pass both the file
path as well as the individual constituent pieces for it.

DO_NOT_BACKUP_FILE appears not to be set anywhere.  Do you expect to use
this later?  If not, maybe remove it.

There are two functions named record_manifest_details_for_file() in
different programs.  I think this sort of arrangement is not great, as
it is confusing confusing to follow.  It would be better if those two
routines were called something like, say, verifybackup_perfile_cb and
combinebackup_perfile_cb instead; then in the function comment say
something like 
/*
 * JsonManifestParseContext->perfile_cb implementation for pg_combinebackup.
 *
 * Record details extracted from the backup manifest for one file,
 * because we like to keep things tracked or whatever.
 */
so it's easy to track down what does what and why.  Same with
perwalrange_cb.  "perfile" looks bothersome to me as a name entity.  Why
not per_file_cb? and per_walrange_cb?
 

In walsummarizer.c, HandleWalSummarizerInterrupts is called in
summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
Maybe it should?

I think this path is not going to be very human-likeable.
        snprintf(final_path, MAXPGPATH,
                 XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
                 tli,
                 LSN_FORMAT_ARGS(summary_start_lsn),
                 LSN_FORMAT_ARGS(summary_end_lsn));
Why not add a dash between the TLI and between both LSNs, or something
like that?  (Also, are we really printing TLIs as 8-byte hexs?)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I suspect most samba developers are already technically insane...
Of course, since many of them are Australians, you can't tell." (L. Torvalds)



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> - I think 0001 looks good improvement irrespective of the patch series.

OK, perhaps that can be independently committed, then, if nobody objects.

Thanks for the review; I've fixed a bunch of things that you
mentioned. I'll just comment on the ones I haven't yet done anything
about below.

> 2.
> +     <varlistentry id="guc-wal-summarize-keep-time"
> xreflabel="wal_summarize_keep_time">
> +      <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
> +      <indexterm>
> +       <primary><varname>wal_summarize_keep_time</varname>
> configuration parameter</primary>
> +      </indexterm>
>
> I feel the name of the guy should be either wal_summarizer_keep_time
> or wal_summaries_keep_time, I mean either we should refer to the
> summarizer process or to the way summaries files.

How about wal_summary_keep_time?

> 6.
> + * If the whole range of LSNs is covered, returns true, otherwise false.
> + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
> + * if there are no WAL summary files in the input list, or to the first LSN
> + * in the range that is not covered by a WAL summary file in the input list.
> + */
> +bool
> +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
>
> I did not see the usage of this function, but I think if the whole
> range is not covered why not keep the behavior uniform w.r.t. what we
> set for '*missing_lsn',  I mean suppose there is no file then
> missing_lsn is the start_lsn because a very first LSN is missing.

It's used later in the patch series. I think the way that I have it
makes for a more understandable error message.

> 8.
> +/*
> + * Comparator to sort a List of WalSummaryFile objects by start_lsn.
> + */
> +static int
> +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
> +{

I'm not sure what needs fixing here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Great stuff you got here.  I'm doing a first pass trying to grok the
> whole thing for more substantive comments, but in the meantime here are
> some cosmetic ones.

Thanks, thanks, and thanks.

I've fixed some things that you mentioned in the attached version.
Other comments below.

> In blkreftable.c, I think the definition of SH_EQUAL should have an
> outer layer of parentheses.  Also, it would be good to provide and use a
> function to initialize a BlockRefTableKey from the RelFileNode and
> forknum components, and ensure that any padding bytes are zeroed.
> Otherwise it's not going to be a great hash key.  On my platform there
> aren't any (padding bytes), but I think it's unwise to rely on that.

I'm having trouble understanding the second part of this suggestion.
Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
in a backend context, we get the default, which is
MemoryContextAllocZero. Maybe there's some case this doesn't cover,
though?

> These forward struct declarations are not buying you anything, I'd
> remove them:

I've had problems from time to time when I don't do this. I'll remove
it here, but I'm not convinced that it's always useless.

> I don't much like the way header files in src/bin/pg_combinebackup files
> are structured.  Particularly, causing a simplehash to be "instantiated"
> just because load_manifest.h is included seems poised to cause pain.  I
> think there should be a file with the basic struct declarations (no
> simplehash); and then maybe since both pg_basebackup and
> pg_combinebackup seem to need the same simplehash, create a separate
> header file containing just that..  But, did you notice that anything
> that includes reconstruct.h will instantiate the simplehash stuff,
> because it includes load_manifest.h?  It may be unwise to have the
> simplehash in a header file.  Maybe just declare it in each .c file that
> needs it.  The duplicity is not that large.

I think that I did this correctly. AIUI, if you're defining a
simplehash that only one source file needs, you make the scope
"static" and do both SH_DECLARE and SH_DEFILE it in that file. If you
need it to be shared by multiple files, you make it "extern" in the
header file, do SH_DECLARE there, and SH_DEFINE in one of those source
files. Or you could make the scope "static inline" in the header file
and then you'd both SH_DECLARE and SH_DEFINE it in the header file.

If I were to do as you suggest here, I think I'd end up with 2 copies
of the compiled code for this instead of one, and if they ever got out
of sync everything would break silently.

> Why leave unnamed arguments in function declarations?  For example, in
>
> static void manifest_process_file(JsonManifestParseContext *,
>                                   char *pathname,
>                                   size_t size,
>                                   pg_checksum_type checksum_type,
>                                   int checksum_length,
>                                   uint8 *checksum_payload);
> the first argument lacks a name.  Is this just an oversight, I hope?

I mean, I've changed it now, but I don't think it's worth getting too
excited about. "int checksum_length" is much better documentation than
just "int," but "JsonManifestParseContext *context" is just noise,
IMHO. You can argue that it's better for consistency that way, but
whatever.

> In GetFileBackupMethod(), which arguments are in and which are out?
> The comment doesn't say, and it's not obvious why we pass both the file
> path as well as the individual constituent pieces for it.

The header comment does document which values are potentially set on
return. I guess I thought it was clear enough that the stuff not
documented to be output parameters was input parameters. Most of them
aren't even pointers, so they have to be input parameters. The only
exception is 'path', which I have some difficulty thinking that anyone
is going to imagine to be an input pointer.

Maybe you could propose a more specific rewording of this comment?
FWIW, I'm not altogether sure whether this function is going to get
more heavily adjusted in a rev or three of the patch set, so maybe we
want to wait to sort this out until this is closer to final, but OTOH
if I know what you have in mind for the current version, I might be
more likely to keep it in a good place if I end up changing it.

> DO_NOT_BACKUP_FILE appears not to be set anywhere.  Do you expect to use
> this later?  If not, maybe remove it.

Woops, that was a holdover from an earlier version.

> There are two functions named record_manifest_details_for_file() in
> different programs.  I think this sort of arrangement is not great, as
> it is confusing confusing to follow.  It would be better if those two
> routines were called something like, say, verifybackup_perfile_cb and
> combinebackup_perfile_cb instead; then in the function comment say
> something like
> /*
>  * JsonManifestParseContext->perfile_cb implementation for pg_combinebackup.
>  *
>  * Record details extracted from the backup manifest for one file,
>  * because we like to keep things tracked or whatever.
>  */
> so it's easy to track down what does what and why.  Same with
> perwalrange_cb.  "perfile" looks bothersome to me as a name entity.  Why
> not per_file_cb? and per_walrange_cb?

I had trouble figuring out how to name this stuff. I did notice the
awkwardness, but surely nobody can think that two functions with the
same name in different binaries can be actually the same function.

If we want to inject more underscores here, my vote is to go all the
way and make it per_wal_range_cb.

> In walsummarizer.c, HandleWalSummarizerInterrupts is called in
> summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
> Maybe it should?

I replaced all the CHECK_FOR_INTERRUPTS() in that file with
HandleWalSummarizerInterrupts(). Does that seem right?

> I think this path is not going to be very human-likeable.
>                 snprintf(final_path, MAXPGPATH,
>                                  XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
>                                  tli,
>                                  LSN_FORMAT_ARGS(summary_start_lsn),
>                                  LSN_FORMAT_ARGS(summary_end_lsn));
> Why not add a dash between the TLI and between both LSNs, or something
> like that?  (Also, are we really printing TLIs as 8-byte hexs?)

Dealing with the last part first, we already do that in every WAL file
name. I actually think these file names are easier to work with than
WAL file names, because 000000010000000000000020 is not the WAL
starting at 0/20, but rather the WAL starting at 0/20000000. To know
at what LSN a WAL file starts, you have to mentally delete characters
17 through 22, which will always be zero, and instead add six zeroes
at the end. I don't think whoever came up with that file naming
convention deserves an award, unless it's a raspberry award. With
these names, you get something like
0000000100000000015125B800000000015128F0.summary and you can sort of
see that 1512 repeats so the LSN went from something ending in 5B8 to
something ending in 8F0. I actually think it's way better.

But I have a hard time arguing that it wouldn't be more readable still
if we put some separator characters in there. I didn't do that because
then they'd look less like WAL file names, but maybe that's not really
a problem. A possible reason not to bother is that these files are
less necessary for humans to care about than WAL files, since they
don't need to be archived or transported between nodes in any way.
Basically I think this is probably fine the way it is, but if you or
others think it's really important to change it, I can do that. Just
as long as we don't spend 50 emails arguing about which separator
character to use.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Tue, Nov 14, 2023 at 12:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > - I think 0001 looks good improvement irrespective of the patch series.
>
> OK, perhaps that can be independently committed, then, if nobody objects.
>
> Thanks for the review; I've fixed a bunch of things that you
> mentioned. I'll just comment on the ones I haven't yet done anything
> about below.
>
> > 2.
> > +     <varlistentry id="guc-wal-summarize-keep-time"
> > xreflabel="wal_summarize_keep_time">
> > +      <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
> > +      <indexterm>
> > +       <primary><varname>wal_summarize_keep_time</varname>
> > configuration parameter</primary>
> > +      </indexterm>
> >
> > I feel the name of the guy should be either wal_summarizer_keep_time
> > or wal_summaries_keep_time, I mean either we should refer to the
> > summarizer process or to the way summaries files.
>
> How about wal_summary_keep_time?

Yes, that looks perfect to me.

> > 6.
> > + * If the whole range of LSNs is covered, returns true, otherwise false.
> > + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
> > + * if there are no WAL summary files in the input list, or to the first LSN
> > + * in the range that is not covered by a WAL summary file in the input list.
> > + */
> > +bool
> > +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
> >
> > I did not see the usage of this function, but I think if the whole
> > range is not covered why not keep the behavior uniform w.r.t. what we
> > set for '*missing_lsn',  I mean suppose there is no file then
> > missing_lsn is the start_lsn because a very first LSN is missing.
>
> It's used later in the patch series. I think the way that I have it
> makes for a more understandable error message.

Okay

> > 8.
> > +/*
> > + * Comparator to sort a List of WalSummaryFile objects by start_lsn.
> > + */
> > +static int
> > +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
> > +{
>
> I'm not sure what needs fixing here.

I think I copy-pasted it by mistake, just ignore it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Tue, Nov 14, 2023 at 2:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Great stuff you got here.  I'm doing a first pass trying to grok the
> > whole thing for more substantive comments, but in the meantime here are
> > some cosmetic ones.
>
> Thanks, thanks, and thanks.
>
> I've fixed some things that you mentioned in the attached version.
> Other comments below.

Here are some more comments based on what I have read so far, mostly
cosmetics comments.

1.
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.

Typo /stoppng/stopping

2.
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+

We have already freed the xlogreader so the second part of the comment
is not valid.

3.+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.


Typo. /there is in point/ there is no point

4.
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */

/recods/records

5.

+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+    forknum, 0);
+ }
+ }

For SmgrCreate and Truncate I understand setting the 'limit block' but
why for commit/abort?  I think it would be better to add some comments
here.

6.
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)

The comments say "private_data->read_upto to the lowest LSN that is
not known to be safe" but is it really the lowest LSN? I think it is
the highest LSN this is known to be safe for that TLI no?

7.
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();

I was just wondering whether removing old summaries should be the job
of the wal summarizer or it should be the job of the checkpointer, I
mean while removing the old wals it can also check and remove the old
summaries?  Anyway, it's just a question and I do not have a strong
opinion on this.

8.
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */

Typo /Whether we we removed/ Whether we removed

9.
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+   ForkNumber forknum, BlockNumber *limit_block)

If this function is already returning 'BlockRefTableEntry' then why
would it need to set an extra '*limit_block' out parameter which it is
actually reading from the entry itself?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
0001 looks OK to push, and since it stands on its own I would get it out
of the way soon rather than waiting for the rest of the series to be
further reviewed.


0002:
This moves bin/pg_verifybackup/parse_manifest.c to
common/parse_manifest.c, where it's not clear that it's for backup
manifests (wasn't a problem in the previous location).  I wonder if
we're going to have anything else called "manifest", in which case I
propose to rename the file to make it clear that this is about backup
manifests -- maybe parse_bkp_manifest.c.

This patch looks pretty uncontroversial, but there's no point in going
further with this one until followup patches are closer to commit.


0003:
AmWalSummarizerProcess() is unused.  Remove?

MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
This causes a function call to be emitted every time through.  That
looks odd.  All other process starts have some triggering condition.  

GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
and SummarizeWAL use while(1). Maybe settle on one style?

Still reading this one.


0004:
in PrepareForIncrementalBackup(), the logic that determines
earliest_wal_range_tli and latest_wal_range_tli looks pretty weird.  I
think it works fine if there's a single timeline, but not otherwise.
Or maybe the trick is that it relies on timelines returned by
readTimeLineHistory being sorted backwards?  If so, maybe add a comment
about that somewhere; I don't think other callers of readTimeLineHistory
make that assumption.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
Hi Robert,

[..spotted the v9 patchset..]

so I've spent some time playing still with patchset v8 (without the
6/6 testing patch related to wal_level=minimal), with the exception of
- patchset v9 - marked otherwise.

1. On compile time there were 2 warnings to shadowing variable (at
least with gcc version 10.2.1), but on v9 that is fixed:

blkreftable.c: In function ‘WriteBlockRefTable’:
blkreftable.c:520:24: warning: declaration of ‘brtentry’ shadows a
previous local [-Wshadow=compatible-local]
walsummarizer.c: In function ‘SummarizeWAL’:
walsummarizer.c:833:36: warning: declaration of ‘private_data’ shadows
a previous local [-Wshadow=compatible-local]

2. Usability thing: I hit the timeout hard: "This backup requires WAL
to be summarized up to 0/90000D8, but summarizer has only reached
0/0." with summarize_wal=off (default) but apparently this in TODO.
Looks like an important usability thing.

3. I've verified that if the DB was in wal_level=minimal even
temporarily (and thus summarization was disabled) it is impossible to
take incremental backup:

pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summaries are required on timeline 1 from 0/70000D8 to 0/10000028, but
the summaries for that timeline and LSN range are incomplete
DETAIL:  The first unsummarized LSN is this range is 0/D04AE88.

4. As we have discussed off list, there's is (was) this
pg_combinebackup bug in v8's reconstruct_from_incremental_file() where
it was unable to realize that - in case of combining multiple
incremental backups - it should stop looking for the previous instance
of the full file (while it was fine with v6 of the patchset). I've
checked it on v9 - it is good now.

5. On v8 i've finally played a little bit with standby(s) and this
patchset with couple of basic scenarios while mixing source of the
backups:

a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
    # sometimes i'm getting spurious error like those when doing
incrementals on standby with -c fast :
    2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
at 0/A000028
    2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
to backend exiting before pg_backup_stop was called
    2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
from final timeline 1 ending at 0/A0000F8, but this backup starts at
0/A000028
    2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
    # when you retry the same pg_basebackup it goes fine (looks like
CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
against primary just before pg_basebackup --incr on standby
workarounds it)

b. full on primary, incr1 on standby, full db restore (incl. incr1) on
standby # WORKS
c. full on standby, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
d. full on primary, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*

* - needs pg_promote() due to the controlfile having standby bit +
potential fiddling with postgresql.auto.conf as it is having
primary_connstring GUC.

6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
more recent pg_basebackup (e.g. from pg18 one day) running against
pg17 in the scope of having this incremental backups possibility. Is
it going to be safe? (currently there seems to be no safeguards
against such use) or should those things (core, pg_basebackup) should
be running in version lock step?

Regards,
-J.



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
On 2023-Nov-13, Robert Haas wrote:

> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Also, it would be good to provide and use a
> > function to initialize a BlockRefTableKey from the RelFileNode and
> > forknum components, and ensure that any padding bytes are zeroed.
> > Otherwise it's not going to be a great hash key.  On my platform there
> > aren't any (padding bytes), but I think it's unwise to rely on that.
> 
> I'm having trouble understanding the second part of this suggestion.
> Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
> in a backend context, we get the default, which is
> MemoryContextAllocZero. Maybe there's some case this doesn't cover,
> though?

I meant code like this

    memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
    key.forknum = forknum;
    entry = blockreftable_lookup(brtab->hash, key);

where any padding bytes in "key" could have arbitrary values, because
they're not initialized.  So I'd have a (maybe static inline) function
  BlockRefTableKeyInit(&key, rlocator, forknum)
that fills it in for you.

Note:
  #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
AFAICT the new simplehash uses in this patch series are the only ones
that use memcmp() as SH_EQUAL, so we don't necessarily have precedent on
lack of padding bytes initialization in existing uses of simplehash.

> > These forward struct declarations are not buying you anything, I'd
> > remove them:
> 
> I've had problems from time to time when I don't do this. I'll remove
> it here, but I'm not convinced that it's always useless.

Well, certainly there are places where they are necessary.

> > I don't much like the way header files in src/bin/pg_combinebackup files
> > are structured.  Particularly, causing a simplehash to be "instantiated"
> > just because load_manifest.h is included seems poised to cause pain.  I
> > think there should be a file with the basic struct declarations (no
> > simplehash); and then maybe since both pg_basebackup and
> > pg_combinebackup seem to need the same simplehash, create a separate
> > header file containing just that..  But, did you notice that anything
> > that includes reconstruct.h will instantiate the simplehash stuff,
> > because it includes load_manifest.h?  It may be unwise to have the
> > simplehash in a header file.  Maybe just declare it in each .c file that
> > needs it.  The duplicity is not that large.
> 
> I think that I did this correctly.

Oh, I hadn't grokked that we had this SH_SCOPE thing and a separate
SH_DECLARE for it being extern.  OK, please ignore that.

> > Why leave unnamed arguments in function declarations?
> 
> I mean, I've changed it now, but I don't think it's worth getting too
> excited about.

Well, we did get into consistency arguments on this point previously.  I
agree it's not *terribly* important, but on thread
https://www.postgresql.org/message-id/flat/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg%40mail.gmail.com
people got really worked up about this stuff.

> > In GetFileBackupMethod(), which arguments are in and which are out?
> > The comment doesn't say, and it's not obvious why we pass both the file
> > path as well as the individual constituent pieces for it.
> 
> The header comment does document which values are potentially set on
> return. I guess I thought it was clear enough that the stuff not
> documented to be output parameters was input parameters. Most of them
> aren't even pointers, so they have to be input parameters. The only
> exception is 'path', which I have some difficulty thinking that anyone
> is going to imagine to be an input pointer.

An output pointer, you mean :-)  (Should it be const?)

When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
to these output values; we modify some, but why?  Maybe they should be
left alone?  In that case, the "if size == 0" test should move a couple
of lines up, in the brtentry == NULL block.

BTW, you could do the qsort() after deciding to backup the file fully if
more than 90% needs to be replaced.

BTW, in sendDir() why do
 lookup_path = pstrdup(pathbuf + basepathlen + 1);
when you could do
 lookup_path = pstrdup(tarfilename);
?

> > There are two functions named record_manifest_details_for_file() in
> > different programs.
> 
> I had trouble figuring out how to name this stuff. I did notice the
> awkwardness, but surely nobody can think that two functions with the
> same name in different binaries can be actually the same function.

Of course not, but when cscope-jumping around, it is weird.

> If we want to inject more underscores here, my vote is to go all the
> way and make it per_wal_range_cb.

+1

> > In walsummarizer.c, HandleWalSummarizerInterrupts is called in
> > summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
> > Maybe it should?
> 
> I replaced all the CHECK_FOR_INTERRUPTS() in that file with
> HandleWalSummarizerInterrupts(). Does that seem right?

Looks to be what walwriter.c does at least, so I guess it's OK.

> > I think this path is not going to be very human-likeable.
> >                 snprintf(final_path, MAXPGPATH,
> >                                  XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
> >                                  tli,
> >                                  LSN_FORMAT_ARGS(summary_start_lsn),
> >                                  LSN_FORMAT_ARGS(summary_end_lsn));
> > Why not add a dash between the TLI and between both LSNs, or something
> > like that?

> But I have a hard time arguing that it wouldn't be more readable still
> if we put some separator characters in there. I didn't do that because
> then they'd look less like WAL file names, but maybe that's not really
> a problem. A possible reason not to bother is that these files are
> less necessary for humans to care about than WAL files, since they
> don't need to be archived or transported between nodes in any way.
> Basically I think this is probably fine the way it is, but if you or
> others think it's really important to change it, I can do that. Just
> as long as we don't spend 50 emails arguing about which separator
> character to use.

Yeah, I just think that endless stream of hex chars are hard to read,
and I've found myself following digits in the screen with my fingers in
order to parse file names.  I guess you could say thousands separators
for regular numbers aren't needed either, but we do have them for
readability sake.


I think a new section in chapter 30 "Reliability and the Write-Ahead
Log" is warranted.  It would explain the summarization process, what the
summary files are used for, and the deletion mechanism.  I can draft
something if you want.

It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
purpose or it's just a leftover from prior development.  I see it's only
read in an assertion ... Maybe if we think this cross-check is
important, it should be turned into an elog?  Otherwise, I'd remove it.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo."                 (Augusto Pinochet a una corte de justicia)



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Tue, Nov 14, 2023 at 8:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> 0001 looks OK to push, and since it stands on its own I would get it out
> of the way soon rather than waiting for the rest of the series to be
> further reviewed.

All right, done.

> 0003:
> AmWalSummarizerProcess() is unused.  Remove?

The intent seems to be to have one of these per enum value, whether it
gets used or not. Some of the others aren't used, either.

> MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
> This causes a function call to be emitted every time through.  That
> looks odd.  All other process starts have some triggering condition.

I'm not sure how much this matters, really. I would expect that the
function call overhead here wouldn't be very noticeable. Generally I
think that when ServerLoop returns from WaitEventSetWait it's going to
be because we need to fork a process. That's pretty expensive compared
to a function call. If we can iterate through this loop lots of times
without doing any real work then it might matter, but I feel like
that's probably not the case, and probably something we would want to
fix if it were the case.

Now, I could nevertheless move some of the triggering conditions in
MaybeStartWalSummarizer(), but moving, say, just the summarize_wal
condition wouldn't be enough to avoid having MaybeStartWalSummarizer()
called repeatedly when there was no work to do, because summarize_wal
could be true and the summarizer could all be running. Similarly, if I
move just the WalSummarizerPID == 0 condition, the function gets
called repeatedly without doing anything when summarize_wal = off. So
at a minimum you have to move both of those if you care about avoiding
the function call overhead, and then you have to wonder if you care
about the corner cases where the function would be called repeatedly
for no gain even then.

Another approach would be to make the function static inline rather
than just static. Or we could delete the whole function and just
duplicate the logic it contains at both call sites. Personally I'm
inclined to just leave it how it is in the absence of some evidence
that there's a real problem here. It's nice to have all the triggering
conditions in one place with nothing duplicated.

> GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
> and SummarizeWAL use while(1). Maybe settle on one style?

OK.

> 0004:
> in PrepareForIncrementalBackup(), the logic that determines
> earliest_wal_range_tli and latest_wal_range_tli looks pretty weird.  I
> think it works fine if there's a single timeline, but not otherwise.
> Or maybe the trick is that it relies on timelines returned by
> readTimeLineHistory being sorted backwards?  If so, maybe add a comment
> about that somewhere; I don't think other callers of readTimeLineHistory
> make that assumption.

It does indeed rely on that assumption, and the comment at the top of
the for (i = 0; i < num_wal_ranges; ++i) loop explains that. Note also
the comment just below that begins "If we found this TLI in the
server's history". I agree with you that this logic looks strange, and
it's possible that there's some better way to do encode the idea than
what I've done here, but I think it might be just that the particular
calculation we're trying to do here is strange. It's almost easier to
understand the logic if you start by reading the sanity checks
("manifest requires WAL from initial timeline %u starting at %X/%X,
but that timeline begins at %X/%X" et. al.), look at the triggering
conditions for those, and then work upward to see how
earliest/latest_wal_range_tli get set, and then look up from there to
see how saw_earliest/latest_wal_range_tli are used in computing those
values.

We do rely on the ordering assumption elsewhere. For example, in
XLogFileReadAnyTLI, see if (tli < curFileTLI) break. We also use it to
set expectedTLEs, which is documented to have this property. And
AddWALInfoToBackupManifest relies on it too; see the comment "Because
the timeline history file lists newer timelines before older ones" in
AddWALInfoToBackupManifest. We're not entirely consistent about this,
e.g., unlike XLogFileReadAnyTLI, tliInHistory() and
tliOfPointInHistory() don't have an early exit provision, but we do
use it some places.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I meant code like this
>
>         memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
>         key.forknum = forknum;
>         entry = blockreftable_lookup(brtab->hash, key);

Ah, I hadn't thought about that. Another way of handling that might be
to add = {0} to the declaration of key. But I can do the initializer
thing too if you think it's better. I'm not sure if there's an
argument that the initializer might optimize better.

> An output pointer, you mean :-)  (Should it be const?)

I'm bad at const, but that seems to work, so sure.

> When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
> to these output values; we modify some, but why?  Maybe they should be
> left alone?  In that case, the "if size == 0" test should move a couple
> of lines up, in the brtentry == NULL block.

OK.

> BTW, you could do the qsort() after deciding to backup the file fully if
> more than 90% needs to be replaced.

OK.

> BTW, in sendDir() why do
>  lookup_path = pstrdup(pathbuf + basepathlen + 1);
> when you could do
>  lookup_path = pstrdup(tarfilename);
> ?

No reason, changed.

> > If we want to inject more underscores here, my vote is to go all the
> > way and make it per_wal_range_cb.
>
> +1

Will look into this.

> Yeah, I just think that endless stream of hex chars are hard to read,
> and I've found myself following digits in the screen with my fingers in
> order to parse file names.  I guess you could say thousands separators
> for regular numbers aren't needed either, but we do have them for
> readability sake.

Sigh.

> I think a new section in chapter 30 "Reliability and the Write-Ahead
> Log" is warranted.  It would explain the summarization process, what the
> summary files are used for, and the deletion mechanism.  I can draft
> something if you want.

Sure, if you want to take a crack at it, that's great.

> It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> purpose or it's just a leftover from prior development.  I see it's only
> read in an assertion ... Maybe if we think this cross-check is
> important, it should be turned into an elog?  Otherwise, I'd remove it.

I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
On 2023-Oct-04, Robert Haas wrote:

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week. File removal is governed by mtimes, so if you change the
> mtimes of your summary files or whack your system clock around, weird
> things might happen. But obviously this might be inconvenient. Some
> people might not want WAL summary files to be generated at all because
> they don't care about incremental backup, and other people might want
> them retained for longer, and still other people might want them to be
> not removed automatically or removed automatically based on some
> criteria other than mtime. I don't really know what's best here. I
> don't think the default policy that the patches implement is
> especially terrible, but it's just something that I made up and I
> don't have any real confidence that it's wonderful. One point to be
> consider here is that, if WAL summarization is enabled, checkpoints
> can't remove WAL that isn't summarized yet. Mostly that's not a
> problem, I think, because the WAL summarizer is pretty fast. But it
> could increase disk consumption for some people. I don't think that we
> need to worry about the summaries themselves being a problem in terms
> of space consumption; at least in all the cases I've tested, they're
> just not very big.

So, wal_summary is no longer turned on by default, I think following a
comment from Peter E.  I think this is a good decision, as we're only
going to need them on servers from which incremental backups are going
to be taken, which is a strict subset of all servers; and furthermore,
people that need them are going to realize that very easily, while if we
went the other around most people would not realize that they need to
turn them off to save some resource consumption.

Granted, the amount of resources additionally used is probably not very
big.  But since it can be changed with a reload not restart, it doesn't
seem problematic.

... oh, I just noticed that this patch now fails to compile because of
the MemoryContextResetAndDeleteChildren removal.

(Typo in the pg_walsummary manpage: "since WAL summary files primary
exist" -> "primarily")

> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I think it should default to off in primary and standby, and the user
has to enable it in whichever server they want to take backups from.

> - We need to settle the question of whether to send the whole backup
> manifest to the server or just the LSN. In a previous attempt at
> incremental backup, we decided the whole manifest was necessary,
> because flat-copying files could make new data show up with old LSNs.
> But that version of the patch set was trying to find modified blocks
> by checking their LSNs individually, not by summarizing WAL. And since
> the operations that flat-copy files are WAL-logged, the WAL summary
> approach seems to eliminate that problem - maybe an LSN (and the
> associated TLI) is good enough now. This also relates to Jakub's
> question about whether this machinery could be used to fast-forward a
> standby, which is not exactly a base backup but  ... perhaps close
> enough? I'm somewhat inclined to believe that we can simplify to an
> LSN and TLI; however, if we do that, then we'll have big problems if
> later we realize that we want the manifest for something after all. So
> if anybody thinks that there's a reason to keep doing what the patch
> does today -- namely, upload the whole manifest to the server --
> please speak up.

I don't understand this point.  Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup.  You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?

> - It's regrettable that we don't have incremental JSON parsing;

We now do have it, at least in patch form.  I guess the question is
whether we're going to accept it in core.  I see chances of changing the
format of the manifest rather slim at this point, and the need for very
large manifests is likely to go up with time, so we probably need to
take that code and polish it up, and see if we can improve its
performance.

> - Right now, I have a hard-coded 60 second timeout for WAL
> summarization. If you try to take an incremental backup and the WAL
> summaries you need don't show up within 60 seconds, the backup times
> out. I think that's a reasonable default, but should it be
> configurable? If yes, should that be a GUC or, perhaps better, a
> pg_basebackup option?

I'd rather have a way for the server to provide diagnostics on why the
summaries aren't being produced.  Maybe a server running under valgrind
is going to fail and need a longer one, but otherwise a hardcoded
timeout seems sufficient.

You did say later that you thought summary files would just go from one
checkpoint to the next.  So the only question is at what point the file
for the last checkpoint (i.e. from the previous one up to the one
requested by pg_basebackup) is written.  If walsummarizer keeps almost
the complete state in memory and just waits for the checkpoint record to
write it, then it's probably okay.

> - I'm curious what people think about the pg_walsummary tool that is
> included in 0006. I think it's going to be fairly important for
> debugging, but it does feel a little bit bad to add a new binary for
> something pretty niche. Nevertheless, merging it into any other
> utility seems relatively awkward, so I'm inclined to think both that
> this should be included in whatever finally gets committed and that it
> should be a separate binary. I considered whether it should go in
> contrib, but we seem to have moved to a policy that heavily favors
> limiting contrib to extensions and loadable modules, rather than
> binaries.

I propose to keep the door open for that binary doing other things that
dumping the files as text.  So add a command argument, which currently
can only be "dump", to allow the command do other things later if
needed.  (For example, remove files from a server on which summarize_wal
has been turned off; or perhaps remove files that are below some LSN.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I meant code like this
> >
> >         memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
> >         key.forknum = forknum;
> >         entry = blockreftable_lookup(brtab->hash, key);
> 
> Ah, I hadn't thought about that. Another way of handling that might be
> to add = {0} to the declaration of key. But I can do the initializer
> thing too if you think it's better. I'm not sure if there's an
> argument that the initializer might optimize better.

I think the {0} initializer is good enough, given a comment to indicate
why.

> > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > purpose or it's just a leftover from prior development.  I see it's only
> > read in an assertion ... Maybe if we think this cross-check is
> > important, it should be turned into an elog?  Otherwise, I'd remove it.
> 
> I've been thinking about that. One thing I'm not quite sure about
> though is introspection. Maybe there should be a function that shows
> summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> should expose pending_lsn too.

True.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Oct-04, Robert Haas wrote:

> > - Right now, I have a hard-coded 60 second timeout for WAL
> > summarization. If you try to take an incremental backup and the WAL
> > summaries you need don't show up within 60 seconds, the backup times
> > out. I think that's a reasonable default, but should it be
> > configurable? If yes, should that be a GUC or, perhaps better, a
> > pg_basebackup option?
> 
> I'd rather have a way for the server to provide diagnostics on why the
> summaries aren't being produced.  Maybe a server running under valgrind
> is going to fail and need a longer one, but otherwise a hardcoded
> timeout seems sufficient.
> 
> You did say later that you thought summary files would just go from one
> checkpoint to the next.  So the only question is at what point the file
> for the last checkpoint (i.e. from the previous one up to the one
> requested by pg_basebackup) is written.  If walsummarizer keeps almost
> the complete state in memory and just waits for the checkpoint record to
> write it, then it's probably okay.

On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Nov-16, Robert Haas wrote:
> 
> > On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > > purpose or it's just a leftover from prior development.  I see it's only
> > > read in an assertion ... Maybe if we think this cross-check is
> > > important, it should be turned into an elog?  Otherwise, I'd remove it.
> > 
> > I've been thinking about that. One thing I'm not quite sure about
> > though is introspection. Maybe there should be a function that shows
> > summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> > should expose pending_lsn too.
> 
> True.

Putting those two thoughts together, I think pg_basebackup with
--progress could tell you "still waiting for the summary file up to LSN
%X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
or something like that.  This would probably require two concurrent
connections, one to run BASE_BACKUP and another to inquire server state;
but this should easy enough to integrate together with parallel
basebackup later.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Nov 16, 2023 at 12:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Putting those two thoughts together, I think pg_basebackup with
> --progress could tell you "still waiting for the summary file up to LSN
> %X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
> or something like that.  This would probably require two concurrent
> connections, one to run BASE_BACKUP and another to inquire server state;
> but this should easy enough to integrate together with parallel
> basebackup later.

I had similar thoughts, except I was thinking it would be better to
have the warnings be generated on the server side. That would save the
need for a second libpq connection, which would be good, because I
think adding that would result in a pretty large increase in
complexity and some not-so-great user-visible consequences. In fact,
my latest thought is to just remove the timeout altogether, and emit
warnings like this:

WARNING:  still waiting for WAL summarization to reach %X/%X after %d
seconds, currently at %X/%X

We could emit that every 30 seconds or so until either the situation
resolves itself or the user hits ^C. I think that would be good enough
here. If we want, the interval between messages can be a GUC, but I
don't know how much real need there will be to tailor that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So, wal_summary is no longer turned on by default, I think following a
> comment from Peter E.  I think this is a good decision, as we're only
> going to need them on servers from which incremental backups are going
> to be taken, which is a strict subset of all servers; and furthermore,
> people that need them are going to realize that very easily, while if we
> went the other around most people would not realize that they need to
> turn them off to save some resource consumption.
>
> Granted, the amount of resources additionally used is probably not very
> big.  But since it can be changed with a reload not restart, it doesn't
> seem problematic.

Yeah. I meant to say that I'd changed that for that reason, but in the
flurry of new versions I omitted to do so.

> ... oh, I just noticed that this patch now fails to compile because of
> the MemoryContextResetAndDeleteChildren removal.

Fixed.

> (Typo in the pg_walsummary manpage: "since WAL summary files primary
> exist" -> "primarily")

This, too.

> I think it should default to off in primary and standby, and the user
> has to enable it in whichever server they want to take backups from.

Yeah, that's how it works currently.

> I don't understand this point.  Currently, the protocol is that
> UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> backup.  You seem to be saying that you're thinking of removing support
> for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> BASE_BACKUP command?

I don't think I'd want to do exactly that, because then you could only
send one LSN, and I do think we want to send a set of LSN ranges with
the corresponding TLI for each. I was thinking about dumping
UPLOAD_MANIFEST and instead having a command like:

INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698

The client would execute this command one or more times before
starting an incremental backup.

> I propose to keep the door open for that binary doing other things that
> dumping the files as text.  So add a command argument, which currently
> can only be "dump", to allow the command do other things later if
> needed.  (For example, remove files from a server on which summarize_wal
> has been turned off; or perhaps remove files that are below some LSN.)

I don't like that very much. That sounds like one of those
forward-compatibility things that somebody designs and then nothing
ever happens and ten years later you still have an ugly wart.

My theory is that these files are going to need very little
management. In general, they're small; if you never removed them, it
probably wouldn't hurt, or at least, not for a long time. As to
specific use cases, if you want to remove files from a server on which
summarize_wal has been turned off, you can just use rm. Removing files
from before a certain LSN would probably need a bit of scripting, but
only a bit. Conceivably we could provide something like that in core,
but it doesn't seem necessary, and it also seems to me that we might
do well to include that in pg_archivecleanup rather than in
pg_walsummary.

Here's a new version. Changes:

- Add preparatory renaming patches to the series.
- Rename wal_summarize_keep_time to wal_summary_keep_time.
- Change while (true) to while (1).
- Typo fixes.
- Fix incorrect assertion in summarizer_read_local_xlog_page; this
could cause occasional regression test failures in 004_pg_xlog_symlink
and 009_growing_files.
- Zero-initialize BlockRefTableKey variables.
- Replace a couple instances of pathbuf + basepathlen + 1 with tarfilename.
- Add const to path argument of GetFileBackupMethod.
- Avoid setting output parameters of GetFileBackupMethod unless the
return value is BACK_UP_FILE_INCREMENTALLY.
- In GetFileBackupMethod, postpone qsorting block numbers slightly.
- Define INCREMENTAL_PREFIX_LENGTH using sizeof(), because that should
hopefully work everywhere and the StaticAssertStmt that checks the
value of this doesn't work on Windows.
- Change MemoryContextResetAndDeleteChildren to MemoryContextReset.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
I made a pass over pg_combinebackup for NLS.  I propose the attached
patch.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I made a pass over pg_combinebackup for NLS.  I propose the attached
> patch.

This doesn't quite compile for me so I changed a few things and
incorporated it. Hopefully I didn't mess anything up.

Here's v11. In addition to incorporating Álvaro's NLS changes, with
the off-list help of Jakub Wartak, I finally tracked down two one-line
bugs in BlockRefTableEntryGetBlocks that have been causing the cfbot
to blow up on these patches. What I hadn't realized is that cfbot runs
with the relation segment size changed to 6 blocks, which tickled some
code paths that I wasn't exercising locally. Thanks a ton to Jakub for
the help running this down. cfbot was unhappy about a %lu so I've
changed that to %zu in this version, too. Finally, the previous
version of this patch set had some pgindent damage, so that is
hopefully now cleaned up as well.

I wish I had better ideas about how to thoroughly test this. I've got
a bunch of different tests for pg_combinebackup and I think those are
good, but the bugs mentioned in the previous paragraph show that those
aren't sufficient to catch all of the logic errors that can exist,
which is not great. But, as I say, I'm not quite sure how to do
better, so I guess I'll just need to keep fixing problems as we find
them.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Alvaro Herrera
Date:
On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > I don't understand this point.  Currently, the protocol is that
> > UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> > backup.  You seem to be saying that you're thinking of removing support
> > for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> > BASE_BACKUP command?
> 
> I don't think I'd want to do exactly that, because then you could only
> send one LSN, and I do think we want to send a set of LSN ranges with
> the corresponding TLI for each. I was thinking about dumping
> UPLOAD_MANIFEST and instead having a command like:
> 
> INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
> 
> The client would execute this command one or more times before
> starting an incremental backup.

That sounds good to me.  Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.

Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?

Do I have it right that you want to rewrite this bit before considering
this ready to commit?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> That sounds good to me.  Not having to parse the manifest server-side
> sounds like a win, as does saving the transfer, for the cases where the
> manifest is large.

OK. I'll look into this next week, hopefully.

> Is this meant to support multiple timelines each with non-overlapping
> adjacent ranges, rather than multiple non-adjacent ranges?

Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.

> Do I have it right that you want to rewrite this bit before considering
> this ready to commit?

For sure. I don't think this is the only thing that needs to be
revised before commit, but it's definitely *a* thing that needs to be
revised before commit.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Nov 20, 2023 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Is this meant to support multiple timelines each with non-overlapping
> > adjacent ranges, rather than multiple non-adjacent ranges?
>
> Correct. I don't see how non-adjacent LSN ranges could ever be a
> useful thing, but adjacent ranges on different timelines are useful.

Thinking about this a bit more, there are a couple of things we could
do here in terms of syntax. Once idea is to give up on having a
separate MANIFEST-WAL-RANGE command for each range and instead just
cram everything into either a single command:

MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}...

Or even into a single option to the BASE_BACKUP command:

BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...'

Or, since we expect adjacent, non-overlapping ranges, you could even
arrange to elide the duplicated boundary LSNs, e.g.

MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn}

Or

BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn'

I'm not sure what's best here. Trying to trim out the duplicated
boundary LSNs feels a bit like rearrangement for the sake of
rearrangement, but maybe it isn't really.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I made a pass over pg_combinebackup for NLS.  I propose the attached
> > patch.
>
> This doesn't quite compile for me so I changed a few things and
> incorporated it. Hopefully I didn't mess anything up.
>
> Here's v11.
[..]

> I wish I had better ideas about how to thoroughly test this. [..]

Hopefully the below add some confidence, I've done some further
quick(?) checks today and results are good:

make check-world #GOOD
test_full_pri__incr_stby__restore_on_pri.sh #GOOD
test_full_pri__incr_stby__restore_on_stby.sh #GOOD*
test_full_stby__incr_stby__restore_on_pri.sh #GOOD
test_full_stby__incr_stby__restore_on_stby.sh #GOOD*
test_many_incrementals_dbcreate.sh #GOOD
test_many_incrementals.sh #GOOD
test_multixact.sh #GOOD
test_pending_2pc.sh #GOOD
test_reindex_and_vacuum_full.sh #GOOD
test_truncaterollback.sh #GOOD
test_unlogged_table.sh #GOOD
test_across_wallevelminimal.sh # GOOD(expected failure, that
walsummaries are off during walminimal and incremental cannot be
taken--> full needs to be taken after wal_level=minimal)

CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)

The remaining test idea is to have a longer running DB under stress
test in more real-world conditions and try to recover using chained
incremental backups (one such test was carried out on patchset v6 and
the result was good back then).

-J.



Re: trying again to get incremental backup

From
Thomas Munro
Date:
On Wed, Nov 22, 2023 at 3:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> CFbot failed on two hosts this time, I haven't looked at the details
> yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
> LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
> Aborted?)

Robert pinged me to see if I had any ideas.

The reason it fails on Windows is because there is a special code path
for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
is incomplete: it returns early without feeding the data into the
checksum, so all the checksums have the same initial and bogus value.
I commented that part out so it took the normal path like Unix, and it
passed.

The reason it fails on Linux 32 bit with -fsanitize is because this
has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
when doing a size test that could easily be changed to non-overflowing
formulation.  AFAICS it is not a live bug because it comes to the
right conclusion without deferencing the pointer due to other checks,
but the sanitizer is not wrong to complain about it and I will post a
patch to fix that in a new thread.  With the draft patch I am testing,
the sanitizer is happy and this passes too.



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Nov 23, 2023 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Robert pinged me to see if I had any ideas.

Thanks, Thomas.

> The reason it fails on Windows is because there is a special code path
> for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
> is incomplete: it returns early without feeding the data into the
> checksum, so all the checksums have the same initial and bogus value.
> I commented that part out so it took the normal path like Unix, and it
> passed.

Yikes, that's embarrassing. Thanks for running it down. There is logic
in the caller to figure out whether we need to recompute the checksum
or can reuse one we already have, but copy_file() doesn't understand
that it should take the slow path if a new checksum computation is
required.

> The reason it fails on Linux 32 bit with -fsanitize is because this
> has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
> when doing a size test that could easily be changed to non-overflowing
> formulation.  AFAICS it is not a live bug because it comes to the
> right conclusion without deferencing the pointer due to other checks,
> but the sanitizer is not wrong to complain about it and I will post a
> patch to fix that in a new thread.  With the draft patch I am testing,
> the sanitizer is happy and this passes too.

Thanks so much.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> so I've spent some time playing still with patchset v8 (without the
> 6/6 testing patch related to wal_level=minimal), with the exception of
> - patchset v9 - marked otherwise.

Thanks, as usual, for that.

> 2. Usability thing: I hit the timeout hard: "This backup requires WAL
> to be summarized up to 0/90000D8, but summarizer has only reached
> 0/0." with summarize_wal=off (default) but apparently this in TODO.
> Looks like an important usability thing.

All right. I'd sort of forgotten about the need to address that issue,
but apparently, I need to re-remember.

> 5. On v8 i've finally played a little bit with standby(s) and this
> patchset with couple of basic scenarios while mixing source of the
> backups:
>
> a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
>     # sometimes i'm getting spurious error like those when doing
> incrementals on standby with -c fast :
>     2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
> at 0/A000028
>     2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
> to backend exiting before pg_backup_stop was called
>     2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
> from final timeline 1 ending at 0/A0000F8, but this backup starts at
> 0/A000028
>     2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
> INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
> CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
>     # when you retry the same pg_basebackup it goes fine (looks like
> CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
> deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
> against primary just before pg_basebackup --incr on standby
> workarounds it)
>
> b. full on primary, incr1 on standby, full db restore (incl. incr1) on
> standby # WORKS
> c. full on standby, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
> d. full on primary, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
>
> * - needs pg_promote() due to the controlfile having standby bit +
> potential fiddling with postgresql.auto.conf as it is having
> primary_connstring GUC.

Well, "manifest requires WAL from final timeline 1 ending at
0/A0000F8, but this backup starts at 0/A000028" is a valid complaint,
not a spurious error. It's essentially saying that WAL replay for this
incremental backup would have to begin at a location that is earlier
than where replay for the earlier backup would have to end while
recovering that backup. It's almost like you're trying to go backwards
in time, with the incremental happening before the full backup instead
of after it. I think the reason this is happening is that when you
take a backup, recovery has to start from the previous checkpoint. On
the primary, we perform a new checkpoint and plan to start recovery
from it. But on a standby, we can't perform a new checkpoint, since we
can't write WAL, so we arrange for recovery of the backup to begin
from the most recent checkpoint. And if you do two backups on the
standby in a row without much happening in the middle, then the most
recent checkpoint will be the same for both. And that I think is
what's resulting in this error, because the end of the backup follows
the start of the backup, so if two consecutive backups have the same
start, then the start of the second one will precede the end of the
first one.

One thing that's interesting to note here is that there is no point in
performing an incremental backup under these circumstances. You would
accrue no advantage over just letting replay continue further from the
full backup. The whole point of an incremental backup is that it lets
you "fast forward" your older backup -- you could have just replayed
all the WAL from the older backup until you got to the start LSN of
the newer backup, but reconstructing a backup that can start replay
from the newer LSN directly is, hopefully, quicker than replaying all
of that WAL. But in this scenario, you're starting from the same
checkpoint no matter what -- the amount of WAL replay required to
reach any given LSN will be unchanged. So storing an incremental
backup would be strictly a loss.

Another interesting point to consider is that you could also get this
complaint by doing something like take the full backup from the
primary, and then try to take an incremental backup from a standby,
maybe even a time-delayed standby that's far behind the primary. In
that case, you would really be trying to take an incremental backup
before you actually took the full backup, as far as LSN time goes.

I'm not quite sure what to do about any of this. I think the error is
correct and accurate, but understanding what it means and why it's
happening and what to do about it is probably going to be difficult
for people. Possibly we should have documentation that talks you
through all of this. Or possibly there are ways to elaborate on the
error message itself. But I'm a little skeptical about the latter
approach because it's all so complicated. I don't know that we can
summarize it in a sentence or two.

> 6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
> more recent pg_basebackup (e.g. from pg18 one day) running against
> pg17 in the scope of having this incremental backups possibility. Is
> it going to be safe? (currently there seems to be no safeguards
> against such use) or should those things (core, pg_basebackup) should
> be running in version lock step?

I think it should be safe, actually. pg_basebackup has no reason to
care about WAL format changes across versions. It doesn't even care
about the format of the WAL summaries, which it never sees, but only
needs the server to have. If we change the format of the incremental
files that are included in the backup, then we will need
backward-compatibility code, or we can disallow cross-version
operations. I don't currently foresee a need to do that, but you never
know. It's manageable in any event.

But note that I also didn't (and can't, without a lot of ugliness)
make pg_combinebackup version-independent. So you could think of
taking incremental backups with a different version of pg_basebackup,
but if you want to restore you're going to need a matching version of
pg_combinebackup.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
New patch set.

0001: Rename JsonManifestParseContext callbacks, per feedback from
Álvaro. Not logically related to the rest of this, except by code
proximity. Separately committable, if nobody objects.

0002: Rename record_manifest_details_for_{file,wal_range}, per
feedback from Álvaro that the names were too generic. Separately
committable, if nobody objects.

0003: Move parse_manifest.c to src/common. No significant changes
since the previous version.

0004: Add a new WAL summarizer process. No significant changes since
the previous version.

0005: Incremental backup itself. Changes:
- Remove UPLOAD_MANIFEST replication command and instead add
INCREMENTAL_WAL_RANGE replication command.
- In consequence, load_manifest.c which was included in the previous
patch sets now moves to src/fe_utils and has some adjustments.
- Actually document the new replication command which I overlooked previously.
- Error out promptly if an incremental backup is attended with
summarize_wal = off.
- Fix test in copy_file(). We should be willing to use the fast-path
if a new checksum is *not* required, but the sense of the test was
inverted in previous versions.
- Fix some handling of the missing-manifest case in pg_combinebackup.
- Fix terminology in a help message.

0006: Add pg_walsummary tool. No significant changes since the previous version.

0007: Test patch, not for commit.

As far as I know, the main commit-blockers here are (1) the timeout
when waiting for WAL summarization is still hard-coded to 60 seconds
and (2) the ubsan issue that Thomas hunted down, which would cause at
least the entire CF environment and maybe some portion of the BF to
turn red if this were committed. That issue is in xlogreader rather
than in this patch set, at least in part, but it still needs fixing
before this goes ahead. I also suspect that the slightly-more
significant refactoring in this version may turn up a few new bugs in
the CF environment. I think once that the aforementioned items are
sorted out, this could be committed through 0005, and 0001 and 0002
could be committed sooner. 0006 should have some tests written before
it gets committed, but it doesn't necessarily have to be committed at
the exact same moment as everything else, and writing tests isn't that
hard, either.

Other loose ends that would be nice to tidy up at some point:

- Incremental JSON parsing so we can handle huge manifests.

- More documentation as proposed by Álvaro but I'm failing to find the
details of his proposal right now.

- More introspection facilities, maybe, or possibly rip some some
stuff out of WalSummarizerCtl if we don't want it. This one might be a
higher priority to address before initial commit, but it's probably
not absolutely critical, either.

I'm not quite sure how aggressively to press forward with getting
stuff committed. I'd certainly rather debug as much as I can locally
and via cfbot before turning the buildfarm pretty colors, but I think
it generally works out better when larger features get pushed earlier
in the cycle rather than in the mad frenzy right before feature
freeze, so I'm not inclined to be too patient, either.

...Robert

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Nov 30, 2023 at 9:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> 0005: Incremental backup itself. Changes:
> - Remove UPLOAD_MANIFEST replication command and instead add
> INCREMENTAL_WAL_RANGE replication command.

Unfortunately, I think this change is going to need to be reverted.
Jakub reported out a problem to me off-list, which I think boils down
to this: take a full backup on the primary. create a database on the
primary. now take an incremental backup on the standby using the full
backup from the master as the prior backup. What happens at this point
depends on how far replay has progressed on the standby. I think there
are three scenarios: (1) If replay has not yet reached a checkpoint
later than the one at which the full backup began, then taking the
incremental backup will fail. This is correct, because it makes no
sense to take an incremental backup that goes backwards in time, and
it's pointless to take one that goes forwards but not far enough to
reach the next checkpoint, as you won't save anything. (2) If replay
has progressed far enough that the redo pointer is now beyond the
CREATE DATABASE record, then everything is fine. (3) But if the redo
pointer for the backup is a later checkpoint than the one from which
the full backup started, but also before the CREATE DATABASE record,
then the new database's files exist on disk, but are not mentioned in
the WAL summary, which covers all LSNs from the start of the prior
backup to the start of this one. Here, the start of the backup is
basically the LSN from which replay will start, and since the database
was created after that, those changes aren't in the WAL summary. This
means that we think the file is unchanged since the prior backup, and
so backup no blocks at all. But now we have an incremental file for a
relation for which no full file is present in the prior backup, and
we're in big trouble.

If my analysis is correct, this bug should be new in v12. In v11 and
prior, I think that we always included every file that didn't appear
in the prior manifest in full. I didn't really quite know why I was
doing that, which is why I was willing to rip it out and thus remove
the need for the manifest, but now I think it was actually preventing
exactly this problem. This issue, in general, is files that get
created after the start of the backup. By that time, the WAL summary
that drives the backup has already been built, so it doesn't know
anything about the new files. That would be fine if we either (A)
omitted those new files from the backup completely, since replay would
recreate them or (B) backed them up in full, so that there was nothing
relying on them being there in the earlier backup. But an incremental
backup of such a file is no good.

Then I started worrying about whether there were problems in cases
where a file was dropped and recreated with the same name. I *think*
it's OK. If file F is dropped and recreated after being copied into
the full backup but before being copied into the incremental backup,
then there are basically two cases. First, F might be dropped before
the start LSN of the incremental backup; if so, we'll know from the
WAL summary that the limit block is 0 and back up the whole thing.
Second, F might be dropped after the start LSN of the incremental
backup and before it's actually coped. In that case, we'll not know
when backing up the file that it was dropped and recreated, so we'll
back it up incrementally as if that hadn't happened. That's OK as long
as reconstruction doesn't fail, because WAL replay will again drop and
recreate F. And I think reconstruction won't fail: blocks that are in
the incremental file will be taken from there, blocks in the prior
backup file will be taken from there, and blocks in neither place will
be zero-filled. The result is logically incoherent, but replay will
nuke the file anyway, so whatever.

It bugs me a bit that we don't obey the WAL-before-data rule with file
creation, e.g. RelationCreateStorage does smgrcreate() and then
log_smgrcreate(). So in theory we could see a file on disk for which
nothing has been logged yet; it could even happen that the file gets
created before the start LSN of the backup and the log record gets
written afterward. It seems like it would be far more comfortable to
swap the order there, so that if it's on disk, it's definitely in the
WAL. But I haven't yet been able to think of a scenario in which the
current ordering causes a real problem. If we backup a stray file in
full (or, hypothetically, if we skipped it entirely) then nothing will
happen that can't already happen today with full backup; any problems
we end up having are, I think, not new problems. It's only when we
back up a file incrementally that we need to be careful, and the
analsysis is basically the same as before ... whatever we put into an
incremental file will cause *something* to get reconstructed except
when there's no prior file at all. Having the manifest for the prior
backup lets us avoid the incremental-with-no-prior-file scenario. And
as long as *something* gets reconstructed, I think WAL replay will fix
up the rest.

Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Considering all this, what I'm inclined to do is go and put
> UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> accordingly. But first: does anybody see more problems here that I may
> have missed?

OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.

This version also improves (at least, IMHO) the way that we wait for
WAL summarization to finish. Previously, you either caught up fully
within 60 seconds or you died. I didn't like that, because it seemed
like some people would get timeouts when the operation was slowly
progressing and would eventually succeed. So what this version does
is:

- Every 10 seconds, it logs a warning saying that it's still waiting
for WAL summarization. That way, a human operator can understand
what's happening easily, and cancel if they want.

- If 60 seconds go by without the WAL summarizer ingesting even a
single WAL record, it times out. That way, if the WAL summarizer is
dead or totally stuck (e.g. debugger attached, hung I/O) the user
won't be left waiting forever even if they never cancel. But if it's
just slow, it probably won't time out, and the operation should
eventually succeed.

To me, this seems like a reasonable compromise. It might be
unreasonable if WAL summarization is proceeding at a very low but
non-zero rate. But it's hard for me to think of a situation where that
will happen, with the exception of when CPU or I/O are badly
overloaded. But in those cases, the WAL generation rate is probably
also not that high, because apparently the system is paralyzed, so
maybe the wait won't even be that bad, especially given that
everything else on the box should be super-slow too. Plus, even if we
did want to time out in such a case, it's hard to know how slow is too
slow. In any event, I think most failures here are likely to be
complete failures, where the WAL summarizer just doesn't, so the fact
that this times out in those cases seems to me to likely be as much as
we need to do here. But if someone sees a problem with this or has a
clever idea how to make it better, I'm all ears.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Jakub Wartak
Date:
On Tue, Dec 5, 2023 at 7:11 PM Robert Haas <robertmhaas@gmail.com> wrote:

[..v13 patchset]

The results with v13 patchset are following:

* - requires checkpoint on primary when doing incremental on standby
when it's too idle, this was explained by Robert in [1], something AKA
too-fast-incremental backup due to testing-scenario:

test_across_wallevelminimal.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_standby_incr_just_backup.sh - GOOD*
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD*
test_full_stby__incr_stby__restore_on_stby.sh - GOOD*
test_incr_on_standby_after_promote.sh - GOOD*
test_incr_after_timelineincrease.sh (pg_ctl stop, pg_resetwal -l
00000002000000000000000E ..., pg_ctl start, pg_basebackup
--incremental) - GOOD, I've got:
    pg_basebackup: error: could not initiate base backup: ERROR:
timeline 1 found in manifest, but not in this server's history
    Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?

test_stuck_walsummary.sh (pkill -STOP walsumm) - GOOD:

> This version also improves (at least, IMHO) the way that we wait for
> WAL summarization to finish. Previously, you either caught up fully
> within 60 seconds or you died. I didn't like that, because it seemed
> like some people would get timeouts when the operation was slowly
> progressing and would eventually succeed. So what this version does
> is:

    WARNING:  still waiting for WAL summarization through 0/A0000D8
after 10 seconds
    DETAIL:  Summarization has reached 0/8000028 on disk and 0/80000F8
in memory.
[..]
    pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summarization is not progressing
    DETAIL:  Summarization is needed through 0/A0000D8, but is stuck
at 0/8000028 on disk and 0/80000F8 in memory.
    Comment2: looks good to me!

test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940

Regards,
-J.

[1] -  https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>     Comment: I was wondering if it wouldn't make some sense to teach
> pg_resetwal to actually delete all WAL summaries after any any
> WAL/controlfile alteration?

I thought that this was a good idea so I decided to go implement it,
only to discover that it was already part of the patch set ... did you
find some case where it doesn't work as expected? The code looks like
this:

    RewriteControlFile();
    KillExistingXLOG();
    KillExistingArchiveStatus();
    KillExistingWALSummaries();
    WriteEmptyXLOG();

> test_pending_2pc.sh - getting GOOD on most recent runs, but several
> times during early testing (probably due to my own mishaps), I've been
> hit by Abort/TRAP. I'm still investigating and trying to reproduce
> those ones. TRAP: failed Assert("summary_end_lsn >=
> WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940

I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
On Thu, Dec 7, 2023 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

Hi Robert,

> On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >     Comment: I was wondering if it wouldn't make some sense to teach
> > pg_resetwal to actually delete all WAL summaries after any any
> > WAL/controlfile alteration?
>
> I thought that this was a good idea so I decided to go implement it,
> only to discover that it was already part of the patch set ... did you
> find some case where it doesn't work as expected? The code looks like
> this:

Ah, my bad, with a fresh mind and coffee the error message makes it
clear and of course it did reset the summaries properly.

While we are at it, maybe around the below in PrepareForIncrementalBackup()

                if (tlep[i] == NULL)
                        ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                                         errmsg("timeline %u found in
manifest, but not in this server's history",
                                                        range->tli)));

we could add

    errhint("You might need to start a new full backup instead of
incremental one")

?

> > test_pending_2pc.sh - getting GOOD on most recent runs, but several
> > times during early testing (probably due to my own mishaps), I've been
> > hit by Abort/TRAP. I'm still investigating and trying to reproduce
> > those ones. TRAP: failed Assert("summary_end_lsn >=
> > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
>
> I have a fix for this locally, but I'm going to hold off on publishing
> a new version until either there's a few more things I can address all
> at once, or until Thomas commits the ubsan fix.
>

Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.

-J.



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Considering all this, what I'm inclined to do is go and put
> > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > accordingly. But first: does anybody see more problems here that I may
> > have missed?
>
> OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> long comment explaining why that's believed to be necessary and
> sufficient. I committed 0001 and 0002 from the previous series also,
> since it doesn't seem like anyone has further comments on those
> renamings.

I have done some testing on standby, but I am facing some issues,
although things are working fine on the primary.  As shown below test
[1]standby is reporting some errors that manifest require WAL from
0/60000F8, but this backup starts at 0/6000028.  Then I tried to look
into the manifest file of the full backup and it shows contents as
below[0].  Actually from this WARNING and ERROR, I am not clear what
is the problem, I understand that full backup ends at  "0/60000F8" so
for the next incremental backup we should be looking for a summary
that has WAL starting at "0/60000F8" and we do have those WALs.  In
fact, the error message is saying "this backup starts at 0/6000028"
which is before  "0/60000F8" so whats the issue?

[0]
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }


[1]
-- test on primary
dilipkumar@dkmac bin % ./pg_basebackup -D d
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest

-- cleanup the backup directory
dilipkumar@dkmac bin % rm -rf d
dilipkumar@dkmac bin % rm -rf d1

--test on standby
dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433

WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  manifest
requires WAL from final timeline 1 ending at 0/60000F8, but this
backup starts at 0/6000028
pg_basebackup: removing data directory "d1"


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Dilip Kumar
Date:
On Mon, Dec 11, 2023 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Considering all this, what I'm inclined to do is go and put
> > > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > > accordingly. But first: does anybody see more problems here that I may
> > > have missed?
> >
> > OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> > long comment explaining why that's believed to be necessary and
> > sufficient. I committed 0001 and 0002 from the previous series also,
> > since it doesn't seem like anyone has further comments on those
> > renamings.
>
> I have done some testing on standby, but I am facing some issues,
> although things are working fine on the primary.  As shown below test
> [1]standby is reporting some errors that manifest require WAL from
> 0/60000F8, but this backup starts at 0/6000028.  Then I tried to look
> into the manifest file of the full backup and it shows contents as
> below[0].  Actually from this WARNING and ERROR, I am not clear what
> is the problem, I understand that full backup ends at  "0/60000F8" so
> for the next incremental backup we should be looking for a summary
> that has WAL starting at "0/60000F8" and we do have those WALs.  In
> fact, the error message is saying "this backup starts at 0/6000028"
> which is before  "0/60000F8" so whats the issue?
>
> [0]
> "WAL-Ranges": [
> { "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }
>
>
> [1]
> -- test on primary
> dilipkumar@dkmac bin % ./pg_basebackup -D d
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
>
> -- cleanup the backup directory
> dilipkumar@dkmac bin % rm -rf d
> dilipkumar@dkmac bin % rm -rf d1
>
> --test on standby
> dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433
>
> WARNING:  aborting backup due to backend exiting before pg_backup_stop
> was called
> pg_basebackup: error: could not initiate base backup: ERROR:  manifest
> requires WAL from final timeline 1 ending at 0/60000F8, but this
> backup starts at 0/6000028
> pg_basebackup: removing data directory "d1"

Jakub, pinged me offlist and pointed me to the thread[1] where it is
already explained so I think we can ignore this.

[1] https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> While we are at it, maybe around the below in PrepareForIncrementalBackup()
>
>                 if (tlep[i] == NULL)
>                         ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                                          errmsg("timeline %u found in
> manifest, but not in this server's history",
>                                                         range->tli)));
>
> we could add
>
>     errhint("You might need to start a new full backup instead of
> incremental one")
>
> ?

I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because," not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.

The scenario I was really concerned about when I wrote this test was
(2), because that could lead to a corrupt restore. This test isn't
strong enough to prevent that completely, because two unrelated
standbys can branch onto the same new timelines at the same LSNs, and
then these checks can't tell that something bad has happened. However,
they can detect a useful subset of problem cases. And the solution is
not so much "take a new full backup" as "keep straight which server is
which." Likewise, in case (1), the relevant hint would be "don't
manually remove timeline history files, and if you must, then at least
don't nuke timelines that you actually still care about."

> > I have a fix for this locally, but I'm going to hold off on publishing
> > a new version until either there's a few more things I can address all
> > at once, or until Thomas commits the ubsan fix.
> >
>
> Great, I cannot get it to fail again today, it had to be some dirty
> state of the testing env. BTW: Thomas has pushed that ubsan fix.

Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Jakub Wartak
Date:
Hi Robert,

On Mon, Dec 11, 2023 at 6:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > While we are at it, maybe around the below in PrepareForIncrementalBackup()
> >
> >                 if (tlep[i] == NULL)
> >                         ereport(ERROR,
> >
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                                          errmsg("timeline %u found in
> > manifest, but not in this server's history",
> >                                                         range->tli)));
> >
> > we could add
> >
> >     errhint("You might need to start a new full backup instead of
> > incremental one")
> >
> > ?
>
> I can't exactly say that such a hint would be inaccurate, but I think
> the impulse to add it here is misguided. One of my design goals for
> this system is to make it so that you never have to take a new
> incremental backup "just because,"

Did you mean take a new full backup here?

> not even in case of an intervening
> timeline switch. So, all of the errors in this function are warning
> you that you've done something that you really should not have done.
> In this particular case, you've either (1) manually removed the
> timeline history file, and not just any timeline history file but the
> one for a timeline for a backup that you still intend to use as the
> basis for taking an incremental backup or (2) tried to use a full
> backup taken from one server as the basis for an incremental backup on
> a completely different server that happens to share the same system
> identifier, e.g. because you promoted two standbys derived from the
> same original primary and then tried to use a full backup taken on one
> as the basis for an incremental backup taken on the other.
>

Okay, but please consider two other possibilities:

(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.

(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this error

I bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.

> > > I have a fix for this locally, but I'm going to hold off on publishing
> > > a new version until either there's a few more things I can address all
> > > at once, or until Thomas commits the ubsan fix.
> > >
> >
> > Great, I cannot get it to fail again today, it had to be some dirty
> > state of the testing env. BTW: Thomas has pushed that ubsan fix.
>
> Huzzah, the cfbot likes the patch set now. Here's a new version with
> the promised fix for your non-reproducible issue. Let's see whether
> you and cfbot still like this version.

LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.

-J.



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> > I can't exactly say that such a hint would be inaccurate, but I think
> > the impulse to add it here is misguided. One of my design goals for
> > this system is to make it so that you never have to take a new
> > incremental backup "just because,"
>
> Did you mean take a new full backup here?

Yes, apologies for the typo.

> > not even in case of an intervening
> > timeline switch. So, all of the errors in this function are warning
> > you that you've done something that you really should not have done.
> > In this particular case, you've either (1) manually removed the
> > timeline history file, and not just any timeline history file but the
> > one for a timeline for a backup that you still intend to use as the
> > basis for taking an incremental backup or (2) tried to use a full
> > backup taken from one server as the basis for an incremental backup on
> > a completely different server that happens to share the same system
> > identifier, e.g. because you promoted two standbys derived from the
> > same original primary and then tried to use a full backup taken on one
> > as the basis for an incremental backup taken on the other.
> >
>
> Okay, but please consider two other possibilities:
>
> (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> and some cronjob just a day later attempted to take incremental and
> failed with that error.
>
> (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> directory) the DB where I had cronjob that just failed with this error
>
> I bet that (4) is going to happen more often than (1), (2) , which
> might trigger users to complain on forums, support tickets.

Hmm. In case (4), I was thinking that you'd get a complaint about the
database system identifier not matching. I'm not actually sure that's
what would happen, though, now that you mention it.

In case (3), I think you would get an error about missing WAL summary files.

> > Huzzah, the cfbot likes the patch set now. Here's a new version with
> > the promised fix for your non-reproducible issue. Let's see whether
> > you and cfbot still like this version.
>
> LGTM, all quick tests work from my end too. BTW: I have also scheduled
> the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> test. I'll let you know how it went.

Awesome, thank you so much.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Jakub Wartak
Date:
Hi Robert,

On Wed, Dec 13, 2023 at 2:16 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
 >
> > > not even in case of an intervening
> > > timeline switch. So, all of the errors in this function are warning
> > > you that you've done something that you really should not have done.
> > > In this particular case, you've either (1) manually removed the
> > > timeline history file, and not just any timeline history file but the
> > > one for a timeline for a backup that you still intend to use as the
> > > basis for taking an incremental backup or (2) tried to use a full
> > > backup taken from one server as the basis for an incremental backup on
> > > a completely different server that happens to share the same system
> > > identifier, e.g. because you promoted two standbys derived from the
> > > same original primary and then tried to use a full backup taken on one
> > > as the basis for an incremental backup taken on the other.
> > >
> >
> > Okay, but please consider two other possibilities:
> >
> > (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> > and some cronjob just a day later attempted to take incremental and
> > failed with that error.
> >
> > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> > directory) the DB where I had cronjob that just failed with this error
> >
> > I bet that (4) is going to happen more often than (1), (2) , which
> > might trigger users to complain on forums, support tickets.
>
> Hmm. In case (4), I was thinking that you'd get a complaint about the
> database system identifier not matching. I'm not actually sure that's
> what would happen, though, now that you mention it.
>

I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:

 $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"

Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.

> > LGTM, all quick tests work from my end too. BTW: I have also scheduled
> > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> > test. I'll let you know how it went.
>
> Awesome, thank you so much.

OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did
generate pretty large incrementals (so I had to abort it due to lack
of space, I was expecting to see smaller incrementals so it took too
much space). I initally suspected that the problem lies in the normal
distribution of `\set aid random(1, 100000 * :scale)` for tpcbb that
UPDATEs on big pgbench_accounts.

$ du -sm /backups/backups/* /backups/archive/
216205  /backups/backups/full
215207  /backups/backups/incr.1
216706  /backups/backups/incr.2
102273  /backups/archive/

So I verified the recoverability yesterday anyway - the
pg_combinebackup "full incr.1 incr.2" took 44 minutes and later
archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan
foir sum(abalance) on the pgbench_accounts and other tables worked
fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys),
while performing mostly 60-80MB/s separately for both reads and writes
(it's slow, but it's due to maxed out sequence I/O of the Premium on a
small SSD on Azure).

So i've launched another improved test (to force more localized
UPDATEs) to see the more real-world space-effectiveness of the
incremental backup:

\set aid random_exponential(1, 100000 * :scale, 8)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
INSERT INTO pgbench_history (tid
, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

But then... (and i have verified the low-IDs for :aid above).. same
has happened:

backups/backups$ du -sm /backups/backups/*
210229  /backups/backups/full
208299  /backups/backups/incr.1
208351  /backups/backups/incr.2

# pgbench_accounts has relfilenodeid 16486
postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173
174 175  ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L
./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ;
echo; done
005c6bbb40fca3c1a0a819376ef0e793  full/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.1/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.2/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  /var/lib/postgres/17/data/base/5/16486.5

[.. all the checksums match (!) for the above $L..]

c5117a213253035da5e5ee8a80c3ee3d  full/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d  ./incr.1/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d  ./incr.2/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d  /var/lib/postgres/17/data/base/5/16486.173

47ee6b18d7f8e40352598d194b9a3c8a  full/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a  ./incr.1/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a  ./incr.2/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a  /var/lib/postgres/17/data/base/5/16486.174

82dfeba58b4a1031ac12c23f9559a330  full/base/5/16486.175
21a8ac1e6fef3cf0b34546c41d59b2cc  ./incr.1/base/5/16486.175
2c3d89c612b2f97d575a55c6c0204d0b  ./incr.2/base/5/16486.175
73367d44d76e98276d3a6bbc14bb31f1  /var/lib/postgres/17/data/base/5/16486.175

So to me, it looks like it copied anyway 174 out of 175 files lowering
the effectiveness of that incremental backup to 0% .The commands to
generate those incr backups were:
pg_basebackup -v -P -c fast -D /backups/backups/incr.1
--incremental=/backups/backups/full/backup_manifest
sleep 4h
pg_basebackup -v -P -c fast -D /backups/backups/incr.2
--incremental=/backups/backups/incr1/backup_manifest

The incrementals are being generated , but just for the first (0)
segment of the relation?

/backups/backups$ ls -l incr.2/base/5 | grep INCR
-rw------- 1 postgres postgres         12 Dec 14 21:33 INCREMENTAL.112
-rw------- 1 postgres postgres         12 Dec 14 21:01 INCREMENTAL.113
-rw------- 1 postgres postgres         12 Dec 14 21:36 INCREMENTAL.1247
-rw------- 1 postgres postgres         12 Dec 14 21:38 INCREMENTAL.1247_vm
[..note, no INCREMENTAL.$int.$segment files]
-rw------- 1 postgres postgres         12 Dec 14 21:24 INCREMENTAL.6238
-rw------- 1 postgres postgres         12 Dec 14 21:17 INCREMENTAL.6239
-rw------- 1 postgres postgres         12 Dec 14 21:55 INCREMENTAL.827

# 16486 is pgbench_accounts
/backups/backups$ ls -l incr.2/base/5/*16486* | grep INCR
-rw------- 1 postgres postgres   14613480 Dec 14 21:00
incr.2/base/5/INCREMENTAL.16486
-rw------- 1 postgres postgres         12 Dec 14 21:52
incr.2/base/5/INCREMENTAL.16486_vm
/backups/backups$

/backups/backups$ find incr* -name INCREMENTAL.* | wc -l
1342
/backups/backups$ find incr* -name INCREMENTAL.*_* | wc -l # VM or FSM
236
/backups/backups$ find incr* -name INCREMENTAL.*.* | wc -l # not a
single >1GB single incremental relation
0

I'm quickly passing info and I haven't really looked at the code yet ,
but it should be somewhere around GetFileBackupMethod() and
reproducible easily with that configure --with-segsize-blocks=X
switch.

-J.



Re: trying again to get incremental backup

From
Peter Eisentraut
Date:
I have a couple of quick fixes here.

The first fixes up some things in nls.mk related to a file move.  The 
second is some cleanup because some function you are using has been 
removed in the meantime; you probably found that yourself while rebasing.

The pg_walsummary patch doesn't have a nls.mk, but you also comment that 
it doesn't have tests yet, so I assume it's not considered complete yet 
anyway.

Attachment

Re: trying again to get incremental backup

From
Peter Eisentraut
Date:
A separate bikeshedding topic: The GUC "summarize_wal", could that be 
"wal_something" instead?  (wal_summarize? wal_summarizer?)  It would be 
nice if these settings names group together a bit, both with existing 
wal_* ones and also with the new ones you are adding 
(wal_summary_keep_time).




Re: trying again to get incremental backup

From
Peter Eisentraut
Date:
Another set of comments, about the patch that adds pg_combinebackup:

Make sure all the options are listed in a consistent order.  We have 
lately changed everything to be alphabetical.  This includes:

- reference page pg_combinebackup.sgml

- long_options listing

- getopt_long() argument

- subsequent switch

- (--help output, but it looks ok as is)

Also, in pg_combinebackup.sgml, the option --sync-method is listed as if 
it does not take an argument, but it does.




Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> The first fixes up some things in nls.mk related to a file move.  The
> second is some cleanup because some function you are using has been
> removed in the meantime; you probably found that yourself while rebasing.

Incorporated these. As you guessed,
MemoryContextResetAndDeleteChildren -> MemoryContextReset had already
been done locally.

> The pg_walsummary patch doesn't have a nls.mk, but you also comment that
> it doesn't have tests yet, so I assume it's not considered complete yet
> anyway.

I think this was more of a case of me just not realizing that I should
add that. I'll add something simple to the next version, but I'm not
very good at this NLS stuff.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> A separate bikeshedding topic: The GUC "summarize_wal", could that be
> "wal_something" instead?  (wal_summarize? wal_summarizer?)  It would be
> nice if these settings names group together a bit, both with existing
> wal_* ones and also with the new ones you are adding
> (wal_summary_keep_time).

Yeah, this is highly debatable, so bikeshed away. IMHO, the question
here is whether we care more about (1) having the name of the GUC
sound nice grammatically or (2) having the GUC begin with the same
string as other, related GUCs. I think that Tom Lane tends to prefer
the former, and probably some other people do too, while some other
people tend to prefer the latter. Ideally it would be possible to
satisfy both goals at once here, but everything I thought about that
started with "wal" sounded too awkward for me to like it; hence the
current choice of name. But if there's consensus on something else, so
be it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Mon, Dec 18, 2023 at 4:10 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> Another set of comments, about the patch that adds pg_combinebackup:
>
> Make sure all the options are listed in a consistent order.  We have
> lately changed everything to be alphabetical.  This includes:
>
> - reference page pg_combinebackup.sgml
>
> - long_options listing
>
> - getopt_long() argument
>
> - subsequent switch
>
> - (--help output, but it looks ok as is)
>
> Also, in pg_combinebackup.sgml, the option --sync-method is listed as if
> it does not take an argument, but it does.

I've attempted to clean this stuff up in the attached version. This
version also includes a fix for the bug found by Jakub that caused
things to not work properly for segment files beyond the first for any
particular relation, which turns out to be a really stupid mistake in
my earlier commit 025584a168a4b3002e193.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> mismatch (of course they do differ after initdb), but i get this
> instead:
>
>  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> --incremental /tmp/incr1.before.upgrade/backup_manifest
> WARNING:  aborting backup due to backend exiting before pg_backup_stop
> was called
> pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> 2 found in manifest, but not in this server's history
> pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
>
> Also in the manifest I don't see DBID ?
> Maybe it's a nuisance and all I'm trying to see is that if an
> automated cronjob with pg_basebackup --incremental hits a freshly
> upgraded cluster, that error message without errhint() is going to
> scare some Junior DBAs.

Yeah. I think we should add the system identifier to the manifest, but
I think that should be left for a future project, as I don't think the
lack of it is a good reason to stop all progress here. When we have
that, we can give more reliable error messages about system mismatches
at an earlier stage. Unfortunately, I don't think that the timeline
messages you're seeing here are going to apply in every case: suppose
you have two unrelated servers that are both on timeline 1. I think
you could use a base backup from one of those servers and use it as
the basis for the incremental from the other, and I think that if you
did it right you might fail to hit any sanity check that would block
that. pg_combinebackup will realize there's a problem, because it has
the whole cluster to work with, not just the manifest, and will notice
the mismatching system identifiers, but that's kind of late to find
out that you made a big mistake. However, right now, it's the best we
can do.

> The incrementals are being generated , but just for the first (0)
> segment of the relation?

I committed the first two patches from the series I posted yesterday.
The first should fix this, and the second relocates parse_manifest.c.
That patch hasn't changed in a while and seems unlikely to attract
major objections. There's no real reason to commit it until we're
ready to move forward with the main patches, but I think we're very
close to that now, so I did.

Here's a rebase for cfbot.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Jakub Wartak
Date:
Hi Robert,

On Tue, Dec 19, 2023 at 9:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> > mismatch (of course they do differ after initdb), but i get this
> > instead:
> >
> >  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> > --incremental /tmp/incr1.before.upgrade/backup_manifest
> > WARNING:  aborting backup due to backend exiting before pg_backup_stop
> > was called
> > pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> > 2 found in manifest, but not in this server's history
> > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
> >
> > Also in the manifest I don't see DBID ?
> > Maybe it's a nuisance and all I'm trying to see is that if an
> > automated cronjob with pg_basebackup --incremental hits a freshly
> > upgraded cluster, that error message without errhint() is going to
> > scare some Junior DBAs.
>
> Yeah. I think we should add the system identifier to the manifest, but
> I think that should be left for a future project, as I don't think the
> lack of it is a good reason to stop all progress here. When we have
> that, we can give more reliable error messages about system mismatches
> at an earlier stage. Unfortunately, I don't think that the timeline
> messages you're seeing here are going to apply in every case: suppose
> you have two unrelated servers that are both on timeline 1. I think
> you could use a base backup from one of those servers and use it as
> the basis for the incremental from the other, and I think that if you
> did it right you might fail to hit any sanity check that would block
> that. pg_combinebackup will realize there's a problem, because it has
> the whole cluster to work with, not just the manifest, and will notice
> the mismatching system identifiers, but that's kind of late to find
> out that you made a big mistake. However, right now, it's the best we
> can do.
>

OK, understood.

> > The incrementals are being generated , but just for the first (0)
> > segment of the relation?
>
> I committed the first two patches from the series I posted yesterday.
> The first should fix this, and the second relocates parse_manifest.c.
> That patch hasn't changed in a while and seems unlikely to attract
> major objections. There's no real reason to commit it until we're
> ready to move forward with the main patches, but I think we're very
> close to that now, so I did.
>
> Here's a rebase for cfbot.

the v15 patchset (posted yesterday) test results are GOOD:

1. make check-world - GOOD
2. cfbot was GOOD
3. the devel/master bug present in
parse_filename_for_nontemp_relation() seems to be gone (in local
testing)
4. some further tests:
test_across_wallevelminimal.sh - GOOD
test_incr_after_timelineincrease.sh - GOOD
test_incr_on_standby_after_promote.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_pending_2pc.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_repro_assert.sh
test_standby_incr_just_backup.sh - GOOD
test_stuck_walsum.sh - GOOD
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD
test_full_stby__incr_stby__restore_on_stby.sh - GOOD

5. the more real-world pgbench test with localized segment writes
usigng `\set aid random_exponential...` [1] indicates much greater
efficiency in terms of backup space use now, du -sm shows:

210229  /backups/backups/full
250     /backups/backups/incr.1
255     /backups/backups/incr.2
[..]
348     /backups/backups/incr.13
408     /backups/backups/incr.14 // latest(20th of Dec on 10:40)
6673    /backups/archive/

The DB size was as reported by \l+ 205GB.
That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30)
with slow 100 TPS (-R), so no insane amounts of WAL.
Time to reconstruct 14 chained incremental backups was 45mins
(pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full
/backups/backups/incr.1 (..)  /backups/backups/incr.14).
DB after recovering was OK and working fine.

-J.



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> the v15 patchset (posted yesterday) test results are GOOD:

All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Alexander Lakhin
Date:
Hello Robert,

20.12.2023 23:56, Robert Haas wrote:
> On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
>> the v15 patchset (posted yesterday) test results are GOOD:
> All right. I committed the main two patches, dropped the
> for-testing-only patch, and added a simple test to the remaining
> pg_walsummary patch. That needs more work, but here's what I have as
> of now.

I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:
s/arguent/argument/;
s/BlkRefTableEntry/BlockRefTableEntry/;
s/BlockRefTablEntry/BlockRefTableEntry/;
s/Caonicalize/Canonicalize/;
s/Checksum_Algorithm/Checksum-Algorithm/;
s/corresonding/corresponding/;
s/differenly/differently/;
s/excessing/excessive/;
s/ exta / extra /;
s/hexademical/hexadecimal/;
s/initally/initially/;
s/MAXGPATH/MAXPGPATH/;
s/overrreacting/overreacting/;
s/old_meanifest_file/old_manifest_file/;
s/pg_cominebackup/pg_combinebackup/;
s/pg_tblpc/pg_tblspc/;
s/pointrs/pointers/;
s/Recieve/Receive/;
s/recieved/received/;
s/ recod / record /;
s/ recods / records /;
s/substntially/substantially/;
s/sumamry/summary/;
s/summry/summary/;
s/synchronizaton/synchronization/;
s/sytem/system/;
s/withot/without/;
s/Woops/Whoops/;
s/xlograder/xlogreader/;

Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?

Best regards,
Alexander



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've found several typos/inconsistencies introduced with 174c48050 and
> dc2123400. Maybe you would want to fix them, while on it?:

That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.

I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...

> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> comment above redo_pointer_at_last_summary_removal declaration, but
> perhaps it should say about removing summaries instead?

Wow, yeah. Thanks, will fix.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Alexander Lakhin
Date:
21.12.2023 15:07, Robert Haas wrote:
> On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've found several typos/inconsistencies introduced with 174c48050 and
>> dc2123400. Maybe you would want to fix them, while on it?:
> That's an impressively long list of mistakes in something I thought
> I'd been careful about. Sigh.
>
> I don't suppose you could provide these corrections in the form of a
> patch? I don't really want to run these sed commands across the entire
> tree and then try to figure out what's what...

Please look at the attached patch; it corrects all 29 items ("recods"
fixed in two places), but maybe you find some substitutions wrong...

I've also observed that those commits introduced new warnings:
$ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
reconstruct.c: In function ‘read_bytes’:
reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   511 |                 if (rb < 0)
       |                        ^
reconstruct.c: In function ‘write_reconstructed_file’:
reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   650 |                                 if (rb < 0)
       |                                        ^
reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   662 |                         if (wb < 0)

There are also two deadcode.DeadStores complaints from clang. First one is
about:
         /*
          * Align the wait time to prevent drift. This doesn't really matter,
          * but we'd like the warnings about how long we've been waiting to say
          * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
          * drifting to something that is not a multiple of ten.
          */
         timeout_in_ms -=
             TimestampDifferenceMilliseconds(current_time, initial_time) %
             timeout_in_ms;
It looks like this timeout is really not used.

And the minor one (similar to many existing, maybe doesn't deserve fixing):
walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
                                         summary_end_lsn = private_data->read_upto;
                                         ^ ~~~~~~~~~~~~~~~~~~~~~~~

>> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
>> comment above redo_pointer_at_last_summary_removal declaration, but
>> perhaps it should say about removing summaries instead?
> Wow, yeah. Thanks, will fix.

Thank you for paying attention to it!

Best regards,
Alexander
Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Dec 21, 2023 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at the attached patch; it corrects all 29 items ("recods"
> fixed in two places), but maybe you find some substitutions wrong...

Thanks, committed with a few additions.

> I've also observed that those commits introduced new warnings:
> $ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
> reconstruct.c: In function ‘read_bytes’:
> reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    511 |                 if (rb < 0)
>        |                        ^
> reconstruct.c: In function ‘write_reconstructed_file’:
> reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    650 |                                 if (rb < 0)
>        |                                        ^
> reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    662 |                         if (wb < 0)

Oops. I think the variables should be type int. See attached.

> There are also two deadcode.DeadStores complaints from clang. First one is
> about:
>          /*
>           * Align the wait time to prevent drift. This doesn't really matter,
>           * but we'd like the warnings about how long we've been waiting to say
>           * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
>           * drifting to something that is not a multiple of ten.
>           */
>          timeout_in_ms -=
>              TimestampDifferenceMilliseconds(current_time, initial_time) %
>              timeout_in_ms;
> It looks like this timeout is really not used.

Oops. It should be. See attached.

> And the minor one (similar to many existing, maybe doesn't deserve fixing):
> walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
>                                          summary_end_lsn = private_data->read_upto;
>                                          ^ ~~~~~~~~~~~~~~~~~~~~~~~

It kind of surprises me that this is dead, but it seems best to keep
it there to be on the safe side, in case some change to the logic
renders it not dead in the future.

> >> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> >> comment above redo_pointer_at_last_summary_removal declaration, but
> >> perhaps it should say about removing summaries instead?
> > Wow, yeah. Thanks, will fix.
>
> Thank you for paying attention to it!

I'll fix this next.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From
Alexander Lakhin
Date:
21.12.2023 23:43, Robert Haas wrote:
>> There are also two deadcode.DeadStores complaints from clang. First one is
>> about:
>>           /*
>>            * Align the wait time to prevent drift. This doesn't really matter,
>>            * but we'd like the warnings about how long we've been waiting to say
>>            * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
>>            * drifting to something that is not a multiple of ten.
>>            */
>>           timeout_in_ms -=
>>               TimestampDifferenceMilliseconds(current_time, initial_time) %
>>               timeout_in_ms;
>> It looks like this timeout is really not used.
> Oops. It should be. See attached.

My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.

The other changes look good to me.

Thank you!

Best regards,
Alexander



Re: trying again to get incremental backup

From
Nathan Bossart
Date:
My compiler has the following complaint:

../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in
thisfunction [-Werror=maybe-uninitialized]
 
  540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
      |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

I haven't looked closely to see whether there is actually a problem here,
but the attached patch at least resolves the warning.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Sat, Dec 23, 2023 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My compiler has the following complaint:
>
> ../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
> ../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in
thisfunction [-Werror=maybe-uninitialized] 
>   540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
>       |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Nathan Bossart
Date:
On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> Thanks. I don't think there's a real bug, but I pushed a fix, same as
> what you had.

Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
wait_event_names.txt.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Wed, Dec 27, 2023 at 10:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> > Thanks. I don't think there's a real bug, but I pushed a fix, same as
> > what you had.
>
> Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
> wait_event_names.txt.

Fixed.

It seems like it would be good if there were an automated cross-check
between lwlocknames.txt and wait_event_names.txt.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Robert Haas
Date:
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Thom Brown
Date:
On Wed, 3 Jan 2024 at 15:10, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds

I would like to query the following:

--tablespace-mapping=olddir=newdir

    Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespace as it exists in the first backup specified on the command line, and newdir is the absolute path to use for the tablespace in the reconstructed backup.

The first backup specified on the command line will be the regular, full, non-incremental backup.  But if a tablespace was introduced subsequently, it would only appear in an incremental backup.  Wouldn't this then mean that a mapping would need to be provided based on the path to the tablespace of that incremental backup's copy?

Regards

Thom

Re: trying again to get incremental backup

From
Robert Haas
Date:
On Thu, Apr 25, 2024 at 6:44 PM Thom Brown <thom@linux.com> wrote:
> I would like to query the following:
>
> --tablespace-mapping=olddir=newdir
>
>     Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the
tablespaceas it exists in the first backup specified on the command line, and newdir is the absolute path to use for
thetablespace in the reconstructed backup. 
>
> The first backup specified on the command line will be the regular, full, non-incremental backup.  But if a
tablespacewas introduced subsequently, it would only appear in an incremental backup.  Wouldn't this then mean that a
mappingwould need to be provided based on the path to the tablespace of that incremental backup's copy? 

Yes. Tomas Vondra found the same issue, which I have fixed in
1713e3d6cd393fcc1d4873e75c7fa1f6c7023d75.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: trying again to get incremental backup

From
Michael Banck
Date:
Hi,

So I am a bit confused about the status of the tar format support, and
after re-reading the thread (or at least grepping it for ' tar '), this
wasn't really much discussed here either.

On Wed, Jun 14, 2023 at 02:46:48PM -0400, Robert Haas wrote:
> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too.

I believe "that tool" is pg_verifybackup, while "this one" is
pg_combinebackup? However, what's up with pg_basebackup itself with
respect to tar format incremental backups?

AFAICT (see below), pg_basebackup -Ft --incremental=foo/backup_manifest
happily creates an incremental backup in tar format; however,
pg_combinebackup will not be able to restore it? If that is the case,
shouldn't there be a bigger warning in the documentation about this, or
maybe pg_basebackup should refuse to make incremental tar-format backups
in the first place?

Am I missing something here? It will be obvious to users after the first
failure (to try to restore) that this will not work, and hopefully
everybody tests a restore before they put a backup solution into
production (or even better, wait until this whole feature is included in
a wholesale solution), but I wonder whether somebody might trip over
this after all and be unhappy. If one reads the pg_combinebackup
documentation carefully it kinda becomes obvious that it does occupy
itself with tar format backups, but it is not spelt out explicitly
either.

|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_full
|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_incr_1
--incremental=backup/backup_full/backup_manifest
|postgres@mbanck-lin-1:~$ echo $?
|0
|postgres@mbanck-lin-1:~$ du -h backup/
|44M    backup/backup_incr_1
|4,5G    backup/backup_full
|4,5G    backup/
|postgres@mbanck-lin-1:~$ tar tf backup/backup_incr_1/base.tar | grep INCR | head
|base/1/INCREMENTAL.3603
|base/1/INCREMENTAL.2187
|base/1/INCREMENTAL.13418
|base/1/INCREMENTAL.3467
|base/1/INCREMENTAL.2615_vm
|base/1/INCREMENTAL.2228
|base/1/INCREMENTAL.3503
|base/1/INCREMENTAL.2659
|base/1/INCREMENTAL.2607_vm
|base/1/INCREMENTAL.4164
|postgres@mbanck-lin-1:~$ /usr/lib/postgresql/17/bin/pg_combinebackup backup/backup_full/ backup/backup_incr_1/ -o
backup/combined
|pg_combinebackup: error: could not open file "backup/backup_incr_1//PG_VERSION": No such file or directory


Michael