Thread: trying again to get incremental backup

trying again to get incremental backup

From

Robert Haas

Date:

14 June 2023, 18:46:48

A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.

The basic design of this patch set is pretty simple, and there are
three main parts. First, there's a new background process called the
walsummarizer which runs all the time. It reads the WAL and generates
WAL summary files. WAL summary files are extremely small compared to
the original WAL and contain only the minimal amount of information
that we need in order to determine which parts of the database need to
be backed up. They tell us about files getting created, destroyed, or
truncated, and they tell us about modified blocks. Naturally, we don't
find out about blocks that were modified without any write-ahead log
record, e.g. hint bit updates, but those are of necessity not critical
for correctness, so it's OK. Second, pg_basebackup has a mode where it
can take an incremental backup. You must supply a backup manifest from
a previous full backup. We read the WAL summary files that have been
generated between the start of the previous backup and the start of
this one, and use that to figure out which relation files have changed
and how much. Non-relation files are sent normally, just as they would
be in a full backup. Relation files can either be sent in full or be
replaced by an incremental file, which contains a subset of the blocks
in the file plus a bit of information to handle truncations properly.
Third, there's now a pg_combinebackup utility which takes a full
backup and one or more incremental backups, performs a bunch of sanity
checks, and if everything works out, writes out a new, synthetic full
backup, aka a data directory.

Simple usage example:

pg_basebackup -cfast -Dx
pg_basebackup -cfast -Dy --incremental x/backup_manifest
pg_combinebackup x y -o z

The part of all this with which I'm least happy is the WAL
summarization engine. Actually, the core process of summarizing the
WAL seems totally fine, and the file format is very compact thanks to
some nice ideas from my colleague Dilip Kumar. Someone may of course
wish to argue that the information should be represented in some other
file format instead, and that can be done if it's really needed, but I
don't see a lot of value in tinkering with it, either. Where I do
think there's a problem is deciding how much WAL ought to be
summarized in one WAL summary file. Summary files cover a certain
range of WAL records - they have names like
$TLI${START_LSN}${END_LSN}.summary. It's not too hard to figure out
where a file should start - generally, it's wherever the previous file
ended, possibly on a new timeline, but figuring out where the summary
should end is trickier. You always have the option to either read
another WAL record and fold it into the current summary, or end the
current summary where you are, write out the file, and begin a new
one. So how do you decide what to do?

I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental. Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.

Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.

Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?
The fact that said file isn't mentioned in the WAL summaries seems
like it ought to mean that the file is unchanged, in which case
perhaps this ought to be an error condition. But I'm not too sure
about that treatment. I have a feeling that there might be some subtle
problems here, especially if databases or tablespaces get dropped and
then new ones get created that happen to have the same OIDs. And what
about wal_level=minimal? I'm not at a point where I can say I've gone
through and plugged up these kinds of corner-case holes tightly yet,
and I'm worried that there may be still other scenarios of which I
haven't even thought. Happy to hear your ideas about what the problem
cases are or how any of the problems should be solved.

A related design question is whether we should really be sending the
whole backup manifest to the server at all. If it turns out that we
don't really need anything except for the LSN of the previous backup,
we could send that one piece of information instead of everything. On
the other hand, if we need the list of files from the previous backup,
then sending the whole manifest makes sense.

Another big and rather obvious problem with the patch set is that it
doesn't currently have any automated test cases, or any real
documentation. Those are obviously things that need a lot of work
before there could be any thought of committing this. And probably a
lot of bugs will be found along the way, too.

A few less-serious problems with the patch:

- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.

- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.

- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.

In the attached patch series, patches 1 through 6 are various
refactoring patches, patch 7 is the main event, and patch 8 adds a
useful inspection tool.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Hi,

In the limited time that I've had to work on this project lately, I've
been trying to come up with a test case for this feature -- and since
I've gotten completely stuck, I thought it might be time to post and
see if anyone else has a better idea. I thought a reasonable test case
would be: Do a full backup. Change some stuff. Do an incremental
backup. Restore both backups and perform replay to the same LSN. Then
compare the files on disk. But I cannot make this work. The first
problem I ran into was that replay of the full backup does a
restartpoint, while the replay of the incremental backup does not.
That results in, for example, pg_subtrans having different contents.
I'm not sure whether it can also result in data files having different
contents: are changes that we replayed following the last restartpoint
guaranteed to end up on disk when the server is shut down? It wasn't
clear to me that this is the case. I thought maybe I could get both
servers to perform a restartpoint at the same location by shutting
down the primary and then replaying through the shutdown checkpoint,
but that doesn't work because the primary doesn't finish archiving
before shutting down. After some more fiddling I settled (at least for
research purposes) on having the restored backups PITR and promote,
instead of PITR and pause, so that we're guaranteed a checkpoint. But
that just caused me to run into a far worse problem: replay on the
standby doesn't actually create a state that is byte-for-byte
identical to the one that exists on the primary. I quickly discovered
that in my test case, I was ending up with different contents in the
"hole" of a block wherein a tuple got updated. Replay doesn't think
it's important to make the hole end up with the same contents on all
machines that replay the WAL, so I end up with one server that has
more junk in there than the other one and the tests fail.

Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents? But I
feel like it would be easy to have bugs that escape detection in such
a test but would be detected by a physical comparison of the clusters.
However, such a comparison can only be conducted if either (a) there's
some way to set up the test so that byte-for-byte identical clusters
can be expected or (b) there's some way to perform the comparison that
can distinguish between expected, harmless differences and unexpected,
problematic differences. And at the moment my conclusion is that
neither (a) nor (b) exists. Does anyone think otherwise?

Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.

...Robert

On Wed, Aug 30, 2023 at 4:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
[..]

I've played a little bit more this second batch of patches on
e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait
events refactor):

test_across_wallevelminimal.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_truncaterollback.sh
test_unlogged_table.sh

all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys. As per earlier doubt: raw wal_level =
minimal situation, shouldn't be a concern, sadly because it requires
max_wal_senders==0, while pg_basebackup requires it above 0 (due to
"FATAL:  number of requested standby connections exceeds
max_wal_senders (currently 0)").

I wanted to also introduce corruption onto pg_walsummaries files, but
later saw in code that is already covered with CRC32, cool.

In v07:
> +#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000

170000 ?

> A related design question is whether we should really be sending the
> whole backup manifest to the server at all. If it turns out that we
> don't really need anything except for the LSN of the previous backup,
> we could send that one piece of information instead of everything. On
> the other hand, if we need the list of files from the previous backup,
> then sending the whole manifest makes sense.

If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?)  - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay

> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too. I don't think there would be
> a whole lot of point trying to operate on compressed tar files because
> you need random access and that seems hard on a compressed file, but
> on uncompressed files it seems at least theoretically doable. I'm not
> sure whether anyone would care that much about this, though, even
> though it does sound pretty cool.

Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.

-J.

Attachment

incrbackuptests-0.1.tgz

Re: trying again to get incremental backup

From

Robert Haas

Date:

03 October 2023, 15:03:22

On Fri, Sep 1, 2023 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > No objections to 0001/0002.
>
> Cool.

Nobody else objected either, so I went ahead and committed those. I'll
rebase the rest of the patches on top of the latest master and repost,
hopefully after addressing some of the other review comments from
Dilip and Jakub.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

03 October 2023, 18:21:00

On Tue, Sep 12, 2023 at 5:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> + BlockNumber relative_block_numbers[RELSEG_SIZE];
>
> This is close to 400kB of memory, so I think it is better we palloc it
> instead of keeping it in the stack.

Fixed.

> Unrelated code refactoring hunk

Fixed.

> This structure is not used anywhere.

Removed.

> /If the file is to be set incrementally/If the file is to be sent incrementally

Fixed.

> I do not really like this change, because after removing this you have
> put 2 independent checks for sending the full file[1] and sending it
> incrementally[1].  Actually for sending incrementally
> 'statbuf->st_size' is computed from the 'num_incremental_blocks'
> itself so why don't we keep this breaking condition in the while loop
> itself?  So that we can avoid these two separate conditions.

I don't think that would be correct. The number of bytes that need to
be read from the original file is not equal to the number of bytes
that will be written to the incremental file. Admittedly, they're
currently different by less than a block, but that could change if we
change the format of the incremental file (e.g. suppose we compressed
the blocks in the incremental file with gzip, or smushed out the holes
in the pages). I wrote the loop as I did precisely so that the two
cases could have different loop exit conditions.

> Better we add some comments for these structures.

Done.

Here's a new patch set, also addressing Jakub's observation that
MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From

Robert Haas

Date:

04 October 2023, 19:33:27

On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> all those basic tests had GOOD results. Please find attached. I'll try
> to schedule some more realistic (in terms of workload and sizes) test
> in a couple of days + maybe have some fun with cross-backup-and
> restores across standbys.

That's awesome! Thanks for testing! This can definitely benefit from
any amount of beating on it that people wish to do. It's a complex,
delicate area that risks data loss.

> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

Hmm. I wonder if this would even be a safe procedure. I admit that I
can't quite see a problem with it, but sometimes I'm kind of dumb.

> Also maybe it's too early to ask, but wouldn't it be nice if we could
> have an future option in pg_combinebackup to avoid double writes when
> used from restore hosts (right now we need to first to reconstruct the
> original datadir from full and incremental backups on host hosting
> backups and then TRANSFER it again and on target host?). So something
> like that could work well from restorehost: pg_combinebackup
> /tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
> dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
> that such a pipe prevents parallelism from day 1 and I'm afraid I do
> not have a better easy idea on how to have both at the same time in
> the long term.

I don't think it's too early to ask for this, but I do think it's too
early for you to get it. ;-)

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

04 October 2023, 20:08:29

On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Here's a new patch set, also addressing Jakub's observation that
> MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.

Here's yet another new version. In this version, I reversed the order
of the first two patches, with the idea that what's now 0001 seems
fairly reasonable as an independent commit, and could thus perhaps be
committed sometime soon-ish. In the main patch, I added SGML
documentation for pg_combinebackup. I also fixed the broken TAP tests
so that they work, by basing them on pg_dump equivalence rather than
file-level equivalence. I'm sad to give up on testing the latter, but
it seems to be unrealistic. I cleaned up a few other odds and ends,
too. But, what exactly is the bigger picture for this patch in terms
of moving forward? Here's a list of things that are on my mind:

- I'd like to get the patch to mark the redo point in the WAL
committed[1] and then reword this patch set to make use of that
infrastructure. Right now, we make a best effort to end WAL summaries
at redo point boundaries, but it's racey, and sometimes we fail to do
so. In theory that just has the effect of potentially making an
incremental backup contain some extra blocks that it shouldn't really
need to contain, but I think it can actually lead to weird stalls,
because when an incremental backup is taken, we have to wait until a
WAL summary shows up that extends at least up to the start LSN of the
backup we're about to take. I believe all the logic in this area can
be made a good deal simpler and more reliable if that patch gets
committed and this one reworked accordingly.

- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful. One point to be
consider here is that, if WAL summarization is enabled, checkpoints
can't remove WAL that isn't summarized yet. Mostly that's not a
problem, I think, because the WAL summarizer is pretty fast. But it
could increase disk consumption for some people. I don't think that we
need to worry about the summaries themselves being a problem in terms
of space consumption; at least in all the cases I've tested, they're
just not very big.

- On a related note, I haven't yet tested this on a standby, which is
a thing that I definitely need to do. I don't know of a reason why it
shouldn't be possible for all of this machinery to work on a standby
just as it does on a primary, but then we need the WAL summarizer to
run there too, which could end up being a waste if nobody ever tries
to take an incremental backup. I wonder how that should be reflected
in the configuration. We could do something like what we've done for
archive_mode, where on means "only on if this is a primary" and you
have to say always if you want it to run on standbys as well ... but
I'm not sure if that's a design pattern that we really want to
replicate into more places. I'd be somewhat inclined to just make
whatever configuration parameters we need to configure this thing on
the primary also work on standbys, and you can set each server up as
you please. But I'm open to other suggestions.

- We need to settle the question of whether to send the whole backup
manifest to the server or just the LSN. In a previous attempt at
incremental backup, we decided the whole manifest was necessary,
because flat-copying files could make new data show up with old LSNs.
But that version of the patch set was trying to find modified blocks
by checking their LSNs individually, not by summarizing WAL. And since
the operations that flat-copy files are WAL-logged, the WAL summary
approach seems to eliminate that problem - maybe an LSN (and the
associated TLI) is good enough now. This also relates to Jakub's
question about whether this machinery could be used to fast-forward a
standby, which is not exactly a base backup but ... perhaps close
enough? I'm somewhat inclined to believe that we can simplify to an
LSN and TLI; however, if we do that, then we'll have big problems if
later we realize that we want the manifest for something after all. So
if anybody thinks that there's a reason to keep doing what the patch
does today -- namely, upload the whole manifest to the server --
please speak up.

- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker, although Andrew Dunstan tells me he
does have a prototype for incremental JSON parsing so maybe that will
land and we can use it here.

- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?

- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche. Nevertheless, merging it into any other
utility seems relatively awkward, so I'm inclined to think both that
this should be included in whatever finally gets committed and that it
should be a separate binary. I considered whether it should go in
contrib, but we seem to have moved to a policy that heavily favors
limiting contrib to extensions and loadable modules, rather than
binaries.

Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.

[1] http://postgr.es/m/CA+TgmoZAM24Ub=uxP0aWuWstNYTUJQ64j976FYJeVaMJ+qD0uw@mail.gmail.com

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From

Robert Haas

Date:

19 October 2023, 16:05:35

On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Clearly there's a good amount of stuff to sort out here, but we've
> still got quite a bit of time left before feature freeze so I'd like
> to have a go at it. Please let me know your thoughts, if you have any.

Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.

After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

On Fri, Oct 20, 2023 at 9:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Okay, so another good news - related to the patch version #4.
> Not-so-tiny stress test consisting of pgbench run for 24h straight
> (with incremental backups every 2h, with base of initial full backup),
> followed by two PITRs (one not using incremental backup and one using
> to to illustrate the performance point - and potentially spot any
> errors in between). In both cases it worked fine.

This is great testing, thanks. What might be even better is to test
whether the resulting backups are correct, somehow.

> I've just noticed one thing when recovery is progress: is
> summarization working during recovery - in the background - an
> expected behaviour? I'm wondering about that, because after freshly
> restored and recovered DB, one would need to create a *new* full
> backup and only from that point new summaries would have any use?

Actually, I think you could take an incremental backup relative to a
full backup from a previous timeline.

But the question of what summarization ought to do (or not do) during
recovery, and whether it ought to be enabled by default, and what the
retention policy ought to be are very much live ones. Right now, it's
enabled by default and keeps summaries for a week, assuming you don't
reset your local clock and that it advances at the same speed as the
universe's own clock. But that's all debatable. Any views?

Meanwhile, here's a new patch set. I went ahead and committed the
first two preparatory patches, as I said earlier that I intended to
do. And here I've adjusted the main patch, which is now 0003, for the
addition of XLOG_CHECKPOINT_REDO, which permitted me to simplify a few
things. wal_summarize_mb now feels like a bit of a silly GUC --
presumably you'd never care, unless you had an absolutely gigantic
inter-checkpoint WAL distance. And if you have that, maybe you should
also have enough memory to summarize all that WAL. Or maybe not:
perhaps it's better to write WAL summaries more than once per
checkpoint when checkpoints are really big. But I'm worried that the
GUC will become a source of needless confusion for users. For most
people, it seems like emitting one summary per checkpoint should be
totally fine, and they might prefer a simple Boolean GUC,
summarize_wal = true | false, over this. I'm just not quite sure about
the corner cases.

--
Robert Haas
EDB: http://www.enterprisedb.com

On 2023-10-24 Tu 12:08, Robert Haas wrote:
>
>> It looks like each file entry in the manifest takes about 150 bytes, so
>> 1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?
> I suspect a few people have more files than that. They'll just have to Maybe someone on the list can see some way o
> wait to use this feature until we get incremental JSON parsing (or
> undo the decision to use JSON for the manifest).

Robert asked me to work on this quite some time ago, and most of this 
work was done last year.

Here's my WIP for an incremental JSON parser. It works and passes all 
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. 
The reason I haven't posted it before is that it's about 50% slower in 
pure parsing speed than the current recursive descent parser in my 
testing. I've tried various things to make it faster, but haven't made 
much impact. One of my colleagues is going to take a fresh look at it, 
but maybe someone on the list can see where we can save some cycles.

If we can't make it faster, I guess we could use the RD parser for 
non-incremental cases and only use the non-RD parser for incremental, 
although that would be a bit sad. However, I don't think we can make the 
RD parser suitable for incremental parsing - there's too much state 
involved in the call stack.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

json_incremental_parser-2023-09-25.patch.gz

Re: trying again to get incremental backup

From

Robert Haas

Date:

25 October 2023, 13:05:36

On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Robert asked me to work on this quite some time ago, and most of this
> work was done last year.
>
> Here's my WIP for an incremental JSON parser. It works and passes all
> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
> The reason I haven't posted it before is that it's about 50% slower in
> pure parsing speed than the current recursive descent parser in my
> testing. I've tried various things to make it faster, but haven't made
> much impact. One of my colleagues is going to take a fresh look at it,
> but maybe someone on the list can see where we can save some cycles.
>
> If we can't make it faster, I guess we could use the RD parser for
> non-incremental cases and only use the non-RD parser for incremental,
> although that would be a bit sad. However, I don't think we can make the
> RD parser suitable for incremental parsing - there's too much state
> involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Andrew Dunstan

Date:

25 October 2023, 14:33:49

On 2023-10-25 We 09:05, Robert Haas wrote:
> On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Robert asked me to work on this quite some time ago, and most of this
>> work was done last year.
>>
>> Here's my WIP for an incremental JSON parser. It works and passes all
>> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
>> The reason I haven't posted it before is that it's about 50% slower in
>> pure parsing speed than the current recursive descent parser in my
>> testing. I've tried various things to make it faster, but haven't made
>> much impact. One of my colleagues is going to take a fresh look at it,
>> but maybe someone on the list can see where we can save some cycles.
>>
>> If we can't make it faster, I guess we could use the RD parser for
>> non-incremental cases and only use the non-RD parser for incremental,
>> although that would be a bit sad. However, I don't think we can make the
>> RD parser suitable for incremental parsing - there's too much state
>> involved in the call stack.
> Yeah, this is exactly why I didn't want to use JSON for the backup
> manifest in the first place. Parsing such a manifest incrementally is
> complicated. If we'd gone with my original design where the manifest
> consisted of a bunch of lines each of which could be parsed
> separately, we'd already have incremental parsing and wouldn't be
> faced with these difficult trade-offs.
>
> Unfortunately, I'm not in a good position either to figure out how to
> make your prototype faster, or to evaluate how painful it is to keep
> both in the source tree. It's probably worth considering how likely it
> is that we'd be interested in incremental JSON parsing in other cases.
> Maintaining two JSON parsers is probably not a lot of fun regardless,
> but if each of them gets used for a bunch of things, that feels less
> bad than if one of them gets used for a bunch of things and the other
> one only ever gets used for backup manifests. Would we be interested
> in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
> seems tenuous, but those are examples of the kind of thing that could
> make us happy to have incremental JSON parsing as a general facility.
>
> If nobody's very excited by those kinds of use cases, then this just
> boils down to whether we want to (a) accept that users with very large
> numbers of relation files won't be able to use pg_verifybackup or
> incremental backup, (b) accept that we're going to maintain a second
> JSON parser just to enable that use cas and with no other benefit, or
> (c) undertake to change the manifest format to something that is
> straightforward to parse incrementally. I think (a) is reasonable
> short term, but at some point I think we should do better. I'm not
> really that enthused about (c) because it means more work for me and
> possibly more arguing, but if (b) is going to cause a lot of hassle
> then we might need to consider it.


I'm not too worried about the maintenance burden. The RD routines were 
added in March 2013 (commit a570c98d7fa) and have hardly changed since 
then. The new code is not ground-breaking - it's just a different (and 
fairly well known) way of doing the same thing. I'd be happier if we 
could make it faster, but maybe it's just a fact that keeping an 
explicit stack, which is how this works, is slower.

I wouldn't at all be surprised if there were other good uses for 
incremental JSON parsing, including some you've identified.

That said, I agree that JSON might not be the best format for backup 
manifests, but maybe that ship has sailed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

25 October 2023, 15:24:25

On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I'm not too worried about the maintenance burden.
>
> That said, I agree that JSON might not be the best format for backup
> manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

25 October 2023, 17:38:25

On Tue, Oct 24, 2023 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, maybe so. I'm not quite ready to commit to doing that split as
> of this writing but I will think about it and possibly do it.

I have done this. Here's v7.

This version also includes several new TAP tests for the main patch,
some of which were inspired by our discussion. It also includes SGML
documentation for pg_walsummary.

New tests:
003_timeline.pl tests the case where the prior backup for an
incremental backup was taken on an earlier timeline.
004_manifest.pl tests the manifest-related options for pg_combinebackup.
005_integrity.pl tests the sanity checks that prevent combining a
backup with the wrong prior backup.

Overview of the new organization of the patch set:
0001 - preparatory refactoring of basebackup.c, changing the algorithm
that we use to decide which files have checksums
0002 - code movement only. makes it possible to reuse parse_manifest.c
0003 - add the WAL summarizer process, but useless on its own
0004 - add incremental backup, making use of 0003
0005 - add pg_walsummary debugging tool

Notes:
- I suspect that 0003 is the most likely to have serious bugs, followed by 0004.
- See XXX comments in the commit messages for some known open issues.
- Still looking for more comments on
http://postgr.es/m/CA+TgmoYdPS7a4eiqAFCZ8dr4r3-O0zq1LvTO5drwWr+7wHQaSQ@mail.gmail.com
and other recent emails where design questions came up

--
Robert Haas
EDB: http://www.enterprisedb.com

On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
> After playing with this for a while, I don't see a reason for wal_summarize_mb
> from a memory usage POV at least.

Here's v8. Changes:

- Replace wal_summarize_mb GUC with summarize_wal = on | off.
- Document the summarize_wal and wal_summary_keep_time GUCs.
- Refuse to start with summarize_wal = on and wal_level = minimal.
- Increase default wal_summary_keep_time to 10d from 7d, per (what I
think was) a suggestion from Peter E.
- Fix fencepost errors when deciding which WAL summaries are needed
for a backup.
- Fix indentation damage.
- Standardize on ereport(DEBUG1, ...) in walsummarizer.c vs. various
more and less chatty things I had before.
- Include the timeline in some error messages because not having it
proved confusing.
- Be more consistent about ignoring the FSM fork.
- Fix a bug that could cause WAL summarization to error out when
switching timelines.
- Fix the division between the wal summarizer and incremental backup
patches so that the former passes tests without the latter.
- Fix some things that an older compiler didn't like, including adding
pg_attribute_printf in some places.
- Die with an error instead of crashing if someone feeds us a manifest
with no WAL ranges.
- Sort the block numbers that need to be read from a relation file
before reading them, so that we're certain to read them in ascending
order.
- Be more careful about computing the truncation_block_length of an
incremental file; don't do math on a block number that might be
InvalidBlockNumber.
- Fix pg_combinebackup so it doesn't fail when zero-filled blocks are
added to a relation between the prior backup and the incremental
backup.
- Improve the pg_combinebackup -d output so that it explains in detail
how it's carrying out reconstruction, to improve debuggability.
- Disable WAL summarization by default, but add a test patch to the
series to enable it, because running the whole test suite with it
turned on is good for bug-hunting.
- In pg_walsummary, zero a struct before using instead of starting
with arbitrary junk values.

To do list:

- Figure out whether to do something other than uploading the whole
summary, per discussion with Jakub Wartak.
- Decide what to do about the 60-second waiting-for-WAL-summarization timeout.
- Make incremental backup fail quickly if WAL summarization is not even enabled.
- Have pg_basebackup error out nicely if an incremental backup is
requested from an older server that can't do that.
- Add some kind of tests for pg_walsummary.

--
Robert Haas
EDB: http://www.enterprisedb.com

On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Great stuff you got here.  I'm doing a first pass trying to grok the
> whole thing for more substantive comments, but in the meantime here are
> some cosmetic ones.

Thanks, thanks, and thanks.

I've fixed some things that you mentioned in the attached version.
Other comments below.

> In blkreftable.c, I think the definition of SH_EQUAL should have an
> outer layer of parentheses.  Also, it would be good to provide and use a
> function to initialize a BlockRefTableKey from the RelFileNode and
> forknum components, and ensure that any padding bytes are zeroed.
> Otherwise it's not going to be a great hash key.  On my platform there
> aren't any (padding bytes), but I think it's unwise to rely on that.

I'm having trouble understanding the second part of this suggestion.
Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
in a backend context, we get the default, which is
MemoryContextAllocZero. Maybe there's some case this doesn't cover,
though?

> These forward struct declarations are not buying you anything, I'd
> remove them:

I've had problems from time to time when I don't do this. I'll remove
it here, but I'm not convinced that it's always useless.

> I don't much like the way header files in src/bin/pg_combinebackup files
> are structured.  Particularly, causing a simplehash to be "instantiated"
> just because load_manifest.h is included seems poised to cause pain.  I
> think there should be a file with the basic struct declarations (no
> simplehash); and then maybe since both pg_basebackup and
> pg_combinebackup seem to need the same simplehash, create a separate
> header file containing just that..  But, did you notice that anything
> that includes reconstruct.h will instantiate the simplehash stuff,
> because it includes load_manifest.h?  It may be unwise to have the
> simplehash in a header file.  Maybe just declare it in each .c file that
> needs it.  The duplicity is not that large.

I think that I did this correctly. AIUI, if you're defining a
simplehash that only one source file needs, you make the scope
"static" and do both SH_DECLARE and SH_DEFILE it in that file. If you
need it to be shared by multiple files, you make it "extern" in the
header file, do SH_DECLARE there, and SH_DEFINE in one of those source
files. Or you could make the scope "static inline" in the header file
and then you'd both SH_DECLARE and SH_DEFINE it in the header file.

If I were to do as you suggest here, I think I'd end up with 2 copies
of the compiled code for this instead of one, and if they ever got out
of sync everything would break silently.

> Why leave unnamed arguments in function declarations?  For example, in
>
> static void manifest_process_file(JsonManifestParseContext *,
>                                   char *pathname,
>                                   size_t size,
>                                   pg_checksum_type checksum_type,
>                                   int checksum_length,
>                                   uint8 *checksum_payload);
> the first argument lacks a name.  Is this just an oversight, I hope?

I mean, I've changed it now, but I don't think it's worth getting too
excited about. "int checksum_length" is much better documentation than
just "int," but "JsonManifestParseContext *context" is just noise,
IMHO. You can argue that it's better for consistency that way, but
whatever.

> In GetFileBackupMethod(), which arguments are in and which are out?
> The comment doesn't say, and it's not obvious why we pass both the file
> path as well as the individual constituent pieces for it.

The header comment does document which values are potentially set on
return. I guess I thought it was clear enough that the stuff not
documented to be output parameters was input parameters. Most of them
aren't even pointers, so they have to be input parameters. The only
exception is 'path', which I have some difficulty thinking that anyone
is going to imagine to be an input pointer.

Maybe you could propose a more specific rewording of this comment?
FWIW, I'm not altogether sure whether this function is going to get
more heavily adjusted in a rev or three of the patch set, so maybe we
want to wait to sort this out until this is closer to final, but OTOH
if I know what you have in mind for the current version, I might be
more likely to keep it in a good place if I end up changing it.

> DO_NOT_BACKUP_FILE appears not to be set anywhere.  Do you expect to use
> this later?  If not, maybe remove it.

Woops, that was a holdover from an earlier version.

> There are two functions named record_manifest_details_for_file() in
> different programs.  I think this sort of arrangement is not great, as
> it is confusing confusing to follow.  It would be better if those two
> routines were called something like, say, verifybackup_perfile_cb and
> combinebackup_perfile_cb instead; then in the function comment say
> something like
> /*
>  * JsonManifestParseContext->perfile_cb implementation for pg_combinebackup.
>  *
>  * Record details extracted from the backup manifest for one file,
>  * because we like to keep things tracked or whatever.
>  */
> so it's easy to track down what does what and why.  Same with
> perwalrange_cb.  "perfile" looks bothersome to me as a name entity.  Why
> not per_file_cb? and per_walrange_cb?

I had trouble figuring out how to name this stuff. I did notice the
awkwardness, but surely nobody can think that two functions with the
same name in different binaries can be actually the same function.

If we want to inject more underscores here, my vote is to go all the
way and make it per_wal_range_cb.

> In walsummarizer.c, HandleWalSummarizerInterrupts is called in
> summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
> Maybe it should?

I replaced all the CHECK_FOR_INTERRUPTS() in that file with
HandleWalSummarizerInterrupts(). Does that seem right?

> I think this path is not going to be very human-likeable.
>                 snprintf(final_path, MAXPGPATH,
>                                  XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
>                                  tli,
>                                  LSN_FORMAT_ARGS(summary_start_lsn),
>                                  LSN_FORMAT_ARGS(summary_end_lsn));
> Why not add a dash between the TLI and between both LSNs, or something
> like that?  (Also, are we really printing TLIs as 8-byte hexs?)

Dealing with the last part first, we already do that in every WAL file
name. I actually think these file names are easier to work with than
WAL file names, because 000000010000000000000020 is not the WAL
starting at 0/20, but rather the WAL starting at 0/20000000. To know
at what LSN a WAL file starts, you have to mentally delete characters
17 through 22, which will always be zero, and instead add six zeroes
at the end. I don't think whoever came up with that file naming
convention deserves an award, unless it's a raspberry award. With
these names, you get something like
0000000100000000015125B800000000015128F0.summary and you can sort of
see that 1512 repeats so the LSN went from something ending in 5B8 to
something ending in 8F0. I actually think it's way better.

But I have a hard time arguing that it wouldn't be more readable still
if we put some separator characters in there. I didn't do that because
then they'd look less like WAL file names, but maybe that's not really
a problem. A possible reason not to bother is that these files are
less necessary for humans to care about than WAL files, since they
don't need to be archived or transported between nodes in any way.
Basically I think this is probably fine the way it is, but if you or
others think it's really important to change it, I can do that. Just
as long as we don't spend 50 emails arguing about which separator
character to use.

--
Robert Haas
EDB: http://www.enterprisedb.com

On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So, wal_summary is no longer turned on by default, I think following a
> comment from Peter E.  I think this is a good decision, as we're only
> going to need them on servers from which incremental backups are going
> to be taken, which is a strict subset of all servers; and furthermore,
> people that need them are going to realize that very easily, while if we
> went the other around most people would not realize that they need to
> turn them off to save some resource consumption.
>
> Granted, the amount of resources additionally used is probably not very
> big.  But since it can be changed with a reload not restart, it doesn't
> seem problematic.

Yeah. I meant to say that I'd changed that for that reason, but in the
flurry of new versions I omitted to do so.

> ... oh, I just noticed that this patch now fails to compile because of
> the MemoryContextResetAndDeleteChildren removal.

Fixed.

> (Typo in the pg_walsummary manpage: "since WAL summary files primary
> exist" -> "primarily")

This, too.

> I think it should default to off in primary and standby, and the user
> has to enable it in whichever server they want to take backups from.

Yeah, that's how it works currently.

> I don't understand this point.  Currently, the protocol is that
> UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> backup.  You seem to be saying that you're thinking of removing support
> for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> BASE_BACKUP command?

I don't think I'd want to do exactly that, because then you could only
send one LSN, and I do think we want to send a set of LSN ranges with
the corresponding TLI for each. I was thinking about dumping
UPLOAD_MANIFEST and instead having a command like:

INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698

The client would execute this command one or more times before
starting an incremental backup.

> I propose to keep the door open for that binary doing other things that
> dumping the files as text.  So add a command argument, which currently
> can only be "dump", to allow the command do other things later if
> needed.  (For example, remove files from a server on which summarize_wal
> has been turned off; or perhaps remove files that are below some LSN.)

I don't like that very much. That sounds like one of those
forward-compatibility things that somebody designs and then nothing
ever happens and ten years later you still have an ugly wart.

My theory is that these files are going to need very little
management. In general, they're small; if you never removed them, it
probably wouldn't hurt, or at least, not for a long time. As to
specific use cases, if you want to remove files from a server on which
summarize_wal has been turned off, you can just use rm. Removing files
from before a certain LSN would probably need a bit of scripting, but
only a bit. Conceivably we could provide something like that in core,
but it doesn't seem necessary, and it also seems to me that we might
do well to include that in pg_archivecleanup rather than in
pg_walsummary.

Here's a new version. Changes:

- Add preparatory renaming patches to the series.
- Rename wal_summarize_keep_time to wal_summary_keep_time.
- Change while (true) to while (1).
- Typo fixes.
- Fix incorrect assertion in summarizer_read_local_xlog_page; this
could cause occasional regression test failures in 004_pg_xlog_symlink
and 009_growing_files.
- Zero-initialize BlockRefTableKey variables.
- Replace a couple instances of pathbuf + basepathlen + 1 with tarfilename.
- Add const to path argument of GetFileBackupMethod.
- Avoid setting output parameters of GetFileBackupMethod unless the
return value is BACK_UP_FILE_INCREMENTALLY.
- In GetFileBackupMethod, postpone qsorting block numbers slightly.
- Define INCREMENTAL_PREFIX_LENGTH using sizeof(), because that should
hopefully work everywhere and the StaticAssertStmt that checks the
value of this doesn't work on Windows.
- Change MemoryContextResetAndDeleteChildren to MemoryContextReset.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From

Alvaro Herrera

Date:

17 November 2023, 10:01:21

I made a pass over pg_combinebackup for NLS.  I propose the attached
patch.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)

Attachment

0001-do-NLS-for-pg_combinebackup.patch

Re: trying again to get incremental backup

From

Robert Haas

Date:

20 November 2023, 15:42:47

On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I made a pass over pg_combinebackup for NLS.  I propose the attached
> patch.

This doesn't quite compile for me so I changed a few things and
incorporated it. Hopefully I didn't mess anything up.

Here's v11. In addition to incorporating Álvaro's NLS changes, with
the off-list help of Jakub Wartak, I finally tracked down two one-line
bugs in BlockRefTableEntryGetBlocks that have been causing the cfbot
to blow up on these patches. What I hadn't realized is that cfbot runs
with the relation segment size changed to 6 blocks, which tickled some
code paths that I wasn't exercising locally. Thanks a ton to Jakub for
the help running this down. cfbot was unhappy about a %lu so I've
changed that to %zu in this version, too. Finally, the previous
version of this patch set had some pgindent damage, so that is
hopefully now cleaned up as well.

I wish I had better ideas about how to thoroughly test this. I've got
a bunch of different tests for pg_combinebackup and I think those are
good, but the bugs mentioned in the previous paragraph show that those
aren't sufficient to catch all of the logic errors that can exist,
which is not great. But, as I say, I'm not quite sure how to do
better, so I guess I'll just need to keep fixing problems as we find
them.

--
Robert Haas
EDB: http://www.enterprisedb.com

New patch set.

0001: Rename JsonManifestParseContext callbacks, per feedback from
Álvaro. Not logically related to the rest of this, except by code
proximity. Separately committable, if nobody objects.

0002: Rename record_manifest_details_for_{file,wal_range}, per
feedback from Álvaro that the names were too generic. Separately
committable, if nobody objects.

0003: Move parse_manifest.c to src/common. No significant changes
since the previous version.

0004: Add a new WAL summarizer process. No significant changes since
the previous version.

0005: Incremental backup itself. Changes:
- Remove UPLOAD_MANIFEST replication command and instead add
INCREMENTAL_WAL_RANGE replication command.
- In consequence, load_manifest.c which was included in the previous
patch sets now moves to src/fe_utils and has some adjustments.
- Actually document the new replication command which I overlooked previously.
- Error out promptly if an incremental backup is attended with
summarize_wal = off.
- Fix test in copy_file(). We should be willing to use the fast-path
if a new checksum is *not* required, but the sense of the test was
inverted in previous versions.
- Fix some handling of the missing-manifest case in pg_combinebackup.
- Fix terminology in a help message.

0006: Add pg_walsummary tool. No significant changes since the previous version.

0007: Test patch, not for commit.

As far as I know, the main commit-blockers here are (1) the timeout
when waiting for WAL summarization is still hard-coded to 60 seconds
and (2) the ubsan issue that Thomas hunted down, which would cause at
least the entire CF environment and maybe some portion of the BF to
turn red if this were committed. That issue is in xlogreader rather
than in this patch set, at least in part, but it still needs fixing
before this goes ahead. I also suspect that the slightly-more
significant refactoring in this version may turn up a few new bugs in
the CF environment. I think once that the aforementioned items are
sorted out, this could be committed through 0005, and 0001 and 0002
could be committed sooner. 0006 should have some tests written before
it gets committed, but it doesn't necessarily have to be committed at
the exact same moment as everything else, and writing tests isn't that
hard, either.

Other loose ends that would be nice to tidy up at some point:

- Incremental JSON parsing so we can handle huge manifests.

- More documentation as proposed by Álvaro but I'm failing to find the
details of his proposal right now.

- More introspection facilities, maybe, or possibly rip some some
stuff out of WalSummarizerCtl if we don't want it. This one might be a
higher priority to address before initial commit, but it's probably
not absolutely critical, either.

I'm not quite sure how aggressively to press forward with getting
stuff committed. I'd certainly rather debug as much as I can locally
and via cfbot before turning the buildfarm pretty colors, but I think
it generally works out better when larger features get pushed earlier
in the cycle rather than in the mad frenzy right before feature
freeze, so I'm not inclined to be too patient, either.

...Robert

Attachment

Re: trying again to get incremental backup

From

Robert Haas

Date:

04 December 2023, 20:58:02

On Thu, Nov 30, 2023 at 9:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> 0005: Incremental backup itself. Changes:
> - Remove UPLOAD_MANIFEST replication command and instead add
> INCREMENTAL_WAL_RANGE replication command.

Unfortunately, I think this change is going to need to be reverted.
Jakub reported out a problem to me off-list, which I think boils down
to this: take a full backup on the primary. create a database on the
primary. now take an incremental backup on the standby using the full
backup from the master as the prior backup. What happens at this point
depends on how far replay has progressed on the standby. I think there
are three scenarios: (1) If replay has not yet reached a checkpoint
later than the one at which the full backup began, then taking the
incremental backup will fail. This is correct, because it makes no
sense to take an incremental backup that goes backwards in time, and
it's pointless to take one that goes forwards but not far enough to
reach the next checkpoint, as you won't save anything. (2) If replay
has progressed far enough that the redo pointer is now beyond the
CREATE DATABASE record, then everything is fine. (3) But if the redo
pointer for the backup is a later checkpoint than the one from which
the full backup started, but also before the CREATE DATABASE record,
then the new database's files exist on disk, but are not mentioned in
the WAL summary, which covers all LSNs from the start of the prior
backup to the start of this one. Here, the start of the backup is
basically the LSN from which replay will start, and since the database
was created after that, those changes aren't in the WAL summary. This
means that we think the file is unchanged since the prior backup, and
so backup no blocks at all. But now we have an incremental file for a
relation for which no full file is present in the prior backup, and
we're in big trouble.

If my analysis is correct, this bug should be new in v12. In v11 and
prior, I think that we always included every file that didn't appear
in the prior manifest in full. I didn't really quite know why I was
doing that, which is why I was willing to rip it out and thus remove
the need for the manifest, but now I think it was actually preventing
exactly this problem. This issue, in general, is files that get
created after the start of the backup. By that time, the WAL summary
that drives the backup has already been built, so it doesn't know
anything about the new files. That would be fine if we either (A)
omitted those new files from the backup completely, since replay would
recreate them or (B) backed them up in full, so that there was nothing
relying on them being there in the earlier backup. But an incremental
backup of such a file is no good.

Then I started worrying about whether there were problems in cases
where a file was dropped and recreated with the same name. I *think*
it's OK. If file F is dropped and recreated after being copied into
the full backup but before being copied into the incremental backup,
then there are basically two cases. First, F might be dropped before
the start LSN of the incremental backup; if so, we'll know from the
WAL summary that the limit block is 0 and back up the whole thing.
Second, F might be dropped after the start LSN of the incremental
backup and before it's actually coped. In that case, we'll not know
when backing up the file that it was dropped and recreated, so we'll
back it up incrementally as if that hadn't happened. That's OK as long
as reconstruction doesn't fail, because WAL replay will again drop and
recreate F. And I think reconstruction won't fail: blocks that are in
the incremental file will be taken from there, blocks in the prior
backup file will be taken from there, and blocks in neither place will
be zero-filled. The result is logically incoherent, but replay will
nuke the file anyway, so whatever.

It bugs me a bit that we don't obey the WAL-before-data rule with file
creation, e.g. RelationCreateStorage does smgrcreate() and then
log_smgrcreate(). So in theory we could see a file on disk for which
nothing has been logged yet; it could even happen that the file gets
created before the start LSN of the backup and the log record gets
written afterward. It seems like it would be far more comfortable to
swap the order there, so that if it's on disk, it's definitely in the
WAL. But I haven't yet been able to think of a scenario in which the
current ordering causes a real problem. If we backup a stray file in
full (or, hypothetically, if we skipped it entirely) then nothing will
happen that can't already happen today with full backup; any problems
we end up having are, I think, not new problems. It's only when we
back up a file incrementally that we need to be careful, and the
analsysis is basically the same as before ... whatever we put into an
incremental file will cause *something* to get reconstructed except
when there's no prior file at all. Having the manifest for the prior
backup lets us avoid the incremental-with-no-prior-file scenario. And
as long as *something* gets reconstructed, I think WAL replay will fix
up the rest.

Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

05 December 2023, 18:10:44

On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Considering all this, what I'm inclined to do is go and put
> UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> accordingly. But first: does anybody see more problems here that I may
> have missed?

OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.

This version also improves (at least, IMHO) the way that we wait for
WAL summarization to finish. Previously, you either caught up fully
within 60 seconds or you died. I didn't like that, because it seemed
like some people would get timeouts when the operation was slowly
progressing and would eventually succeed. So what this version does
is:

- Every 10 seconds, it logs a warning saying that it's still waiting
for WAL summarization. That way, a human operator can understand
what's happening easily, and cancel if they want.

- If 60 seconds go by without the WAL summarizer ingesting even a
single WAL record, it times out. That way, if the WAL summarizer is
dead or totally stuck (e.g. debugger attached, hung I/O) the user
won't be left waiting forever even if they never cancel. But if it's
just slow, it probably won't time out, and the operation should
eventually succeed.

To me, this seems like a reasonable compromise. It might be
unreasonable if WAL summarization is proceeding at a very low but
non-zero rate. But it's hard for me to think of a situation where that
will happen, with the exception of when CPU or I/O are badly
overloaded. But in those cases, the WAL generation rate is probably
also not that high, because apparently the system is paralyzed, so
maybe the wait won't even be that bad, especially given that
everything else on the box should be super-slow too. Plus, even if we
did want to time out in such a case, it's hard to know how slow is too
slow. In any event, I think most failures here are likely to be
complete failures, where the WAL summarizer just doesn't, so the fact
that this times out in those cases seems to me to likely be as much as
we need to do here. But if someone sees a problem with this or has a
clever idea how to make it better, I'm all ears.

--
Robert Haas
EDB: http://www.enterprisedb.com

On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> While we are at it, maybe around the below in PrepareForIncrementalBackup()
>
>                 if (tlep[i] == NULL)
>                         ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                                          errmsg("timeline %u found in
> manifest, but not in this server's history",
>                                                         range->tli)));
>
> we could add
>
>     errhint("You might need to start a new full backup instead of
> incremental one")
>
> ?

I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because," not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.

The scenario I was really concerned about when I wrote this test was
(2), because that could lead to a corrupt restore. This test isn't
strong enough to prevent that completely, because two unrelated
standbys can branch onto the same new timelines at the same LSNs, and
then these checks can't tell that something bad has happened. However,
they can detect a useful subset of problem cases. And the solution is
not so much "take a new full backup" as "keep straight which server is
which." Likewise, in case (1), the relevant hint would be "don't
manually remove timeline history files, and if you must, then at least
don't nuke timelines that you actually still care about."

> > I have a fix for this locally, but I'm going to hold off on publishing
> > a new version until either there's a few more things I can address all
> > at once, or until Thomas commits the ubsan fix.
> >
>
> Great, I cannot get it to fail again today, it had to be some dirty
> state of the testing env. BTW: Thomas has pushed that ubsan fix.

Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.

--
Robert Haas
EDB: http://www.enterprisedb.com

I have a couple of quick fixes here.

The first fixes up some things in nls.mk related to a file move.  The 
second is some cleanup because some function you are using has been 
removed in the meantime; you probably found that yourself while rebasing.

The pg_walsummary patch doesn't have a nls.mk, but you also comment that 
it doesn't have tests yet, so I assume it's not considered complete yet 
anyway.

On Mon, Dec 18, 2023 at 4:10 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> Another set of comments, about the patch that adds pg_combinebackup:
>
> Make sure all the options are listed in a consistent order.  We have
> lately changed everything to be alphabetical.  This includes:
>
> - reference page pg_combinebackup.sgml
>
> - long_options listing
>
> - getopt_long() argument
>
> - subsequent switch
>
> - (--help output, but it looks ok as is)
>
> Also, in pg_combinebackup.sgml, the option --sync-method is listed as if
> it does not take an argument, but it does.

I've attempted to clean this stuff up in the attached version. This
version also includes a fix for the bug found by Jakub that caused
things to not work properly for segment files beyond the first for any
particular relation, which turns out to be a really stupid mistake in
my earlier commit 025584a168a4b3002e193.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From

Robert Haas

Date:

19 December 2023, 20:36:03

On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> mismatch (of course they do differ after initdb), but i get this
> instead:
>
>  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> --incremental /tmp/incr1.before.upgrade/backup_manifest
> WARNING:  aborting backup due to backend exiting before pg_backup_stop
> was called
> pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> 2 found in manifest, but not in this server's history
> pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
>
> Also in the manifest I don't see DBID ?
> Maybe it's a nuisance and all I'm trying to see is that if an
> automated cronjob with pg_basebackup --incremental hits a freshly
> upgraded cluster, that error message without errhint() is going to
> scare some Junior DBAs.

Yeah. I think we should add the system identifier to the manifest, but
I think that should be left for a future project, as I don't think the
lack of it is a good reason to stop all progress here. When we have
that, we can give more reliable error messages about system mismatches
at an earlier stage. Unfortunately, I don't think that the timeline
messages you're seeing here are going to apply in every case: suppose
you have two unrelated servers that are both on timeline 1. I think
you could use a base backup from one of those servers and use it as
the basis for the incremental from the other, and I think that if you
did it right you might fail to hit any sanity check that would block
that. pg_combinebackup will realize there's a problem, because it has
the whole cluster to work with, not just the manifest, and will notice
the mismatching system identifiers, but that's kind of late to find
out that you made a big mistake. However, right now, it's the best we
can do.

> The incrementals are being generated , but just for the first (0)
> segment of the relation?

I committed the first two patches from the series I posted yesterday.
The first should fix this, and the second relocates parse_manifest.c.
That patch hasn't changed in a while and seems unlikely to attract
major objections. There's no real reason to commit it until we're
ready to move forward with the main patches, but I think we're very
close to that now, so I did.

Here's a rebase for cfbot.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: trying again to get incremental backup

From

Jakub Wartak

Date:

20 December 2023, 13:10:42

Hi Robert,

On Tue, Dec 19, 2023 at 9:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> > mismatch (of course they do differ after initdb), but i get this
> > instead:
> >
> >  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> > --incremental /tmp/incr1.before.upgrade/backup_manifest
> > WARNING:  aborting backup due to backend exiting before pg_backup_stop
> > was called
> > pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> > 2 found in manifest, but not in this server's history
> > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
> >
> > Also in the manifest I don't see DBID ?
> > Maybe it's a nuisance and all I'm trying to see is that if an
> > automated cronjob with pg_basebackup --incremental hits a freshly
> > upgraded cluster, that error message without errhint() is going to
> > scare some Junior DBAs.
>
> Yeah. I think we should add the system identifier to the manifest, but
> I think that should be left for a future project, as I don't think the
> lack of it is a good reason to stop all progress here. When we have
> that, we can give more reliable error messages about system mismatches
> at an earlier stage. Unfortunately, I don't think that the timeline
> messages you're seeing here are going to apply in every case: suppose
> you have two unrelated servers that are both on timeline 1. I think
> you could use a base backup from one of those servers and use it as
> the basis for the incremental from the other, and I think that if you
> did it right you might fail to hit any sanity check that would block
> that. pg_combinebackup will realize there's a problem, because it has
> the whole cluster to work with, not just the manifest, and will notice
> the mismatching system identifiers, but that's kind of late to find
> out that you made a big mistake. However, right now, it's the best we
> can do.
>

OK, understood.

> > The incrementals are being generated , but just for the first (0)
> > segment of the relation?
>
> I committed the first two patches from the series I posted yesterday.
> The first should fix this, and the second relocates parse_manifest.c.
> That patch hasn't changed in a while and seems unlikely to attract
> major objections. There's no real reason to commit it until we're
> ready to move forward with the main patches, but I think we're very
> close to that now, so I did.
>
> Here's a rebase for cfbot.

the v15 patchset (posted yesterday) test results are GOOD:

1. make check-world - GOOD
2. cfbot was GOOD
3. the devel/master bug present in
parse_filename_for_nontemp_relation() seems to be gone (in local
testing)
4. some further tests:
test_across_wallevelminimal.sh - GOOD
test_incr_after_timelineincrease.sh - GOOD
test_incr_on_standby_after_promote.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_pending_2pc.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_repro_assert.sh
test_standby_incr_just_backup.sh - GOOD
test_stuck_walsum.sh - GOOD
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD
test_full_stby__incr_stby__restore_on_stby.sh - GOOD

5. the more real-world pgbench test with localized segment writes
usigng `\set aid random_exponential...` [1] indicates much greater
efficiency in terms of backup space use now, du -sm shows:

210229  /backups/backups/full
250     /backups/backups/incr.1
255     /backups/backups/incr.2
[..]
348     /backups/backups/incr.13
408     /backups/backups/incr.14 // latest(20th of Dec on 10:40)
6673    /backups/archive/

The DB size was as reported by \l+ 205GB.
That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30)
with slow 100 TPS (-R), so no insane amounts of WAL.
Time to reconstruct 14 chained incremental backups was 45mins
(pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full
/backups/backups/incr.1 (..)  /backups/backups/incr.14).
DB after recovering was OK and working fine.

-J.

Re: trying again to get incremental backup

From

Robert Haas

Date:

20 December 2023, 20:56:00

On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> the v15 patchset (posted yesterday) test results are GOOD:

All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v17-0001-Add-new-pg_walsummary-tool.patch

Re: trying again to get incremental backup

From

Alexander Lakhin

Date:

21 December 2023, 04:00:01

Hello Robert,

20.12.2023 23:56, Robert Haas wrote:
> On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
>> the v15 patchset (posted yesterday) test results are GOOD:
> All right. I committed the main two patches, dropped the
> for-testing-only patch, and added a simple test to the remaining
> pg_walsummary patch. That needs more work, but here's what I have as
> of now.

I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:
s/arguent/argument/;
s/BlkRefTableEntry/BlockRefTableEntry/;
s/BlockRefTablEntry/BlockRefTableEntry/;
s/Caonicalize/Canonicalize/;
s/Checksum_Algorithm/Checksum-Algorithm/;
s/corresonding/corresponding/;
s/differenly/differently/;
s/excessing/excessive/;
s/ exta / extra /;
s/hexademical/hexadecimal/;
s/initally/initially/;
s/MAXGPATH/MAXPGPATH/;
s/overrreacting/overreacting/;
s/old_meanifest_file/old_manifest_file/;
s/pg_cominebackup/pg_combinebackup/;
s/pg_tblpc/pg_tblspc/;
s/pointrs/pointers/;
s/Recieve/Receive/;
s/recieved/received/;
s/ recod / record /;
s/ recods / records /;
s/substntially/substantially/;
s/sumamry/summary/;
s/summry/summary/;
s/synchronizaton/synchronization/;
s/sytem/system/;
s/withot/without/;
s/Woops/Whoops/;
s/xlograder/xlogreader/;

Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?

Best regards,
Alexander

Re: trying again to get incremental backup

From

Robert Haas

Date:

21 December 2023, 12:07:03

On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've found several typos/inconsistencies introduced with 174c48050 and
> dc2123400. Maybe you would want to fix them, while on it?:

That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.

I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...

> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> comment above redo_pointer_at_last_summary_removal declaration, but
> perhaps it should say about removing summaries instead?

Wow, yeah. Thanks, will fix.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Alexander Lakhin

Date:

21 December 2023, 15:00:00

21.12.2023 15:07, Robert Haas wrote:
> On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've found several typos/inconsistencies introduced with 174c48050 and
>> dc2123400. Maybe you would want to fix them, while on it?:
> That's an impressively long list of mistakes in something I thought
> I'd been careful about. Sigh.
>
> I don't suppose you could provide these corrections in the form of a
> patch? I don't really want to run these sed commands across the entire
> tree and then try to figure out what's what...

Please look at the attached patch; it corrects all 29 items ("recods"
fixed in two places), but maybe you find some substitutions wrong...

I've also observed that those commits introduced new warnings:
$ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
reconstruct.c: In function ‘read_bytes’:
reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   511 |                 if (rb < 0)
       |                        ^
reconstruct.c: In function ‘write_reconstructed_file’:
reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   650 |                                 if (rb < 0)
       |                                        ^
reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
   662 |                         if (wb < 0)

There are also two deadcode.DeadStores complaints from clang. First one is
about:
         /*
          * Align the wait time to prevent drift. This doesn't really matter,
          * but we'd like the warnings about how long we've been waiting to say
          * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
          * drifting to something that is not a multiple of ten.
          */
         timeout_in_ms -=
             TimestampDifferenceMilliseconds(current_time, initial_time) %
             timeout_in_ms;
It looks like this timeout is really not used.

And the minor one (similar to many existing, maybe doesn't deserve fixing):
walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
                                         summary_end_lsn = private_data->read_upto;
                                         ^ ~~~~~~~~~~~~~~~~~~~~~~~

>> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
>> comment above redo_pointer_at_last_summary_removal declaration, but
>> perhaps it should say about removing summaries instead?
> Wow, yeah. Thanks, will fix.

Thank you for paying attention to it!

Best regards,
Alexander

Attachment

fix-typos.patch

Re: trying again to get incremental backup

From

Robert Haas

Date:

21 December 2023, 20:43:00

On Thu, Dec 21, 2023 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at the attached patch; it corrects all 29 items ("recods"
> fixed in two places), but maybe you find some substitutions wrong...

Thanks, committed with a few additions.

> I've also observed that those commits introduced new warnings:
> $ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
> reconstruct.c: In function ‘read_bytes’:
> reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    511 |                 if (rb < 0)
>        |                        ^
> reconstruct.c: In function ‘write_reconstructed_file’:
> reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    650 |                                 if (rb < 0)
>        |                                        ^
> reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
>    662 |                         if (wb < 0)

Oops. I think the variables should be type int. See attached.

> There are also two deadcode.DeadStores complaints from clang. First one is
> about:
>          /*
>           * Align the wait time to prevent drift. This doesn't really matter,
>           * but we'd like the warnings about how long we've been waiting to say
>           * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
>           * drifting to something that is not a multiple of ten.
>           */
>          timeout_in_ms -=
>              TimestampDifferenceMilliseconds(current_time, initial_time) %
>              timeout_in_ms;
> It looks like this timeout is really not used.

Oops. It should be. See attached.

> And the minor one (similar to many existing, maybe doesn't deserve fixing):
> walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
>                                          summary_end_lsn = private_data->read_upto;
>                                          ^ ~~~~~~~~~~~~~~~~~~~~~~~

It kind of surprises me that this is dead, but it seems best to keep
it there to be on the safe side, in case some change to the logic
renders it not dead in the future.

> >> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> >> comment above redo_pointer_at_last_summary_removal declaration, but
> >> perhaps it should say about removing summaries instead?
> > Wow, yeah. Thanks, will fix.
>
> Thank you for paying attention to it!

I'll fix this next.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

fix-ib-thinkos.patch

Re: trying again to get incremental backup

From

Alexander Lakhin

Date:

22 December 2023, 05:00:00

21.12.2023 23:43, Robert Haas wrote:
>> There are also two deadcode.DeadStores complaints from clang. First one is
>> about:
>>           /*
>>            * Align the wait time to prevent drift. This doesn't really matter,
>>            * but we'd like the warnings about how long we've been waiting to say
>>            * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
>>            * drifting to something that is not a multiple of ten.
>>            */
>>           timeout_in_ms -=
>>               TimestampDifferenceMilliseconds(current_time, initial_time) %
>>               timeout_in_ms;
>> It looks like this timeout is really not used.
> Oops. It should be. See attached.

My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.

The other changes look good to me.

Thank you!

Best regards,
Alexander

Re: trying again to get incremental backup

From

Nathan Bossart

Date:

23 December 2023, 21:51:47

My compiler has the following complaint:

../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in
thisfunction [-Werror=maybe-uninitialized]
 
  540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
      |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

I haven't looked closely to see whether there is actually a problem here,
but the attached patch at least resolves the warning.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

fix_uninitialized_warning.patch

Re: trying again to get incremental backup

From

Robert Haas

Date:

27 December 2023, 14:11:02

On Sat, Dec 23, 2023 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My compiler has the following complaint:
>
> ../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
> ../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in
thisfunction [-Werror=maybe-uninitialized] 
>   540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
>       |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Nathan Bossart

Date:

27 December 2023, 15:36:47

On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> Thanks. I don't think there's a real bug, but I pushed a fix, same as
> what you had.

Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
wait_event_names.txt.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

02 January 2024, 15:34:11

On Wed, Dec 27, 2023 at 10:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> > Thanks. I don't think there's a real bug, but I pushed a fix, same as
> > what you had.
>
> Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
> wait_event_names.txt.

Fixed.

It seems like it would be good if there were an automated cross-check
between lwlocknames.txt and wait_event_names.txt.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Robert Haas

Date:

03 January 2024, 15:10:09

On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Thom Brown

Date:

25 April 2024, 22:43:52

On Wed, 3 Jan 2024 at 15:10, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds

I would like to query the following:

--tablespace-mapping=olddir=newdir

Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespace as it exists in the first backup specified on the command line, and newdir is the absolute path to use for the tablespace in the reconstructed backup.

The first backup specified on the command line will be the regular, full, non-incremental backup. But if a tablespace was introduced subsequently, it would only appear in an incremental backup. Wouldn't this then mean that a mapping would need to be provided based on the path to the tablespace of that incremental backup's copy?

Regards

Thom

Re: trying again to get incremental backup

From

Robert Haas

Date:

26 April 2024, 15:32:04

On Thu, Apr 25, 2024 at 6:44 PM Thom Brown <thom@linux.com> wrote:
> I would like to query the following:
>
> --tablespace-mapping=olddir=newdir
>
>     Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the
tablespaceas it exists in the first backup specified on the command line, and newdir is the absolute path to use for
thetablespace in the reconstructed backup. 
>
> The first backup specified on the command line will be the regular, full, non-incremental backup.  But if a
tablespacewas introduced subsequently, it would only appear in an incremental backup.  Wouldn't this then mean that a
mappingwould need to be provided based on the path to the tablespace of that incremental backup's copy? 

Yes. Tomas Vondra found the same issue, which I have fixed in
1713e3d6cd393fcc1d4873e75c7fa1f6c7023d75.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

From

Michael Banck

Date:

29 October 2024, 00:10:58

Hi,

So I am a bit confused about the status of the tar format support, and
after re-reading the thread (or at least grepping it for ' tar '), this
wasn't really much discussed here either.

On Wed, Jun 14, 2023 at 02:46:48PM -0400, Robert Haas wrote:
> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too.

I believe "that tool" is pg_verifybackup, while "this one" is
pg_combinebackup? However, what's up with pg_basebackup itself with
respect to tar format incremental backups?

AFAICT (see below), pg_basebackup -Ft --incremental=foo/backup_manifest
happily creates an incremental backup in tar format; however,
pg_combinebackup will not be able to restore it? If that is the case,
shouldn't there be a bigger warning in the documentation about this, or
maybe pg_basebackup should refuse to make incremental tar-format backups
in the first place?

Am I missing something here? It will be obvious to users after the first
failure (to try to restore) that this will not work, and hopefully
everybody tests a restore before they put a backup solution into
production (or even better, wait until this whole feature is included in
a wholesale solution), but I wonder whether somebody might trip over
this after all and be unhappy. If one reads the pg_combinebackup
documentation carefully it kinda becomes obvious that it does occupy
itself with tar format backups, but it is not spelt out explicitly
either.

|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_full
|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_incr_1
--incremental=backup/backup_full/backup_manifest
|postgres@mbanck-lin-1:~$ echo $?
|0
|postgres@mbanck-lin-1:~$ du -h backup/
|44M    backup/backup_incr_1
|4,5G    backup/backup_full
|4,5G    backup/
|postgres@mbanck-lin-1:~$ tar tf backup/backup_incr_1/base.tar | grep INCR | head
|base/1/INCREMENTAL.3603
|base/1/INCREMENTAL.2187
|base/1/INCREMENTAL.13418
|base/1/INCREMENTAL.3467
|base/1/INCREMENTAL.2615_vm
|base/1/INCREMENTAL.2228
|base/1/INCREMENTAL.3503
|base/1/INCREMENTAL.2659
|base/1/INCREMENTAL.2607_vm
|base/1/INCREMENTAL.4164
|postgres@mbanck-lin-1:~$ /usr/lib/postgresql/17/bin/pg_combinebackup backup/backup_full/ backup/backup_incr_1/ -o
backup/combined
|pg_combinebackup: error: could not open file "backup/backup_incr_1//PG_VERSION": No such file or directory

Michael