Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal
Date
Msg-id CA+TgmoYPNfgAT3OubjoYSiXTb3-urOjQxT3SCEUCrRBx5g6ihA@mail.gmail.com
Whole thread Raw
In response to Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal
List pgsql-hackers
On Thu, Jul 4, 2024 at 10:35 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> +1 for v18 or later. However, since the reported issue is in v17,
> it needs to be addressed without such a improved check mechanism.

Here is a draft patch for that. This is only lightly tested at this
point, so further testing would be greatly appreciated, as would code
review.

NOTE THAT this seems to require bumping XLOG_PAGE_MAGIC and
PG_CONTROL_VERSION, which I imagine won't be super-popular considering
we've already shipped beta2. However, I don't see another principled
way to fix the reported problem. We do currently log an
XLOG_PARAMETER_CHANGE message at startup when wal_level has changed,
but that's useless for this purpose. Consider that WAL summarization
on a standby can start from any XLOG_CHECKPOINT_SHUTDOWN or
XLOG_CHECKPOINT_REDO record, and the most recent XLOG_PARAMETER_CHANGE
may be arbitrarily far behind that point, and replay may be
arbitrarily far ahead of that point by which things may have changed
again. The only way to know for certain what wal_level was in effect
at the time one of those records was generated is if the record itself
contains that information, so that's what I did. If I'm reading the
code correctly, this doesn't increase the size of the WAL, because
XLOG_CHECKPOINT_REDO was previously storing a dummy integer, and
CheckPoint looks to have an alignment padding hole where I inserted
the new member. Even if the size did increase, I don't think it would
matter: these records aren't emitted frequently enough for it to be a
problem. But I think it doesn't. However, it does change the format of
WAL, and because the checkpoint data is also stored in the control
file, it changes the format of that, too.

If we really, really don't want to let those values change at this
point in the release cycle, then we could alternatively just document
the kinds of scenarios that this protects against as unsupported and
admonish people sternly not to do that kind of thing. I think it's
actually sort of rare, because you have to do something like: enable
WAL summarization, do some stuff, restart with wal_level=minimal and
summarize_wal=off, do some more stuff but not so much stuff that any
of the WAL you generate gets removed, then restart again with
wal_level=replica/logical and summarize_wal=on again, so that the WAL
generated at the lower WAL level gets summarized before it gets
removed. Then, an incremental backup that you took after doing that
dance against a prior backup taken before doing all that might get
confused about what to do. Although that does not sound like a
terribly likely or advisable sequence of steps, I bet some people will
do it and then it will be hard to figure out what went wrong (and even
after we do, it won't make those people whole). And on the flip side I
don't think that bumping XLOG_PAGE_MAGIC or PG_CONTROL_VERSION will
cause huge inconvenience, because people don't tend to put beta2 into
production -- the early adopters tend to go for .0 or .1, not betaX.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improve the connection failure error messages
Next
From: Andrew Dunstan
Date:
Subject: Re: ssl tests fail due to TCP port conflict