Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From Daniel Gustafsson
Subject Re: Changing the state of data checksums in a running cluster
Date
Msg-id 477897AE-1314-4724-9694-0BABC4F4ABDA@yesql.se
Whole thread Raw
In response to Re: Changing the state of data checksums in a running cluster  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
> On 5 Nov 2025, at 15:05, Tomas Vondra <tomas@vondra.me> wrote:

> I took a look at the patch again, focusing mostly on the docs and
> comments.

While I am still working on a few ideas on how to handle the PITR issue, I
didn't want to leave this review hanging longer (and it would be nice to get it
out of the way and not mix it with other issues).

> 1) func-admin.sgml
>
> - This is missing documentation for the "fast" parameters for both
> functions (enable/disable).

I had originally omitted them intentionally since they were intended for
testing, but I agree that they should be documented as they might be useful
outside of testng as well.

> - The paragraph stars with "Initiates data checksums for ..." but that
> doesn't sound right to me. I think you can initiate enabling/disabling,
> not "data checksums".

Fair point.

> - I think the docs should at least briefly describe the impact on the
> cluster, and also on a standby, due to having to write everything into
> WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
> in some way. More about the standby stuff later.

Agreed, but I think it's better done in a one central place like the data
checksums section in wal.sgml (which your XXX also mention).  In the preamble
to the function table I've added a mention of the system performance impact
with a link to the aforementioned section.

> 2) glossary.sgml
>
> - This describes "checksum worker" as process that enables or disables
> checksums in a specific database, but we don't need any per-database
> processes when disabling, no?

That's very true, the Worker Launcher will receive the operation and if it is
to disable it will proceed without launching any workers since it's cluster
wide.  Thinking about it, maybe DataChecksumsWorkerLauncher isn't a very good
name, DataChecksumsController or DataChecksumsCoordinator might be a better
choice?

Naming seems to be hard, who knew..

> 3) monitoring.sgml
>
> - I'm not sure what "regardles" implies here. Does that mean we simply
> don't hide/reset the counters?

Correct, I've expanded this para to mention this.

> - I added a brief explanation about using the "launcher" row for overall
> progress, and per-database workers for "current progress".

+1

> - Do we want to refer to "datachecksumsworker"? Isn't it a bit too
> low-level detail?

I think so, we should stick to "launcher process" and "worker process" and be
conistent about it.

> - The table of phases is missing "waiting" and "done" phases. IMHO if
> the progress view can return it, it should be in the docs.

Nice catch.  The code can't actually return "waiting" since it was broken up
into three different wait phases, but one of them wasn't documented or added to
the view properly.  Fixed.

> 4) wal.sgml
>
> - I added a sentence explaining that both enabling and disabling happens
> in phases, with checkpoints in between. I think that's helpful for users
> and DBAs.

+1

> - The section only described "enabling checksums", but it probably
> should explain the process for disabling too. Added a para.

+1

> - Again, I think we should explain the checkpoints, restartpoints,
> impact on standby ... somewhere. Not sure if this is the right place.

I've added a subsection in the main Data Checksums section for this.

> 5) xlog.c

> - Some minor corrections (typos, ...).

Thanks, I did some additional minor wordsmithing around these.

> - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
> where we check InitialDataChecksumTransition stale? AFAIK we now check
> this in AbsorbDataChecksumsBarrier for all transitions, no?

Correct, fixed.

> 6) datachecksumsworker.c

> - One of the items in the synchronization/correctness section states
> that "Backends SHALL NOT violate local data_checksums state" but what
> does "violating local data_checksums state" mean? What even is "local
> state in this context"? Should we explain/define that, or would that be
> unnecessarily detailed?

By "local state" I was referring to the data checksum state that a backend
knows about.  I've reworded this to hopefully be a little clearer.

> - The section only talks about "enabling checksums" but also discusses
> all four possible states. Maybe it should talk about disabling too, as
> that requires the same synchronization/correctness.
>
> - Maybe it'd be good make it more explicit at which point the process
> waits on a barrier, which backend initiates that (and which backends are
> required to absorb the signal). It kinda is there, but only indirectly.

I've tried to address these, but they might still be off since I am very
Stockholm syndromed into this.

> - Another idea I had is that maybe it'd help to have some visualization
> of the process (with data_checksum states, barriers, ...) e.g. in the
> form of an ASCII image.

I instead opted for a SVG image in the docs which illustrates the states and
the transitions.  My Graphviz skills aren't that great so it doesn't look all
that pretty yet, but it's something to iterate on at least.

> I'm much more concerned about streaming replicas, because there it
> forces a restart point - and it *blocks redo* until it completes. Which
> means there'll be replication lag, and for synchronous standbys this
> would even block progress on the primary.
>
> We should very clearly document this.

Indeed. I've started on such a section.

> I see some similarities to shutdown checkpoints, which can take a lot of
> time if there happen to be a lot of dirty data, increasing disruption
> during planned restarts (when no one can connect). A common mitigation
> is to run CHECKPOINT shortly before the restart, to flush most of the
> dirty data while still allowing new connections.
>
> Maybe we could do something like that for checksum changes? I don't know
> how exactly we could do that, but let's say we can predict when roughly
> to expect the next state change.

Each worker knows how far it has come within its database, but is unaware about
other databases; the launcher knows how far it has come across all databases,
but is unaware about the relative size of each database.  Maybe there still is
a heuristic that can be teased out of imperfect knowledge.

> And we'd ensure the standby starts
> flushing stuff before that, so that creating the restartpoint is cheap.
> Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
> the amount of WAL as we're getting closer to the end?

That's an interesting idea, do you know if we have processes taking a similar
approach today?

The attached is a rebase with the above fixes along with a few more smaller
fixups and cleanups noticed along the way, nothing which change any
functionality though.

--
Daniel Gustafsson


Attachment

pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: show size of DSAs and dshash tables in pg_dsm_registry_allocations
Next
From: Tom Lane
Date:
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?