Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From Daniel Gustafsson
Subject Re: Changing the state of data checksums in a running cluster
Date
Msg-id C458D1B0-EF58-4DFC-9833-BC23D6B9C7F2@yesql.se
Whole thread Raw
In response to Re: Changing the state of data checksums in a running cluster  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Changing the state of data checksums in a running cluster
Re: Changing the state of data checksums in a running cluster
List pgsql-hackers
> On 6 Feb 2026, at 18:15, Tomas Vondra <tomas@vondra.me> wrote:

> I spent a bit more time looking at this today, and I figured out a
> simpler way to "cause trouble" by using PITR. I don't know if this has
> the same root cause as the failures in the 006 TAP test, but I find it
> interesting and I think it highlights some issues with the new patch.

I've encoded some of this into a new test suite in the attached patchset,
having it in a repeatable test should make it easier to reproduce and reason
about.

Attached is a rebase of the patchset which further refines the switch to using
XLOG_CHECKPOINT_* records which was proposed by Andres upstream.  The errors I
encountered could be traced to replay across multiple state transitions (with
at least one case of ON->OFF), and re-introducing checkpoints at on and off
state transitions helped removing those.

* A lot of code has been moved out from xlog.c and into the file which contains
  the code for the datachecksumsworker.  This file has in turn been renamed to
  signal what it's used for.  It should probably be moved from backend/postmaster
  but it's not clear to me where it should go.
* Improved signalhandling in the worker code.
* A new testsuite for PITR is added, which is based on Tomas' findings.  This
  can be further increased and refined but it's at least a start.
* Another new testsuite which changes full_page_writes settings is also added.
  This one should perhaps be rolled into 001_basic, not sure.
* Tests 006 and 007, the pgbench "torture test" suites, are now being run in a
  pared-down fashion when PG_TEST_EXTRA isn't defined, rather than being skipped
  entirely
* The minimum max_wal_size in tests is now 32 instead of 64.
* Lots of polish and small tweaking and new comments

There are a few TODOs, the more interesting ones include:

* The change to XLOG_CHECKPOINT_REDO to move the wal_level into a proper record
  structure should be pulled out as a 0001 patch as it's an cleanup that has
  value on its own.
* Try to move even more code awy from xlog.c
* Address, and remove, all XXX comments (most of which stem from a much earlier
  version of the patchset).
* The code for emitting, and waiting on, the procsignalbarrier exists in
  multiple copies which needs to be refactored into a single one.
* Tests failed in Autoconf on CI, which I need to fix.

My next step will be addressing these TODOs, but I wanted to get a new version
on the list sooner rather than later.

--
Daniel Gustafsson


Attachment

pgsql-hackers by date:

Previous
From: Henson Choi
Date:
Subject: Re: Row pattern recognition
Next
From: Chao Li
Date:
Subject: Re: Add documentation for PG_ABS_SRCDIR, PG_ABS_BUILDDIR, PG_LIBDIR, PG_DLSUFFIX