Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
| From | Ayush Tiwari |
|---|---|
| Subject | Re: Changing the state of data checksums in a running cluster |
| Date | |
| Msg-id | CAJTYsWWn9_n2CGJOPNTvnqQUNtE3d4EXm5hy51CtEXptykdGUA@mail.gmail.com Whole thread |
| In response to | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
| List | pgsql-hackers |
Hi,
On Thu, 23 Apr 2026 at 15:47, Daniel Gustafsson <daniel@yesql.se> wrote:
While performing extensive post-commit testing using the built-in TAP tests in
checksum_extended mode, Tomas observed a couple issues. The testing was done
on multiple machines, ranging from a fast x86 machine to a much slower rpi4,
but they all performed a very large number of iterations. The combination of
hardware and the large number of tests executed is likely why some of the
issues went unnoticed.
Thanks for the patchset. I applied the eight proposed patches on top of
current master and did a review/test pass. I do have one concern
about the duplicate-launcher race discussed earlier, though.
I don't think 0006 fully addresses the duplicate-launcher exit race [1]. I was
able to reproduce the issue with a deterministic test using the existing
test_checksums delay hooks: delay
the initial StartDataChecksumsWorkerLauncher()
path so two quick pg_enable_data_checksums() calls can queue two launchers,
delay the final transition to on, then fire follow-up pg_enable_data_checksums()
calls while the real launcher is still active.
The relevant logs from that run:
[37940] DEBUG: background worker "datachecksums launcher" started
[37941] DEBUG: background worker "datachecksums launcher" started
[37940] LOG: enabling data checksums requested, starting data checksum calculation
[37941] LOG: background worker "datachecksums launcher" already running, exiting
[37940] LOG: initiating data checksum processing in database "postgres"
[37952] DEBUG: background worker "datachecksums launcher" started
[37952] LOG: enabling data checksums requested, starting data checksum calculation
[37952] LOG: initiating data checksum processing in database "postgres"
[37952] WARNING: cannot set data checksums to "on", current state is not "inprogress-on", disabling
[37989] DEBUG: background worker "datachecksums launcher" started
[37989] LOG: enabling data checksums requested, starting data checksum calculation
[38006] DEBUG: background worker "datachecksums launcher" started
[38006] LOG: enabling data checksums requested, starting data checksum calculation
So, at least the duplicate-launcher part is still reproducible after the fixup
series. The warning also suggests the state can be perturbed by the duplicate
launcher, although in my run the final state still ended up as on.
A few other smaller comments/nits:
1. In src/backend/access/transam/xlog.c, the comment in SetDataChecksumsOff()
says the branch implies the state is inprogress-on or inprogress-off, but
after the patch inprogress-on is handled by the preceding if condition. It
looks like this should probably only mention inprogress-off.
2. There is a repeated typo in comments:
src/backend/utils/init/postinit.c: "interrups" -> "interrupts"
src/backend/postmaster/auxprocess.c: "interrups" -> "interrupts"
3. A few commit messages could use a quick polish pass:
0002: "onine" -> "online"
0004: "Additionelly" -> "Additionally"
0006: "being enabled of disabled" -> "being enabled or disabled"
0006: "change it's operation" -> "change its operation"
0006: "ss now avoided" -> "is now avoided"
4. One minor question on 0006: the runtime cost-parameter update path is only
meaningful while checksums are being enabled. That looks fine from the current
control flow, but would it be worth adding a short comment or assertion near
that update path to make the assumption explicit?
Other than the above concern and cleanup items, the approach and behavior
looked good to me.
Regards,
Ayush
[1] PostgreSQL: [BUG] Race in online checksums launcher_exit()
the initial StartDataChecksumsWorkerLauncher()
path so two quick pg_enable_data_checksums() calls can queue two launchers,
delay the final transition to on, then fire follow-up pg_enable_data_checksums()
calls while the real launcher is still active.
The relevant logs from that run:
[37940] DEBUG: background worker "datachecksums launcher" started
[37941] DEBUG: background worker "datachecksums launcher" started
[37940] LOG: enabling data checksums requested, starting data checksum calculation
[37941] LOG: background worker "datachecksums launcher" already running, exiting
[37940] LOG: initiating data checksum processing in database "postgres"
[37952] DEBUG: background worker "datachecksums launcher" started
[37952] LOG: enabling data checksums requested, starting data checksum calculation
[37952] LOG: initiating data checksum processing in database "postgres"
[37952] WARNING: cannot set data checksums to "on", current state is not "inprogress-on", disabling
[37989] DEBUG: background worker "datachecksums launcher" started
[37989] LOG: enabling data checksums requested, starting data checksum calculation
[38006] DEBUG: background worker "datachecksums launcher" started
[38006] LOG: enabling data checksums requested, starting data checksum calculation
So, at least the duplicate-launcher part is still reproducible after the fixup
series. The warning also suggests the state can be perturbed by the duplicate
launcher, although in my run the final state still ended up as on.
A few other smaller comments/nits:
1. In src/backend/access/transam/xlog.c, the comment in SetDataChecksumsOff()
says the branch implies the state is inprogress-on or inprogress-off, but
after the patch inprogress-on is handled by the preceding if condition. It
looks like this should probably only mention inprogress-off.
2. There is a repeated typo in comments:
src/backend/utils/init/postinit.c: "interrups" -> "interrupts"
src/backend/postmaster/auxprocess.c: "interrups" -> "interrupts"
3. A few commit messages could use a quick polish pass:
0002: "onine" -> "online"
0004: "Additionelly" -> "Additionally"
0006: "being enabled of disabled" -> "being enabled or disabled"
0006: "change it's operation" -> "change its operation"
0006: "ss now avoided" -> "is now avoided"
4. One minor question on 0006: the runtime cost-parameter update path is only
meaningful while checksums are being enabled. That looks fine from the current
control flow, but would it be worth adding a short comment or assertion near
that update path to make the assumption explicit?
Other than the above concern and cleanup items, the approach and behavior
looked good to me.
Regards,
Ayush
[1] PostgreSQL: [BUG] Race in online checksums launcher_exit()
pgsql-hackers by date: