Home > mailing lists

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From	Daniel Gustafsson
Subject	Re: Changing the state of data checksums in a running cluster
Date	April 23 13:16:44
Msg-id	9197F930-DDEB-4CAC-82A2-16FEC715CCE8@yesql.se Whole thread
In response to	Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>)
List	pgsql-hackers

Tree view

While performing extensive post-commit testing using the built-in TAP tests in
checksum_extended mode, Tomas observed a couple issues.  The testing was done
on multiple machines, ranging from a fast x86 machine to a much slower rpi4,
but they all performed a very large number of iterations.  The combination of
hardware and the large number of tests executed is likely why some of the
issues went unnoticed.

Below are the issues described in more detail:

1) race condition between checkpoint and checksum state changes

The main issue is a race condition causing invalid state transitions.
When analyzing the failure (which we couldn't synthetically reproduce, only
observe by running tests over and over for a long time) we realized that the
original coding was performing checksum state transitions while replaying
online checkpoints as well as redo checkpoints.

Updating checksum state while replaying online checkpoints is incorrect.  After
a crash we'll start with the REDO record, which initializes checksums to the
right state.  And then later the checksum state is updated by the regular
XLOG_CHECKSUMS entries.

Moreover, the checksum state in the XLOG_CHECKPOINT_ONLINE record can be stale,
because the value is determined at the checkpoint start, but the WAL entry
is added at the end.  If there is a concurrent checksum change, the value
written to the WAL record will be stale.  Replaying it will cause an "invalid
transition" error later, during the next checksum state update. The fix is to
remove the checksum update from the online checkpoint altogether.

In fact, there's no need for CreateCheckPoint to update checksum state
in the ControlFile. If the value is stale, it could make it permanent.
But it's unnecessary - the ControlFile is updated by the process
performing the checksum state change. So remove that.

This, along with an re-ordering updating the controlfile and procsignal barrier
enission fixes the issue.  The re-ordering makes sure that the controlfile is
always updated *before* a procsignalbarrier is emitted to avoid a race like the
one described below:

1. A barrier for off to inprogress-on is emitted
2. All active backends absorbs the barrier
  - All processes in the cluster are in state inprogress-on
3. A new backend b' forks, reads controlfile and sets state of off
4. The controlfile is updated
5. A new backend b'' forks, reads controlfile and sets state to inprogress-on

b' and b'' have different states, and b' has an incompatible state with the
rest of the cluster.  Re-ordering as done in the attached makes this go away.


2) race condition in launcher exit

Another timing related issue was that reverting to the "off" state then
launcher errors out had synchronization logic which was racy as it was relying
on the cached checksum state and not the value from XLogCtl.  The logic for
determining if a launcher/worker was already active was also fragile as it
started another launcher which would overwrite certain data in shared memory.
The attached patch inspects shared memory instead and use that to signal the
running launcher to either abort (disable), or change cost parameters on a
running enable process.  These fixes makes erroring out and going to off state
stable.


3) Concurrency issue with ProcSignalInit / InitLocalDataChecksumState

The checksum barriers must not be consumed before the initial value gets
properly set.  On very slow systems, there could potentially be multiple
checksum state transitions between a fork and InitLocalDataChecksumState.  In
such cases we might get failures due to incorrect transitions.

With the current code this is not a live issue, as there is no place checking
interrupts in between the two functions.  But it's fragile, as it's trivial to
break this by adding an elog() somewhere.  Which is what happened to us while
debugging the other issues.  So better to explicitly hold interrupts for a
brief moment.

To find the issues, and to validate their fix, Tomas developed a new testsuite
which is attached as a .txt.  This is not proposed for adding to v19, it is
included to showcase what was done, and what will be further hacked on for a
new suite during the v20 cycle.  It is gated behind PG_TEST_EXTRA and is
intended to be executed by select members of the buildfarm.

As part of this postcommit review we also found a few more cleanups and smaller
fixes which are included.  The patchset also includes the patch submitted
upthread by Satyanarayana Narlapuram.

More details can be found in the individual commit messages.

Unless there are objections I would like to go ahead with this fixup fairly soon.

--
Daniel Gustafsson

Attachment

pgsql-hackers by date:

From: "Zhijie Hou (Fujitsu)"
Date: 23 April, 13:07:33
Subject: RE: Parallel Apply

From: Jakub Wartak
Date: 23 April, 13:20:18
Subject: Re: Add errdetail() with PID and UID about source of termination signal

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

Attachment

Previous

Next