Thread: Changing the state of data checksums in a running cluster

Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

03 July 2024, 06:41:01

After some off-list discussion about the desirability of this feature, where
several hackers opined that it's something that we should have, I've decided to
rebase this patch and submit it one more time. There are several (long)
threads covering the history of this patch [0][1], related work stemming from
this [2] as well as earlier attempts and discussions [3][4]. Below I try to
respond to a summary of points raised in those threads.

The mechanics of the patch hasn't changed since the last posted version, it has
mainly been polished slightly. A high-level overview of the processing is:
It's using a launcher/worker model where the launcher will spawn a worker per
database which will traverse all pages and dirty them in order to calculate and
set the checksum on them. During this inprogress state all backends calculated
and write checksums but don't verify them on read. Once all pages have been
checksummed the state of the cluster will switch over to "on" synchronized
across all backends with a procsignalbarrier. At this point checksums are
verified and processing is equal to checksums having been enabled initdb. When
a user disables checksums the cluster enters a state where all backends still
write checksums until all backends have acknowledged that they have stopped
verifying checksums (again using a procsignalbarrier). At this point the
cluster switches to "off" and checksums are neither written nor verified. In
case the cluster is restarted, voluntarily or via a crash, processing will have
to be restarted (more on that further down).

The user facing controls for this are two SQL level functions, for enabling and
disabling. The existing data_checksums GUC remains but is expanded with more
possible states (with on/off retained).

Complaints against earlier versions
===================================
Seasoned hackers might remember that this patch has been on -hackers before.
There has been a lot of review, and AFAICT all specific comments have been
addressed. There are however a few larger more generic complaints:

* Restartability - the initial version of the patch did not support stateful
restarts, a shutdown performed (or crash) before checksums were enabled would
result in a need to start over from the beginning. This was deemed the safe
orchestration method. The lack of this feature was seen as serious drawback,
so it was added. Subsequent review instead found the patch to be too
complicated with a too large featureset. I thihk there is merit to both of
these arguments: being able to restart is a great feature; and being able to
reason about the correctness of a smaller patch is also great. As of this
submission I have removed the ability to restart to keep the scope of the patch
small (which is where the previous version was, which received no review after
the removal). The way I prefer to frame this is to first add scaffolding and
infrastructure (this patch) and leave refinements and add-on features
(restartability, but also others like parallel workers, optimizing rare cases,
etc) for follow-up patches.

* Complexity - it was brought up that this is a very complex patch for a niche
feature, and there is a lot of truth to that. It is inherently complex to
change a pg_control level state of a running cluster. There might be ways to
make the current patch less complex, while not sacrificing stability, and if so
that would be great. A lot of of the complexity came from being able to
restart processing, and that's not removed for this version, but it's clearly
not close to a one-line-diff even without it.

Other complaints were addressed, in part by the invention of procsignalbarriers
which makes this synchronization possible. In re-reading the threads I might
have missed something which is still left open, and if so I do apologize for
that.

Open TODO items:
================
* Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
order to be able to run the tests in a timely manner on it. This is overly
aggressive and dialling it back while still being able to run fast tests is a
TODO. Not sure what the best option is there.

* Monitoring - an insightful off-list reviewer asked how the current progress
of the operation is monitored. So far I've been using pg_stat_activity but I
don't disagree that it's not a very sharp tool for this. Maybe we need a
specific function or view or something? There clearly needs to be a way for a
user to query state and progress of a transition.

* Throttling - right now the patch uses the vacuum access strategy, with the
same cost options as vacuum, in order to implement throttling. This is in part
due to the patch starting out modelled around autovacuum as a worker, but it
may not be the right match for throttling checksums.

* Naming - the in-between states when data checksums are enabled or disabled
are called inprogress-on and inprogress-off. The reason for this is simply
that early on there were only three states: inprogress, on and off, and the
process of disabling wasn't labeled with a state. When this transition state
was added it seemed like a good idea to tack the end-goal onto the transition.
These state names make the code easily greppable but might not be the most
obvious choices for anything user facing. Is "Enabling" and "Disabling" better
terms to use (across the board or just user facing) or should we stick to the
current?

There are ways in which this processing can be optimized to achieve better
performance, but in order to keep goalposts in sight and patchsize down they
are left as future work.

--
Daniel Gustafsson

[0] https://www.postgresql.org/message-id/flat/CABUevExz9hUUOLnJVr2kpw9Cx%3Do4MCr1SVKwbupzuxP7ckNutA%40mail.gmail.com
[1] https://www.postgresql.org/message-id/flat/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh%2BziCSa5YLLU_BA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
[3] https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com
[4] https://www.postgresql.org/message-id/flat/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com

Attachment

v1-0001-Support-checksum-enable-disable-in-a-running-clus.patch

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

03 July 2024, 11:20:10

Hi Daniel,

Thanks for rebasing the patch and submitting it again!

On 7/3/24 08:41, Daniel Gustafsson wrote:
> After some off-list discussion about the desirability of this feature, where
> several hackers opined that it's something that we should have, I've decided to
> rebase this patch and submit it one more time.  There are several (long)
> threads covering the history of this patch [0][1], related work stemming from
> this [2] as well as earlier attempts and discussions [3][4].  Below I try to
> respond to a summary of points raised in those threads.
> 
> The mechanics of the patch hasn't changed since the last posted version, it has
> mainly been polished slightly.  A high-level overview of the processing is:
> It's using a launcher/worker model where the launcher will spawn a worker per
> database which will traverse all pages and dirty them in order to calculate and
> set the checksum on them.  During this inprogress state all backends calculated
> and write checksums but don't verify them on read.  Once all pages have been
> checksummed the state of the cluster will switch over to "on" synchronized
> across all backends with a procsignalbarrier.  At this point checksums are
> verified and processing is equal to checksums having been enabled initdb.  When
> a user disables checksums the cluster enters a state where all backends still
> write checksums until all backends have acknowledged that they have stopped
> verifying checksums (again using a procsignalbarrier).  At this point the
> cluster switches to "off" and checksums are neither written nor verified.  In
> case the cluster is restarted, voluntarily or via a crash, processing will have
> to be restarted (more on that further down).
> 
> The user facing controls for this are two SQL level functions, for enabling and
> disabling.  The existing data_checksums GUC remains but is expanded with more
> possible states (with on/off retained).
> 
> 
> Complaints against earlier versions
> ===================================
> Seasoned hackers might remember that this patch has been on -hackers before.
> There has been a lot of review, and AFAICT all specific comments have been
> addressed.  There are however a few larger more generic complaints:
> 
> * Restartability - the initial version of the patch did not support stateful
> restarts, a shutdown performed (or crash) before checksums were enabled would
> result in a need to start over from the beginning.  This was deemed the safe
> orchestration method.  The lack of this feature was seen as serious drawback,
> so it was added.  Subsequent review instead found the patch to be too
> complicated with a too large featureset.  I thihk there is merit to both of
> these arguments: being able to restart is a great feature; and being able to
> reason about the correctness of a smaller patch is also great.  As of this
> submission I have removed the ability to restart to keep the scope of the patch
> small (which is where the previous version was, which received no review after
> the removal).  The way I prefer to frame this is to first add scaffolding and
> infrastructure (this patch) and leave refinements and add-on features
> (restartability, but also others like parallel workers, optimizing rare cases,
> etc) for follow-up patches.
> 

I 100% support this approach.

Sure, I'd like to have a restartable tool, but clearly that didn't go
particularly well, and we still have nothing to enable checksums online.
That doesn't seem to benefit anyone - to me it seems reasonable to get
the non-restartable tool in, and then maybe later someone can improve
this to make it restartable. Thanks to the earlier work we know it's
doable, even if it was too complex.

This way it's at least possible to enable checksums online with some
additional care (e.g. to make sure no one restarts the cluster etc.).
I'd bet for vast majority of systems this will work just fine. Huge
systems with some occasional / forced restarts may not be able to make
this work - but then again, that's no worse than now.

> * Complexity - it was brought up that this is a very complex patch for a niche
> feature, and there is a lot of truth to that.  It is inherently complex to
> change a pg_control level state of a running cluster.  There might be ways to
> make the current patch less complex, while not sacrificing stability, and if so
> that would be great.  A lot of of the complexity came from being able to
> restart processing, and that's not removed for this version, but it's clearly
> not close to a one-line-diff even without it.
> 

I'd push back on this a little bit - the patch looks like this:

 50 files changed, 2691 insertions(+), 48 deletions(-)

and if we ignore the docs / perl tests, then the two parts that stand
out are

 src/backend/access/transam/xlog.c             |  455 +++++-
 src/backend/postmaster/datachecksumsworker.c  | 1353 +++++++++++++++++

I don't think the worker code is exceptionally complex. Yes, it's not
trivial, but a lot of the 1353 inserts is comments (which is good) or
generic infrastructure to start the worker etc.

> Other complaints were addressed, in part by the invention of procsignalbarriers
> which makes this synchronization possible.  In re-reading the threads I might
> have missed something which is still left open, and if so I do apologize for
> that.
> 
> 
> Open TODO items:
> ================
> * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
> order to be able to run the tests in a timely manner on it.  This is overly
> aggressive and dialling it back while still being able to run fast tests is a
> TODO.  Not sure what the best option is there.
> 

Why not to add a parameter to pg_enable_data_checksums(), specifying
whether to do immediate checkpoint or wait for the next one? AFAIK
that's what we do in pg_backup_start, for example.

> * Monitoring - an insightful off-list reviewer asked how the current progress
> of the operation is monitored.  So far I've been using pg_stat_activity but I
> don't disagree that it's not a very sharp tool for this.  Maybe we need a
> specific function or view or something?  There clearly needs to be a way for a
> user to query state and progress of a transition.
> 

Yeah, I think a view like pg_stat_progress_checksums would work.

> * Throttling - right now the patch uses the vacuum access strategy, with the
> same cost options as vacuum, in order to implement throttling.  This is in part
> due to the patch starting out modelled around autovacuum as a worker, but it
> may not be the right match for throttling checksums.
> 

IMHO it's reasonable to reuse the vacuum throttling. Even if it's not
perfect, it does not seem great to invent something new and end up with
two different ways to throttle stuff.

> * Naming - the in-between states when data checksums are enabled or disabled
> are called inprogress-on and inprogress-off.  The reason for this is simply
> that early on there were only three states: inprogress, on and off, and the
> process of disabling wasn't labeled with a state.  When this transition state
> was added it seemed like a good idea to tack the end-goal onto the transition.
> These state names make the code easily greppable but might not be the most
> obvious choices for anything user facing.  Is "Enabling" and "Disabling" better
> terms to use (across the board or just user facing) or should we stick to the
> current?
> 

I think the naming is fine. In the worst case we can rename that later,
seems more like a detail.

> There are ways in which this processing can be optimized to achieve better
> performance, but in order to keep goalposts in sight and patchsize down they
> are left as future work.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Changing the state of data checksums in a running cluster

From

Bruce Momjian

Date:

06 July 2024, 02:23:35

On Wed, Jul  3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote:
> > * Restartability - the initial version of the patch did not support stateful
> > restarts, a shutdown performed (or crash) before checksums were enabled would
> > result in a need to start over from the beginning.  This was deemed the safe
> > orchestration method.  The lack of this feature was seen as serious drawback,
> > so it was added.  Subsequent review instead found the patch to be too
> > complicated with a too large featureset.  I thihk there is merit to both of
> > these arguments: being able to restart is a great feature; and being able to
> > reason about the correctness of a smaller patch is also great.  As of this
> > submission I have removed the ability to restart to keep the scope of the patch
> > small (which is where the previous version was, which received no review after
> > the removal).  The way I prefer to frame this is to first add scaffolding and
> > infrastructure (this patch) and leave refinements and add-on features
> > (restartability, but also others like parallel workers, optimizing rare cases,
> > etc) for follow-up patches.
> > 
> 
> I 100% support this approach.

Yes, I was very disappointed when restartability sunk the patch, and I
saw this as another case where saying "yes" to every feature improvement
can lead to failure.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Changing the state of data checksums in a running cluster

From

Michael Banck

Date:

01 October 2024, 01:43:49

Hi,

On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
> > Yeah, I think a view like pg_stat_progress_checksums would work.
> 
> Added in the attached version.  It probably needs some polish (the docs for
> sure do) but it's at least a start.

Just a nitpick, but we call it data_checksums about everywhere, but the
new view is called pg_stat_progress_datachecksums - I think
pg_stat_progress_data_checksums would look better even if it gets quite
long.

Michael

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

01 October 2024, 21:55:34

> On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
>>> Yeah, I think a view like pg_stat_progress_checksums would work.
>>
>> Added in the attached version.  It probably needs some polish (the docs for
>> sure do) but it's at least a start.
>
> Just a nitpick, but we call it data_checksums about everywhere, but the
> new view is called pg_stat_progress_datachecksums - I think
> pg_stat_progress_data_checksums would look better even if it gets quite
> long.

That's a fair point, I'll make sure to switch for the next version of the
patch.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

07 October 2024, 17:46:25

Hi,

I did a quick review today. First a couple minor comments:

1) monitoring.sgml

typos: number of database -> number of databases
       calcuated -> calculated

2) unnecessary newline in heapam.c (no other changes)

3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
shadowing earlier variable

4) unnecessary comment change in bufmgr.c (no other changes)

5) unnecessary include and newline in bulk_write.c (no other changes)

6) unnecessary newline in pgstat.c (no other changes)

7) controldata.c - maybe this

   if (oldctrl->data_checksum_version == 2)

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

8) xlog_internal.h - xl_checksum_state should be added to typedefs

9) system_functions.sql - Isn't it weird that this only creates the new
pg_enable_data_checksums function, but not pg_disable_data_checksums? It
also means it doesn't revoke EXECUTE from public on it, which I guess it
probably should? Or why should this be different for the two functions?

Also the error message seems to differ:

  test=> select pg_enable_data_checksums();
  ERROR:  permission denied for function pg_enable_data_checksums
  test=> select pg_disable_data_checksums();
  ERROR:  must be superuser

Probably harmless, but seems a bit strange.


But there also seems to be a more serious problem with recovery. I did a
simple script that does a loop of

* start a cluster
* initialize a small pgbench database (scale 1 - 100)
* run "long" pgbench
* call pg_enable_data_checksums(), wait for it to complete
* stop the cluster with "-m immediate"
* start the cluster

And this unfortunately hits this assert:

  bool
  AbsorbChecksumsOnBarrier(void)
  {
    Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
    return true;
  }

Based on our short discussion about this, the controlfile gets updated
right after pg_enable_data_checksums() completes. The immediate stop
however forces a recovery since the last checkpoint, which means we see
the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
exit recovery, try to start checkpointer and it trips over this, because
the control file already has the "on" value :-(

I'm not sure what's the best way to fix this. Would it be possible to
remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
noop in that case? Or not set the barrier when exiting recovery. I'm not
sure the relaxed assert would remain meaningful, though. What would it
check on standbys, for example?

Maybe a better way would be to wait for a checkpoint before updating the
controlfile, similar to what we do at the beginning? Possibly even with
the same "fast=true/false" logic. That would prevent us from seeing the
XLOG_CHECKSUMS wal record with the updated flag. It would extend the
"window" where a crash would mean we have to redo the checksums, but I
don't think that matters much. For small databases who cares, and for
large databases it should not be a meaningful difference (setting the
checksums already ran over multiple checkpoints, so one checkpoint is
not a big difference).


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Dagfinn Ilmari Mannsåker

Date:

07 October 2024, 21:42:31

Tomas Vondra <tomas@vondra.me> writes:

> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
>    shadowing earlier variable

All the ListCell variables can be eliminated by using the foreach_ptr
and foreach_oid macros instead of plain foreach.

- ilmari

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

08 October 2024, 23:39:57

> On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
>
> Tomas Vondra <tomas@vondra.me> writes:
>
>> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
>>   shadowing earlier variable
>
> All the ListCell variables can be eliminated by using the foreach_ptr
> and foreach_oid macros instead of plain foreach.

Fair point, done in the v4 attached upthread.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

09 October 2024, 13:41:02


On 10/8/24 22:38, Daniel Gustafsson wrote:
> 
>> 7) controldata.c - maybe this
>>
>>   if (oldctrl->data_checksum_version == 2)
>>
>> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
>> constant? For "off" we use "0" which seems somewhat acceptable, but for
>> other values it's less obvious what the meaning is.
> 
> It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
> we should move (or mirror) the checksum versions to storage/checksum_impl.h to
> make them available to frontend and backend tools?
> 

+1 to have checksum_impl.h

> 
>> But there also seems to be a more serious problem with recovery. I did a
>> simple script that does a loop of
>>
>> * start a cluster
>> * initialize a small pgbench database (scale 1 - 100)
>> * run "long" pgbench
>> * call pg_enable_data_checksums(), wait for it to complete
>> * stop the cluster with "-m immediate"
>> * start the cluster
>>
>> And this unfortunately hits this assert:
>>
>>  bool
>>  AbsorbChecksumsOnBarrier(void)
>>  {
>>    Assert(LocalDataChecksumVersion ==
>> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>>    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
>>    return true;
>>  }
>>
>> Based on our short discussion about this, the controlfile gets updated
>> right after pg_enable_data_checksums() completes. The immediate stop
>> however forces a recovery since the last checkpoint, which means we see
>> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
>> exit recovery, try to start checkpointer and it trips over this, because
>> the control file already has the "on" value :-(
>>
>> I'm not sure what's the best way to fix this. Would it be possible to
>> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
>> noop in that case? Or not set the barrier when exiting recovery. I'm not
>> sure the relaxed assert would remain meaningful, though. What would it
>> check on standbys, for example?
>>
>> Maybe a better way would be to wait for a checkpoint before updating the
>> controlfile, similar to what we do at the beginning? Possibly even with
>> the same "fast=true/false" logic. That would prevent us from seeing the
>> XLOG_CHECKSUMS wal record with the updated flag. It would extend the
>> "window" where a crash would mean we have to redo the checksums, but I
>> don't think that matters much. For small databases who cares, and for
>> large databases it should not be a meaningful difference (setting the
>> checksums already ran over multiple checkpoints, so one checkpoint is
>> not a big difference).
> 
> The more I think about it the more I think that updating the control file is
> the wrong thing to do for this patch, it should only change the state in memory
> and let the checkpoints update the controlfile.  The attached fixes that and I
> can no longer reproduce the assertion failure you hit.
> 

I think leaving the update of controlfile to checkpointer is correct,
and probably the only way to make this correct (without race
conditions). We need to do that automatically with the checkpoint (which
updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS
record again).

I ran the tests with this new patch, and I haven't reproduced the
crashes. I'll let it run a bit longer, and improve it to test some more
stuff, but it looks good.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

27 November 2024, 01:07:12

Hi,

I spent a bit more time doing some testing on the last version of the
patch from [1]. And I ran into this assert in PostmasterStateMachine()
when stopping the cluster:

  /* All types should be included in targetMask or remainMask */
  Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);

At first I was puzzled as this happens on every shutdown, but that's
because these checks were introduced by a78af0427015 a week ago. So it's
more a matter of rebasing.

However, I also noticed the progress monitoring does not really work. I
get stuff like this:

    + psql -x test -c 'select * from pg_stat_progress_data_checksums'
    -[ RECORD 1 ]---------------------+---------
    pid                               | 56811
    datid                             | 0
    datname                           |
    phase                             | enabling
    databases_total                   | 4
    relations_total                   |
    databases_processed               | 0
    relations_processed               | 0
    databases_current                 | 16384
    relation_current                  | 0
    relation_current_blocks           | 0
    relation_current_blocks_processed | 0

But I've never seen any of the "processed" fields to be non-zero (and
relations is even NULL), and the same thing applies to relation_. Also
what is the datid/datname about? It's empty, not mentioned in sgml docs,
and we already have databases_current ...

The message [2] from 10/08 says:

> I did remove parts of the progress reporting for now since it can't be
> used from the dynamic backgroundworker it seems.  I need to regroup
> and figure out a better way there, but I wanted to address your above
> find sooner rather than wait for that.

And I guess that would explain why some of the fields are not updated.
But then the later patch versions seem to imply there are no outstanding
issues / missing stuff.


regards


[1]
https://www.postgresql.org/message-id/CA226DE1-DC9A-4675-A83C-32270C473F0B%40yesql.se

[2]
https://www.postgresql.org/message-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94%40yesql.se

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

09 March, 20:57:52

Here's a rebased version of the patch series, addressing the issues I've
pointed out in the last round of reviews. I've kept the changes in
separate patches for clarity, but it should be squashed into a single
patch in the end.


1) v20250309-0001-Online-enabling-and-disabling-of-data-chec.patch
------------------------------------------------------------------

Original patch, rebased, resolving merge conflicts.


2) v20250309-0002-simple-post-rebase-fixes.patch
------------------------------------------------

A couple minor fixes, addressing test failures due to stuff committed
since the previous patch version. Mostly mechanical, the main change is
I don't think the pgstat_bestart() call is necessary. Or is it?


3) v20250309-0003-sync-the-data_checksums-GUC-with-the-local.patch
------------------------------------------------------------------

This is the main change, fixing failures in 002_actions.pl - the short
version is that test does "-C data_checksums", but AFAICS that does not
quite work because it does not call show_data_checksums() that early,
and instead just grabs the variable backing the GUC. Which may be out of
sync, so this patch fixes that by updating them both.

That fixes the issue, but it's it a bit strange we now have three places
tracking the state of data checksums? We have data_checksum_version in
the control file, and then data_checksums and LocalDataChecksumVersion
in the backends.

Would it be possible to "unify" the latter two? That would also mean we
don't have the duplicate constants like PG_DATA_CHECKSUM_VERSION and
DATA_CHECKSUM_VERSION. Or why do we need that?


4) v20250309-0004-make-progress-reporting-work.patch
----------------------------------------------------

The progress reporting got "mostly disabled" in an earlier version, due
to issues with the bgworkers. AFAICS the issue is the "status" row can
be updated only by a single process, which does not quite work with the
launcher + per-db workers architecture.

I've considered a couple different approaches:


a) updating the status only from the launcher

This is mostly what CREATE INDEX does with parallel builds, and there
it's mostly sufficient. But for checksums it'd mean we only have the
number of databases to process/done, and that seems unsatisfactory,
considering large clusters often have only a single large database. So
not good enough, IMHO.


b) allowing workers to update the status row, created by the launcher

I guess this would be better, we'd know the relations/blocks counts. And
I haven't tried coding this, but there would need to be some locking so
that the workers don't overwrite increments from other workers, etc.

But I don't think it'd work nicely with parallel per-db workers (which
we don't do now, but we might).


c) having one status entry per worker

I ended up doing this, it didn't require any changes to the progress
infrastructure, and it will work naturally even with multiple workers.
There will always be one row for launcher status (which only has the
number of databases total/done), and then one row per worker, with
database-level info (datid, datname, #relations, #blocks).

I removed the "DONE" phase, because that's right before the launcher
exists, and I don't think we have that for similar cases. And I added
"waiting on checkpoint" state, because that's often a long wait when the
launcher seems to do nothing, so it seems useful to communicate the
reason for that wait.


5) v20250309-0005-update-docs.patch
-----------------------------------

Minor tweaks to docs, to reflect the changes to the progress reporting
changes, and also some corrections (no resume after restart, ...).


So far this passed all my tests - both chekc-world and stress testing
(no failures / assert crashes, ...). One thing that puzzles me is I
wasn't able to reproduce the failures reported in [1] - not even with
just the rebase + minimal fixes (0001 + 0002).  My best theory is this
is somehow machine-specific, and my laptop is too slow or something.
I'll try with the machine I used before once it gets available.


regards


[1]
https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me


-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 02:35:29

Seems cfbot was unhappy with the patches, so here's an improved version,
fixing some minor issues in expected output and a compiler warning.

There however seems to be some issue with 003_standby_restarts, which
causes failures on freebsd and macos. I don't know what that is about,
but the test runs much longer than on debian.


regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 03:18:23

On 3/10/25 00:35, Tomas Vondra wrote:
> Seems cfbot was unhappy with the patches, so here's an improved version,
> fixing some minor issues in expected output and a compiler warning.
> 
> There however seems to be some issue with 003_standby_restarts, which
> causes failures on freebsd and macos. I don't know what that is about,
> but the test runs much longer than on debian.
> 

OK, turns out the failures were caused by the test creating a standby
from a backup, without a slot, so sometimes the primary removed the
necessary WAL. Fixed in the attached version.

There's still a failure on windows, though. I'd bet that's due to the
data_checksum/LocalDatachecksumVersion sync not working correctly on
builds with EXEC_BACKEND, or something like that, but it's too late so
I'll take a closer look tomorrow.

regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 12:46:32

On 3/10/25 01:18, Tomas Vondra wrote:
>
> ...
>
> There's still a failure on windows, though. I'd bet that's due to the
> data_checksum/LocalDatachecksumVersion sync not working correctly on
> builds with EXEC_BACKEND, or something like that, but it's too late so
> I'll take a closer look tomorrow.
> 

Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
different from what I guessed - the worker state in shmem was zeroed
every time, not just once. And a second issue was child_process_kinds
got out of sync with BackendType (mea culpa).

For me, this passes all CI tests, hopefully cfbot will be happy too.

regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 14:17:58

On 3/10/25 10:46, Tomas Vondra wrote:
> On 3/10/25 01:18, Tomas Vondra wrote:
>>
>> ...
>>
>> There's still a failure on windows, though. I'd bet that's due to the
>> data_checksum/LocalDatachecksumVersion sync not working correctly on
>> builds with EXEC_BACKEND, or something like that, but it's too late so
>> I'll take a closer look tomorrow.
>>
> 
> Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
> different from what I guessed - the worker state in shmem was zeroed
> every time, not just once. And a second issue was child_process_kinds
> got out of sync with BackendType (mea culpa).
> 
> For me, this passes all CI tests, hopefully cfbot will be happy too.
> 

A bit embarrassing, I did not notice updating child_process_kinds breaks
the stats regression test, so here's a version fixing that.


regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

10 March, 16:27:06

> On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
>
> On 3/10/25 10:46, Tomas Vondra wrote:
>> On 3/10/25 01:18, Tomas Vondra wrote:

Thank you so much for picking up and fixing the blockers, it's highly appreciated!

>> For me, this passes all CI tests, hopefully cfbot will be happy too.

Confirmed, it compiles clean, builds docs and passes all tests for me as well.

A few comments from reading over your changes:

+   launcher worker has this value set, the other worker processes
+   have this <literal>NULL</literal>.
There seems to be a word or two missing (same in a few places), should this be
"have this set to NULL"?


+   The command is currently waiting for a checkpoint to update the checksum
+   state at the end.
s/at the end/before finishing/?


+ * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
This could of course be remedied.  IIRC one reason for adding the enum was to
get compiler warnings on missing cases when switch()ing over the value, but I
don't think the current code has any switch.


+   /* XXX isn't it weird there's no wait between the phase updates? */
It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.


+   * When enabling checksums, we have to wait for a checkpoint for the
+   * checksums to e.
Seems to be missing the punchline, "for the checksum state to be moved from
in-progress to on" perhaps?


It also needs a pgindent and pgperltidy but there were only small trivial
changes there.

Thanks again for updating the patch!

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 18:01:35

On 3/10/25 14:27, Daniel Gustafsson wrote:
>> On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 3/10/25 10:46, Tomas Vondra wrote:
>>> On 3/10/25 01:18, Tomas Vondra wrote:
> 
> Thank you so much for picking up and fixing the blockers, it's highly appreciated!
> 
>>> For me, this passes all CI tests, hopefully cfbot will be happy too.
> 
> Confirmed, it compiles clean, builds docs and passes all tests for me as well.
> 
> A few comments from reading over your changes:
> 
> +   launcher worker has this value set, the other worker processes
> +   have this <literal>NULL</literal>.
> There seems to be a word or two missing (same in a few places), should this be
> "have this set to NULL"?
> 

done

> 
> +   The command is currently waiting for a checkpoint to update the checksum
> +   state at the end.
> s/at the end/before finishing/?
> 

done

> 
> + * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
> They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
> aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
> This could of course be remedied.  IIRC one reason for adding the enum was to
> get compiler warnings on missing cases when switch()ing over the value, but I
> don't think the current code has any switch.
> 

I haven't done anything about this. I'm not convinced it's an issue we
need to fix, and I haven't tried how much work would it be.

> 
> +   /* XXX isn't it weird there's no wait between the phase updates? */
> It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
> favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.
> 

Removed the WAITING_BACKENDS phase.

> 
> +   * When enabling checksums, we have to wait for a checkpoint for the
> +   * checksums to e.
> Seems to be missing the punchline, "for the checksum state to be moved from
> in-progress to on" perhaps?
> 

done

> 
> It also needs a pgindent and pgperltidy but there were only small trivial
> changes there.
> 

done

Attached is an updated version.


-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 18:16:02

One thing I forgot to mention is the progress reporting only updates
blocks for the FORK_MAIN. It wouldn't be difficult to report blocks for
each fork, but it'd be confusing - the relation counters would remain
the same, but the block counters would change for each fork.

I guess we could report the current_relation/fork, but it seems like an
overkill. The main fork is by far the largest one, so this seems OK.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

10 March, 20:35:56

Hi,

I continued stress testing this, as I was rather unsure why the assert
failures reported in [1] disappeared. And I managed to reproduce that
again, and I think I actually understand why it happens.

I modified the test script (attached) to setup replication, not just a
single instance. And then it does a bit of work, flips the checksums,
restarts the instances (randomly, fast/immediate), verifies the checkums
and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
pretty easily:

  Assert(LocalDataChecksumVersion ==
         PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The reason is pretty simple - this happens on the standby:

1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
and signals all other processes to refresh LocalDataChecksumVersion

2) the control file gets written to disk for whatever reason (redo does
this in a number of places)

3) standby gets restarted with "immediate" mode (I'm not sure if this
can happen with "fast" mode, I only recall seeing "immediate")

4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
ControlFile->data_checksum_version (to the same value, no noop), and
then signals the other processes again

5) the other processes already have LocalDataChecksumVersion=1 (on), but
the assert says it should be 2 (inprogress-on) => kaboom

I believe this can happen for changes in either direction, although the
window while disabling checksums is more narrow.


I'm not sure what to do about this. Maybe we could relax the assert in
some way? But that seems a bit ... possibly risky. It's not necessarily
true we'll see the immediately preceding checksum state, we might see a
couple updates back (if the control file was not updated in between).

Could this affect checksum verification during recovery? Imagine we get
to the "on" state, the controlfile gets flushed, and then the standby
restarts and starts receiving older records again. The control file says
we should be verifying checksums, but couldn't some of the writes have
been lost (and so the pages may not have a valid checksum)?

The one idea I have is to create an "immediate" restartpoint in
xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
a "spread" restartpoint would not be enough, because then we could get
into the same situation with a control file of sync (ahead of WAL) after
a restart. It'd not be cheap, but it should be a rare operation ...

I was wondering if the primary has the same issue, but AFAICS it does
not. It flushes the control file in only a couple places, I couldn't
think of a way to get it out of sync.


regards

[1]
https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me

-- 
Tomas Vondra

Attachment

test.sh

Re: Changing the state of data checksums in a running cluster

From

Dagfinn Ilmari Mannsåker

Date:

11 March, 16:07:35

As the resident perl style pedant, I'd just like to complain about the
below:

Tomas Vondra <tomas@vondra.me> writes:

> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 666bd2a2d4c..1c66360c16c 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>      my ($self) = @_;
>  
>      print "### Enabling checksums in \"$self->data_dir\"\n";
> -    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
> +    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
> +        $self->data_dir, '-e');
>      return;
>  }

This breaking between the command line options and its arguments is why
we're switching to using fat commas. We're also using long options for
improved self-documentation, so this should be written as:

    PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
        '--pgdata' => $self->data_dir,
        '--enable');

And likewise below in the disable method.

- ilmari

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

12 March, 15:28:50

On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
> As the resident perl style pedant, I'd just like to complain about the
> below:
> 
> Tomas Vondra <tomas@vondra.me> writes:
> 
>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> index 666bd2a2d4c..1c66360c16c 100644
>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>>      my ($self) = @_;
>>  
>>      print "### Enabling checksums in \"$self->data_dir\"\n";
>> -    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
>> +    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
>> +        $self->data_dir, '-e');
>>      return;
>>  }
> 
> 
> This breaking between the command line options and its arguments is why
> we're switching to using fat commas. We're also using long options for
> improved self-documentation, so this should be written as:
> 
>     PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
>         '--pgdata' => $self->data_dir,
>         '--enable');
> 
> And likewise below in the disable method.
> 

I don't know what fat comma is, but that's simply what perltidy did. I
don't mind formatting it differently, if there's a better way.


thanks

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Dagfinn Ilmari Mannsåker

Date:

12 March, 16:08:50

Tomas Vondra <tomas@vondra.me> writes:

> On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
>> As the resident perl style pedant, I'd just like to complain about the
>> below:
>> 
>> Tomas Vondra <tomas@vondra.me> writes:
>> 
>>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> index 666bd2a2d4c..1c66360c16c 100644
>>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>>>      my ($self) = @_;
>>>  
>>>      print "### Enabling checksums in \"$self->data_dir\"\n";
>>> -    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
>>> +    PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
>>> +        $self->data_dir, '-e');
>>>      return;
>>>  }
>> 
>> 
>> This breaking between the command line options and its arguments is why
>> we're switching to using fat commas. We're also using long options for
>> improved self-documentation, so this should be written as:
>> 
>>     PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
>>         '--pgdata' => $self->data_dir,
>>         '--enable');
>> 
>> And likewise below in the disable method.
>> 
>
> I don't know what fat comma is, but that's simply what perltidy did. I
> don't mind formatting it differently, if there's a better way.

Fat comma is the perlish name for the => arrow, which is semantically
equivalent to a comma (except it auto-quotes any immediately preceding
bareword), but looks fatter.  Perltidy knows to not wrap lines around
them, keeping the key and value (or option and argument in this case)
together.  See commit ce1b0f9da03 for a large (but not complete, I have
more patches pending) conversion to this new style.

- ilmari

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

12 March, 16:16:41

On 3/10/25 18:35, Tomas Vondra wrote:
> Hi,
> 
> I continued stress testing this, as I was rather unsure why the assert
> failures reported in [1] disappeared. And I managed to reproduce that
> again, and I think I actually understand why it happens.
> 
> I modified the test script (attached) to setup replication, not just a
> single instance. And then it does a bit of work, flips the checksums,
> restarts the instances (randomly, fast/immediate), verifies the checkums
> and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
> pretty easily:
> 
>   Assert(LocalDataChecksumVersion ==
>          PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> 
> The reason is pretty simple - this happens on the standby:
> 
> 1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
> sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
> and signals all other processes to refresh LocalDataChecksumVersion
> 
> 2) the control file gets written to disk for whatever reason (redo does
> this in a number of places)
> 
> 3) standby gets restarted with "immediate" mode (I'm not sure if this
> can happen with "fast" mode, I only recall seeing "immediate")
> 
> 4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
> ControlFile->data_checksum_version (to the same value, no noop), and
> then signals the other processes again
> 
> 5) the other processes already have LocalDataChecksumVersion=1 (on), but
> the assert says it should be 2 (inprogress-on) => kaboom
> 
> I believe this can happen for changes in either direction, although the
> window while disabling checksums is more narrow.
> 
> 
> I'm not sure what to do about this. Maybe we could relax the assert in
> some way? But that seems a bit ... possibly risky. It's not necessarily
> true we'll see the immediately preceding checksum state, we might see a
> couple updates back (if the control file was not updated in between).
> 
> Could this affect checksum verification during recovery? Imagine we get
> to the "on" state, the controlfile gets flushed, and then the standby
> restarts and starts receiving older records again. The control file says
> we should be verifying checksums, but couldn't some of the writes have
> been lost (and so the pages may not have a valid checksum)?
> 
> The one idea I have is to create an "immediate" restartpoint in
> xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
> a "spread" restartpoint would not be enough, because then we could get
> into the same situation with a control file of sync (ahead of WAL) after
> a restart. It'd not be cheap, but it should be a rare operation ...
> 
> I was wondering if the primary has the same issue, but AFAICS it does
> not. It flushes the control file in only a couple places, I couldn't
> think of a way to get it out of sync.
> 

I continued investigating this and experimenting with alternative
approaches, and I think the way the patch relies on ControlFile is not
quite right. That is, it always sets data_checksum_version to the last
("current") value, but that's not what ControlFile is for ...

The ControlFile is meant to be a safe/consistent state, e.g. for crash
recovery. By setting data_checksum_version to the "last" value we've
seen, that's broken - if the control file gets persisted (haven't seen
this on primary, but pretty common on replica, per the report), the
recovery will start with a "future" data_checksum_version value. Which
is wrong - we'll read the XLOG_CHECKUMS record, triggering the assert. I
suspect it might also lead to confusion whether checksums should be
verified or not.

In my earlier message I suggested maybe this could be solved by forcing
a checkpoint every time we see the XLOG_CHECKUMS record (or rather a
restart point, as it'd be on the replica). Sure, that would have some
undesirable consequences (forcing an immediate checkpoint is not cheap,
and the redo would need to wait for that). But the assumption was it'd
be very rare (how often you enable checksums?), so this cost might be
acceptable.

But when I started experimenting with this, I realized it has a couple
other issues:

1) We can't do the checkpoint/restartpoint when handling XLOG_CHECKUMS,
because that'd mean we see this XLOG record again, which we don't want.
So the checkpoint would need to happen the *next* time we update the
control file.

2) But we can't trigger a checkpoint from UpdateControlFile, because of
locking (because CreateCheckPoint also calls UpdateControlFile). So this
would require much more invasive changes to all places updating the
control file.

3) It does not resolve the mismatch with using ControlFile to store
"current" data_checksums_version value.

4) ... probably more minor issues that I already forgot about.

In the end, I decided to try to rework this by storing the current value
elsewhere, and only updating the "persistent" value in the control file
when necessary.

XLogCtl seemed like a good place, so I used that - after all, it's a
value from XLOG. Maybe there's a better place? I'm open to suggestions,
but it does not really affect the overall approach.

So all the places now update XLogCtl->data_checksums_version instead of
the ControlFile, and also query this flag for *current* value.

The value is copied from XLogCtl to ControlFile when creating checkpoint
(or restartpoint), and the control file is persisted. This means (a) the
current value can't get written to the control file prematurely, and (b)
the value is consistent with the checkpoint (i.e. with the LSN where we
start crash recovery, if needed).

The attached 0005 patch implements this. It's a bit WIP and I'm sure it
can be improved, but I'm yet to see a single crash/failure with it. With
the original patch I've seen crashes after 5-10 loops (i.e. a couple
minutes), I'm now at loop 1000 and it's still OK.

I believe the approach is correct, but the number of possible states
(e.g. after a crash/restart) seems a bit complex. I wonder if there's a
better way to handle this, but I can't think of any. Ideas?

One issue I ran into is the postmaster does not seem to be processing
the barriers, and thus not getting info about the data_checksum_version
changes. That's fine until it needs to launch a child process (e.g. a
walreceiver), which will then see the LocalDataChecksumVersion as of the
start of the instance, not the "current" one. I fixed this by explicitly
refreshing the value in postmaster_child_launch(), but maybe I'm missing
something. (Also, EXEC_BACKEND may need to handle this too.)

regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

13 March, 12:54:28

> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:

> I continued investigating this and experimenting with alternative
> approaches, and I think the way the patch relies on ControlFile is not
> quite right. That is, it always sets data_checksum_version to the last
> ("current") value, but that's not what ControlFile is for ...

Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
failed to see that when hacking =/

> XLogCtl seemed like a good place, so I used that - after all, it's a
> value from XLOG. Maybe there's a better place? I'm open to suggestions,
> but it does not really affect the overall approach.

Seems like a good place for it.

> So all the places now update XLogCtl->data_checksums_version instead of
> the ControlFile, and also query this flag for *current* value.
>
> The value is copied from XLogCtl to ControlFile when creating checkpoint
> (or restartpoint), and the control file is persisted. This means (a) the
> current value can't get written to the control file prematurely, and (b)
> the value is consistent with the checkpoint (i.e. with the LSN where we
> start crash recovery, if needed).

+1

> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
> can be improved, but I'm yet to see a single crash/failure with it. With
> the original patch I've seen crashes after 5-10 loops (i.e. a couple
> minutes), I'm now at loop 1000 and it's still OK.

Given how successful this test has been at stressing out errors that is indeed
comforting to hear.

> I believe the approach is correct, but the number of possible states
> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
> better way to handle this, but I can't think of any. Ideas?

Not sure if this moves the needle too far in terms of complexity wrt to the
previous version of the patch, there were already multiple copies.

> One issue I ran into is the postmaster does not seem to be processing
> the barriers, and thus not getting info about the data_checksum_version
> changes.

Makes sense, that seems like a pretty reasonable constraint for the barrier.

> That's fine until it needs to launch a child process (e.g. a
> walreceiver), which will then see the LocalDataChecksumVersion as of the
> start of the instance, not the "current" one. I fixed this by explicitly
> refreshing the value in postmaster_child_launch(), but maybe I'm missing
> something. (Also, EXEC_BACKEND may need to handle this too.)

The pg_checksums test is failing for me on this version due to the GUC not
being initialized, don't we need something like the below as well?  (With a
comment explaining why ReadControlFile wasn't enough.)

@@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
        Assert(reset || ControlFile == NULL);
        ControlFile = palloc(sizeof(ControlFileData));
        ReadControlFile();
+       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);

A few comments on the patchset:

+ * Local state fror Controlfile data_checksum_version. After initialization
s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
value and not the Controlfile value (which may or may not be equal).


- if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
+ if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
     ereport(WARNING,
         (errmsg("data checksums are being enabled, but no worker is running"),
          errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
Reading this made me realize what a terrible error message I had placed there,
the hint is good but the message says checksums are being enabled but they're
not being enabled.  Maybe "data checksums are marked as being in-progress, but
no worker is running"


+uint32
+GetLocalDataChecksumVersion(void)
+{
+ return LocalDataChecksumVersion;
+}
+
+/*
+ * Get the *current* data_checksum_version (might not be written to control
+ * file yet).
+ */
+uint32
+GetCurrentDataChecksumVersion(void)
+{
+ return XLogCtl->data_checksum_version;
+}
I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
it from the Current value, and also to make appear less like actual copies of
controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
should be a better API no?

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

13 March, 14:03:29

On 3/13/25 10:54, Daniel Gustafsson wrote:
>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> I continued investigating this and experimenting with alternative
>> approaches, and I think the way the patch relies on ControlFile is not
>> quite right. That is, it always sets data_checksum_version to the last
>> ("current") value, but that's not what ControlFile is for ...
> 
> Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
> failed to see that when hacking =/
> 

It wasn't obvious to me either, until I managed to trigger the failure
and investigated the root cause.

>> XLogCtl seemed like a good place, so I used that - after all, it's a
>> value from XLOG. Maybe there's a better place? I'm open to suggestions,
>> but it does not really affect the overall approach.
> 
> Seems like a good place for it.
> 

OK

>> So all the places now update XLogCtl->data_checksums_version instead of
>> the ControlFile, and also query this flag for *current* value.
>>
>> The value is copied from XLogCtl to ControlFile when creating checkpoint
>> (or restartpoint), and the control file is persisted. This means (a) the
>> current value can't get written to the control file prematurely, and (b)
>> the value is consistent with the checkpoint (i.e. with the LSN where we
>> start crash recovery, if needed).
> 
> +1
> 

OK. I still want to go over the places once more and double check it
sets the ControlFile value to the right data_checksum_version.

>> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
>> can be improved, but I'm yet to see a single crash/failure with it. With
>> the original patch I've seen crashes after 5-10 loops (i.e. a couple
>> minutes), I'm now at loop 1000 and it's still OK.
> 
> Given how successful this test has been at stressing out errors that is indeed
> comforting to hear.
> 

It is. I plan to vary the stress test a bit more, and also run it on
another machine (rpi5, to get some non-x86 testing).

>> I believe the approach is correct, but the number of possible states
>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>> better way to handle this, but I can't think of any. Ideas?
> 
> Not sure if this moves the needle too far in terms of complexity wrt to the
> previous version of the patch, there were already multiple copies.
> 

It does add one more place (XLogCtl->data_checksum_version) to store the
current state, so it's not that much more complex, ofc. But I was not
really comparing this to the previous patch version, I meant the state
space in general - all possible combinations of all the flags (control
file, local + xlogct).

I wonder if it might be possible to have a more thorough validation of
the transitions. We already have that for the LocalDataChecksumVersion,
thanks to the asserts - and it was damn useful, otherwise we would not
have noticed this issue for a long time, I think.

I wonder if we can have similar checks for the other flags. I'm pretty
sure we can have the same checks for XLogCtl, right? I'm not quite sure
about ControlFile - can't that "skip" some of the changes, e.g. if we do
(enable->disable->enable) within a single checkpoint? Need to check.

This also reminds me I had a question about the barrier - can't it
happen a process gets to process multiple barriers at the same time? I
mean, let's say it gets stuck for a while, and the cluster happens to go
through disable+enable. Won't it then see both barriers? That'd be a
problem, because the core processes the barriers in the order determined
by the enum value, not in the order the barriers happened. Which means
it might break the expected state transitions again (and end with the
wrong local value). I haven't tried, though.

>> One issue I ran into is the postmaster does not seem to be processing
>> the barriers, and thus not getting info about the data_checksum_version
>> changes.
> 
> Makes sense, that seems like a pretty reasonable constraint for the barrier.
> 

Not sure I follow. What's a reasonable constraint?

>> That's fine until it needs to launch a child process (e.g. a
>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>> start of the instance, not the "current" one. I fixed this by explicitly
>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>> something. (Also, EXEC_BACKEND may need to handle this too.)
> 
> The pg_checksums test is failing for me on this version due to the GUC not
> being initialized, don't we need something like the below as well?  (With a
> comment explaining why ReadControlFile wasn't enough.)
> 
> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>         Assert(reset || ControlFile == NULL);
>         ControlFile = palloc(sizeof(ControlFileData));
>         ReadControlFile();
> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
> 

Yeah, I think this (or something like it) is missing.

> A few comments on the patchset:
> 
> + * Local state fror Controlfile data_checksum_version. After initialization
> s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
> value and not the Controlfile value (which may or may not be equal).
> 
> 
> - if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> + if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
>      ereport(WARNING,
>          (errmsg("data checksums are being enabled, but no worker is running"),
>           errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
> Reading this made me realize what a terrible error message I had placed there,
> the hint is good but the message says checksums are being enabled but they're
> not being enabled.  Maybe "data checksums are marked as being in-progress, but
> no worker is running"
> 

Makes sense, will reword.

> 
> +uint32
> +GetLocalDataChecksumVersion(void)
> +{
> + return LocalDataChecksumVersion;
> +}
> +
> +/*
> + * Get the *current* data_checksum_version (might not be written to control
> + * file yet).
> + */
> +uint32
> +GetCurrentDataChecksumVersion(void)
> +{
> + return XLogCtl->data_checksum_version;
> +}
> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
> it from the Current value, and also to make appear less like actual copies of
> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
> should be a better API no?
> 

FWIW those functions are for debug logging only, I needed to print the
values in a couple places outside xlog.c. I don't intend to make that
part of the patch.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

13 March, 15:32:20

> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
> On 3/13/25 10:54, Daniel Gustafsson wrote:
>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:

>>> I believe the approach is correct, but the number of possible states
>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>>> better way to handle this, but I can't think of any. Ideas?
>>
>> Not sure if this moves the needle too far in terms of complexity wrt to the
>> previous version of the patch, there were already multiple copies.
>
> It does add one more place (XLogCtl->data_checksum_version) to store the
> current state, so it's not that much more complex, ofc. But I was not
> really comparing this to the previous patch version, I meant the state
> space in general - all possible combinations of all the flags (control
> file, local + xlogct).

Fair point.

> I wonder if it might be possible to have a more thorough validation of
> the transitions. We already have that for the LocalDataChecksumVersion,
> thanks to the asserts - and it was damn useful, otherwise we would not
> have noticed this issue for a long time, I think.
>
> I wonder if we can have similar checks for the other flags. I'm pretty
> sure we can have the same checks for XLogCtl, right?

I don't see why not, they should abide by the same rules.

> I'm not quite sure
> about ControlFile - can't that "skip" some of the changes, e.g. if we do
> (enable->disable->enable) within a single checkpoint? Need to check.

For enable->disable->enable within a single checkpoint then ControlFile should
never see the disable state.

> This also reminds me I had a question about the barrier - can't it
> happen a process gets to process multiple barriers at the same time? I
> mean, let's say it gets stuck for a while, and the cluster happens to go
> through disable+enable. Won't it then see both barriers? That'd be a
> problem, because the core processes the barriers in the order determined
> by the enum value, not in the order the barriers happened. Which means
> it might break the expected state transitions again (and end with the
> wrong local value). I haven't tried, though.

Interesting, that seems like a general deficiency in the barriers, surely
processing them in-order would be more intuitive?  That would probably require
some form of Lamport clock though.

>>> One issue I ran into is the postmaster does not seem to be processing
>>> the barriers, and thus not getting info about the data_checksum_version
>>> changes.
>>
>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>
> Not sure I follow. What's a reasonable constraint?

That the postmaster deosn't process them.

>>> That's fine until it needs to launch a child process (e.g. a
>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>>> start of the instance, not the "current" one. I fixed this by explicitly
>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>>> something. (Also, EXEC_BACKEND may need to handle this too.)
>>
>> The pg_checksums test is failing for me on this version due to the GUC not
>> being initialized, don't we need something like the below as well?  (With a
>> comment explaining why ReadControlFile wasn't enough.)
>>
>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>>        Assert(reset || ControlFile == NULL);
>>        ControlFile = palloc(sizeof(ControlFileData));
>>        ReadControlFile();
>> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>>
>
> Yeah, I think this (or something like it) is missing.

Thanks for confirming.

>> +uint32
>> +GetLocalDataChecksumVersion(void)
>> +{
>> + return LocalDataChecksumVersion;
>> +}
>> +
>> +/*
>> + * Get the *current* data_checksum_version (might not be written to control
>> + * file yet).
>> + */
>> +uint32
>> +GetCurrentDataChecksumVersion(void)
>> +{
>> + return XLogCtl->data_checksum_version;
>> +}
>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
>> it from the Current value, and also to make appear less like actual copies of
>> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
>> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
>> should be a better API no?
>>
>
> FWIW those functions are for debug logging only, I needed to print the
> values in a couple places outside xlog.c. I don't intend to make that
> part of the patch.

Ah, gotcha, I never applied the debug patch from the patchset so I figured this
was a planned API.  The main question still stands though, if LocalDataCheckXX
can be confusing and CachedDataCheckXX would be better in order to
distinguish it from actual controlfile copies?

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

13 March, 19:26:47

On 3/13/25 13:32, Daniel Gustafsson wrote:
>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
>> On 3/13/25 10:54, Daniel Gustafsson wrote:
>>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
> 
>>>> I believe the approach is correct, but the number of possible states
>>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>>>> better way to handle this, but I can't think of any. Ideas?
>>>
>>> Not sure if this moves the needle too far in terms of complexity wrt to the
>>> previous version of the patch, there were already multiple copies.
>>
>> It does add one more place (XLogCtl->data_checksum_version) to store the
>> current state, so it's not that much more complex, ofc. But I was not
>> really comparing this to the previous patch version, I meant the state
>> space in general - all possible combinations of all the flags (control
>> file, local + xlogct).
> 
> Fair point.
> 
>> I wonder if it might be possible to have a more thorough validation of
>> the transitions. We already have that for the LocalDataChecksumVersion,
>> thanks to the asserts - and it was damn useful, otherwise we would not
>> have noticed this issue for a long time, I think.
>>
>> I wonder if we can have similar checks for the other flags. I'm pretty
>> sure we can have the same checks for XLogCtl, right?
> 
> I don't see why not, they should abide by the same rules.
> 

OK, I'll add these asserts.

>> I'm not quite sure
>> about ControlFile - can't that "skip" some of the changes, e.g. if we do
>> (enable->disable->enable) within a single checkpoint? Need to check.
> 
> For enable->disable->enable within a single checkpoint then ControlFile should
> never see the disable state.
> 

Hmm, that means we can't have the same checks for the ControlFile
fields, but I don't think that's a problem. We've verified the "path" to
that (on the XLogCtl field), so that seems fine.

>> This also reminds me I had a question about the barrier - can't it
>> happen a process gets to process multiple barriers at the same time? I
>> mean, let's say it gets stuck for a while, and the cluster happens to go
>> through disable+enable. Won't it then see both barriers? That'd be a
>> problem, because the core processes the barriers in the order determined
>> by the enum value, not in the order the barriers happened. Which means
>> it might break the expected state transitions again (and end with the
>> wrong local value). I haven't tried, though.
> 
> Interesting, that seems like a general deficiency in the barriers, surely
> processing them in-order would be more intuitive?  That would probably require
> some form of Lamport clock though.
> 

Yeah, that seems non-trivial. What if we instead ensured there can't be
two barriers set at the same time? Say, if we (somehow) ensured all
processes saw the previous barrier before allowing a new one, we would
not have this issue, right?

But I don't know what would be a good way to ensure this. Is there a way
to check if all processes saw the barrier? Any ideas?

>>>> One issue I ran into is the postmaster does not seem to be processing
>>>> the barriers, and thus not getting info about the data_checksum_version
>>>> changes.
>>>
>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>
>> Not sure I follow. What's a reasonable constraint?
> 
> That the postmaster deosn't process them.
> 

OK, that means we need a way to "refresh" the value for new child
processses, similar to what my patch does. But I suspect there might be
a race condition - if the child process starts while processing the
XLOG_CHECKUMS record, it might happen to get the new value and then also
the barrier (if it does the "refresh" in between the XLogCtl update and
the barrier). Doesn't this need some sort of interlock, preventing this?

The child startup would need to do this:

1) acquire lock
2) reset barriers
3) refresh the LocalDataChecksumValue (from XLogCtl)
4) release lock

while the walreceiver would do this

1) acquire lock
2) update XLogCtl value
3) emit barrier
4) release lock

Or is there a reason why this would be unnecessary?

>>>> That's fine until it needs to launch a child process (e.g. a
>>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>>>> start of the instance, not the "current" one. I fixed this by explicitly
>>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>>>> something. (Also, EXEC_BACKEND may need to handle this too.)
>>>
>>> The pg_checksums test is failing for me on this version due to the GUC not
>>> being initialized, don't we need something like the below as well?  (With a
>>> comment explaining why ReadControlFile wasn't enough.)
>>>
>>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>>>        Assert(reset || ControlFile == NULL);
>>>        ControlFile = palloc(sizeof(ControlFileData));
>>>        ReadControlFile();
>>> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>>>
>>
>> Yeah, I think this (or something like it) is missing.
> 
> Thanks for confirming.
> 
>>> +uint32
>>> +GetLocalDataChecksumVersion(void)
>>> +{
>>> + return LocalDataChecksumVersion;
>>> +}
>>> +
>>> +/*
>>> + * Get the *current* data_checksum_version (might not be written to control
>>> + * file yet).
>>> + */
>>> +uint32
>>> +GetCurrentDataChecksumVersion(void)
>>> +{
>>> + return XLogCtl->data_checksum_version;
>>> +}
>>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
>>> it from the Current value, and also to make appear less like actual copies of
>>> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
>>> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
>>> should be a better API no?
>>>
>>
>> FWIW those functions are for debug logging only, I needed to print the
>> values in a couple places outside xlog.c. I don't intend to make that
>> part of the patch.
> 
> Ah, gotcha, I never applied the debug patch from the patchset so I figured this
> was a planned API.  The main question still stands though, if LocalDataCheckXX
> can be confusing and CachedDataCheckXX would be better in order to
> distinguish it from actual controlfile copies?
> 

Yeah, I'll think about the naming.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

14 March, 02:11:39

On 3/13/25 17:26, Tomas Vondra wrote:
> On 3/13/25 13:32, Daniel Gustafsson wrote:
>>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>> ...
>>>
>>> This also reminds me I had a question about the barrier - can't it
>>> happen a process gets to process multiple barriers at the same time? I
>>> mean, let's say it gets stuck for a while, and the cluster happens to go
>>> through disable+enable. Won't it then see both barriers? That'd be a
>>> problem, because the core processes the barriers in the order determined
>>> by the enum value, not in the order the barriers happened. Which means
>>> it might break the expected state transitions again (and end with the
>>> wrong local value). I haven't tried, though.
>>
>> Interesting, that seems like a general deficiency in the barriers, surely
>> processing them in-order would be more intuitive?  That would probably require
>> some form of Lamport clock though.
>>
> 
> Yeah, that seems non-trivial. What if we instead ensured there can't be
> two barriers set at the same time? Say, if we (somehow) ensured all
> processes saw the previous barrier before allowing a new one, we would
> not have this issue, right?
> 
> But I don't know what would be a good way to ensure this. Is there a way
> to check if all processes saw the barrier? Any ideas?
> 

Actually, scratch this. There already is a way to do this, by using
WaitForProcSignalBarrier. And the XLOG_CHECKSUMS processing already
calls this. So we should not see two barriers at the same time ...

>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>> changes.
>>>>
>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>
>>> Not sure I follow. What's a reasonable constraint?
>>
>> That the postmaster deosn't process them.
>>
> 
> OK, that means we need a way to "refresh" the value for new child
> processses, similar to what my patch does. But I suspect there might be
> a race condition - if the child process starts while processing the
> XLOG_CHECKUMS record, it might happen to get the new value and then also
> the barrier (if it does the "refresh" in between the XLogCtl update and
> the barrier). Doesn't this need some sort of interlock, preventing this?
> 
> The child startup would need to do this:
> 
> 1) acquire lock
> 2) reset barriers
> 3) refresh the LocalDataChecksumValue (from XLogCtl)
> 4) release lock
> 
> while the walreceiver would do this
> 
> 1) acquire lock
> 2) update XLogCtl value
> 3) emit barrier
> 4) release lock
> 
> Or is there a reason why this would be unnecessary?
> 

I still think this might be a problem. I wonder if we could maybe
leverage the barrier generation, to detect that we don't need to process
this barrier, because we already got the value directly ...

FWIW we'd have this problem even if postmaster was processing barriers,
because there'd always be a "gap" between the fork and ProcSignalInit()
registering the new process into the procsignal array.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

14 March, 15:20:37

On 3/14/25 00:11, Tomas Vondra wrote:
> ...
>>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>>> changes.
>>>>>
>>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>>
>>>> Not sure I follow. What's a reasonable constraint?
>>>
>>> That the postmaster deosn't process them.
>>>
>>
>> OK, that means we need a way to "refresh" the value for new child
>> processses, similar to what my patch does. But I suspect there might be
>> a race condition - if the child process starts while processing the
>> XLOG_CHECKUMS record, it might happen to get the new value and then also
>> the barrier (if it does the "refresh" in between the XLogCtl update and
>> the barrier). Doesn't this need some sort of interlock, preventing this?
>>
>> The child startup would need to do this:
>>
>> 1) acquire lock
>> 2) reset barriers
>> 3) refresh the LocalDataChecksumValue (from XLogCtl)
>> 4) release lock
>>
>> while the walreceiver would do this
>>
>> 1) acquire lock
>> 2) update XLogCtl value
>> 3) emit barrier
>> 4) release lock
>>
>> Or is there a reason why this would be unnecessary?
>>
> 
> I still think this might be a problem. I wonder if we could maybe
> leverage the barrier generation, to detect that we don't need to process
> this barrier, because we already got the value directly ...
> 
> FWIW we'd have this problem even if postmaster was processing barriers,
> because there'd always be a "gap" between the fork and ProcSignalInit()
> registering the new process into the procsignal array.
> 

I experimented with this a little bit, and unfortunately I ran into not
one, but two race conditions in this :-( I don't have reproducers, all
of this was done by manually adding sleep() calls / gdb breakpoints to
pause the processes for a while, but I'll try to explain what/why ...

1) race #1: SetDataChecksumsOn

The function (and all the other "SetDataChecksums" funcs) does this

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    SpinLockRelease(&XLogCtl->info_lck);

    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);

Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
process may start during that, and it'll read the current checksum value
from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
barrier. So far so good.

But the new backend is already registered in ProcSignal, so it'll get
the barrier too, and will try to set the local version to "on" again.
And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():

    Assert(LocalDataChecksumVersion ==
           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The other "SetDataChecksums" have the same issue, except that in those
cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
such assert to check the state transition.

This is "ephemeral" in the sense that setting the value to "on" again
would be harmless, and indeed a non-assert build will run just fine.

2) race #2: InitPostgres

The InitPostgres does this:

    InitLocalControldata();

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

where InitLocalControldata gets the current checksum value from XLogCtl,
and ProcSignalInit registers the backend into the procsignal (which is
what barriers are based on).

Imagine there's a sleep() between these two calls, and the cluster does
not have checksums enabled. A backend will start, will read "off" from
XLogCtl, and then gets stuck on the sleep before it gets added to the
procsignal/barrier array.

Now, we enable checksums, and the instance goes through 'inprogress-on'
and 'on' states. This completes, and the backend wakes up and registers
itself into procsignal - but it won't get any barriers, of course.

So we end up with an instance with data_checksums="on", but this one
backend still believes data_checksums="on". This can cause a lot of
trouble, because it won't write blocks with checksums. I.e. this is
persistent data corruption.

I have been thinking about how to fix this. One way would be to
introduce some sort of locking, so that the two steps (update of the
XLogCtl version + barrier emit) and (local flag init + procsignal init)
would always happen atomically. So, something like this:

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    SpinLockRelease(&XLogCtl->info_lck);

and

    SpinLockAcquire(&XLogCtl->info_lck);
    InitLocalControldata();
    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    SpinLockRelease(&XLogCtl->info_lck);

But that seems pretty heavy-handed, it's definitely much more work while
holding a spinlock than I'm comfortable with, and I wouldn't be
surprised if there were deadlock cases etc. (FWIW I believe it needs to
use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Anyway, I think a much simpler solution would be to reorder InitPostgres
like this:

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

    InitLocalControldata();

i.e. to first register into procsignal, and then read the new value.
AFAICS this guarantees we won't lose any checksum version updates. It
does mean we still can get a barrier for a value we've already seen, but
I think we should simply ignore this for the very first update.

Opinions? Other ideas how to fix this?

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

14 March, 16:38:06

> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:

> I experimented with this a little bit, and unfortunately I ran into not
> one, but two race conditions in this :-( I don't have reproducers, all
> of this was done by manually adding sleep() calls / gdb breakpoints to
> pause the processes for a while, but I'll try to explain what/why ...

Ugh. Thanks for this!

> 1) race #1: SetDataChecksumsOn
>
> The function (and all the other "SetDataChecksums" funcs) does this
>
>    SpinLockAcquire(&XLogCtl->info_lck);
>    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
>    SpinLockRelease(&XLogCtl->info_lck);
>
>    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
>
> Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
> process may start during that, and it'll read the current checksum value
> from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
> barrier. So far so good.
>
> But the new backend is already registered in ProcSignal, so it'll get
> the barrier too, and will try to set the local version to "on" again.
> And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():
>
>    Assert(LocalDataChecksumVersion ==
>           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>
> The other "SetDataChecksums" have the same issue, except that in those
> cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
> such assert to check the state transition.
>
> This is "ephemeral" in the sense that setting the value to "on" again
> would be harmless, and indeed a non-assert build will run just fine.

As mentioned off-list, being able to loosen the restriction for the first
barrier seen seem like a good way to keep this assertion.  Removing it is of
course the alternative solution, as it's not causing any issues, but given how
handy it's been to find actual issues it would be good to be able to keep it.

> 2) race #2: InitPostgres
>
> The InitPostgres does this:
>
>    InitLocalControldata();
>
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>
> where InitLocalControldata gets the current checksum value from XLogCtl,
> and ProcSignalInit registers the backend into the procsignal (which is
> what barriers are based on).
>
> Imagine there's a sleep() between these two calls, and the cluster does
> not have checksums enabled. A backend will start, will read "off" from
> XLogCtl, and then gets stuck on the sleep before it gets added to the
> procsignal/barrier array.
>
> Now, we enable checksums, and the instance goes through 'inprogress-on'
> and 'on' states. This completes, and the backend wakes up and registers
> itself into procsignal - but it won't get any barriers, of course.
>
> So we end up with an instance with data_checksums="on", but this one
> backend still believes data_checksums="on". This can cause a lot of
> trouble, because it won't write blocks with checksums. I.e. this is
> persistent data corruption.
>
> I have been thinking about how to fix this. One way would be to
> introduce some sort of locking, so that the two steps (update of the
> XLogCtl version + barrier emit) and (local flag init + procsignal init)
> would always happen atomically. So, something like this:
>
>    SpinLockAcquire(&XLogCtl->info_lck);
>    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
>    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
>    SpinLockRelease(&XLogCtl->info_lck);
>
> and
>
>    SpinLockAcquire(&XLogCtl->info_lck);
>    InitLocalControldata();
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>    SpinLockRelease(&XLogCtl->info_lck);
>
> But that seems pretty heavy-handed, it's definitely much more work while
> holding a spinlock than I'm comfortable with, and I wouldn't be
> surprised if there were deadlock cases etc. (FWIW I believe it needs to
> use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Yeah, that seems quite likely to introduce a new set if issues.

> Anyway, I think a much simpler solution would be to reorder InitPostgres
> like this:
>
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>
>    InitLocalControldata();

Agreed.

> i.e. to first register into procsignal, and then read the new value.
> AFAICS this guarantees we won't lose any checksum version updates. It
> does mean we still can get a barrier for a value we've already seen, but
> I think we should simply ignore this for the very first update.

Calling functions with sideeffects in setting state seems like a bad idea
before ProcSignalInit has run, that's thinko on my part in this patch.  Your
solution of reordering seems like the right way to handle this.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

From

Daniel Gustafsson

Date:

14 March, 17:06:52

> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:

>> This is "ephemeral" in the sense that setting the value to "on" again
>> would be harmless, and indeed a non-assert build will run just fine.
>
> As mentioned off-list, being able to loosen the restriction for the first
> barrier seen seem like a good way to keep this assertion.  Removing it is of
> course the alternative solution, as it's not causing any issues, but given how
> handy it's been to find actual issues it would be good to be able to keep it.
>
>> i.e. to first register into procsignal, and then read the new value.
>> AFAICS this guarantees we won't lose any checksum version updates. It
>> does mean we still can get a barrier for a value we've already seen, but
>> I think we should simply ignore this for the very first update.
>
> Calling functions with sideeffects in setting state seems like a bad idea
> before ProcSignalInit has run, that's thinko on my part in this patch.  Your
> solution of reordering seems like the right way to handle this.

0006 in the attached version is what I have used when testing the above, along
with an update to the copyright year which I had missed doing earlier.  It also
contains the fix in LocalProcessControlFile which I had in my local tree, I
think we need something like that at least.

--
Daniel Gustafsson

Attachment

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

15 March, 18:50:02

On 3/14/25 15:06, Daniel Gustafsson wrote:
>> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:
> 
>>> This is "ephemeral" in the sense that setting the value to "on" again
>>> would be harmless, and indeed a non-assert build will run just fine.
>>
>> As mentioned off-list, being able to loosen the restriction for the first
>> barrier seen seem like a good way to keep this assertion.  Removing it is of
>> course the alternative solution, as it's not causing any issues, but given how
>> handy it's been to find actual issues it would be good to be able to keep it.
>>
>>> i.e. to first register into procsignal, and then read the new value.
>>> AFAICS this guarantees we won't lose any checksum version updates. It
>>> does mean we still can get a barrier for a value we've already seen, but
>>> I think we should simply ignore this for the very first update.
>>
>> Calling functions with sideeffects in setting state seems like a bad idea
>> before ProcSignalInit has run, that's thinko on my part in this patch.  Your
>> solution of reordering seems like the right way to handle this.
> 
> 0006 in the attached version is what I have used when testing the above, along
> with an update to the copyright year which I had missed doing earlier.  It also
> contains the fix in LocalProcessControlFile which I had in my local tree, I
> think we need something like that at least.
>

Thanks, here's an updated patch version - I squashed all the earlier
parts, but I kept your changes and my adjustments separate, for clarity.
A couple comments:

1) I don't think the comment before InitialDataChecksumTransition was
entirely accurate, because it said we can see the duplicate state only
for "on" state. But AFAICS we can see duplicate values for any states,
except that we only have an assert for the "on" so we don't notice the
other cases. I wonder if we could strengthen this a bit, by adding some
asserts for the other states too.

2) I admit it's rather subjective, but I didn't like how you did the
assert in AbsorbChecksumsOnBarrier. But looking at it now in the diff,
maybe it was more readable ...

3) I renamed InitLocalControldata() to InitLocalDataChecksumVersion().
The name was entirely misleading, because it now initializes the flag in
XLogCtl, it has nothing to do with control file.

4) I realized AuxiliaryProcessMainCommon() may be a better place to
initialize the checksum flag for non-backend processes. In fact, doing
it in postmaster_child_launch() had the same race condition because it
happened before ProcSignalInit().

I'm sure there's cleanup possible in a bunch of places, but the really
bad thing is I realized the handling on a standby is not quite correct.
I don't know what exactly is happening, there's too many moving bits,
but here's what I see ...

Every now and then, after restarting the standby, it logs a bunch of
page verification failures. Stuff like this:

  WARNING:  page verification failed, calculated checksum 9856 but
            expected 0
  CONTEXT:  WAL redo at 0/3447BA8 for Heap2/VISIBLE:
            snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel
            1663/16384/16401, fork 2, blk 0 FPW; blkref #1: rel
            1663/16384/16401, blk 0
  WARNING:  page verification failed, LSN 0/CF54C10

This is after an immediate shutdown, but I've seen similar failures for
fast shutdowns too (the root causes may be different / may need a
different fix, not sure).

The instance restarts, and the "startup" process starts recovery

  LOG:  redo starts at 0/2000028

This matches LSN from the very first start of the standby - there were
no restart points since then, apparently. And since then the primary did
this with the checksums (per pg_waldump):

  lsn: 0/0ECCFC48, prev 0/0ECCFBA0, desc: CHECKSUMS inprogress-off
  lsn: 0/0ECD0168, prev 0/0ECD0128, desc: CHECKSUMS off

The instance already saw both records before the immediate shutdown (per
the logging in patch 0004), but after the restart the instance goes back
to having checksums enabled again

  data_checksum_version = 1

Which is correct, because it starts at 0/2000028, which is before either
of the XLOG_CHECKSUMS records. But then at 0/3447BA8 (which is *before*
either of the checksum changes) it tries to read a page from disk, and
hits a checksum error. That page is from the future (per the page LSN
logged by patch 0004), but it's still before both XLOG_CHECKSUMS
messages. So how come the page has pd_checksum 0?

I'd have understood if the page came "broken" from the primary, but I've
not seen a single page verification failure on that side (and it's
subject to the same fast/immediate restarts, etc).

I wonder if this might be related to how we enforce checkpoints only
when setting the checksums to "on" on the primary. Maybe that's safe on
primary but not on a standby?

FWIW I've seen similar issues for "fast" shutdowns too - at least the
symptoms are similar, but the mechanism might be a bit different. In
particular, I suspect there's some sort of thinko in updating the
data_checksum_version in the control file, but I can't put my finger on
it yet.

Another thing I noticed is this comment in CreateRestartPoint(), before
one of the early exits:

/*
 * If the last checkpoint record we've replayed is already our last
 * restartpoint, we can't perform a new restart point. We still update
 * minRecoveryPoint in that case, so that if this is a shutdown restart
 * point, we won't start up earlier than before. That's not strictly
 * necessary, but when hot standby is enabled, it would be rather weird
 * if the database opened up for read-only connections at a
 * point-in-time before the last shutdown. Such time travel is still
 * possible in case of immediate shutdown, though.
 * ...

I wonder if this "time travel backwards" might be an issue for this too,
because it might mean we end up picking the wrong data_checksum_version
from the control file. In any case, if this happens, we don't get to the
ControlFile->data_checksum_version update a bit further down. And
there's another condition that can skip that.

I'll continue investigating this next week, but at this point I'm quite
confused and would be grateful for any insights ...

regards

-- 
Tomas Vondra

Attachment

Re: Changing the state of data checksums in a running cluster

From

Andres Freund

Date:

15 March, 19:26:12

Jo.

On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
> Thanks, here's an updated patch version

FWIW, this fails in CI;

https://cirrus-ci.com/build/4678473324691456
On all OSs:
[16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
[16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
[16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0
ControlFile->data_checksum_version0
 
[16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0
ControlFile->data_checksum_version0 (UPDATED)
 

Windows & Compiler warnings:
[16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No
suchfile or directory
 

[16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
[16:18:52.385]    25 | #include <execinfo.h>
[16:18:52.385]       |          ^~~~~~~~~~~~

Greetings,

Andres Freund

Re: Changing the state of data checksums in a running cluster

From

Tomas Vondra

Date:

15 March, 20:32:57

On 3/15/25 17:26, Andres Freund wrote:
> Jo.
> 
> On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
>> Thanks, here's an updated patch version
> 
> FWIW, this fails in CI;
> 
> https://cirrus-ci.com/build/4678473324691456
> On all OSs:
> [16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
> [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
> [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0
ControlFile->data_checksum_version0
 
> [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0
ControlFile->data_checksum_version0 (UPDATED)
 
> 
> Windows & Compiler warnings:
> [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h':
Nosuch file or directory
 
> 
> [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
> [16:18:52.385]    25 | #include <execinfo.h>
> [16:18:52.385]       |          ^~~~~~~~~~~~
> 
> Greetings,

Yeah, that's just the "debug stuff" - I don't expect any of that to be
included in the commit, I only posted it for convenience. It adds a lot
of debug logging, which I hope might help others to understand what the
problem with checksums on standby is.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

From

Alexander Korotkov

Date:

04 April, 10:55:32

Hi!

On Sat, Mar 15, 2025 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote:

On 3/15/25 17:26, Andres Freund wrote:
> Jo.
>
> On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
>> Thanks, here's an updated patch version
>
> FWIW, this fails in CI;
>
> https://cirrus-ci.com/build/4678473324691456
> On all OSs:
> [16:08:36.331] # Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
> [16:08:36.331] # at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
> [16:08:36.331] # got: '2025-03-15 16:08:26.216 UTC [63153] LOG: XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
> [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG: XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0 (UPDATED)
>
> Windows & Compiler warnings:
> [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No such file or directory
>
> [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
> [16:18:52.385] 25 | #include <execinfo.h>
> [16:18:52.385] | ^~~~~~~~~~~~
>
> Greetings,

Yeah, that's just the "debug stuff" - I don't expect any of that to be
included in the commit, I only posted it for convenience. It adds a lot
of debug logging, which I hope might help others to understand what the
problem with checksums on standby is.

I took a look at this patch. I have following notes.

1) I think reporting of these errors could be better, more detailed. Especially the second one could be similar to some of other errors on checksums processing.

ereport(ERROR,
(errmsg("failed to start background worker to process data checksums")));
ereport(ERROR,
(errmsg("unable to enable data checksums in cluster")));

2) ProcessAllDatabases() contains loop, which repeats scanning the new databases for checkums. It continues while there are new database on each iteration. Could we just limit the number of iterations to 2? Given at each step we're calling WaitForAllTransactionsToFinish(), everything that gets created after first WaitForAllTransactionsToFinish() call should have checksums enabled in the beginning.

------
Regards,
Alexander Korotkov

Supabase