Home > mailing lists

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From	Daniel Gustafsson
Subject	Re: Changing the state of data checksums in a running cluster
Date	April 6 03:20:34
Msg-id	B627A8A6-0239-486A-8CD4-96130603FAAA@yesql.se Whole thread Raw
In response to	Re: Changing the state of data checksums in a running cluster (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers

Tree view

> On 5 Apr 2026, at 06:56, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-04-05 00:27:00 +0200, Daniel Gustafsson wrote:
>>> On 4 Apr 2026, at 02:35, Daniel Gustafsson <daniel@yesql.se> wrote:
>>>
>>>> On 4 Apr 2026, at 00:59, Daniel Gustafsson <daniel@yesql.se> wrote:
>>>>
>>>>> On 3 Apr 2026, at 23:46, Daniel Gustafsson <daniel@yesql.se> wrote:
>>>>>
>>>>> After many more runs on CI I ended up pushing this version, and I see BF
>>>>> members being angry due the test not waiting for the launcher to exit.  I am
>>>>> working on a fix right now.
>>>>
>>>> 0036232ba8f seems to have made the failing animals slightly happier, I will
>>>> continue to monitor the buildfarm for other fallout.
>>>
>>> The intermittent failure on kestrel implies timing similar to the one fixed in
>>> 0036232ba8fb28, a tentative fix is to make it part of waiting for an endstate
>>> (on or off) to make sure the cluster is always in the right state for new
>>> operations.  Right now kestrel is the one which has been flapping, I'm waiting
>>> a bit to see if more will follow and give further clues.
>>
>> mylodon had the same failure, and I believe the bug is in my injection point
>> test code.  I have a tentative fix in the attached refactoring which moves over
>> to using the injection_point extension module.  It's still fairly rare so I'm
>> holding off for a little bit before pushing it to see if I can collect a little
>> bit more evidence.
>
> There are a lot checksum related errors on CI:
>
> https://cirrus-ci.com/task/4848298592305152

[22:35:56.818] # poll_query_until timed out executing this query:
[22:35:56.818] # SELECT setting FROM pg_catalog.pg_settings WHERE name = 'data_checksums';
[22:35:56.818] # expecting this output:
[22:35:56.818] # inprogress-on
[22:35:56.818] # last actual query output:
[22:35:56.818] # on

Another timing error, solved by allowing for on as well as inprogress-on and
expanding the wait fix already committed into a more generic one.

> https://cirrus-ci.com/task/5338691381493760

This one was interesting, it managed to hit a bug when the worker process
starts, and finishes, before the launcher manages to wait for it to start up.
The BGWH_STOPPED return was erroneously interpreted as a failure.

> https://cirrus-ci.com/task/6271077241847808

Cheeky, the processing managed to finish between closing the connection
blocking progress and before shutting down the cluster. Reordering to keep the

> https://cirrus-ci.com/task/6150048418889728

Seems like the same error as the first one.

I've pushed fixes for all of these as well as the intermittent failures that
were seen on some BF animals, and will stare at the buildfarm for a while now.
So far 10 or so machines have built these green so it looks decent so far.

--
Daniel Gustafsson

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 06 April, 03:16:47
Subject: Shmem allocated wrong for custom cumulative stats

From: Michael Paquier
Date: 06 April, 03:55:14
Subject: Re: Shmem allocated wrong for custom cumulative stats

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

Previous

Next