Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" - Mailing list pgsql-hackers

From Tom Lane
Subject Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date
Msg-id 3748783.1669241103@sss.pgh.pa.us
Whole thread Raw
In response to Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I assume this is ext4.  Presumably anything that reads the
>> controlfile, like pg_ctl, pg_checksums, pg_resetwal,
>> pg_control_system(), ... by reading without interlocking against
>> writes could see garbage.  I have lost track of the versions and the
>> thread, but I worked out at some point by experimentation that this
>> only started relatively recently for concurrent read() and write(),
>> but always happened with concurrent pread() and pwrite().  The control
>> file uses the non-p variants which didn't mash old/new data like
>> grated cheese under concurrency due to some implementation detail, but
>> now does.

Ugh.

> As for what to do about it, some ideas:
> 2.  Retry after a short time on checksum failure.  The probability is
> already miniscule, and becomes pretty close to 0 if we read thrice
> 100ms apart.

> First thought is that 2 is appropriate level of complexity for this
> rare and stupid problem.

Yeah, I was thinking the same.  A variant could be "repeat until
we see the same calculated checksum twice".

            regards, tom lane





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: More efficient build farm animal wakeup?
Next
From: Cary Huang
Date:
Subject: Re: Patch: Global Unique Index