Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date
Msg-id CA+hUKGKzODWydEc0ZyAYXyFscqSQAvxDWCkDEOggiP4OZdr9BQ@mail.gmail.com
Whole thread Raw
In response to Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"  ("Anton A. Melnikov" <aamelnikov@inbox.ru>)
Responses Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
List pgsql-hackers
On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> Also checked for a few hours that the patch 0002 fixes this error,
> but there are some questions to its logical structure.

Hi Anton,

Thanks for looking!

> The equality between the previous and newly calculated crc is checked only
> if the last crc calculation was wrong, i.e not equal to the value stored in the file.
> It is very unlikely that in this case the previous and new crc can match, so, in fact,
> the loop will spin until crc is calculated correctly. In the other words,
> this algorithm is practically equivalent to an infinite loop of reading from a file
> and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

> But then it can be simplified significantly by removing checksums equality checks,
> bool fist_try and by limiting the maximum number of iterations
> with some constant in the e.g. for loop to avoid theoretically possible freeze.

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

Primary servers write the control file very infrequently.  Standybys
more frequently, while writing data out, maybe every few seconds on a
busy system writing out a lot of data.  UpdateMinRecoveryPoint() makes
some effort to avoid updating the file too often.  You definitely see
bursts of repeated flushes that might send this thing in a loop for a
while if the timings were exactly wrong, but usually with periodic
gaps; I haven't really studied the expected behaviour too closely.

> Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
> repeat the file_open-read-close-calculate_crc sequence twice without a pause between
> them and check the both calculated crcs for the equality. If they match, exit and return
> the bool result of comparing between the last calculation with the value from the file,
> if not, take a pause and repeat everything from the beginning.

Hmm.  Would it be good enough to do two read() calls with no sleep in
between?  How sure are we that a concurrent write will manage to
change at least one bit that our second read can see?  I guess it's
likely, but maybe hypervisors, preemptible kernels, I/O interrupts or
a glitch in the matrix could decrease our luck?  I really don't know.
So I figured I should add a small sleep between the reads to change
the odds in our favour.  But we don't want to slow down all reads of
the control file with a sleep, do we?  So I thought we should only
bother doing this slow stuff if the actual CRC check fails, a low
probability event.

Clearly there is an element of speculation or superstition here.  I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking.  Maybe we should rethink that.  How bad would it
really be if control file access used POSIX file locking?  I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Improve logging when using Huge Pages
Next
From: Amin
Date:
Subject: Re: Scan buffercache for a table