Re: Online enabling of checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Online enabling of checksums
Date
Msg-id 20180406235126.d4sg4dtgicdpucnj@alap3.anarazel.de
Whole thread Raw
In response to Re: Online enabling of checksums  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers
On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote:
> > On 07 Apr 2018, at 01:13, Andres Freund <andres@anarazel.de> wrote:
> > 
> > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote:
> >>> I'm fairly certain that the bug here is a simple race condition in the
> >>> test (not the main code!):
> >> 
> >> I wonder if it may perhaps be a case of both?
> > 
> > See my other message about the atomic fallback bit.
> 
> Yep, my MUA pulled it down just as I had sent this.  Thanks for confirming my
> suspicion.

But note it fails because the code using it is WRONG. There's a reason
there's "unlocked" in the name. But even leaving that aside, it probably
*still* be wrong if it were locked.

It seems *extremely* dubious that we'll allow to re-enable the checksums
while a worker is still doing stuff for the old cycle in the
background.  Consider what happens if the checksum helper is currently
doing RequestCheckpoint() (something that can certainly take a *LONG*)
while. Another process disables checksums. Pages get written out
without checksums. Yet another process re-enables checksums. Helper
process does SetDataChecksumsOn(). Which succeeds because

    if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION)
    {
        LWLockRelease(ControlFileLock);
        elog(ERROR, "Checksums not in inprogress mode");
    }

succeeds.  Boom. Cluster with partially set checksums but marked as
valid.


Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: [PATCH] Update README for Resource Owners
Next
From: Andres Freund
Date:
Subject: Re: [PATCH] Update README for Resource Owners