when should we set DB_IN_PRODUCTION? - Mailing list pgsql-hackers

From Robert Haas
Subject when should we set DB_IN_PRODUCTION?
Date
Msg-id CA+TgmobFX-6gsQKHN5T+iRUvA_UKZybM4t9xvHcPkB75kNUN-A@mail.gmail.com
Whole thread Raw
List pgsql-hackers
The following code doesn't make a lot of sense to me:

    /*
     * All done with end-of-recovery actions.
     *
     * Now allow backends to write WAL and update the control file status in
     * consequence.  SharedRecoveryState, that controls if backends can write
     * WAL, is updated while holding ControlFileLock to prevent other backends
     * to look at an inconsistent state of the control file in shared memory.
     * There is still a small window during which backends can write WAL and
     * the control file is still referring to a system not in DB_IN_PRODUCTION
     * state while looking at the on-disk control file.
     *
     * Also, we use info_lck to update SharedRecoveryState to ensure that
     * there are no race conditions concerning visibility of other recent
     * updates to shared memory.
     */
    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    ControlFile->state = DB_IN_PRODUCTION;

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
    SpinLockRelease(&XLogCtl->info_lck);

    UpdateControlFile();
    LWLockRelease(ControlFileLock);

Before ebdf5bf7d1c97a926e2b0cb6523344c2643623c7 (2016) we changed the
control file state first, then did a bunch of stuff like StartupCLOG()
and TrimCLOG() and RecoverPreparedTransactions(), and then set
RECOVERY_STATE_DONE. Judging by the commit message for the
aforementioned commit, some people didn't like the fact that
DB_IN_PRODUCTION would show up in the control file potentially some
time in advance of when the server was actually able to accept
read-write connections. However, it seems to me that we now have the
opposite problem: as soon as we set RECOVERY_STATE_DONE in shared
memory, some other backend can write WAL, which I think means that we
might start writing WAL before the control file says we're in
production. Which might not be an entirely cosmetic problem, because
there's code that looks at whether the control file state is
DB_SHUTDOWNED to figure out whether we crashed previously -- and it's
hard to imagine that it would be OK to write some WAL, crash before
updating the control file state, and then observe the control file
state on restart to still be indicative of a clean shutdown. I imagine
the race is rather narrow, but I don't see what prevents it in theory.

It seems reasonable to me to want these changes to happen as close
together as possible, and before
ebdf5bf7d1c97a926e2b0cb6523344c2643623c7 that wasn't the case, so I do
acknowledge that there was a legitimate reason to make a change. But I
don't think the change was correct in detail. True simultaneity is
impossible, and one change or the other must happen first, rather than
interleaving them as this code does. And I think the one that should
happen first is the control file update, including the on-disk copy,
because no WAL should be generated until that's fully complete. If
that's not good enough, we could update the control file twice, once
just before allowing connections to say that we're about to allow new
WAL, and once just after to confirm that we did. The first update
would be for the benefit of the server itself, so that it can be
certain whether any WAL might have been generated, and the second one
would just be for the benefit of observers.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Replace open mode with PG_BINARY_R/W/A macros
Next
From: Tom Lane
Date:
Subject: Re: Replace open mode with PG_BINARY_R/W/A macros