when should we set DB_IN_PRODUCTION? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | when should we set DB_IN_PRODUCTION? |
Date | |
Msg-id | CA+TgmobFX-6gsQKHN5T+iRUvA_UKZybM4t9xvHcPkB75kNUN-A@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
The following code doesn't make a lot of sense to me: /* * All done with end-of-recovery actions. * * Now allow backends to write WAL and update the control file status in * consequence. SharedRecoveryState, that controls if backends can write * WAL, is updated while holding ControlFileLock to prevent other backends * to look at an inconsistent state of the control file in shared memory. * There is still a small window during which backends can write WAL and * the control file is still referring to a system not in DB_IN_PRODUCTION * state while looking at the on-disk control file. * * Also, we use info_lck to update SharedRecoveryState to ensure that * there are no race conditions concerning visibility of other recent * updates to shared memory. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile->state = DB_IN_PRODUCTION; SpinLockAcquire(&XLogCtl->info_lck); XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; SpinLockRelease(&XLogCtl->info_lck); UpdateControlFile(); LWLockRelease(ControlFileLock); Before ebdf5bf7d1c97a926e2b0cb6523344c2643623c7 (2016) we changed the control file state first, then did a bunch of stuff like StartupCLOG() and TrimCLOG() and RecoverPreparedTransactions(), and then set RECOVERY_STATE_DONE. Judging by the commit message for the aforementioned commit, some people didn't like the fact that DB_IN_PRODUCTION would show up in the control file potentially some time in advance of when the server was actually able to accept read-write connections. However, it seems to me that we now have the opposite problem: as soon as we set RECOVERY_STATE_DONE in shared memory, some other backend can write WAL, which I think means that we might start writing WAL before the control file says we're in production. Which might not be an entirely cosmetic problem, because there's code that looks at whether the control file state is DB_SHUTDOWNED to figure out whether we crashed previously -- and it's hard to imagine that it would be OK to write some WAL, crash before updating the control file state, and then observe the control file state on restart to still be indicative of a clean shutdown. I imagine the race is rather narrow, but I don't see what prevents it in theory. It seems reasonable to me to want these changes to happen as close together as possible, and before ebdf5bf7d1c97a926e2b0cb6523344c2643623c7 that wasn't the case, so I do acknowledge that there was a legitimate reason to make a change. But I don't think the change was correct in detail. True simultaneity is impossible, and one change or the other must happen first, rather than interleaving them as this code does. And I think the one that should happen first is the control file update, including the on-disk copy, because no WAL should be generated until that's fully complete. If that's not good enough, we could update the control file twice, once just before allowing connections to say that we're about to allow new WAL, and once just after to confirm that we did. The first update would be for the benefit of the server itself, so that it can be certain whether any WAL might have been generated, and the second one would just be for the benefit of observers. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: