On Thu, Jun 18, 2020 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It seems like this is the core decision that needs to be taken. If
> we're willing to have these state transitions include a server restart,
> then many things get simpler. If we're not, it's gonna cost us in
> code complexity and hence bugs. Maybe the usability gain is worth it,
> or maybe not.
>
> I think it would probably be worth the trouble to pursue both designs in
> parallel for awhile, so we can get a better handle on exactly how much
> complexity we're buying into with the more ambitious definition.
I wouldn't vote to reject a patch that performed a demotion by doing a
full server restart, because it's a useful incremental step, but I
wouldn't be excited about spending a lot of time on it, either,
because it's basically crippleware by design. Either you have to
checkpoint before restarting, or you have to run recovery after
restarting, and either of those can be extremely slow. You also break
all the connections, which can produce application errors unless the
applications are pretty robustly designed, and you lose the entire
contents of shared_buffers, which makes things run very slowly even
after the restart is completed, which can cause a lengthy slow period
even after the system is nominally back up. All of those things are
really bad, and AFAICT the first one is the worst by a considerable
margin. It can take 20 minutes to checkpoint and even longer to run
recovery, so doing this once per year already puts you outside of
five-nines territory, and we release critical updates that cannot be
applied without a server restart about four times per year. That means
that if you perform updates by using a switchover -- a common practice
-- and if you apply all of your updates in a timely fashion --
unfortunately, not such a common practice, but one we'd surely like to
encourage -- you can't even achieve four nines if a switchover
requires either a checkpoint or running recovery. And that's without
accounting for any switchovers that you may need to perform for
reasons unrelated to upgrades, and without accounting for any
unplanned downtime. Not many people in 2020 are interested in running
a system with three nines of availability, so I think it is clear that
we need to do better.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company