Re: Race condition in SyncRepGetSyncStandbysPriority - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Race condition in SyncRepGetSyncStandbysPriority |
Date | |
Msg-id | 5637.1586644230@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Race condition in SyncRepGetSyncStandbysPriority (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>) |
Responses |
Re: Race condition in SyncRepGetSyncStandbysPriority
|
List | pgsql-hackers |
Masahiko Sawada <masahiko.sawada@2ndquadrant.com> writes: > On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: >>> Therefore, the band-aid fix seems to be to set the lowest priority to >>> very large number at the beginning of SyncRepGetSyncStandbysPriority(). >> I think we can use max_wal_senders. > Sorry, that's not true. We need another number large enough. The buildfarm had another three failures of this type today, so that motivated me to look at it some more. I don't think this code needs a band-aid fix; I think "nuke from orbit" is more nearly the right level of response. The point that I was trying to make originally is that it seems quite insane to imagine that a walsender's sync_standby_priority value is somehow more stable than the very existence of the process. Yet we only require a walsender to lock its own mutex while claiming or disowning its WalSnd entry (by setting or clearing the pid field). So I think it's nuts to define those fields as being protected by the global SyncRepLock. Even without considering the possibility that a walsender has just started or stopped, we have the problem Fujii-san described that after a change in the synchronous_standby_names GUC setting, different walsenders will update their values of sync_standby_priority at different instants. (BTW, I now notice that Noah had previously identified this problem at [1].) Thus, even while holding SyncRepLock, we do not have a guarantee that we'll see a consistent set of sync_standby_priority values. In fact we don't even know that the walsnd array entries still belong to the processes that last set those values. This is what is breaking SyncRepGetSyncStandbysPriority, and what it means is that there's really fundamentally no chance of that function producing trustworthy results. The "band aid" fixes discussed here might avoid crashing on the Assert, but they won't fix the problems that (a) the result is possibly wrong and (b) it can become stale immediately even if it's right when returned. Now, there are only two callers of SyncRepGetSyncStandbys: SyncRepGetSyncRecPtr and pg_stat_get_wal_senders. The latter is mostly cosmetic (which is a good thing, because to add insult to injury, it continues to use the list after releasing SyncRepLock; not that continuing to hold that lock would make things much safer). If I'm reading the code correctly, the former doesn't really care exactly which walsenders are sync standbys: all it cares about is to collect their WAL position pointers. What I think we should do about this is, essentially, to get rid of SyncRepGetSyncStandbys. Instead, let's have each walsender advertise whether *it* believes that it is a sync standby, based on its last evaluation of the relevant GUCs. This would be a bool that it'd compute and set alongside sync_standby_priority. (Hm, maybe we'd not even need to have that field anymore? Not sure.) We should also redefine that flag, and sync_standby_priority if it survives, as being protected by the per-walsender mutex not SyncRepLock. Then, what SyncRepGetSyncRecPtr would do is just sweep through the walsender array and collect WAL position pointers from the walsenders that claim to be sync standbys at the instant that it's inspecting them. pg_stat_get_wal_senders could also use those flags instead of the list from SyncRepGetSyncStandbys. It's likely that this definition would have slightly different behavior from the current implementation during the period where the system is absorbing a change in the set of synchronous walsenders. However, since the current implementation is visibly wrong during that period anyway, I'm not sure how this could be worse. And at least we can be certain that SyncRepGetSyncRecPtr will not include WAL positions from already-dead walsenders in its calculations, which *is* a hazard in the code as it stands. I also estimate that this would be noticeably more efficient than the current code, since the logic to decide who's a sync standby would only run when we're dealing with walsender start/stop or SIGHUP, rather than every time SyncRepGetSyncRecPtr runs. Don't especially want to code this myself, though. Anyone? regards, tom lane [1] https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
pgsql-hackers by date: