Re: Race condition in SyncRepGetSyncStandbysPriority - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Race condition in SyncRepGetSyncStandbysPriority |
Date | |
Msg-id | CA+fd4k7rFQ+3fSA-gYho9FbOA3v+E9865jKSg2RDiBOTYMQPYQ@mail.gmail.com Whole thread Raw |
In response to | Re: Race condition in SyncRepGetSyncStandbysPriority (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>) |
Responses |
Re: Race condition in SyncRepGetSyncStandbysPriority
|
List | pgsql-hackers |
On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Fri, 27 Mar 2020 at 13:54, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > > > > > On 2020/03/27 10:26, Tom Lane wrote: > > > Twice in the past month [1][2], buildfarm member hoverfly has managed > > > to reach the "unreachable" Assert(false) at the end of > > > SyncRepGetSyncStandbysPriority. > > > > When I search the past discussions, I found that Noah Misch reported > > the same issue. > > https://www.postgresql.org/message-id/20200206074552.GB3326097@rfd.leadboat.com > > > > > What seems likely to me, after quickly eyeballing the code, is that > > > hoverfly is hitting the blatantly-obvious race condition in that function. > > > Namely, that the second loop supposes that the state of the walsender > > > array hasn't changed since the first loop. > > > > > > The minimum fix for this, I suppose, would have the first loop capture > > > the sync_standby_priority value for each walsender along with what it's > > > already capturing. But I wonder if the whole function shouldn't be > > > rewritten from scratch, because it seems like the algorithm is both > > > expensively brute-force and unintelligible, which is a sad combination. > > > It's likely that the number of walsenders would never be high enough > > > that efficiency could matter, but then couldn't we use an algorithm > > > that is less complicated and more obviously correct? > > > > +1 to rewrite the function with better algorithm. > > > > > (Because the > > > alternative conclusion, if you reject the theory that a race is happening, > > > is that the algorithm is just flat out buggy; something that's not too > > > easy to disprove either.) > > > > > > Another fairly dubious thing here is that whether or not *am_sync > > > gets set depends not only on whether MyWalSnd is claiming to be > > > synchronous but on how many lower-numbered walsenders are too. > > > Is that really the right thing? > > > > > > But worse than any of that is that the return value seems to be > > > a list of walsender array indexes, meaning that the callers cannot > > > use it without making even stronger assumptions about the array > > > contents not having changed since the start of this function. > > > > > > It sort of looks like the design is based on the assumption that > > > the array contents can't change while SyncRepLock is held ... but > > > if that's the plan then why bother with the per-walsender spinlocks? > > > In any case this assumption seems to be failing, suggesting either > > > that there's a caller that's not holding SyncRepLock when it calls > > > this function, or that somebody is failing to take that lock while > > > modifying the array. > > > > As far as I read the code, that assumption seems still valid. But the problem > > is that each walsender updates MyWalSnd->sync_standby_priority at each > > convenient timing, when SIGHUP is signaled. That is, at a certain moment, > > some walsenders (also their WalSnd entries in shmem) work based on > > the latest configuration but the others (also their WalSnd entries) work based > > on the old one. > > > > lowest_priority = SyncRepConfig->nmembers; > > next_highest_priority = lowest_priority + 1; > > > > SyncRepGetSyncStandbysPriority() calculates the lowest priority among > > all running walsenders as the above, by using the configuration info that > > this walsender is based on. But this calculated lowest priority would be > > invalid if other walsender is based on different (e.g., old) configuraiton. > > This can cause the (other) walsender to have lower priority than > > the calculated lowest priority and the second loop in > > SyncRepGetSyncStandbysPriority() to unexpectedly end. > > I have the same understanding. Since sync_standby_priroity is > protected by SyncRepLock these values of each walsender are not > changed through two loops in SyncRepGetSyncStandbysPriority(). > However, as Fujii-san already mentioned the true lowest priority can > be lower than lowest_priority, nmembers, when only part of walsenders > reloaded the configuration, which in turn could be the cause of > leaving entries in the pending list at the end of the function. > > > Therefore, the band-aid fix seems to be to set the lowest priority to > > very large number at the beginning of SyncRepGetSyncStandbysPriority(). > > I think we can use max_wal_senders. Sorry, that's not true. We need another number large enough. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: