Re: XID-wraparound hazards in LISTEN/NOTIFY - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | Re: XID-wraparound hazards in LISTEN/NOTIFY |
Date | |
Msg-id | d70a276e-7aa8-fb27-a342-838d96167701@gmail.com Whole thread Raw |
In response to | XID-wraparound hazards in LISTEN/NOTIFY (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: XID-wraparound hazards in LISTEN/NOTIFY
|
List | pgsql-hackers |
On 11/23/19 8:34 AM, Tom Lane wrote: > In connection with a different issue, I wrote: > >> * The point of calling asyncQueueReadAllNotifications in >> Exec_ListenPreCommit is to advance over already-committed queue entries >> before we start sending any events to the client. Without this, a >> newly-listening client could be sent some very stale events. (Note >> that 51004c717 changed this from "somewhat stale" to "very stale". > > It suddenly strikes me to worry that we have an XID wraparound hazard > for entries in the notify queue. The odds of seeing a live problem > with that before 51004c717 were pretty minimal, but now that we don't > aggressively advance the queue tail, I think it's a very real risk for > low-notify-traffic installations. In the worst case, imagine > > * Somebody sends one NOTIFY, maybe just as a test. > > * Nothing happens for a couple of weeks, during which the XID counter > advances by 2 billion or so. > > * Newly-listening sessions will now think that that old event is > "in the future", hence fail to advance over it, resulting in denial > of service for new notify traffic. There is no recourse short of > a server restart or waiting another couple weeks for wraparound. Is it worth checking for this condition in autovacuum? Even for installations with autovacuum disabled, would the anti-wraparound vacuums happen frequently enough to also advance the tail if modified to test for this condition? > I thought about fixing this by tracking the queue's oldest XID in > the shared queue info, and forcing a tail advance when that got > too old --- but if nobody actively uses any listen or notify > features for awhile, no such code is going to execute, so the > above scenario could happen anyway. > > The only bulletproof fix I can think of offhand is to widen the > queue entries to 64-bit XIDs, which is a tad annoying from a > space consumption standpoint. Possibly we could compromise by > storing the high-order bits just once per SLRU page (and then > forcing an advance to a new page when those bits change). > > A somewhat less bulletproof fix is to detect far-in-the-future queue > entries and discard them. That would prevent the freezeup scenario, > but there'd be a residual hazard of transmitting ancient > notifications to clients (if a queue entry survived for 4G > transactions not just 2G). But maybe that's OK, given the rather > tiny probabilities involved, and the low guarantees around notify > reliability in general. It'd be a much more back-patchable answer > than a queue format change, too. There shouldn't be too much reason to back-patch any of this, since the change in 51004c717 only applies to v13 and onward. Or do you see the risk you described as "pretty minimal" as still being large enough to outweigh the risk of anything we might back-patch? -- Mark Dilger
pgsql-hackers by date: