Thread: XID-wraparound hazards in LISTEN/NOTIFY

XID-wraparound hazards in LISTEN/NOTIFY

From
Tom Lane
Date:
In connection with a different issue, I wrote:

> * The point of calling asyncQueueReadAllNotifications in
> Exec_ListenPreCommit is to advance over already-committed queue entries
> before we start sending any events to the client.  Without this, a
> newly-listening client could be sent some very stale events.  (Note
> that 51004c717 changed this from "somewhat stale" to "very stale".

It suddenly strikes me to worry that we have an XID wraparound hazard
for entries in the notify queue.  The odds of seeing a live problem
with that before 51004c717 were pretty minimal, but now that we don't
aggressively advance the queue tail, I think it's a very real risk for
low-notify-traffic installations.  In the worst case, imagine

* Somebody sends one NOTIFY, maybe just as a test.

* Nothing happens for a couple of weeks, during which the XID counter
advances by 2 billion or so.

* Newly-listening sessions will now think that that old event is
"in the future", hence fail to advance over it, resulting in denial
of service for new notify traffic.  There is no recourse short of
a server restart or waiting another couple weeks for wraparound.

I thought about fixing this by tracking the queue's oldest XID in
the shared queue info, and forcing a tail advance when that got
too old --- but if nobody actively uses any listen or notify
features for awhile, no such code is going to execute, so the
above scenario could happen anyway.

The only bulletproof fix I can think of offhand is to widen the
queue entries to 64-bit XIDs, which is a tad annoying from a
space consumption standpoint.  Possibly we could compromise by
storing the high-order bits just once per SLRU page (and then
forcing an advance to a new page when those bits change).

A somewhat less bulletproof fix is to detect far-in-the-future queue
entries and discard them.  That would prevent the freezeup scenario,
but there'd be a residual hazard of transmitting ancient
notifications to clients (if a queue entry survived for 4G
transactions not just 2G).  But maybe that's OK, given the rather
tiny probabilities involved, and the low guarantees around notify
reliability in general.  It'd be a much more back-patchable answer
than a queue format change, too.

Thoughts?  I'm not really planning to work on this myself, but
somebody oughta.

            regards, tom lane



Re: XID-wraparound hazards in LISTEN/NOTIFY

From
Mark Dilger
Date:

On 11/23/19 8:34 AM, Tom Lane wrote:
> In connection with a different issue, I wrote:
> 
>> * The point of calling asyncQueueReadAllNotifications in
>> Exec_ListenPreCommit is to advance over already-committed queue entries
>> before we start sending any events to the client.  Without this, a
>> newly-listening client could be sent some very stale events.  (Note
>> that 51004c717 changed this from "somewhat stale" to "very stale".
> 
> It suddenly strikes me to worry that we have an XID wraparound hazard
> for entries in the notify queue.  The odds of seeing a live problem
> with that before 51004c717 were pretty minimal, but now that we don't
> aggressively advance the queue tail, I think it's a very real risk for
> low-notify-traffic installations.  In the worst case, imagine
> 
> * Somebody sends one NOTIFY, maybe just as a test.
> 
> * Nothing happens for a couple of weeks, during which the XID counter
> advances by 2 billion or so.
> 
> * Newly-listening sessions will now think that that old event is
> "in the future", hence fail to advance over it, resulting in denial
> of service for new notify traffic.  There is no recourse short of
> a server restart or waiting another couple weeks for wraparound.

Is it worth checking for this condition in autovacuum?  Even for
installations with autovacuum disabled, would the anti-wraparound
vacuums happen frequently enough to also advance the tail if modified
to test for this condition?

> I thought about fixing this by tracking the queue's oldest XID in
> the shared queue info, and forcing a tail advance when that got
> too old --- but if nobody actively uses any listen or notify
> features for awhile, no such code is going to execute, so the
> above scenario could happen anyway.
> 
> The only bulletproof fix I can think of offhand is to widen the
> queue entries to 64-bit XIDs, which is a tad annoying from a
> space consumption standpoint.  Possibly we could compromise by
> storing the high-order bits just once per SLRU page (and then
> forcing an advance to a new page when those bits change).
> 
> A somewhat less bulletproof fix is to detect far-in-the-future queue
> entries and discard them.  That would prevent the freezeup scenario,
> but there'd be a residual hazard of transmitting ancient
> notifications to clients (if a queue entry survived for 4G
> transactions not just 2G).  But maybe that's OK, given the rather
> tiny probabilities involved, and the low guarantees around notify
> reliability in general.  It'd be a much more back-patchable answer
> than a queue format change, too.

There shouldn't be too much reason to back-patch any of this, since
the change in 51004c717 only applies to v13 and onward.  Or do you
see the risk you described as "pretty minimal" as still being large
enough to outweigh the risk of anything we might back-patch?



-- 
Mark Dilger



Re: XID-wraparound hazards in LISTEN/NOTIFY

From
Tom Lane
Date:
Mark Dilger <hornschnorter@gmail.com> writes:
> On 11/23/19 8:34 AM, Tom Lane wrote:
>> It suddenly strikes me to worry that we have an XID wraparound hazard
>> for entries in the notify queue.

> Is it worth checking for this condition in autovacuum?

Dunno, maybe.  It's a different avenue to consider, at least.

> There shouldn't be too much reason to back-patch any of this, since
> the change in 51004c717 only applies to v13 and onward.  Or do you
> see the risk you described as "pretty minimal" as still being large
> enough to outweigh the risk of anything we might back-patch?

There may not be a risk large enough to worry about before 51004c717,
assuming that we discount cases like a single session staying
idle-in-transaction for long enough for the XID counter to wrap
(which'd cause problems for more than just LISTEN/NOTIFY).  I haven't
analyzed this carefully enough to be sure.  We'd have to consider
that, as well as the complexity of whatever fix we choose for HEAD,
while deciding if we need a back-patch.

            regards, tom lane



Re: XID-wraparound hazards in LISTEN/NOTIFY

From
Bruce Momjian
Date:
On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> Mark Dilger <hornschnorter@gmail.com> writes:
> > On 11/23/19 8:34 AM, Tom Lane wrote:
> >> It suddenly strikes me to worry that we have an XID wraparound hazard
> >> for entries in the notify queue.
> 
> > Is it worth checking for this condition in autovacuum?
> 
> Dunno, maybe.  It's a different avenue to consider, at least.
> 
> > There shouldn't be too much reason to back-patch any of this, since
> > the change in 51004c717 only applies to v13 and onward.  Or do you
> > see the risk you described as "pretty minimal" as still being large
> > enough to outweigh the risk of anything we might back-patch?
> 
> There may not be a risk large enough to worry about before 51004c717,
> assuming that we discount cases like a single session staying
> idle-in-transaction for long enough for the XID counter to wrap
> (which'd cause problems for more than just LISTEN/NOTIFY).  I haven't
> analyzed this carefully enough to be sure.  We'd have to consider
> that, as well as the complexity of whatever fix we choose for HEAD,
> while deciding if we need a back-patch.

Is this still an open issue?  Should it be a TODO item?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: XID-wraparound hazards in LISTEN/NOTIFY

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
>>> It suddenly strikes me to worry that we have an XID wraparound hazard
>>> for entries in the notify queue.

> Is this still an open issue?  Should it be a TODO item?

I don't think anyone's done anything about it, so yeah.

Realistically, if you've got NOTIFY messages that are going unread
for long enough to risk XID wraparound, your app is broken.  So
maybe it'd be sufficient to discard messages that are old enough
to approach the wrap horizon.  But still that's code that doesn't
exist.

            regards, tom lane



Re: XID-wraparound hazards in LISTEN/NOTIFY

From
Bruce Momjian
Date:
On Wed, Nov  8, 2023 at 02:52:16PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> >>> It suddenly strikes me to worry that we have an XID wraparound hazard
> >>> for entries in the notify queue.
> 
> > Is this still an open issue?  Should it be a TODO item?
> 
> I don't think anyone's done anything about it, so yeah.
> 
> Realistically, if you've got NOTIFY messages that are going unread
> for long enough to risk XID wraparound, your app is broken.  So
> maybe it'd be sufficient to discard messages that are old enough
> to approach the wrap horizon.  But still that's code that doesn't
> exist.

Thanks, TODO added.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.