Re: LISTEN denial of service with aborted transaction - Mailing list pgsql-hackers

From Matt Newell
Subject Re: LISTEN denial of service with aborted transaction
Date
Msg-id 4705384.iNuYeV9kCG@obsidian
Whole thread Raw
In response to Re: LISTEN denial of service with aborted transaction  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: LISTEN denial of service with aborted transaction
List pgsql-hackers
On Tuesday, September 29, 2015 09:58:35 PM Tom Lane wrote:
> Matt Newell <newellm@blur.com> writes:
> > On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote:
> >> Possibly.  sinval catchup notification works like that, so you could
> >> maybe
> >> invent a similar mechanism for notify.  Beware that we've had to fix
> >> performance issues with sinval catchup; you don't want to release a whole
> >> bunch of processes to do work at the same time.
> > 
> > I'll have to think about this more but i don't envision it causing such as
> > scenario.
> > The extra processing that will happen at signaling time would have been
> > done anyway when the aborted transaction is finally rolled back, the only
> > extra work is copying the relevant notifications to local memory.
> 
> Hm.  An idle-in-transaction listening backend will eventually cause a more
> serious form of denial of service: once the pg_notify SLRU area fills up
> to its maximum of 8GB, no more new messages can be inserted, and every
> transaction that tries to do a NOTIFY will abort.  However, that's a
> documented hazard and we've not really had complaints about it.  In any
> case, trying to fix that by having such a backend absorb messages into
> local memory doesn't seem like a great idea: if it sits for long enough,
> its local memory usage will bloat.  Eventually you'd have a failure
> anyway.
> 
> > Regardless it might not be worthwhile since my patch for #1 seems to
> > address the symptom that bit me.
> 
> I looked at this patch for a bit and found it kind of ugly, particularly
> the business about "we allow one process at a time to advance
> firstUncommitted by using advancingFirstUncommitted as a mutex".
> That doesn't sound terribly safe or performant.
> 
It can be implemented without that but then you'll have multiple backends 
trying to do the same work.  This might not be an issue at all but I couldn't 
tell at first glance the potential cost of extra calls to 
TransactionIdIsInProgress. Since there are no extra locks taken I figured 
ensuring the work is only done once is good for performance.

If the cluster only has one database generating notifies then there is 
practically no extra work with the patch.

As far as safety is concerned I think the worst case is that behavior returns 
to what it is now, where firstUncommitted ends up tracking QUEUE_TAIL.

I originally used a boolean then realized I could get some extra safety by 
using the backendId so that the mutex would be released automatically if the 
backend dies.

> After further thought, I wonder whether instead of having an incoming
> listener initialize its "pos" to QUEUE_TAIL, we couldn't have it
> initialize to the maximum "pos" among existing listeners (with a floor of
> QUEUE_TAIL of course).  If any other listener has advanced over a given
> message X, then X was committed (or aborted) earlier and the new listener
> is not required to return it; so it should be all right to take that
> listener's "pos" as a starting point.

That's a great idea and I will try it.  It does need to be the maximum 
position of existing listeners *with matching db oid" since 
asyncQueueReadAllNotifications doesn't check transaction status if the db oid 
doesn't match it's own.  Another advantage of this approach is that a queued 
notify from an open transaction in one database won't incur additional cost on 
listens coming from other databases, whereas with my patch such a situation 
will prevent firstUncommitted from advancing.

> 
> The minimum-code-change way to do that would be to compute the max pos
> the hard way while Exec_ListenPreCommit holds the AsyncQueueLock, which
> it would now have to do in exclusive mode.  That's a little bit annoying
> but it's surely not much worse than what SignalBackends has to do after
> every notification.
Yeah I think it's going to be a win regardless.  Could also skip the whole 
thing if QUEUE_HEAD - QUEUE_TAIL is under a fixed amount, which would probably 
cover a lot of use cases.

> 
> Alternatively, we could try to maintain a shared pointer that is always
> equal to the frontmost backend's "pos".  But I think that would require
> more locking during queue reading than exists now; so it's far from clear
> it would be a win, especially in systems where listening sessions last for
> a reasonable amount of time (so that Exec_ListenPreCommit is infrequent).
And that would be even more complicated since it has to be per db oid.

Matt Newell



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH 5/6] pg_basebackup: allow GetConnection() to make non-replication connections.
Next
From: Pavan Deolasee
Date:
Subject: Re: No Issue Tracker - Say it Ain't So!