Thread: LISTEN denial of service with aborted transaction

LISTEN denial of service with aborted transaction

From
Matt Newell
Date:
This morning with our production database I began receiving reports
of the database being "down".

I checked the log and was surprised to see extremely long durations for a 
LISTEN that happens after each connection is made by our database library. 
This coincided with many(approx 600) new connections happening in a short time 
window due to render nodes automatically being turned on when the first job of 
the morning was submitted(idle nodes are turned off to save power).  My 
initial hunch was that there was some code in postgres that resulted 
exponential execution time if enough listens on new connections happened at 
the same time.  As I was trying to gather more information the listen times 
began to decrease and after about 20 minutes things were back to normal.

A few hours later the symptoms returned but this time the listen was taking 
upwards of 15 minutes.  I did some more reading and checked the pg_notify 
directory and found that there were 49 files.  Having never checked that 
directory before I wasn't sure if that was normal.  A short time later I 
noticed there was a process sitting idle with an aborted transaction.  After 
killing the process things quickly returned to normal.

After doing a search for "idle in transaction (aborted)" I came upon 
http://stackoverflow.com/questions/15036438/postgres-connection-leaks-idle-in-transaction-aborted 

While this is a potential solution for dealing with the problem it seems that 
the postgresql developers have decided to let connections stay in the "idle in 
transaction (aborted)" state for a reason, most likely under the assumption 
that it's relatively safe and only eats up the resources of a single 
connection.  However it's easy to demonstrate that doing:

listen "abc";
begin;
select bad_column from non_existant_table;

...will eventually cause a denial of service situation if the DBA hasn't setup
guards against connection sitting idle in an aborted transaction.

Looking at the code in src/backend/commands/async.c I think there are a couple 
ways to eliminate this problem.

1. When a connection issues it's first LISTEN command, in Exec_ListenPreCommit    
QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
this causes the connection to iterate through every notify queued in the slru, 
even though at that point I believe the connection can safely ignore any 
notifications from transactions that are already committed, and if I 
understand correctly notifications aren't put into the slru until precommit,  
so wouldn't it be safe to do:
QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD;
inside Async_Listen?  If that's not safe, then could a new member be added to 
AsyncQueueControl that points to the first uncommitted QueuePosition (wouldn't 
have to be kept completely up to date).

This would solve the problem of slow initial LISTEN in the face of a growing 
pg_notify queue.

2. Would it be possible when a backend is signaled to check if it is idle 
inside an aborted transaction, and if so process the notifications and put any 
that match what the backend is listening on into a local list.  This would 
allow the slru to be cleaned up.  In practice I think most notifications would 
either be disregarded or combined with a duplicate, so the list would most 
likely end up staying very small. If the backend local list does grow too 
large then the connection could be killed or handled in some other appropriate 
way.

I am happy to attempt coming up with a patch if the ideas are deemed 
worthwhile.

Thanks,
Matt Newell






Re: LISTEN denial of service with aborted transaction

From
Tom Lane
Date:
Matt Newell <newellm@blur.com> writes:
> 1. When a connection issues it's first LISTEN command, in Exec_ListenPreCommit    
> QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
> this causes the connection to iterate through every notify queued in the slru, 
> even though at that point I believe the connection can safely ignore any 
> notifications from transactions that are already committed, and if I 
> understand correctly notifications aren't put into the slru until precommit,  
> so wouldn't it be safe to do:
> QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD;
> inside Async_Listen?

Per the comment in Exec_ListenPreCommit, the goal here is to see entries
that have been made and not yet committed.  If you don't do it like this,
then a new listener has the possibility of receiving notifications from
one transaction, while not seeing notifications from another one that
committed later, even though longer-established listeners saw outputs from
both transactions.  We felt that that sort of application-visible
inconsistency would be a bad thing.

> If that's not safe, then could a new member be added to 
> AsyncQueueControl that points to the first uncommitted QueuePosition (wouldn't 
> have to be kept completely up to date).

Err ... isn't that QUEUE_TAIL already?  I've not studied this code in
detail recently, though.

> 2. Would it be possible when a backend is signaled to check if it is idle 
> inside an aborted transaction, and if so process the notifications and put any 
> that match what the backend is listening on into a local list.

Possibly.  sinval catchup notification works like that, so you could maybe
invent a similar mechanism for notify.  Beware that we've had to fix
performance issues with sinval catchup; you don't want to release a whole
bunch of processes to do work at the same time.
        regards, tom lane



Re: LISTEN denial of service with aborted transaction

From
Matt Newell
Date:
On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote:
> Matt Newell <newellm@blur.com> writes:
> > 1. When a connection issues it's first LISTEN command, in
> > Exec_ListenPreCommit QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
> > this causes the connection to iterate through every notify queued in the
> > slru, even though at that point I believe the connection can safely
> > ignore any notifications from transactions that are already committed,
> > and if I understand correctly notifications aren't put into the slru
> > until precommit, so wouldn't it be safe to do:
> > QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD;
> > inside Async_Listen?
>
> Per the comment in Exec_ListenPreCommit, the goal here is to see entries
> that have been made and not yet committed.  If you don't do it like this,
> then a new listener has the possibility of receiving notifications from
> one transaction, while not seeing notifications from another one that
> committed later, even though longer-established listeners saw outputs from
> both transactions.  We felt that that sort of application-visible
> inconsistency would be a bad thing.
>
Right,  QUEUE_HEAD can't be used, however...

> > If that's not safe, then could a new member be added to
> > AsyncQueueControl that points to the first uncommitted QueuePosition
> > (wouldn't have to be kept completely up to date).
>
> Err ... isn't that QUEUE_TAIL already?  I've not studied this code in
> detail recently, though.
>
QUEUE_TAIL is the oldest notification.  My idea is to keep a somewhat up-to-
date pointer to the oldest uncommitted notification, which can be used as a
starting point when a connection issues it's first listen.  Just as the
current code will advance a backend's pointer past any committed notifications
when it calls asyncQueueReadAllNotifications in Exec_ListenPreCommit with no
registered listeners.

I came up with a proof of concept and it appears to work as expected. With
500,000 notifications queued a listen is consistently under .5ms, while my
9.4.4 install is taking 70ms, and the speedup should be much greater on a
busy server where there is more lock contention.

Attached is the patch and the ugly test script.


> > 2. Would it be possible when a backend is signaled to check if it is idle
> > inside an aborted transaction, and if so process the notifications and put
> > any that match what the backend is listening on into a local list.
>
> Possibly.  sinval catchup notification works like that, so you could maybe
> invent a similar mechanism for notify.  Beware that we've had to fix
> performance issues with sinval catchup; you don't want to release a whole
> bunch of processes to do work at the same time.

I'll have to think about this more but i don't envision it causing such as
scenario.

The extra processing that will happen at signaling time would have been done
anyway when the aborted transaction is finally rolled back, the only extra
work is copying the relevant notifications to local memory.

Regardless it might not be worthwhile since my patch for #1 seems to address
the symptom that bit me.

Matt Newell

Attachment

Re: LISTEN denial of service with aborted transaction

From
Tom Lane
Date:
Matt Newell <newellm@blur.com> writes:
> On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote:
>> Possibly.  sinval catchup notification works like that, so you could maybe
>> invent a similar mechanism for notify.  Beware that we've had to fix
>> performance issues with sinval catchup; you don't want to release a whole
>> bunch of processes to do work at the same time.

> I'll have to think about this more but i don't envision it causing such as 
> scenario.
> The extra processing that will happen at signaling time would have been done 
> anyway when the aborted transaction is finally rolled back, the only extra 
> work is copying the relevant notifications to local memory. 

Hm.  An idle-in-transaction listening backend will eventually cause a more
serious form of denial of service: once the pg_notify SLRU area fills up
to its maximum of 8GB, no more new messages can be inserted, and every
transaction that tries to do a NOTIFY will abort.  However, that's a
documented hazard and we've not really had complaints about it.  In any
case, trying to fix that by having such a backend absorb messages into
local memory doesn't seem like a great idea: if it sits for long enough,
its local memory usage will bloat.  Eventually you'd have a failure
anyway.

> Regardless it might not be worthwhile since my patch for #1 seems to address 
> the symptom that bit me.

I looked at this patch for a bit and found it kind of ugly, particularly
the business about "we allow one process at a time to advance
firstUncommitted by using advancingFirstUncommitted as a mutex".
That doesn't sound terribly safe or performant.

After further thought, I wonder whether instead of having an incoming
listener initialize its "pos" to QUEUE_TAIL, we couldn't have it
initialize to the maximum "pos" among existing listeners (with a floor of
QUEUE_TAIL of course).  If any other listener has advanced over a given
message X, then X was committed (or aborted) earlier and the new listener
is not required to return it; so it should be all right to take that
listener's "pos" as a starting point.

The minimum-code-change way to do that would be to compute the max pos
the hard way while Exec_ListenPreCommit holds the AsyncQueueLock, which
it would now have to do in exclusive mode.  That's a little bit annoying
but it's surely not much worse than what SignalBackends has to do after
every notification.

Alternatively, we could try to maintain a shared pointer that is always
equal to the frontmost backend's "pos".  But I think that would require
more locking during queue reading than exists now; so it's far from clear
it would be a win, especially in systems where listening sessions last for
a reasonable amount of time (so that Exec_ListenPreCommit is infrequent).
        regards, tom lane



Re: LISTEN denial of service with aborted transaction

From
Matt Newell
Date:
On Tuesday, September 29, 2015 09:58:35 PM Tom Lane wrote:
> Matt Newell <newellm@blur.com> writes:
> > On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote:
> >> Possibly.  sinval catchup notification works like that, so you could
> >> maybe
> >> invent a similar mechanism for notify.  Beware that we've had to fix
> >> performance issues with sinval catchup; you don't want to release a whole
> >> bunch of processes to do work at the same time.
> > 
> > I'll have to think about this more but i don't envision it causing such as
> > scenario.
> > The extra processing that will happen at signaling time would have been
> > done anyway when the aborted transaction is finally rolled back, the only
> > extra work is copying the relevant notifications to local memory.
> 
> Hm.  An idle-in-transaction listening backend will eventually cause a more
> serious form of denial of service: once the pg_notify SLRU area fills up
> to its maximum of 8GB, no more new messages can be inserted, and every
> transaction that tries to do a NOTIFY will abort.  However, that's a
> documented hazard and we've not really had complaints about it.  In any
> case, trying to fix that by having such a backend absorb messages into
> local memory doesn't seem like a great idea: if it sits for long enough,
> its local memory usage will bloat.  Eventually you'd have a failure
> anyway.
> 
> > Regardless it might not be worthwhile since my patch for #1 seems to
> > address the symptom that bit me.
> 
> I looked at this patch for a bit and found it kind of ugly, particularly
> the business about "we allow one process at a time to advance
> firstUncommitted by using advancingFirstUncommitted as a mutex".
> That doesn't sound terribly safe or performant.
> 
It can be implemented without that but then you'll have multiple backends 
trying to do the same work.  This might not be an issue at all but I couldn't 
tell at first glance the potential cost of extra calls to 
TransactionIdIsInProgress. Since there are no extra locks taken I figured 
ensuring the work is only done once is good for performance.

If the cluster only has one database generating notifies then there is 
practically no extra work with the patch.

As far as safety is concerned I think the worst case is that behavior returns 
to what it is now, where firstUncommitted ends up tracking QUEUE_TAIL.

I originally used a boolean then realized I could get some extra safety by 
using the backendId so that the mutex would be released automatically if the 
backend dies.

> After further thought, I wonder whether instead of having an incoming
> listener initialize its "pos" to QUEUE_TAIL, we couldn't have it
> initialize to the maximum "pos" among existing listeners (with a floor of
> QUEUE_TAIL of course).  If any other listener has advanced over a given
> message X, then X was committed (or aborted) earlier and the new listener
> is not required to return it; so it should be all right to take that
> listener's "pos" as a starting point.

That's a great idea and I will try it.  It does need to be the maximum 
position of existing listeners *with matching db oid" since 
asyncQueueReadAllNotifications doesn't check transaction status if the db oid 
doesn't match it's own.  Another advantage of this approach is that a queued 
notify from an open transaction in one database won't incur additional cost on 
listens coming from other databases, whereas with my patch such a situation 
will prevent firstUncommitted from advancing.

> 
> The minimum-code-change way to do that would be to compute the max pos
> the hard way while Exec_ListenPreCommit holds the AsyncQueueLock, which
> it would now have to do in exclusive mode.  That's a little bit annoying
> but it's surely not much worse than what SignalBackends has to do after
> every notification.
Yeah I think it's going to be a win regardless.  Could also skip the whole 
thing if QUEUE_HEAD - QUEUE_TAIL is under a fixed amount, which would probably 
cover a lot of use cases.

> 
> Alternatively, we could try to maintain a shared pointer that is always
> equal to the frontmost backend's "pos".  But I think that would require
> more locking during queue reading than exists now; so it's far from clear
> it would be a win, especially in systems where listening sessions last for
> a reasonable amount of time (so that Exec_ListenPreCommit is infrequent).
And that would be even more complicated since it has to be per db oid.

Matt Newell



Re: LISTEN denial of service with aborted transaction

From
Matt Newell
Date:
>
> > After further thought, I wonder whether instead of having an incoming
> > listener initialize its "pos" to QUEUE_TAIL, we couldn't have it
> > initialize to the maximum "pos" among existing listeners (with a floor of
> > QUEUE_TAIL of course).  If any other listener has advanced over a given
> > message X, then X was committed (or aborted) earlier and the new listener
> > is not required to return it; so it should be all right to take that
> > listener's "pos" as a starting point.
>
> That's a great idea and I will try it.  It does need to be the maximum
> position of existing listeners *with matching db oid" since
> asyncQueueReadAllNotifications doesn't check transaction status if the db
> oid doesn't match it's own.  Another advantage of this approach is that a
> queued notify from an open transaction in one database won't incur
> additional cost on listens coming from other databases, whereas with my
> patch such a situation will prevent firstUncommitted from advancing.
>


Patch attached works great.  I added the dboid to the QueueBackendStatus
struct but that might not be needed if there is an easy and fast way to get a
db oid from a backend pid.

I also skip the max pos calculation loop if QUEUE_HEAD is on the same page as
QUEUE_TAIL, with the thought that it's not desirable to increase the time that
AsyncQueueLock is held if the queue is small and
asyncQueueReadAllNotifications will execute quickly.

I then noticed that we are taking the same lock twice in a row to read the
same values, first in Exec_ListenPreCommit then in
asyncQueueReadAllNotifications, and much of the time the latter will simply
return because pos will already be at QUEUE_HEAD.  I prepared a second patch
that splits asyncQueueReadAllNotifications. Exec_ListPreCommit then only calls
the worker version only when needed.  This avoids the duplicate lock.

Thanks,
Matt Newell
Attachment

Re: LISTEN denial of service with aborted transaction

From
Tom Lane
Date:
Matt Newell <newellm@blur.com> writes:
> Patch attached works great.  I added the dboid to the QueueBackendStatus 
> struct but that might not be needed if there is an easy and fast way to get a 
> db oid from a backend pid.

Not particularly; I agree with adding it to this data structure.  One
reason is it makes the array stride a power of 2, so array accesses may be
a shade cheaper than before.

> I also skip the max pos calculation loop if QUEUE_HEAD is on the same page as 
> QUEUE_TAIL, with the thought that it's not desirable to increase the time that 
> AsyncQueueLock is held if the queue is small and 
> asyncQueueReadAllNotifications will execute quickly.

Check.

There were a couple of minor bugs in this, but I fixed them and committed
it.

> I then noticed that we are taking the same lock twice in a row to read the 
> same values, first in Exec_ListenPreCommit then in 
> asyncQueueReadAllNotifications, and much of the time the latter will simply 
> return because pos will already be at QUEUE_HEAD.  I prepared a second patch
> that splits asyncQueueReadAllNotifications. Exec_ListPreCommit then only calls 
> the worker version only when needed.  This avoids the duplicate lock.

I was unimpressed with this; the refactoring of asyncQueueReadAllNotifications
seemed messy and definitely was undocumented.  I think the main actual
value is to skip doing anything during Exec_ListenPreCommit when
max == head, so I extracted that part and included it.
        regards, tom lane