Thread: LISTEN denial of service with aborted transaction
This morning with our production database I began receiving reports of the database being "down". I checked the log and was surprised to see extremely long durations for a LISTEN that happens after each connection is made by our database library. This coincided with many(approx 600) new connections happening in a short time window due to render nodes automatically being turned on when the first job of the morning was submitted(idle nodes are turned off to save power). My initial hunch was that there was some code in postgres that resulted exponential execution time if enough listens on new connections happened at the same time. As I was trying to gather more information the listen times began to decrease and after about 20 minutes things were back to normal. A few hours later the symptoms returned but this time the listen was taking upwards of 15 minutes. I did some more reading and checked the pg_notify directory and found that there were 49 files. Having never checked that directory before I wasn't sure if that was normal. A short time later I noticed there was a process sitting idle with an aborted transaction. After killing the process things quickly returned to normal. After doing a search for "idle in transaction (aborted)" I came upon http://stackoverflow.com/questions/15036438/postgres-connection-leaks-idle-in-transaction-aborted While this is a potential solution for dealing with the problem it seems that the postgresql developers have decided to let connections stay in the "idle in transaction (aborted)" state for a reason, most likely under the assumption that it's relatively safe and only eats up the resources of a single connection. However it's easy to demonstrate that doing: listen "abc"; begin; select bad_column from non_existant_table; ...will eventually cause a denial of service situation if the DBA hasn't setup guards against connection sitting idle in an aborted transaction. Looking at the code in src/backend/commands/async.c I think there are a couple ways to eliminate this problem. 1. When a connection issues it's first LISTEN command, in Exec_ListenPreCommit QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL; this causes the connection to iterate through every notify queued in the slru, even though at that point I believe the connection can safely ignore any notifications from transactions that are already committed, and if I understand correctly notifications aren't put into the slru until precommit, so wouldn't it be safe to do: QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD; inside Async_Listen? If that's not safe, then could a new member be added to AsyncQueueControl that points to the first uncommitted QueuePosition (wouldn't have to be kept completely up to date). This would solve the problem of slow initial LISTEN in the face of a growing pg_notify queue. 2. Would it be possible when a backend is signaled to check if it is idle inside an aborted transaction, and if so process the notifications and put any that match what the backend is listening on into a local list. This would allow the slru to be cleaned up. In practice I think most notifications would either be disregarded or combined with a duplicate, so the list would most likely end up staying very small. If the backend local list does grow too large then the connection could be killed or handled in some other appropriate way. I am happy to attempt coming up with a patch if the ideas are deemed worthwhile. Thanks, Matt Newell
Matt Newell <newellm@blur.com> writes: > 1. When a connection issues it's first LISTEN command, in Exec_ListenPreCommit > QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL; > this causes the connection to iterate through every notify queued in the slru, > even though at that point I believe the connection can safely ignore any > notifications from transactions that are already committed, and if I > understand correctly notifications aren't put into the slru until precommit, > so wouldn't it be safe to do: > QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD; > inside Async_Listen? Per the comment in Exec_ListenPreCommit, the goal here is to see entries that have been made and not yet committed. If you don't do it like this, then a new listener has the possibility of receiving notifications from one transaction, while not seeing notifications from another one that committed later, even though longer-established listeners saw outputs from both transactions. We felt that that sort of application-visible inconsistency would be a bad thing. > If that's not safe, then could a new member be added to > AsyncQueueControl that points to the first uncommitted QueuePosition (wouldn't > have to be kept completely up to date). Err ... isn't that QUEUE_TAIL already? I've not studied this code in detail recently, though. > 2. Would it be possible when a backend is signaled to check if it is idle > inside an aborted transaction, and if so process the notifications and put any > that match what the backend is listening on into a local list. Possibly. sinval catchup notification works like that, so you could maybe invent a similar mechanism for notify. Beware that we've had to fix performance issues with sinval catchup; you don't want to release a whole bunch of processes to do work at the same time. regards, tom lane
On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote: > Matt Newell <newellm@blur.com> writes: > > 1. When a connection issues it's first LISTEN command, in > > Exec_ListenPreCommit QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL; > > this causes the connection to iterate through every notify queued in the > > slru, even though at that point I believe the connection can safely > > ignore any notifications from transactions that are already committed, > > and if I understand correctly notifications aren't put into the slru > > until precommit, so wouldn't it be safe to do: > > QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD; > > inside Async_Listen? > > Per the comment in Exec_ListenPreCommit, the goal here is to see entries > that have been made and not yet committed. If you don't do it like this, > then a new listener has the possibility of receiving notifications from > one transaction, while not seeing notifications from another one that > committed later, even though longer-established listeners saw outputs from > both transactions. We felt that that sort of application-visible > inconsistency would be a bad thing. > Right, QUEUE_HEAD can't be used, however... > > If that's not safe, then could a new member be added to > > AsyncQueueControl that points to the first uncommitted QueuePosition > > (wouldn't have to be kept completely up to date). > > Err ... isn't that QUEUE_TAIL already? I've not studied this code in > detail recently, though. > QUEUE_TAIL is the oldest notification. My idea is to keep a somewhat up-to- date pointer to the oldest uncommitted notification, which can be used as a starting point when a connection issues it's first listen. Just as the current code will advance a backend's pointer past any committed notifications when it calls asyncQueueReadAllNotifications in Exec_ListenPreCommit with no registered listeners. I came up with a proof of concept and it appears to work as expected. With 500,000 notifications queued a listen is consistently under .5ms, while my 9.4.4 install is taking 70ms, and the speedup should be much greater on a busy server where there is more lock contention. Attached is the patch and the ugly test script. > > 2. Would it be possible when a backend is signaled to check if it is idle > > inside an aborted transaction, and if so process the notifications and put > > any that match what the backend is listening on into a local list. > > Possibly. sinval catchup notification works like that, so you could maybe > invent a similar mechanism for notify. Beware that we've had to fix > performance issues with sinval catchup; you don't want to release a whole > bunch of processes to do work at the same time. I'll have to think about this more but i don't envision it causing such as scenario. The extra processing that will happen at signaling time would have been done anyway when the aborted transaction is finally rolled back, the only extra work is copying the relevant notifications to local memory. Regardless it might not be worthwhile since my patch for #1 seems to address the symptom that bit me. Matt Newell
Attachment
Matt Newell <newellm@blur.com> writes: > On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote: >> Possibly. sinval catchup notification works like that, so you could maybe >> invent a similar mechanism for notify. Beware that we've had to fix >> performance issues with sinval catchup; you don't want to release a whole >> bunch of processes to do work at the same time. > I'll have to think about this more but i don't envision it causing such as > scenario. > The extra processing that will happen at signaling time would have been done > anyway when the aborted transaction is finally rolled back, the only extra > work is copying the relevant notifications to local memory. Hm. An idle-in-transaction listening backend will eventually cause a more serious form of denial of service: once the pg_notify SLRU area fills up to its maximum of 8GB, no more new messages can be inserted, and every transaction that tries to do a NOTIFY will abort. However, that's a documented hazard and we've not really had complaints about it. In any case, trying to fix that by having such a backend absorb messages into local memory doesn't seem like a great idea: if it sits for long enough, its local memory usage will bloat. Eventually you'd have a failure anyway. > Regardless it might not be worthwhile since my patch for #1 seems to address > the symptom that bit me. I looked at this patch for a bit and found it kind of ugly, particularly the business about "we allow one process at a time to advance firstUncommitted by using advancingFirstUncommitted as a mutex". That doesn't sound terribly safe or performant. After further thought, I wonder whether instead of having an incoming listener initialize its "pos" to QUEUE_TAIL, we couldn't have it initialize to the maximum "pos" among existing listeners (with a floor of QUEUE_TAIL of course). If any other listener has advanced over a given message X, then X was committed (or aborted) earlier and the new listener is not required to return it; so it should be all right to take that listener's "pos" as a starting point. The minimum-code-change way to do that would be to compute the max pos the hard way while Exec_ListenPreCommit holds the AsyncQueueLock, which it would now have to do in exclusive mode. That's a little bit annoying but it's surely not much worse than what SignalBackends has to do after every notification. Alternatively, we could try to maintain a shared pointer that is always equal to the frontmost backend's "pos". But I think that would require more locking during queue reading than exists now; so it's far from clear it would be a win, especially in systems where listening sessions last for a reasonable amount of time (so that Exec_ListenPreCommit is infrequent). regards, tom lane
On Tuesday, September 29, 2015 09:58:35 PM Tom Lane wrote: > Matt Newell <newellm@blur.com> writes: > > On Monday, September 28, 2015 07:22:26 PM Tom Lane wrote: > >> Possibly. sinval catchup notification works like that, so you could > >> maybe > >> invent a similar mechanism for notify. Beware that we've had to fix > >> performance issues with sinval catchup; you don't want to release a whole > >> bunch of processes to do work at the same time. > > > > I'll have to think about this more but i don't envision it causing such as > > scenario. > > The extra processing that will happen at signaling time would have been > > done anyway when the aborted transaction is finally rolled back, the only > > extra work is copying the relevant notifications to local memory. > > Hm. An idle-in-transaction listening backend will eventually cause a more > serious form of denial of service: once the pg_notify SLRU area fills up > to its maximum of 8GB, no more new messages can be inserted, and every > transaction that tries to do a NOTIFY will abort. However, that's a > documented hazard and we've not really had complaints about it. In any > case, trying to fix that by having such a backend absorb messages into > local memory doesn't seem like a great idea: if it sits for long enough, > its local memory usage will bloat. Eventually you'd have a failure > anyway. > > > Regardless it might not be worthwhile since my patch for #1 seems to > > address the symptom that bit me. > > I looked at this patch for a bit and found it kind of ugly, particularly > the business about "we allow one process at a time to advance > firstUncommitted by using advancingFirstUncommitted as a mutex". > That doesn't sound terribly safe or performant. > It can be implemented without that but then you'll have multiple backends trying to do the same work. This might not be an issue at all but I couldn't tell at first glance the potential cost of extra calls to TransactionIdIsInProgress. Since there are no extra locks taken I figured ensuring the work is only done once is good for performance. If the cluster only has one database generating notifies then there is practically no extra work with the patch. As far as safety is concerned I think the worst case is that behavior returns to what it is now, where firstUncommitted ends up tracking QUEUE_TAIL. I originally used a boolean then realized I could get some extra safety by using the backendId so that the mutex would be released automatically if the backend dies. > After further thought, I wonder whether instead of having an incoming > listener initialize its "pos" to QUEUE_TAIL, we couldn't have it > initialize to the maximum "pos" among existing listeners (with a floor of > QUEUE_TAIL of course). If any other listener has advanced over a given > message X, then X was committed (or aborted) earlier and the new listener > is not required to return it; so it should be all right to take that > listener's "pos" as a starting point. That's a great idea and I will try it. It does need to be the maximum position of existing listeners *with matching db oid" since asyncQueueReadAllNotifications doesn't check transaction status if the db oid doesn't match it's own. Another advantage of this approach is that a queued notify from an open transaction in one database won't incur additional cost on listens coming from other databases, whereas with my patch such a situation will prevent firstUncommitted from advancing. > > The minimum-code-change way to do that would be to compute the max pos > the hard way while Exec_ListenPreCommit holds the AsyncQueueLock, which > it would now have to do in exclusive mode. That's a little bit annoying > but it's surely not much worse than what SignalBackends has to do after > every notification. Yeah I think it's going to be a win regardless. Could also skip the whole thing if QUEUE_HEAD - QUEUE_TAIL is under a fixed amount, which would probably cover a lot of use cases. > > Alternatively, we could try to maintain a shared pointer that is always > equal to the frontmost backend's "pos". But I think that would require > more locking during queue reading than exists now; so it's far from clear > it would be a win, especially in systems where listening sessions last for > a reasonable amount of time (so that Exec_ListenPreCommit is infrequent). And that would be even more complicated since it has to be per db oid. Matt Newell
> > > After further thought, I wonder whether instead of having an incoming > > listener initialize its "pos" to QUEUE_TAIL, we couldn't have it > > initialize to the maximum "pos" among existing listeners (with a floor of > > QUEUE_TAIL of course). If any other listener has advanced over a given > > message X, then X was committed (or aborted) earlier and the new listener > > is not required to return it; so it should be all right to take that > > listener's "pos" as a starting point. > > That's a great idea and I will try it. It does need to be the maximum > position of existing listeners *with matching db oid" since > asyncQueueReadAllNotifications doesn't check transaction status if the db > oid doesn't match it's own. Another advantage of this approach is that a > queued notify from an open transaction in one database won't incur > additional cost on listens coming from other databases, whereas with my > patch such a situation will prevent firstUncommitted from advancing. > Patch attached works great. I added the dboid to the QueueBackendStatus struct but that might not be needed if there is an easy and fast way to get a db oid from a backend pid. I also skip the max pos calculation loop if QUEUE_HEAD is on the same page as QUEUE_TAIL, with the thought that it's not desirable to increase the time that AsyncQueueLock is held if the queue is small and asyncQueueReadAllNotifications will execute quickly. I then noticed that we are taking the same lock twice in a row to read the same values, first in Exec_ListenPreCommit then in asyncQueueReadAllNotifications, and much of the time the latter will simply return because pos will already be at QUEUE_HEAD. I prepared a second patch that splits asyncQueueReadAllNotifications. Exec_ListPreCommit then only calls the worker version only when needed. This avoids the duplicate lock. Thanks, Matt Newell
Attachment
Matt Newell <newellm@blur.com> writes: > Patch attached works great. I added the dboid to the QueueBackendStatus > struct but that might not be needed if there is an easy and fast way to get a > db oid from a backend pid. Not particularly; I agree with adding it to this data structure. One reason is it makes the array stride a power of 2, so array accesses may be a shade cheaper than before. > I also skip the max pos calculation loop if QUEUE_HEAD is on the same page as > QUEUE_TAIL, with the thought that it's not desirable to increase the time that > AsyncQueueLock is held if the queue is small and > asyncQueueReadAllNotifications will execute quickly. Check. There were a couple of minor bugs in this, but I fixed them and committed it. > I then noticed that we are taking the same lock twice in a row to read the > same values, first in Exec_ListenPreCommit then in > asyncQueueReadAllNotifications, and much of the time the latter will simply > return because pos will already be at QUEUE_HEAD. I prepared a second patch > that splits asyncQueueReadAllNotifications. Exec_ListPreCommit then only calls > the worker version only when needed. This avoids the duplicate lock. I was unimpressed with this; the refactoring of asyncQueueReadAllNotifications seemed messy and definitely was undocumented. I think the main actual value is to skip doing anything during Exec_ListenPreCommit when max == head, so I extracted that part and included it. regards, tom lane