Thread: PQcancel may hang in the recv call

PQcancel may hang in the recv call

From

Peter Juhasz

Date:

19 May 2016, 18:41:15

Hi all,

this is somewhat involved so please bear with me.

We've found a situation where canceling a query may cause the client to
hang, possibly indefinitely. This can happen if the network connection
fails in a specific way.

The reason for this lies in the way the PQcancel function (which
eventually gets called from the higher level interface's cancel
function) is implemented. It works by opening a second connection to
the postmaster (on the same host/port as the existing connection),
send()-ing a cancellation message via the newly opened connection, then
calling recv() to receive an indication that the message was processed.

However, if the network fails in a way that the connection appears to
have been established but subsequent packages are dropped silently,
this recv() call will block.

My questions:

Is this known?
Is this a bug?
What can be done to fix or work around it, apart from applying a
timeout wrapper the cancel operation as well?


The attached example program attempts to demonstrate the effect.
It simulates network outage by routing data through a local TCP proxy
that stops forwarding packets at a given point. It's written in Perl
for convenience, but the problem is not in the Perl part: running it
with strace will clearly show that it hangs at the recv() call in fe-
connect.c:internal_cancel().

The program assumes that you have a postgresql server listening on port
5432 on localhost, and you can log in to a database called 'postgres'
with user 'postgres' (but edit either your local postgresql settings or
the connection string in the program if it doesn't work).

Ran without any command line options, the program simulates a long-
running query with pg_sleep(), and prints the result, which should be
'ok'. This should take about 3 seconds.

Ran with the -c option, it cancels the query after one second.

Ran with the -d option, it instructs the proxy to drop packets, so the
main program will never receive the result and timeouts after 6
seconds.

With both the -c and -d options it drops packets, then attempts to
cancel, and this is where it gets interesting: it hangs for 60
seconds. 

With options -c -d -a 1, it doesn't allow the second connection to go
through, in which case it hangs (seemingly) forever.

The -v option can be added to print debug messages.

(For those unfamiliar with Perl, the program works by forking twice:
after the first fork the child process starts the proxy, then after the
second fork the parent process proceeds with the database connection,
while the second child sends signals to the proxy or the main process,
depending on the command line settings.)


Best regards,
Péter Juhász

Attachment

pg_cancel_bug.pl

Re: PQcancel may hang in the recv call

From

"David G. Johnston"

Date:

19 May 2016, 19:13:40

On Thu, May 19, 2016 at 10:37 AM, Peter Juhasz <pjuhasz@uhusystems.com> wrote:

Hi all,

this is somewhat involved so please bear with me.

We've found a situation where canceling a query may cause the client to
hang, possibly indefinitely. This can happen if the network connection
fails in a specific way.

The reason for this lies in the way the PQcancel function (which
eventually gets called from the higher level interface's cancel
function) is implemented. It works by opening a second connection to
the postmaster (on the same host/port as the existing connection),
send()-ing a cancellation message via the newly opened connection, then
calling recv() to receive an indication that the message was processed.

However, if the network fails in a way that the connection appears to
have been established but subsequent packages are dropped silently,
this recv() call will block.

My questions:

Is this known?
Is this a bug?
What can be done to fix or work around it, apart from applying a
timeout wrapper the cancel operation as well?

It does sound familiar. Providing the version number(s) on which you encountered this behavior would be helpful. Or HEAD if you have or are testing against current code.

David J.

Re: PQcancel may hang in the recv call

From

Tom Lane

Date:

19 May 2016, 19:32:15

Peter Juhasz <pjuhasz@uhusystems.com> writes:
> We've found a situation where canceling a query may cause the client to
> hang, possibly indefinitely. This can happen if the network connection
> fails in a specific way.
> ...
> However, if the network fails in a way that the connection appears to
> have been established but subsequent packages are dropped silently,
> this recv() call will block.

Hmm.  I would expect the recv to eventually fail based on TCP timeouts,
but I agree that that would be much longer than you'd typically wish
to wait.

> Is this known?

I do not recall anyone ever reporting something similar --- and that code
has been like that for a long time.

> Is this a bug?

I wouldn't call it that exactly.  There might be an opportunity for
improvement here, but it's not very clear what.  Just introducing a
timeout would likely create more problems than it fixes, considering the
evident rarity of the problem.  The race condition hazard that the recv()
is trying to prevent is definitely real: we used to not have that, and
we got bug reports, cf
http://www.postgresql.org/message-id/flat/20030915070801.GD23844@opencloud.com

            regards, tom lane

Re: PQcancel may hang in the recv call

From

"David G. Johnston"

Date:

19 May 2016, 19:39:35

On Thu, May 19, 2016 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Juhasz <pjuhasz@uhusystems.com> writes:

> Is this known?

I do not recall anyone ever reporting something similar --- and that code
has been like that for a long time.

I'd take Tom's word over mine :)

David J.

Re: PQcancel may hang in the recv call

From

Tom Lane

Date:

19 May 2016, 19:53:14

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Thu, May 19, 2016 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not recall anyone ever reporting something similar --- and that code
>> has been like that for a long time.

> I'd take Tom's word over mine :)

Well, my memory is often faulty ;-).  But I did trawl the PG archives
for a bit, and didn't find anything quite like this.  There are complaints
about PQcancel not working if the network is down, but no reports that it
hangs, as far as I can find.

            regards, tom lane

Re: PQcancel may hang in the recv call

From

Peter Juhasz

Date:

20 May 2016, 21:11:57

On Thu, 2016-05-19 at 15:32 -0400, Tom Lane wrote:
> Peter Juhasz <pjuhasz@uhusystems.com> writes:
> >
> > We've found a situation where canceling a query may cause the
> > client to
> > hang, possibly indefinitely. This can happen if the network
> > connection
> > fails in a specific way.
> > ...
> > However, if the network fails in a way that the connection appears
> > to
> > have been established but subsequent packages are dropped silently,
> > this recv() call will block.
> Hmm.  I would expect the recv to eventually fail based on TCP
> timeouts,
> but I agree that that would be much longer than you'd typically wish
> to wait.
>

In case the connection goes through, the recv call does return after 60
seconds (on linux, where I'm trying this).

The problem is that in our home-grown framework we'd want to use cancel
to bail out of queries that have already run for too long. So at that
point we've already waited long enough, we don't want to wait even
more.

The situation is even worse in an asynchronous, event-driven
application: in that case we must not block at all. Yet, with the
problem I've described, cancellation blocks just like in the
synchronous case, rendering the entire application unresponsive for
that period.

(It's actually even worse than that, because DBD::Pg's support for
asynchronous operation is half-finished at best: their pg_cancel
function wants to read back the confirmation of the cancellation with
PQgetResult, which blocks indefinitely if the network connection has
failed in the way I've described.)

> >
> > Is this known?
> I do not recall anyone ever reporting something similar --- and that
> code
> has been like that for a long time.

I did forget to mention that I've observed this behavior with
Postgresql 9.5.3 and 9.4.8, but I don't think the actual version
matters much, because as you say, that part of the code has not changed
recently.

I find it strange that nobody has reported similar problems, though -
everyone else has perfect network connections that never drop packets,
never introduce random delays?

>
> >
> > Is this a bug?
> I wouldn't call it that exactly.  There might be an opportunity for
> improvement here, but it's not very clear what.  Just introducing a
> timeout would likely create more problems than it fixes, considering
> the
> evident rarity of the problem.  

In our framework we had to resort to this: but we mark the connection
as unreliable, unusable if even cancellation times out. The point is
that the application must remain responsive, and even in case of a
complete network failure (between the app server and the database) we
must be able to signal this state to the user.

Best regards,
Péter Juhász

PS. and now for something completely different: the menu on http://yum.
postgresql.org/ seems to be broken, the last two items are wrapped
around into a second line.