PQcancel may hang in the recv call - Mailing list pgsql-general

From Peter Juhasz
Subject PQcancel may hang in the recv call
Date
Msg-id 1463668668.3489.41.camel@uhusystems.com
Whole thread Raw
Responses Re: PQcancel may hang in the recv call  ("David G. Johnston" <david.g.johnston@gmail.com>)
Re: PQcancel may hang in the recv call  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi all,

this is somewhat involved so please bear with me.

We've found a situation where canceling a query may cause the client to
hang, possibly indefinitely. This can happen if the network connection
fails in a specific way.

The reason for this lies in the way the PQcancel function (which
eventually gets called from the higher level interface's cancel
function) is implemented. It works by opening a second connection to
the postmaster (on the same host/port as the existing connection),
send()-ing a cancellation message via the newly opened connection, then
calling recv() to receive an indication that the message was processed.

However, if the network fails in a way that the connection appears to
have been established but subsequent packages are dropped silently,
this recv() call will block.

My questions:

Is this known?
Is this a bug?
What can be done to fix or work around it, apart from applying a
timeout wrapper the cancel operation as well?


The attached example program attempts to demonstrate the effect.
It simulates network outage by routing data through a local TCP proxy
that stops forwarding packets at a given point. It's written in Perl
for convenience, but the problem is not in the Perl part: running it
with strace will clearly show that it hangs at the recv() call in fe-
connect.c:internal_cancel().

The program assumes that you have a postgresql server listening on port
5432 on localhost, and you can log in to a database called 'postgres'
with user 'postgres' (but edit either your local postgresql settings or
the connection string in the program if it doesn't work).

Ran without any command line options, the program simulates a long-
running query with pg_sleep(), and prints the result, which should be
'ok'. This should take about 3 seconds.

Ran with the -c option, it cancels the query after one second.

Ran with the -d option, it instructs the proxy to drop packets, so the
main program will never receive the result and timeouts after 6
seconds.

With both the -c and -d options it drops packets, then attempts to
cancel, and this is where it gets interesting: it hangs for 60
seconds. 

With options -c -d -a 1, it doesn't allow the second connection to go
through, in which case it hangs (seemingly) forever.

The -v option can be added to print debug messages.

(For those unfamiliar with Perl, the program works by forking twice:
after the first fork the child process starts the proxy, then after the
second fork the parent process proceeds with the database connection,
while the second child sends signals to the proxy or the main process,
depending on the command line settings.)


Best regards,
Péter Juhász

Attachment

pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Debugging a backend stuck consuming CPU
Next
From: Saiful Muhajir
Date:
Subject: Re: Londiste 3 pgq events_1_1 table huge