Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time? - Mailing list pgsql-interfaces

From Dirk Niggemann
Subject Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?
Date
Msg-id 37F6B68B.5ABC566F@dircon.co.uk
Whole thread Raw
In response to Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-interfaces
Tom Lane wrote:
> 
> Dirk Niggemann <dirk-n@dircon.co.uk> writes:
> > Would there be any problems with changing pqWait so it uses a timeout?
Actually, I've just gone away and implemented it- silly, silly me. 
The timeout can be completely disabled and is settable from PQconnect
and from some new environment variables. (PGTIMEOUT and PGCONNTIMEOUT).
It remains off by default. 
> 
> I think this would be rather risky: if I issue a query that the backend
> takes a long time to compute, I don't want libpq timing out on me,
> *especially* not if its response is to close the connection rather than
> some fail-soft recovery technique.
What kind of fail-soft could be implemented here?
I suppose I could be nice & stick a PQrequestCancel in there, but I'll 
have to make sure it doesn't frazzle any other bit of the system. Hold 
that thought: PQrequestCancel is _designed_ to be called asynchronously 
to all te other requests, so it's unlikely to cause any major probs.. 
(have I missed something?).
That doesn't however, deal with the main concern about the timeout-
there 
is no 100% reliable way to tell the difference between "this qeuery is
running 
a long time without returning results" and "the network is down" or 
"the backend is hung"
The elegant solution would be to implement some sort of
backend-ping-protocol. The 
problem with this is that the backend needs to be prepared to service
it- even if it's 
doing other things. Not the sort of ting the backend was designed to do.
BTW- as libpq doesn't set SO_KEEPALIVE, queries that run for > 15minutes
w/o any data returned will disconnect anyway (provided things are't too
confused).

>From my POV,  the purpose behind this is not to deal with
long-computation queries. 
It's to deal with unattended automatic applications that run on remote
sites over flakynetworks that splat relatively small sets of inserts (<150 rows,
consisting of 30-40 
integers max)into a table in a central DB.
Now if I really wanted to do this right, I would build some sort of
reliable queuing middleware...
Except all that costs, well, real money. and this is only an
experimental project.
The applications do these uploads at regular intervals. The applications
have to recover and perform the inserts at a later stage if the network
is not working properly. 
I expect the network not to be working properly.
The applications run 'unattended' and require no user interaction.
>  If the timeout is
> application-settable and defaults to infinity, it might be OK.
Yes, after having a closer look at PQrequestCancel I sort of see where
the problem lies.
libpq makes some effort at not closing the backend socket. 
That's why I left the original behaviour in place.
Can you describe to me what the backend does on socket close?
I should really grovel the code myself, but my main concern here is:
a) I guess the backend simply aborts the current query/transaction.
Nasty if you were running anenormous query that will take a long time to run. Otherwise not too
catastrophic.
b) I'm concerned the backend might do some blocking somewhere w/o a
timeout- that is, can closing a socket at a bad time cause the backend to get
stuck?
c) can a failed network connection cause the backend to get stuck in a
wait/blocking I/O? 
Having hung backends lying around could be a serious problem, as that
would reduce the total number of backends available (max backends does not change
dynamically).  
d) Am I misunderstanding here or is PostgresSQL 'single threaded' with
respect to requests over a single client connection? libpq certainly gives that
impression. Shoving an empty query at the backend while it is returning
rows to you would confuse it very badly?
e) just like the PQrequestCancel, could you (me?) build a sort of
PQrequestPing to confirm that nothing is reachable or functioning at the remote end
before libpq closes the socket?

> 
> In my own applications using the asynchronous query features, I have not
> seen any actual hangups that could be improved by a timeout ... but I've
> mostly been working across LAN rather than WAN connections...
What hangs did you see?

the kind of problem I'm talking about is the sort where routers go and
disappear on you...
after you've opened the connection...
or someone chops a cable on a remote site...
all fun stuff...
try this sometime:a) open connection to a remote lan over a single router (with no
backup/standby)...b) start shoving data over it.    c) shoot router by preferred method (ax, fire, power plug, admin
typo,
dripping aircon (my favourite)).d) see how long your socket connection takes to work out the network's
gone away...

> 
> > I would also like to add a timeout for the connect operation on the
> > socket-
> > on a sufficiently frazzled WAN this could block for a long time-
> 
> This might be useful, but again I wonder whether libpq ought to be
> second-guessing the transport system about how long is reasonable to
> wait for a response.
done as well-  second-guessing the transport system is a good idea sometimes- the 
default TCP/ip timeouts are all rather long. 
Adjusting them generally  means fiddling with kernel parameters (might 
break kernel. might break some other apps.)
I certainly have seen problems on Solaris connection to DEC (oracle) RDB 
over a flaky WAN. 
the application would occasionally 'stick' for at least 8-10 minutes
when we simulated the 
problem by pulling some cables. In reality it would regularly stick for
1/2 hour. 
(routers went up&down and let packets drip through on occasions. made
both ends think thenetwork was just extremly slow rather than broken.) 
Oracle's connection timeout solution is particularly insidious(I have
this by word of mouth only)-
it seems to fork children to asynchronously handle the requests that
time out. 
crashed a machine when 30 concurrent client apps all suddenly lost
connection to the same database.

apologies for seeming a bit argumentative on this point...
BTW- just realized another problem- 
we can't know for certain a if a commit happened or not if we get a
timeout...
eek...
example
normal flow:
COMMIT->        ->COMMIT        <-RESPONSE
<-RESPONSE

error flow:
COOMMIT->                ->COMMIT        <-ERROR
<-ERROR

timeout flow (ok)
COMMIT->            
timeout         
close                    ->close            ABORT
(transaction assumed aborted).

timeout flow (bad)
COMMIT->        ->COMMIT        <-RESPONSE
timeout
close
(transaction assumed aborted but has actually committed)        ->close

I cant' seem to come up with the worst-case  scenario 
(e.g. false positive on commit)
maybe it's not possible...


or do we have a three-way handshake for commits?

> 
>                         regards, tom lane

Thanks, 
Dirk


pgsql-interfaces by date:

Previous
From: John Cusick
Date:
Subject: Re: [INTERFACES] PL_na undeclared
Next
From: Douglas Thomson
Date:
Subject: Re: [INTERFACES] problems with large-objects