Re: Escaping a blocked sendto() syscall without causing a restart - Mailing list pgsql-admin

From Tom Lane
Subject Re: Escaping a blocked sendto() syscall without causing a restart
Date
Msg-id 17914.1358458720@sss.pgh.pa.us
Whole thread Raw
In response to Escaping a blocked sendto() syscall without causing a restart  (Jerry Sievers <gsievers19@comcast.net>)
List pgsql-admin
Jerry Sievers <gsievers19@comcast.net> writes:
> Does anyone know if one of the signals below can be sent to break out
> ,of this state *without* the postmaster sensing a crashed backend?

> I've seen several times in the past at other companies, backends that
> will not respond to cancel nor SIGTERM due to syscall that's blocked
> on IO.

> Quite often though apparently the backend would notice the broken
> socket eventually and receive the signals and exit cleanly.

> I've got one that's been wedged like that for  a couple days now.

> I recall trying several  in a similar situation a while ago and of
> course one of them  interrupted the syscall all right but it was an
> abort and we got the customary spontaneous postmaster restart.

Offhand it looks to me like most signals would kick the backend off the
send() call ... but it would loop right back and try again.  See
internal_flush() in pqcomm.c.  (If you're using SSL, this diagnosis
may or may not apply.)

We can't do anything except repeat the send attempt if the client
connection is to be kept in a sane state.  It's possible that if the
interrupt was a SIGTERM (forced exit) we could mark the connection dead
and return early, but it would probably take some thought and
experimentation to get useful behavior that way.  And I'm not at all
sure if we could get it to work in SSL mode ...

So the short answer is no, you probably can't kill the session without
causing a restart.  Possibly we should add a TODO to make this better.

What you might consider instead, if this is a recurring problem, is
adjusting the postmaster-side TCP keepalive parameters so that dead
connections are noticed more quickly.  The default connection timeout
according to the TCP standards is on the order of hours, but you can
reduce that quite a lot if your network environment is at all reliable.

(But it's not clear to me why your stuck-for-a-couple-days case wouldn't
have timed out long since.  Are you sure this isn't a client-side
problem, ie client is wedged?  If so, why not kill the client instead?)

            regards, tom lane


pgsql-admin by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Need assistance in incremental backup for my environment
Next
From: "Kevin Grittner"
Date:
Subject: Re: Question concerning replicated server using streaming replication used as a read-only reporting server