Re: (Never?) Kill Postmaster? - Mailing list pgsql-general

From Tom Lane
Subject Re: (Never?) Kill Postmaster?
Date
Msg-id 5188.1194810530@sss.pgh.pa.us
Whole thread Raw
In response to Re: (Never?) Kill Postmaster?  (Christian Schröder <cs@deriva.de>)
Responses Re: (Never?) Kill Postmaster?  (Christian Schröder <cs@deriva.de>)
List pgsql-general
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> (gdb) bt
> #0  0x00002b24aeee0a68 in __lll_mutex_lock_wait () from
> /lib64/libpthread.so.0
> #1  0x00002b24aeedde88 in pthread_rwlock_rdlock () from
> /lib64/libpthread.so.0
> #2  0x00002b24a5814e23 in _nl_find_msg () from /lib64/libc.so.6
> #3  0x00002b24a5815c83 in __dcigettext () from /lib64/libc.so.6
> #4  0x00002b24a585df0b in strerror_r () from /lib64/libc.so.6
> #5  0x00002b24a585dd33 in strerror () from /lib64/libc.so.6
> #6  0x00000000005f4daa in expand_fmt_string ()
> #7  0x00000000005f6d14 in errmsg ()
> #8  0x00000000005185f3 in pq_recvbuf ()
> #9  0x0000000000518987 in pq_getbyte ()
> #10 0x000000000057eb69 in PostgresMain ()
> #11 0x0000000000558218 in ServerLoop ()
> #12 0x0000000000558db8 in PostmasterMain ()
> #13 0x000000000051a213 in main ()

> Seems to be the same as for the processes that were stuck inside of a
> statement.

Well, the top of the stack is the same, but this is useful anyway
because it shows that an I/O error on the input side can trigger the
problem as well as one on the output side.  We're still left wondering
how a thread mutex down inside strerror() could be left in a "stuck"
state, when the process doesn't appear to contain more than one thread.

> I recompiled the server with debugging symbols enabled and then did the
> following experiment: I started a query which I knew would take some
> time. While the query executed I disconnected my dial-up line. After
> reconnecting the backend process was still there (still SELECTing).
> Meanwhile the query is finished and the process is idle, but it's still
> present.

That is probably not the same situation because (assuming the query
didn't produce a lot of output) the kernel does not yet think that the
network connection is lost irretrievably.  You'd have to wait for the
TCP timeout interval to elapse, whereupon the kernel would report the
connection lost (EPIPE or ECONNRESET error), whereupon we'd enter the
code path shown above.

One thing I'm suddenly thinking might be related: didn't you mention
that you have some process that goes around and SIGINT's backends that
it thinks are running too long?  I'm wondering if a SIGINT event is a
necessary component of producing the problem ...

            regards, tom lane

pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: (Never?) Kill Postmaster?
Next
From: Christian Schröder
Date:
Subject: Re: (Never?) Kill Postmaster?