Re: (Never?) Kill Postmaster? - Mailing list pgsql-general

From Christian Schröder
Subject Re: (Never?) Kill Postmaster?
Date
Msg-id 47339573.5050202@deriva.de
Whole thread Raw
In response to Re: (Never?) Kill Postmaster?  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: (Never?) Kill Postmaster?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: (Never?) Kill Postmaster?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi all,
any news about this issue? Anything else that I can do to help you?
Meanwhile there are 4 connections in the same state. (I did not do the
whole investigation on all 4, but since they all do not respond on a
SIGINT I assume that they all have the same problem.)
It may also be interesting that the 4 processes belong to the same two
users that already caused this problem earlier. Maybe it really has
something to do with their unstable internet connection? (I mentioned
this in an earlier mail.)
I have also noticed that one of these two users has many open
connections which are all idle. I guess that those connections are in
fact dead, but the database did not close them for any reason. The
pg_stat_activity entries for the corresponding backend processes are as
follows:

 procpid | usename | current_query | waiting |
query_start          |         backend_start
---------+---------+---------------+---------+-------------------------------+-------------------------------
   26033 | dpyrek  | <IDLE>        | f       | 2007-11-08
10:21:01.555232+01 | 2007-11-08 09:55:01.59932+01
   18331 | dpyrek  | <IDLE>        | f       | 2007-11-07
11:34:24.968852+01 | 2007-11-07 11:08:29.043762+01
   18940 | dpyrek  | <IDLE>        | f       | 2007-11-07
14:29:52.987176+01 | 2007-11-07 13:14:48.609031+01
   25868 | dpyrek  | <IDLE>        | f       | 2007-11-08
09:47:46.938991+01 | 2007-11-08 09:13:34.101351+01
    6719 | dpyrek  | <IDLE>        | f       | 2007-11-06
12:06:14.875588+01 | 2007-11-06 11:10:00.566644+01
   17987 | dpyrek  | <IDLE>        | f       | 2007-11-07
10:31:50.517275+01 | 2007-11-07 10:11:07.310338+01
   31808 | dpyrek  | <IDLE>        | f       | 2007-11-08
22:55:03.931727+01 | 2007-11-08 22:55:03.766638+01
   25484 | dpyrek  | <IDLE>        | f       | 2007-11-08
08:32:57.265377+01 | 2007-11-08 07:44:30.845967+01
    5972 | dpyrek  | <IDLE>        | f       | 2007-11-06
08:51:54.57437+01  | 2007-11-06 08:14:03.560602+01
    6241 | dpyrek  | <IDLE>        | f       | 2007-11-06
09:59:02.018452+01 | 2007-11-06 09:20:49.092246+01
    6136 | dpyrek  | <IDLE>        | f       | 2007-11-06
09:14:40.729837+01 | 2007-11-06 08:57:29.55187+01
   12645 | dpyrek  | <IDLE>        | f       | 2007-11-02
10:08:24.856929+01 | 2007-11-02 09:35:37.640976+01
   25254 | dpyrek  | <IDLE>        | f       | 2007-11-08
07:29:04.547081+01 | 2007-11-08 06:33:47.707759+01
   20275 | dpyrek  | <IDLE>        | f       | 2007-11-03
09:14:12.73829+01  | 2007-11-03 08:57:05.555972+01
   20216 | dpyrek  | <IDLE>        | f       | 2007-11-03
08:46:40.555354+01 | 2007-11-03 08:40:31.756993+01
   12435 | dpyrek  | <IDLE>        | f       | 2007-11-02
09:28:53.361365+01 | 2007-11-02 08:48:11.589485+01
   19633 | dpyrek  | <IDLE>        | f       | 2007-11-03
08:34:16.263487+01 | 2007-11-03 05:46:16.811069+01
   12156 | dpyrek  | <IDLE>        | f       | 2007-11-02
08:10:11.558427+01 | 2007-11-02 07:49:03.442489+01
    4899 | dpyrek  | <IDLE>        | f       | 2007-11-01
12:42:30.880391+01 | 2007-11-01 10:56:18.513398+01
   11988 | dpyrek  | <IDLE>        | f       | 2007-11-02
07:38:10.315758+01 | 2007-11-02 07:02:52.438251+01
    4490 | dpyrek  | <IDLE>        | f       | 2007-11-01
09:51:42.216745+01 | 2007-11-01 09:34:18.63771+01

A ptrace of one of these connections yields the following result:

db2:/home/pgsql/data # strace -p 4899
Process 4899 attached - interrupt to quit
futex(0x994000, FUTEX_WAIT, 2, NULL

This looks identical (even with the same uaddr parameter) for the other
processes.

In the log file I find many lines like this:

<2007-11-08 22:56:19 CET - dpyrek> LOG:  could not receive data from
client: Die Wartezeit für die Verbindung ist abgelaufen
<2007-11-08 22:56:19 CET - dpyrek> LOG:  unexpected EOF on client connection

I'm not sure that these lines correspond to the dead connections, but at
least it's the same user.

Does this additional information help you in any way? I'm a little bit
afraid that eventually the maximum number of connections will be
exceeded. I don't want to "kill -9" the processes because the last time
I did this the database was in recovery mode for a substantial amount of
time.

Any help is still highly appreciated!

Regards,
    Christian

P.S.: If nothing else helps I could also grant one of you guys root
access to the database machine.

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



pgsql-general by date:

Previous
From: Simon Riggs
Date:
Subject: Re: "Resurrected" data files - problem?
Next
From: Gregory Stark
Date:
Subject: Re: Optimal time series sampling.