Home > mailing lists

Thread: (Never?) Kill Postmaster?

(Never?) Kill Postmaster?

From

Stefan Schwarzer

Date:

24 October 2007, 02:57:48

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages, and which I want to interrupt?

Thanks for any advice,

Stef

____________________________________________________________________

Stefan Schwarzer

Lean Back and Relax - Enjoy some Nature Photography:

http://photoblog.la-famille-schwarzer.de

Appetite for Global Data? UNEP GEO Data Portal:

http://geodata.grid.unep.ch

____________________________________________________________________

Re: (Never?) Kill Postmaster?

From

"Pavel Stehule"

Date:

24 October 2007, 03:10:06

2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:
> Hi there,
>
> I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...
>
> Now, what am I supposed to do if I launched a query which takes ages, and
> which I want to interrupt?
>
> Thanks for any advice,
>
> Stef
>

Hello

you have to use more gently way

select pg_cancel_backend()
http://www.postgresql.org/docs/current/interactive/functions-admin.html

Pavel



> ____________________________________________________________________
>
>
>   Stefan Schwarzer
>
>   Lean Back and Relax - Enjoy some Nature Photography:
>   http://photoblog.la-famille-schwarzer.de
>
>   Appetite for Global Data? UNEP GEO Data Portal:
>   http://geodata.grid.unep.ch
>
> ____________________________________________________________________
>
>
>
>
>

Re: (Never?) Kill Postmaster?

From

Ow Mun Heng

Date:

24 October 2007, 03:16:24

On Wed, 2007-10-24 at 07:57 +0200, Stefan Schwarzer wrote:
> Hi there,
>
>
> I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...
>
>
> Now, what am I supposed to do if I launched a query which takes ages,
> and which I want to interrupt?


look for the query's procpid and then issue a select
pg_cancel_backend('the_id')

>

Re: (Never?) Kill Postmaster?

From

Tommy Gildseth

Date:

24 October 2007, 03:36:42

Pavel Stehule wrote:
> 2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:
>
>> I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...
>>
>> Now, what am I supposed to do if I launched a query which takes ages, and
>> which I want to interrupt?
> you have to use more gently way
>
> select pg_cancel_backend()
> http://www.postgresql.org/docs/current/interactive/functions-admin.html

The problem is that pg_cancel_backend() is somewhat unreliable at
cancelling wayward queries. You can try other options for kill though,
other than -9, which is kind of a last resort.

--
Tommy Gildseth

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

24 October 2007, 03:54:16

Ow Mun Heng wrote:
> look for the query's procpid and then issue a select
> pg_cancel_backend('the_id')
>

Does it do any harm if I kill (either with signal 9 or signal 15) the
single backend process (and not the postmaster)?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

"Michael Harris"

Date:

24 October 2007, 04:23:18

Hi,

First you need to identify the correct postgresql process. Postgresql spawns an individual server process for each database connection. They look something like this:

postgres 27296 7089 9 08:00 ? 00:05:52 postgres: username databasename [local] idle

If a query was running then it would say 'SELECT' instead of 'idle'.

You can send a SIGINT (ie, -2) to that process to cancel a query, eg

kill -2 27296

In most systems SIGINT is the default for kill so you could just do kill <pid>.

The tip is ''kill -9' the postmaster', which has two important differences to the scenario I just described:

1) kill -9 means the OS kills the process without allowing it to clean up after itself

2) The postmaster is the master postgresql backend process. If you want to kill a single query you would not want to kill that.

Regards // Mike

From: Stefan Schwarzer [mailto:stefan.schwarzer@grid.unep.ch]
Sent: Wednesday, 24 October 2007 3:58 PM
To: pgsql-general@postgresql.org
Subject: (Never?) Kill Postmaster?

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages, and which I want to interrupt?

Thanks for any advice,

Stef

____________________________________________________________________

Stefan Schwarzer

Lean Back and Relax - Enjoy some Nature Photography:

http://photoblog.la-famille-schwarzer.de

Appetite for Global Data? UNEP GEO Data Portal:

http://geodata.grid.unep.ch

____________________________________________________________________

Re: (Never?) Kill Postmaster?

From

"Pavel Stehule"

Date:

24 October 2007, 04:23:48

2007/10/24, Christian Schröder <cs@deriva.de>:
> Ow Mun Heng wrote:
> > look for the query's procpid and then issue a select
> > pg_cancel_backend('the_id')
> >
>
> Does it do any harm if I kill (either with signal 9 or signal 15) the
> single backend process (and not the postmaster)?
>

shared memory can be broken, and probably somewhere can stay some
garbage. It's depend on query.

Pavel


> Regards,
>     Christian
>
> --
> Deriva GmbH                         Tel.: +49 551 489500-42
> Financial IT and Consulting         Fax:  +49 551 489500-91
> Hans-Böckler-Straße 2                  http://www.deriva.de
> D-37079 Göttingen
>
> Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

24 October 2007, 09:01:33

"Michael Harris" <michael.harris@ericsson.com> writes:
> The tip is ''kill -9' the postmaster', which has two important
> differences to the scenario I just described:
> 1) kill -9 means the OS kills the process without allowing it to clean
> up after itself
> 2) The postmaster is the master postgresql backend process. If you want
> to kill a single query you would not want to kill that.

Right: the tip is to not kill -9 the parent process; it's not saying
anything about child server processes.

If you've got a child process that's unresponsive to SIGINT then you
can send it a SIGKILL instead; the downside is that this will force a
restart of the other children too, that is you're interrupting all
database sessions not only the one.  But Postgres will recover
automatically and I don't think I've ever heard of anyone getting data
corruption as a result of such a thing.

SIGKILL on the parent is an entirely different thing.  You'll have to
manually restart Postgres, possibly do some manual cleanup, and there's
a small but nonzero chance of ensuing data corruption ... especially if
you fat-finger any of the manual steps.  Plus there simply isn't any
good reason to do it.  The postmaster should always respond to more
gentle shutdown signals, because it doesn't run any user-provided
commands that could send it off into the weeds.

Hence the TIP.

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Bill Moran

Date:

24 October 2007, 09:06:26

In response to Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:

> Hi there,
>
> I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...
>
> Now, what am I supposed to do if I launched a query which takes ages,
> and which I want to interrupt?
>
> Thanks for any advice,

Tracing through the other threads, this is obviously not the regular
wayward query, but one that won't die by the normal methods.

Unfortunately, I came across this recently, and the only solution I
found was to do a "pg_ctl restart -m i" (Yes, I tried -m f first).

Luckily, the db in question was such that the front ends didn't
suffer horribly from this and reconnected, and that the database
finished up its recovery in a timely manner.

Hopefully, I can generate a reproducible example so I can file a
bug, but haven't gotten that far with it yet.

--
Bill Moran
http://www.potentialtech.com

Re: (Never?) Kill Postmaster?

From

"Vyacheslav Kalinin"

Date:

29 October 2007, 23:27:41

Postgres starts separate process for each connection and you can just `kill` (not `kill -9`) the unwanted process. Or you can do `select pg_cancel_backend(pid)` from another session.

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

31 October 2007, 20:06:33

Tom Lane wrote:
> "Michael Harris" <michael.harris@ericsson.com> writes:
>
>> The tip is ''kill -9' the postmaster', which has two important
>> differences to the scenario I just described:
>> 1) kill -9 means the OS kills the process without allowing it to clean
>> up after itself
>> 2) The postmaster is the master postgresql backend process. If you want
>> to kill a single query you would not want to kill that.
>>
>
> Right: the tip is to not kill -9 the parent process; it's not saying
> anything about child server processes.
>
> If you've got a child process that's unresponsive to SIGINT then you
> can send it a SIGKILL instead; the downside is that this will force a
> restart of the other children too, that is you're interrupting all
> database sessions not only the one.  But Postgres will recover
> automatically and I don't think I've ever heard of anyone getting data
> corruption as a result of such a thing.
>

I have been in exactly this situation today: One statement took several
hours to complete, so it should be cancelled. I tried a
"pg_cancel_backend" and a "kill -2" (which means "SIGINT" on our linux
box), but nothing happened. Since I remembered this thread, I tried a
"kill -9" on this child process. As you described, all other connections
were reset, too, and this was the message in the server log:

<2007-10-31 22:48:28 CET - chschroe> WARNING:  terminating connection
because of crash of another server process
<2007-10-31 22:48:28 CET - chschroe> DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.

But then, when I tried to reconnect to the database, I received the
following message:

<2007-10-31 22:50:01 CET - chschroe> FATAL:  the database system is in
recovery mode

Ok, you wrote "Postgres will recover automatically", but could this take
several minutes? Is that what "recovery mode" means? When nothing seemed
to happen for several minutes, I performed a (fortunately clean) restart
of the whole server. The log messages for the server restart looked
normal to me:

<2007-10-31 22:53:15 CET - > LOG:  received smart shutdown request
<2007-10-31 22:53:21 CET - > LOG:  all server processes terminated;
reinitializing
<2007-10-31 22:53:58 CET - > LOG:  database system was interrupted at
2007-10-31 22:46:46 CET
<2007-10-31 22:53:58 CET - > LOG:  checkpoint record is at 153/FE9FAF20
<2007-10-31 22:53:58 CET - > LOG:  redo record is at 153/FE9FAF20; undo
record is at 0/0; shutdown FALSE
<2007-10-31 22:53:58 CET - > LOG:  next transaction ID: 0/128715865;
next OID: 58311787
<2007-10-31 22:53:58 CET - > LOG:  next MultiXactId: 4704; next
MultiXactOffset: 9414
<2007-10-31 22:53:58 CET - > LOG:  database system was not properly shut
down; automatic recovery in progress
<2007-10-31 22:53:58 CET - > LOG:  redo starts at 153/FE9FAF70
<2007-10-31 22:53:58 CET - > LOG:  record with zero length at 153/FEA05E70
<2007-10-31 22:53:58 CET - > LOG:  redo done at 153/FEA05E40
<2007-10-31 22:53:58 CET - > LOG:  database system is ready

I hope that no data got corrupted. Is there any way to check this?

What is the conclusion of this experience? Is it contrary to the above
statements dangerous to kill (-9) a subprocess?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

31 October 2007, 21:02:42

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> But then, when I tried to reconnect to the database, I received the
> following message:

> <2007-10-31 22:50:01 CET - chschroe> FATAL:  the database system is in
> recovery mode

> Ok, you wrote "Postgres will recover automatically", but could this take
> several minutes?

Yeah, potentially.  I don't suppose you have any idea how long it'd been
since your last checkpoint, but what do you have checkpoint_timeout and
checkpoint_segments set to?

> What is the conclusion of this experience?

That you have an itchy trigger finger ;-).  It looks from the log that
you needed to wait only a few seconds longer for the recovery to have
completed.

What I'd like to know about is why the child process was unresponsive to
SIGINT in the first place.  There's little we can do about long-running
plpython functions, for instance, but if it was looping in Postgres
proper then we should do something about that.  Can you reproduce this
problem easily?

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

31 October 2007, 23:08:38

Tom Lane wrote:
>> Ok, you wrote "Postgres will recover automatically", but could this take
>> several minutes?
>>
>
> Yeah, potentially.  I don't suppose you have any idea how long it'd been
> since your last checkpoint, but what do you have checkpoint_timeout and
> checkpoint_segments set to?
>

I did not change these parameters from their default values, so
checkpoint_timeout is 5 min and checkpoint_segments is 8.

> What I'd like to know about is why the child process was unresponsive to
> SIGINT in the first place.  There's little we can do about long-running
> plpython functions, for instance, but if it was looping in Postgres
> proper then we should do something about that.  Can you reproduce this
> problem easily?
>

Unfortunately not. I have tried the same query and it took only about 1
sec to complete. In fact, it's a simple seq scan with a single filter
condition. No user defined functions are involved.
Maybe it has something to do with the users connecting from their
Windows machines to the PostgreSQL server using psqlodbc. On the other
hand, it has not been the first time that such a user connection had to
be terminated and we did never experience this problem.
If I see the phenomenon again I will use strace or something similar to
find out what the backend process is doing.

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

"Scott Marlowe"

Date:

01 November 2007, 17:52:02

On 10/31/07, Christian Schröder <cs@deriva.de> wrote:
> Tom Lane wrote:
> >> Ok, you wrote "Postgres will recover automatically", but could this take
> >> several minutes?
> >>
> >
> > Yeah, potentially.  I don't suppose you have any idea how long it'd been
> > since your last checkpoint, but what do you have checkpoint_timeout and
> > checkpoint_segments set to?
> >
>
> I did not change these parameters from their default values, so
> checkpoint_timeout is 5 min and checkpoint_segments is 8.
>
> > What I'd like to know about is why the child process was unresponsive to
> > SIGINT in the first place.  There's little we can do about long-running
> > plpython functions, for instance, but if it was looping in Postgres
> > proper then we should do something about that.  Can you reproduce this
> > problem easily?
> >
>
> Unfortunately not. I have tried the same query and it took only about 1
> sec to complete. In fact, it's a simple seq scan with a single filter
> condition. No user defined functions are involved.
> Maybe it has something to do with the users connecting from their
> Windows machines to the PostgreSQL server using psqlodbc. On the other
> hand, it has not been the first time that such a user connection had to
> be terminated and we did never experience this problem.
> If I see the phenomenon again I will use strace or something similar to
> find out what the backend process is doing.

Tom, is it possible the backend was doing something that couldn't be
immediately interrupted, like a long wait on IO or something?

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

01 November 2007, 18:53:24

"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> Tom, is it possible the backend was doing something that couldn't be
> immediately interrupted, like a long wait on IO or something?

Sherlock Holmes said that theorizing in advance of the data is a capital
mistake...

What we can be reasonably certain of is that that backend wasn't
reaching any CHECK_FOR_INTERRUPTS() macros.  Whether it was hung up
waiting for something, or caught in a tight loop somewhere, is
impossible to say without more data than we have.  AFAIR the OP didn't
even mention whether the backend appeared to be consuming CPU cycles
(which'd be a pretty fair tip about which of those to believe, but still
not enough to guess *where* the problem is). A gdb backtrace would tell
us more.

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

06 November 2007, 11:13:57

Tom Lane wrote:
> What we can be reasonably certain of is that that backend wasn't
> reaching any CHECK_FOR_INTERRUPTS() macros.  Whether it was hung up
> waiting for something, or caught in a tight loop somewhere, is
> impossible to say without more data than we have.  AFAIR the OP didn't
> even mention whether the backend appeared to be consuming CPU cycles
> (which'd be a pretty fair tip about which of those to believe, but still
> not enough to guess *where* the problem is). A gdb backtrace would tell
> us more.
>

It happened again! I'm not sure if I should be happy because we can now
maybe find the cause of the problem, or should be worried because it's
our productive database ... At least the process doesn't seem to consume
cpu (it doesn't show up in "top"), so I won't kill it this time, but
instead try to get all information that you guys need.
What I already did was an strace with the following result:

db2:/home/pgsql/data # strace -p 7129
Process 7129 attached - interrupt to quit
futex(0x994000, FUTEX_WAIT, 2, NULL)    = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
futex(0x994000, FUTEX_WAIT, 2, NULL

That interrupt will have been the script that tries to remove long-time
queries. The same lines seem to repeat over and over again.

Then I attached a gdb to the process and printed out a backtrace:

db2:/home/pgsql/data # gdb --pid=7129
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Attaching to process 7129
Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so
Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 47248855881456 (LWP 7129)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so
0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
#1  0x00002af904806e88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
#2  0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
#3  0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
#4  0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
#5  0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
#6  0x00000000005f4daa in expand_fmt_string ()
#7  0x00000000005f6d14 in errmsg ()
#8  0x00000000005182cc in internal_flush ()
#9  0x00000000005183b6 in internal_putbytes ()
#10 0x000000000051841c in pq_putmessage ()
#11 0x00000000005199c4 in pq_endmessage ()
#12 0x0000000000440c6a in printtup ()
#13 0x00000000004fc1b8 in ExecutorRun ()
#14 0x0000000000580451 in PortalRunSelect ()
#15 0x0000000000581446 in PortalRun ()
#16 0x000000000057d625 in exec_simple_query ()
#17 0x000000000057ea72 in PostgresMain ()
#18 0x0000000000558218 in ServerLoop ()
#19 0x0000000000558db8 in PostmasterMain ()
#20 0x000000000051a213 in main ()

Do you need anything else? Can you still tell what's happening?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

06 November 2007, 11:47:55

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> db2:/home/pgsql/data # gdb --pid=7129
> GNU gdb 6.5
> Copyright (C) 2006 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-suse-linux".
> Attaching to process 7129
> Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done.
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Reading symbols from /lib64/libcrypt.so.1...done.
> Loaded symbols for /lib64/libcrypt.so.1
> Reading symbols from /lib64/libdl.so.2...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libm.so.6...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libc.so.6...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done.
> Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so
> Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done.
> Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so
> Reading symbols from
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
> Loaded symbols for
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
> Reading symbols from /lib64/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 47248855881456 (LWP 7129)]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
> Loaded symbols for
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so
> 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x00002af904809a68 in __lll_mutex_lock_wait () from
> /lib64/libpthread.so.0
> #1  0x00002af904806e88 in pthread_rwlock_rdlock () from
> /lib64/libpthread.so.0
> #2  0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
> #3  0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
> #4  0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
> #5  0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
> #6  0x00000000005f4daa in expand_fmt_string ()
> #7  0x00000000005f6d14 in errmsg ()
> #8  0x00000000005182cc in internal_flush ()
> #9  0x00000000005183b6 in internal_putbytes ()
> #10 0x000000000051841c in pq_putmessage ()
> #11 0x00000000005199c4 in pq_endmessage ()
> #12 0x0000000000440c6a in printtup ()
> #13 0x00000000004fc1b8 in ExecutorRun ()
> #14 0x0000000000580451 in PortalRunSelect ()
> #15 0x0000000000581446 in PortalRun ()
> #16 0x000000000057d625 in exec_simple_query ()
> #17 0x000000000057ea72 in PostgresMain ()
> #18 0x0000000000558218 in ServerLoop ()
> #19 0x0000000000558db8 in PostmasterMain ()
> #20 0x000000000051a213 in main ()

Too bad you don't have debug symbols :-(.  However, this does raise a
couple of interesting questions:

* The only place internal_flush would call errmsg is here:

                ereport(COMMERROR,
                        (errcode_for_socket_access(),
                         errmsg("could not send data to client: %m")));

So why is it unable to send data to the client?

* How the heck would strerror() be blocking on a thread mutex, when
the backend is single-threaded?

We recently discovered that it was possible for pltcl to cause the
backend to become multithreaded:
http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php
I see from your dump that this session has been using plperl, so I
wonder whether plperl has the same sort of issue.  Can you determine
exactly what's been done with plperl in this session?

Also, can you confirm that there is actually more than one thread active
in this process?  On Linux "ps -Lw" would show threads.

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

06 November 2007, 13:41:36

Tom Lane wrote:
> * The only place internal_flush would call errmsg is here:
>
>                 ereport(COMMERROR,
>                         (errcode_for_socket_access(),
>                          errmsg("could not send data to client: %m")));
>
> So why is it unable to send data to the client?
>
The user accesses the database from his home office. He is connected to
the internet using a dial-in connection which seems to be disconnected
from time to time. The problem seems to be specific for this user, so it
may have something to do with his type of connection.

> * How the heck would strerror() be blocking on a thread mutex, when
> the backend is single-threaded?
>
> We recently discovered that it was possible for pltcl to cause the
> backend to become multithreaded:
> http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php
> I see from your dump that this session has been using plperl, so I
> wonder whether plperl has the same sort of issue.  Can you determine
> exactly what's been done with plperl in this session?
>

Not exactly. There are several triggers whose functions are implemented
in perl:

    elog(ERROR, 'set_serial needs at least 2 arguments.')
        if $_TD->{argc} < 2;

    my ($colname, $seqname) = @{$_TD->{args}};
    my $result = spi_exec_query("SELECT nextval('$seqname')");
    $_TD->{new}{$colname} = $result->{rows}[0]{nextval};
    return 'MODIFY';

and

    elog(ERROR, 'log_changes needs at least 2 arguments.')
        if $_TD->{argc} < 2;

    my @args = @{$_TD->{args}};

    shift(@args) =~ /([^.]*(?=\.))?\.?(.*)/;
    my $log_table = defined($1) ? qq("$1"."$2") : qq("$2");

    if ($_TD->{event} eq 'UPDATE') {
        my $id = join(',', @{$_TD->{old}}{@args});
        $id =~ s/'/\\'/g;
        while (my ($key, $value) = each %{$_TD->{old}}) {
            if ($value ne $_TD->{new}{$key}) {
                my $newvalue = $_TD->{new}{$key};
                $value =~ s/'/\\'/g;
                $newvalue =~ s/'/\\'/g;
                my $query = sprintf(qq(INSERT INTO $log_table VALUES
(current_timestamp, session_user, '%s', '%s', '%s', '%s', '%s')),
                                    $id,
                                    $_TD->{relname},
                                    $key,
                                    $value,
                                    $newvalue);
                spi_exec_query($query);
            }
        }
    }
    else {
        my $id = join(',', @{$_TD->{$_TD->{event} eq 'INSERT' ? 'new' :
'old'}}{@args});
        $id =~ s/'/\\'/g;
        my $query = sprintf(qq(INSERT INTO $log_table VALUES
(current_timestamp, session_user, '%s', '%s', '(%s)', null, null)),
                            $id,
                            $_TD->{relname},
                            $_TD->{event});
        spi_exec_query($query);
    }
    return;

> Also, can you confirm that there is actually more than one thread active
> in this process?  On Linux "ps -Lw" would show threads.
>

Since "ps -Lw" only showed me my own processes, I did "ps -AT -o
pid,ppid,lwp,nlwp,command" instead. This is the relevant line:

  PID  PPID   LWP NLWP COMMAND
...
 7129  3934  7129    1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT
...

I'm not sure if this data is correct because NLWP (the number of threads
in the process) is 1 for all processes.

Is it necessary to recompile the server with debugging information
enabled? As I already mentioned this is a productive database, so I
would have to do it by night which would be somewhat uncomfortable for
me ... Not to mention the bad feeling when changing a productive
database server ...

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Alvaro Herrera

Date:

06 November 2007, 19:55:31

Christian Schröder wrote:

> Since "ps -Lw" only showed me my own processes, I did "ps -AT -o
> pid,ppid,lwp,nlwp,command" instead. This is the relevant line:
>
>  PID  PPID   LWP NLWP COMMAND
> ...
> 7129  3934  7129    1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT
> ...
>
> I'm not sure if this data is correct because NLWP (the number of threads in
> the process) is 1 for all processes.

Please try "thread apply all bt full" on gdb.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

07 November 2007, 02:43:05

Alvaro Herrera wrote:
> Please try "thread apply all bt full" on gdb.
>

The first lines where the symbols are loaded are of course identical.
The output of the command is in my opinion not very helpful:

(gdb) thread apply all bt full

Thread 1 (Thread 47248855881456 (LWP 7129)):
#0  0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002af904806e88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
No symbol table info available.
#2  0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
No symbol table info available.
#3  0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
No symbol table info available.
#4  0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
No symbol table info available.
#5  0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
No symbol table info available.
#6  0x00000000005f4daa in expand_fmt_string ()
No symbol table info available.
#7  0x00000000005f6d14 in errmsg ()
No symbol table info available.
#8  0x00000000005182cc in internal_flush ()
No symbol table info available.
#9  0x00000000005183b6 in internal_putbytes ()
No symbol table info available.
#10 0x000000000051841c in pq_putmessage ()
No symbol table info available.
#11 0x00000000005199c4 in pq_endmessage ()
No symbol table info available.
#12 0x0000000000440c6a in printtup ()
No symbol table info available.
#13 0x00000000004fc1b8 in ExecutorRun ()
No symbol table info available.
#14 0x0000000000580451 in PortalRunSelect ()
No symbol table info available.
#15 0x0000000000581446 in PortalRun ()
No symbol table info available.
#16 0x000000000057d625 in exec_simple_query ()
No symbol table info available.
#17 0x000000000057ea72 in PostgresMain ()
No symbol table info available.
#18 0x0000000000558218 in ServerLoop ()
No symbol table info available.
#19 0x0000000000558db8 in PostmasterMain ()
No symbol table info available.
#20 0x000000000051a213 in main ()
No symbol table info available.
#0  0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0

What does this tell you?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Alvaro Herrera

Date:

07 November 2007, 07:29:38

Christian Schröder wrote:
> Alvaro Herrera wrote:
>> Please try "thread apply all bt full" on gdb.
>>
>
> The first lines where the symbols are loaded are of course identical. The
> output of the command is in my opinion not very helpful:

I was actually hoping that it would list the running threads in the
process.  It would have sufficed to say "info threads", but the "bt
full" command would have given more detail in case there were more.

I find it a bit strange that it tells you that there is one thread; in a
simple try here, it doesn't say anything at all:

(gdb) info threads
(gdb)

In any case I can see in the files from which symbols are loaded that
some Perl stuff seem to be multithreaded ... maybe that's were the
problem comes from.

Hmm, after creating a plperl function it is indeed different:

(gdb) info threads
* 1 Thread 0x2ba8945f3e20 (LWP 3606)  0x00002ba894370645 in recv ()
   from /lib/libc.so.6
(gdb)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

08 November 2007, 19:03:06

Hi all,
any news about this issue? Anything else that I can do to help you?
Meanwhile there are 4 connections in the same state. (I did not do the
whole investigation on all 4, but since they all do not respond on a
SIGINT I assume that they all have the same problem.)
It may also be interesting that the 4 processes belong to the same two
users that already caused this problem earlier. Maybe it really has
something to do with their unstable internet connection? (I mentioned
this in an earlier mail.)
I have also noticed that one of these two users has many open
connections which are all idle. I guess that those connections are in
fact dead, but the database did not close them for any reason. The
pg_stat_activity entries for the corresponding backend processes are as
follows:

 procpid | usename | current_query | waiting |
query_start          |         backend_start
---------+---------+---------------+---------+-------------------------------+-------------------------------
   26033 | dpyrek  | <IDLE>        | f       | 2007-11-08
10:21:01.555232+01 | 2007-11-08 09:55:01.59932+01
   18331 | dpyrek  | <IDLE>        | f       | 2007-11-07
11:34:24.968852+01 | 2007-11-07 11:08:29.043762+01
   18940 | dpyrek  | <IDLE>        | f       | 2007-11-07
14:29:52.987176+01 | 2007-11-07 13:14:48.609031+01
   25868 | dpyrek  | <IDLE>        | f       | 2007-11-08
09:47:46.938991+01 | 2007-11-08 09:13:34.101351+01
    6719 | dpyrek  | <IDLE>        | f       | 2007-11-06
12:06:14.875588+01 | 2007-11-06 11:10:00.566644+01
   17987 | dpyrek  | <IDLE>        | f       | 2007-11-07
10:31:50.517275+01 | 2007-11-07 10:11:07.310338+01
   31808 | dpyrek  | <IDLE>        | f       | 2007-11-08
22:55:03.931727+01 | 2007-11-08 22:55:03.766638+01
   25484 | dpyrek  | <IDLE>        | f       | 2007-11-08
08:32:57.265377+01 | 2007-11-08 07:44:30.845967+01
    5972 | dpyrek  | <IDLE>        | f       | 2007-11-06
08:51:54.57437+01  | 2007-11-06 08:14:03.560602+01
    6241 | dpyrek  | <IDLE>        | f       | 2007-11-06
09:59:02.018452+01 | 2007-11-06 09:20:49.092246+01
    6136 | dpyrek  | <IDLE>        | f       | 2007-11-06
09:14:40.729837+01 | 2007-11-06 08:57:29.55187+01
   12645 | dpyrek  | <IDLE>        | f       | 2007-11-02
10:08:24.856929+01 | 2007-11-02 09:35:37.640976+01
   25254 | dpyrek  | <IDLE>        | f       | 2007-11-08
07:29:04.547081+01 | 2007-11-08 06:33:47.707759+01
   20275 | dpyrek  | <IDLE>        | f       | 2007-11-03
09:14:12.73829+01  | 2007-11-03 08:57:05.555972+01
   20216 | dpyrek  | <IDLE>        | f       | 2007-11-03
08:46:40.555354+01 | 2007-11-03 08:40:31.756993+01
   12435 | dpyrek  | <IDLE>        | f       | 2007-11-02
09:28:53.361365+01 | 2007-11-02 08:48:11.589485+01
   19633 | dpyrek  | <IDLE>        | f       | 2007-11-03
08:34:16.263487+01 | 2007-11-03 05:46:16.811069+01
   12156 | dpyrek  | <IDLE>        | f       | 2007-11-02
08:10:11.558427+01 | 2007-11-02 07:49:03.442489+01
    4899 | dpyrek  | <IDLE>        | f       | 2007-11-01
12:42:30.880391+01 | 2007-11-01 10:56:18.513398+01
   11988 | dpyrek  | <IDLE>        | f       | 2007-11-02
07:38:10.315758+01 | 2007-11-02 07:02:52.438251+01
    4490 | dpyrek  | <IDLE>        | f       | 2007-11-01
09:51:42.216745+01 | 2007-11-01 09:34:18.63771+01

A ptrace of one of these connections yields the following result:

db2:/home/pgsql/data # strace -p 4899
Process 4899 attached - interrupt to quit
futex(0x994000, FUTEX_WAIT, 2, NULL

This looks identical (even with the same uaddr parameter) for the other
processes.

In the log file I find many lines like this:

<2007-11-08 22:56:19 CET - dpyrek> LOG:  could not receive data from
client: Die Wartezeit für die Verbindung ist abgelaufen
<2007-11-08 22:56:19 CET - dpyrek> LOG:  unexpected EOF on client connection

I'm not sure that these lines correspond to the dead connections, but at
least it's the same user.

Does this additional information help you in any way? I'm a little bit
afraid that eventually the maximum number of connections will be
exceeded. I don't want to "kill -9" the processes because the last time
I did this the database was in recovery mode for a substantial amount of
time.

Any help is still highly appreciated!

Regards,
    Christian

P.S.: If nothing else helps I could also grant one of you guys root
access to the database machine.

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

08 November 2007, 20:17:38

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> I don't want to "kill -9" the processes because the last time
> I did this the database was in recovery mode for a substantial amount of
> time.

A useful tip on that: if you perform a manual CHECKPOINT just before
issuing the kills, recovery time should be minimal.

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

08 November 2007, 23:36:48

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> any news about this issue?

Not a lot.  I've been able to confirm on my own Fedora 6 machine that
the mere act of loading plperl.so into the backend causes the process
to have one thread instead of no threads, as reported by gdb.  How
relevant that is isn't clear.  I experimented with executing a plperl
function and then kill -9'ing the connected psql frontend process
while executing a query that returned a lot of output.  But all I could
get out of that was

LOG:  could not send data to client: Broken pipe
STATEMENT:  select * from generate_series(1,1000000);
LOG:  could not receive data from client: Connection reset by peer
LOG:  unexpected EOF on client connection

The first of these lines shows that control passed unscathed through
the place where your stack trace shows it was hung up.  So there is
something different between your situation and my test.  It could be
that the flaky-internet-connection case is different from my
kill-the-client test, but it's pretty hard to see how that would matter
in userland, especially not for the act of calling strerror() after
control has already returned from the kernel.  What I think is that the
perl stuff your session has done has included some action that changed
the condition of the backend process ... exactly what, I have no idea.
Can you show us the plperl functions that were used in these sessions?

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

09 November 2007, 04:41:50

Tom Lane wrote:
> =?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
>
>> I don't want to "kill -9" the processes because the last time
>> I did this the database was in recovery mode for a substantial amount of
>> time.
>>
>
> A useful tip on that: if you perform a manual CHECKPOINT just before
> issuing the kills, recovery time should be minimal.
>
Thanks for the tip. For the moment, I will leave the processes in their
current state to be able to perform further tests in case you need more
data. But when I will start feeling uncomfortable with the situation I
will follow your advice.

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

09 November 2007, 07:01:00

Tom Lane wrote:
> control has already returned from the kernel.  What I think is that the
> perl stuff your session has done has included some action that changed
> the condition of the backend process ... exactly what, I have no idea.
> Can you show us the plperl functions that were used in these sessions?
>
Considering the skill level of the users I don't think that they
explicitly used any perl function. So the only remaining perl stuff are
the triggers that I mentioned earlier. There are two triggers in the
database to which the users were connected (see
http://archives.postgresql.org/pgsql-general/2007-11/msg00245.php).
There are three more triggers in other databases which might have an impact:

=============================
        if ($_TD->{argc} != 1) {
                elog(ERROR, 'do_fast_export_log() needs the table\'s
primary key as sole argument, multiple keys are not yet supported.');
        }

        my $table = $_TD->{table_name};
        my $id = $_TD->{old}{$_TD->{args}[0]};

        while (my ($col, $value) = each %{$_TD->{old}}) {
                my $oldVal = $_TD->{old}{$col};
                my $newVal = $_TD->{new}{$col};

                $oldVal =~ s/'/''/g;
                $oldVal =~ s/\t/\\\\t/g;
                $newVal =~ s/'/''/g;
                $newVal =~ s/\t/\\\\t/g;

                if ($oldVal ne $newVal) {
                        my $data = "$id\t$table\t$col\t$oldVal\t$newVal";
                        my $query = spi_exec_query("insert into
fast_export(data) values(E'$data')");
                }
        }

        spi_exec_query("notify fast_export");
        return;
=============================
    elog(ERROR, 'store_session_user needs the field name as argument.')
        if $_TD->{argc} < 1;

    my ($field) = @{$_TD->{args}};

    elog(ERROR, "Field '$field' does not exist in target table.\n")
        unless exists $_TD->{new}{$field};

    my $query = spi_exec_query('SELECT session_user', 1);
    $_TD->{new}{$field} = $query->{rows}[0]{session_user};
    return 'MODIFY';
=============================
    if ($_TD->{argc} > 0) {
        foreach (@{$_TD->{args}}) {
            $_TD->{new}{$_} = $_TD->{old}{$_}
                if exists $_TD->{old}{$_};
        }
        return 'MODIFY';
    }
    else {
        return;
    }
=============================

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

09 November 2007, 23:06:49

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> [ ongoing saga ]

I don't think you ever mentioned exactly what platform you're running
on; it seems to be some 64-bit Linux variant but you didn't say which.
I've been futilely trying to reproduce the hang on a Xeon Fedora Core 6
machine, and just now I noticed what might be a significant discrepancy.
You showed this in a gdb printout:

> Reading symbols from
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
> Loaded symbols for
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
> Reading symbols from /lib64/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 47248855881456 (LWP 7129)]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
> Loaded symbols for
> /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so

On my machine the corresponding lines point to /usr/lib64:

Reading symbols from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
Loaded symbols for /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 46912496247568 (LWP 344)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
Loaded symbols for /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so

Maybe this is just a cross-distribution difference in file layouts,
but I'm suddenly wondering if there's a 32-vs-64-bit issue here.
Exactly which perl packages have you got installed?

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

10 November 2007, 05:35:35

Tom Lane wrote:
> I don't think you ever mentioned exactly what platform you're running
> on; it seems to be some 64-bit Linux variant but you didn't say which.
>

The machine has two dual-core Xeon 5130 cpus. The os is openSUSE 10.2
(x86-64). The output of uname -a is:

Linux db2 2.6.18.8-0.7-default #1 SMP Tue Oct 2 17:21:08 UTC 2007 x86_64
x86_64 x86_64 GNU/Linux

Did I mention the PostgreSQL version is 8.2.5?

> On my machine the corresponding lines point to /usr/lib64:
>

You are right. Of course the directory /usr/lib64 exists and contains
lots of shared libraries and other stuff, but indeed no directory perl5.

> Maybe this is just a cross-distribution difference in file layouts,
> but I'm suddenly wondering if there's a 32-vs-64-bit issue here.
> Exactly which perl packages have you got installed?
>

rpm says: "perl-5.8.8-32". The somewhat lengthy output of "perl -V" is:

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.18, archname=x86_64-linux-thread-multi
    uname='linux eisler 2.6.18 #1 smp tue nov 21 12:59:21 utc 2006
x86_64 x86_64 x86_64 gnulinux '
    config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr
-Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm
-Duseshrplib=true -Doptimize=-O2 -fmessage-length=0 -Wall
-D_FORTIFY_SOURCE=2 -g -Wall -pipe'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define
usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=define uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -fmessage-length=0 -Wall -D_FORTIFY_SOURCE=2 -g -Wall
-pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING
-fno-strict-aliasing -pipe -Wdeclaration-after-statement'
    ccversion='', gccversion='4.1.2 20061115 (prerelease) (SUSE Linux)',
gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib64'
    libpth=/lib64 /usr/lib64 /usr/local/lib64
    libs=-lm -ldl -lcrypt -lpthread
    perllibs=-lm -ldl -lcrypt -lpthread
    libc=/lib64/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64'


Characteristics of this binary (from libperl):
  Compile-time options: DEBUGGING MULTIPLICITY PERL_IMPLICIT_CONTEXT
                        PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_PERLIO USE_REENTRANT_API
  Built under linux
  Compiled at Nov 25 2006 11:02:03
  @INC:
    /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi
    /usr/lib/perl5/5.8.8
    /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl
    .

One more information about the current situation: Besides the processes
that are stuck in a statement (meanwhile I have five and the most recent
belongs to yet another user) I also see many idle connections. I'm
pretty sure that the users are out of office at the moment and their
workstations are shut down, so the connections should not exist at all.
Since I have now a total of 69 connections and already touched the
connection limit yesterday I will now restart the server. Thus, I won't
be able to provide any more detailed information until the problem
arises again.

Of course, I can still give you general information about the system.

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

10 November 2007, 13:50:29

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> Tom Lane wrote:
>> Maybe this is just a cross-distribution difference in file layouts,
>> but I'm suddenly wondering if there's a 32-vs-64-bit issue here.
>> Exactly which perl packages have you got installed?

> rpm says: "perl-5.8.8-32". The somewhat lengthy output of "perl -V" is:
> [snip]

OK, it looks like it is a 64-bit build but they intentionally put the
perl stuff under /usr/lib anyway.  So that was a red herring.  One
thing I notice in comparing your output against what I see on Fedora is:

>   Compiler:
>     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
                                                    ^^^^^^^^^^^^^^^^^^^

There's no such #define mentioned in the Red Hat configuration ... I
wonder what that does?  I wouldn't normally think this significant,
but given that we suspect the problem is somehow related to threading,
it kind of draws the eyes ...

> Since I have now a total of 69 connections and already touched the
> connection limit yesterday I will now restart the server. Thus, I won't
> be able to provide any more detailed information until the problem
> arises again.

OK.  For the moment I confess bafflement.  You had offered access
to your system to probe more carefully --- once you've built up
two or three stuck processes again, I would like to take a look.

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

10 November 2007, 21:45:19

Tom Lane wrote:
> OK.  For the moment I confess bafflement.  You had offered access
> to your system to probe more carefully --- once you've built up
> two or three stuck processes again, I would like to take a look.
>

Although I do not yet have any processes that are stuck inside a
statement, there are some that are idle, but do not respond to SIGINT or
even SIGTERM. Is this sufficient? Then I could change the root password
and send all the login data to you.

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

10 November 2007, 21:55:43

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> Although I do not yet have any processes that are stuck inside a
> statement, there are some that are idle, but do not respond to SIGINT or
> even SIGTERM. Is this sufficient?

Dunno.  Have you looked at their call stacks with gdb?

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

11 November 2007, 13:34:18

Tom Lane wrote:
> =?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
>
>> Although I do not yet have any processes that are stuck inside a
>> statement, there are some that are idle, but do not respond to SIGINT or
>> even SIGTERM. Is this sufficient?
>>
>
> Dunno.  Have you looked at their call stacks with gdb?
>
(gdb) bt
#0  0x00002b24aeee0a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
#1  0x00002b24aeedde88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
#2  0x00002b24a5814e23 in _nl_find_msg () from /lib64/libc.so.6
#3  0x00002b24a5815c83 in __dcigettext () from /lib64/libc.so.6
#4  0x00002b24a585df0b in strerror_r () from /lib64/libc.so.6
#5  0x00002b24a585dd33 in strerror () from /lib64/libc.so.6
#6  0x00000000005f4daa in expand_fmt_string ()
#7  0x00000000005f6d14 in errmsg ()
#8  0x00000000005185f3 in pq_recvbuf ()
#9  0x0000000000518987 in pq_getbyte ()
#10 0x000000000057eb69 in PostgresMain ()
#11 0x0000000000558218 in ServerLoop ()
#12 0x0000000000558db8 in PostmasterMain ()
#13 0x000000000051a213 in main ()

Seems to be the same as for the processes that were stuck inside of a
statement.

I recompiled the server with debugging symbols enabled and then did the
following experiment: I started a query which I knew would take some
time. While the query executed I disconnected my dial-up line. After
reconnecting the backend process was still there (still SELECTing).
Meanwhile the query is finished and the process is idle, but it's still
present. I tried to "kill -SIGINT" the process and it didn't respond.
"pg_cancel_backend" also didn't work. However, a "kill -SIGTERM" did
kill the process (in contrast to the processes I saw before) and the
call stack is different:

(gdb) bt
#0  0x00002ac5d1d0f0c5 in recv () from /lib64/libc.so.6
#1  0x0000000000514292 in secure_read (port=0x996730, ptr=0x917760,
len=8192) at be-secure.c:313
#2  0x0000000000518574 in pq_recvbuf () at pqcomm.c:723
#3  0x0000000000518987 in pq_getbyte () at pqcomm.c:764
#4  0x000000000057eb69 in PostgresMain (argc=4, argv=0x97f830,
username=0x97f6d0 "chschroe") at postgres.c:301
#5  0x0000000000558218 in ServerLoop () at postmaster.c:2934
#6  0x0000000000558db8 in PostmasterMain (argc=3, argv=0x97bfd0) at
postmaster.c:966
#7  0x000000000051a213 in main (argc=3, argv=0x97bfd0) at main.c:188

I now did the same again and will wait some time. Maybe the process will
come to the same state the other processes had been. If not, I'm pretty
sure that my users will be able to create some really persistent
processes again ... ;-)

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Martijn van Oosterhout

Date:

11 November 2007, 13:43:04

On Sun, Nov 11, 2007 at 06:34:40PM +0100, Christian Schröder wrote:
> I recompiled the server with debugging symbols enabled and then did the
> following experiment: I started a query which I knew would take some
> time. While the query executed I disconnected my dial-up line. After
> reconnecting the backend process was still there (still SELECTing).
> Meanwhile the query is finished and the process is idle, but it's still
> present. I tried to "kill -SIGINT" the process and it didn't respond.
> "pg_cancel_backend" also didn't work. However, a "kill -SIGTERM" did
> kill the process (in contrast to the processes I saw before) and the
> call stack is different:

Yeah, not being able to kill processes that are waiting for I/O is
somewhat of a known problem in older versions, unsure if it's still an
issue. The issue is that the signal is set to restartable, which means
that the SIGINT signal handler gets called and then it goes right back
to waiting, rather than interrupting the system call.

In the libpq client we handle this by jumping out of the signal
handler. For the server the best bet would probably be to disable the
restartable flag for the period of socket I/O.

Not sure if this has anything to do with your main problem though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

signature.asc

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

11 November 2007, 15:48:59

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> (gdb) bt
> #0  0x00002b24aeee0a68 in __lll_mutex_lock_wait () from
> /lib64/libpthread.so.0
> #1  0x00002b24aeedde88 in pthread_rwlock_rdlock () from
> /lib64/libpthread.so.0
> #2  0x00002b24a5814e23 in _nl_find_msg () from /lib64/libc.so.6
> #3  0x00002b24a5815c83 in __dcigettext () from /lib64/libc.so.6
> #4  0x00002b24a585df0b in strerror_r () from /lib64/libc.so.6
> #5  0x00002b24a585dd33 in strerror () from /lib64/libc.so.6
> #6  0x00000000005f4daa in expand_fmt_string ()
> #7  0x00000000005f6d14 in errmsg ()
> #8  0x00000000005185f3 in pq_recvbuf ()
> #9  0x0000000000518987 in pq_getbyte ()
> #10 0x000000000057eb69 in PostgresMain ()
> #11 0x0000000000558218 in ServerLoop ()
> #12 0x0000000000558db8 in PostmasterMain ()
> #13 0x000000000051a213 in main ()

> Seems to be the same as for the processes that were stuck inside of a
> statement.

Well, the top of the stack is the same, but this is useful anyway
because it shows that an I/O error on the input side can trigger the
problem as well as one on the output side.  We're still left wondering
how a thread mutex down inside strerror() could be left in a "stuck"
state, when the process doesn't appear to contain more than one thread.

> I recompiled the server with debugging symbols enabled and then did the
> following experiment: I started a query which I knew would take some
> time. While the query executed I disconnected my dial-up line. After
> reconnecting the backend process was still there (still SELECTing).
> Meanwhile the query is finished and the process is idle, but it's still
> present.

That is probably not the same situation because (assuming the query
didn't produce a lot of output) the kernel does not yet think that the
network connection is lost irretrievably.  You'd have to wait for the
TCP timeout interval to elapse, whereupon the kernel would report the
connection lost (EPIPE or ECONNRESET error), whereupon we'd enter the
code path shown above.

One thing I'm suddenly thinking might be related: didn't you mention
that you have some process that goes around and SIGINT's backends that
it thinks are running too long?  I'm wondering if a SIGINT event is a
necessary component of producing the problem ...

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

11 November 2007, 16:48:44

Tom Lane wrote:
>
>> I recompiled the server with debugging symbols enabled and then did the
>> following experiment: I started a query which I knew would take some
>> time. While the query executed I disconnected my dial-up line. After
>> reconnecting the backend process was still there (still SELECTing).
>> Meanwhile the query is finished and the process is idle, but it's still
>> present.
>>
>
> That is probably not the same situation because (assuming the query
> didn't produce a lot of output) the kernel does not yet think that the
> network connection is lost irretrievably.  You'd have to wait for the
> TCP timeout interval to elapse, whereupon the kernel would report the
> connection lost (EPIPE or ECONNRESET error), whereupon we'd enter the
> code path shown above.
>

That's what I also thought, but unfortunately it doesn't seem to be the
same situation. After a while, the connection disappears and the server
log says (quite correct in my opinion):

<2007-11-11 18:45:43 CET - chschroe> LOG:  could not receive data from
client: Die Wartezeit für die Verbindung ist abgelaufen
<2007-11-11 18:45:43 CET - chschroe> LOG:  unexpected EOF on client
connection

> One thing I'm suddenly thinking might be related: didn't you mention
> that you have some process that goes around and SIGINT's backends that
> it thinks are running too long?  I'm wondering if a SIGINT event is a
> necessary component of producing the problem ...
>

Maybe. On the other hand, I sent a SIGINT to my process and it
nonetheless didn't show the strange behaviour. To test this I will
change the script so that it will leave alone the processes of one of
the users.

I think the users will create some more unkillable processes tomorrow.
As soon as I have some, I will send you the login data for the server so
that you can have a look for yourself.

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

12 November 2007, 18:08:40

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> [ ongoing saga ]

Hah, I've got it.  (Should have searched Red Hat's bugzilla sooner.)
What you are hitting is a glibc bug, as explained here:
http://sources.redhat.com/ml/libc-hacker/2007-10/msg00010.html
If libpthread is loaded after first use of dcgettext, then subsequent
uses are at risk of hanging because they start to use a mutex lock
that was never initialized.  And libperl brings in libpthread.

I've confirmed that the test program given in the above page hangs
on your machine, and does not hang on mine (with latest Fedora 6
glibc), which no doubt explains my lack of success in reproducing
the problem.

So you need to pester SuSE for a glibc with that fix in it ...

            regards, tom lane

Re: (Never?) Kill Postmaster?

From

Christian Schröder

Date:

13 November 2007, 09:08:22

Tom Lane wrote:
> Hah, I've got it.  (Should have searched Red Hat's bugzilla sooner.)
> What you are hitting is a glibc bug, as explained here:
> http://sources.redhat.com/ml/libc-hacker/2007-10/msg00010.html
> If libpthread is loaded after first use of dcgettext, then subsequent
> uses are at risk of hanging because they start to use a mutex lock
> that was never initialized.  And libperl brings in libpthread.
>

Many thanks for your efforts which have been crowned with success at last!!

What remains inscrutable to me is why this problem did not arise
earlier. I cannot remember any changes that I have made to the system
recently. Or maybe it has been a defective update of the glibc? I did
not see in the bug report any hint about when this bug has been
introduced. Obviously it has something to do with the users having
disconnects, but we have been working with dial-up connections for many
years and did never see this bug.

By the way, does the "--enable-thread-safety" switch have anything to do
with the problem?

> So you need to pester SuSE for a glibc with that fix in it ...
>

Phew, that might be hard work, but I will give it a try. Until I have
success I will probably restart the database server once in a week to
remove those old connections.

Thanks a lot for your help!
    Christian

P.S.: Can I change the root password again?

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

Re: (Never?) Kill Postmaster?

From

Martijn van Oosterhout

Date:

13 November 2007, 09:20:48

On Tue, Nov 13, 2007 at 02:08:47PM +0100, Christian Schröder wrote:
> By the way, does the "--enable-thread-safety" switch have anything to do
> with the problem?

No, that only affects the client library, it doesn't affect the
server...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

signature.asc

Re: (Never?) Kill Postmaster?

From

Tom Lane

Date:

13 November 2007, 12:28:56

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> What remains inscrutable to me is why this problem did not arise
> earlier. I cannot remember any changes that I have made to the system
> recently. Or maybe it has been a defective update of the glibc? I did
> not see in the bug report any hint about when this bug has been
> introduced.

It's quite recent actually --- according to
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/intl/dcigettext.c?cvsroot=glibc
the problem was introduced into upstream sources on 24-Sep.  I was a bit
confused yesterday because I'd been looking at another lock used by the
same routine that's been there much longer; but that one is initialized
in a safe way so it doesn't have the problem.  It's the new usage of the
conversions_lock that puts _nl_find_msg at risk.  So even though the
underlying problem has existed for years, it didn't manifest, at least
not right there.

So I suppose that SuSE is tracking upstream closely and you recently
updated glibc ...

> P.S.: Can I change the root password again?

Oh, yeah, I'm done.  Thanks for letting me poke at it.

            regards, tom lane