Thread: (Never?) Kill Postmaster?
2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>: > Hi there, > > I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"... > > Now, what am I supposed to do if I launched a query which takes ages, and > which I want to interrupt? > > Thanks for any advice, > > Stef > Hello you have to use more gently way select pg_cancel_backend() http://www.postgresql.org/docs/current/interactive/functions-admin.html Pavel > ____________________________________________________________________ > > > Stefan Schwarzer > > Lean Back and Relax - Enjoy some Nature Photography: > http://photoblog.la-famille-schwarzer.de > > Appetite for Global Data? UNEP GEO Data Portal: > http://geodata.grid.unep.ch > > ____________________________________________________________________ > > > > >
On Wed, 2007-10-24 at 07:57 +0200, Stefan Schwarzer wrote: > Hi there, > > > I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"... > > > Now, what am I supposed to do if I launched a query which takes ages, > and which I want to interrupt? look for the query's procpid and then issue a select pg_cancel_backend('the_id') >
Pavel Stehule wrote: > 2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>: > >> I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"... >> >> Now, what am I supposed to do if I launched a query which takes ages, and >> which I want to interrupt? > you have to use more gently way > > select pg_cancel_backend() > http://www.postgresql.org/docs/current/interactive/functions-admin.html The problem is that pg_cancel_backend() is somewhat unreliable at cancelling wayward queries. You can try other options for kill though, other than -9, which is kind of a last resort. -- Tommy Gildseth
Ow Mun Heng wrote: > look for the query's procpid and then issue a select > pg_cancel_backend('the_id') > Does it do any harm if I kill (either with signal 9 or signal 15) the single backend process (and not the postmaster)? Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
Hi,
First you need to identify the correct postgresql process. Postgresql spawns an individual server process for each database connection. They look something like this:
postgres 27296 7089 9 08:00 ? 00:05:52 postgres: username databasename [local] idle
If a query was running then it would say 'SELECT' instead of 'idle'.
You can send a SIGINT (ie, -2) to that process to cancel a query, eg
kill -2 27296
In most systems SIGINT is the default for kill so you could just do kill <pid>.
The tip is ''kill -9' the postmaster', which has two important differences to the scenario I just described:
1) kill -9 means the OS kills the process without allowing it to clean up after itself
2) The postmaster is the master postgresql backend process. If you want to kill a single query you would not want to kill that.
Regards // Mike
From: Stefan Schwarzer [mailto:stefan.schwarzer@grid.unep.ch]
Sent: Wednesday, 24 October 2007 3:58 PM
To: pgsql-general@postgresql.org
Subject: (Never?) Kill Postmaster?
Hi there,
I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...
Now, what am I supposed to do if I launched a query which takes ages, and which I want to interrupt?
Thanks for any advice,
Stef
____________________________________________________________________
Stefan Schwarzer
Lean Back and Relax - Enjoy some Nature Photography:
Appetite for Global Data? UNEP GEO Data Portal:
____________________________________________________________________
2007/10/24, Christian Schröder <cs@deriva.de>: > Ow Mun Heng wrote: > > look for the query's procpid and then issue a select > > pg_cancel_backend('the_id') > > > > Does it do any harm if I kill (either with signal 9 or signal 15) the > single backend process (and not the postmaster)? > shared memory can be broken, and probably somewhere can stay some garbage. It's depend on query. Pavel > Regards, > Christian > > -- > Deriva GmbH Tel.: +49 551 489500-42 > Financial IT and Consulting Fax: +49 551 489500-91 > Hans-Böckler-Straße 2 http://www.deriva.de > D-37079 Göttingen > > Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
"Michael Harris" <michael.harris@ericsson.com> writes: > The tip is ''kill -9' the postmaster', which has two important > differences to the scenario I just described: > 1) kill -9 means the OS kills the process without allowing it to clean > up after itself > 2) The postmaster is the master postgresql backend process. If you want > to kill a single query you would not want to kill that. Right: the tip is to not kill -9 the parent process; it's not saying anything about child server processes. If you've got a child process that's unresponsive to SIGINT then you can send it a SIGKILL instead; the downside is that this will force a restart of the other children too, that is you're interrupting all database sessions not only the one. But Postgres will recover automatically and I don't think I've ever heard of anyone getting data corruption as a result of such a thing. SIGKILL on the parent is an entirely different thing. You'll have to manually restart Postgres, possibly do some manual cleanup, and there's a small but nonzero chance of ensuing data corruption ... especially if you fat-finger any of the manual steps. Plus there simply isn't any good reason to do it. The postmaster should always respond to more gentle shutdown signals, because it doesn't run any user-provided commands that could send it off into the weeds. Hence the TIP. regards, tom lane
In response to Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>: > Hi there, > > I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"... > > Now, what am I supposed to do if I launched a query which takes ages, > and which I want to interrupt? > > Thanks for any advice, Tracing through the other threads, this is obviously not the regular wayward query, but one that won't die by the normal methods. Unfortunately, I came across this recently, and the only solution I found was to do a "pg_ctl restart -m i" (Yes, I tried -m f first). Luckily, the db in question was such that the front ends didn't suffer horribly from this and reconnected, and that the database finished up its recovery in a timely manner. Hopefully, I can generate a reproducible example so I can file a bug, but haven't gotten that far with it yet. -- Bill Moran http://www.potentialtech.com
Tom Lane wrote: > "Michael Harris" <michael.harris@ericsson.com> writes: > >> The tip is ''kill -9' the postmaster', which has two important >> differences to the scenario I just described: >> 1) kill -9 means the OS kills the process without allowing it to clean >> up after itself >> 2) The postmaster is the master postgresql backend process. If you want >> to kill a single query you would not want to kill that. >> > > Right: the tip is to not kill -9 the parent process; it's not saying > anything about child server processes. > > If you've got a child process that's unresponsive to SIGINT then you > can send it a SIGKILL instead; the downside is that this will force a > restart of the other children too, that is you're interrupting all > database sessions not only the one. But Postgres will recover > automatically and I don't think I've ever heard of anyone getting data > corruption as a result of such a thing. > I have been in exactly this situation today: One statement took several hours to complete, so it should be cancelled. I tried a "pg_cancel_backend" and a "kill -2" (which means "SIGINT" on our linux box), but nothing happened. Since I remembered this thread, I tried a "kill -9" on this child process. As you described, all other connections were reset, too, and this was the message in the server log: <2007-10-31 22:48:28 CET - chschroe> WARNING: terminating connection because of crash of another server process <2007-10-31 22:48:28 CET - chschroe> DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. But then, when I tried to reconnect to the database, I received the following message: <2007-10-31 22:50:01 CET - chschroe> FATAL: the database system is in recovery mode Ok, you wrote "Postgres will recover automatically", but could this take several minutes? Is that what "recovery mode" means? When nothing seemed to happen for several minutes, I performed a (fortunately clean) restart of the whole server. The log messages for the server restart looked normal to me: <2007-10-31 22:53:15 CET - > LOG: received smart shutdown request <2007-10-31 22:53:21 CET - > LOG: all server processes terminated; reinitializing <2007-10-31 22:53:58 CET - > LOG: database system was interrupted at 2007-10-31 22:46:46 CET <2007-10-31 22:53:58 CET - > LOG: checkpoint record is at 153/FE9FAF20 <2007-10-31 22:53:58 CET - > LOG: redo record is at 153/FE9FAF20; undo record is at 0/0; shutdown FALSE <2007-10-31 22:53:58 CET - > LOG: next transaction ID: 0/128715865; next OID: 58311787 <2007-10-31 22:53:58 CET - > LOG: next MultiXactId: 4704; next MultiXactOffset: 9414 <2007-10-31 22:53:58 CET - > LOG: database system was not properly shut down; automatic recovery in progress <2007-10-31 22:53:58 CET - > LOG: redo starts at 153/FE9FAF70 <2007-10-31 22:53:58 CET - > LOG: record with zero length at 153/FEA05E70 <2007-10-31 22:53:58 CET - > LOG: redo done at 153/FEA05E40 <2007-10-31 22:53:58 CET - > LOG: database system is ready I hope that no data got corrupted. Is there any way to check this? What is the conclusion of this experience? Is it contrary to the above statements dangerous to kill (-9) a subprocess? Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > But then, when I tried to reconnect to the database, I received the > following message: > <2007-10-31 22:50:01 CET - chschroe> FATAL: the database system is in > recovery mode > Ok, you wrote "Postgres will recover automatically", but could this take > several minutes? Yeah, potentially. I don't suppose you have any idea how long it'd been since your last checkpoint, but what do you have checkpoint_timeout and checkpoint_segments set to? > What is the conclusion of this experience? That you have an itchy trigger finger ;-). It looks from the log that you needed to wait only a few seconds longer for the recovery to have completed. What I'd like to know about is why the child process was unresponsive to SIGINT in the first place. There's little we can do about long-running plpython functions, for instance, but if it was looping in Postgres proper then we should do something about that. Can you reproduce this problem easily? regards, tom lane
Tom Lane wrote: >> Ok, you wrote "Postgres will recover automatically", but could this take >> several minutes? >> > > Yeah, potentially. I don't suppose you have any idea how long it'd been > since your last checkpoint, but what do you have checkpoint_timeout and > checkpoint_segments set to? > I did not change these parameters from their default values, so checkpoint_timeout is 5 min and checkpoint_segments is 8. > What I'd like to know about is why the child process was unresponsive to > SIGINT in the first place. There's little we can do about long-running > plpython functions, for instance, but if it was looping in Postgres > proper then we should do something about that. Can you reproduce this > problem easily? > Unfortunately not. I have tried the same query and it took only about 1 sec to complete. In fact, it's a simple seq scan with a single filter condition. No user defined functions are involved. Maybe it has something to do with the users connecting from their Windows machines to the PostgreSQL server using psqlodbc. On the other hand, it has not been the first time that such a user connection had to be terminated and we did never experience this problem. If I see the phenomenon again I will use strace or something similar to find out what the backend process is doing. Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
On 10/31/07, Christian Schröder <cs@deriva.de> wrote: > Tom Lane wrote: > >> Ok, you wrote "Postgres will recover automatically", but could this take > >> several minutes? > >> > > > > Yeah, potentially. I don't suppose you have any idea how long it'd been > > since your last checkpoint, but what do you have checkpoint_timeout and > > checkpoint_segments set to? > > > > I did not change these parameters from their default values, so > checkpoint_timeout is 5 min and checkpoint_segments is 8. > > > What I'd like to know about is why the child process was unresponsive to > > SIGINT in the first place. There's little we can do about long-running > > plpython functions, for instance, but if it was looping in Postgres > > proper then we should do something about that. Can you reproduce this > > problem easily? > > > > Unfortunately not. I have tried the same query and it took only about 1 > sec to complete. In fact, it's a simple seq scan with a single filter > condition. No user defined functions are involved. > Maybe it has something to do with the users connecting from their > Windows machines to the PostgreSQL server using psqlodbc. On the other > hand, it has not been the first time that such a user connection had to > be terminated and we did never experience this problem. > If I see the phenomenon again I will use strace or something similar to > find out what the backend process is doing. Tom, is it possible the backend was doing something that couldn't be immediately interrupted, like a long wait on IO or something?
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > Tom, is it possible the backend was doing something that couldn't be > immediately interrupted, like a long wait on IO or something? Sherlock Holmes said that theorizing in advance of the data is a capital mistake... What we can be reasonably certain of is that that backend wasn't reaching any CHECK_FOR_INTERRUPTS() macros. Whether it was hung up waiting for something, or caught in a tight loop somewhere, is impossible to say without more data than we have. AFAIR the OP didn't even mention whether the backend appeared to be consuming CPU cycles (which'd be a pretty fair tip about which of those to believe, but still not enough to guess *where* the problem is). A gdb backtrace would tell us more. regards, tom lane
Tom Lane wrote: > What we can be reasonably certain of is that that backend wasn't > reaching any CHECK_FOR_INTERRUPTS() macros. Whether it was hung up > waiting for something, or caught in a tight loop somewhere, is > impossible to say without more data than we have. AFAIR the OP didn't > even mention whether the backend appeared to be consuming CPU cycles > (which'd be a pretty fair tip about which of those to believe, but still > not enough to guess *where* the problem is). A gdb backtrace would tell > us more. > It happened again! I'm not sure if I should be happy because we can now maybe find the cause of the problem, or should be worried because it's our productive database ... At least the process doesn't seem to consume cpu (it doesn't show up in "top"), so I won't kill it this time, but instead try to get all information that you guys need. What I already did was an strace with the following result: db2:/home/pgsql/data # strace -p 7129 Process 7129 attached - interrupt to quit futex(0x994000, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call) --- SIGINT (Interrupt) @ 0 (0) --- rt_sigreturn(0x2) = -1 EINTR (Interrupted system call) futex(0x994000, FUTEX_WAIT, 2, NULL That interrupt will have been the script that tries to remove long-time queries. The same lines seem to repeat over and over again. Then I attached a gdb to the process and printed out a backtrace: db2:/home/pgsql/data # gdb --pid=7129 GNU gdb 6.5 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux". Attaching to process 7129 Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done. Using host libthread_db library "/lib64/libthread_db.so.1". Reading symbols from /lib64/libcrypt.so.1...done. Loaded symbols for /lib64/libcrypt.so.1 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libm.so.6...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done. Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done. Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so Reading symbols from /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done. Loaded symbols for /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so Reading symbols from /lib64/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 47248855881456 (LWP 7129)] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done. Loaded symbols for /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00002af904806e88 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6 #3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6 #4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6 #5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6 #6 0x00000000005f4daa in expand_fmt_string () #7 0x00000000005f6d14 in errmsg () #8 0x00000000005182cc in internal_flush () #9 0x00000000005183b6 in internal_putbytes () #10 0x000000000051841c in pq_putmessage () #11 0x00000000005199c4 in pq_endmessage () #12 0x0000000000440c6a in printtup () #13 0x00000000004fc1b8 in ExecutorRun () #14 0x0000000000580451 in PortalRunSelect () #15 0x0000000000581446 in PortalRun () #16 0x000000000057d625 in exec_simple_query () #17 0x000000000057ea72 in PostgresMain () #18 0x0000000000558218 in ServerLoop () #19 0x0000000000558db8 in PostmasterMain () #20 0x000000000051a213 in main () Do you need anything else? Can you still tell what's happening? Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > db2:/home/pgsql/data # gdb --pid=7129 > GNU gdb 6.5 > Copyright (C) 2006 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-suse-linux". > Attaching to process 7129 > Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done. > Using host libthread_db library "/lib64/libthread_db.so.1". > Reading symbols from /lib64/libcrypt.so.1...done. > Loaded symbols for /lib64/libcrypt.so.1 > Reading symbols from /lib64/libdl.so.2...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libm.so.6...done. > Loaded symbols for /lib64/libm.so.6 > Reading symbols from /lib64/libc.so.6...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done. > Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so > Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done. > Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so > Reading symbols from > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done. > Loaded symbols for > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so > Reading symbols from /lib64/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 47248855881456 (LWP 7129)] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done. > Loaded symbols for > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so > 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 > (gdb) bt > #0 0x00002af904809a68 in __lll_mutex_lock_wait () from > /lib64/libpthread.so.0 > #1 0x00002af904806e88 in pthread_rwlock_rdlock () from > /lib64/libpthread.so.0 > #2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6 > #3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6 > #4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6 > #5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6 > #6 0x00000000005f4daa in expand_fmt_string () > #7 0x00000000005f6d14 in errmsg () > #8 0x00000000005182cc in internal_flush () > #9 0x00000000005183b6 in internal_putbytes () > #10 0x000000000051841c in pq_putmessage () > #11 0x00000000005199c4 in pq_endmessage () > #12 0x0000000000440c6a in printtup () > #13 0x00000000004fc1b8 in ExecutorRun () > #14 0x0000000000580451 in PortalRunSelect () > #15 0x0000000000581446 in PortalRun () > #16 0x000000000057d625 in exec_simple_query () > #17 0x000000000057ea72 in PostgresMain () > #18 0x0000000000558218 in ServerLoop () > #19 0x0000000000558db8 in PostmasterMain () > #20 0x000000000051a213 in main () Too bad you don't have debug symbols :-(. However, this does raise a couple of interesting questions: * The only place internal_flush would call errmsg is here: ereport(COMMERROR, (errcode_for_socket_access(), errmsg("could not send data to client: %m"))); So why is it unable to send data to the client? * How the heck would strerror() be blocking on a thread mutex, when the backend is single-threaded? We recently discovered that it was possible for pltcl to cause the backend to become multithreaded: http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php I see from your dump that this session has been using plperl, so I wonder whether plperl has the same sort of issue. Can you determine exactly what's been done with plperl in this session? Also, can you confirm that there is actually more than one thread active in this process? On Linux "ps -Lw" would show threads. regards, tom lane
Tom Lane wrote: > * The only place internal_flush would call errmsg is here: > > ereport(COMMERROR, > (errcode_for_socket_access(), > errmsg("could not send data to client: %m"))); > > So why is it unable to send data to the client? > The user accesses the database from his home office. He is connected to the internet using a dial-in connection which seems to be disconnected from time to time. The problem seems to be specific for this user, so it may have something to do with his type of connection. > * How the heck would strerror() be blocking on a thread mutex, when > the backend is single-threaded? > > We recently discovered that it was possible for pltcl to cause the > backend to become multithreaded: > http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php > I see from your dump that this session has been using plperl, so I > wonder whether plperl has the same sort of issue. Can you determine > exactly what's been done with plperl in this session? > Not exactly. There are several triggers whose functions are implemented in perl: elog(ERROR, 'set_serial needs at least 2 arguments.') if $_TD->{argc} < 2; my ($colname, $seqname) = @{$_TD->{args}}; my $result = spi_exec_query("SELECT nextval('$seqname')"); $_TD->{new}{$colname} = $result->{rows}[0]{nextval}; return 'MODIFY'; and elog(ERROR, 'log_changes needs at least 2 arguments.') if $_TD->{argc} < 2; my @args = @{$_TD->{args}}; shift(@args) =~ /([^.]*(?=\.))?\.?(.*)/; my $log_table = defined($1) ? qq("$1"."$2") : qq("$2"); if ($_TD->{event} eq 'UPDATE') { my $id = join(',', @{$_TD->{old}}{@args}); $id =~ s/'/\\'/g; while (my ($key, $value) = each %{$_TD->{old}}) { if ($value ne $_TD->{new}{$key}) { my $newvalue = $_TD->{new}{$key}; $value =~ s/'/\\'/g; $newvalue =~ s/'/\\'/g; my $query = sprintf(qq(INSERT INTO $log_table VALUES (current_timestamp, session_user, '%s', '%s', '%s', '%s', '%s')), $id, $_TD->{relname}, $key, $value, $newvalue); spi_exec_query($query); } } } else { my $id = join(',', @{$_TD->{$_TD->{event} eq 'INSERT' ? 'new' : 'old'}}{@args}); $id =~ s/'/\\'/g; my $query = sprintf(qq(INSERT INTO $log_table VALUES (current_timestamp, session_user, '%s', '%s', '(%s)', null, null)), $id, $_TD->{relname}, $_TD->{event}); spi_exec_query($query); } return; > Also, can you confirm that there is actually more than one thread active > in this process? On Linux "ps -Lw" would show threads. > Since "ps -Lw" only showed me my own processes, I did "ps -AT -o pid,ppid,lwp,nlwp,command" instead. This is the relevant line: PID PPID LWP NLWP COMMAND ... 7129 3934 7129 1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT ... I'm not sure if this data is correct because NLWP (the number of threads in the process) is 1 for all processes. Is it necessary to recompile the server with debugging information enabled? As I already mentioned this is a productive database, so I would have to do it by night which would be somewhat uncomfortable for me ... Not to mention the bad feeling when changing a productive database server ... Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
Christian Schröder wrote: > Since "ps -Lw" only showed me my own processes, I did "ps -AT -o > pid,ppid,lwp,nlwp,command" instead. This is the relevant line: > > PID PPID LWP NLWP COMMAND > ... > 7129 3934 7129 1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT > ... > > I'm not sure if this data is correct because NLWP (the number of threads in > the process) is 1 for all processes. Please try "thread apply all bt full" on gdb. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Please try "thread apply all bt full" on gdb. > The first lines where the symbols are loaded are of course identical. The output of the command is in my opinion not very helpful: (gdb) thread apply all bt full Thread 1 (Thread 47248855881456 (LWP 7129)): #0 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00002af904806e88 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6 No symbol table info available. #3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6 No symbol table info available. #4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6 No symbol table info available. #5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6 No symbol table info available. #6 0x00000000005f4daa in expand_fmt_string () No symbol table info available. #7 0x00000000005f6d14 in errmsg () No symbol table info available. #8 0x00000000005182cc in internal_flush () No symbol table info available. #9 0x00000000005183b6 in internal_putbytes () No symbol table info available. #10 0x000000000051841c in pq_putmessage () No symbol table info available. #11 0x00000000005199c4 in pq_endmessage () No symbol table info available. #12 0x0000000000440c6a in printtup () No symbol table info available. #13 0x00000000004fc1b8 in ExecutorRun () No symbol table info available. #14 0x0000000000580451 in PortalRunSelect () No symbol table info available. #15 0x0000000000581446 in PortalRun () No symbol table info available. #16 0x000000000057d625 in exec_simple_query () No symbol table info available. #17 0x000000000057ea72 in PostgresMain () No symbol table info available. #18 0x0000000000558218 in ServerLoop () No symbol table info available. #19 0x0000000000558db8 in PostmasterMain () No symbol table info available. #20 0x000000000051a213 in main () No symbol table info available. #0 0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 What does this tell you? Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
Christian Schröder wrote: > Alvaro Herrera wrote: >> Please try "thread apply all bt full" on gdb. >> > > The first lines where the symbols are loaded are of course identical. The > output of the command is in my opinion not very helpful: I was actually hoping that it would list the running threads in the process. It would have sufficed to say "info threads", but the "bt full" command would have given more detail in case there were more. I find it a bit strange that it tells you that there is one thread; in a simple try here, it doesn't say anything at all: (gdb) info threads (gdb) In any case I can see in the files from which symbols are loaded that some Perl stuff seem to be multithreaded ... maybe that's were the problem comes from. Hmm, after creating a plperl function it is indeed different: (gdb) info threads * 1 Thread 0x2ba8945f3e20 (LWP 3606) 0x00002ba894370645 in recv () from /lib/libc.so.6 (gdb) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hi all, any news about this issue? Anything else that I can do to help you? Meanwhile there are 4 connections in the same state. (I did not do the whole investigation on all 4, but since they all do not respond on a SIGINT I assume that they all have the same problem.) It may also be interesting that the 4 processes belong to the same two users that already caused this problem earlier. Maybe it really has something to do with their unstable internet connection? (I mentioned this in an earlier mail.) I have also noticed that one of these two users has many open connections which are all idle. I guess that those connections are in fact dead, but the database did not close them for any reason. The pg_stat_activity entries for the corresponding backend processes are as follows: procpid | usename | current_query | waiting | query_start | backend_start ---------+---------+---------------+---------+-------------------------------+------------------------------- 26033 | dpyrek | <IDLE> | f | 2007-11-08 10:21:01.555232+01 | 2007-11-08 09:55:01.59932+01 18331 | dpyrek | <IDLE> | f | 2007-11-07 11:34:24.968852+01 | 2007-11-07 11:08:29.043762+01 18940 | dpyrek | <IDLE> | f | 2007-11-07 14:29:52.987176+01 | 2007-11-07 13:14:48.609031+01 25868 | dpyrek | <IDLE> | f | 2007-11-08 09:47:46.938991+01 | 2007-11-08 09:13:34.101351+01 6719 | dpyrek | <IDLE> | f | 2007-11-06 12:06:14.875588+01 | 2007-11-06 11:10:00.566644+01 17987 | dpyrek | <IDLE> | f | 2007-11-07 10:31:50.517275+01 | 2007-11-07 10:11:07.310338+01 31808 | dpyrek | <IDLE> | f | 2007-11-08 22:55:03.931727+01 | 2007-11-08 22:55:03.766638+01 25484 | dpyrek | <IDLE> | f | 2007-11-08 08:32:57.265377+01 | 2007-11-08 07:44:30.845967+01 5972 | dpyrek | <IDLE> | f | 2007-11-06 08:51:54.57437+01 | 2007-11-06 08:14:03.560602+01 6241 | dpyrek | <IDLE> | f | 2007-11-06 09:59:02.018452+01 | 2007-11-06 09:20:49.092246+01 6136 | dpyrek | <IDLE> | f | 2007-11-06 09:14:40.729837+01 | 2007-11-06 08:57:29.55187+01 12645 | dpyrek | <IDLE> | f | 2007-11-02 10:08:24.856929+01 | 2007-11-02 09:35:37.640976+01 25254 | dpyrek | <IDLE> | f | 2007-11-08 07:29:04.547081+01 | 2007-11-08 06:33:47.707759+01 20275 | dpyrek | <IDLE> | f | 2007-11-03 09:14:12.73829+01 | 2007-11-03 08:57:05.555972+01 20216 | dpyrek | <IDLE> | f | 2007-11-03 08:46:40.555354+01 | 2007-11-03 08:40:31.756993+01 12435 | dpyrek | <IDLE> | f | 2007-11-02 09:28:53.361365+01 | 2007-11-02 08:48:11.589485+01 19633 | dpyrek | <IDLE> | f | 2007-11-03 08:34:16.263487+01 | 2007-11-03 05:46:16.811069+01 12156 | dpyrek | <IDLE> | f | 2007-11-02 08:10:11.558427+01 | 2007-11-02 07:49:03.442489+01 4899 | dpyrek | <IDLE> | f | 2007-11-01 12:42:30.880391+01 | 2007-11-01 10:56:18.513398+01 11988 | dpyrek | <IDLE> | f | 2007-11-02 07:38:10.315758+01 | 2007-11-02 07:02:52.438251+01 4490 | dpyrek | <IDLE> | f | 2007-11-01 09:51:42.216745+01 | 2007-11-01 09:34:18.63771+01 A ptrace of one of these connections yields the following result: db2:/home/pgsql/data # strace -p 4899 Process 4899 attached - interrupt to quit futex(0x994000, FUTEX_WAIT, 2, NULL This looks identical (even with the same uaddr parameter) for the other processes. In the log file I find many lines like this: <2007-11-08 22:56:19 CET - dpyrek> LOG: could not receive data from client: Die Wartezeit für die Verbindung ist abgelaufen <2007-11-08 22:56:19 CET - dpyrek> LOG: unexpected EOF on client connection I'm not sure that these lines correspond to the dead connections, but at least it's the same user. Does this additional information help you in any way? I'm a little bit afraid that eventually the maximum number of connections will be exceeded. I don't want to "kill -9" the processes because the last time I did this the database was in recovery mode for a substantial amount of time. Any help is still highly appreciated! Regards, Christian P.S.: If nothing else helps I could also grant one of you guys root access to the database machine. -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > I don't want to "kill -9" the processes because the last time > I did this the database was in recovery mode for a substantial amount of > time. A useful tip on that: if you perform a manual CHECKPOINT just before issuing the kills, recovery time should be minimal. regards, tom lane
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > any news about this issue? Not a lot. I've been able to confirm on my own Fedora 6 machine that the mere act of loading plperl.so into the backend causes the process to have one thread instead of no threads, as reported by gdb. How relevant that is isn't clear. I experimented with executing a plperl function and then kill -9'ing the connected psql frontend process while executing a query that returned a lot of output. But all I could get out of that was LOG: could not send data to client: Broken pipe STATEMENT: select * from generate_series(1,1000000); LOG: could not receive data from client: Connection reset by peer LOG: unexpected EOF on client connection The first of these lines shows that control passed unscathed through the place where your stack trace shows it was hung up. So there is something different between your situation and my test. It could be that the flaky-internet-connection case is different from my kill-the-client test, but it's pretty hard to see how that would matter in userland, especially not for the act of calling strerror() after control has already returned from the kernel. What I think is that the perl stuff your session has done has included some action that changed the condition of the backend process ... exactly what, I have no idea. Can you show us the plperl functions that were used in these sessions? regards, tom lane
Tom Lane wrote: > =?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > >> I don't want to "kill -9" the processes because the last time >> I did this the database was in recovery mode for a substantial amount of >> time. >> > > A useful tip on that: if you perform a manual CHECKPOINT just before > issuing the kills, recovery time should be minimal. > Thanks for the tip. For the moment, I will leave the processes in their current state to be able to perform further tests in case you need more data. But when I will start feeling uncomfortable with the situation I will follow your advice. Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
Tom Lane wrote: > control has already returned from the kernel. What I think is that the > perl stuff your session has done has included some action that changed > the condition of the backend process ... exactly what, I have no idea. > Can you show us the plperl functions that were used in these sessions? > Considering the skill level of the users I don't think that they explicitly used any perl function. So the only remaining perl stuff are the triggers that I mentioned earlier. There are two triggers in the database to which the users were connected (see http://archives.postgresql.org/pgsql-general/2007-11/msg00245.php). There are three more triggers in other databases which might have an impact: ============================= if ($_TD->{argc} != 1) { elog(ERROR, 'do_fast_export_log() needs the table\'s primary key as sole argument, multiple keys are not yet supported.'); } my $table = $_TD->{table_name}; my $id = $_TD->{old}{$_TD->{args}[0]}; while (my ($col, $value) = each %{$_TD->{old}}) { my $oldVal = $_TD->{old}{$col}; my $newVal = $_TD->{new}{$col}; $oldVal =~ s/'/''/g; $oldVal =~ s/\t/\\\\t/g; $newVal =~ s/'/''/g; $newVal =~ s/\t/\\\\t/g; if ($oldVal ne $newVal) { my $data = "$id\t$table\t$col\t$oldVal\t$newVal"; my $query = spi_exec_query("insert into fast_export(data) values(E'$data')"); } } spi_exec_query("notify fast_export"); return; ============================= elog(ERROR, 'store_session_user needs the field name as argument.') if $_TD->{argc} < 1; my ($field) = @{$_TD->{args}}; elog(ERROR, "Field '$field' does not exist in target table.\n") unless exists $_TD->{new}{$field}; my $query = spi_exec_query('SELECT session_user', 1); $_TD->{new}{$field} = $query->{rows}[0]{session_user}; return 'MODIFY'; ============================= if ($_TD->{argc} > 0) { foreach (@{$_TD->{args}}) { $_TD->{new}{$_} = $_TD->{old}{$_} if exists $_TD->{old}{$_}; } return 'MODIFY'; } else { return; } ============================= Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > [ ongoing saga ] I don't think you ever mentioned exactly what platform you're running on; it seems to be some 64-bit Linux variant but you didn't say which. I've been futilely trying to reproduce the hang on a Xeon Fedora Core 6 machine, and just now I noticed what might be a significant discrepancy. You showed this in a gdb printout: > Reading symbols from > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done. > Loaded symbols for > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so > Reading symbols from /lib64/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 47248855881456 (LWP 7129)] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done. > Loaded symbols for > /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so On my machine the corresponding lines point to /usr/lib64: Reading symbols from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done. Loaded symbols for /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so Reading symbols from /lib64/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 46912496247568 (LWP 344)] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done. Loaded symbols for /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so Maybe this is just a cross-distribution difference in file layouts, but I'm suddenly wondering if there's a 32-vs-64-bit issue here. Exactly which perl packages have you got installed? regards, tom lane
Tom Lane wrote: > I don't think you ever mentioned exactly what platform you're running > on; it seems to be some 64-bit Linux variant but you didn't say which. > The machine has two dual-core Xeon 5130 cpus. The os is openSUSE 10.2 (x86-64). The output of uname -a is: Linux db2 2.6.18.8-0.7-default #1 SMP Tue Oct 2 17:21:08 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux Did I mention the PostgreSQL version is 8.2.5? > On my machine the corresponding lines point to /usr/lib64: > You are right. Of course the directory /usr/lib64 exists and contains lots of shared libraries and other stuff, but indeed no directory perl5. > Maybe this is just a cross-distribution difference in file layouts, > but I'm suddenly wondering if there's a 32-vs-64-bit issue here. > Exactly which perl packages have you got installed? > rpm says: "perl-5.8.8-32". The somewhat lengthy output of "perl -V" is: Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.18, archname=x86_64-linux-thread-multi uname='linux eisler 2.6.18 #1 smp tue nov 21 12:59:21 utc 2006 x86_64 x86_64 x86_64 gnulinux ' config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -Doptimize=-O2 -fmessage-length=0 -Wall -D_FORTIFY_SOURCE=2 -g -Wall -pipe' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -fmessage-length=0 -Wall -D_FORTIFY_SOURCE=2 -g -Wall -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement' ccversion='', gccversion='4.1.2 20061115 (prerelease) (SUSE Linux)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib64' libpth=/lib64 /usr/lib64 /usr/local/lib64 libs=-lm -ldl -lcrypt -lpthread perllibs=-lm -ldl -lcrypt -lpthread libc=/lib64/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64' Characteristics of this binary (from libperl): Compile-time options: DEBUGGING MULTIPLICITY PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API Built under linux Compiled at Nov 25 2006 11:02:03 @INC: /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl . One more information about the current situation: Besides the processes that are stuck in a statement (meanwhile I have five and the most recent belongs to yet another user) I also see many idle connections. I'm pretty sure that the users are out of office at the moment and their workstations are shut down, so the connections should not exist at all. Since I have now a total of 69 connections and already touched the connection limit yesterday I will now restart the server. Thus, I won't be able to provide any more detailed information until the problem arises again. Of course, I can still give you general information about the system. Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > Tom Lane wrote: >> Maybe this is just a cross-distribution difference in file layouts, >> but I'm suddenly wondering if there's a 32-vs-64-bit issue here. >> Exactly which perl packages have you got installed? > rpm says: "perl-5.8.8-32". The somewhat lengthy output of "perl -V" is: > [snip] OK, it looks like it is a 64-bit build but they intentionally put the perl stuff under /usr/lib anyway. So that was a red herring. One thing I notice in comparing your output against what I see on Fedora is: > Compiler: > cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS ^^^^^^^^^^^^^^^^^^^ There's no such #define mentioned in the Red Hat configuration ... I wonder what that does? I wouldn't normally think this significant, but given that we suspect the problem is somehow related to threading, it kind of draws the eyes ... > Since I have now a total of 69 connections and already touched the > connection limit yesterday I will now restart the server. Thus, I won't > be able to provide any more detailed information until the problem > arises again. OK. For the moment I confess bafflement. You had offered access to your system to probe more carefully --- once you've built up two or three stuck processes again, I would like to take a look. regards, tom lane
Tom Lane wrote: > OK. For the moment I confess bafflement. You had offered access > to your system to probe more carefully --- once you've built up > two or three stuck processes again, I would like to take a look. > Although I do not yet have any processes that are stuck inside a statement, there are some that are idle, but do not respond to SIGINT or even SIGTERM. Is this sufficient? Then I could change the root password and send all the login data to you. Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > Although I do not yet have any processes that are stuck inside a > statement, there are some that are idle, but do not respond to SIGINT or > even SIGTERM. Is this sufficient? Dunno. Have you looked at their call stacks with gdb? regards, tom lane
Tom Lane wrote: > =?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > >> Although I do not yet have any processes that are stuck inside a >> statement, there are some that are idle, but do not respond to SIGINT or >> even SIGTERM. Is this sufficient? >> > > Dunno. Have you looked at their call stacks with gdb? > (gdb) bt #0 0x00002b24aeee0a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00002b24aeedde88 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #2 0x00002b24a5814e23 in _nl_find_msg () from /lib64/libc.so.6 #3 0x00002b24a5815c83 in __dcigettext () from /lib64/libc.so.6 #4 0x00002b24a585df0b in strerror_r () from /lib64/libc.so.6 #5 0x00002b24a585dd33 in strerror () from /lib64/libc.so.6 #6 0x00000000005f4daa in expand_fmt_string () #7 0x00000000005f6d14 in errmsg () #8 0x00000000005185f3 in pq_recvbuf () #9 0x0000000000518987 in pq_getbyte () #10 0x000000000057eb69 in PostgresMain () #11 0x0000000000558218 in ServerLoop () #12 0x0000000000558db8 in PostmasterMain () #13 0x000000000051a213 in main () Seems to be the same as for the processes that were stuck inside of a statement. I recompiled the server with debugging symbols enabled and then did the following experiment: I started a query which I knew would take some time. While the query executed I disconnected my dial-up line. After reconnecting the backend process was still there (still SELECTing). Meanwhile the query is finished and the process is idle, but it's still present. I tried to "kill -SIGINT" the process and it didn't respond. "pg_cancel_backend" also didn't work. However, a "kill -SIGTERM" did kill the process (in contrast to the processes I saw before) and the call stack is different: (gdb) bt #0 0x00002ac5d1d0f0c5 in recv () from /lib64/libc.so.6 #1 0x0000000000514292 in secure_read (port=0x996730, ptr=0x917760, len=8192) at be-secure.c:313 #2 0x0000000000518574 in pq_recvbuf () at pqcomm.c:723 #3 0x0000000000518987 in pq_getbyte () at pqcomm.c:764 #4 0x000000000057eb69 in PostgresMain (argc=4, argv=0x97f830, username=0x97f6d0 "chschroe") at postgres.c:301 #5 0x0000000000558218 in ServerLoop () at postmaster.c:2934 #6 0x0000000000558db8 in PostmasterMain (argc=3, argv=0x97bfd0) at postmaster.c:966 #7 0x000000000051a213 in main (argc=3, argv=0x97bfd0) at main.c:188 I now did the same again and will wait some time. Maybe the process will come to the same state the other processes had been. If not, I'm pretty sure that my users will be able to create some really persistent processes again ... ;-) Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
On Sun, Nov 11, 2007 at 06:34:40PM +0100, Christian Schröder wrote: > I recompiled the server with debugging symbols enabled and then did the > following experiment: I started a query which I knew would take some > time. While the query executed I disconnected my dial-up line. After > reconnecting the backend process was still there (still SELECTing). > Meanwhile the query is finished and the process is idle, but it's still > present. I tried to "kill -SIGINT" the process and it didn't respond. > "pg_cancel_backend" also didn't work. However, a "kill -SIGTERM" did > kill the process (in contrast to the processes I saw before) and the > call stack is different: Yeah, not being able to kill processes that are waiting for I/O is somewhat of a known problem in older versions, unsure if it's still an issue. The issue is that the signal is set to restartable, which means that the SIGINT signal handler gets called and then it goes right back to waiting, rather than interrupting the system call. In the libpq client we handle this by jumping out of the signal handler. For the server the best bet would probably be to disable the restartable flag for the period of socket I/O. Not sure if this has anything to do with your main problem though. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > (gdb) bt > #0 0x00002b24aeee0a68 in __lll_mutex_lock_wait () from > /lib64/libpthread.so.0 > #1 0x00002b24aeedde88 in pthread_rwlock_rdlock () from > /lib64/libpthread.so.0 > #2 0x00002b24a5814e23 in _nl_find_msg () from /lib64/libc.so.6 > #3 0x00002b24a5815c83 in __dcigettext () from /lib64/libc.so.6 > #4 0x00002b24a585df0b in strerror_r () from /lib64/libc.so.6 > #5 0x00002b24a585dd33 in strerror () from /lib64/libc.so.6 > #6 0x00000000005f4daa in expand_fmt_string () > #7 0x00000000005f6d14 in errmsg () > #8 0x00000000005185f3 in pq_recvbuf () > #9 0x0000000000518987 in pq_getbyte () > #10 0x000000000057eb69 in PostgresMain () > #11 0x0000000000558218 in ServerLoop () > #12 0x0000000000558db8 in PostmasterMain () > #13 0x000000000051a213 in main () > Seems to be the same as for the processes that were stuck inside of a > statement. Well, the top of the stack is the same, but this is useful anyway because it shows that an I/O error on the input side can trigger the problem as well as one on the output side. We're still left wondering how a thread mutex down inside strerror() could be left in a "stuck" state, when the process doesn't appear to contain more than one thread. > I recompiled the server with debugging symbols enabled and then did the > following experiment: I started a query which I knew would take some > time. While the query executed I disconnected my dial-up line. After > reconnecting the backend process was still there (still SELECTing). > Meanwhile the query is finished and the process is idle, but it's still > present. That is probably not the same situation because (assuming the query didn't produce a lot of output) the kernel does not yet think that the network connection is lost irretrievably. You'd have to wait for the TCP timeout interval to elapse, whereupon the kernel would report the connection lost (EPIPE or ECONNRESET error), whereupon we'd enter the code path shown above. One thing I'm suddenly thinking might be related: didn't you mention that you have some process that goes around and SIGINT's backends that it thinks are running too long? I'm wondering if a SIGINT event is a necessary component of producing the problem ... regards, tom lane
Tom Lane wrote: > >> I recompiled the server with debugging symbols enabled and then did the >> following experiment: I started a query which I knew would take some >> time. While the query executed I disconnected my dial-up line. After >> reconnecting the backend process was still there (still SELECTing). >> Meanwhile the query is finished and the process is idle, but it's still >> present. >> > > That is probably not the same situation because (assuming the query > didn't produce a lot of output) the kernel does not yet think that the > network connection is lost irretrievably. You'd have to wait for the > TCP timeout interval to elapse, whereupon the kernel would report the > connection lost (EPIPE or ECONNRESET error), whereupon we'd enter the > code path shown above. > That's what I also thought, but unfortunately it doesn't seem to be the same situation. After a while, the connection disappears and the server log says (quite correct in my opinion): <2007-11-11 18:45:43 CET - chschroe> LOG: could not receive data from client: Die Wartezeit für die Verbindung ist abgelaufen <2007-11-11 18:45:43 CET - chschroe> LOG: unexpected EOF on client connection > One thing I'm suddenly thinking might be related: didn't you mention > that you have some process that goes around and SIGINT's backends that > it thinks are running too long? I'm wondering if a SIGINT event is a > necessary component of producing the problem ... > Maybe. On the other hand, I sent a SIGINT to my process and it nonetheless didn't show the strange behaviour. To test this I will change the script so that it will leave alone the processes of one of the users. I think the users will create some more unkillable processes tomorrow. As soon as I have some, I will send you the login data for the server so that you can have a look for yourself. Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > [ ongoing saga ] Hah, I've got it. (Should have searched Red Hat's bugzilla sooner.) What you are hitting is a glibc bug, as explained here: http://sources.redhat.com/ml/libc-hacker/2007-10/msg00010.html If libpthread is loaded after first use of dcgettext, then subsequent uses are at risk of hanging because they start to use a mutex lock that was never initialized. And libperl brings in libpthread. I've confirmed that the test program given in the above page hangs on your machine, and does not hang on mine (with latest Fedora 6 glibc), which no doubt explains my lack of success in reproducing the problem. So you need to pester SuSE for a glibc with that fix in it ... regards, tom lane
Tom Lane wrote: > Hah, I've got it. (Should have searched Red Hat's bugzilla sooner.) > What you are hitting is a glibc bug, as explained here: > http://sources.redhat.com/ml/libc-hacker/2007-10/msg00010.html > If libpthread is loaded after first use of dcgettext, then subsequent > uses are at risk of hanging because they start to use a mutex lock > that was never initialized. And libperl brings in libpthread. > Many thanks for your efforts which have been crowned with success at last!! What remains inscrutable to me is why this problem did not arise earlier. I cannot remember any changes that I have made to the system recently. Or maybe it has been a defective update of the glibc? I did not see in the bug report any hint about when this bug has been introduced. Obviously it has something to do with the users having disconnects, but we have been working with dial-up connections for many years and did never see this bug. By the way, does the "--enable-thread-safety" switch have anything to do with the problem? > So you need to pester SuSE for a glibc with that fix in it ... > Phew, that might be hard work, but I will give it a try. Until I have success I will probably restart the database server once in a week to remove those old connections. Thanks a lot for your help! Christian P.S.: Can I change the root password again? -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
On Tue, Nov 13, 2007 at 02:08:47PM +0100, Christian Schröder wrote: > By the way, does the "--enable-thread-safety" switch have anything to do > with the problem? No, that only affects the client library, it doesn't affect the server... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > What remains inscrutable to me is why this problem did not arise > earlier. I cannot remember any changes that I have made to the system > recently. Or maybe it has been a defective update of the glibc? I did > not see in the bug report any hint about when this bug has been > introduced. It's quite recent actually --- according to http://sourceware.org/cgi-bin/cvsweb.cgi/libc/intl/dcigettext.c?cvsroot=glibc the problem was introduced into upstream sources on 24-Sep. I was a bit confused yesterday because I'd been looking at another lock used by the same routine that's been there much longer; but that one is initialized in a safe way so it doesn't have the problem. It's the new usage of the conversions_lock that puts _nl_find_msg at risk. So even though the underlying problem has existed for years, it didn't manifest, at least not right there. So I suppose that SuSE is tracking upstream closely and you recently updated glibc ... > P.S.: Can I change the root password again? Oh, yeah, I'm done. Thanks for letting me poke at it. regards, tom lane