Thread: trouble restarting a server
Hi all, I'm having some trouble restarting a long-running server. Basically, pg_ctl can't kill any processes waiting on a notify interrupt. Here's what happened after trying a "stop -m fast"
[root@vero ~]# ps axvw | grep post
3606 ? S 3:30 23 3260 41575 6644 0.1 /usr/bin/postgres -D /postgres/db-8.2
4293 ? Ss 0:17 0 3260 40551 27004 0.6 postgres: writer process
4294 ? Ss 0:00 0 3260 9363 1744 0.0 postgres: stats collector process
14508 ? Ss 0:01 0 3260 41195 29356 0.7 postgres: me csdb zim(48570) notify interrupt
20701 ? Ss 0:01 0 3260 41127 29384 0.7 postgres: others csdb chef(45882) notify interrupt
31179 ? Ss 0:01 0 3260 41127 29428 0.7 postgres: others csdb yfandes(33113) notify interrupt
7342 ? Ss 0:03 0 3260 41723 30584 0.7 postgres: others csdb reed(54823) notify interrupt
7343 ? Ss 0:02 0 3260 41127 29396 0.7 postgres: others csdb reed(54824) notify interrupt
7404 ? Ss 0:02 0 3260 41127 29392 0.7 postgres: others csdb reed(54830) notify interrupt
29537 ? Ss 0:02 0 3260 41127 29404 0.7 postgres: others csdb stupid(60028) notify interrupt
26671 pts/0 S+ 0:00 1 19 7976 1700 0.0 su postgres -c /s/postgresql-8.2/bin/pg_ctl stop -m fast -D /scratch.1/postgres/csdb-8.2 -l /dev/null
26672 pts/0 S+ 0:00 1 283 5556 1180 0.0 -csh -c /usr/bin/pg_ctl stop -m fast -D /scratch.1/postgres/csdb- 8.2 -l /dev/null
26673 pts/0 S+ 0:00 0 23 6916 1440 0.0 /usr/bin/pg_ctl stop -m fast -D /postgres/db-8.2 -l /dev/null
26713 ttyS0 S+ 0:00 0 71 5016 644 0.0 grep post
Any idea how postgres gets into this state, or how it can be resolved short of a system reboot or a "stop -m immediate"? Thanks much.
Peter
[root@vero ~]# ps axvw | grep post
3606 ? S 3:30 23 3260 41575 6644 0.1 /usr/bin/postgres -D /postgres/db-8.2
4293 ? Ss 0:17 0 3260 40551 27004 0.6 postgres: writer process
4294 ? Ss 0:00 0 3260 9363 1744 0.0 postgres: stats collector process
14508 ? Ss 0:01 0 3260 41195 29356 0.7 postgres: me csdb zim(48570) notify interrupt
20701 ? Ss 0:01 0 3260 41127 29384 0.7 postgres: others csdb chef(45882) notify interrupt
31179 ? Ss 0:01 0 3260 41127 29428 0.7 postgres: others csdb yfandes(33113) notify interrupt
7342 ? Ss 0:03 0 3260 41723 30584 0.7 postgres: others csdb reed(54823) notify interrupt
7343 ? Ss 0:02 0 3260 41127 29396 0.7 postgres: others csdb reed(54824) notify interrupt
7404 ? Ss 0:02 0 3260 41127 29392 0.7 postgres: others csdb reed(54830) notify interrupt
29537 ? Ss 0:02 0 3260 41127 29404 0.7 postgres: others csdb stupid(60028) notify interrupt
26671 pts/0 S+ 0:00 1 19 7976 1700 0.0 su postgres -c /s/postgresql-8.2/bin/pg_ctl stop -m fast -D /scratch.1/postgres/csdb-8.2 -l /dev/null
26672 pts/0 S+ 0:00 1 283 5556 1180 0.0 -csh -c /usr/bin/pg_ctl stop -m fast -D /scratch.1/postgres/csdb- 8.2 -l /dev/null
26673 pts/0 S+ 0:00 0 23 6916 1440 0.0 /usr/bin/pg_ctl stop -m fast -D /postgres/db-8.2 -l /dev/null
26713 ttyS0 S+ 0:00 0 71 5016 644 0.0 grep post
Any idea how postgres gets into this state, or how it can be resolved short of a system reboot or a "stop -m immediate"? Thanks much.
Peter
"Peter Koczan" <pjkoczan@gmail.com> writes: > [ lots of processes stuck in "notify interrupt" code ] That's weird. If it's still in that state, or if you can reproduce it, could you attach to a few of those processes with gdb and get stack traces? Looking at the async.c code, an obvious candidate is that that routine tries to take ExclusiveLock on pg_listener --- so if something had managed to exit without releasing a lock on that table, hangups could be expected. But if that were the case, you'd think the process status lines would include "waiting". My guess is they're blocked on something lower-level than a table lock, but without a stack trace it's hard to guess what. Which PG release is this exactly? regards, tom lane
The release is 8.2.4. I haven't been able to reproduce the condition yet, but I will send along stack traces as soon as I can. I have this strange feeling that it's only going to happen when I find a reason to make a restart-worthy config change.
Peter
Peter
On 5/21/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Peter Koczan" <pjkoczan@gmail.com> writes:
> [ lots of processes stuck in "notify interrupt" code ]
That's weird. If it's still in that state, or if you can reproduce it,
could you attach to a few of those processes with gdb and get stack
traces?
Looking at the async.c code, an obvious candidate is that that routine
tries to take ExclusiveLock on pg_listener --- so if something had
managed to exit without releasing a lock on that table, hangups could be
expected. But if that were the case, you'd think the process status
lines would include "waiting". My guess is they're blocked on something
lower-level than a table lock, but without a stack trace it's hard to
guess what.
Which PG release is this exactly?
regards, tom lane
It finally reoccurred. Here's what I got from attaching to those processes from gdb. I attached with the postmaster binary, let me know if I should use something else.
vero(su): ps axvw | grep notify
24556 ? Ss 0:03 0 3265 41262 29672 0.7 postgres: jerel csdb chef(36275) notify interrupt
2889 ? Ss 0:04 0 3265 41270 29688 0.7 postgres: ela csdb newton(32777) notify interrupt
2943 ? Ss 0:04 0 3265 41270 29684 0.7 postgres: stefan csdb stupid(32788) notify interrupt
5866 ? Ss 0:04 0 3265 41270 29680 0.7 postgres: petska csdb brian(32786) notify interrupt
27850 ? Ss 0:03 0 3265 41270 29768 0.7 postgres: dparter csdb yfandes(35456) notify interrupt
18582 ? Ss 0:03 0 3265 41270 29732 0.7 postgres: timc csdb tornado(47047) notify interrupt
449 ? Ss 0:02 0 3265 41270 29764 0.7 postgres: archer csdb spoon(33141) notify interrupt
12731 pts/0 S+ 0:00 0 71 3828 664 0.0 grep notify
vero(su): gdb /s/postgresql/bin/postmaster 24556
[----- Begin lots of messages -----]
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".
Attaching to program: /afs/cs.wisc.edu/s/postgresql-8.2.4/@sys/bin/postmaster, process 5866
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /afs/cs.wisc.edu/s/openssl- 0.9.8d/@sys/lib/libssl.so.0.9.8d...done.
Loaded symbols for /s/openssl-0.9.8d/lib/libssl.so.0.9.8d
Reading symbols from /afs/cs.wisc.edu/s/openssl-0.9.8d/@sys/lib/libcrypto.so.0.9.8d...done.
Loaded symbols for /s/openssl- 0.9.8d/lib/libcrypto.so.0.9.8d
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/@sys/lib/libkrb5.so.3...done.Loaded symbols for /s/krb5-1.5.1/lib/libkrb5.so.3
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/@sys/lib/libcom_err.so.3...done.
Loaded symbols for /s/krb5-1.5.1/lib/libcom_err.so.3
Reading symbols from /lib/libaudit.so.0...done.
Loaded symbols for /lib/libaudit.so.0
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/i386_cent40/lib/libk5crypto.so.3...done.
Loaded symbols for /s/krb5-1.5.1/i386_cent40/lib/libk5crypto.so.3
Reading symbols from /afs/cs.wisc.edu/s/krb5- 1.5.1/i386_cent40/lib/libkrb5support.so.0...done.
Loaded symbols for /s/krb5-1.5.1/i386_cent40/lib/libkrb5support.so.0
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/ld- linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x007ef7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
[----- End lots of messages -----]
(gdb) bt
#0 0x007ef7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008c60f3 in __write_nocancel () from /lib/tls/libc.so.6
#2 0x0064d734 in sock_write () from /s/openssl-0.9.8d/lib/libcrypto.so.0.9.8d
#3 0x00000008 in ?? ()
#4 0x09ede1a2 in ?? ()
#5 0x00000038 in ?? ()
#6 0x09ee50c0 in ?? ()
#7 0x006e06c8 in ?? () from /s/openssl- 0.9.8d/lib/libcrypto.so.0.9.8d
#8 0x09ed4148 in ?? ()
#9 0x00000000 in ?? ()
All the other processes are the same except for addresses in #4, #6, and #8, but they're all within a few MB of each other (they're probably asynchronous interrupts).
Let me know if you need more info.
Peter
vero(su): ps axvw | grep notify
24556 ? Ss 0:03 0 3265 41262 29672 0.7 postgres: jerel csdb chef(36275) notify interrupt
2889 ? Ss 0:04 0 3265 41270 29688 0.7 postgres: ela csdb newton(32777) notify interrupt
2943 ? Ss 0:04 0 3265 41270 29684 0.7 postgres: stefan csdb stupid(32788) notify interrupt
5866 ? Ss 0:04 0 3265 41270 29680 0.7 postgres: petska csdb brian(32786) notify interrupt
27850 ? Ss 0:03 0 3265 41270 29768 0.7 postgres: dparter csdb yfandes(35456) notify interrupt
18582 ? Ss 0:03 0 3265 41270 29732 0.7 postgres: timc csdb tornado(47047) notify interrupt
449 ? Ss 0:02 0 3265 41270 29764 0.7 postgres: archer csdb spoon(33141) notify interrupt
12731 pts/0 S+ 0:00 0 71 3828 664 0.0 grep notify
vero(su): gdb /s/postgresql/bin/postmaster 24556
[----- Begin lots of messages -----]
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".
Attaching to program: /afs/cs.wisc.edu/s/postgresql-8.2.4/@sys/bin/postmaster, process 5866
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /afs/cs.wisc.edu/s/openssl- 0.9.8d/@sys/lib/libssl.so.0.9.8d...done.
Loaded symbols for /s/openssl-0.9.8d/lib/libssl.so.0.9.8d
Reading symbols from /afs/cs.wisc.edu/s/openssl-0.9.8d/@sys/lib/libcrypto.so.0.9.8d...done.
Loaded symbols for /s/openssl- 0.9.8d/lib/libcrypto.so.0.9.8d
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/@sys/lib/libkrb5.so.3...done.Loaded symbols for /s/krb5-1.5.1/lib/libkrb5.so.3
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/@sys/lib/libcom_err.so.3...done.
Loaded symbols for /s/krb5-1.5.1/lib/libcom_err.so.3
Reading symbols from /lib/libaudit.so.0...done.
Loaded symbols for /lib/libaudit.so.0
Reading symbols from /afs/cs.wisc.edu/s/krb5-1.5.1/i386_cent40/lib/libk5crypto.so.3...done.
Loaded symbols for /s/krb5-1.5.1/i386_cent40/lib/libk5crypto.so.3
Reading symbols from /afs/cs.wisc.edu/s/krb5- 1.5.1/i386_cent40/lib/libkrb5support.so.0...done.
Loaded symbols for /s/krb5-1.5.1/i386_cent40/lib/libkrb5support.so.0
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/ld- linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x007ef7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
[----- End lots of messages -----]
(gdb) bt
#0 0x007ef7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008c60f3 in __write_nocancel () from /lib/tls/libc.so.6
#2 0x0064d734 in sock_write () from /s/openssl-0.9.8d/lib/libcrypto.so.0.9.8d
#3 0x00000008 in ?? ()
#4 0x09ede1a2 in ?? ()
#5 0x00000038 in ?? ()
#6 0x09ee50c0 in ?? ()
#7 0x006e06c8 in ?? () from /s/openssl- 0.9.8d/lib/libcrypto.so.0.9.8d
#8 0x09ed4148 in ?? ()
#9 0x00000000 in ?? ()
All the other processes are the same except for addresses in #4, #6, and #8, but they're all within a few MB of each other (they're probably asynchronous interrupts).
Let me know if you need more info.
Peter
On 5/22/07, Peter Koczan <pjkoczan@gmail.com> wrote:
The release is 8.2.4. I haven't been able to reproduce the condition yet, but I will send along stack traces as soon as I can. I have this strange feeling that it's only going to happen when I find a reason to make a restart-worthy config change.
PeterOn 5/21/07, Tom Lane < tgl@sss.pgh.pa.us> wrote:"Peter Koczan" <pjkoczan@gmail.com> writes:
> [ lots of processes stuck in "notify interrupt" code ]
That's weird. If it's still in that state, or if you can reproduce it,
could you attach to a few of those processes with gdb and get stack
traces?
Looking at the async.c code, an obvious candidate is that that routine
tries to take ExclusiveLock on pg_listener --- so if something had
managed to exit without releasing a lock on that table, hangups could be
expected. But if that were the case, you'd think the process status
lines would include "waiting". My guess is they're blocked on something
lower-level than a table lock, but without a stack trace it's hard to
guess what.
Which PG release is this exactly?
regards, tom lane