On 26/10/2009 5:28 PM, Karen Pease wrote:
> I did my best to follow the gdb instructions. I ran:
>
> gdb -p 2852
>
> Then connected entered the logging statements, then ran "cont", then
> ctrl-c'ed it a couple times. I got:
OK, so there's nothing shrieklingly obviously wrong with what the
postmaster is up to. But what about the backend that's stopped
responding? Try connecting gdb to that "postgres" process once it's
stopped responding and get a backtrace from that.
> [root@chmmr dbscripts]# ps ax -o pid,ppid,stat,wchan:50,cmd | grep -i
> http
> 3376 1 D
> start_this_handle /usr/sbin/httpd
start_this_handle appears in common ext4 call paths, and several lkml
issue reports over time:
http://lkml.org/lkml/2009/3/11/253
http://www.google.com.au/search?q=%22start_this_handle%22+ext4
Smells like kernel bug. When looking at two extremely stable pieces of
software (Pg and apache) both having issues on a well tested kernel
(Linux) with a new and fairly immature file system in use (ext4) it's
probably not an unreasonable assumption.
You can find out a bit more about what the kernel is doing using the
"magic" keyboard sequence "ALT-SysRQ-T" from a vconsole (not under X).
If the results scroll past too fast you can page through them with
"less" on /var/log/kern.log (or /var/log/dmesg depending on your distro)
or using the "dmesg" command.
I won't be too surprised if you see a kernel stack trace for your httpd
process(es) starting something like this:
schedule+0x18/0x40
start_this_handle+0x374/0x508
jbd2_journal_start+0xbc/0x11c
ext4_journal_start_sb+0x5c/0x84
ext4_dirty_inode+0xd4/0xf0
--
Craig Ringer