Thread: 100% CPU pg processes that don't die.
I'm load testing a machine, and i'm seeing idle in transaction processes that are no longer hooked to any outside client, that pull 100% CPU and can't be kill -9ed. I'm using pgbench -c 1000 -t 1000. postgresql.conf attached. This is on a 8 CPU AMD box with hardware RAID. I'll likely never see this many parallel connections in production, but who knows... I want to reboot the machine to test at a lower number of threads but if there's more info in it to gain knowledge from I'll leave it up. They look like this in top: 3552 postgres 20 0 8286m 82m 78m R 100 0.3 195:04.22 postgres: postgres postgres [local] idle in transaction 3561 postgres 20 0 8286m 83m 79m R 100 0.3 195:04.20 postgres: postgres postgres [local] idle in transaction This in ps aux|grep postgres: postgres 3561 95.2 0.2 8485376 85708 ? Rs 09:45 197:17 postgres: postgres postgres [local] idle in transaction The db is still up and accessable. I'm also getting this in my /var/log/messages: Aug 9 13:13:21 engelberg kernel: [71242.734934] CPU 1: Aug 9 13:13:21 engelberg kernel: [71242.734935] Modules linked in: iptable_filter ip_tables x_tables parport_pc lp parport loop ipv6 evdev i2c_nforce2 pcspkr shpchp button pci_hotplug i2c_core pata_amd ata_generic ext3 jbd mbcache sg sr_mod cdrom sd_mod e1000 floppy arcmsr pata_acpi libata ehci_hcd forcedeth ohci_hcd scsi_mod usbcore thermal processor fan fbcon tileblit font bitblit softcursor fuse Aug 9 13:13:21 engelberg kernel: [71242.734972] Pid: 294, comm: kswapd0 Not tainted 2.6.24-19-server #1 Aug 9 13:13:21 engelberg kernel: [71242.734974] RIP: 0010:[floppy:_spin_lock_irqsave+0x12/0x30] [floppy:_spin_lock_irqsave+0x12/0x30] _spin_lock_irqsave+0x12/0x30 Aug 9 13:13:21 engelberg kernel: [71242.734980] RSP: 0018:ffff810415423df8 EFLAGS: 00000286 Aug 9 13:13:21 engelberg kernel: [71242.734982] RAX: 0000000000000246 RBX: ffff81000003137d RCX: 0000000000000003 Aug 9 13:13:21 engelberg kernel: [71242.734984] RDX: 0000000000000001 RSI: ffff810415423ea0 RDI: ffff81000003137d Aug 9 13:13:21 engelberg kernel: [71242.734987] RBP: ffff810415423d60 R08: 0000000000000000 R09: 0000000000000000 Aug 9 13:13:21 engelberg kernel: [71242.734989] R10: 0000000000000000 R11: ffffffff881a46b0 R12: ffff810415423d60 Aug 9 13:13:21 engelberg kernel: [71242.734991] R13: ffffffff8028d11e R14: ffff81041f6b2670 R15: ffff810420168178 Aug 9 13:13:21 engelberg kernel: [71242.734994] FS: 00007f51096fd700(0000) GS:ffff8108171a2300(0000) knlGS:0000000000000000 Aug 9 13:13:21 engelberg kernel: [71242.734997] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 9 13:13:21 engelberg kernel: [71242.734999] CR2: 00007f4f27ebffd0 CR3: 0000000000201000 CR4: 00000000000006e0 Aug 9 13:13:21 engelberg kernel: [71242.735001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 9 13:13:21 engelberg kernel: [71242.735003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 9 13:13:21 engelberg kernel: [71242.735006] Aug 9 13:13:21 engelberg kernel: [71242.735006] Call Trace: Aug 9 13:13:21 engelberg kernel: [71242.735009] [usbcore:prepare_to_wait+0x23/0x80] prepare_to_wait+0x23/0x80 Aug 9 13:13:21 engelberg kernel: [71242.735013] [kswapd+0xfa/0x560] kswapd+0xfa/0x560 Aug 9 13:13:21 engelberg kernel: [71242.735020] [<ffffffff80254260>] autoremove_wake_function+0x0/0x30 Aug 9 13:13:21 engelberg kernel: [71242.735026] [kswapd+0x0/0x560] kswapd+0x0/0x560 Aug 9 13:13:21 engelberg kernel: [71242.735030] [kthread+0x4b/0x80] kthread+0x4b/0x80 Aug 9 13:13:21 engelberg kernel: [71242.735034] [child_rip+0xa/0x12] child_rip+0xa/0x12 Aug 9 13:13:21 engelberg kernel: [71242.735040] [kthread+0x0/0x80] kthread+0x0/0x80 Aug 9 13:13:21 engelberg kernel: [71242.735043] [child_rip+0x0/0x12] child_rip+0x0/0x12 Aug 9 13:13:21 engelberg kernel: [71242.735046] Does this look like a kernel bug or a pgsql bug to most people?
* Scott Marlowe (scott.marlowe@gmail.com) wrote: > Aug 9 13:13:21 engelberg kernel: [71242.735046] > > Does this look like a kernel bug or a pgsql bug to most people? It's certainly something kernel-related. It might be the OOM killer though.. You might want to disable that. Is the box running out of memory (run 'free')? Stephen
Attachment
On Sat, Aug 9, 2008 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Scott Marlowe (scott.marlowe@gmail.com) wrote: >> Aug 9 13:13:21 engelberg kernel: [71242.735046] >> >> Does this look like a kernel bug or a pgsql bug to most people? > > It's certainly something kernel-related. It might be the OOM killer > though.. You might want to disable that. Is the box running out of > memory (run 'free')? Hehe, no. It's got 32 Gig of ram and is using 25G of that for kernel cache. There are no entries in any log for oom killer that I can see. Hmmm. I'm running the latest Ubuntu LTS with 2.6.24-19-server kernel. Just ran the latest apt-get upgrade and only was missing an update to pciutils. btw, here's the output of free: total used free shared buffers cached Mem: 33080292 32983004 97288 0 87992 24882632 -/+ buffers/cache: 8012380 25067912 Swap: 7815580 144 7815436 Interesting thing, I've got a machine with the same configuration sitting next to it that's not doing this. But it seems odd it could be hardware induced.
On Sat, Aug 9, 2008 at 1:38 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sat, Aug 9, 2008 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote: >> * Scott Marlowe (scott.marlowe@gmail.com) wrote: >>> Aug 9 13:13:21 engelberg kernel: [71242.735046] >>> >>> Does this look like a kernel bug or a pgsql bug to most people? >> >> It's certainly something kernel-related. It might be the OOM killer >> though.. You might want to disable that. Is the box running out of >> memory (run 'free')? > > Hehe, no. It's got 32 Gig of ram and is using 25G of that for kernel > cache. There are no entries in any log for oom killer that I can see. > > Hmmm. I'm running the latest Ubuntu LTS with 2.6.24-19-server kernel. > > Just ran the latest apt-get upgrade and only was missing an update to > pciutils. > > btw, here's the output of free: > > total used free shared buffers cached > Mem: 33080292 32983004 97288 0 87992 24882632 > -/+ buffers/cache: 8012380 25067912 > Swap: 7815580 144 7815436 > > Interesting thing, I've got a machine with the same configuration > sitting next to it that's not doing this. But it seems odd it could > be hardware induced. I take that back. This problem followed the RAID card from one machine to another.
* Scott Marlowe (scott.marlowe@gmail.com) wrote: > I take that back. This problem followed the RAID card from one > machine to another. That's certainly curious. The kernel backtrace didn't seem to have anything terribly interesting in it (at least to me). Sure there aren't more detailed logs? With an actual "Kernel BUG" or 'OOPS' line? What does dmesg look like? Maybe try posting to lkml? Stephen
Attachment
On Sat, Aug 9, 2008 at 1:47 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Scott Marlowe (scott.marlowe@gmail.com) wrote: >> I take that back. This problem followed the RAID card from one >> machine to another. > > That's certainly curious. The kernel backtrace didn't seem to have > anything terribly interesting in it (at least to me). Sure there aren't > more detailed logs? With an actual "Kernel BUG" or 'OOPS' line? What > does dmesg look like? Maybe try posting to lkml? That's what I'm thinking, but that machien just hung hard and isn't coming back up and I don't feel like driving to the hosting center for a broken machine. We've got a replacement card on the way already, and the problem is NOT happening with the other card (so far, a week of testing).
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > I'm load testing a machine, and i'm seeing idle in transaction > processes that are no longer hooked to any outside client, that pull > 100% CPU and can't be kill -9ed. To my knowledge, the only way a process can't be kill -9'd is if it's stuck inside the kernel (typically, doing I/O to a nonresponsive disk). There's certainly no way for a userland process to defend itself against kill -9. So my immediate response would have been to look for a hardware problem, or failing that a kernel bug. I see from the subsequent thread that indeed hardware failure looks to be the answer, but that should have been your first assumption. regards, tom lane
On Sat, Aug 9, 2008 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Scott Marlowe" <scott.marlowe@gmail.com> writes: >> I'm load testing a machine, and i'm seeing idle in transaction >> processes that are no longer hooked to any outside client, that pull >> 100% CPU and can't be kill -9ed. > > To my knowledge, the only way a process can't be kill -9'd is if it's > stuck inside the kernel (typically, doing I/O to a nonresponsive disk). > There's certainly no way for a userland process to defend itself against > kill -9. So my immediate response would have been to look for a > hardware problem, or failing that a kernel bug. I see from the > subsequent thread that indeed hardware failure looks to be the answer, > but that should have been your first assumption. It was before this. That's why I'd swapped the RAID cards. It's just that this is the first time this has happened without killing the box, so I wanted to be sure it didn't look like something else to anybody.
On Sat, Aug 9, 2008 at 2:54 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sat, Aug 9, 2008 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Scott Marlowe" <scott.marlowe@gmail.com> writes: >>> I'm load testing a machine, and i'm seeing idle in transaction >>> processes that are no longer hooked to any outside client, that pull >>> 100% CPU and can't be kill -9ed. >> >> To my knowledge, the only way a process can't be kill -9'd is if it's >> stuck inside the kernel (typically, doing I/O to a nonresponsive disk). >> There's certainly no way for a userland process to defend itself against >> kill -9. So my immediate response would have been to look for a >> hardware problem, or failing that a kernel bug. I see from the >> subsequent thread that indeed hardware failure looks to be the answer, >> but that should have been your first assumption. > > It was before this. That's why I'd swapped the RAID cards. It's just > that this is the first time this has happened without killing the box, > so I wanted to be sure it didn't look like something else to anybody. Just as a followup several hours later the other machine started producing the same effects. I'm gonna go trawl through the lkml to see if they have any info on this problem. The good news is that both Centos 5.2 and Ubuntu 7.10 seem immune to this particular bug, and have been running 13 hours now without a hitch.
On Sun, 10 Aug 2008, Scott Marlowe wrote: > The good news is that both Centos 5.2 and Ubuntu 7.10 seem immune to > this particular bug, and have been running 13 hours now without a > hitch. Not sure if it's relevant here, but you do know that I've been kicking back to lkml that pgbench has issues on the 2.6.24 kernel, right? I haven't tried simulating 1000 clients like you're case, but it fails miserably to scale to even 10 with the select-only workload. The CFS scheduler used in 2.6.23+ is only about a year old right now, and it sure seems like it's still a baby going through its share of teething pains. Lest you think it's just me complaining, read https://bugs.launchpad.net/ubuntu/+source/linux/+bug/188226 Maybe your problem is somewhere else instead, but particularly given the known pgbench issues it may be worth spending a minute investigating that area. One bit of black magic I was told is that you can turn off some of the new scheduling "features" (one of which was the main cause of the pgbench incompatibility) by updating a kernel tunable. You can turn off all the fancy features with: echo 0 > /proc/sys/kernel/sched_features See http://lkml.org/lkml/2008/5/26/288 for more details. Whether the problem still shows up with that change should help narrow whether your issue is likely related to the scheduler changes or likely something else that's different with the newer kernel. Your pgbench results should be higher after that change, too, but you shouldn't run the system like that normally--it's optimizing for something bad pgbench does but will hurt other, more realistic workloads. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD