Thread: 100% CPU pg processes that don't die.

100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

09 August 2008, 19:16:40

I'm load testing a machine, and i'm seeing idle in transaction
processes that are no longer hooked to any outside client, that pull
100% CPU and can't be kill -9ed.  I'm using pgbench -c 1000 -t 1000.
postgresql.conf attached.  This is on a 8 CPU AMD box with hardware
RAID.  I'll likely never see this many parallel connections in
production, but who knows...  I want to reboot the machine to test at
a lower number of threads but if there's more info in it to gain
knowledge from I'll leave it up.

They look like this in top:

 3552 postgres  20   0 8286m  82m  78m R  100  0.3 195:04.22 postgres:
postgres postgres [local] idle in transaction
 3561 postgres  20   0 8286m  83m  79m R  100  0.3 195:04.20 postgres:
postgres postgres [local] idle in transaction

This in ps aux|grep postgres:

postgres  3561 95.2  0.2 8485376 85708 ?       Rs   09:45 197:17
postgres: postgres postgres [local] idle in transaction

The db is still up and accessable.  I'm also getting this in my
/var/log/messages:

Aug  9 13:13:21 engelberg kernel: [71242.734934] CPU 1:
Aug  9 13:13:21 engelberg kernel: [71242.734935] Modules linked in:
iptable_filter ip_tables x_tables parport_pc lp parport loop ipv6
evdev i2c_nforce2 pcspkr shpchp button pci_hotplug i2c_core pata_amd
ata_generic ext3 jbd mbcache sg sr_mod cdrom sd_mod e1000 floppy
arcmsr pata_acpi libata ehci_hcd forcedeth ohci_hcd scsi_mod usbcore
thermal processor fan fbcon tileblit font bitblit softcursor fuse
Aug  9 13:13:21 engelberg kernel: [71242.734972] Pid: 294, comm:
kswapd0 Not tainted 2.6.24-19-server #1
Aug  9 13:13:21 engelberg kernel: [71242.734974] RIP:
0010:[floppy:_spin_lock_irqsave+0x12/0x30]
[floppy:_spin_lock_irqsave+0x12/0x30] _spin_lock_irqsave+0x12/0x30
Aug  9 13:13:21 engelberg kernel: [71242.734980] RSP:
0018:ffff810415423df8  EFLAGS: 00000286
Aug  9 13:13:21 engelberg kernel: [71242.734982] RAX: 0000000000000246
RBX: ffff81000003137d RCX: 0000000000000003
Aug  9 13:13:21 engelberg kernel: [71242.734984] RDX: 0000000000000001
RSI: ffff810415423ea0 RDI: ffff81000003137d
Aug  9 13:13:21 engelberg kernel: [71242.734987] RBP: ffff810415423d60
R08: 0000000000000000 R09: 0000000000000000
Aug  9 13:13:21 engelberg kernel: [71242.734989] R10: 0000000000000000
R11: ffffffff881a46b0 R12: ffff810415423d60
Aug  9 13:13:21 engelberg kernel: [71242.734991] R13: ffffffff8028d11e
R14: ffff81041f6b2670 R15: ffff810420168178
Aug  9 13:13:21 engelberg kernel: [71242.734994] FS:
00007f51096fd700(0000) GS:ffff8108171a2300(0000)
knlGS:0000000000000000
Aug  9 13:13:21 engelberg kernel: [71242.734997] CS:  0010 DS: 0018
ES: 0018 CR0: 000000008005003b
Aug  9 13:13:21 engelberg kernel: [71242.734999] CR2: 00007f4f27ebffd0
CR3: 0000000000201000 CR4: 00000000000006e0
Aug  9 13:13:21 engelberg kernel: [71242.735001] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Aug  9 13:13:21 engelberg kernel: [71242.735003] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  9 13:13:21 engelberg kernel: [71242.735006]
Aug  9 13:13:21 engelberg kernel: [71242.735006] Call Trace:
Aug  9 13:13:21 engelberg kernel: [71242.735009]
[usbcore:prepare_to_wait+0x23/0x80] prepare_to_wait+0x23/0x80
Aug  9 13:13:21 engelberg kernel: [71242.735013]  [kswapd+0xfa/0x560]
kswapd+0xfa/0x560
Aug  9 13:13:21 engelberg kernel: [71242.735020]  [<ffffffff80254260>]
autoremove_wake_function+0x0/0x30
Aug  9 13:13:21 engelberg kernel: [71242.735026]  [kswapd+0x0/0x560]
kswapd+0x0/0x560
Aug  9 13:13:21 engelberg kernel: [71242.735030]  [kthread+0x4b/0x80]
kthread+0x4b/0x80
Aug  9 13:13:21 engelberg kernel: [71242.735034]  [child_rip+0xa/0x12]
child_rip+0xa/0x12
Aug  9 13:13:21 engelberg kernel: [71242.735040]  [kthread+0x0/0x80]
kthread+0x0/0x80
Aug  9 13:13:21 engelberg kernel: [71242.735043]  [child_rip+0x0/0x12]
child_rip+0x0/0x12
Aug  9 13:13:21 engelberg kernel: [71242.735046]

Does this look like a kernel bug or a pgsql bug to most people?

Re: 100% CPU pg processes that don't die.

From

Stephen Frost

Date:

09 August 2008, 19:31:22

* Scott Marlowe (scott.marlowe@gmail.com) wrote:
> Aug  9 13:13:21 engelberg kernel: [71242.735046]
>
> Does this look like a kernel bug or a pgsql bug to most people?

It's certainly something kernel-related.  It might be the OOM killer
though..  You might want to disable that.  Is the box running out of
memory (run 'free')?

    Stephen

Attachment

signature.asc

Re: 100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

09 August 2008, 19:38:31

On Sat, Aug 9, 2008 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Scott Marlowe (scott.marlowe@gmail.com) wrote:
>> Aug  9 13:13:21 engelberg kernel: [71242.735046]
>>
>> Does this look like a kernel bug or a pgsql bug to most people?
>
> It's certainly something kernel-related.  It might be the OOM killer
> though..  You might want to disable that.  Is the box running out of
> memory (run 'free')?

Hehe, no.  It's got 32 Gig of ram and is using 25G of that for kernel
cache.  There are no entries in any log for oom killer that I can see.

Hmmm.  I'm running the latest Ubuntu LTS with 2.6.24-19-server kernel.

Just ran the latest apt-get upgrade and only was missing an update to
pciutils.

btw, here's the output of free:

             total       used       free     shared    buffers     cached
Mem:      33080292   32983004      97288          0      87992   24882632
-/+ buffers/cache:    8012380   25067912
Swap:      7815580        144    7815436

Interesting thing, I've got a machine with the same configuration
sitting next to it that's not doing this.  But it seems odd it could
be hardware induced.

Re: 100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

09 August 2008, 19:40:40

On Sat, Aug 9, 2008 at 1:38 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Aug 9, 2008 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Scott Marlowe (scott.marlowe@gmail.com) wrote:
>>> Aug  9 13:13:21 engelberg kernel: [71242.735046]
>>>
>>> Does this look like a kernel bug or a pgsql bug to most people?
>>
>> It's certainly something kernel-related.  It might be the OOM killer
>> though..  You might want to disable that.  Is the box running out of
>> memory (run 'free')?
>
> Hehe, no.  It's got 32 Gig of ram and is using 25G of that for kernel
> cache.  There are no entries in any log for oom killer that I can see.
>
> Hmmm.  I'm running the latest Ubuntu LTS with 2.6.24-19-server kernel.
>
> Just ran the latest apt-get upgrade and only was missing an update to
> pciutils.
>
> btw, here's the output of free:
>
>             total       used       free     shared    buffers     cached
> Mem:      33080292   32983004      97288          0      87992   24882632
> -/+ buffers/cache:    8012380   25067912
> Swap:      7815580        144    7815436
>
> Interesting thing, I've got a machine with the same configuration
> sitting next to it that's not doing this.  But it seems odd it could
> be hardware induced.

I take that back.  This problem followed the RAID card from one
machine to another.

Re: 100% CPU pg processes that don't die.

From

Stephen Frost

Date:

09 August 2008, 19:47:10

* Scott Marlowe (scott.marlowe@gmail.com) wrote:
> I take that back.  This problem followed the RAID card from one
> machine to another.

That's certainly curious.  The kernel backtrace didn't seem to have
anything terribly interesting in it (at least to me).  Sure there aren't
more detailed logs?  With an actual "Kernel BUG" or 'OOPS' line?  What
does dmesg look like?  Maybe try posting to lkml?

    Stephen

Attachment

signature.asc

Re: 100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

09 August 2008, 20:06:50

On Sat, Aug 9, 2008 at 1:47 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Scott Marlowe (scott.marlowe@gmail.com) wrote:
>> I take that back.  This problem followed the RAID card from one
>> machine to another.
>
> That's certainly curious.  The kernel backtrace didn't seem to have
> anything terribly interesting in it (at least to me).  Sure there aren't
> more detailed logs?  With an actual "Kernel BUG" or 'OOPS' line?  What
> does dmesg look like?  Maybe try posting to lkml?

That's what I'm thinking, but that machien just hung hard and isn't
coming back up and I don't feel like driving to the hosting center for
a broken machine.  We've got a replacement card on the way already,
and the problem is NOT happening with the other card (so far, a week
of testing).

Re: 100% CPU pg processes that don't die.

From

Tom Lane

Date:

09 August 2008, 20:51:48

"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> I'm load testing a machine, and i'm seeing idle in transaction
> processes that are no longer hooked to any outside client, that pull
> 100% CPU and can't be kill -9ed.

To my knowledge, the only way a process can't be kill -9'd is if it's
stuck inside the kernel (typically, doing I/O to a nonresponsive disk).
There's certainly no way for a userland process to defend itself against
kill -9.  So my immediate response would have been to look for a
hardware problem, or failing that a kernel bug.  I see from the
subsequent thread that indeed hardware failure looks to be the answer,
but that should have been your first assumption.

            regards, tom lane

Re: 100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

09 August 2008, 20:54:55

On Sat, Aug 9, 2008 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>> I'm load testing a machine, and i'm seeing idle in transaction
>> processes that are no longer hooked to any outside client, that pull
>> 100% CPU and can't be kill -9ed.
>
> To my knowledge, the only way a process can't be kill -9'd is if it's
> stuck inside the kernel (typically, doing I/O to a nonresponsive disk).
> There's certainly no way for a userland process to defend itself against
> kill -9.  So my immediate response would have been to look for a
> hardware problem, or failing that a kernel bug.  I see from the
> subsequent thread that indeed hardware failure looks to be the answer,
> but that should have been your first assumption.

It was before this. That's why I'd swapped the RAID cards.  It's just
that this is the first time this has happened without killing the box,
so I wanted to be sure it didn't look like something else to anybody.

Re: 100% CPU pg processes that don't die.

From

"Scott Marlowe"

Date:

11 August 2008, 03:54:28

On Sat, Aug 9, 2008 at 2:54 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Aug 9, 2008 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>>> I'm load testing a machine, and i'm seeing idle in transaction
>>> processes that are no longer hooked to any outside client, that pull
>>> 100% CPU and can't be kill -9ed.
>>
>> To my knowledge, the only way a process can't be kill -9'd is if it's
>> stuck inside the kernel (typically, doing I/O to a nonresponsive disk).
>> There's certainly no way for a userland process to defend itself against
>> kill -9.  So my immediate response would have been to look for a
>> hardware problem, or failing that a kernel bug.  I see from the
>> subsequent thread that indeed hardware failure looks to be the answer,
>> but that should have been your first assumption.
>
> It was before this. That's why I'd swapped the RAID cards.  It's just
> that this is the first time this has happened without killing the box,
> so I wanted to be sure it didn't look like something else to anybody.

Just as a followup several hours later the other machine started
producing the same effects.  I'm gonna go trawl through the lkml to
see if they have any info on this problem.

The good news is that both Centos 5.2 and Ubuntu 7.10 seem immune to
this particular bug, and have been running 13 hours now without a
hitch.

Re: 100% CPU pg processes that don't die.

From

Greg Smith

Date:

11 August 2008, 16:41:10

On Sun, 10 Aug 2008, Scott Marlowe wrote:

> The good news is that both Centos 5.2 and Ubuntu 7.10 seem immune to
> this particular bug, and have been running 13 hours now without a
> hitch.

Not sure if it's relevant here, but you do know that I've been kicking
back to lkml that pgbench has issues on the 2.6.24 kernel, right?  I
haven't tried simulating 1000 clients like you're case, but it fails
miserably to scale to even 10 with the select-only workload.

The CFS scheduler used in 2.6.23+ is only about a year old right now, and
it sure seems like it's still a baby going through its share of teething
pains.  Lest you think it's just me complaining, read
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/188226 Maybe your
problem is somewhere else instead, but particularly given the known
pgbench issues it may be worth spending a minute investigating that area.

One bit of black magic I was told is that you can turn off some of the new
scheduling "features" (one of which was the main cause of the pgbench
incompatibility) by updating a kernel tunable.  You can turn off all the
fancy features with:

echo 0 > /proc/sys/kernel/sched_features

See http://lkml.org/lkml/2008/5/26/288 for more details.

Whether the problem still shows up with that change should help narrow
whether your issue is likely related to the scheduler changes or likely
something else that's different with the newer kernel.  Your pgbench
results should be higher after that change, too, but you shouldn't run the
system like that normally--it's optimizing for something bad pgbench does
but will hurt other, more realistic workloads.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD