Thread: BUG #6650: CPU system time utilization rising few times a day

BUG #6650: CPU system time utilization rising few times a day

From
a.krawiec@focustelecom.pl
Date:
The following bug has been logged on the website:

Bug reference:      6650
Logged by:          Andrzej Krawiec
Email address:      a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system:   CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:=20=20=20=20=20=20=20=20

Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is
running 99% in systime. IO is OK, only interrupts are extremely high (over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone
have any idea?

Re: BUG #6650: CPU system time utilization rising few times a day

From
Robert Haas
Date:
On Fri, May 18, 2012 at 5:09 AM,  <a.krawiec@focustelecom.pl> wrote:
> The following bug has been logged on the website:
>
> Bug reference: =A0 =A0 =A06650
> Logged by: =A0 =A0 =A0 =A0 =A0Andrzej Krawiec
> Email address: =A0 =A0 =A0a.krawiec@focustelecom.pl
> PostgreSQL version: 8.4.11
> Operating system: =A0 CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
> Description:
>
> Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
> day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU=
 is
> running 99% in systime. IO is OK, only interrupts are extremely high (over
> 100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does any=
one
> have any idea?

Try using strace to figure out where all that system time is going.
Sometimes the '-c' option is helpful.

It might also be helpful to connect gdb to the process and get a
backtrace, then continue, stop it again, get another backtrace.
Repeat that a few times and send us the backtrace that occurs most
frequently.

Is it a regular backend that is eating all that CPU time, or an
autovacuum worker?

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6650: CPU system time utilization rising few times a day

From
Andrzej Krawiec
Date:
Cannot strace or gdb on a production system under heavy load (about 100
transactions per second).
It's in kernel space not user, so we are unable to anything at this
particular moment (sometimes even the ssh connection seems to hang for a
while).
We suspect neither autovacuum (although suspected primarily) nor regular
backend. It is system time. The question is: what's the reasone for that?
We've dug through system and postgres logs, cleared out most of the long
query problems, idle in transaction, optimized queries, vacuumed, reindexed
and such.
For a while it seemed like the particular kernel version is causing
majority of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and
those problems went mostly! away. For few days we had no situations, but it
happened again.

Regards
--
Andrzej Krawiec

2012/5/22 Robert Haas <robertmhaas@gmail.com>

> On Fri, May 18, 2012 at 5:09 AM,  <a.krawiec@focustelecom.pl> wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      6650
> > Logged by:          Andrzej Krawiec
> > Email address:      a.krawiec@focustelecom.pl
> > PostgreSQL version: 8.4.11
> > Operating system:   CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
> > Description:
> >
> > Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
> > day I get a situation where PostgreSQL stops running for 1-2 minutes.
> CPU is
> > running 99% in systime. IO is OK, only interrupts are extremely high
> (over
> > 100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does
> anyone
> > have any idea?
>
> Try using strace to figure out where all that system time is going.
> Sometimes the '-c' option is helpful.
>
> It might also be helpful to connect gdb to the process and get a
> backtrace, then continue, stop it again, get another backtrace.
> Repeat that a few times and send us the backtrace that occurs most
> frequently.
>
> Is it a regular backend that is eating all that CPU time, or an
> autovacuum worker?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: BUG #6650: CPU system time utilization rising few times a day

From
Robert Haas
Date:
On Wed, May 23, 2012 at 5:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
> Cannot strace or gdb on a production system under heavy load (about 100
> transactions per second).
> It's in kernel space not user, so we are unable to anything at this
> particular moment (sometimes even the ssh connection seems to hang for a
> while).
> We suspect neither autovacuum (although suspected primarily) nor regular
> backend. It is system time. The question is: what's the reasone for that?
> We've dug through system and postgres logs, cleared out most of the long
> query problems, idle in transaction, optimized queries, vacuumed, reindexed
> and such.
> For a while it seemed like the particular kernel version is causing majority
> of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and those
> problems went mostly! away. For few days we had no situations, but it
> happened again.

perf can tell you about problems in kernel-space, but I'm not sure it
exists that far back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6650: CPU system time utilization rising few times a day

From
Robert Haas
Date:
On Thu, May 31, 2012 at 3:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
> Ok, we've managed to do strace -s during such a situation (see
> attached file). I have no clue what can it mean. Only errors count is
> quite strange.

How long was strace -s run for to generate this?

> Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
> affect our environment?

Not sure, but that's a much older version of PostgreSQL than the one
you're running, and I think there may have been some improvements
meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6650: CPU system time utilization rising few times a day

From
Andrzej Krawiec
Date:
Ok, we've managed to do strace -s during such a situation (see
attached file). I have no clue what can it mean. Only errors count is
quite strange.
Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
affect our environment?

--
Andrzej Krawiec


>
> perf can tell you about problems in kernel-space, but I'm not sure it
> exists that far back.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Attachment

Re: BUG #6650: CPU system time utilization rising few times a day

From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
> 2012/5/31 Robert Haas <robertmhaas@gmail.com>:
>> How long was strace -s run for to generate this?
>
> Strace - s was running for about 2 minutes.

Hmm, I'm sort of confused then.  This only shows a total of 1.815816
seconds of system time, which is a lot less than 99% utilization * 2
minutes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6650: CPU system time utilization rising few times a day

From
Andrzej Krawiec
Date:
It has happened again. Twice today. We have one perf dump. But it is over
600 MB of bzip2-ed data (3.5 GB of plain binary perf data).
We'll go deeper into that, and if we don't come to any conclusions we will
upload it to our anonymous FTP.

Don't know if this perf dump is ok, as it was again triggered by our sciprt
upon system load rise. Maybe it is too late...

Another question: is errors count from my strace log not high?

--
Andrzej Krawiec

2012/6/8 Robert Haas <robertmhaas@gmail.com>

> On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
> <a.krawiec@focustelecom.pl> wrote:
> > 2012/5/31 Robert Haas <robertmhaas@gmail.com>:
> >> How long was strace -s run for to generate this?
> >
> > Strace - s was running for about 2 minutes.
>
> Hmm, I'm sort of confused then.  This only shows a total of 1.815816
> seconds of system time, which is a lot less than 99% utilization * 2
> minutes.
>
>

Re: BUG #6650: CPU system time utilization rising few times a day

From
Andrzej Krawiec
Date:
Ok, we are 99% sure as for what is the reason for this situation. Let's say
we have a 1 GB of free memory. Any process (especially postgres, as it is
dedicated server) wants to allocate little more. Kernel has to look through
cached memory to determine which cache pages can be dropped and allocated
as user memory. If the server is idle (almost no queries) allocating huge
amounts of memory is almost invisible, but under production load (about
100-200 transactions per second) system halts for a while, to drop some of
cache pages.

We consider setting kernel value extra_free_kbytes to some reasonably high
value to check if it helps keeping free memory at certain level.

It seems that it's not a PostgreSQL bug.

Can anyone try to confirm this?

--
Regards
Andrzej Krawiec