Thread: BUG #6650: CPU system time utilization rising few times a day
The following bug has been logged on the website: Bug reference: 6650 Logged by: Andrzej Krawiec Email address: a.krawiec@focustelecom.pl PostgreSQL version: 8.4.11 Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64 Description:=20=20=20=20=20=20=20=20 Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is running 99% in systime. IO is OK, only interrupts are extremely high (over 100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone have any idea?
On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote: > The following bug has been logged on the website: > > Bug reference: =A0 =A0 =A06650 > Logged by: =A0 =A0 =A0 =A0 =A0Andrzej Krawiec > Email address: =A0 =A0 =A0a.krawiec@focustelecom.pl > PostgreSQL version: 8.4.11 > Operating system: =A0 CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64 > Description: > > Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a > day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU= is > running 99% in systime. IO is OK, only interrupts are extremely high (over > 100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does any= one > have any idea? Try using strace to figure out where all that system time is going. Sometimes the '-c' option is helpful. It might also be helpful to connect gdb to the process and get a backtrace, then continue, stop it again, get another backtrace. Repeat that a few times and send us the backtrace that occurs most frequently. Is it a regular backend that is eating all that CPU time, or an autovacuum worker? --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Cannot strace or gdb on a production system under heavy load (about 100 transactions per second). It's in kernel space not user, so we are unable to anything at this particular moment (sometimes even the ssh connection seems to hang for a while). We suspect neither autovacuum (although suspected primarily) nor regular backend. It is system time. The question is: what's the reasone for that? We've dug through system and postgres logs, cleared out most of the long query problems, idle in transaction, optimized queries, vacuumed, reindexed and such. For a while it seemed like the particular kernel version is causing majority of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and those problems went mostly! away. For few days we had no situations, but it happened again. Regards -- Andrzej Krawiec 2012/5/22 Robert Haas <robertmhaas@gmail.com> > On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote: > > The following bug has been logged on the website: > > > > Bug reference: 6650 > > Logged by: Andrzej Krawiec > > Email address: a.krawiec@focustelecom.pl > > PostgreSQL version: 8.4.11 > > Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64 > > Description: > > > > Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a > > day I get a situation where PostgreSQL stops running for 1-2 minutes. > CPU is > > running 99% in systime. IO is OK, only interrupts are extremely high > (over > > 100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does > anyone > > have any idea? > > Try using strace to figure out where all that system time is going. > Sometimes the '-c' option is helpful. > > It might also be helpful to connect gdb to the process and get a > backtrace, then continue, stop it again, get another backtrace. > Repeat that a few times and send us the backtrace that occurs most > frequently. > > Is it a regular backend that is eating all that CPU time, or an > autovacuum worker? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
On Wed, May 23, 2012 at 5:29 AM, Andrzej Krawiec <a.krawiec@focustelecom.pl> wrote: > Cannot strace or gdb on a production system under heavy load (about 100 > transactions per second). > It's in kernel space not user, so we are unable to anything at this > particular moment (sometimes even the ssh connection seems to hang for a > while). > We suspect neither autovacuum (although suspected primarily) nor regular > backend. It is system time. The question is: what's the reasone for that? > We've dug through system and postgres logs, cleared out most of the long > query problems, idle in transaction, optimized queries, vacuumed, reindexed > and such. > For a while it seemed like the particular kernel version is causing majority > of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and those > problems went mostly! away. For few days we had no situations, but it > happened again. perf can tell you about problems in kernel-space, but I'm not sure it exists that far back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 31, 2012 at 3:29 AM, Andrzej Krawiec <a.krawiec@focustelecom.pl> wrote: > Ok, we've managed to do strace -s during such a situation (see > attached file). I have no clue what can it mean. Only errors count is > quite strange. How long was strace -s run for to generate this? > Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html > affect our environment? Not sure, but that's a much older version of PostgreSQL than the one you're running, and I think there may have been some improvements meanwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Ok, we've managed to do strace -s during such a situation (see attached file). I have no clue what can it mean. Only errors count is quite strange. Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html affect our environment? -- Andrzej Krawiec > > perf can tell you about problems in kernel-space, but I'm not sure it > exists that far back. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Attachment
On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec <a.krawiec@focustelecom.pl> wrote: > 2012/5/31 Robert Haas <robertmhaas@gmail.com>: >> How long was strace -s run for to generate this? > > Strace - s was running for about 2 minutes. Hmm, I'm sort of confused then. This only shows a total of 1.815816 seconds of system time, which is a lot less than 99% utilization * 2 minutes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
It has happened again. Twice today. We have one perf dump. But it is over 600 MB of bzip2-ed data (3.5 GB of plain binary perf data). We'll go deeper into that, and if we don't come to any conclusions we will upload it to our anonymous FTP. Don't know if this perf dump is ok, as it was again triggered by our sciprt upon system load rise. Maybe it is too late... Another question: is errors count from my strace log not high? -- Andrzej Krawiec 2012/6/8 Robert Haas <robertmhaas@gmail.com> > On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec > <a.krawiec@focustelecom.pl> wrote: > > 2012/5/31 Robert Haas <robertmhaas@gmail.com>: > >> How long was strace -s run for to generate this? > > > > Strace - s was running for about 2 minutes. > > Hmm, I'm sort of confused then. This only shows a total of 1.815816 > seconds of system time, which is a lot less than 99% utilization * 2 > minutes. > >
Ok, we are 99% sure as for what is the reason for this situation. Let's say we have a 1 GB of free memory. Any process (especially postgres, as it is dedicated server) wants to allocate little more. Kernel has to look through cached memory to determine which cache pages can be dropped and allocated as user memory. If the server is idle (almost no queries) allocating huge amounts of memory is almost invisible, but under production load (about 100-200 transactions per second) system halts for a while, to drop some of cache pages. We consider setting kernel value extra_free_kbytes to some reasonably high value to check if it helps keeping free memory at certain level. It seems that it's not a PostgreSQL bug. Can anyone try to confirm this? -- Regards Andrzej Krawiec