Re: CPU spikes and transactions - Mailing list pgsql-performance

From Tony Kay
Subject Re: CPU spikes and transactions
Date
Msg-id CAB=fRcqLZWmEW8omtc8+xzarAVPGrhnQmY1Lst0Edw1F92_HvQ@mail.gmail.com
Whole thread Raw
In response to Re: CPU spikes and transactions  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: CPU spikes and transactions
List pgsql-performance

On Mon, Oct 14, 2013 at 4:42 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 15.10.2013 01:00, Tony Kay wrote:
> Hi,
>
> I'm running 9.1.6 w/22GB shared buffers, and 32GB overall RAM on a
> 16 Opteron 6276 CPU box. We limit connections to roughly 120, but
> our webapp is configured to allocate a thread-local connection, so
> those connections are rarely doing anything more than half the time.

Lower your shared buffers to about 20% of your RAM, unless you've tested
it's actually helping in your particular case. It's unlikely you'll get
better performance by using more than that, especially on older
versions, so it's wiser to leave the rest for page cache.

It might even be one of the causes of the performance issue you're
seeing, as shared buffers are not exactly overhead-free.

See this for more details on tuning:

   http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

I had followed the general directions from several sources years ago, which indicate up to 40% of RAM. We've been running very large shared buffers for 4 years now, but it is difficult to generate a good real load without testing against users, so we have not felt the need to move it around. In general, I don't tend to tinker with a setting that has been fine for this long without good reason. I've been wanting to upgrade to the newer mmap-based versions of pgsql, but was waiting to re-tune this when I did so.

Why do you suspect that shared_buffers would cause the behavior I'm seeing?
 


You're on a rather old 9.1.x version, BTW. The last version in this
branch is 9.1.10 and there are some important security fixes (e.g. in
9.1.9). Not sure if there are any fixes relevant to the performance
issue, though.

An upgrade to 9.1.10 is planned.
 
A few initial questions:

* What OS are we dealing with?

CentOS el6
 

* So how many active connections are there on average (see
  pg_stat_activity for connections running queries)?

about 40-60
 

* How much data are we talking about? In total and in the imports?

80GB database. The imports are maybe 1-3 MB...often much smaller. 10k rows would be a probably average.
 

> We have been running smoothly for over a year on this configuration,
> and recently started having huge CPU spikes that bring the system to
> its knees. Given that it is a multiuser system, it has been quite
> hard to pinpoint the exact cause, but I think we've narrowed it down
> to two data import jobs that were running in semi-long transactions
> (clusters of row inserts).
>
> The tables affected by these inserts are used in common queries.
>
> The imports will bring in a row count of perhaps 10k on average
> covering 4 tables.
>
> The insert transactions are at isolation level read committed (the
> default for the JDBC driver).
>
> When the import would run (again, theory...we have not been able to
> reproduce), we would end up maxed out on CPU, with a load average of
> 50 for 16 CPUs (our normal busy usage is a load average of 5 out of
> 16 CPUs).
>
> When looking at the active queries, most of them are against the
> tables that are affected by these imports.

Which processes consume most CPU time? Are those backends executing the
queries, or some background processes (checkpointer, autovacuum, ...)?


The backends executing the queries...most of the queries that seem hung usually run in a few ms.
 
Can you post a "top -c" output collected at the time of the CPU peak?


Don't have process accounting, so I cannot regenerate that; however, I can tell you what queries were active at one of them. 

There were 36 of the queries agains table ind_event (which is one affected by the import). Those queries usually take 5-10ms, and we never see more than 2 active during normal operation. These had been active for _minutes_....a sample of the running queries:

time_active   |       datname       | procpid |   query                                      
-----------------+---------------------+---------+-------------------------------------------
 00:08:10.891105 | tudb                |    9058 | select * from mr_uss_ind_event_x where (tu
 00:08:10.981845 | tudb                |    8977 | select * from mr_uss_ind_event_x where (tu
 00:07:08.883347 | tudb                |    8930 | select * from mr_uss_ind_event_x where org
 00:07:15.266393 | tudb                |    8927 | select * from mr_uss_ind_event_x where org
 00:07:27.587133 | tudb                |   11867 | update msg_result set dt_result=$1,msg_id=
 00:08:06.458885 | tudb                |    8912 | select * from mr_uss_ind_event_x where org
 00:06:43.036487 | tudb                |    8887 | select * from mr_uss_ind_event_x where (tu
 00:07:01.992367 | tudb                |    8831 | select * from mr_uss_ind_event_x where (tu
 00:06:59.217721 | tudb                |    8816 | select * from mr_uss_ind_event_x where org
 00:07:07.558848 | tudb                |    8811 | update md_invoice set unbilled_amt=unbille
 00:07:30.636192 | tudb                |    8055 | select * from mr_uss_ind_event_x where (tu
 00:07:26.761053 | tudb                |    8053 | update msg_result set dt_result=$1,msg_id=
 00:06:46.021084 | tudb                |    8793 | select * from mr_uss_ind_event_x where (tu
 00:07:26.412781 | tudb                |    8041 | select * from mr_uss_ind_event_x where org
 00:07:43.315019 | tudb                |    8031 | select * from mr_uss_ind_event_x where org
 00:07:42.651143 | tudb                |    7990 | select * from mr_uss_ind_event_x where org
 00:06:45.258232 | tudb                |    7973 | select * from mr_uss_ind_event_x where (tu
 00:07:46.135027 | tudb                |    7961 | select * from mr_uss_ind_event_x where (tu
 00:07:31.814513 | tudb                |    7959 | select * from mr_uss_ind_event_x where (tu
 00:07:27.739959 | tudb                |    8221 | select * from mr_uss_ind_event_x where org
 00:07:21.554369 | tudb                |    8191 | select * from mr_uss_ind_event_x where org
 00:07:30.665133 | tudb                |    7953 | select * from mr_uss_ind_event_x where org
 00:07:17.727028 | tudb                |    7950 | select * from mr_uss_ind_event_x where org
 00:07:25.657611 | tudb                |    7948 | select * from mr_uss_ind_event_x where org
 00:07:28.118856 | tudb                |    7939 | select * from mr_uss_ind_event_x where org
 00:07:32.436705 | tudb                |    7874 | insert into mr_uss_ind_event (prelimtime_c
 00:08:12.090187 | tudb                |    7873 | select * from mr_uss_ind_event_x where (tu
 00:07:19.181981 | tudb                |    7914 | select * from mr_uss_ind_event_x where (tu
 00:07:04.234119 | tudb                |    7909 | select * from mr_uss_ind_event_x where (tu
 00:06:52.614609 | tudb                |    7856 | select * from mr_uss_ind_event_x where org
 00:07:18.667903 | tudb                |    7908 | select * from mr_uss_ind_event_x where (tu

The insert listed there is coming from that import...the others are querying a view that includes that table in a join.

Also, try to collect a few snapshots of pg_stat_bgwriter catalog before
and during the loads. Don't forget to include the timestamp:

   select now(), * from pg_stat_bgwriter;


This is a live production system, and it will take me some doing to generate a load on a test server that triggers the condition. I'll be certain to gather this and the other stats if I can trigger it.
 
and when you're at it, pg_stat_database snapshots might be handy too
(not sure if you're running a single database or multiple ones), so use
either

  select now(), * from pg_stat_database;

or

  select now(), * from pg_stat_database where datname = '..dbname..';

That should give us at least some insight into what's happening.

> Our workaround (that is holding at present) was to drop the
> transactions on those imports (which is not optimal, but fortunately
> is acceptable for this particular data). This workaround has
> prevented any further incidents, but is of course inconclusive.
>
> Does this sound familiar to anyone, and if so, please advise.

I'm wondering how this could be related to the transactions, and IIRC
the stats (e.g. # of inserted rows) are sent at commit time. That might
trigger the autovacuum. But without the transactions the autovacuum
would be triggered sooner ...

regards
Tomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: CPU spikes and transactions
Next
From: Tony Kay
Date:
Subject: Re: CPU spikes and transactions