Thread: High SYS CPU - need advise

High SYS CPU - need advise

From
Vlad
Date:
Hello everyone,

I'm seeking help in diagnosing / figuring out the issue that we have with our DB server:

Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's pure SYS cpu, i.e. it's not io wait, not irq, not user). Postgresql is taking 10-15% at the same time. Those periods would last from few seconds, to minutes or until Postgresql is restarted. Needless to say that system is barely responsive, with load average hitting over 100. We have mostly select statements (joins across few tables), using indexes and resulting in a small number of records returned. Should number of requests per second coming drop a bit, server does not fall into those HIGH-SYS-CPU periods. It all seems like postgres runs out of some resources or fighting for some locks and that causing kernel to go into la-la land trying to manage it. 


So far we've checked:
- disk and nic delays / errors / utilization
- WAL files (created rarely)
- tables are vacuumed OK. periods of high SYS not tied to vacuum process.
- kernel resources utilization (sufficient FS handles, shared MEM/SEM, VM)
- increased log level, but nothing suspicious/different (to me) is reported there during periods of high sys-cpu 
- ran pgbench (could not reproduce the issue, even though it was producing over 40,000 TPS for prolonged period of time)

Basically, our symptoms are exactly as was reported here over a year ago (though for postgres 8.3, we ran 9.1):

I will be grateful for any ideas helping to resolve or diagnose this problem. 

Environment background:

Postgresql 9.1.6. 
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

Linux 3.5.5 (Fedora 17 x64) on 32Gb RAM / 8 cores

Default configuration changed: 
max_connection = 1200
shared_buffers = 3200MB
temp_buffers = 18MB
max_prepared_transactions = 500
work_mem = 16MB            
maintenance_work_mem = 64MB
max_files_per_process = 3000
wal_level = hot_standby
fsync = off
checkpoint_segments = 64  
checkpoint_timeout = 15min
effective_cache_size = 8GB
default_statistics_target = 500 




-- Vlad

Re: High SYS CPU - need advise

From
John R Pierce
Date:
On 11/14/12 1:13 PM, Vlad wrote:
> Postgresql 9.1.6.
> Postgres usually has 400-500 connected clients, most of them are idle.
> Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

thats a really high client connection count for a 8 core system.

I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower,
say in the 16-48 range.


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: High SYS CPU - need advise

From
Vlad
Date:
John,

thanks for your feedback. While implementing connection pooling would make resources utilization more efficient, I don't think it's the root of my problem. Most of the connected clients are at IDLE. When I do 

select * from pg_stat_activity where current_query not like '%IDLE%';

I only see several active queries at any given time.


-- Vlad


On Wed, Nov 14, 2012 at 3:23 PM, John R Pierce <pierce@hogranch.com> wrote:
On 11/14/12 1:13 PM, Vlad wrote:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

thats a really high client connection count for a 8 core system.

I'd consider implementing a connection pool (like pgbouncer), and rewriting your client applications to connect, do a transaction, disconnect, so the actual number of postgres connections is much lower, say in the 16-48 range.

Re: High SYS CPU - need advise

From
John R Pierce
Date:
On 11/14/12 1:34 PM, Vlad wrote:
> thanks for your feedback. While implementing connection pooling would
> make resources utilization more efficient, I don't think it's the root
> of my problem. Most of the connected clients are at IDLE. When I do
>
> select * from pg_stat_activity where current_query not like '%IDLE%';
>
> I only see several active queries at any given time.

what about during these spikes?

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast



Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 11/14/12 1:34 PM, Vlad wrote:
>>
>> thanks for your feedback. While implementing connection pooling would make
>> resources utilization more efficient, I don't think it's the root of my
>> problem. Most of the connected clients are at IDLE. When I do
>>
>> select * from pg_stat_activity where current_query not like '%IDLE%';
>>
>> I only see several active queries at any given time.
>
>
> what about during these spikes?

Yeah -- if you are seeing a lot of queries pile up during these times,
it's time to explore connection pooler because it will keep system
load manageable during these situations.

things to check:
*) blocked queries (pg_locks/pg_stat_activity)
*) i/o wait. in particular, is linux page cache flushing.
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
*) you have a large amount of table? maybe this statistics file related?
*) let's log checkpoints to see if there is correlation with stall
*) nice to have vmstat/iostat during/before/after stall.  for example,
massive spike of context switches during stall could point to o/s
scheduler issue.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
there is no big spike of queries that cause that, queries come in relatively stable pace. It's just when the higher rate of queries coming, the more likely this to happen.  yes, when stall happens , the active queries pile up - but that's the result of a stall (the server reacts slow on a keypress, not to mention queries execution), not the cause.


things to check:
*) blocked queries (pg_locks/pg_stat_activity)

nada
 
*) i/o wait. in particular, is linux page cache flushing.

no i/o wait, no IRQ (see below)
 
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?

just bare linux + postgresql.
 
*) you have a large amount of table? maybe this statistics file related?

over 1000 tables across 4 or 5 schemas in a single database. 

*) let's log checkpoints to see if there is correlation with stall

checked, checkpoints happen must more rarely and w/o relation to a high-sys periods
 
*) nice to have vmstat/iostat during/before/after stall.  for example,
massive spike of context switches during stall could point to o/s
scheduler issue.


checked that as well - nothing. CS even lower.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.94    0.00    9.28    0.38    0.00   73.40

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               6.00        48.00         0.00         48          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.06    0.00   18.43    0.25    0.00   63.26

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              10.00       104.00         0.00        104          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.12    0.00   86.74    0.12    0.00    4.01

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.97         7.77         0.00          8          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.44    0.00   96.58    0.00    0.00    1.98

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               3.28        78.69         0.00        144          0



procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 279240  12016 14431964    0    0    32     0 197852 4299 15  9 76  0  0
 4  0      0 225984  12024 14419696    0    0     0    64 197711 5158 11  9 79  1  0
 0  0      0 260112  12024 14413636    0    0    48     0 196708 4618 17 10 73  0  0
 6  0      0 233936  12024 14375784    0    0   104     0 179861 4884 19 17 64  0  0
30  0      0 224904  12024 14354812    0    0     8     0 51088 1205  9 86  5  0  0
72  0      0 239144  12024 14333852    0    0   144     0 45601  542  2 98  0  0  0
78  0      0 224840  12024 14328536    0    0     0     0 38732  481  2 94  5  0  0
22  1      0 219072  12032 14250652    0    0   136   100 47323 1231  9 90  1  0  0







Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 11:50 AM, Vlad <marchenko@gmail.com> wrote:
> there is no big spike of queries that cause that, queries come in relatively
> stable pace. It's just when the higher rate of queries coming, the more
> likely this to happen.  yes, when stall happens , the active queries pile up
> - but that's the result of a stall (the server reacts slow on a keypress,
> not to mention queries execution), not the cause.
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st
>  1  0      0 279240  12016 14431964    0    0    32     0 197852 4299 15  9
> 76  0  0
>  4  0      0 225984  12024 14419696    0    0     0    64 197711 5158 11  9
> 79  1  0
>  0  0      0 260112  12024 14413636    0    0    48     0 196708 4618 17 10
> 73  0  0
>  6  0      0 233936  12024 14375784    0    0   104     0 179861 4884 19 17
> 64  0  0
> 30  0      0 224904  12024 14354812    0    0     8     0 51088 1205  9 86
> 5  0  0
> 72  0      0 239144  12024 14333852    0    0   144     0 45601  542  2 98
> 0  0  0
> 78  0      0 224840  12024 14328536    0    0     0     0 38732  481  2 94
> 5  0  0
> 22  1      0 219072  12032 14250652    0    0   136   100 47323 1231  9 90
> 1  0  0

hm.  well, we can definitely rule out i/o.   I reviewed your last
posting, and you said:
"Out of the top 50 processes in top, 48 of them are postmasters, one
is syslog, and one is psql. Each of the postmasters have a high %CPU,
the top ones being 80% and higher, the rest being anywhere between 30%
- 60%. Would postmaster 'queries' that are running attribute to the
sys CPU usage, or should they be under the 'us' CPU usage?"

Is this still true?  Can we capture strace from one of the high %
postmasters to see if there's any clues there.   Maybe we've uncovered
some type of weird spinlock contention issue.  How large is your
database (or at least the portion of it that's commonly used)?  Would
you characterize your queries as mostly small lookups, scans, or a
mix?

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
Merlin,

this is not my report, probably from a thread that I've referenced as having a common symptoms. Here is info about my db:


Postgresql 9.1.6. 
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes and resulting in a small number of records returned.

So mostly small lookups across few tables joining by a primary key / index. 
Also, I just perform an experiment - I switched our app over to using hot PITR instead of master and it experienced the same problem. So since PITR db can only perform read-only queries, there is no write-locks (except maybe when pitr is playing wal records from the master?), nevertheless SYS CPU jumped sky.

-- Vlad




Is this still true?  Can we capture strace from one of the high %
postmasters to see if there's any clues there.   Maybe we've uncovered
some type of weird spinlock contention issue.  How large is your
database (or at least the portion of it that's commonly used)?  Would
you characterize your queries as mostly small lookups, scans, or a
mix?

merlin

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 2:20 PM, Vlad <marchenko@gmail.com> wrote:
> Merlin,
>
> this is not my report, probably from a thread that I've referenced as having
> a common symptoms. Here is info about my db:
>
>
> Postgresql 9.1.6.
> Postgres usually has 400-500 connected clients, most of them are idle.
> Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
> We have mostly select statements (joins across few tables), using indexes
> and resulting in a small number of records returned.
>
> So mostly small lookups across few tables joining by a primary key / index.
> Also, I just perform an experiment - I switched our app over to using hot
> PITR instead of master and it experienced the same problem. So since PITR db
> can only perform read-only queries, there is no write-locks (except maybe
> when pitr is playing wal records from the master?), nevertheless SYS CPU
> jumped sky.

yeah.  ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel.  so we must be suspicious of
a regression there, particularly in the scheduler.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:


yeah.  ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time

yes, CPU is spread across a lot of postmasters

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29863 pgsql     20   0 3636m 102m  36m R 19.1  0.3   0:01.33 postmaster
30277 pgsql     20   0 3645m 111m  37m R 16.8  0.3   0:01.27 postmaster
11966 pgsql     20   0 3568m  22m  15m R 15.1  0.1   0:00.66 postmaster
 8073 pgsql     20   0 3602m  60m  26m S 13.6  0.2   0:00.77 postmaster
29780 pgsql     20   0 3646m 115m  43m R 13.6  0.4   0:01.13 postmaster
11865 pgsql     20   0 3606m  61m  23m S 12.8  0.2   0:01.87 postmaster
29379 pgsql     20   0 3603m  70m  30m R 12.8  0.2   0:00.80 postmaster
29727 pgsql     20   0 3616m  77m  31m R 12.5  0.2   0:00.81 postmaster


 
*) if, so, please strace that process and save some of the log

 
*) you're using a 'bleeding edge' kernel.  so we must be suspicious of
a regression there, particularly in the scheduler.

this was observed for a while, during which period system went from using 3.4.* kernels to 3.5.*... but I do not deny such a possibility.
 
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?


Yes, definitely possible. we run manually compiled postgresql anyway. Pls, provide instructions.


 
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode


I presume you wanted cat instead of echo, and it shows 0.


-- vlad

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:
>
>>
>> yeah.  ok, nest steps:
>> *) can you confirm that postgres process is using high cpu (according
>> to top) during stall time
>
>
> yes, CPU is spread across a lot of postmasters
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 29863 pgsql     20   0 3636m 102m  36m R 19.1  0.3   0:01.33 postmaster
> 30277 pgsql     20   0 3645m 111m  37m R 16.8  0.3   0:01.27 postmaster
> 11966 pgsql     20   0 3568m  22m  15m R 15.1  0.1   0:00.66 postmaster
>  8073 pgsql     20   0 3602m  60m  26m S 13.6  0.2   0:00.77 postmaster
> 29780 pgsql     20   0 3646m 115m  43m R 13.6  0.4   0:01.13 postmaster
> 11865 pgsql     20   0 3606m  61m  23m S 12.8  0.2   0:01.87 postmaster
> 29379 pgsql     20   0 3603m  70m  30m R 12.8  0.2   0:00.80 postmaster
> 29727 pgsql     20   0 3616m  77m  31m R 12.5  0.2   0:00.81 postmaster
>
>>
>> *) if, so, please strace that process and save some of the log
>
>
> https://dl.dropbox.com/u/109778/stall_postmaster.log

ok, excellent.   reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1)         = 0
lseek(295, 0, SEEK_END)                 = 0
lseek(296, 0, SEEK_END)                 = 8192

this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times.  did you by any chance check the logs? did
any backends restart? we are looking for this:

        elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
                 lock, file, line);

Anything else going on in the log?   The way to enable locks status is
via LWLOCK_STATS macro.  But before doing any thing plz check logs for
error/panic.

merlin


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:
>>
>>>
>>> yeah.  ok, nest steps:
>>> *) can you confirm that postgres process is using high cpu (according
>>> to top) during stall time
>>
>>
>> yes, CPU is spread across a lot of postmasters
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 29863 pgsql     20   0 3636m 102m  36m R 19.1  0.3   0:01.33 postmaster
>> 30277 pgsql     20   0 3645m 111m  37m R 16.8  0.3   0:01.27 postmaster
>> 11966 pgsql     20   0 3568m  22m  15m R 15.1  0.1   0:00.66 postmaster
>>  8073 pgsql     20   0 3602m  60m  26m S 13.6  0.2   0:00.77 postmaster
>> 29780 pgsql     20   0 3646m 115m  43m R 13.6  0.4   0:01.13 postmaster
>> 11865 pgsql     20   0 3606m  61m  23m S 12.8  0.2   0:01.87 postmaster
>> 29379 pgsql     20   0 3603m  70m  30m R 12.8  0.2   0:00.80 postmaster
>> 29727 pgsql     20   0 3616m  77m  31m R 12.5  0.2   0:00.81 postmaster
>>
>>>
>>> *) if, so, please strace that process and save some of the log
>>
>>
>> https://dl.dropbox.com/u/109778/stall_postmaster.log
>
> ok, excellent.   reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1)         = 0
> lseek(295, 0, SEEK_END)                 = 0
> lseek(296, 0, SEEK_END)                 = 8192
>
> this is definitely pointing to spinlock issue. see: slock.c what you
> are seeing here is a backend getting caught in a spin loop via a stuck
> spinlock. 0-9 = 10 times.  did you by any chance check the logs? did
> any backends restart? we are looking for this:
>
>         elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
>                  lock, file, line);
>
> Anything else going on in the log?   The way to enable locks status is
> via LWLOCK_STATS macro.  But before doing any thing plz check logs for
> error/panic.

ah, scratch that.  reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times).  so the panic is
unlikely to be in the log.  also spinlock waits shouldn't increase by
exactly 1000 us. investigating...

merlin


Re: High SYS CPU - need advise

From
Alvaro Herrera
Date:
Merlin Moncure escribió:

> ok, excellent.   reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1)         = 0
> lseek(295, 0, SEEK_END)                 = 0
> lseek(296, 0, SEEK_END)                 = 8192
>
> this is definitely pointing to spinlock issue.

I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly.  He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high.  That would
reduce the system load in this scenario.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: High SYS CPU - need advise

From
Vlad
Date:

sorry - no panics / errors in the log... 

-- Vlad


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure escribió:
>
>> ok, excellent.   reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
>> 64, 0, NULL, NULL) = 64
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
>> semop(41713721, {{2, 1, 0}}, 1)         = 0
>> lseek(295, 0, SEEK_END)                 = 0
>> lseek(296, 0, SEEK_END)                 = 8192
>>
>> this is definitely pointing to spinlock issue.
>
> I met Rik van Riel (Linux kernel hacker) recently and we chatted about
> this briefly.  He strongly suggested that we should consider using
> futexes on Linux instead of spinlocks; the big advantage being that
> futexes sleep instead of spinning when contention is high.  That would
> reduce the system load in this scenario.

Well, so do postgres spinlocks right?  When we overflow
spins_per_delay we go to pg_usleep which proxies to select() --
postgres spinlocks are a hybrid implementation.  Moving to futex is
possible improvement (that's another discussion) in degenerate cases
but I'm not sure that I've exactly zeroed in on the problem.  Or am I
missing something?

What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above.  The code is here:

  pg_usleep(cur_delay * 1000L);

  /* increase delay by a random fraction between 1X and 2X */
  cur_delay += (int) (cur_delay *
        ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
  /* wrap back to minimum delay when max is exceeded */
  if (cur_delay > MAX_DELAY_MSEC)
    cur_delay = MIN_DELAY_MSEC;

...so cur_delay is supposed to increase in non linear fashion.  I've
looked around the sleep, usleep, and latch calls as of yet haven't
found anything that advances timeout just like that (yet, need to do
another pass). And we don't know for sure if this is directly related
to OP's problem.

merlin


Re: High SYS CPU - need advise

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> What I've been scratching my head over is what code exactly would
> cause an iterative sleep like the above.  The code is here:

>   pg_usleep(cur_delay * 1000L);

>   /* increase delay by a random fraction between 1X and 2X */
>   cur_delay += (int) (cur_delay *
>         ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
>   /* wrap back to minimum delay when max is exceeded */
>   if (cur_delay > MAX_DELAY_MSEC)
>     cur_delay = MIN_DELAY_MSEC;

IIRC that coding is of relatively recent vintage.  The OP is probably
running some older version that increased cur_delay differently.

            regards, tom lane


Re: High SYS CPU - need advise

From
Vlad Marchenko
Date:
Tom,

I just checked the version I'm running (9.1.6), and the code is quite
similar (src/backend/storage/lmgr/s_lock.c)

             pg_usleep(cur_delay * 1000L);

#if defined(S_LOCK_TEST)
             fprintf(stdout, "*");
             fflush(stdout);
#endif

             /* increase delay by a random fraction between 1X and 2X */
             cur_delay += (int) (cur_delay *
                       ((double) random() / (double) MAX_RANDOM_VALUE) +
0.5);
             /* wrap back to minimum delay when max is exceeded */
             if (cur_delay > MAX_DELAY_MSEC)
                 cur_delay = MIN_DELAY_MSEC;

-- vlad




Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)

This is not entirely inconsistent with the spinlock.  Note that 1000
is repeated 3 times, and 5000 is missing.

This might just be a misleading random sample we got here.  I've seen
similar close spacing in some simulations I've run.

It is not clear to me why we use a resolution of 1 msec here.  If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business.  But why do we have to lose intermediate
precision due to its decision?

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
>
> This is not entirely inconsistent with the spinlock.  Note that 1000
> is repeated 3 times, and 5000 is missing.
>
> This might just be a misleading random sample we got here.  I've seen
> similar close spacing in some simulations I've run.
>
> It is not clear to me why we use a resolution of 1 msec here.  If the
> OS's implementation of select() eventually rounds to the nearest msec,
> that is its business.  But why do we have to lose intermediate
> precision due to its decision?

Yeah -- you're right, this is definitely spinlock issue.  Next steps:

*) in mostly read workloads, we have a couple of known frequent
offenders.  In particular the 'BufFreelistLock'.  One way we can
influence that guy is to try and significantly lower/raise shared
buffers.  So this is one thing to try.

*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on.  Hopefully
it's a lwlock -- this will make diagnosing the problem easier.

*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan.  This long thread:

"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
has a lot information about that case and deserves a review.

*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve.  This is dangerous, and could crash your
server/eat your data, so fair warning.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
Merlin,

> Yeah -- you're right, this is definitely spinlock issue.  Next steps:
>
> *) in mostly read workloads, we have a couple of known frequent
> offenders.  In particular the 'BufFreelistLock'.  One way we can
> influence that guy is to try and significantly lower/raise shared
> buffers.  So this is one thing to try.

server has 32gb with 3.2gb dedicated for share buffers. I've increased
it to 13.2gb, the stall still happened (I have a way of controlling
number of queries hitting postgresql by shutting down own memcache-based
app cache, so to test for stall I temporary shut down few memcached
servers).

> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
> information about the particular lock(s) we're binding on.  Hopefully
> it's a lwlock -- this will make diagnosing the problem easier.

I've enabled that macro, seeing flying lwlock messages in the log (see
below), even when there is no high-sys-cpu stall observed at the moment.
Should I be looking for something in particular?


PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0
PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0
PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0
PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0
PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0
PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0
PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0
PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0
PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0
PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0
PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0
PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0
PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0


> *) if we're not blocking on lwlock, it's possibly a buffer pin related
> issue? I've seen this before, for example on an index scan that is
> dependent on an seq scan.  This long thread:
>
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
> has a lot information about that case and deserves a review.
>
> *) we can consider experimenting with futex
> (http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
> to see if things improve.  This is dangerous, and could crash your
> server/eat your data, so fair warning.
>
> merlin


-- vlad


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote:
> Merlin,
>
>
>> Yeah -- you're right, this is definitely spinlock issue.  Next steps:
>>
>> *) in mostly read workloads, we have a couple of known frequent
>> offenders.  In particular the 'BufFreelistLock'.  One way we can
>> influence that guy is to try and significantly lower/raise shared
>> buffers.  So this is one thing to try.
>
>
> server has 32gb with 3.2gb dedicated for share buffers. I've increased it to
> 13.2gb, the stall still happened (I have a way of controlling number of
> queries hitting postgresql by shutting down own memcache-based app cache, so
> to test for stall I temporary shut down few memcached servers).
>
>
>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>> information about the particular lock(s) we're binding on.  Hopefully
>> it's a lwlock -- this will make diagnosing the problem easier.
>
>
> I've enabled that macro, seeing flying lwlock messages in the log (see
> below), even when there is no high-sys-cpu stall observed at the moment.
> Should I be looking for something in particular?

We're looking for spikes in 'blk' which represents when lwlocks bump.
If you're not seeing any then this is suggesting a buffer pin related
issue -- this is also supported by the fact that raising shared
buffers didn't help.   If you're not seeing 'bk's, go ahead and
disable the stats macro.

So, what we need to know now is:
*) What happens when you drastically *lower* shared buffers?   Say, to
64mb?  Note, you may experience higher load for unrelated reasons and
have to scuttle the test.  Also, if you have to crank higher to handle
internal server structures, do that.  This is a hail mary, but maybe
something interesting spits out.

*) How many specific query plans are needed to introduce the
condition,  Hopefully, it's not too many.  If so, let's start
gathering the plans.  If you have a lot of plans to sift through, one
thing we can attempt to eliminate noise is to tweak
log_min_duration_statement so that during stall times (only) it logs
offending queries that are unexpectedly blocking.

*) Approximately how big is your 'working set' -- the data your
queries are routinely hitting?

*) Is the distribution of the *types* of queries uniform?  Or do you
have special processes that occur on intervals?

Thanks for your patience.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:

We're looking for spikes in 'blk' which represents when lwlocks bump.
If you're not seeing any then this is suggesting a buffer pin related
issue -- this is also supported by the fact that raising shared
buffers didn't help.   If you're not seeing 'bk's, go ahead and
disable the stats macro.

most blk comes with 0, some with 1, few hitting 100. I can't say that during stall times the number of blk 0 vs blk non-0 are very different.  

 
So, what we need to know now is:
*) What happens when you drastically *lower* shared buffers?   Say, to
64mb?  Note, you may experience higher load for unrelated reasons and
have to scuttle the test.  Also, if you have to crank higher to handle
internal server structures, do that.  This is a hail mary, but maybe
something interesting spits out.

lowering shared_buffers didn't help. 
 
*) How many specific query plans are needed to introduce the
condition,  Hopefully, it's not too many.  If so, let's start
gathering the plans.  If you have a lot of plans to sift through, one
thing we can attempt to eliminate noise is to tweak
log_min_duration_statement so that during stall times (only) it logs
offending queries that are unexpectedly blocking.

unfortunately, there are quite a few query plans... also, I don't think setting log_min_duration_statement will help us, cause when server is hitting high load average, it reacts slowly even on a key press. So even non-offending queries will be taking long to execute. I see all sorts of queries a being executed long during stall: spanning from simple 
LOG:  duration: 1131.041 ms  statement: SELECT 'DBD::Pg ping test'
to complex ones, joining multiple tables. 
We are still looking into all the logged queries in attempt to find the ones that are causing the problem, I'll report if we find any clues.
 

*) Approximately how big is your 'working set' -- the data your
queries are routinely hitting?

I *think* it's within few hundreds MB range.
 

*) Is the distribution of the *types* of queries uniform?  Or do you
have special processes that occur on intervals?

it's pretty uniform.
 

Thanks for your patience.


oh no, thank you for trying to help me to resolve this issue.

-- vlad

Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote:
>>
>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>>> information about the particular lock(s) we're binding on.  Hopefully
>>> it's a lwlock -- this will make diagnosing the problem easier.
>>
>>
>> I've enabled that macro, seeing flying lwlock messages in the log (see
>> below), even when there is no high-sys-cpu stall observed at the moment.
>> Should I be looking for something in particular?
>
> We're looking for spikes in 'blk' which represents when lwlocks bump.

Unfortunately LWLock contention is only a weak indication of spinlock
contention.  For example, if backends are furiously acquiring and
releasing the same LWLock in shared mode and no one is getting an
exclusive mode, then there can be high spinlock contention but there
will never be a blk.

In 9.3 there is a new field that tells how many spin delays there were
on the mutex that is behind each lock.  That was  commit
b79ab00144e64217d41, maybe he can port that back to his version.

But that only tells you about LWLock mutexes, not about all the other
ones in PG.

The attached patch logs every spin delay with where in the source it comes from.

I don't think it will have too much of a performance impact as the
logs only happen when a sleep of at least 1ms is going to happen
anyway.  But it could generate massive amounts of logs.


Cheers,

Jeff

Attachment

Re: High SYS CPU - need advise

From
Vlad Marchenko
Date:
OK, so far I settled on excluding connection caching on app side
(Apache::DBI and prepare_cached) from equation and adding pgbouncer as a
counter-measure. This seems to stabilize the situation - at least I'm
not able to push server into high-sys-cpu stall the way how I used to do.

I'm still interested in getting to the bottom of the issue vs
alleviating it with connection poller, so I might setup a server on
Centos 6.3 (kernel ver 2.6.32) and give it a try.

Alternatively, if anyone have ideas on playing with sysctl parameters
related to scheduler - please let me know.

-- vlad





Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Fri, Nov 16, 2012 at 11:19 AM, Vlad <marchenko@gmail.com> wrote:
>
>> We're looking for spikes in 'blk' which represents when lwlocks bump.
>> If you're not seeing any then this is suggesting a buffer pin related
>> issue -- this is also supported by the fact that raising shared
>> buffers didn't help.   If you're not seeing 'bk's, go ahead and
>> disable the stats macro.
>
>
> most blk comes with 0, some with 1, few hitting 100. I can't say that during
> stall times the number of blk 0 vs blk non-0 are very different.

right.  this is feeling more and more like a buffer pin issue.  but
even then we can't be certain -- it could be symptom, not the cause.
to prove it we need to demonstrate that everyone is spinning and
waiting, which we haven't done.  classic spinlock contention manifests
in high user cpu. we are binding in kernel, so I wonder if it's all
the select() calls.  we haven't yet ruled out kernel regression.

If I were you, I'd be investigating pgbouncer to see if your app is
compliant with transaction mode processing, if for no other reason
than it will mitigate high load issues.

>> *) How many specific query plans are needed to introduce the
>> condition,  Hopefully, it's not too many.  If so, let's start
>> gathering the plans.  If you have a lot of plans to sift through, one
>> thing we can attempt to eliminate noise is to tweak
>> log_min_duration_statement so that during stall times (only) it logs
>> offending queries that are unexpectedly blocking.
>
>
> unfortunately, there are quite a few query plans... also, I don't think
> setting log_min_duration_statement will help us, cause when server is
> hitting high load average, it reacts slowly even on a key press. So even
> non-offending queries will be taking long to execute. I see all sorts of
> queries a being executed long during stall: spanning from simple
> LOG:  duration: 1131.041 ms  statement: SELECT 'DBD::Pg ping test'
> to complex ones, joining multiple tables.
> We are still looking into all the logged queries in attempt to find the ones
> that are causing the problem, I'll report if we find any clues.

right.

merlin


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Fri, Nov 16, 2012 at 12:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@gmail.com> wrote:
>>>
>>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>>>> information about the particular lock(s) we're binding on.  Hopefully
>>>> it's a lwlock -- this will make diagnosing the problem easier.
>>>
>>>
>>> I've enabled that macro, seeing flying lwlock messages in the log (see
>>> below), even when there is no high-sys-cpu stall observed at the moment.
>>> Should I be looking for something in particular?
>>
>> We're looking for spikes in 'blk' which represents when lwlocks bump.
>
> Unfortunately LWLock contention is only a weak indication of spinlock
> contention.  For example, if backends are furiously acquiring and
> releasing the same LWLock in shared mode and no one is getting an
> exclusive mode, then there can be high spinlock contention but there
> will never be a blk.

yup.  this problem doesn't smell like lwlock issues.  typically there
the problem manifests as low cpu performance, everybody waits.
classic spinlock contention (at least from what i've seen) is very
high *userspace* cpu utilization and low work output.  this time it's
different -- OP is bogging in the kernel so it's not impossible we're
troubleshooting the symptom, not the cause.

> In 9.3 there is a new field that tells how many spin delays there were
> on the mutex that is behind each lock.  That was  commit
> b79ab00144e64217d41, maybe he can port that back to his version.
>
> But that only tells you about LWLock mutexes, not about all the other
> ones in PG.
>
> The attached patch logs every spin delay with where in the source it comes from.

yeah, OP should fire this off. good stuff.  I'll bet lunch (if we ever
happen to meet) it's on buffer pin.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
ok, I've applied that patch and ran. The stall started around 13:50:45...50 and lasted until the end


the actual log has more data (including statement following each 'spin delay' record), but there is some sensitive info inside which I can't share with public.


-- Vlad


yup.  this problem doesn't smell like lwlock issues.  typically there
the problem manifests as low cpu performance, everybody waits.
classic spinlock contention (at least from what i've seen) is very
high *userspace* cpu utilization and low work output.  this time it's
different -- OP is bogging in the kernel so it's not impossible we're
troubleshooting the symptom, not the cause.

> In 9.3 there is a new field that tells how many spin delays there were
> on the mutex that is behind each lock.  That was  commit
> b79ab00144e64217d41, maybe he can port that back to his version.
>
> But that only tells you about LWLock mutexes, not about all the other
> ones in PG.
>
> The attached patch logs every spin delay with where in the source it comes from.

yeah, OP should fire this off. good stuff.  I'll bet lunch (if we ever
happen to meet) it's on buffer pin.

merlin

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Fri, Nov 16, 2012 at 2:13 PM, Vlad <marchenko@gmail.com> wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> the actual log has more data (including statement following each 'spin
> delay' record), but there is some sensitive info inside which I can't share
> with public.

first thoughts:
no single thing really stands out -- contention is all over the place.
lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
bad scheduler interaction.  any chance we can fire up pgbouncer?

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
what would pgbouncer do in my case? Number of connections will decrease, but number of active clients won't be smaller. As I understand the latter ones are that important.

-- Vlad


On Fri, Nov 16, 2012 at 2:31 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

first thoughts:
no single thing really stands out -- contention is all over the place.
lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
bad scheduler interaction.  any chance we can fire up pgbouncer?

merlin

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Mon, Nov 19, 2012 at 12:02 PM, Vlad <marchenko@gmail.com> wrote:
>
> Some additional observation and food for thoughts. Our app uses connection
> caching (Apache::DBI). By disabling Apache::DBI and forcing client
> re-connection for every (http) request processed I eliminated the stall. The
> user cpu usage jumped (mostly cause prepared sql queries are no longer
> available, and some additional overhead on re-connection), but no single
> case of high-sys-cpu stall.
>
> I can not completely rule out the possibility of some left-overs (unfinished
> transaction?) remain after serving http request, which, in the absence of
> connection caching, are discarded for sure....

well, that is good.  note for pgbouncer transaction mode you would
have had to quit using prepared statements anyways (see list of
features here: http://wiki.postgresql.org/wiki/PgBouncer).  you can
use session mode though, and that would be a good test.  faster speed
could be because of application issues or maybe just spreading out the
queries reduced contention.  hard to say.

well, what happens next is up to you -- we can spend more time chasing
this down if you want, or you can hold back and see how things run for
a while.   what is your user load up to in the worst case?

merlin


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Mon, Nov 19, 2012 at 10:50 AM, Vlad <marchenko@gmail.com> wrote:
> I just did a little experiment: extracted top four queries that were
> executed the longest during stall times and launched  pgbench test with 240
> clients. Yet I wasn't able to put the server into a stall with that. Also
> load average was hitting 120+, it was all user cpu, single digit % system.
> The system remained pretty responsive (on a keypress), in contrary to when
> high-sys-cpu stall happens, showing similar LA numbers.
>
> This makes me think that there is probably a different (than ones I tried)
> query or condition that is responsible for creating high-sys-cpu. Also, from
> my limited knowledge of postgresql and kernel internals, this experiment
> puts under question idea that linux scheduler might be the corner stone....
>
> Here is a longer log, first column is PID. High-sys-cpu stall happened
> around 10:42:37
> https://dl.dropbox.com/u/109778/postgresql-2012-11-19_103425-stripped.log
>
> I'll try to put pgbouncer in place and repeat, also though this will take
> longer time to setup.


yeah.  interesting -- contention was much higher this time and that
changes things.  strange how it was missed earlier.

you're getting bounced around a lot in lwlock especially
(unfortunately we don't know which one).  I'm going to hazard another
guess:  maybe the trigger here is when the number of contending
backends exceeds some critical number (probably based on the number of
cores) you see a quick cpu spike (causing more backends to lock and
pile up) as cache line bouncing sets in.  That spike doesn't last
long, because the spinlocks quickly accumulate delay counts then punt
to the scheduler which is unable to cope.  The exact reason why this
is happening to you in exactly this way (I've never seen it) is
unclear.  Also the line between symptom and cause is difficult to
draw.

unfortunately, in your case spinlock re-scheduling isn't helping.  log
entries like this one:
18764 [2012-11-19 10:43:50.124 CST] LOG:  JJ spin delay from file
sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29

are suggesting major problems.  you're dangerously close to a stuck
spinlock which is lights out for the database.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:

Some additional observation and food for thoughts. Our app uses connection caching (Apache::DBI). By disabling Apache::DBI and forcing client re-connection for every (http) request processed I eliminated the stall. The user cpu usage jumped (mostly cause prepared sql queries are no longer available, and some additional overhead on re-connection), but no single case of high-sys-cpu stall.

I can not completely rule out the possibility of some left-overs (unfinished transaction?) remain after serving http request, which, in the absence of connection caching, are discarded for sure.... 

-- Vlad


On Mon, Nov 19, 2012 at 11:19 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

yeah.  interesting -- contention was much higher this time and that
changes things.  strange how it was missed earlier.

you're getting bounced around a lot in lwlock especially
(unfortunately we don't know which one).  I'm going to hazard another
guess:  maybe the trigger here is when the number of contending
backends exceeds some critical number (probably based on the number of
cores) you see a quick cpu spike (causing more backends to lock and
pile up) as cache line bouncing sets in.  That spike doesn't last
long, because the spinlocks quickly accumulate delay counts then punt
to the scheduler which is unable to cope.  The exact reason why this
is happening to you in exactly this way (I've never seen it) is
unclear.  Also the line between symptom and cause is difficult to
draw.

unfortunately, in your case spinlock re-scheduling isn't helping.  log
entries like this one:
18764 [2012-11-19 10:43:50.124 CST] LOG:  JJ spin delay from file
sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29

are suggesting major problems.  you're dangerously close to a stuck
spinlock which is lights out for the database.

merlin

Re: High SYS CPU - need advise

From
Vlad
Date:
I just did a little experiment: extracted top four queries that were executed the longest during stall times and launched  pgbench test with 240 clients. Yet I wasn't able to put the server into a stall with that. Also load average was hitting 120+, it was all user cpu, single digit % system. The system remained pretty responsive (on a keypress), in contrary to when high-sys-cpu stall happens, showing similar LA numbers.

This makes me think that there is probably a different (than ones I tried) query or condition that is responsible for creating high-sys-cpu. Also, from my limited knowledge of postgresql and kernel internals, this experiment puts under question idea that linux scheduler might be the corner stone....

Here is a longer log, first column is PID. High-sys-cpu stall happened around 10:42:37

I'll try to put pgbouncer in place and repeat, also though this will take longer time to setup.


-- Vlad

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote:
>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>> and lasted until the end
>>
>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> That isn't as much log as I expected.  But I guess only the tip of the
> iceberg gets logged, it could be that most spinlocks are contended but
> then get acquired just before the sleep (and log) would have happened.
>
> I'm not sure how to figure out just how much spinning is going on
> below the surface, but my gut feeling is there isn't enough of it to
> explain the amount of slowdown you are seeing (and it would probably
> be accounted for as user time rather than system time)


yes. classic spinlock contention symptoms are high user cpu and almost
work getting done.   something else is going on here?

> Maybe we could whip up a dummy standalone  program that uses PG's
> spinlock code in a intentionally contended  way and see how your
> system reports resource usage for that.  Maybe your system reports
> fighting for cache lines or whatever is going on behind the scenes as
> system time rather than user time, or something.

I don't think this is the case.  It's looking more and more like
scheduler issues, although it's unclear if that is due to some kind of
regression or just symptoms of backends piling up.  I'm also starting
to wonder if the spinlock contention we are seeing (which according to
your patch really isn't all that much really) is just symptomatic of
scheduler issues.

> It would be a little easier to figure out what was going on there if
> the log included the pid (%p) so we can see if it is the same process
> doing a bunch of consecutive sleeps, or a bunch of different
> processes.  Also, logging the millisecond (%m rather than %t) would be
> helpful.
>
> But the way that the point of contention jumps around a lot points to
> a scheduler issue rather than a postgres-specific issue.  A backend
> gets pre-empted while holding a spinlock, and then the kernel is
> resistant to rescheduling it even though all other processes are piled
> up behind it.  But again, that theory would lead to more log lines
> than we see, I would think.

yeah.  I asked for some longer strace logs but then everyone went home
for the weekend.

> Anyway, pgbouncer might help by ensuring that a connection gets
> pre-empted by pgbouncer at a point where it has no locks, rather than
> by the kernel at a point where it does hold locks.

yes -- I think pgbouncer is the key here because it's going to tell us
if the scheduler is the culprit here.  If you look carefully at the
strace logs, you'll see that system call times become unpredictable as
they are getting scheduled out...particularly yielding calls like
select().  So we need to determine if this is normal high load
behavior or if there is an underlying mechanic.

> Any chance you could try an older kernel and see if that reduces the problem?

This is one thing to try.  Personally I think pgbouncer/transaction
mode is low hanging fruit if it 'drops in' to OP's app. If problem
goes away this will pretty much point the finger at Linux AFAICT (in
which case we can start messing around with the kernel).

merlin


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log

That isn't as much log as I expected.  But I guess only the tip of the
iceberg gets logged, it could be that most spinlocks are contended but
then get acquired just before the sleep (and log) would have happened.

I'm not sure how to figure out just how much spinning is going on
below the surface, but my gut feeling is there isn't enough of it to
explain the amount of slowdown you are seeing (and it would probably
be accounted for as user time rather than system time)

Maybe we could whip up a dummy standalone  program that uses PG's
spinlock code in a intentionally contended  way and see how your
system reports resource usage for that.  Maybe your system reports
fighting for cache lines or whatever is going on behind the scenes as
system time rather than user time, or something.

It would be a little easier to figure out what was going on there if
the log included the pid (%p) so we can see if it is the same process
doing a bunch of consecutive sleeps, or a bunch of different
processes.  Also, logging the millisecond (%m rather than %t) would be
helpful.

But the way that the point of contention jumps around a lot points to
a scheduler issue rather than a postgres-specific issue.  A backend
gets pre-empted while holding a spinlock, and then the kernel is
resistant to rescheduling it even though all other processes are piled
up behind it.  But again, that theory would lead to more log lines
than we see, I would think.

Anyway, pgbouncer might help by ensuring that a connection gets
pre-empted by pgbouncer at a point where it has no locks, rather than
by the kernel at a point where it does hold locks.

Any chance you could try an older kernel and see if that reduces the problem?

One thing that does stand out is that it looks like partitioned hash
tables in dynahash.c could benefit by having their freelist
partitioned as well, assuming we can give up the hash_get_num_entries
function.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Fri, Nov 16, 2012 at 3:21 PM, Vlad <marchenko@gmail.com> wrote:
> what would pgbouncer do in my case? Number of connections will decrease, but
> number of active clients won't be smaller. As I understand the latter ones
> are that important.

Well, one thing that struck me was how little spinlock contention
there actually was.   Yeah, backends are delaying here and there,
which isn't great, but  but a few dozen delays per second across
several hundred backends doesn't seem like it should be pegging sys
cpu%.  This is all pointing to the problem not being in postgres, but
in linux.

pgbouncer would do two things:
1) perhaps guard you against some o/s issue
2) keep system more responsive during stall (since by controlling the
pool you control the number of queries piling up).

of course, this comes at the expense of some overhead.

But, before going through all that, how about timing strace recorded
calls (strace -T) both in stall and non-stall conditions.   I was
assuming select(), but maybe it's something else....for example the
recently fixed lseek.

merlin


Re: High SYS CPU - need advise

From
Shaun Thomas
Date:
On 11/16/2012 02:31 PM, Merlin Moncure wrote:

> no single thing really stands out -- contention is all over the place.
> lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
> bad scheduler interaction.  any chance we can fire up pgbouncer?

Just want to throw it out there, but we've been having really bad luck
with the scheduler recently. But only when we use 8GB (on our 72GB
system) for shared_buffers. Cut that down to 4GB, and everything is fine
and dandy.

I think the kernel devs have added in some overzealous scheduler code on us.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>
>> no single thing really stands out -- contention is all over the place.
>> lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
>> bad scheduler interaction.  any chance we can fire up pgbouncer?
>
>
> Just want to throw it out there, but we've been having really bad luck with
> the scheduler recently. But only when we use 8GB (on our 72GB system) for
> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>
> I think the kernel devs have added in some overzealous scheduler code on us.

Shared buffer manipulation changing contention is suggesting you're
running into free list lock issues.  How many active backends/cores?

merlin


Re: High SYS CPU - need advise

From
Shaun Thomas
Date:
On 11/20/2012 10:03 AM, Merlin Moncure wrote:

> Shared buffer manipulation changing contention is suggesting you're
> running into free list lock issues.  How many active backends/cores?

Oh, the reason I wanted to point it out was that we see a lot more than
CPU contention with higher shared_buffer settings. The scheduler starts
aggressively clearing out the file cache *and* freeing memory, driving
load way higher than it should be. That doesn't result in a very high
iowait, but all that page swapping sure isn't free.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Tue, Nov 20, 2012 at 10:12 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 11/20/2012 10:03 AM, Merlin Moncure wrote:
>
>> Shared buffer manipulation changing contention is suggesting you're
>> running into free list lock issues.  How many active backends/cores?
>
>
> Oh, the reason I wanted to point it out was that we see a lot more than CPU
> contention with higher shared_buffer settings. The scheduler starts
> aggressively clearing out the file cache *and* freeing memory, driving load
> way higher than it should be. That doesn't result in a very high iowait, but
> all that page swapping sure isn't free.

have you ruled out numa issues?
(http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

merlin


Re: High SYS CPU - need advise

From
Shaun Thomas
Date:
On 11/20/2012 10:13 AM, Merlin Moncure wrote:

> have you ruled out numa issues?
> (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

Haha. Yeah. Our zone reclaim mode off, and node distance is 10 or 20.
ZCM is only enabled by default if distance is > 20, unless there's some
kernel bug triggering it even when zone reclaim reports as off.

I'll also note that my tests with interleave made no difference at all.
At least not with pgbench. There was a small amount of flux with TPS,
but the page swap storms came regardless of NUMA tweaks. The formula
worked like this:

High connection count + high shared_buffers = page swap storm. I'll note
that 8GB -> 4GB immediately stopped the paging everywhere, and the OS
went from using 13GB for active file cache, to 45GB. I can't see how PG
would cause something like that by itself.

I only piped in because it's very adversely affecting our CPU load in a
similarly inexplicable-but-seemingly-the-scheduler way.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure escribió:
>
>> ok, excellent.   reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
>> 64, 0, NULL, NULL) = 64
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
>> semop(41713721, {{2, 1, 0}}, 1)         = 0
>> lseek(295, 0, SEEK_END)                 = 0
>> lseek(296, 0, SEEK_END)                 = 8192
>>
>> this is definitely pointing to spinlock issue.
>
> I met Rik van Riel (Linux kernel hacker) recently and we chatted about
> this briefly.  He strongly suggested that we should consider using
> futexes on Linux instead of spinlocks; the big advantage being that
> futexes sleep instead of spinning when contention is high.  That would
> reduce the system load in this scenario.

By the way, I think you (and Rik) 100% right.  We're slamming select()
where futex has much smarter wait mechanism.  There are reports from
users who have switched to pthreads and reporting better scalability.
(https://groups.google.com/forum/?fromgroups=#!topic/pgsql.hackers/v8V82gznxs8)

merlin


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>
>>> no single thing really stands out -- contention is all over the place.
>>> lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
>>> bad scheduler interaction.  any chance we can fire up pgbouncer?
>>
>>
>> Just want to throw it out there, but we've been having really bad luck with
>> the scheduler recently. But only when we use 8GB (on our 72GB system) for
>> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>>
>> I think the kernel devs have added in some overzealous scheduler code on us.
>
> Shared buffer manipulation changing contention is suggesting you're
> running into free list lock issues.

I wouldn't expect so.  Increasing shared_buffers should either fix
free list lock contention, or leave it unchanged, not make it worse.
Sounds more like that NUMA/interleave/zone reclaim stuff I could never
wrap my head around.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
>>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>>
>>>> no single thing really stands out -- contention is all over the place.
>>>> lwlock, pinbuffer, dynahash (especially).  I am again suspicious of
>>>> bad scheduler interaction.  any chance we can fire up pgbouncer?
>>>
>>>
>>> Just want to throw it out there, but we've been having really bad luck with
>>> the scheduler recently. But only when we use 8GB (on our 72GB system) for
>>> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>>>
>>> I think the kernel devs have added in some overzealous scheduler code on us.
>>
>> Shared buffer manipulation changing contention is suggesting you're
>> running into free list lock issues.
>
> I wouldn't expect so.  Increasing shared_buffers should either fix
> free list lock contention, or leave it unchanged, not make it worse.

AIUI, that is simply not true (unless you raise it to the point you're
not churning them).   I'm looking at StrategyGetBuffer() for non-scan
cases.  It locks "BufFreelistLock" then loops the free list, and, if
it finds nothing, engages a clock sweep.  Both of those operations are
dependent on the number of buffers being managed and so it's
reasonable to expect some workloads to increase contention with more
buffers.

merlin


Re: High SYS CPU - need advise

From
Shaun Thomas
Date:
On 11/20/2012 04:08 PM, Jeff Janes wrote:

> Shaun Thomas reports one that is (I assume) not read intensive, but
> his diagnosis is that this is a kernel bug where a larger
> shared_buffers for no good reason causes the kernel to kill off its
> page cache.

We're actually very read intensive. According to pg_stat_statements, we
regularly top out at 42k queries per second, and pg_stat_database says
we're pushing 7k TPS.

But I'm still sure this is a kernel bug. Moving from 4GB to 6GB or 8GB
causes the kernel to cut the active page cache in half, in addition to
freeing 1/4 of RAM to just sit around doing nothing. That in turn causes
kswapd to work constantly, while our IO drivers work to undo the damage.
It's a positive feedback loop that I can reliably drive the load up to
800+ on an 800-client pgbench with two threads, all while having 0% CPU
free.

Make that 4GB, and not only does the problem completely disappear, but
the load settles down to around 9, and the machine becomes about 60%
idle. Something in there is fantastically broken, but I can't point a
finger at what.

I was just piping in because, in absence of an obvious PG-related
culprit, the problem could be the OS itself. It certainly was in our case.

That, or PG has a memory leak that only appears at > 4GB of shared buffers.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Tue, Nov 20, 2012 at 2:26 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 11/20/2012 04:08 PM, Jeff Janes wrote:
>
>> Shaun Thomas reports one that is (I assume) not read intensive, but
>> his diagnosis is that this is a kernel bug where a larger
>> shared_buffers for no good reason causes the kernel to kill off its
>> page cache.
>
>
> We're actually very read intensive.

Sorry, that is what I meant.  I wrote "not write intensive", then
decided it would be clearer to change "not write" to "read" and then
forgot to remove the "not".  I hate it when that happens.

Atomic update commit failure in the meatware :)

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Shaun Thomas
Date:
On 11/20/2012 04:35 PM, Jeff Janes wrote:

> Atomic update commit failure in the meatware :)

Ha.

What's actually funny is that one of the affected machines started
*swapping* earlier today. With 15GB free, and 12GB of inactive cache,
and vm.swappiness set to 0, it somehow decided there was enough memory
pressure to push 1.5GB of *something* into swap.

Frankly, the kernel we're using is clearly deranged. There's really no
other way to explain this. But we were blaming various elements of PG
for several weeks following our upgrade. After further examination, our
query profile isn't any different, and the load spikes were effectively
random.

All I can say is... be wary of the 3.2 kernel. It's... unique.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: High SYS CPU - need advise

From
Gavin Flower
Date:
On 21/11/12 11:41, Shaun Thomas wrote:
> On 11/20/2012 04:35 PM, Jeff Janes wrote:
>
>> Atomic update commit failure in the meatware :)
>
> Ha.
>
> What's actually funny is that one of the affected machines started
> *swapping* earlier today. With 15GB free, and 12GB of inactive cache,
> and vm.swappiness set to 0, it somehow decided there was enough memory
> pressure to push 1.5GB of *something* into swap.
>
> Frankly, the kernel we're using is clearly deranged. There's really no
> other way to explain this. But we were blaming various elements of PG
> for several weeks following our upgrade. After further examination,
> our query profile isn't any different, and the load spikes were
> effectively random.
>
> All I can say is... be wary of the 3.2 kernel. It's... unique.
>
You'll just have to upgrade to Microsoft!
<Ducks, and runs away _VERY_ quickly!>



Could you not install a later kernel?


Cheers,
Gavin


Re: High SYS CPU - need advise

From
Vlad Marchenko
Date:
update on my problem: despite pgbouncer, the problem still occures on my
end.

Also, interesting observation - I ran several tests with pgbench, using
queries that I think are prone to trigger high-sys-cpu-stall. What I
noticed is when pgbench is started with prepared mode, the system
behaves fine during stress-test (high user cpu - 85-90%, low sys cpu -
5-7%), high TPS. Though when I used non-prepared modes, the sys cpu
portion jumps to 40% (and tps drops dramatically as well, but this is
understandable).  The test queries are pretty long (10kb+), with couple
of outer joins across 1000-record tables with indexes.

Maybe, we are looking in a wrong place and the issue is somewhere within
planer/parser? Is there some extensive locking used in there?

Another observation - it's harder to trigger high-sys-cpu stall on a
freshly restarted postgresql. Though if it was running for a while, then
it's much easier to do.

-- vlad

On 11/19/12 8:33 AM, Merlin Moncure wrote:
> On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote:
>>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>>> and lasted until the end
>>>
>>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>> That isn't as much log as I expected.  But I guess only the tip of the
>> iceberg gets logged, it could be that most spinlocks are contended but
>> then get acquired just before the sleep (and log) would have happened.
>>
>> I'm not sure how to figure out just how much spinning is going on
>> below the surface, but my gut feeling is there isn't enough of it to
>> explain the amount of slowdown you are seeing (and it would probably
>> be accounted for as user time rather than system time)
>
> yes. classic spinlock contention symptoms are high user cpu and almost
> work getting done.   something else is going on here?
>
>> Maybe we could whip up a dummy standalone  program that uses PG's
>> spinlock code in a intentionally contended  way and see how your
>> system reports resource usage for that.  Maybe your system reports
>> fighting for cache lines or whatever is going on behind the scenes as
>> system time rather than user time, or something.
> I don't think this is the case.  It's looking more and more like
> scheduler issues, although it's unclear if that is due to some kind of
> regression or just symptoms of backends piling up.  I'm also starting
> to wonder if the spinlock contention we are seeing (which according to
> your patch really isn't all that much really) is just symptomatic of
> scheduler issues.
>
>> It would be a little easier to figure out what was going on there if
>> the log included the pid (%p) so we can see if it is the same process
>> doing a bunch of consecutive sleeps, or a bunch of different
>> processes.  Also, logging the millisecond (%m rather than %t) would be
>> helpful.
>>
>> But the way that the point of contention jumps around a lot points to
>> a scheduler issue rather than a postgres-specific issue.  A backend
>> gets pre-empted while holding a spinlock, and then the kernel is
>> resistant to rescheduling it even though all other processes are piled
>> up behind it.  But again, that theory would lead to more log lines
>> than we see, I would think.
> yeah.  I asked for some longer strace logs but then everyone went home
> for the weekend.
>
>> Anyway, pgbouncer might help by ensuring that a connection gets
>> pre-empted by pgbouncer at a point where it has no locks, rather than
>> by the kernel at a point where it does hold locks.
> yes -- I think pgbouncer is the key here because it's going to tell us
> if the scheduler is the culprit here.  If you look carefully at the
> strace logs, you'll see that system call times become unpredictable as
> they are getting scheduled out...particularly yielding calls like
> select().  So we need to determine if this is normal high load
> behavior or if there is an underlying mechanic.
>
>> Any chance you could try an older kernel and see if that reduces the problem?
> This is one thing to try.  Personally I think pgbouncer/transaction
> mode is low hanging fruit if it 'drops in' to OP's app. If problem
> goes away this will pretty much point the finger at Linux AFAICT (in
> which case we can start messing around with the kernel).
>
> merlin



Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 21, 2012 at 9:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote:
> update on my problem: despite pgbouncer, the problem still occures on my
> end.
>
> Also, interesting observation - I ran several tests with pgbench, using
> queries that I think are prone to trigger high-sys-cpu-stall. What I noticed
> is when pgbench is started with prepared mode, the system behaves fine
> during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS.
> Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and
> tps drops dramatically as well, but this is understandable).  The test
> queries are pretty long (10kb+), with couple of outer joins across
> 1000-record tables with indexes.
>
> Maybe, we are looking in a wrong place and the issue is somewhere within
> planer/parser? Is there some extensive locking used in there?
>
> Another observation - it's harder to trigger high-sys-cpu stall on a freshly
> restarted postgresql. Though if it was running for a while, then it's much
> easier to do.

what pgbouncer mode, and how large is your pool.

merlin


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote:

> update on my problem: despite pgbouncer, the problem still occures on my
> end.

As Merlin asked, how big is the pool?  Maybe you are using a large
enough pool so as to defeat the purpose of restricting the number of
connections.


> Also, interesting observation - I ran several tests with pgbench, using
> queries that I think are prone to trigger high-sys-cpu-stall. What I noticed
> is when pgbench is started with prepared mode, the system behaves fine
> during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS.
> Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and
> tps drops dramatically as well, but this is understandable).  The test
> queries are pretty long (10kb+), with couple of outer joins across
> 1000-record tables with indexes.

Could you sanitize the queries (and some statements to generate dummy
data) enough to share?

>
> Maybe, we are looking in a wrong place and the issue is somewhere within
> planer/parser? Is there some extensive locking used in there?

I don't think the locking is particular extensive, but it could be
enough extra to drive something over the edge.

But it would be the same nature of locking as elsewhere (spinlocks and
lwlocks), so it doesn't really change the nature of the problem, which
is still "Why do these user-space locks turn into high SYS cpu?"

> Another observation - it's harder to trigger high-sys-cpu stall on a freshly
> restarted postgresql. Though if it was running for a while, then it's much
> easier to do.

Maybe the long running time has built up enough resource usage to
cause the kernel scheduler to get into a snit, so it decides to
preempt the process while it holds a spinlock, and then refuses to run
it again for a while.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Vlad
Date:
Merlin,

On Wed, Nov 21, 2012 at 2:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko@gmail.com> wrote:
> It turned out we can't use transaction mode, cause there are prepared
> statement used a lot within code, while processing a single http request.

prepare statements can be fudged within some constraints.  if prepared
statements are explicitly named via PREPARE, you can simply prepare
them all on server connection via connect_query setting and disable
the manual preparation.  you then change the server_reset_query so
that they are not discarded.  some basic experimentation might confirm
if this is viable strategy.  automatic protocol level statements can
be an issue though.

We have 350k+ lines of code in our app, so this is not quite feasible as I'd wish. 
 
> Also, I can't 100% rule out that there won't be any long running
> (statistical) queries launched (even though such requests should not come to
> this database), which would occupy connection for longer time, but do not
> create any race condition... So having pool size at 8 may be too slim .

there are a number of simple tricks to deal with this:
1) move long running queries to their own pool (by changing login user
or connection string)
2) bypass pgbouncer in those cases
3) increase pool size


It's pretty much already setup so that long running queries should not hit the same DB cluster as those with (potentially) high connection/query rates, but I still can't rule out that no long-running queries will be issued via pgbouncer.

Either case - it seems that the combination of pool size  = 200 / pool mode = session / server_lifetime = 30 makes things stable for now. 
 
I'm planning to repeat my case on 2.6.x kernel, but it will be a while before I have chance to do that.


Thanks.

-- vlad

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko@gmail.com> wrote:
> It turned out we can't use transaction mode, cause there are prepared
> statement used a lot within code, while processing a single http request.

prepare statements can be fudged within some constraints.  if prepared
statements are explicitly named via PREPARE, you can simply prepare
them all on server connection via connect_query setting and disable
the manual preparation.  you then change the server_reset_query so
that they are not discarded.  some basic experimentation might confirm
if this is viable strategy.  automatic protocol level statements can
be an issue though.

> Also, I can't 100% rule out that there won't be any long running
> (statistical) queries launched (even though such requests should not come to
> this database), which would occupy connection for longer time, but do not
> create any race condition... So having pool size at 8 may be too slim .

there are a number of simple tricks to deal with this:
1) move long running queries to their own pool (by changing login user
or connection string)
2) bypass pgbouncer in those cases
3) increase pool size

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
It turned out we can't use transaction mode, cause there are prepared statement used a lot within code, while processing a single http request.

Also, I can't 100% rule out that there won't be any long running (statistical) queries launched (even though such requests should not come to this database), which would occupy connection for longer time, but do not create any race condition... So having pool size at 8 may be too slim .

It's a single database.

-- Vlad


On Wed, Nov 21, 2012 at 12:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko@gmail.com> wrote:
> ok, understood.
> I need to give some more thoughts to if it's possible for us to switch to
> transaction mode from app standpoint of view.
>
> if yes, then setting pool size to 20 (for 8 cores)  sounds OK?

If it was me, I would be starting with exactly 8 and be tuning from there.

Don't forget: each database role gets its own pool.

merlin

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko@gmail.com> wrote:
> ok, understood.
> I need to give some more thoughts to if it's possible for us to switch to
> transaction mode from app standpoint of view.
>
> if yes, then setting pool size to 20 (for 8 cores)  sounds OK?

If it was me, I would be starting with exactly 8 and be tuning from there.

Don't forget: each database role gets its own pool.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
ok, understood. 
I need to give some more thoughts to if it's possible for us to switch to transaction mode from app standpoint of view.

if yes, then setting pool size to 20 (for 8 cores)  sounds OK?

-- Vlad


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> I wouldn't expect so.  Increasing shared_buffers should either fix
>> free list lock contention, or leave it unchanged, not make it worse.
>
> AIUI, that is simply not true (unless you raise it to the point you're
> not churning them).   I'm looking at StrategyGetBuffer() for non-scan
> cases.  It locks "BufFreelistLock" then loops the free list, and, if
> it finds nothing, engages a clock sweep.

The freelist should never loop.  It is written as a loop, but I think
there is currently no code path which ends up with valid buffers being
on the freelist, so that loop will never, or at least rarely, execute
more than once.

> Both of those operations are
> dependent on the number of buffers being managed and so it's
> reasonable to expect some workloads to increase contention with more
> buffers.

The clock sweep can depend on the number of buffers begin managed in a
worst-case sense, but I've never seen any evidence (nor analysis) that
this worst case can be obtained in reality on an ongoing basis.  By
constructing two pathological workloads which get switched between, I
can get the worst-case to happen, but when it does happen the
consequences are mild compared to the amount of time needed to set up
the necessary transition.  In other words, the worse-case can't be
triggered often enough to make a meaningful difference.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> I wouldn't expect so.  Increasing shared_buffers should either fix
>>> free list lock contention, or leave it unchanged, not make it worse.
>>
>> AIUI, that is simply not true (unless you raise it to the point you're
>> not churning them).   I'm looking at StrategyGetBuffer() for non-scan
>> cases.  It locks "BufFreelistLock" then loops the free list, and, if
>> it finds nothing, engages a clock sweep.
>
> The freelist should never loop.  It is written as a loop, but I think
> there is currently no code path which ends up with valid buffers being
> on the freelist, so that loop will never, or at least rarely, execute
> more than once.
>
>> Both of those operations are
>> dependent on the number of buffers being managed and so it's
>> reasonable to expect some workloads to increase contention with more
>> buffers.
>
> The clock sweep can depend on the number of buffers begin managed in a
> worst-case sense, but I've never seen any evidence (nor analysis) that
> this worst case can be obtained in reality on an ongoing basis.  By
> constructing two pathological workloads which get switched between, I
> can get the worst-case to happen, but when it does happen the
> consequences are mild compared to the amount of time needed to set up
> the necessary transition.  In other words, the worse-case can't be
> triggered often enough to make a meaningful difference.

Yeah, good points;  but (getting off topic here) : there have been
several documented cases of lowering shared buffers improving
performance under contention...the  'worst case' might be happening
more than expected.  In particular, what happens when a substantial
percentage of the buffer pool is set with a non-zero usage count?
This seems unlikely, but possible?  Take note:

        if (buf->refcount == 0)
        {
            if (buf->usage_count > 0)
            {
                buf->usage_count--;
                trycounter = NBuffers;  /* emphasis *./
            }

ISTM time spent here isn't bounded except that as more time is spent
sweeping (more backends are thus waiting and not marking pages) the
usage counts decrease faster until you hit steady state.  Smaller
buffer pool naturally would help in that scenario as your usage counts
would drop faster.  It strikes me as cavalier to be resetting
trycounter while sitting under the #1 known contention point for read
only workloads.  Shouldn't SBF() work on advisory basis and try to
force a buffer after N failed usage count attempts?

merlin


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> The freelist should never loop.  It is written as a loop, but I think
>> there is currently no code path which ends up with valid buffers being
>> on the freelist, so that loop will never, or at least rarely, execute
>> more than once.
>>
>>> Both of those operations are
>>> dependent on the number of buffers being managed and so it's
>>> reasonable to expect some workloads to increase contention with more
>>> buffers.
>>
>> The clock sweep can depend on the number of buffers begin managed in a
>> worst-case sense, but I've never seen any evidence (nor analysis) that
>> this worst case can be obtained in reality on an ongoing basis.  By
>> constructing two pathological workloads which get switched between, I
>> can get the worst-case to happen, but when it does happen the
>> consequences are mild compared to the amount of time needed to set up
>> the necessary transition.  In other words, the worse-case can't be
>> triggered often enough to make a meaningful difference.
>
> Yeah, good points;  but (getting off topic here) : there have been
> several documented cases of lowering shared buffers improving
> performance under contention...the  'worst case' might be happening
> more than expected.

The ones that I am aware of (mostly Greg Smith's case studies) this
has been for write-intensive work loads and are related to
writes/fsyncs getting gummed up.

Shaun Thomas reports one that is (I assume) not read intensive, but
his diagnosis is that this is a kernel bug where a larger
shared_buffers for no good reason causes the kernel to kill off its
page cache.  From the kernel's perspective, the freelist lock doesn't
look any different from any other lwlock, so I doubt that is issue is
related to freelist lock.

> In particular, what happens when a substantial
> percentage of the buffer pool is set with a non-zero usage count?

The current clock sweep algorithm is an extraordinary usagecount
decrementing machine.  From what I know, the only way to get much more
than half of the buffers to be non-zero usage count is for the
clock-sweep to rarely run (in which case, it is hard to be the
bottleneck if it rarely runs), or for most of the buffer-cache to be
pinned simultaneously.

> This seems unlikely, but possible?  Take note:
>
>                 if (buf->refcount == 0)
>                 {
>                         if (buf->usage_count > 0)
>                         {
>                                 buf->usage_count--;
>                                 trycounter = NBuffers;  /* emphasis *./
>                         }
>
> ISTM time spent here isn't bounded except that as more time is spent
> sweeping (more backends are thus waiting and not marking pages) the
> usage counts decrease faster until you hit steady state.

But that is a one time thing.  Once you hit the steady state, how do
you get away from it again, such that a large amount of work is needed
again?

> Smaller
> buffer pool naturally would help in that scenario as your usage counts
> would drop faster.

They would drop at the same rate in absolute numbers, barring the
smaller buffer_cache fitting entirely in the on-board CPU cache.

They would drop faster in percentage terms, but they would also
increase faster in percentage terms once a candidate is found and a
new page read into it.

> It strikes me as cavalier to be resetting
> trycounter while sitting under the #1 known contention point for read
> only workloads.

The only use for the trycounter is to know when to ERROR out with "no
unpinned buffers available", so not resetting that seems entirely
wrong.

I would contest "the #1 known contention point" claim.  We know that
the freelist lock is a point of contention under certain conditions,
but we (or at least I) also know that it is the mere acquisition of
this lock, and not the work done while it is held, that is important.

If I add a spurious
"LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); LWLockRelease(BufFreelistLock);"
to each execution of StrategyGetBuffer, then contention kicks in twice as fast.

But if I instead hack the clock sweep to run twice as far (ignore the
first eligible buffer it finds, and go find another one) but all under
the cover of a single BufFreelistLock acquisition, there was no
meaningful increase in contention.

This was all on a 4 socket x 2 core/socket opteron machine which I no
longer have access to.  Using a more modern 8 core on a single socket,
I can't get it to collapse on BufFreelistLock at all, presumably
because the cache coherence mechanisms are so much faster.

>  Shouldn't SBF() work on advisory basis and try to
> force a buffer after N failed usage count attempts?

I believe Simon tried that a couple commit-fests ago, and no one could
show that it made a difference.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Tue, Nov 20, 2012 at 4:08 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> It strikes me as cavalier to be resetting
>> trycounter while sitting under the #1 known contention point for read
>> only workloads.
>
> The only use for the trycounter is to know when to ERROR out with "no
> unpinned buffers available", so not resetting that seems entirely
> wrong.
>
> I would contest "the #1 known contention point" claim.  We know that
> the freelist lock is a point of contention under certain conditions,
> but we (or at least I) also know that it is the mere acquisition of
> this lock, and not the work done while it is held, that is important.

Well, I'm speculating (without evidence) if that is in fact might not
be always true.  It certainly is hard to demonstrate with synthetic
testing though.  I guess the first step would be to whip up a
diagnostic patch to confirm malfeasance in suspicious cases.

merlin


Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Wed, Nov 21, 2012 at 9:05 AM, Vlad <marchenko@gmail.com> wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients),

Wouldn't the clients prefer to wait 100ms to get a connnection if that
means their query finishes in 100ms,
rather than waiting 0ms to get a connection but have the query take
1000ms to finish?

> however even
> at the times preceding  high-cpu-sys-stall, the number postmasters are like
> 15-20. When stall happens, it starts to raise but that's the result of DB
> not answering in timely manner.

That is like the "beatings will continue until moral improves" joke.
The theory about why pgbouncer would be beneficial is that it prevents
hundreds of connections from clogging up the server.  By allowing
hundreds of connections to be made when the server gets into distress,
you are increasing that distress and making it harder to recover from
it.  That is the theory, anyway.

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 21, 2012 at 11:05 AM, Vlad <marchenko@gmail.com> wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients), however even
> at the times preceding  high-cpu-sys-stall, the number postmasters are like
> 15-20. When stall happens, it starts to raise but that's the result of DB
> not answering in timely manner.

That is completely defeating the point.  Mode should be transaction
mode and somewhere between 1-2x the number of cores for the pool size.
 Note the # of connected clients does not have to equal the size of
the pool -- that's the whole point of pgboucner -- it forces the
queries through a fixed number of backend processes.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
it's session mode and the pool size is 1200 (cause I need to grantee that in the worst case we have enough slots for all possible clients), however even at the times preceding  high-cpu-sys-stall, the number postmasters are like 15-20. When stall happens, it starts to raise but that's the result of DB not answering in timely manner.

as far as data - yes, I can probably clean up names and provide some data with same queries, though  - we are not able to cause a stall just with those queries &  pgbench; we are able to make the situation worse and stall much likely when it's overlapped with regular queries coming from our app. 

-- Vlad


On Wed, Nov 21, 2012 at 10:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote:
>
>> update on my problem: despite pgbouncer, the problem still occures on my
>> end.
>
> As Merlin asked, how big is the pool?  Maybe you are using a large
> enough pool so as to defeat the purpose of restricting the number of
> connections.

also, what mode.

merlin

Re: High SYS CPU - need advise

From
Merlin Moncure
Date:
On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko@gmail.com> wrote:
>
>> update on my problem: despite pgbouncer, the problem still occures on my
>> end.
>
> As Merlin asked, how big is the pool?  Maybe you are using a large
> enough pool so as to defeat the purpose of restricting the number of
> connections.

also, what mode.

merlin


Re: High SYS CPU - need advise

From
Vlad
Date:
nothing changes if I increase number of threads. 
pgbouncer doesn't change much.

also, I think the nature of high-sys-cpu during stall and and when I run pgbench is different. 
During pgbench it's constantly at 30-40%, while during stall it sits at low 5-15% and then spikes to 90% after a while, with no apparent reason (i.e.: no increased load or anything), then goes down in a few secs, or stays until postgres restarted. 

at this point the only thing that does help on app side is when I force frequent re-connection to postgresql (i.e. not letting postmaster live too long) 

-- Vlad


On Wed, Nov 21, 2012 at 10:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko@gmail.com> wrote:
>
>
> '-M prepared' produces normal results, while '-M simple' results in 40% sys
> cpu.  '-M extended' is somewhere in between.
> I'm running it as 60 clients, 2 threads.

2 threads is pretty low for 60 clients.  What happens if you increase
-j to either 30 or 60?  (What I'm wondering here is if perhaps in this
case it is pgbench which is triggering the scheduling issue)

Also, what happens if you channel pgbench through pgbouncer?

Cheers,

Jeff

Re: High SYS CPU - need advise

From
Jeff Janes
Date:
On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko@gmail.com> wrote:
>
>
> '-M prepared' produces normal results, while '-M simple' results in 40% sys
> cpu.  '-M extended' is somewhere in between.
> I'm running it as 60 clients, 2 threads.

2 threads is pretty low for 60 clients.  What happens if you increase
-j to either 30 or 60?  (What I'm wondering here is if perhaps in this
case it is pgbench which is triggering the scheduling issue)

Also, what happens if you channel pgbench through pgbouncer?

Cheers,

Jeff


Re: High SYS CPU - need advise

From
Vlad
Date:



what pgbouncer mode, and how large is your pool.


'-M prepared' produces normal results, while '-M simple' results in 40% sys cpu.  '-M extended' is somewhere in between.
I'm running it as 60 clients, 2 threads.


-- Vlad  

Re: High SYS CPU - need advise

From
"Kevin Grittner"
Date:
Vlad wrote:

> it's session mode and the pool size is 1200 (cause I need to
> grantee that in the worst case we have enough slots for all
> possible clients),

We found that the real-world production performance of a web
application servicing millions of we hits per day with thousands of
concurrent users improved when we reconfigured our database
connection pool to be about 35 instead of 55, on a 16 core box with a
40 drive RAID.

http://wiki.postgresql.org/wiki/Number_Of_Database_Connections

-Kevin


Re: High SYS CPU - need advise

From
Gavin Flower
Date:
On 25/11/12 09:30, Kevin Grittner wrote:
> Vlad wrote:
>
>> it's session mode and the pool size is 1200 (cause I need to
>> grantee that in the worst case we have enough slots for all
>> possible clients),
> We found that the real-world production performance of a web
> application servicing millions of we hits per day with thousands of
> concurrent users improved when we reconfigured our database
> connection pool to be about 35 instead of 55, on a 16 core box with a
> 40 drive RAID.
>
> http://wiki.postgresql.org/wiki/Number_Of_Database_Connections
>
> -Kevin
>
>
Curious,  what is your RAID configuration?


Cheers,
Gavin


Re: High SYS CPU - need advise

From
"Kevin Grittner"
Date:
Gavin Flower wrote:

>> We found that the real-world production performance of a web
>> application servicing millions of we hits per day with thousands
>> of concurrent users improved when we reconfigured our database
>> connection pool to be about 35 instead of 55, on a 16 core box
>> with a 40 drive RAID.

> Curious, what is your RAID configuration?

We have the OS on RAID 1 on one RAID controller, WAL on another RAID
1 array on its own controller, and the PostgreSQL data directory on
RAID 5 on a pair of controllers which each has a path to every drive.
(Don't ask me how that works, I just trust the hardware guys to get
that part right.) The active portion of the database is cached, the
read-to-write ratio is about 10 to 1, and the controllers all have
BBU cache with write-back. It is actually one pool of about 30 for
the read-only web app with so many hits, and a separate pool of 6
database connections for logical replication from about 80 source
databases handling an OLTP load of 3000 directly connected users.

We were seeing some performance degradation at peak loads, and
improved both user response time and overall throughput by reducing
the database connections used by the connection pool. Peak load was
handled much better that way.

I cringe every time I see someone say they need a large number of
database connections in order to support that many users. Having one
database connection for each user is, in my experience, guaranteed to
make things fall apart under load.

-Kevin


Re: High SYS CPU - need advise

From
Gavin Flower
Date:
On 25/11/12 11:11, Kevin Grittner wrote:
Gavin Flower wrote:

We found that the real-world production performance of a web
application servicing millions of we hits per day with thousands
of concurrent users improved when we reconfigured our database
connection pool to be about 35 instead of 55, on a 16 core box
with a 40 drive RAID.
Curious, what is your RAID configuration?
We have the OS on RAID 1 on one RAID controller, WAL on another RAID
1 array on its own controller, and the PostgreSQL data directory on
RAID 5 on a pair of controllers which each has a path to every drive.
(Don't ask me how that works, I just trust the hardware guys to get
that part right.) The active portion of the database is cached, the
read-to-write ratio is about 10 to 1, and the controllers all have
BBU cache with write-back. It is actually one pool of about 30 for
the read-only web app with so many hits, and a separate pool of 6
database connections for logical replication from about 80 source
databases handling an OLTP load of 3000 directly connected users.

We were seeing some performance degradation at peak loads, and
improved both user response time and overall throughput by reducing
the database connections used by the connection pool. Peak load was
handled much better that way.

I cringe every time I see someone say they need a large number of
database connections in order to support that many users. Having one
database connection for each user is, in my experience, guaranteed to
make things fall apart under load. 

-Kevin

Thanks for a very comprehensive answer! (almost simply asked what 'type' of RAID did you have, I am glad I rephrased that!  I learnt far more than I expected, which is good - more than merely satisfied my curiosity!.

I use RAID-6; but in development, reliability (for me, I've had 5 hard disk crashes in 20 years) is orders of magnitude more important than performance, with rare exceptions!

I can understand the mentality of wanting one connection per user, though I fully know why it is the wrong approach - I would also want to use connection pooling in your situation.

Backing the 1980's I was a Mainframe COBOL analyst/programmer and just starting to use databases. Initially I thought of a transaction in terms of the user wanting to ensure their display and update were in the same transaction. Took me a while to fully appreciate and accept that transactions should be short to protect the database while not blocking other users too much (multi-user access was also a whole new concept), and that keeping a transaction open for user actions was not the 'done thing'! It requires a whole new change of mindset – something a lot of people find difficult, as it conflicts with part of their implicit belief system (just try discussing even the most obvious flaws in Creationism with a Believer!).

Now I'm quite comfortable with the idea of multi-user and am much more database centric while still appreciating the needs of users. Now my weapons of choice are Java & Linux, with Postgres being my preferred database.


Cheers,
Gavin





Re: High SYS CPU - need advise

From
Vlad
Date:
RAID10

-- vlad

On 11/24/2012 3:17 PM, Gavin Flower wrote:
>
> Curious,  what is your RAID configuration?
>