Thread: performance config help

performance config help

From
Bob Dusek
Date:
Hello,

We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty hardware...

4X E7420 Xeon, Four cores (for a total of 16 cores)
2.13 GHz, 8M Cache, 1066 Mhz FSB
32 Gigs of RAM
15 K RPM drives in striped raid

Things run fine, but when we get a lot of concurrent queries running, we see a pretty good slow down.

We don't have much experience with this sort of hardware.   Does anyone have an example config file we could use as a good starting point for this sort of hardware?

We have a fair amount of row-level inserts and deletes going on (probably as much data is inserted and deleted in a day than permanently resides in the db).

Bob

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 6:44 AM, Bob Dusek <redusek@gmail.com> wrote:
> Hello,
>
> We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty hardware...
>
> 4X E7420 Xeon, Four cores (for a total of 16 cores)
> 2.13 GHz, 8M Cache, 1066 Mhz FSB
> 32 Gigs of RAM
> 15 K RPM drives in striped raid

What method of striped RAID?  RAID-5?  RAID-10?  RAID-4?  RAID-0?

> Things run fine, but when we get a lot of concurrent queries running, we see
> a pretty good slow down.

Definte "a lot".

> We don't have much experience with this sort of hardware.   Does anyone have
> an example config file we could use as a good starting point for this sort
> of hardware?
>
> We have a fair amount of row-level inserts and deletes going on (probably as
> much data is inserted and deleted in a day than permanently resides in the
> db).

What do the following commands tell you?

iostat -x 10 (first iteration doesn't count)
vmstat 10 (again, first iteration doesn't count)
top

What you're looking for is iowait / utilization.

Re: performance config help

From
"A. Kretschmer"
Date:
In response to Bob Dusek :
> Hello,
>
> We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty hardware...
>
> 4X E7420 Xeon, Four cores (for a total of 16 cores)
> 2.13 GHz, 8M Cache, 1066 Mhz FSB
> 32 Gigs of RAM
> 15 K RPM drives in striped raid
>
> Things run fine, but when we get a lot of concurrent queries running, we see a
> pretty good slow down.
>
> We don't have much experience with this sort of hardware.   Does anyone have an
> example config file we could use as a good starting point for this sort of
> hardware?

Have you tuned your postgresql.conf? (memory-parameter)

Here are some links for you:

15:07 < akretschmer> ??performance
15:07 < rtfm_please> For information about performance
15:07 < rtfm_please> see http://revsys.com/writings/postgresql-performance.html
15:07 < rtfm_please> or http://wiki.postgresql.org/wiki/Performance_Optimization
15:07 < rtfm_please> or http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/


HTH, Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: performance config help

From
Bob Dusek
Date:
On Mon, Jan 11, 2010 at 8:50 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Mon, Jan 11, 2010 at 6:44 AM, Bob Dusek <redusek@gmail.com> wrote:
> Hello,
>
> We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty hardware...
>
> 4X E7420 Xeon, Four cores (for a total of 16 cores)
> 2.13 GHz, 8M Cache, 1066 Mhz FSB
> 32 Gigs of RAM
> 15 K RPM drives in striped raid

What method of striped RAID?  RAID-5?  RAID-10?  RAID-4?  RAID-0?

RAID-0
 

> Things run fine, but when we get a lot of concurrent queries running, we see
> a pretty good slow down.

Definte "a lot".


We have an application server that is processing requests.  Each request consists of a combination of selects, inserts, and deletes.  We actually see degredation when we get more than 40 concurrent requests.  The exact number of queries executed by each request isn't known.  It varies per request.  But, an example request would be about 16 inserts and 113 selects.  Any given request can't execute more than a single query at a time.  

To be more specific about the degradation, we've set the "log_min_duration_statement=200", and when we run with 40 concurrent requests, we don't see queries showing up in there.  When we run with 60 concurrent requests, we start seeing queries show up, and when we run 200+ requests, we see multi-second queries.

This is to be expected, to some extent, as we would expect some perfromance degradation with higher utilization.  But, the hardware doesn't appear to be very busy, and that's where we're hoping for some help.

We want to have Postgres eating up as many resources as possible to chug through our queries faster.  Right now, it's running slower with more utilization, but there's still too much idle time for the CPUs. 

> We don't have much experience with this sort of hardware.   Does anyone have
> an example config file we could use as a good starting point for this sort
> of hardware?
>
> We have a fair amount of row-level inserts and deletes going on (probably as
> much data is inserted and deleted in a day than permanently resides in the
> db).

What do the following commands tell you?

iostat -x 10 (first iteration doesn't count)

Here's some iostat output (from the 3rd data point from iostat -x 10)... this was taken while we were processing 256 simultaneous requests.

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.29    0.00    7.09    0.03    0.00   58.58

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   112.20  0.00 133.40     0.00  1964.80    14.73     0.42    3.17   0.04   0.48
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda5              0.00   112.20  0.00 133.40     0.00  1964.80    14.73     0.42    3.17   0.04   0.48
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  0.40     0.00     3.20     8.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00 99.90     0.00   799.20     8.00     0.15    1.50   0.01   0.12
dm-3              0.00     0.00  0.00  0.60     0.00     4.80     8.00     0.00    0.33   0.17   0.01
dm-4              0.00     0.00  0.00 144.70     0.00  1157.60     8.00     0.46    3.17   0.02   0.35
dm-5              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

The iowait seems pretty low, doesn't it?

vmstat 10 (again, first iteration doesn't count)

Here's the vmstat output, with the 6th data element clipped out, which seems representative of the whole... (also taken during 256 simultaneous request)

[root@ecpe1 pg_log]# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
8  0      0 21818136 696692 3871172    0    0     0  1042 2261 48418 34  7 59  0  0
 
That's a lot of free mem, which is to be expected.  Our database is not very large.

top

(taken during 256 simultaneous requests)

top - 10:11:43 up 12 days, 20:48,  6 users,  load average: 10.48, 4.16, 2.83
Tasks: 798 total,   8 running, 790 sleeping,   0 stopped,   0 zombie
Cpu0  : 33.3%us,  7.6%sy,  0.0%ni, 59.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 31.6%us,  5.9%sy,  0.0%ni, 62.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 33.0%us,  6.6%sy,  0.0%ni, 60.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 35.4%us,  6.2%sy,  0.0%ni, 58.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 36.3%us,  5.6%sy,  0.0%ni, 58.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 37.4%us,  6.2%sy,  0.0%ni, 56.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  : 38.1%us,  6.0%sy,  0.0%ni, 56.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 39.2%us,  7.5%sy,  0.0%ni, 52.9%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu8  : 35.5%us,  7.2%sy,  0.0%ni, 56.9%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu9  : 37.8%us,  7.6%sy,  0.0%ni, 54.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu10 : 39.5%us,  5.9%sy,  0.0%ni, 54.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 34.5%us,  7.2%sy,  0.0%ni, 58.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 41.1%us,  6.9%sy,  0.0%ni, 50.3%id,  0.0%wa,  0.0%hi,  1.6%si,  0.0%st
Cpu13 : 38.0%us,  7.3%sy,  0.0%ni, 54.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 36.2%us,  6.2%sy,  0.0%ni, 57.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 36.8%us,  8.2%sy,  0.0%ni, 54.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  28817004k total,  8008372k used, 20808632k free,   705772k buffers   
Swap: 30867448k total,        0k used, 30867448k free,  4848376k cached   

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2641 postgres  16   0 8524m  21m  16m S  3.3  0.1   0:00.25 postmaster
2650 postgres  15   0 8524m  21m  16m S  3.3  0.1   0:00.25 postmaster
2706 postgres  16   0 8524m  21m  15m S  3.3  0.1   0:00.20 postmaster
2814 postgres  15   0 8523m  18m  14m S  3.3  0.1   0:00.10 postmaster
2829 postgres  15   0 8523m  18m  14m S  3.3  0.1   0:00.10 postmaster
2618 postgres  15   0 8524m  21m  16m S  3.0  0.1   0:00.25 postmaster
2639 postgres  15   0 8524m  21m  16m R  3.0  0.1   0:00.25 postmaster
2671 postgres  15   0 8524m  21m  16m S  3.0  0.1   0:00.23 postmaster
2675 postgres  16   0 8524m  21m  16m S  3.0  0.1   0:00.23 postmaster
2694 postgres  15   0 8524m  21m  15m S  3.0  0.1   0:00.23 postmaster
2698 postgres  15   0 8524m  21m  15m S  3.0  0.1   0:00.21 postmaster
2702 postgres  15   0 8524m  21m  15m S  3.0  0.1   0:00.19 postmaster
2767 postgres  15   0 8524m  20m  14m S  3.0  0.1   0:00.13 postmaster
2776 postgres  15   0 8524m  20m  14m S  3.0  0.1   0:00.14 postmaster
2812 postgres  15   0 8523m  18m  14m S  3.0  0.1   0:00.11 postmaster
2819 postgres  15   0 8523m  18m  14m S  3.0  0.1   0:00.09 postmaster
2823 postgres  16   0 8523m  18m  14m S  3.0  0.1   0:00.09 postmaster
2828 postgres  15   0 8523m  18m  14m S  3.0  0.1   0:00.09 postmaster 
2631 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.24 postmaster  
2643 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.23 postmaster
2656 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.22 postmaster
2658 postgres  16   0 8524m  21m  15m S  2.6  0.1   0:00.22 postmaster
2664 postgres  16   0 8524m  21m  16m S  2.6  0.1   0:00.24 postmaster
2674 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.23 postmaster 
2679 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.22 postmaster
2684 postgres  15   0 8524m  21m  15m S  2.6  0.1   0:00.21 postmaster
2695 postgres  15   0 8524m  20m  15m S  2.6  0.1   0:00.18 postmaster
2699 postgres  15   0 8524m  20m  15m S  2.6  0.1   0:00.18 postmaster
2703 postgres  15   0 8524m  20m  15m S  2.6  0.1   0:00.18 postmaster 
2704 postgres  15   0 8524m  20m  15m R  2.6  0.1   0:00.17 postmaster
2713 postgres  15   0 8524m  20m  15m S  2.6  0.1   0:00.18 postmaster
2734 postgres  15   0 8524m  20m  14m S  2.6  0.1   0:00.14 postmaster
2738 postgres  15   0 8524m  20m  15m S  2.6  0.1   0:00.14 postmaster

 
What you're looking for is iowait / utilization.

Re: performance config help

From
Bob Dusek
Date:
On Mon, Jan 11, 2010 at 9:07 AM, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote:
In response to Bob Dusek :
> Hello,
>
> We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty hardware...
>
> 4X E7420 Xeon, Four cores (for a total of 16 cores)
> 2.13 GHz, 8M Cache, 1066 Mhz FSB
> 32 Gigs of RAM
> 15 K RPM drives in striped raid
>
> Things run fine, but when we get a lot of concurrent queries running, we see a
> pretty good slow down.
>
> We don't have much experience with this sort of hardware.   Does anyone have an
> example config file we could use as a good starting point for this sort of
> hardware?

Have you tuned your postgresql.conf? (memory-parameter)


Here's a list of what I consider to be key changes we've made to the config file (from default)..

for comparison purposes, the diff command was "diff postgresql.conf.dist postgresql.conf.mod"
64c64
< max_connections = 100            # (change requires restart)
---
> max_connections = 300            # (change requires restart)
78c78
< ssl = true                # (change requires restart)
---
> #ssl = off                # (change requires restart)
106c106,107
< shared_buffers = 32MB            # min 128kB
---
> #shared_buffers = 32MB            # min 128kB
> shared_buffers = 8GB            # min 128kB (rdk)
115a117
> work_mem = 64MB                # min 64kB (vrk) (rdk)
117c119,121
< #max_stack_depth = 2MB            # min 100kB
---
> maintenance_work_mem = 2GB        # min 1MB (rdk)
> #max_stack_depth = 1MB            # min 100kB
> max_stack_depth = 9MB            # min 100kB (vrk)
127a132
> vacuum_cost_delay = 15ms        # 0-100 milliseconds (rdk)
150c155
< #fsync = on                # turns forced synchronization on or off
---
> fsync = off                # turns forced synchronization on or off (rdk)

Please note, I've been reading this list a bit lately, and I'm aware of the kind of advice that some offer with respect to fsync.  I understand that with 8.4 we can turn this on and shut off "synchronous_commit".  I would be interested in more information on that.  But the bottom line is that we've gotten in the habit of shutting this off (on production servers) using Postgres 7.4, as the performance gain is enormous, and with fsync=on, we couldn't get the performance we needed. 

151a157
> synchronous_commit = off        # immediate fsync at commit
152a159
> wal_sync_method = open_sync        # the default is the first option (vrk)
159c166
< #full_page_writes = on            # recover from partial page writes
---
> full_page_writes = off            # recover from partial page writes (rdk)
160a168
> wal_buffers = 8MB            # min 32kB (rdk)
164c172
< #commit_delay = 0            # range 0-100000, in microseconds
---
> commit_delay = 10            # range 0-100000, in microseconds (vrk)
169a178
> checkpoint_segments = 256        # in logfile segments, min 1, 16MB each (rdk)
170a180
> checkpoint_timeout = 15min        # range 30s-1h (rdk)
171a182
> checkpoint_completion_target = 0.7    # checkpoint target duration, 0.0 - 1.0 (rdk)
206a218
> effective_cache_size = 24GB # (rdk)

I would be willing to send our entire config file to someone if that would help... I didn't want to attach it to this email, because I'm not sure about the etiquette of attaching files to emails on this list.  
 
Here are some links for you:

15:07 < akretschmer> ??performance
15:07 < rtfm_please> For information about performance
15:07 < rtfm_please> see http://revsys.com/writings/postgresql-performance.html
15:07 < rtfm_please> or http://wiki.postgresql.org/wiki/Performance_Optimization
15:07 < rtfm_please> or http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/

We will spend more time on this and see if we can learn more. 

But, we're hoping someone on this list can offer us some quick tips to help us use up more of the 16 cpus we have available. 

Thanks for pointing all of that out. 
 


HTH, Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: performance config help

From
Ivan Voras
Date:
Bob Dusek wrote:
> On Mon, Jan 11, 2010 at 8:50 AM, Scott Marlowe <scott.marlowe@gmail.com
> <mailto:scott.marlowe@gmail.com>> wrote:
>
>     On Mon, Jan 11, 2010 at 6:44 AM, Bob Dusek <redusek@gmail.com
>     <mailto:redusek@gmail.com>> wrote:
>      > Hello,
>      >
>      > We're running Postgres 8.4.2 on Red Hat 5, on pretty hefty
>     hardware...
>      >
>      > 4X E7420 Xeon, Four cores (for a total of 16 cores)
>      > 2.13 GHz, 8M Cache, 1066 Mhz FSB
>      > 32 Gigs of RAM
>      > 15 K RPM drives in striped raid
>
>     What method of striped RAID?  RAID-5?  RAID-10?  RAID-4?  RAID-0?
>
>
> RAID-0

And how many drives?

>
>      > Things run fine, but when we get a lot of concurrent queries
>     running, we see
>      > a pretty good slow down.
>
>     Definte "a lot".
>
>
> We have an application server that is processing requests.  Each request
> consists of a combination of selects, inserts, and deletes.  We actually
> see degredation when we get more than 40 concurrent requests.  The exact
> number of queries executed by each request isn't known.  It varies per
> request.  But, an example request would be about 16 inserts and 113
> selects.  Any given request can't execute more than a single query at a
> time.

So, you are concurrently trying to achieve more than around 640 writes
and 4520 reads (worst case figures...). This should be interesting. For
40 concurrent requests you will probably need at least 4 drives in
RAID-0 to sustain the write rates (and I'll guess 5-6 to be sure to
cover read requests also, together with plenty of RAM).

>  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           34.29    0.00    7.09    0.03    0.00   58.58
>
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               0.00   112.20  0.00 133.40     0.00  1964.80
> 14.73     0.42    3.17   0.04   0.48
>
> The iowait seems pretty low, doesn't it?

Yes, but you are issuing 133 write operations per seconds per drive(s) -
this is nearly the limit of what you can get with 15k RPM drives
(actually, the limit should be somewhere around 200..250 IOPS but 133
isn't that far).

> top - 10:11:43 up 12 days, 20:48,  6 users,  load average: 10.48, 4.16, 2.83
> Tasks: 798 total,   8 running, 790 sleeping,   0 stopped,   0 zombie
> Cpu0  : 33.3%us,  7.6%sy,  0.0%ni, 59.1%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st

There is one other possibility - since the CPUs are not very loaded, are
you sure the client application with which you are testing is fast
enough to issue enough request to saturate the database?

Re: performance config help

From
"Kevin Grittner"
Date:
Bob Dusek <redusek@gmail.com> wrote:
> Scott Marlowe <scott.marlowe@gmail.com>wrote:
>> Bob Dusek <redusek@gmail.com> wrote:

>>> 4X E7420 Xeon, Four cores (for a total of 16 cores)

>> What method of striped RAID?
>
> RAID-0

I hope you have a plan for what to do when any one drive in this
array fails, and the entire array is unusable.

Anyway, my benchmarks tend to show that best throughput occurs at
about (CPU_count * 2) plus effective_spindle_count.  Since you seem
to be fully cached, effective_spindle_count would be zero, so I
would expect performance to start to degrade when you have more than
about 32 sessions active.

> We actually see degredation when we get more than 40 concurrent
> requests.

Close enough.

> when we run 200+ requests, we see multi-second queries.

No surprise there.  Your vmstat output suggests that context
switches are becoming a problem, and I wouldn't be surprised if I
heard that the network is an issue.  You might want to have someone
take a look at the network side to check.

You want to use some connection pooling which queues requests when
more than some configurable number of connections is already active
with a request.  You probably want to run that on the server side.
As for the postgresql.conf, could you show what you have right now,
excluding all comments?

-Kevin

Re: performance config help

From
Ivan Voras
Date:
Ivan Voras wrote:

> Yes, but you are issuing 133 write operations per seconds per drive(s) -
> this is nearly the limit of what you can get with 15k RPM drives
> (actually, the limit should be somewhere around 200..250 IOPS but 133
> isn't that far).

I saw in your other post you have fsync turned off so ignore this, it's
not an IO problem in your case.

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 9:42 AM, Bob Dusek <redusek@gmail.com> wrote:
>> What method of striped RAID?  RAID-5?  RAID-10?  RAID-4?  RAID-0?
>
> RAID-0

Just wondering how many drives?

> To be more specific about the degradation, we've set the
> "log_min_duration_statement=200", and when we run with 40 concurrent
> requests, we don't see queries showing up in there.  When we run with 60
> concurrent requests, we start seeing queries show up, and when we run 200+
> requests, we see multi-second queries.
>
> This is to be expected, to some extent, as we would expect some perfromance
> degradation with higher utilization.  But, the hardware doesn't appear to be
> very busy, and that's where we're hoping for some help.

It's likely in io wait.

>> What do the following commands tell you?
>>
>> iostat -x 10 (first iteration doesn't count)
>
> Here's some iostat output (from the 3rd data point from iostat -x 10)...
> this was taken while we were processing 256 simultaneous requests.
>
>  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           34.29    0.00    7.09    0.03    0.00   58.58
>
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               0.00   112.20  0.00 133.40     0.00  1964.80    14.73
> 0.42    3.17   0.04   0.48
> sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda5              0.00   112.20  0.00 133.40     0.00  1964.80    14.73
> 0.42    3.17   0.04   0.48
> sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00  0.00  0.40     0.00     3.20     8.00
> 0.00    0.00   0.00   0.00
> dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00  0.00 99.90     0.00   799.20     8.00
> 0.15    1.50   0.01   0.12
> dm-3              0.00     0.00  0.00  0.60     0.00     4.80     8.00
> 0.00    0.33   0.17   0.01
> dm-4              0.00     0.00  0.00 144.70     0.00  1157.60     8.00
> 0.46    3.17   0.02   0.35
> dm-5              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
>
> The iowait seems pretty low, doesn't it?

Depends, is that the first iteration of output?  if so, ignore it and
show me the second and further on.  Same for vmstat...  In fact let
them run for a minute or two and attach the results...  OTOH, if that
is the second or later set of output, then you're definitely not IO
bound, and I don't see why the CPUs are not being better utilized.

How many concurrent queries are you running when you take these
measurements?  Can you take them with lower and higher numbers of
concurrent users and compare the two?  normally I'd be looking for
context switching taking more and more time in a heavily loaded
system, but I'm not seeing it in your vmstat numbers either.

What are your settings for

effective_cache_size
random_page_cost
work_mem

with your machine and the extra memory, you can probably uptune the
work_mem to 8 Megs safely if it's at the default of 1MB.  With a
database that fits in RAM, you can often turn down random_page_cost to
near 1.0 (1.2 to 1.4 is common for such scenarios.)  And effective
cache size being larger (in the 20G range) will hint the planner that
it's likely to find everything it needs in ram somewhere and not on
the disk.

There are several common bottlenecks you can try to tune away from.
IO doesn't look like a problem for you.  Neither does CPU load.  So,
then we're left with context switching time and memory to CPU
bandwidth.  If your CPUs are basically becoming data pumps then the
speed of your FSB becomes VERY critical, and some older Intel mobos
didn't have a lot of CPU to Memory bandwidth and adding CPUs made it
worse, not better.  More modern Intel chipsets have much faster CPU to
Memory BW, since they're using the same kind of fabric switching that
AMD uses on highly parallel machines.

If your limit is your hardware, then the only solution is a faster
machine.  It may well be that a machine with dual fast Nehalem
(2.4GHz+) quad core CPUs will be faster.  Or 4 or 8 AMD CPUs with
their faster fabric.

Re: performance config help

From
Scott Marlowe
Date:
We may want to start looking at query plans for the slowest queries.
Use explain analyze to find them and attach them here.  I kinda have a
feeling you're running into a limit on the speed of your memory
though, and there's no real cure for that.  You can buy a little time
with some query or db tuning, but 250 or more concurrent users is a
LOT.

Re: performance config help

From
Bob Dusek
Date:

> This is to be expected, to some extent, as we would expect some perfromance
> degradation with higher utilization.  But, the hardware doesn't appear to be
> very busy, and that's where we're hoping for some help.

It's likely in io wait.

>> What do the following commands tell you?
>>
>> iostat -x 10 (first iteration doesn't count)
>
> Here's some iostat output (from the 3rd data point from iostat -x 10)...
> this was taken while we were processing 256 simultaneous requests.
>
>  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           34.29    0.00    7.09    0.03    0.00   58.58
>
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               0.00   112.20  0.00 133.40     0.00  1964.80    14.73
> 0.42    3.17   0.04   0.48
> sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> sda5              0.00   112.20  0.00 133.40     0.00  1964.80    14.73
> 0.42    3.17   0.04   0.48
> sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00  0.00  0.40     0.00     3.20     8.00
> 0.00    0.00   0.00   0.00
> dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00  0.00 99.90     0.00   799.20     8.00
> 0.15    1.50   0.01   0.12
> dm-3              0.00     0.00  0.00  0.60     0.00     4.80     8.00
> 0.00    0.33   0.17   0.01
> dm-4              0.00     0.00  0.00 144.70     0.00  1157.60     8.00
> 0.46    3.17   0.02   0.35
> dm-5              0.00     0.00  0.00  0.00     0.00     0.00     0.00
> 0.00    0.00   0.00   0.00
>
> The iowait seems pretty low, doesn't it?

Depends, is that the first iteration of output?  if so, ignore it and
show me the second and further on.  Same for vmstat...  In fact let
them run for a minute or two and attach the results...  OTOH, if that
is the second or later set of output, then you're definitely not IO
bound, and I don't see why the CPUs are not being better utilized.

I was probably not clear... the output I pasted was from the third iteration of output from iostat.  And, the vmstat output I pasted was from the sixth iteration of output

How many concurrent queries are you running when you take these
measurements?  Can you take them with lower and higher numbers of
concurrent users and compare the two?  normally I'd be looking for
context switching taking more and more time in a heavily loaded
system, but I'm not seeing it in your vmstat numbers either.

We took those measurements with 256 concurrent requests being processed.  So, at most, we have 256 concurrent queries executed by our application.  There aren't other applications using the db in our tests. 

We can take some measurements at 40 concurrent requests and see where we stand.
 
What are your settings for

effective_cache_size
 
 effective_cache_size = 24GB 
 
random_page_cost

Using the default...

#random_page_cost = 4.0   
 
work_mem
 
 work_mem = 64MB
 
with your machine and the extra memory, you can probably uptune the
work_mem to 8 Megs safely if it's at the default of 1MB.  With a
database that fits in RAM, you can often turn down random_page_cost to
near 1.0 (1.2 to 1.4 is common for such scenarios.)  And effective
cache size being larger (in the 20G range) will hint the planner that
it's likely to find everything it needs in ram somewhere and not on
the disk.

So, we should probably try cranking our random_page_cost value down.  When we dump our db with "pg_dump --format=t", it's about 15 MB.  We should be able to keep the thing in memory.

There are several common bottlenecks you can try to tune away from.
IO doesn't look like a problem for you.  Neither does CPU load.  So,
then we're left with context switching time and memory to CPU
bandwidth.  If your CPUs are basically becoming data pumps then the
speed of your FSB becomes VERY critical, and some older Intel mobos
didn't have a lot of CPU to Memory bandwidth and adding CPUs made it
worse, not better.  More modern Intel chipsets have much faster CPU to
Memory BW, since they're using the same kind of fabric switching that
AMD uses on highly parallel machines.

Each CPU is 2.13 GHz, with 8MB Cache, and the FSB is 1066 MHz.  Does that bus speed seem slow? 

It's hard to go to the money tree  and say "we're only using about half of your CPUs, but you need to get better ones."
 
If your limit is your hardware, then the only solution is a faster
machine.  It may well be that a machine with dual fast Nehalem
(2.4GHz+) quad core CPUs will be faster.  Or 4 or 8 AMD CPUs with
their faster fabric.

It sounds like we could spend less money on memory and more on faster hard drives and faster CPUs. 

But, man, that's a tough sell.  This box is a giant, relative to anything else we've worked with. 

Re: performance config help

From
Bob Dusek
Date:


On Mon, Jan 11, 2010 at 12:17 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Bob Dusek <redusek@gmail.com> wrote:
> Scott Marlowe <scott.marlowe@gmail.com>wrote:
>> Bob Dusek <redusek@gmail.com> wrote:

>>> 4X E7420 Xeon, Four cores (for a total of 16 cores)

>> What method of striped RAID?
>
> RAID-0

I hope you have a plan for what to do when any one drive in this
array fails, and the entire array is unusable.

Point noted.
 
Anyway, my benchmarks tend to show that best throughput occurs at
about (CPU_count * 2) plus effective_spindle_count.  Since you seem
to be fully cached, effective_spindle_count would be zero, so I
would expect performance to start to degrade when you have more than
about 32 sessions active.

That's a little disheartening for a single or dual CPU system. 


> We actually see degredation when we get more than 40 concurrent
> requests.

Close enough.

> when we run 200+ requests, we see multi-second queries.

No surprise there.  Your vmstat output suggests that context
switches are becoming a problem, and I wouldn't be surprised if I
heard that the network is an issue.  You might want to have someone
take a look at the network side to check.

This is all happening on a LAN, and network throughput doesn't seem to be an issue.  It may be a busy network, but I'm not sure about a problem.  Can you elaborate on your suspicion, based on the vmstat?  I haven't used vmstat much.

You want to use some connection pooling which queues requests when
more than some configurable number of connections is already active
with a request.  You probably want to run that on the server side.
As for the postgresql.conf, could you show what you have right now,
excluding all comments?

The problem with connection pooling is that we actually have to achieve more than 40 per second, which happens to be the sweet spot with our current config. 

I posted our changes from the default in a reply to another message.  I don't want to duplicate.  Can you check those out? 

-Kevin

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 10:49 AM, Bob Dusek <redusek@gmail.com> wrote:
>> Depends, is that the first iteration of output?  if so, ignore it and
>> show me the second and further on.  Same for vmstat...  In fact let
>> them run for a minute or two and attach the results...  OTOH, if that
>> is the second or later set of output, then you're definitely not IO
>> bound, and I don't see why the CPUs are not being better utilized.
>>
> I was probably not clear... the output I pasted was from the third iteration
> of output from iostat.  And, the vmstat output I pasted was from the sixth
> iteration of output

Yeah, you're definitely CPU/Memory bound it seems.

> We can take some measurements at 40 concurrent requests and see where we
> stand.

We'll probably not see much difference, if you're waiting on memory.

> So, we should probably try cranking our random_page_cost value down.  When
> we dump our db with "pg_dump --format=t", it's about 15 MB.  We should be
> able to keep the thing in memory.

Yeah, I doubt that changing it will make a huge difference given how
small your db is.

>> There are several common bottlenecks you can try to tune away from.
>> IO doesn't look like a problem for you.  Neither does CPU load.  So,
>> then we're left with context switching time and memory to CPU
>> bandwidth.  If your CPUs are basically becoming data pumps then the
>> speed of your FSB becomes VERY critical, and some older Intel mobos
>> didn't have a lot of CPU to Memory bandwidth and adding CPUs made it
>> worse, not better.  More modern Intel chipsets have much faster CPU to
>> Memory BW, since they're using the same kind of fabric switching that
>> AMD uses on highly parallel machines.
>
> Each CPU is 2.13 GHz, with 8MB Cache, and the FSB is 1066 MHz.  Does that
> bus speed seem slow?

When 16 cores are all sharing the same bus (which a lot of older
designs do) then yes.  I'm not that familiar with the chipset you're
running, but I don't think that series CPU has an integrated memory
controller.  Does it break the memory into separate chunks that
different cpus can access without stepping on each other's toes?

Later Intel and all AMDs since the Opteron have built in memory
controllers.  This meant that going from 2 to 4 cpus in an AMD server
doubled your memory bandwidth, while going from 2 to 4 cpus on older
intel designs left it the same so that each cpu got 1/2 as much
bandwidth as it had before when there were 2.

> It's hard to go to the money tree  and say "we're only using about half of
> your CPUs, but you need to get better ones."

Well, if the problem is that you've got a chipset that can't utilize
all your CPUs because of memory bw starvation, it's your only fix.
You should set up some streaming read / write to memory tests you can
run singly, then on 2, 4, 8 16 cores and see how fast memory access is
as you add more threads.  I'm betting you'll get throughput well
before 16 cores are working on the problem.

>> If your limit is your hardware, then the only solution is a faster
>> machine.  It may well be that a machine with dual fast Nehalem
>> (2.4GHz+) quad core CPUs will be faster.  Or 4 or 8 AMD CPUs with
>> their faster fabric.
>
> It sounds like we could spend less money on memory and more on faster hard
> drives and faster CPUs.

I'm pretty sure you could live with slower hard drives here, and fsync
on as well possibly.  It looks like it's all cpu <-> memory bandwidth.
 But I'm just guessing.

> But, man, that's a tough sell.  This box is a giant, relative to anything
> else we've worked with.

Yeah, I understand.  We're looking at having to upgrade our dual cpu /
quad core AMD 2.1GHz machine to 4 hex core cpus this summer, possibly
dodecacore cpus even.

So, I took a break from writing and searched for some more info on the
74xx series CPUs, and from reading lots of articles, including this
one:

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3414

It seems apparent that the 74xx series if a great CPU, as long as
you're not memory bound.

Re: performance config help

From
"Kevin Grittner"
Date:
Bob Dusek <redusek@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> Bob Dusek <redusek@gmail.com> wrote:

>> Anyway, my benchmarks tend to show that best throughput occurs at
>> about (CPU_count * 2) plus effective_spindle_count.  Since you
>> seem to be fully cached, effective_spindle_count would be zero,
>> so I would expect performance to start to degrade when you have
>> more than about 32 sessions active.
>>
> That's a little disheartening for a single or dual CPU system.

Not at all.  You only have so many resources to keep busy at any one
moment.  It is generally more efficient to only context switch
between as many processes as can keep those resources relatively
busy; otherwise valuable resources are spent switching among the
various processes rather than doing useful work.

[Regular readers of this list might want to skip ahead while I run
through my usual "thought experiment on the topic.  ;-) ]

Imagine this hypothetical environment -- you have one CPU running
requests.  There are no other resources to worry about and no
latency to the clients.  Let's say that the requests take one second
each.  The client suddenly has 100 requests to run.  Assuming
context switching is free, you could submit all at once, and 100
seconds later, you get 100 responses, with an average response time
of 100 seconds.  Let's put a (again free) connection pooler in
there.  You submit those 100 requests, but they are fed to the
database one at a time.  You get one response back in one second,
the next in two seconds, the last in 100 seconds.  No request took
any longer, and the average response time was 50.5 seconds -- almost
a 50% reduction.

Now context switching is not free, and you had tens of thousands of
them per second.  Besides the hit on CPU availability during each
switch, you're reducing the value of the L1 and L2 caches.  So in
reality, you could expect your "request storm" to perform
significantly worse in comparison to the connection pooled
configuration.  In reality, you have more than one resource to keep
busy, so the pool should be sized greater than one; but it's still
true that there is some point at which getting a request to the
database server delays the response to that request more than
queuing it for later execution would.  Some database products build
in a way to manage this; in PostgreSQL it's on you to do so.

>> Your vmstat output suggests that context switches are becoming a
>> problem, and I wouldn't be surprised if I heard that the network
>> is an issue.  You might want to have someone take a look at the
>> network side to check.
>>
> This is all happening on a LAN, and network throughput doesn't
> seem to be an issue.  It may be a busy network, but I'm not sure
> about a problem.  Can you elaborate on your suspicion, based on
> the vmstat?  I haven't used vmstat much.

It was simply this: all that CPU idle time while it was swamped with
requests suggests that there might be a bottleneck outside the
database server.  That could be, as another post suggests, the
client software.  It could also be the network.  (It could also be
contention on locks within PostgreSQL from the large number of
requests, but that's covered by the connection pooling suggestion.)

> The problem with connection pooling is that we actually have to
> achieve more than 40 per second, which happens to be the sweet
> spot with our current config.

Well, if you're considering a connection pool which can only submit
one request per second, you're looking at the wrong technology.  We
use a custom connection pool built into our software, so I'm not
very familiar with the "drop in" packages out there, but we start
the next queued request based on the completion of a request --
there's no polling involved.

Between the RAID 0, fsync = off, and full_page_writes = off -- you
really had better not be staking anything important on this data.
This configuration would make The Flying Wallendas break out in a
sweat.  It suggests to me that you might want to look into a better
RAID controller -- a high quality controller with battery-backup
(BBU) cache, configured for write-back, might allow you to change
all these to safe settings.  If you also switch to a RAID
configuration with some redundancy, you'll be much safer....

-Kevin


Re: performance config help

From
Bob Dusek
Date:


RAID-0

And how many drives?

Just two.

We have an application server that is processing requests.  Each request consists of a combination of selects, inserts, and deletes.  We actually see degredation when we get more than 40 concurrent requests.  The exact number of queries executed by each request isn't known.  It varies per request.  But, an example request would be about 16 inserts and 113 selects.  Any given request can't execute more than a single query at a time.  

So, you are concurrently trying to achieve more than around 640 writes and 4520 reads (worst case figures...). This should be interesting. For 40 concurrent requests you will probably need at least 4 drives in RAID-0 to sustain the write rates (and I'll guess 5-6 to be sure to cover read requests also, together with plenty of RAM).

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         34.29    0.00    7.09    0.03    0.00   58.58

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   112.20  0.00 133.40     0.00  1964.80    14.73     0.42    3.17   0.04   0.48

The iowait seems pretty low, doesn't it?

Yes, but you are issuing 133 write operations per seconds per drive(s) - this is nearly the limit of what you can get with 15k RPM drives (actually, the limit should be somewhere around 200..250 IOPS but 133 isn't that far).

 
As you mentioned in a separate response, we have fsync shut off.  Regardless, we shut off a lot of logging in our app and reduced that number to approx 20 per second.  So, a lot of those writes were coming from outside the db.  We do a lot of logging.  We should consider turning some off, it seems.


top - 10:11:43 up 12 days, 20:48,  6 users,  load average: 10.48, 4.16, 2.83
Tasks: 798 total,   8 running, 790 sleeping,   0 stopped,   0 zombie
Cpu0  : 33.3%us,  7.6%sy,  0.0%ni, 59.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

There is one other possibility - since the CPUs are not very loaded, are you sure the client application with which you are testing is fast enough to issue enough request to saturate the database?

Each of the 256 requests was being processed by a php process.  So, it could certainly be faster.  But, the fact that we're seeing the db performance degrade would seem to indicate that our application is fast enough to punish the db.  Isn't that true? 

 


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 10:54 AM, Bob Dusek <redusek@gmail.com> wrote:
>> You want to use some connection pooling which queues requests when
>> more than some configurable number of connections is already active
>> with a request.  You probably want to run that on the server side.
>> As for the postgresql.conf, could you show what you have right now,
>> excluding all comments?
>
> The problem with connection pooling is that we actually have to achieve more
> than 40 per second, which happens to be the sweet spot with our current
> config.

Number of parallel processes doesn't equal # reqs/second.  If your
maximum throughput occurs at 40 parallel requests, you'll get more
done reducing the maximum number of concurrent processes to 40 and
letting them stack up in a queue waiting for a spot to run.

Re: performance config help

From
Bob Dusek
Date:
On Mon, Jan 11, 2010 at 1:20 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Bob Dusek <redusek@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> Bob Dusek <redusek@gmail.com> wrote:

>> Anyway, my benchmarks tend to show that best throughput occurs at
>> about (CPU_count * 2) plus effective_spindle_count.  Since you
>> seem to be fully cached, effective_spindle_count would be zero,
>> so I would expect performance to start to degrade when you have
>> more than about 32 sessions active.
>>
> That's a little disheartening for a single or dual CPU system.

Not at all.  You only have so many resources to keep busy at any one
moment.  It is generally more efficient to only context switch
between as many processes as can keep those resources relatively
busy; otherwise valuable resources are spent switching among the
various processes rather than doing useful work.

[Regular readers of this list might want to skip ahead while I run
through my usual "thought experiment on the topic.  ;-) ]

Imagine this hypothetical environment -- you have one CPU running
requests.  There are no other resources to worry about and no
latency to the clients.  Let's say that the requests take one second
each.  The client suddenly has 100 requests to run.  Assuming
context switching is free, you could submit all at once, and 100
seconds later, you get 100 responses, with an average response time
of 100 seconds.  Let's put a (again free) connection pooler in
there.  You submit those 100 requests, but they are fed to the
database one at a time.  You get one response back in one second,
the next in two seconds, the last in 100 seconds.  No request took
any longer, and the average response time was 50.5 seconds -- almost
a 50% reduction.

Now context switching is not free, and you had tens of thousands of
them per second.  Besides the hit on CPU availability during each
switch, you're reducing the value of the L1 and L2 caches.  So in
reality, you could expect your "request storm" to perform
significantly worse in comparison to the connection pooled
configuration.  In reality, you have more than one resource to keep
busy, so the pool should be sized greater than one; but it's still
true that there is some point at which getting a request to the
database server delays the response to that request more than
queuing it for later execution would.  Some database products build
in a way to manage this; in PostgreSQL it's on you to do so.

I appreciate the explanation.  We were thinking that since we have so much CPU available, we weren't hitting Postgres' peak and that maybe a config change would help.  But, thus far, it sounds like we're hardware-bound, and an application connection pool seems inevitable. 

>> Your vmstat output suggests that context switches are becoming a
>> problem, and I wouldn't be surprised if I heard that the network
>> is an issue.  You might want to have someone take a look at the
>> network side to check.
>>
> This is all happening on a LAN, and network throughput doesn't
> seem to be an issue.  It may be a busy network, but I'm not sure
> about a problem.  Can you elaborate on your suspicion, based on
> the vmstat?  I haven't used vmstat much.

It was simply this: all that CPU idle time while it was swamped with
requests suggests that there might be a bottleneck outside the
database server.  That could be, as another post suggests, the
client software.  It could also be the network.  (It could also be
contention on locks within PostgreSQL from the large number of
requests, but that's covered by the connection pooling suggestion.)

I'm curious if it would be worth our effort to enable the pg_stat stuff and try to analyze the system that way.  We don't have a lot of experience with that, but if we could learn something critical from it, we will do it.
 
> The problem with connection pooling is that we actually have to
> achieve more than 40 per second, which happens to be the sweet
> spot with our current config.

Well, if you're considering a connection pool which can only submit
one request per second, you're looking at the wrong technology.  We
use a custom connection pool built into our software, so I'm not
very familiar with the "drop in" packages out there, but we start
the next queued request based on the completion of a request --
there's no polling involved.

I'm thinking we'll have to roll our own.  In a way, we have already done the connection pooling.  We're experimenting with a new architecture with much more demanding performance requirements.  We were emboldened by the hardware specs. 


Between the RAID 0, fsync = off, and full_page_writes = off -- you
really had better not be staking anything important on this data.
This configuration would make The Flying Wallendas break out in a
sweat.  It suggests to me that you might want to look into a better
RAID controller -- a high quality controller with battery-backup
(BBU) cache, configured for write-back, might allow you to change
all these to safe settings.  If you also switch to a RAID
configuration with some redundancy, you'll be much safer....

Yeah :)  We haven't run into much trouble.  But, we cut our teeth doing performance analysis of our app using PG 7.4.  And, people on this list seem to be adamantly against this config these days.  Is this safer in older versions of PG?  Or, are the risks the same? 

We have some asynchronous communications processes that communicate permanent db changes to an enterprise-level data warehouse.  And, we can recover that data back down to the server if the server goes belly-up.  If something does go belly up, we really only lose the bit of data that hasn't been communicated yet.  It's true, that this data is important.  However, it's also true that it's very costly to guarantee this that very small amount of data isn't lost.  And, practically speaking (for our purposes) it seems that the data's not worth the cost.   

-Kevin


Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 11:20 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Bob Dusek <redusek@gmail.com> wrote:
>> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>>> Bob Dusek <redusek@gmail.com> wrote:
>
>>> Anyway, my benchmarks tend to show that best throughput occurs at
>>> about (CPU_count * 2) plus effective_spindle_count.  Since you
>>> seem to be fully cached, effective_spindle_count would be zero,
>>> so I would expect performance to start to degrade when you have
>>> more than about 32 sessions active.
>>>
>> That's a little disheartening for a single or dual CPU system.
>
> Not at all.  You only have so many resources to keep busy at any one
> moment.  It is generally more efficient to only context switch
> between as many processes as can keep those resources relatively
> busy; otherwise valuable resources are spent switching among the
> various processes rather than doing useful work.
>
> [Regular readers of this list might want to skip ahead while I run
> through my usual "thought experiment on the topic.  ;-) ]
>
> Imagine this hypothetical environment -- you have one CPU running
> requests.  There are no other resources to worry about and no
> latency to the clients.  Let's say that the requests take one second
> each.  The client suddenly has 100 requests to run.  Assuming
> context switching is free, you could submit all at once, and 100
> seconds later, you get 100 responses, with an average response time
> of 100 seconds.  Let's put a (again free) connection pooler in
> there.  You submit those 100 requests, but they are fed to the
> database one at a time.  You get one response back in one second,
> the next in two seconds, the last in 100 seconds.  No request took
> any longer, and the average response time was 50.5 seconds -- almost
> a 50% reduction.
>
> Now context switching is not free, and you had tens of thousands of
> them per second.

FYI, on an 8 or 16 core machine, 10k to 30k context switches per
second aren't that much.  If you're climbing past 100k you might want
to look out.

The more I read up on the 74xx CPUs and look at the numbers here the
more I think it's just that this machine has X bandwidth and it's
using it all up.  You could put 1,000 cores in it, and it wouldn't go
any faster.  My guess is that a 4x6 core AMD machine or even a 2x6
Nehalem would be much faster at this job.  Only way to tell is to run
something like the stream benchmark and see how it scales,
memory-wise, as you add cores to the benchmark.

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 12:36 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> FYI, on an 8 or 16 core machine, 10k to 30k context switches per
> second aren't that much.  If you're climbing past 100k you might want
> to look out.
>
> The more I read up on the 74xx CPUs and look at the numbers here the
> more I think it's just that this machine has X bandwidth and it's
> using it all up.  You could put 1,000 cores in it, and it wouldn't go
> any faster.  My guess is that a 4x6 core AMD machine or even a 2x6
> Nehalem would be much faster at this job.  Only way to tell is to run
> something like the stream benchmark and see how it scales,
> memory-wise, as you add cores to the benchmark.

Also I'm guessing that query profiling may help, if we can get the
queries to request less data to trundle through then we might be able
to get Bob the performance needed to keep up.

But at some point he's gonna have to look at partitioning his database
onto multiple machines some how.

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 12:34 PM, Bob Dusek <redusek@gmail.com> wrote:
> Yeah :)  We haven't run into much trouble.  But, we cut our teeth doing
> performance analysis of our app using PG 7.4.  And, people on this list seem
> to be adamantly against this config these days.  Is this safer in older
> versions of PG?  Or, are the risks the same?

It's always been unsafe.  Just that 7.4 was so slow that sometimes you
didn't really get to choose.

> We have some asynchronous communications processes that communicate
> permanent db changes to an enterprise-level data warehouse.  And, we can
> recover that data back down to the server if the server goes belly-up.  If
> something does go belly up, we really only lose the bit of data that hasn't
> been communicated yet.  It's true, that this data is important.  However,
> it's also true that it's very costly to guarantee this that very small
> amount of data isn't lost.  And, practically speaking (for our purposes) it
> seems that the data's not worth the cost.

I have slave dbs running on four 7200RPM SATA drives with fsync off.
They only get updated from the master db so if they go boom, I just
recreate their node.  There's times fsync off is ok, you just have to
know that that db is now considered "disposable".

However, I'd suggest doing some benchmarking to PROVE that you're
seeing an improvement from fsync being off.  If there's no
improvement, then you might as well leave it on and save yourself some
headache later on when the machine gets powered off suddenly etc.

Re: performance config help

From
"Dusek, Bob"
Date:
> I have slave dbs running on four 7200RPM SATA drives with fsync off.
> They only get updated from the master db so if they go boom, I just
> recreate their node.  There's times fsync off is ok, you just have to
> know that that db is now considered "disposable".
>
> However, I'd suggest doing some benchmarking to PROVE that you're
> seeing an improvement from fsync being off.  If there's no
> improvement, then you might as well leave it on and save yourself some
> headache later on when the machine gets powered off suddenly etc.

I haven't been involved in any benchmarking of PG8 with fsync=off, but we certainly did it with PG 7.4.  fsync=0ff, for
ourpurposes, was MUCH faster. 

> --
> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 1:13 PM, Dusek, Bob <rd185032@ncr.com> wrote:
> I haven't been involved in any benchmarking of PG8 with fsync=off, but we certainly did it with PG 7.4.  fsync=0ff,
forour purposes, was MUCH faster. 

And many changes have been made since then to make fsyncing much
faster.  You may be grinding the valves on a 2009 Ferrari because
pappy used to have to do it on his 1958 pickup truck here.

Re: performance config help

From
Tom Lane
Date:
Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Mon, Jan 11, 2010 at 1:13 PM, Dusek, Bob <rd185032@ncr.com> wrote:
>> I haven't been involved in any benchmarking of PG8 with fsync=off, but we certainly did it with PG 7.4. �fsync=0ff,
forour purposes, was MUCH faster. 

> And many changes have been made since then to make fsyncing much
> faster.  You may be grinding the valves on a 2009 Ferrari because
> pappy used to have to do it on his 1958 pickup truck here.

Perhaps more to the point, synchronous_commit can get most of the same
speedup with much less risk to your database.  You really owe it to
yourself to redo that benchmarking with a recent PG release.

            regards, tom lane

Re: performance config help

From
"Kevin Grittner"
Date:
Scott Marlowe <scott.marlowe@gmail.com> wrote:

> FYI, on an 8 or 16 core machine, 10k to 30k context switches per
> second aren't that much.

Yeah, on our 16 core machines under heavy load we hover around 30k.
He was around 50k, which is why I said it looked like it was
"becoming a problem."

> If you're climbing past 100k you might want to look out.

We hit that at one point; cutting our connection pool size brought
it down and improved performance dramatically.  I don't think I'd
wait for 100k to address it next time.

> The more I read up on the 74xx CPUs and look at the numbers here
> the more I think it's just that this machine has X bandwidth and
> it's using it all up.  You could put 1,000 cores in it, and it
> wouldn't go any faster.  My guess is that a 4x6 core AMD machine
> or even a 2x6 Nehalem would be much faster at this job.  Only way
> to tell is to run something like the stream benchmark and see how
> it scales, memory-wise, as you add cores to the benchmark.

I haven't been keeping up on the hardware, so I defer to you on
that.  It certainly seems like it would fit with the symptoms. On
the other hand, I haven't seen anything yet to convince me that it
*couldn't* be a client-side or network bottleneck, or the sort of
lock contention bottleneck that showed up in some of Sun's
benchmarks.  If it were my problem, I'd be trying to rule out
whichever one of those could be tested most easily, iteratively.

Also, as you suggested, identifying what queries are taking most of
the time and trying to optimize them is a route that might help,
regardless.

-Kevin

Re: performance config help

From
Bob Dusek
Date:

I haven't been keeping up on the hardware, so I defer to you on
that.  It certainly seems like it would fit with the symptoms. On
the other hand, I haven't seen anything yet to convince me that it
*couldn't* be a client-side or network bottleneck, or the sort of
lock contention bottleneck that showed up in some of Sun's
benchmarks.  If it were my problem, I'd be trying to rule out
whichever one of those could be tested most easily, iteratively.


How do I learn more about the actual lock contention in my db?   Lock contention makes some sense.  Each of the 256 requests are relatively similar.  So, I don't doubt that lock contention could be an issue.  I just don't know how to observe it or correct it.  It seems like if we have processes that are contending for locks, there's not much we can do about it. 

Also, as you suggested, identifying what queries are taking most of
the time and trying to optimize them is a route that might help,
regardless.

We often undertake query optimization.  And, we often learn things about our app or make small performance gains from it.  Sometimes, we are even able to make big changes to the application to make large gains based on how we see queries performing. 

So, I agree that it's a good thing.  However, query optimizing is tough, since you can't necessarily predict the sizes of your tables in a real-time system that is used differently by different users.


-Kevin

Re: performance config help

From
"Kevin Grittner"
Date:
Bob Dusek <redusek@gmail.com> wrote:

> How do I learn more about the actual lock contention in my db?
> Lock contention makes some sense.  Each of the 256 requests are
> relatively similar.  So, I don't doubt that lock contention could
> be an issue.  I just don't know how to observe it or correct it.
> It seems like if we have processes that are contending for locks,
> there's not much we can do about it.

I'm not sure what the best way would be to measure it, but in prior
discussions the general mood seemed to be that if you had so many
active sessions that you were running into the issue, the best
solution was to use a connection pool to avoid it.

-Kevin

Re: performance config help

From
Greg Smith
Date:
Scott Marlowe wrote:
> So, I took a break from writing and searched for some more info on the
> 74xx series CPUs, and from reading lots of articles, including this
> one:
> http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3414
> It seems apparent that the 74xx series if a great CPU, as long as
> you're not memory bound.
>

This is why I regularly end up recommending people consider Intel's
designs here so often, the ones that use triple-channel DDR3, instead of
any of the AMD ones.  It's extremely easy to end up with a memory-bound
workload nowadays, at which point all the CPU power in the world doesn't
help you anymore.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: performance config help

From
"Dusek, Bob"
Date:
> > How do I learn more about the actual lock contention in my db?
> > Lock contention makes some sense.  Each of the 256 requests are
> > relatively similar.  So, I don't doubt that lock contention could
> > be an issue.  I just don't know how to observe it or correct it.
> > It seems like if we have processes that are contending for locks,
> > there's not much we can do about it.
>
> I'm not sure what the best way would be to measure it, but in prior
> discussions the general mood seemed to be that if you had so many
> active sessions that you were running into the issue, the best
> solution was to use a connection pool to avoid it.

Sorry.. by "not much we can do about it", I meant, from a query perspective.  I mean, we can't use locking hints or
anythinglike that in Postgres that I know of.    

I do understand that the connection pool will help this.


>
> -Kevin
>
> --
> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: performance config help

From
Greg Smith
Date:
Bob Dusek wrote:
>
> How do I learn more about the actual lock contention in my db?

There's a page with a sample query and links to more info at
http://wiki.postgresql.org/wiki/Lock_Monitoring

One other little thing:  when you're running "top", try using "top -c"
instead.  That should show you exactly what all the postmaster backends
are actually doing, which is really handy to sort out issues in this area.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 3:04 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Marlowe wrote:
>>
>> So, I took a break from writing and searched for some more info on the
>> 74xx series CPUs, and from reading lots of articles, including this
>> one:
>> http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3414
>> It seems apparent that the 74xx series if a great CPU, as long as
>> you're not memory bound.
>>
>
> This is why I regularly end up recommending people consider Intel's designs
> here so often, the ones that use triple-channel DDR3, instead of any of the
> AMD ones.  It's extremely easy to end up with a memory-bound workload
> nowadays, at which point all the CPU power in the world doesn't help you
> anymore.

The DDR3 Nehalem and DDR2 AMD are both actually pretty close in real
world use on 4 or more socket machines.  Most benchmarks on memory
bandwidth give no huge advantage to either one or the other.  They
both max out at about 25GB/s.

It's the older Xeon base 74xx chipsets without integrated memory
controllers that seem to have such horrible bandwidth because they're
not multi-channel.

For dual socket the Nehalem is pretty much the king.  By the time you
get to 8 sockets AMD is still ahead.  Just avoid anything older than
nehalem or istanbul.

Re: performance config help

From
Greg Smith
Date:
Scott Marlowe wrote:
> The DDR3 Nehalem and DDR2 AMD are both actually pretty close in real
> world use on 4 or more socket machines.  Most benchmarks on memory
> bandwidth give no huge advantage to either one or the other.  They
> both max out at about 25GB/s.
>
The most fair comparison I've seen so far is
http://www.advancedclustering.com/company-blog/stream-benchmarking.html
which puts the faster Intel solutions at 37GB/s, while the Opterons bog
down at 20GB/s.  That matches my own tests pretty well too--Intel's got
at least a 50% lead here in many cases.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: performance config help

From
Scott Marlowe
Date:
On Mon, Jan 11, 2010 at 4:17 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Marlowe wrote:
>>
>> The DDR3 Nehalem and DDR2 AMD are both actually pretty close in real
>> world use on 4 or more socket machines.  Most benchmarks on memory
>> bandwidth give no huge advantage to either one or the other.  They
>> both max out at about 25GB/s.
>>
>
> The most fair comparison I've seen so far is
> http://www.advancedclustering.com/company-blog/stream-benchmarking.html
> which puts the faster Intel solutions at 37GB/s, while the Opterons bog down
> at 20GB/s.  That matches my own tests pretty well too--Intel's got at least
> a 50% lead here in many cases.

But that's with only 2 sockets.  I'd like to see something comparing 4
or 8 socket machines.  Hmmm, off to googol.

Re: performance config help

From
Pierre Frédéric Caillaud
Date:
> Each of the 256 requests was being processed by a php process.  So, it
> could
> certainly be faster.  But, the fact that we're seeing the db performance
> degrade would seem to indicate that our application is fast enough to
> punish
> the db.  Isn't that true?

    Not necessarily. Your DB still has lots of idle CPU, so perhaps it's your
client which is getting over the top. Or you have locking problems in your
DB.
    Things to test :

    - vmstat on the benchmark client
    - iptraf on the network link
    - monitor ping times between client and server during load test

    Some time ago, I made a benchmark simulating a forum. Postgres was
saturating the gigabit ethernet between server and client...

    If those PHP processes run inside Apache, I'd suggest switching to
lighttpd/fastcgi, which has higher performance, and uses a limited,
controllable set of PHP processes (and therefore DB connections), which in
turn uses much less memory.

    PS : try those settings :

fsync = fdatasync
wal_buffers = 64MB
walwriter_delay = 2ms
synchronous commits @ 1 s delay

Re: performance config help

From
Matthew Wakeling
Date:
On Mon, 11 Jan 2010, Bob Dusek wrote:
> How do I learn more about the actual lock contention in my db?   Lock contention makes
> some sense.  Each of the 256 requests are relatively similar.  So, I don't doubt that
> lock contention could be an issue.  I just don't know how to observe it or correct it. 
> It seems like if we have processes that are contending for locks, there's not much we can
> do about it. 

To me:

1. This doesn't look like an IO bandwidth issue, as the database is small.
2. This doesn't look like a CPU speed issue, as usage is low.
3. This doesn't look like a memory bandwidth issue, as that would count as
    CPU active in top.
4. This doesn't look like a memory size problem either.

So, what's left? It could be a network bandwidth problem, if your client
is on a separate server. You haven't really given much detail about the
nature of the queries, so it is difficult for us to tell if the size of
the results means that you are maxing out your network. However, it
doesn't sound like it is likely to me that this is the problem.

It could be a client bottleneck problem - maybe your server is performing
really well, but your client can't keep up. You may be able to determine
this by switching on logging of long-running queries in Postgres, and
comparing that with what your client says. Also, look at the resource
usage on the client machine.

It could be a lock contention problem. To me, this feels like the most
likely. You say that the queries are similar. If you are reading and
writing from a small set of the same objects in each of the transactions,
then you will suffer badly from lock contention, as only one backend can
be in a read-modify-write cycle on a given object at a time. We don't know
enough about the data and queries to say whether this is the case.
However, if you have a common object that every request touches (like a
status line or something), then try to engineer that out of the system.

Hope this helps. Synchronising forty processes around accessing a single
object for high performance is really hard, and Postgres does it
incredibly well, but it is still by far the best option to avoid
contention completely if possible.

>       -Kevin

It'd really help us reading your emails if you could make sure that it is
easy to distinguish your words from words you are quoting. It can be very
confusing reading some of your emails, trying to remember which bits I
have seen before written by someone else. This is one of the few lines
that I know you didn't write - you're a Bob, not a Kevin. A few ">"
characters at the beginning of lines, which most mail readers will add
automatically, make all the difference.

Matthew

--
 Me... a skeptic?  I trust you have proof?

Re: performance config help

From
"Kevin Grittner"
Date:
Matthew Wakeling <matthew@flymine.org> wrote:

>>       -Kevin
>
> It'd really help us reading your emails if you could make sure
> that it is easy to distinguish your words from words you are
> quoting. It can be very confusing reading some of your emails,
> trying to remember which bits I have seen before written by
> someone else. This is one of the few lines that I know you didn't
> write - you're a Bob, not a Kevin. A few ">" characters at the
> beginning of lines, which most mail readers will add
> automatically, make all the difference.

That took me by surprise, because outside of that one line, where
Bob apparently lost the leading character, I've been seeing his
messages properly quoted.  I went back and looked at Bob's old
messages and found that he's sending them in multiple mime formats,
text/plain with the '>' characters and the following:

--0016e6d77e63233088047ce8a128
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I hadn't noticed, because I have my email reader set up to default
to text format if available.  Your reader must be looking at the
html format and not handling the this stuff:

<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid=
 rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:
1ex;"><div><d=
iv class=3D"h5">

You might want to adjust your reader.

Bob, you might want to just send plain text, to avoid such problems.

-Kevin


Re: performance config help

From
Bob Dusek
Date:

On Tue, Jan 12, 2010 at 12:12 PM, Matthew Wakeling <matthew@flymine.org> wrote:
On Mon, 11 Jan 2010, Bob Dusek wrote:
How do I learn more about the actual lock contention in my db?   Lock contention makes
some sense.  Each of the 256 requests are relatively similar.  So, I don't doubt that
lock contention could be an issue.  I just don't know how to observe it or correct it. 
It seems like if we have processes that are contending for locks, there's not much we can
do about it. 

To me:

1. This doesn't look like an IO bandwidth issue, as the database is small.
2. This doesn't look like a CPU speed issue, as usage is low.
3. This doesn't look like a memory bandwidth issue, as that would count as
  CPU active in top.
4. This doesn't look like a memory size problem either.

So, what's left? It could be a network bandwidth problem, if your client is on a separate server. You haven't really given much detail about the nature of the queries, so it is difficult for us to tell if the size of the results means that you are maxing out your network. However, it doesn't sound like it is likely to me that this is the problem.


The connections to postgres are happening on the localhost.  Our application server accepts connections from the network, and the application queries Postgres using a standard pg_pconnect on the localhost. 
 
It could be a client bottleneck problem - maybe your server is performing really well, but your client can't keep up. You may be able to determine this by switching on logging of long-running queries in Postgres, and comparing that with what your client says. Also, look at the resource usage on the client machine.

We've been logging long-running queries (200 ms).  That's how we know Postgres is degrading.  We don't see any queries showing up when we have 40 clients running.  But, we start seeing quite a bit show up after that. 
 

It could be a lock contention problem. To me, this feels like the most likely. You say that the queries are similar. If you are reading and writing from a small set of the same objects in each of the transactions, then you will suffer badly from lock contention, as only one backend can be in a read-modify-write cycle on a given object at a time. We don't know enough about the data and queries to say whether this is the case. However, if you have a common object that every request touches (like a status line or something), then try to engineer that out of the system.

Hope this helps. Synchronising forty processes around accessing a single object for high performance is really hard, and Postgres does it incredibly well, but it is still by far the best option to avoid contention completely if possible.

Each of the concurrent clients does a series of selects, inserts, updates, and deletes.  The requests would generally never update or delete the same rows in a table.  However, the requests do generally write to the same tables.  And, they are all reading from the same tables that they're writing to.  For the inserts, I imagine they are blocking on access to the sequence that controls the primary keys for the insert tables. 

But, I'm not sure about locking beyond that.  When we delete from the tables, we generally delete where "clientid=X", which deletes all of the rows that a particular client inserted (each client cleans up its own rows after it finishes what its doing).  Would that be blocking inserts on that table for other clients?
 

     -Kevin

It'd really help us reading your emails if you could make sure that it is easy to distinguish your words from words you are quoting. It can be very confusing reading some of your emails, trying to remember which bits I have seen before written by someone else. This is one of the few lines that I know you didn't write - you're a Bob, not a Kevin. A few ">" characters at the beginning of lines, which most mail readers will add automatically, make all the difference.


I'm really sorry.  I'm using gmail's interface.   I just saw the "<< Plain Text" formatter at the top of this compose message.  But, if I convert it to Plain Text now, I may lose my portion of the message.  I'll use the Plain Text when posting future messages. 

Sorry for the hassel. 

Matthew

--
Me... a skeptic?  I trust you have proof?

Re: performance config help

From
Bob Dusek
Date:
> Bob, you might want to just send plain text, to avoid such problems.

Will do.  Looks like gmail's interface does it nicely.

>
> -Kevin

Re: performance config help

From
Matthew Wakeling
Date:
On Tue, 12 Jan 2010, Bob Dusek wrote:
> Each of the concurrent clients does a series of selects, inserts, updates,
> and deletes.  The requests would generally never update or delete the same
> rows in a table.  However, the requests do generally write to the same
> tables.  And, they are all reading from the same tables that they're writing
> to.  For the inserts, I imagine they are blocking on access to the sequence
> that controls the primary keys for the insert tables.

I'm going to have to bow out at this stage, and let someone else who knows
more about what gets locked in a transaction help instead. The sequence
may be significant, but I would have thought it would have to be something
a bit bigger that is gumming up the works.

> I'm really sorry.  I'm using gmail's interface.

Actually, you weren't doing anything particularly wrong as it turns out.
It is partly a case of alpine being too clever for its own good, just as
Kevin pointed out. My mail reader is taking the "most preferred" mime
alternative, which is the HTML version, and interpreting it to its best
ability, which isn't very well. It is the email that says which
alternative is preferred, by the way. I have just forced alpine to view
the plain text version instead, and it is much better.

> I just saw the "<< Plain Text" formatter at the top of this compose
> message.  But, if I convert it to Plain Text now, I may lose my portion
> of the message.  I'll use the Plain Text when posting future messages.

To be honest, that's always a good idea, although you didn't actually do
wrong. I do know people whose spam filters immediately discard emails that
contain a HTML alternative - that's taking it to the extreme!

Matthew

--
 Beware of bugs in the above code; I have only proved it correct, not
 tried it.                                               --Donald Knuth

Re: performance config help

From
Craig Ringer
Date:
On 13/01/2010 2:01 AM, Bob Dusek wrote:

> The connections to postgres are happening on the localhost.  Our
> application server accepts connections from the network, and the
> application queries Postgres using a standard pg_pconnect on the localhost.

Well, that's a good reason to have all those CPUs - if your app server
runs on the same host as Pg.

> We've been logging long-running queries (200 ms).  That's how we know
> Postgres is degrading.  We don't see any queries showing up when we have
> 40 clients running.  But, we start seeing quite a bit show up after that.

It might be informative to see how fast query times are increasing with
client count. You can probably figure this out by progressively lowering
your query time logging theshold.

> Each of the concurrent clients does a series of selects, inserts,
> updates, and deletes.  The requests would generally never update or
> delete the same rows in a table.  However, the requests do generally
> write to the same tables.  And, they are all reading from the same
> tables that they're writing to.

AFAIK None of that should interfere with each other, so long as they're
not working with the same sets of tuples.

> For the inserts, I imagine they are
> blocking on access to the sequence that controls the primary keys for
> the insert tables.

I doubt it. Sequences are outside transactional rules for that reason.
It takes an incredibly short time for nextval(...) to obtain the next
value for the sequence, and after that the sequence is unlocked and
ready for the next use.

> But, I'm not sure about locking beyond that.  When we delete from the
> tables, we generally delete where "clientid=X", which deletes all of the
> rows that a particular client inserted (each client cleans up its own
> rows after it finishes what its doing).  Would that be blocking inserts
> on that table for other clients?

Now would be a good time to start watching pg_locks - rather than
guessing, try to actually observe if there's lock contention.

I'd also consider looking into a connection pool so that as the number
of clients to your appserver increases you can keep the number of active
Pg connections at the "sweet spot" for your server to maximise overall
throughput.

--
Craig Ringer

Re: performance config help

From
Bob Dusek
Date:
FYI - We have implemented a number of changes...

a) some query and application optimizations
b) connection pool (on the cheap: set max number of clients on
Postgres server and created a blocking wrapper to pg_pconnect that
will block until it gets a connection)
c) moved the application server to a separate box

And, we pretty much doubled our capacity... from approx 40 "requests"
per second to approx 80.

The problem with our "cheap" connection pool is that the persistent
connections don't seem to be available immediately after they're
released by the previous process.   pg_close doesn't seem to help the
situation.  We understand that pg_close doesn't really close a
persistent connection, but we were hoping that it would cleanly
release it for another client to use.  Curious.

We've also tried third-party connection pools and they don't seem to
be real fast.

Thanks for all of your input.  We really appreciate it.

Bob

Re: performance config help

From
Scott Marlowe
Date:
On Wed, Jan 13, 2010 at 1:10 PM, Bob Dusek <redusek@gmail.com> wrote:
> And, we pretty much doubled our capacity... from approx 40 "requests"
> per second to approx 80.

Excellent!

> The problem with our "cheap" connection pool is that the persistent
> connections don't seem to be available immediately after they're
> released by the previous process.   pg_close doesn't seem to help the
> situation.  We understand that pg_close doesn't really close a
> persistent connection, but we were hoping that it would cleanly
> release it for another client to use.  Curious.

Yeah, the persistent connects in php are kinda as dangerous as they
are useful..  Have you tried using regular connects just to compare
performance?  On Linux they're not too bad, but on Windows (the pg
server that is) it's pretty horrible performance-wise.

> We've also tried third-party connection pools and they don't seem to
> be real fast.

What have you tried?  Would pgbouncer work for you?

Re: performance config help

From
Bob Dusek
Date:
>> The problem with our "cheap" connection pool is that the persistent
>> connections don't seem to be available immediately after they're
>> released by the previous process.   pg_close doesn't seem to help the
>> situation.  We understand that pg_close doesn't really close a
>> persistent connection, but we were hoping that it would cleanly
>> release it for another client to use.  Curious.
>
> Yeah, the persistent connects in php are kinda as dangerous as they
> are useful..  Have you tried using regular connects just to compare
> performance?  On Linux they're not too bad, but on Windows (the pg
> server that is) it's pretty horrible performance-wise.

Yes we have.  Regular connections are pretty slow, even when our
application server is on the same box as the db server.

>> We've also tried third-party connection pools and they don't seem to
>> be real fast.
>
> What have you tried?  Would pgbouncer work for you?

We've tried pgbouncer.  It's pretty good.

Here are more details on what we're running:

We have three servers: A, B, and C.  All of them are on the same rack,
sharing a gb switch.

We have a test application (Apache bench) running on A.  The test app
sends 5000 requests to our application server.  We can control how
many requests it will send concurrently.  For purposes of explanation,
I'll refer to the concurrency parameter of the test server TCON.

The application server is (now) running on B.  It's basically Apache
with the PHP5 module.

And, good ol' Postgres is running on C.

We have two basic configurations.

The first configuration is with the application server using the
"cheap" connection pooling.  Like I said before, we do this by
configuring Postgres to only allow 40 clients, and the application
server uses a pconnect wrapper that blocks until it gets a db
connection (I guess you'd call this a "polling connection pool").  We
have run the first configuration using persistent and non-persistent
connections.

When we run it with persistent connections using a TCON of 40, Apache
Bench tells us that we are processing ~100 requests per second and our
CPU utilization is up to about %80.

When we run it with non-persistent connections using the same TCON, we
process about ~30 requests per second, and our cpu utilization is at
about %30 (sort of a surprising drop).

If we change TCON to 200 using the persistent connection
configuration, we're only able to process ~23 per second.  It seems
like lots of failing connections from our pconnect wrapper are killing
db performance.

The second configuration is with pgBouncer.  We configure pgBouncer to
run on the same server as Postgres, and we configure it to allow an
infinite number of incoming connections and only 40 connections to the
actual Postgres db.  We change the Postgres configuration to allow up
to 60 clients (just to set it higher than what pgBouncer should be
using).  Using this configuration, with TCON set to any number >= 40,
we can process ~83 requests per second.

So, pgBouncer is pretty good.  It doesn't appear to be as good as
limiting TCON and using pconnect, but since we can't limit TCON in a
production environment, we may not have a choice.

Does anyone know why failed db connections would have such a drastic
performance hit on the system?  I suppose it matters how many
connections were attempting.  Maybe we're killing ourselves with a
denial of service attack of sorts (hence, polling is bad).  But, I'm
told that we were only checking every 1/2 second (so, basically 160
processes attempting to connect every 1/2 second).   I suppose 320
attempts per second could cause a lot of interrupts and context
switches.  I don't think we have the context switch numbers handy for
all of those runs.

Maybe we can get those numbers tomorrow.

Anyway, thanks again for your help thus far.

Re: performance config help

From
Craig Ringer
Date:
Bob Dusek wrote:

> So, pgBouncer is pretty good.  It doesn't appear to be as good as
> limiting TCON and using pconnect, but since we can't limit TCON in a
> production environment, we may not have a choice.

It may be worth looking into pgpool, as well. If you have a very
cheap-to-connect-to local pool you can use non-persistent connections
(for quick release) and the local pool takes care of maintaining and
sharing out the expensive-to-establish real connections to Pg its self.

If you find you still can't get the throughput you need, an alternative
to adding more hardware capacity and/or more server tuning is to look
into using memcached to satisfy many of the read requests for your app
server. Use some of that 16GB of RAM on the app server to populate a
memcached instance with less-frequently-changing data, and prefer to
fetch things from memcached rather than from Pg. With a bit of work on
data access indirection and on invalidating things in memcached when
they're changed in Pg, you can get truly insane boosts to performance
... and get more real work done in Pg by getting rid of repetitive
queries of relatively constant data.

--
Craig Ringer

Re: performance config help

From
Craig Ringer
Date:
Bob Dusek wrote:
>>> The problem with our "cheap" connection pool is that the persistent
>>> connections don't seem to be available immediately after they're
>>> released by the previous process.   pg_close doesn't seem to help the
>>> situation.  We understand that pg_close doesn't really close a
>>> persistent connection, but we were hoping that it would cleanly
>>> release it for another client to use.  Curious.
>> Yeah, the persistent connects in php are kinda as dangerous as they
>> are useful..  Have you tried using regular connects just to compare
>> performance?  On Linux they're not too bad, but on Windows (the pg
>> server that is) it's pretty horrible performance-wise.
>
> Yes we have.  Regular connections are pretty slow, even when our
> application server is on the same box as the db server.
>
>>> We've also tried third-party connection pools and they don't seem to
>>> be real fast.
>> What have you tried?  Would pgbouncer work for you?
>
> We've tried pgbouncer.  It's pretty good.

Oh, also look into mod_dbd . With the threaded MPM it can apparently
provide excellent in-apache connection pooling.

--
Craig Ringer


Re: performance config help

From
Dimitri Fontaine
Date:
Bob Dusek <redusek@gmail.com> writes:
> So, pgBouncer is pretty good.  It doesn't appear to be as good as
> limiting TCON and using pconnect, but since we can't limit TCON in a
> production environment, we may not have a choice.

You can still use pconnect() with pgbouncer, in transaction mode, if
your application is compatible with that (no advisory locks or other
session level tricks).

Regards,
--
dim

Re: performance config help

From
Pierre Frédéric Caillaud
Date:
> So, pgBouncer is pretty good.  It doesn't appear to be as good as
> limiting TCON and using pconnect, but since we can't limit TCON in a
> production environment, we may not have a choice.

    Actually, you can : use lighttpd and php/fastcgi.

    Lighttpd handles the network stuff, and funnels/queues any number of
client connections into a limited number of PHP fastcgi processes. You can
configure this process pool to your tastes.

    Rather than instanciating 1 PHP interpreter (and 1 postgres) per client
connection, you can set it up for a max of N PHP procs. If PHP waits a lot
on IO (you use url fopen, that kind of things) you can set N=5..10 per
core, but if you don't use that, N=2-3 per core is good. It needs to be
tuned to your application's need.

    The idea is that if you got enough processes to keep your CPU busy,
adding more will just fill your RAM, trash your CPU cache, add more
context swithes, and generally lower your total throughput. Same is true
for Postgres, too.

    I've switched from apache to lighttpd on a rather busy community site and
the difference in performance and memory usage were quite noticeable.
Also, this site used MySQL (argh) so the occasional locking on some MyISAM
tables would become really itchy unless the number of concurrent processes
was kept to a manageable level.

    When you bring down your number of postgres processes to some manageable
level (plot a curve of throughput versus processes and select the
maximum), if postgres still spends idle time waiting for locks, you'll
need to do some exploration :

    - use the lock view facility in postgres
    - check your triggers : are you using some trigger that updates a count
as rows are modified ? This can be a point of contention.
    - check your FKs too.
    - try fsync=off
    - try to put the WAL and tables on a ramdisk.
    If you have even a few % iowait, maybe that hides the fact that 1
postmaster is fsyncing and perhaps 10 others are waiting on it to finish,
which doesn't count as iowait...

    - recompile postgres and enable lwlock timing