Re: FW: Queries becoming slow under heavy load - Mailing list pgsql-performance

From Anne Rosset
Subject Re: FW: Queries becoming slow under heavy load
Date
Msg-id 945629628BB0174D86709AFE6D1CDEF501634373@SP-EXCHMBC.sp.corp.collab.net
Whole thread Raw
In response to Re: FW: Queries becoming slow under heavy load  (Scott Marlowe <scott.marlowe@gmail.com>)
Responses Re: FW: Queries becoming slow under heavy load
List pgsql-performance
Scott,
Thanks for your response.
We are over NFS for our storage ...

Here is what we see during our performance testing:
This is about 7 seconds after the query was sent to postgres:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7090 root      25   0  689m 399m  10m R 89.9  5.0   3868:44 java
 1846 postgres  16   0  474m 198m 103m R 75.2  2.5   0:28.69 postmaster
 2170 postgres  15   0  391m 203m 188m R 44.0  2.6   0:17.63 postmaster
 2555 httpd     18   0  298m  15m 4808 R 22.0  0.2   0:00.12 httpd
 2558 root      15   0 29056 2324 1424 R  1.8  0.0   0:00.01 top
 1207 httpd     15   0  337m  20m 7064 R  0.0  0.3   0:00.69 httpd
28312 postgres  16   0  396m 183m 162m D  0.0  2.3   0:50.82 postmaster  <---- this is the query here

Notice the 0% CPU, also, notice the 183m RES memory.

Ten seconds later:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7090 root      25   0  689m 399m  10m R 92.9  5.0   3868:53 java
 2657 root      15   0 29056 2328 1424 R  1.9  0.0   0:00.01 top
28312 postgres  16   0  396m 184m 162m D  0.0  2.3   0:50.84 postmaster  <---- here

Ten seconds after that:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7090 root      25   0  689m 399m  10m R 88.7  5.0   3869:02 java
 1845 postgres  16   0  473m 223m 127m D 22.6  2.8   0:26.39 postmaster
 2412 httpd     15   0 2245m 1.4g  16m R 18.9 17.8   0:02.48 java
  966 postgres  15   0  395m 242m 221m D  0.0  3.0   1:02.31 postmaster
 2680 root      15   0 29056 2336 1424 R  0.0  0.0   0:00.01 top
28312 postgres  16   0  396m 184m 163m D  0.0  2.3   0:50.85 postmaster  <--- here

etc....

and it's not until around the 221 second mark that we see catch it consuming CPU:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7090 root      25   0  689m 399m  10m R 93.4  5.0   3872:07 java
28312 postgres  16   0  396m 225m 204m R  5.7  2.8   0:51.52 postmaster   <----- here
 3391 root      15   0 29056 2348 1424 R  1.9  0.0   0:00.01 top
 4297 root      16   0 10228  740  632 D  0.0  0.0  12:53.66 hald-addon-stor
26885 httpd     15   0 2263m 1.5g  16m R  0.0 19.0   0:00.01 java

Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU.

Looks like this is true "halting".

Further, or worse yet, this same behavior expands out to multiple processes, producing a true "back up". It can look
something like this. Notice the 0% cpu consumption:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7090 root      22   0  689m 399m  10m R 91.1  5.0   3874:32 java
 4139 root      15   0 29080 2344 1424 R  1.9  0.0   0:00.01 top
 1555 postgres  16   0  474m 258m 162m D  0.0  3.2   0:17.32 postmaster
 1846 postgres  16   0  474m 285m 189m D  0.0  3.6   0:47.43 postmaster
 2713 postgres  16   0  404m 202m 179m D  0.0  2.5   0:33.54 postmaster
 2801 postgres  16   0  391m 146m 131m D  0.0  1.8   0:04.48 postmaster
 2804 postgres  16   0  419m 172m 133m D  0.0  2.2   0:09.41 postmaster
 2825 postgres  16   0  473m 142m  49m D  0.0  1.8   0:04.12 postmaster

Thanks for any additional explanation/advice,
Anne

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Wednesday, January 26, 2011 8:19 PM
To: Anne Rosset
Cc: pgsql-performance@postgresql.org
Subject: Re: FW: [PERFORM] Queries becoming slow under heavy load

On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset@collab.net> wrote:

<HUGE LIST OF SETTINGS DELETED>

PLEASE post just the settings you changed.  I'm not searching through a list that big for the interesting bits.

> Today we did more analysis and observed  postgress processes that
> continually reported status 'D' in top.

Full stop.  The most likely problem here is that the query is now hitting the disks and waiting.  If you have 1 disk
andtwo users, the access speed will drop by factors, usually much higher than 2. 

To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you
haveusers reading the disk drives.  If you're writing you need more and more disks too. 
Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers.  They allow the reads and
writesto be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much
faster,had better access, and all 
the sectors were near each other.   To an extent.

If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, #
ofdisks, etc)." you should provide it.  If you've got a pair of 5k RPM SATA drives in a 
RAID-1 you might need more hardware.

So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is
happening. Don't tell us what you see, show us. 

pgsql-performance by date:

Previous
From: david@lang.hm
Date:
Subject: Re: How to best use 32 15k.7 300GB drives?
Next
From: Michael Kohl
Date:
Subject: Re: High load,