Thread: understanding postgres issues/bottlenecks

understanding postgres issues/bottlenecks

From
Stefano Nichele
Date:
Hi list,
I would like to ask your help in order to understand if my postgresql
server (ver. 8.2.9) is well configured.
It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2
separate disks for the OS and write-ahead logs with 4GB of RAM.

I don't know what is the best info to help me and so I start with some
vmstat information:

 > vmstat -n 30
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa st
 0 23     84 129968  25060 3247860    0    0    78    50    0    2 17  5
33 45  0
 1 24     84 124204  25136 3257100    0    0  3037  1154 3359 7253 25
7  6 62  0
 2 30     84 124704  25136 3256344    0    0  3004  1269 3553 7906 33
9  7 51  0
 0 25     84 125784  24956 3253344    0    0  3357   773 3163 6454 17  4
10 68  0
 0  2     84 125744  25236 3258996    0    0  3186   567 3125 6425 24  6
21 50  0
 3  7     84 124948  25500 3260088    0    0  1829   535 2706 4625 18  3
54 25  0
 5  0     84 124976  25624 3259112    0    0  2067   647 3050 6163 26  6
41 27  0
 0  7     84 123836  25644 3260760    0    0  2239  1065 3289 8654 27  7
38 28  0

These are gathered loadavg info for the same period:
29.57 29.53 33.52 1/231 12641
29.54 29.63 33.31 1/226 12678
24.43 28.45 32.69 1/223 12696
12.31 24.17 30.95 4/223 12706

At the moment as average there are about 120/150 connections and this is
my postgresql.conf file

listen_addresses = '*'                  # what IP address(es) to listen on;
port = 5432                             # (change requires restart)
max_connections = 400                   # (change requires restart)
ssl = off                               # (change requires restart)
password_encryption = on
shared_buffers = 32MB                   # min 128kB or max_connections*16kB
max_prepared_transactions = 0           # can be 0 or more
work_mem = 1MB                          # min 64kB
maintenance_work_mem = 256MB            # min 1MB
max_fsm_pages = 204800                  # min max_fsm_relations*16, 6
bytes each
fsync = on                              # turns forced synchronization
on or off
full_page_writes = on                   # recover from partial page writes
wal_buffers = 8MB                       # min 32kB
checkpoint_segments = 56                # in logfile segments, min 1,
16MB each
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on
cpu_tuple_cost = 0.003                  # same scale as above
cpu_index_tuple_cost = 0.001            # same scale as above
cpu_operator_cost = 0.0005              # same scale as above
effective_cache_size = 3GB
geqo_threshold = 14
log_destination = 'stderr'              # Valid values are combinations of
redirect_stderr = on                    # Enable capturing of stderr
into log
log_directory = 'pg_log'                # Directory where log files are
written
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern.
log_truncate_on_rotation = on           # If on, any existing log file
of the same
log_rotation_age = 1d                   # Automatic rotation of logfiles
will
log_rotation_size = 0                   # Automatic rotation of logfiles
will
log_min_duration_statement = -1         # -1 is disabled, 0 logs all
statements
log_statement = 'none'                  # none, ddl, mod, all

autovacuum = on                         # enable autovacuum subprocess?
stats_start_collector = on              # must be 'on' for autovacuum
stats_row_level = on                    # must be 'on' for autovacuum

statement_timeout = 150000
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'                     # locale for system
error message
lc_monetary = 'en_US.UTF-8'                     # locale for monetary
formatting
lc_numeric = 'en_US.UTF-8'                      # locale for number
formatting
lc_time = 'en_US.UTF-8'                         # locale for time formatting
deadlock_timeout = 5s
escape_string_warning = off
standard_conforming_strings = on


Cheers and thanks a lot in advance.
Let me know if other info is useful.
Ste

--
Stefano Nichele

Funambol Chief Architect
Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com


Re: understanding postgres issues/bottlenecks

From
"Merlin Moncure"
Date:
On Tue, Jan 6, 2009 at 11:50 AM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
> Hi list,
> I would like to ask your help in order to understand if my postgresql server
> (ver. 8.2.9) is well configured.
> It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2 separate
> disks for the OS and write-ahead logs with 4GB of RAM.
>
> I don't know what is the best info to help me and so I start with some
> vmstat information:
>
>> vmstat -n 30
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> st
> 0 23     84 129968  25060 3247860    0    0    78    50    0    2 17  5 33
> 45  0

clearly i/o bound. can you throw an iostat to give some more detailed info?
also,
need o/s version, etc
disk info (speed, raid controller, etc)
my guess is your disk system is underpowered for transaction load you
are giving.
how many tps is the database doing?

merlin

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
I concur with Merlin you're I/O bound.

Adding to his post, what RAID controller are you running, does it have
cache, does the cache have battery backup, is the cache set to write
back or write through?

Also, what do you get for this (need contrib module pgbench installed)

pgbench -i -s 100
pgbench -c 50 -n 10000

? Specifically transactions per second?

Re: failure notice

From
"Scott Marlowe"
Date:
I got this bounce message from your account,
stefano.nichele@gmail.com.  I'm on gmail too, but don't get a lot of
-perform messages into my spam folder.

Just in case you've got eliminatecc turned on on the mailing list
server, I'm resending it through the mail server without your email
address in it.


I concur with Merlin you're I/O bound.

Adding to his post, what RAID controller are you running, does it have
cache, does the cache have battery backup, is the cache set to write
back or write through?

Also, what do you get for this (need contrib module pgbench installed)

pgbench -i -s 100
pgbench -c 50 -n 10000

? Specifically transactions per second?

Re: understanding postgres issues/bottlenecks

From
Stefano Nichele
Date:
Thanks for your help. I'll give you the info you asked as soon as I'll
have it (i have also to install iostat but I don't have enough privilege
to do that).

BTW, why did you said I/O bound ? Which are the parameters that
highlight that ? Sorry for my ignorance....

ste

Merlin Moncure wrote:
> On Tue, Jan 6, 2009 at 11:50 AM, Stefano Nichele
> <stefano.nichele@gmail.com> wrote:
>
>> Hi list,
>> I would like to ask your help in order to understand if my postgresql server
>> (ver. 8.2.9) is well configured.
>> It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2 separate
>> disks for the OS and write-ahead logs with 4GB of RAM.
>>
>> I don't know what is the best info to help me and so I start with some
>> vmstat information:
>>
>>
>>> vmstat -n 30
>>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> st
>> 0 23     84 129968  25060 3247860    0    0    78    50    0    2 17  5 33
>> 45  0
>>
>
> clearly i/o bound. can you throw an iostat to give some more detailed info?
> also,
> need o/s version, etc
> disk info (speed, raid controller, etc)
> my guess is your disk system is underpowered for transaction load you
> are giving.
> how many tps is the database doing?
>
> merlin
>
>


--
Stefano Nichele

Funambol Chief Architect
Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com


Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Tue, Jan 6, 2009 at 12:02 PM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
> Thanks for your help. I'll give you the info you asked as soon as I'll have
> it (i have also to install iostat but I don't have enough privilege to do
> that).
>
> BTW, why did you said I/O bound ? Which are the parameters that highlight
> that ? Sorry for my ignorance....

In this output, the second to the last column is wa, which stands for
wait, the third to last is id, which is CPU idle.

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1 24     84 124204  25136 3257100    0    0  3037  1154 3359 7253 25  7  6 62  0
2 30     84 124704  25136 3256344    0    0  3004  1269 3553 7906 33  9  7 51  0
0 25     84 125784  24956 3253344    0    0  3357   773 3163 6454 17  4 10 68  0

So here the first line reports
25% user
7% system
6% idle
62% wait

So it's mostly waiting.

Re: understanding postgres issues/bottlenecks

From
"David Rees"
Date:
On Tue, Jan 6, 2009 at 11:02 AM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
> BTW, why did you said I/O bound ? Which are the parameters that highlight
> that ? Sorry for my ignorance....

In addition to the percentage of time spent in wait as Scott said, you
can also see the number of processes which are blocked (b column on
the left). Those are the processes which would like to run but are
waiting for IO to complete.

From your earlier vmstat output, you can see that often quite a large
number of processes were waiting on IO - 20-30 of them. This is
confirmed by your load average. Since you only have 6 spindles, during
those periods, request service latency was likely very high during
those periods.

Anyway, I would start with significantly increasing the amount of
memory you have allocated to shared_buffers.

32MB is _way_ too low. For a dedicated PostgreSQL machine with 4GB
RAM, 1GB going to be a lot closer to optimal, with the optimal setting
probably somewhere between 500MB and 2GB.

And please post the rest of the information that the others have asked for...

-Dave

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
David Rees wrote:
> On Tue, Jan 6, 2009 at 11:02 AM, Stefano Nichele
> <stefano.nichele@gmail.com> wrote:
>> BTW, why did you said I/O bound ? Which are the parameters that highlight
>> that ? Sorry for my ignorance....
>
> In addition to the percentage of time spent in wait as Scott said, you
> can also see the number of processes which are blocked (b column on
> the left). Those are the processes which would like to run but are
> waiting for IO to complete.
>
>From your earlier vmstat output, you can see that often quite a large
> number of processes were waiting on IO - 20-30 of them. This is
> confirmed by your load average. Since you only have 6 spindles, during
> those periods, request service latency was likely very high during
> those periods.
>
> Anyway, I would start with significantly increasing the amount of
> memory you have allocated to shared_buffers.
>
> 32MB is _way_ too low. For a dedicated PostgreSQL machine with 4GB
> RAM, 1GB going to be a lot closer to optimal, with the optimal setting
> probably somewhere between 500MB and 2GB.
>
> And please post the rest of the information that the others have asked for...
>
> -Dave
>

A couple of notes on "iostat":

1. The package it lives in is called "sysstat". Most Linux distros do
*not* install "sysstat" by default. Somebody should beat up on them
about that. :)

2. It does not require admin privileges to install or execute if you
want to have your own personal copy. You will, however, need the Gnu C
compiler, "make", and some other things that also may not be installed.
In any event, you can build "sysstat" from source in your home directory
if you can't get the sysadmin to install it globally.

3. Once you do get it installed, the command line to run it that gives
the most information is

$ iostat -cdmtx <sampling interval> <number of samples>

That will give you CPU utilization, throughputs in megabytes, time
stamps and extended statistics like device utilizations, queue lengths
and average service times. And if you're running 2.6.25 or later, it
should give you all the statistics for all partitions, not just devices!

Somewhere buried in one of my open source projects is a Perl script that
will parse the "iostat" output into a CSV file. By the way, if you need
to, you can be fairly aggressive with small sampling intervals. "iostat"
is very efficient, and chances are, you're I/O bound anyhow so your
processors will have plenty of bandwidth to format numbers out of /proc.
:) I normally run with samples every ten seconds, but you can probably
go even smaller if you need to.

Re: understanding postgres issues/bottlenecks

From
Simon Waters
Date:
On Wednesday 07 January 2009 04:17:10 M. Edward (Ed) Borasky wrote:
>
> 1. The package it lives in is called "sysstat". Most Linux distros do
> *not* install "sysstat" by default. Somebody should beat up on them
> about that. :)

Hehe, although sysstat and friends did have issues on Linux for a long time.
Nothing worse than misleading stats, so I suspect it lost a lot of friends
back then. It is a lot better these days when most of the "Unix" software
targets Linux first, and other kernels second.

Aside from all the advice here about system tuning, as a system admin I'd also
ask is the box doing the job you need? And are you looking at the Postgres
log (with logging of slow queries) to see that queries perform in a sensible
time? I'd assume with the current performance figure there is an issue
somewhere, but I've been to places where it was as simple as adding one
index, or even modifying an index so it does what the application developer
intended instead of what they ask for ;)

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Simon Waters wrote:
> On Wednesday 07 January 2009 04:17:10 M. Edward (Ed) Borasky wrote:
>> 1. The package it lives in is called "sysstat". Most Linux distros do
>> *not* install "sysstat" by default. Somebody should beat up on them
>> about that. :)
>
> Hehe, although sysstat and friends did have issues on Linux for a long time.
> Nothing worse than misleading stats, so I suspect it lost a lot of friends
> back then. It is a lot better these days when most of the "Unix" software
> targets Linux first, and other kernels second.

I'm unfortunately familiar with that "episode", which turned out to be
bugs in the Red Hat Linux kernels / drivers. I don't remember the exact
versions, although there was at least one of them shipped at one time as
part of RHEL 3. I may be wrong, but I don't think Sebastien ever had to
change any of *his* code in "sysstat" -- I think he had to wait for Red
Hat. :)

>
> Aside from all the advice here about system tuning, as a system admin I'd also
> ask is the box doing the job you need? And are you looking at the Postgres
> log (with logging of slow queries) to see that queries perform in a sensible
> time? I'd assume with the current performance figure there is an issue
> somewhere, but I've been to places where it was as simple as adding one
> index, or even modifying an index so it does what the application developer
> intended instead of what they ask for ;)
>

Yeah ... the issues of application response time and throughput, cost of
adding hardware to the box(es) vs. cost of changing the application,
changes in demand for the server over time, etc., are what make capacity
planning "fun". Squeezing the last megabyte per second out of a RAID
array is the easy stuff. :)

Re: understanding postgres issues/bottlenecks

From
Stefano Nichele
Date:
Ok, here some information:

OS: Centos 5.x (Linux 2.6.18-53.1.21.el5 #1 SMP Tue May 20 09:34:18 EDT
2008 i686 i686 i386 GNU/Linux)
RAID: it's a hardware RAID controller
The disks are 9600rpm SATA drives

(6 disk 1+0 RAID array and 2 separate disks for the OS).


About iostat (on sdb I have pg_xlog, on sdc I have data)

 > iostat -k
Linux 2.6.18-53.1.21.el5 (*******)       01/07/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.27    0.00    5.13   45.08    0.00   32.52

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              30.42        38.50       170.48  182600516  808546589
sdb              46.16         0.23        52.10    1096693  247075617
sdc             269.26       351.51       451.00 1667112043 2138954833



 > iostat -x -k -d 2 5
Linux 2.6.18-53.1.21.el5 (*******)       01/07/2009

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.17    12.68  0.47 29.95    38.51   170.51
13.74     0.03    0.86   0.19   0.57
sdb               0.01    80.11  0.05 46.11     0.23    52.01
2.26     0.01    0.22   0.22   1.01
sdc               7.50    64.57 222.55 46.69   350.91   450.98
5.96     0.57    2.05   3.13  84.41

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00   196.00  1.00 117.00     4.00  1252.00
21.29     0.02    0.19   0.19   2.30
sdc               1.50    66.00 277.00 66.50  3100.00   832.00
22.89    50.84  242.30   2.91 100.10

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00   264.50  0.00 176.50     0.00  1764.00
19.99     0.04    0.21   0.21   3.70
sdc               3.50   108.50 291.50 76.00  3228.00   752.00
21.66    89.42  239.39   2.72 100.05

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     4.98  0.00  1.00     0.00    23.88
48.00     0.00    0.00   0.00   0.00
sdb               0.00    23.88  0.00  9.45     0.00   133.33
28.21     0.00    0.21   0.21   0.20
sdc               1.00   105.97 274.13 53.73  3297.51   612.94
23.85    67.99  184.58   3.04  99.55

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00    79.00  0.00 46.00     0.00   500.00
21.74     0.01    0.25   0.25   1.15
sdc               2.50   141.00 294.00 43.50  3482.00   528.00
23.76    51.33  170.46   2.96 100.05


vmstat in the same time:
 > vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa st
 0 27     80 126380  27304 3253016    0    0    98    55    0    1 17  5
33 45  0
 0 26     80 124516  27300 3255456    0    0  3438  1724 2745 4011 11
2  8 78  0
 1 25     80 124148  27276 3252548    0    0  3262  2806 3572 7007 33
11  3 53  0
 1 28     80 128272  27244 3248516    0    0  2816  1006 2926 5624 12  3
12 73  0


I will run pgbench in the next days.


>
> Aside from all the advice here about system tuning, as a system admin I'd also
> ask is the box doing the job you need? And are you looking at the Postgres
> log (with logging of slow queries) to see that queries perform in a sensible
> time? I'd assume with the current performance figure there is an issue
> somewhere, but I've been to places where it was as simple as adding one
> index, or even modifying an index so it does what the application developer
> intended instead of what they ask for ;)
>


I already checked postgres log and resolved index/slow queries issues.
Actually I have queries that sometime are really fast, and sometime go
in timeout.
But all the required indexes are there. For sure, there are space to
improve performances also in that way, but I would like also to
investigate issue from other point of views (in order to understand also
how to monitor the server).


Cheers and thanks a lot.
ste


Re: understanding postgres issues/bottlenecks

From
"Stefano Nichele"
Date:

On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I concur with Merlin you're I/O bound.

Adding to his post, what RAID controller are you running, does it have
cache, does the cache have battery backup, is the cache set to write
back or write through?
 
At the moment I don't have such information. It's a "standard" RAID controller coming with a DELL server. Is there any information I can have asking to the SO ?



Also, what do you get for this (need contrib module pgbench installed)

pgbench -i -s 100
pgbench -c 50 -n 10000

? Specifically transactions per second?

I'll run pgbench in the next days.

Cheers,
ste

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
>
> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> I concur with Merlin you're I/O bound.
>>
>> Adding to his post, what RAID controller are you running, does it have
>> cache, does the cache have battery backup, is the cache set to write
>> back or write through?
>
>
> At the moment I don't have such information. It's a "standard" RAID
> controller coming with a DELL server. Is there any information I can have
> asking to the SO ?

You can run lshw to see what flavor controller it is.  Dell RAID
controllers are pretty much either total crap, or mediocre at best.
The latest one, the Perc 6 series are squarely in the same performance
realm as a 4 or 5 year old LSI megaraid.  The perc 5 series and before
are total performance dogs.  The really bad news is that you can't
generally plug in a real RAID controller on a Dell.  We put an Areca
168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
CPU Error.

Dells are fine for web servers and such.  For database servers they're
a total loss.  The best you can do with one is to put a generic SCSI
card in it and connect to an external array with its own controller.

We have a perc6e and a perc5e in two different servers, and no matter
how we configure them, we can't get even 1/10th the performance of an
Areca controller with the same number of drives on another machine of
the same basic class as the 1950s.

>> Also, what do you get for this (need contrib module pgbench installed)
>>
>> pgbench -i -s 100
>> pgbench -c 50 -n 10000
>>
>> ? Specifically transactions per second?
>
> I'll run pgbench in the next days.

Cool.  That pgbench is a "best case scenario" benchmark.  Lots of
small transactions on a db that should fit into memory.  If you can't
pull off a decent number there (at least a few hundred tps) then can't
expect better performance from real world usage.

Oh, and that should be:

pgbench -c 50 -t 10000

not -n... not enough sleep I guess.

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
Just to elaborate on the horror that is a Dell perc5e.  We have one in
a 1950 with battery backed cache (256 Meg I think).  It has an 8 disk
500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10 Gigs ram.
This server currently serves as a mnogo search server.  Here's what
vmstat 1 looks like during the day:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  5  38748  60400  43172 9227632    0    0  5676     0  668  703  4  2 12 82
 0  5  38748  58888  43176 9229360    0    0  5548     0  672  792  2  0 15 83
 0  4  38748  64460  43184 9230476    0    0  5964    72  773  947  1  0 31 67
 0  5  38748  61884  43272 9241564    0    0  5896  1112  674 1028  1  2 23 74
 0  5  38748  56612  43276 9247376    0    0  5660     0  603  795  0  0 21 79
 0  5  38748  56472  43268 9247480    0    0  5700     0  603  790  0  0 22 77

Note 4 or 5 blocking, and reading in data at 5M/sec and bursting small writes.
75 to 80% wait state
user and sys time around 1 or 2%
rest is idle.

This is without adding any load from pgbench.  When I add pgbench, the
numbers from vmstat look the same pretty much, slight increase of
maybe 20% bi and bo and a rise in the blocked processes.

Running vacuum on the pgbench db on this machine takes well over a
minute.  Even a small pgbench test runs slowly, getting 20 to 30 tps
during the day.  During off hours I can get a max of about 80 tps.

----------------------------------------------------------------------------------------------

Now, here's my primary production server.  It is NOT a Dell, or an HP,
or an IBM.  It is a high end white box machine with a Tyan mobo.
Primary difference here is 12 disk SAS RAID-10 and a much faster RAID
controller.  It's got 8 opteron 2.1GHz cores but honestly, it hardly
uses any of them.  This is it before I run pgbench.  It almost looks
like it's sleeping.  It is handling about 250 queries per second right
now, most serving from memory.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0   2464 614156 626448 28658476    0    0     0   688 2515 2615 12
 1 87  0  0
 2  0   2464 638972 626448 28658484    0    0     8   656 2401 2454 16
 2 82  0  0
 1  0   2464 631852 626448 28658516    0    0     0   552 1939 1984 14
 1 85  0  0
 4  0   2464 617900 626448 28658532    0    0    32   500 4925 5276 19
 2 78  0  0
 1  0   2464 617656 626448 28658560    0    0     0   492 3363 3428 14
 1 85  0  0
 1  0   2464 618648 626448 28658560    0    0     0   752 3391 3579 12
 2 85  0  0

It's not reading anything in because it fits in memory (it's got 32
Gig ram) and it's writing out a small amount.  I'll run pgbench to
give it a heavier load and force more writes.

pgbench -c 10 -t 25000
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 25000
number of transactions actually processed: 250000/250000
tps = 2866.549792 (including connections establishing)
tps = 2868.010947 (excluding connections establishing)

Here's vmstat during that period:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2   2408 190548 626584 28770448    0    0     0 66200 8210 84391
44 14 35  6  0
 0  2   2408 207156 626584 28770900    0    0     0 23832 6688 6426 20
 4 58 18  0
 7  0   2408 295792 626588 28771904    0    0     0 34372 6831 62873
35 12 43 11  0
 3  2   2408 308904 626588 28773128    0    0     0 60040 6711 78235
44 13 40  3  0
 4  0   2408 310408 626588 28773660    0    0     0 54780 7779 37399
28  8 50 15  0
 6  1   2408 325808 626592 28775912    0    0    16 43588 5345 105348
43 15 39  3  0
10  0   2408 324304 626592 28778188    0    0    16 60984 6582 125105
52 18 29  1  0
 8  1   2408 339204 626596 28780248    0    0     8 47956 5203 113799
48 18 33  1  0
 4  2   2408 337856 626596 28782840    0    0     0 108096 12132 90391
46 16 33  4  0

Note that wait is generally around 5 to 10% on this machine.

Now, I know it's a much more powerful machine, but the difference in
performance is not some percentage faster.  It's many many factors
faster.  Twenty or more times as faster as the dell.  That dell
machine cost us somewhere in the range of $11,000.  The MD-1000 was a
large part of that cost.

The Aberdeen white box that is replacing it cost us about $11,500.

I cannot understand how Dell stays in business.  It certainly isn't on merit.

Re: understanding postgres issues/bottlenecks

From
Greg Smith
Date:
On Wed, 7 Jan 2009, Scott Marlowe wrote:

> I cannot understand how Dell stays in business.

There's a continuous stream of people who expect RAID5 to perform well,
too, yet this surprises you?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Wed, Jan 7, 2009 at 3:34 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Wed, 7 Jan 2009, Scott Marlowe wrote:
>
>> I cannot understand how Dell stays in business.
>
> There's a continuous stream of people who expect RAID5 to perform well, too,
> yet this surprises you?

I guess I've underestimated the human capacity for stupidity yet again.

-- An optimist sees the glass as half full, a pessimist as half empty,
and engineer as having redundant storage capacity.

Re: understanding postgres issues/bottlenecks

From
Glyn Astill
Date:
--- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:

 > The really bad news is that
> you can't
> generally plug in a real RAID controller on a Dell.  We put
> an Areca
> 168-LP PCI-x8 in one of our 1950s and it wouldn't even
> turn on, got a
> CPU Error.
>

Hmm, I had to pull the perc5i's out of our dell servers to get them to boot off of our Adaptec 5805's

Anyway that reminds me, I must remember to bring my lump hammer into work at some point...




Re: understanding postgres issues/bottlenecks

From
Glyn Astill
Date:
--- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> Just to elaborate on the horror that is a Dell perc5e.  We
> have one in
> a 1950 with battery backed cache (256 Meg I think).  It has
> an 8 disk
> 500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10
> Gigs ram.

Our perc5i controllers performed better in raid 5 that 10. Sounds like the comment you made when I was wasting my time
withthat perc3 fits all dell cards perfectly; "brain damaged" 




Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Wed, Jan 7, 2009 at 4:36 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
> --- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
>> Just to elaborate on the horror that is a Dell perc5e.  We
>> have one in
>> a 1950 with battery backed cache (256 Meg I think).  It has
>> an 8 disk
>> 500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10
>> Gigs ram.
>
> Our perc5i controllers performed better in raid 5 that 10. Sounds like the comment you made when I was wasting my
timewith that perc3 fits all dell cards perfectly; "brain damaged" 

One of the beauties of the whole Dell RAID card naming scheme is that
a Perc5i and a Perc5e can be made by different manufacturers and have
completely different performance characteristics.  The only common
factor seems to be the high level of suck they managed to generate.

Re: understanding postgres issues/bottlenecks

From
Bborie Park
Date:
Since the discussion involves Dell PERC controllers, does anyone know if
the performance of LSI cards (those with the same chipsets as Dell) also
have similarly poor performance?

I have a LSI 8888ELP card, so would like to know what other people's
experiences are...

-bborie

Scott Marlowe wrote:
> On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele
> <stefano.nichele@gmail.com> wrote:
>> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>
>> wrote:
>>> I concur with Merlin you're I/O bound.
>>>
>>> Adding to his post, what RAID controller are you running, does it have
>>> cache, does the cache have battery backup, is the cache set to write
>>> back or write through?
>>
>> At the moment I don't have such information. It's a "standard" RAID
>> controller coming with a DELL server. Is there any information I can have
>> asking to the SO ?
>
> You can run lshw to see what flavor controller it is.  Dell RAID
> controllers are pretty much either total crap, or mediocre at best.
> The latest one, the Perc 6 series are squarely in the same performance
> realm as a 4 or 5 year old LSI megaraid.  The perc 5 series and before
> are total performance dogs.  The really bad news is that you can't
> generally plug in a real RAID controller on a Dell.  We put an Areca
> 168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
> CPU Error.
>
> Dells are fine for web servers and such.  For database servers they're
> a total loss.  The best you can do with one is to put a generic SCSI
> card in it and connect to an external array with its own controller.
>
> We have a perc6e and a perc5e in two different servers, and no matter
> how we configure them, we can't get even 1/10th the performance of an
> Areca controller with the same number of drives on another machine of
> the same basic class as the 1950s.
>
>>> Also, what do you get for this (need contrib module pgbench installed)
>>>
>>> pgbench -i -s 100
>>> pgbench -c 50 -n 10000
>>>
>>> ? Specifically transactions per second?
>> I'll run pgbench in the next days.
>
> Cool.  That pgbench is a "best case scenario" benchmark.  Lots of
> small transactions on a db that should fit into memory.  If you can't
> pull off a decent number there (at least a few hundred tps) then can't
> expect better performance from real world usage.
>
> Oh, and that should be:
>
> pgbench -c 50 -t 10000
>
> not -n... not enough sleep I guess.
>

--
Bborie Park
Programmer
Center for Vectorborne Diseases
UC Davis
530-752-8380
bkpark@ucdavis.edu

Re: understanding postgres issues/bottlenecks

From
Scott Carey
Date:
If you’re stuck with a Dell, the Adaptec 5 series works, I’m using 5085’s in a pair and get 1200 MB/sec streaming reads best case with 20 SATA drives in RAID 10 (2 sets of 10, software raid 0 on top).   Of course, Dell doesn’t like you putting in somebody else’s RAID card, but they support the rest of the system when you add in a third party PCIe card.
Sure, they are a bit pricey, but they are also very good performers with drivers for a lot of stuff, including OpenSolaris.  I tested a PERC 6 versus this with the same drives, but only 10 total.  The Adaptec was 70% faster out of the box, and still 35% faster after tuning the linux OS read-ahead and other parameters.

PERC 5 / 6 cards are LSI re-badged MegaRaid cards, for those interested.  With special firmware modifications that make them ‘interesting’.


On 1/7/09 1:31 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:

On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
>
> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> I concur with Merlin you're I/O bound.
>>
>> Adding to his post, what RAID controller are you running, does it have
>> cache, does the cache have battery backup, is the cache set to write
>> back or write through?
>
>
> At the moment I don't have such information. It's a "standard" RAID
> controller coming with a DELL server. Is there any information I can have
> asking to the SO ?

You can run lshw to see what flavor controller it is.  Dell RAID
controllers are pretty much either total crap, or mediocre at best.
The latest one, the Perc 6 series are squarely in the same performance
realm as a 4 or 5 year old LSI megaraid.  The perc 5 series and before
are total performance dogs.  The really bad news is that you can't
generally plug in a real RAID controller on a Dell.  We put an Areca
168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
CPU Error.

Dells are fine for web servers and such.  For database servers they're
a total loss.  The best you can do with one is to put a generic SCSI
card in it and connect to an external array with its own controller.

We have a perc6e and a perc5e in two different servers, and no matter
how we configure them, we can't get even 1/10th the performance of an
Areca controller with the same number of drives on another machine of
the same basic class as the 1950s.

>> Also, what do you get for this (need contrib module pgbench installed)
>>
>> pgbench -i -s 100
>> pgbench -c 50 -n 10000
>>
>> ? Specifically transactions per second?
>
> I'll run pgbench in the next days.

Cool.  That pgbench is a "best case scenario" benchmark.  Lots of
small transactions on a db that should fit into memory.  If you can't
pull off a decent number there (at least a few hundred tps) then can't
expect better performance from real world usage.

Oh, and that should be:

pgbench -c 50 -t 10000

not -n... not enough sleep I guess.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: understanding postgres issues/bottlenecks

From
"Merlin Moncure"
Date:
On Wed, Jan 7, 2009 at 7:11 PM, Scott Carey <scott@richrelevance.com> wrote:
> If you're stuck with a Dell, the Adaptec 5 series works, I'm using 5085's in
> a pair and get 1200 MB/sec streaming reads best case with 20 SATA drives in
> RAID 10 (2 sets of 10, software raid 0 on top).   Of course, Dell doesn't
> like you putting in somebody else's RAID card, but they support the rest of
> the system when you add in a third party PCIe card.
> Sure, they are a bit pricey, but they are also very good performers with
> drivers for a lot of stuff, including OpenSolaris.  I tested a PERC 6 versus
> this with the same drives, but only 10 total.  The Adaptec was 70% faster
> out of the box, and still 35% faster after tuning the linux OS read-ahead
> and other parameters.

Sequential read performance means precisely squat for most database
loads.  The dell stuff is ok....decent RAID 5 performance and mediocre
raid 10.  Unfortunately switching the disks to jbod and going software
raid doesn't seem to help much.  The biggest problem with dell
hardware that I see is that overflowing the raid cache causes the
whole system to spectacularly grind to a halt, causing random delays.

To the OP, it looks like you are getting about 300 or so tps out of
sdc (80% read), which is where I'm assuming the data is.  I'm guessing
most of that is random traffic.  Here's the bad news: while this is on
the low side for a 6 disk raid 10 7200 rpm, it's probably about what
your particular hardware can do.  I have some general suggestions for
you:
*) upgrade hardware: more/faster disks, etc
*) disable fsync (dangerous!) can risk data loss, but maybe you have
redundancy built in a different place.  This will let linux reorganize
i/o on top of what the hardware is doing.
*) upgrade to postgres 8.3.  Numerous efficiency advantages, and has
the synchronous_commit setting, which is 'fsync lite'...most of the
advantages and a lot less risk.
*) tune the app

merlin

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Wed, Jan 7, 2009 at 6:19 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

RE: Perc raid controllers
>  Unfortunately switching the disks to jbod and going software
> raid doesn't seem to help much.  The biggest problem with dell

Yeah, I noticed that too when I was trying to get a good config from
the perc 5e.  Also, running software RAID0 over hardware RAID1 sets
gave no appreciable speed boost.

Re: understanding postgres issues/bottlenecks

From
Scott Carey
Date:

> Sequential read performance means precisely squat for most database
> loads.  

Depends on the database workload.  Many queries for me may scan 50GB of data for aggregation.
Besides, it is a good test for making sure your RAID card doesn’t suck.  Especially running tests with sequential access CONCURRENT with random access.
A good tuned raid setup will be able to handle a good chunk of sequential access while doing random reads  concurrently.  A bad one will grind to a halt.
The same can be said for concurrent writes and fsyncs with concurrent reads.  Bad cards tend to struggle with this, good ones don’t.

$ sar -b
12:00:01 AM       tps      rtps      wtps   bread/s   bwrtn/s
01:10:01 AM   1913.22   1903.74      9.48 561406.70    326.67
01:20:02 AM   2447.71   2439.97      7.74 930357.08    148.86
01:30:01 AM   1769.77   1740.41     29.35 581015.86   3729.37
01:40:01 AM   1762.05   1659.06    102.99 477730.70  26137.96

And disk utilization did not go past 85% or so during the peak load, usually much less (8 cores, 12.5% would reflect a CPU).

12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
01:10:01 AM       all     47.92      0.00     12.92     10.22      0.00     28.94
01:20:02 AM       all     67.97      0.00     17.93      3.47      0.00     10.63
01:30:01 AM       all     46.67      0.00     10.60      7.43      0.00     35.29
01:40:01 AM       all     59.22      0.03      9.88      5.67      0.00     25.21

The workload regularly bursts to 900MB/sec with concurrent sequential scans.


> The dell stuff is ok....decent RAID 5 performance and mediocre
> raid 10.  Unfortunately switching the disks to jbod and going software
> raid doesn't seem to help much.  The biggest problem with dell
> hardware that I see is that overflowing the raid cache causes the
> whole system to spectacularly grind to a halt, causing random delays.

The Adaptec stuff doesn’t have the issues with cache overflow.  For pure random access stuff the Dell Perc 6 is pretty good, but mix read/write it freaks out and has inconsistent performance.  A PERC 6 does perform better than a 3Ware 9650 for me though.  Those are both on my crap list, with 3Ware 9550 and PERC 5 both much worse.
Both got about 200 iops per drive on random access.


> To the OP, it looks like you are getting about 300 or so tps out of
> sdc (80% read), which is where I'm assuming the data is.  I'm guessing
> most of that is random traffic.  Here's the bad news: while this is on
> the low side for a 6 disk raid 10 7200 rpm, it's probably about what
> your particular hardware can do.  I have some general suggestions for
> you:
> *) upgrade hardware: more/faster disks, etc
> *) disable fsync (dangerous!) can risk data loss, but maybe you have
> redundancy built in a different place.  This will let linux reorganize
> i/o on top of what the hardware is doing.
> *) upgrade to postgres 8.3.  Numerous efficiency advantages, and has
> the synchronous_commit setting, which is 'fsync lite'...most of the
> advantages and a lot less risk.
> *) tune the app
>
> merlin

Agree with all of the above.
The xlogs are on sdb, which is not I/O bound, so I am not sure how much changing fsync will help.
I second upgrading to 8.3 which is generally faster and will reduce random i/o if any sequential scans are kicking out random access data from shared_buffers.

If there is budget for an upgrade, how big is the data set, and how much will it grow?
For  $1200 get two Intel X25-M SSD’s and all random iops issues will be gone (8k iops in raid 1).  Double that if 4 drives in raid 10.  Unfortunately, each pair only stores 80GB.  But for many, that is plenty.

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Scott Marlowe wrote:
> On Wed, Jan 7, 2009 at 3:34 PM, Greg Smith <gsmith@gregsmith.com> wrote:
>> On Wed, 7 Jan 2009, Scott Marlowe wrote:
>>
>>> I cannot understand how Dell stays in business.
>> There's a continuous stream of people who expect RAID5 to perform well, too,
>> yet this surprises you?
>
> I guess I've underestimated the human capacity for stupidity yet again.
>
> -- An optimist sees the glass as half full, a pessimist as half empty,
> and engineer as having redundant storage capacity.

Ah, but the half-empty glass has a fly in it. -- Kehlog Albran :)



Re: understanding postgres issues/bottlenecks

From
"Stefano Nichele"
Date:
Find !

Dell CERC SATA RAID 2 PCI SATA 6ch

Running lspci -v:

03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
        Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair)
        Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 209
        Memory at f8000000 (32-bit, prefetchable) [size=64M]
        Expansion ROM at fe800000 [disabled] [size=32K]
        Capabilities: [80] Power Management version 2


Any consideration looking at iostat output ?

Cheers and thanks to all!

ste


On Wed, Jan 7, 2009 at 10:31 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele
> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> I concur with Merlin you're I/O bound.
>>
>> Adding to his post, what RAID controller are you running, does it have
>> cache, does the cache have battery backup, is the cache set to write
>> back or write through?
>
>
> At the moment I don't have such information. It's a "standard" RAID
> controller coming with a DELL server. Is there any information I can have
> asking to the SO ?

You can run lshw to see what flavor controller it is.  Dell RAID
controllers are pretty much either total crap, or mediocre at best.
The latest one, the Perc 6 series are squarely in the same performance
realm as a 4 or 5 year old LSI megaraid.  The perc 5 series and before
are total performance dogs.  The really bad news is that you can't
generally plug in a real RAID controller on a Dell.  We put an Areca
168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
CPU Error.

Dells are fine for web servers and such.  For database servers they're
a total loss.  The best you can do with one is to put a generic SCSI
card in it and connect to an external array with its own controller.

We have a perc6e and a perc5e in two different servers, and no matter
how we configure them, we can't get even 1/10th the performance of an
Areca controller with the same number of drives on another machine of
the same basic class as the 1950s.

>> Also, what do you get for this (need contrib module pgbench installed)
>>
>> pgbench -i -s 100
>> pgbench -c 50 -n 10000
>>
>> ? Specifically transactions per second?
>
> I'll run pgbench in the next days.

Cool.  That pgbench is a "best case scenario" benchmark.  Lots of
small transactions on a db that should fit into memory.  If you can't
pull off a decent number there (at least a few hundred tps) then can't
expect better performance from real world usage.

Oh, and that should be:

pgbench -c 50 -t 10000

not -n... not enough sleep I guess.

Re: understanding postgres issues/bottlenecks

From
"Merlin Moncure"
Date:
On Thu, Jan 8, 2009 at 3:36 AM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
> Find !
>
> Dell CERC SATA RAID 2 PCI SATA 6ch
>
> Running lspci -v:
>
> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>         Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair)

IIRC that's the 'perc 6ir' card...no write caching.  You are getting
killed with syncs. If you can restart the database, you can test with
fsync=off comparing load to confirm this.  (another way is to compare
select only vs regular transactions on pgbench).

merlin

Re: understanding postgres issues/bottlenecks

From
Glyn Astill
Date:
--- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote:

> From: Stefano Nichele <stefano.nichele@gmail.com>
> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
> To: "Scott Marlowe" <scott.marlowe@gmail.com>
> Cc: pgsql-performance@postgresql.org
> Date: Thursday, 8 January, 2009, 8:36 AM
> Find !
>
> Dell CERC SATA RAID 2 PCI SATA 6ch
>
> Running lspci -v:
>
> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>         Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch
> (DellCorsair)
>         Flags: bus master, 66MHz, slow devsel, latency 32,
> IRQ 209
>         Memory at f8000000 (32-bit, prefetchable)
> [size=64M]
>         Expansion ROM at fe800000 [disabled] [size=32K]
>         Capabilities: [80] Power Management version 2
>

Hmm, the 64M / 6ch makes it sound like this card

http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975

Which is a 6ch dell version of

http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/

I have one on my smash pile.  The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ...





Re: understanding postgres issues/bottlenecks

From
Stefano Nichele
Date:
Glyn Astill wrote:
> --- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote:
>
>
>> From: Stefano Nichele <stefano.nichele@gmail.com>
>> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
>> To: "Scott Marlowe" <scott.marlowe@gmail.com>
>> Cc: pgsql-performance@postgresql.org
>> Date: Thursday, 8 January, 2009, 8:36 AM
>> Find !
>>
>> Dell CERC SATA RAID 2 PCI SATA 6ch
>>
>> Running lspci -v:
>>
>> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>>         Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch
>> (DellCorsair)
>>         Flags: bus master, 66MHz, slow devsel, latency 32,
>> IRQ 209
>>         Memory at f8000000 (32-bit, prefetchable)
>> [size=64M]
>>         Expansion ROM at fe800000 [disabled] [size=32K]
>>         Capabilities: [80] Power Management version 2
>>
>>
>
> Hmm, the 64M / 6ch makes it sound like this card
>
> http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975
>

Indeed I think it's this one. The server is a Power Edge 1800 Server.

> Which is a 6ch dell version of
>
> http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/
>
> I have one on my smash pile.  The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ...
>
>
>
>
>
>



Re: understanding postgres issues/bottlenecks

From
Stefano Nichele
Date:
Merlin Moncure wrote:
> On Thu, Jan 8, 2009 at 3:36 AM, Stefano Nichele
> <stefano.nichele@gmail.com> wrote:
>
>> Find !
>>
>> Dell CERC SATA RAID 2 PCI SATA 6ch
>>
>> Running lspci -v:
>>
>> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>>         Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair)
>>
>
> IIRC that's the 'perc 6ir' card...no write caching.  You are getting
> killed with syncs. If you can restart the database, you can test with
> fsync=off comparing load to confirm this.  (another way is to compare
> select only vs regular transactions on pgbench).
>
> merlin
>
>

I'll try next Saturday.

thanks
ste

Re: understanding postgres issues/bottlenecks

From
Scott Carey
Date:
If it is PowerEdge 1800 with a low end non-write cache controller, then 100 iops per SATA drive isn’t too far off from what you should expect.

Contrary to what others are saying, I don’t think that turning fsync off is killing you, you might get 25% more throughput at best.

The iostat output would indicate that you are getting just about all you can get out of it without getting faster drives or controllers.

Since it is mostly random read bound, more RAM in the system would certainly help reduce the load, as would an I/O subsystem with high random i/o throughput (quality flash drives, the best quality ones currently being Intel X-25 M until the next gen Micron, Samsung, Mtron, and SanDisk ones come out — do NOT use any that don’t have high random write perf).

I’m guessing that the cheapest thing to do is increase the RAM on the server — I think you can fit 16GB in an 1800, if not at least 8GB, and increase shared_buffers to between 10% and 25% of that.  Upgrading to 8.3.x will further help.


On 1/8/09 6:41 AM, "Stefano Nichele" <stefano.nichele@gmail.com> wrote:

Glyn Astill wrote:
> --- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote:
>
>
>> From: Stefano Nichele <stefano.nichele@gmail.com>
>> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
>> To: "Scott Marlowe" <scott.marlowe@gmail.com>
>> Cc: pgsql-performance@postgresql.org
>> Date: Thursday, 8 January, 2009, 8:36 AM
>> Find !
>>
>> Dell CERC SATA RAID 2 PCI SATA 6ch
>>
>> Running lspci -v:
>>
>> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>>         Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch
>> (DellCorsair)
>>         Flags: bus master, 66MHz, slow devsel, latency 32,
>> IRQ 209
>>         Memory at f8000000 (32-bit, prefetchable)
>> [size=64M]
>>         Expansion ROM at fe800000 [disabled] [size=32K]
>>         Capabilities: [80] Power Management version 2
>>
>>
>
> Hmm, the 64M / 6ch makes it sound like this card
>
> http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975
>

Indeed I think it's this one. The server is a Power Edge 1800 Server.

> Which is a 6ch dell version of
>
> http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/
>
> I have one on my smash pile.  The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ...
>
>
>
>
>
>



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: understanding postgres issues/bottlenecks

From
"Merlin Moncure"
Date:
On Thu, Jan 8, 2009 at 9:42 AM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
> Merlin Moncure wrote:
>> IIRC that's the 'perc 6ir' card...no write caching.  You are getting
>> killed with syncs. If you can restart the database, you can test with
>> fsync=off comparing load to confirm this.  (another way is to compare
>> select only vs regular transactions on pgbench).
>
> I'll try next Saturday.
>

just be aware of the danger .  hard reset (power off) class of failure
when fsync = off means you are loading from backups.

merlin

Re: understanding postgres issues/bottlenecks

From
Ron
Date:
At 03:28 PM 1/8/2009, Merlin Moncure wrote:
>On Thu, Jan 8, 2009 at 9:42 AM, Stefano Nichele
><stefano.nichele@gmail.com> wrote:
> > Merlin Moncure wrote:
> >> IIRC that's the 'perc 6ir' card...no write caching.  You are getting
> >> killed with syncs. If you can restart the database, you can test with
> >> fsync=off comparing load to confirm this.  (another way is to compare
> >> select only vs regular transactions on pgbench).
> >
> > I'll try next Saturday.
> >
>
>just be aware of the danger .  hard reset (power off) class of failure
>when fsync = off means you are loading from backups.
>
>merlin
That's what redundant power conditioning UPS's are supposed to help prevent ;-)

Merlin is of course absolutely correct that you are taking a bigger
risk if you turn fsync off.

I would not recommend fysnc = off if you do not have other safety
measures in place to protect against data loss because of a power event..
(At least for most DB applications.)

...and of course, those lucky few with bigger budgets can use SSD's
and not care what fsync is set to.

Ron


Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote:
> At 03:28 PM 1/8/2009, Merlin Moncure wrote:
>> just be aware of the danger .  hard reset (power off) class of failure
>> when fsync = off means you are loading from backups.
>
> That's what redundant power conditioning UPS's are supposed to help prevent
> ;-)

But of course, they can't prevent them, but only reduce the likelihood
of their occurrance.  Everyone who's working in large hosting
environments has at least one horror story to tell about a power
outage that never should have happened.

> I would not recommend fysnc = off if you do not have other safety measures
> in place to protect against data loss because of a power event..
> (At least for most DB applications.)

Agreed.   Keep in mind that you'll be losing whatever wasn't
transferred to the backup machines.

> ...and of course, those lucky few with bigger budgets can use SSD's and not
> care what fsync is set to.

Would that prevent any corruption if the writes got out of order
because of lack of fsync?  Or partial writes?  Or wouldn't fsync still
need to be turned on to keep the data safe.

Re: understanding postgres issues/bottlenecks

From
Gregory Stark
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:

> On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote:
>> At 03:28 PM 1/8/2009, Merlin Moncure wrote:
>>> just be aware of the danger .  hard reset (power off) class of failure
>>> when fsync = off means you are loading from backups.
>>
>> That's what redundant power conditioning UPS's are supposed to help prevent
>> ;-)
>
> But of course, they can't prevent them, but only reduce the likelihood
> of their occurrance.  Everyone who's working in large hosting
> environments has at least one horror story to tell about a power
> outage that never should have happened.

Or a system crash. If the kernel panics for any reason when it has dirty
buffers in memory the database will need to be restored.

>> ...and of course, those lucky few with bigger budgets can use SSD's and not
>> care what fsync is set to.
>
> Would that prevent any corruption if the writes got out of order
> because of lack of fsync?  Or partial writes?  Or wouldn't fsync still
> need to be turned on to keep the data safe.

I think the idea is that with SSDs or a RAID with a battery backed cache you
can leave fsync on and not have any significant performance hit since the seek
times are very fast for SSD. They have limited bandwidth but bandwidth to the
WAL is rarely an issue -- just latency.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sat, 10 Jan 2009, Gregory Stark wrote:

>>> ...and of course, those lucky few with bigger budgets can use SSD's and not
>>> care what fsync is set to.
>>
>> Would that prevent any corruption if the writes got out of order
>> because of lack of fsync?  Or partial writes?  Or wouldn't fsync still
>> need to be turned on to keep the data safe.
>
> I think the idea is that with SSDs or a RAID with a battery backed cache you
> can leave fsync on and not have any significant performance hit since the seek
> times are very fast for SSD. They have limited bandwidth but bandwidth to the
> WAL is rarely an issue -- just latency.

I don't think that this is true, even if your SSD is battery backed RAM
(as opposed to the flash based devices that have slower writes than
high-end hard drives) you can complete 'writes' to the system RAM faster
than the OS can get the data to the drive, so if you don't do a fsync you
can still loose a lot in a power outage.

raid controllers with battery backed ram cache will make the fsyncs very
cheap (until the cache fills up anyway)

with SSDs having extremely good read speeds, but poor (at least by
comparison) write speeds I wonder if any of the RAID controllers are going
to get a mode where they cache writes, but don't cache reads, leaving all
ofyour cache to handle writes.

David Lang

Re: understanding postgres issues/bottlenecks

From
Markus Wanner
Date:
Hi,

david@lang.hm wrote:
> On Sat, 10 Jan 2009, Gregory Stark wrote:
>> I think the idea is that with SSDs or a RAID with a battery backed
>> cache you
>> can leave fsync on and not have any significant performance hit since
>> the seek
>> times are very fast for SSD. They have limited bandwidth but bandwidth
>> to the
>> WAL is rarely an issue -- just latency.

That's also my understanding.

> with SSDs having extremely good read speeds, but poor (at least by
> comparison) write speeds I wonder if any of the RAID controllers are
> going to get a mode where they cache writes, but don't cache reads,
> leaving all ofyour cache to handle writes.

My understanding of SSDs so far is, that they are not that bad at
writing *on average*, but to perform wear-leveling, they sometimes have
to shuffle around multiple blocks at once. So there are pretty awful
spikes for writing latency (IIRC more than 100ms has been measured on
cheaper disks).

A battery backed cache could theoretically flatten those, as long as
your avg. WAL throughput is below the SSDs avg. writing throughput.

Regards

Markus Wanner

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sat, Jan 10, 2009 at 12:00 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> david@lang.hm wrote:
>> On Sat, 10 Jan 2009, Gregory Stark wrote:
>>> I think the idea is that with SSDs or a RAID with a battery backed
>>> cache you
>>> can leave fsync on and not have any significant performance hit since
>>> the seek
>>> times are very fast for SSD. They have limited bandwidth but bandwidth
>>> to the
>>> WAL is rarely an issue -- just latency.
>
> That's also my understanding.
>
>> with SSDs having extremely good read speeds, but poor (at least by
>> comparison) write speeds I wonder if any of the RAID controllers are
>> going to get a mode where they cache writes, but don't cache reads,
>> leaving all ofyour cache to handle writes.
>
> My understanding of SSDs so far is, that they are not that bad at
> writing *on average*, but to perform wear-leveling, they sometimes have
> to shuffle around multiple blocks at once. So there are pretty awful
> spikes for writing latency (IIRC more than 100ms has been measured on
> cheaper disks).

Multiply it by 10 and apply to both reads and writes for most cheap
SSDs when doing random writes and reads mixed together.  Which is why
so many discussions specificall mention the intel XM series, because
they don't suck like that.  They keep good access times even under
several random read / write threads.

Some review of the others was posted here a while back and it was
astounding how slow the others became in a mixed read / write
benchmark.

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sat, 10 Jan 2009, Markus Wanner wrote:

> david@lang.hm wrote:
>> On Sat, 10 Jan 2009, Gregory Stark wrote:
>>> I think the idea is that with SSDs or a RAID with a battery backed
>>> cache you
>>> can leave fsync on and not have any significant performance hit since
>>> the seek
>>> times are very fast for SSD. They have limited bandwidth but bandwidth
>>> to the
>>> WAL is rarely an issue -- just latency.
>
> That's also my understanding.
>
>> with SSDs having extremely good read speeds, but poor (at least by
>> comparison) write speeds I wonder if any of the RAID controllers are
>> going to get a mode where they cache writes, but don't cache reads,
>> leaving all ofyour cache to handle writes.
>
> My understanding of SSDs so far is, that they are not that bad at
> writing *on average*, but to perform wear-leveling, they sometimes have
> to shuffle around multiple blocks at once. So there are pretty awful
> spikes for writing latency (IIRC more than 100ms has been measured on
> cheaper disks).

well, I have one of those cheap disks.

brand new out of the box, format the 32G drive, then copy large files to
it (~1G per file). this should do almost no wear-leveling, but it's write
performance is still poor and it has occasional 1 second pauses.

I for my initial tests I hooked it up to a USB->SATA adapter and the write
speed is showing about half of what I can get on a 1.5TB SATA drive hooked
to the same system.

the write speed is fairly comparable to what you can do with slow laptop
drives (even ignoring the pauses)

read speed is much better (and I think limited by the USB)

the key thing with any new storage technology (including RAID controller)
is that you need to do your own testing, treat the manufacturers specs as
ideal conditions or 'we guarentee that the product will never do better
than this' specs

Imation has a white paper on their site about solid state drive
performance that is interesting. among other things it shows that
high-speed SCSI drives are still a significant win in
random-write workloads


at this point, if I was specing out a new high-end system I would be
looking at and testing somthing like the following

SSD for read-mostly items (OS, possibly some indexes)
15K SCSI drives for heavy writing (WAL, indexes, temp tables, etc)
SATA drives for storage capacity (table contents)

David Lang

Re: understanding postgres issues/bottlenecks

From
Gregory Stark
Date:
david@lang.hm writes:

> On Sat, 10 Jan 2009, Markus Wanner wrote:
>
>> My understanding of SSDs so far is, that they are not that bad at
>> writing *on average*, but to perform wear-leveling, they sometimes have
>> to shuffle around multiple blocks at once. So there are pretty awful
>> spikes for writing latency (IIRC more than 100ms has been measured on
>> cheaper disks).

That would be fascinating. And frightening. A lot of people have been
recommending these for WAL disks and this would be make them actually *worse*
than regular drives.

> well, I have one of those cheap disks.
>
> brand new out of the box, format the 32G drive, then copy large files to it
> (~1G per file). this should do almost no wear-leveling, but it's write
> performance is still poor and it has occasional 1 second pauses.

This isn't similar to the way WAL behaves though. What you're testing is the
behaviour when the bandwidth to the SSD is saturated. At that point some point
in the stack, whether in the SSD, the USB hardware or driver, or OS buffer
cache can start to queue up writes. The stalls you see could be the behaviour
when that queue fills up and it needs to push back to higher layers.

To simulate WAL you want to transfer smaller volumes of data, well below the
bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time
each fsync and see whether the time they take is proportional to the amount of
data written in the meantime or whether they randomly spike upwards.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sat, 10 Jan 2009, Gregory Stark wrote:

> david@lang.hm writes:
>
>> On Sat, 10 Jan 2009, Markus Wanner wrote:
>>
>>> My understanding of SSDs so far is, that they are not that bad at
>>> writing *on average*, but to perform wear-leveling, they sometimes have
>>> to shuffle around multiple blocks at once. So there are pretty awful
>>> spikes for writing latency (IIRC more than 100ms has been measured on
>>> cheaper disks).
>
> That would be fascinating. And frightening. A lot of people have been
> recommending these for WAL disks and this would be make them actually *worse*
> than regular drives.
>
>> well, I have one of those cheap disks.
>>
>> brand new out of the box, format the 32G drive, then copy large files to it
>> (~1G per file). this should do almost no wear-leveling, but it's write
>> performance is still poor and it has occasional 1 second pauses.
>
> This isn't similar to the way WAL behaves though. What you're testing is the
> behaviour when the bandwidth to the SSD is saturated. At that point some point
> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer
> cache can start to queue up writes. The stalls you see could be the behaviour
> when that queue fills up and it needs to push back to higher layers.
>
> To simulate WAL you want to transfer smaller volumes of data, well below the
> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time
> each fsync and see whether the time they take is proportional to the amount of
> data written in the meantime or whether they randomly spike upwards.

if you have a specific benchmark for me to test I would be happy to do
this.

the test that I did is basicly the best-case for the SSD (more-or-less
sequential writes where the vendors claim that the drives match or
slightly outperform the traditional disks). for random writes the vendors
put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives.

take a look at this paper
http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf

this is not one of the low-performance drives, they include a sandisk
drive in the paper that shows significantly less performance (but the same
basic pattern) than the imation drives.

David Lang

Re: understanding postgres issues/bottlenecks

From
Ron
Date:
At 10:36 AM 1/10/2009, Gregory Stark wrote:
>"Scott Marlowe" <scott.marlowe@gmail.com> writes:
>
> > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote:
> >> At 03:28 PM 1/8/2009, Merlin Moncure wrote:
> >>> just be aware of the danger .  hard reset (power off) class of failure
> >>> when fsync = off means you are loading from backups.
> >>
> >> That's what redundant power conditioning UPS's are supposed to
> help prevent
> >> ;-)
> >
> > But of course, they can't prevent them, but only reduce the likelihood
> > of their occurrance.  Everyone who's working in large hosting
> > environments has at least one horror story to tell about a power
> > outage that never should have happened.
>
>Or a system crash. If the kernel panics for any reason when it has dirty
>buffers in memory the database will need to be restored.
A power conditioning UPS should prevent a building wide or circuit
level bad power event, caused by either dirty power or a power loss,
from affecting the host.  Within the design limits of the UPS in
question of course.

So the real worry with fsync = off in a environment with redundant
decent UPS's is pretty much limited to host level HW failures, SW
crashes, and unlikely catastrophes like building collapses, lightning
strikes, floods, etc.
Not that your fsync setting is going to matter much in the event of
catastrophes in the physical environment...

Like anything else, there is usually more than one way to reduce risk
while at the same time meeting (realistic) performance goals.
If you need the performance implied by fsync off, then you have to
take other steps to reduce the risk of data corruption down to about
the same statistical level as running with fsync on.  Or you have to
decide that you are willing to life with the increased risk (NOT my
recommendation for most DB hosting scenarios.)


> >> ...and of course, those lucky few with bigger budgets can use
> SSD's and not
> >> care what fsync is set to.
> >
> > Would that prevent any corruption if the writes got out of order
> > because of lack of fsync?  Or partial writes?  Or wouldn't fsync still
> > need to be turned on to keep the data safe.
>
>I think the idea is that with SSDs or a RAID with a battery backed cache you
>can leave fsync on and not have any significant performance hit since the seek
>times are very fast for SSD. They have limited bandwidth but bandwidth to the
>WAL is rarely an issue -- just latency.
Yes, Greg understands what I meant here.  In the case of SSDs, the
performance hit of fsync = on is essentially zero.  In the case of
battery backed RAM caches for RAID arrays, the efficacy is dependent
on how the size of the cache compares with the working set of the
disk access pattern.

Ron


Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sat, 10 Jan 2009, Ron wrote:

> At 10:36 AM 1/10/2009, Gregory Stark wrote:
>> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>>
>> > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote:
>> >> At 03:28 PM 1/8/2009, Merlin Moncure wrote:
>> >>> just be aware of the danger .  hard reset (power off) class of failure
>> >>> when fsync = off means you are loading from backups.
>> >>
>> >> That's what redundant power conditioning UPS's are supposed to help
>> prevent
>> >> ;-)
>> >
>> > But of course, they can't prevent them, but only reduce the likelihood
>> > of their occurrance.  Everyone who's working in large hosting
>> > environments has at least one horror story to tell about a power
>> > outage that never should have happened.
>>
>> Or a system crash. If the kernel panics for any reason when it has dirty
>> buffers in memory the database will need to be restored.
> A power conditioning UPS should prevent a building wide or circuit level bad
> power event, caused by either dirty power or a power loss, from affecting the
> host.  Within the design limits of the UPS in question of course.
>
> So the real worry with fsync = off in a environment with redundant decent
> UPS's is pretty much limited to host level HW failures, SW crashes, and
> unlikely catastrophes like building collapses, lightning strikes, floods,
> etc.

I've seen datacenters with redundant UPSs go dark unexpectedly. it's less
common, but it does happen.

> Not that your fsync setting is going to matter much in the event of
> catastrophes in the physical environment...

questionable, but sometimes true. in the physical environment disasters
you will loose access to your data for a while, but after the drives are
dug out of the rubble (or dried out from the flood) the data can probably
be recovered.

for crying out loud, they were able to recover most of the data from the
hard drives in the latest shuttle disaster.

> Like anything else, there is usually more than one way to reduce risk while
> at the same time meeting (realistic) performance goals.

very true.

>> >> ...and of course, those lucky few with bigger budgets can use SSD's and
>> not
>> >> care what fsync is set to.
>> >
>> > Would that prevent any corruption if the writes got out of order
>> > because of lack of fsync?  Or partial writes?  Or wouldn't fsync still
>> > need to be turned on to keep the data safe.
>>
>> I think the idea is that with SSDs or a RAID with a battery backed cache
>> you
>> can leave fsync on and not have any significant performance hit since the
>> seek
>> times are very fast for SSD. They have limited bandwidth but bandwidth to
>> the
>> WAL is rarely an issue -- just latency.
> Yes, Greg understands what I meant here.  In the case of SSDs, the
> performance hit of fsync = on is essentially zero.

this is definantly not the case.

fsync off the data stays in memory and may never end up being sent to the
drive. RAM speeds are several orders of magnatude faster than the
interfaces to the drives (or even to the RAID controllers in high-speed
slots)

it may be that it's fast enough (see the other posts disputing that), but
don't think that it's the same.

David Lang

> In the case of battery
> backed RAM caches for RAID arrays, the efficacy is dependent on how the size
> of the cache compares with the working set of the disk access pattern.
>
> Ron
>
>

Re: understanding postgres issues/bottlenecks

From
Gregory Stark
Date:
Ron <rjpeace@earthlink.net> writes:

> At 10:36 AM 1/10/2009, Gregory Stark wrote:
>>
>> Or a system crash. If the kernel panics for any reason when it has dirty
>> buffers in memory the database will need to be restored.
>
> A power conditioning UPS should prevent a building wide or circuit level bad
> power event

Except of course those caused *by* a faulty UPS. Or for that matter by the
power supply in the computer or drive array, or someone just accidentally
hitting the wrong power button.

I'm surprised people are so confident in their kernels though. I know some
computers with uptimes measured in years but I know far more which don't.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sat, 10 Jan 2009, Luke Lonergan wrote:

> The new MLC based SSDs have better wear leveling tech and don't suffer
> the pauses.  Intel X25-M 80 and 160 GB SSDs are both pause-free.  See
> Anandtech's test results for details.

they don't suffer the pauses, but they still don't have fantasic write
speeds.

David Lang

> Intel's SLC SSDs should also be good enough but they're smaller.
>
> - Luke
>
> ----- Original Message -----
> From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
> To: Gregory Stark <stark@enterprisedb.com>
> Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>;
pgsql-performance@postgresql.org<pgsql-performance@postgresql.org> 
> Sent: Sat Jan 10 14:40:51 2009
> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
>
> On Sat, 10 Jan 2009, Gregory Stark wrote:
>
>> david@lang.hm writes:
>>
>>> On Sat, 10 Jan 2009, Markus Wanner wrote:
>>>
>>>> My understanding of SSDs so far is, that they are not that bad at
>>>> writing *on average*, but to perform wear-leveling, they sometimes have
>>>> to shuffle around multiple blocks at once. So there are pretty awful
>>>> spikes for writing latency (IIRC more than 100ms has been measured on
>>>> cheaper disks).
>>
>> That would be fascinating. And frightening. A lot of people have been
>> recommending these for WAL disks and this would be make them actually *worse*
>> than regular drives.
>>
>>> well, I have one of those cheap disks.
>>>
>>> brand new out of the box, format the 32G drive, then copy large files to it
>>> (~1G per file). this should do almost no wear-leveling, but it's write
>>> performance is still poor and it has occasional 1 second pauses.
>>
>> This isn't similar to the way WAL behaves though. What you're testing is the
>> behaviour when the bandwidth to the SSD is saturated. At that point some point
>> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer
>> cache can start to queue up writes. The stalls you see could be the behaviour
>> when that queue fills up and it needs to push back to higher layers.
>>
>> To simulate WAL you want to transfer smaller volumes of data, well below the
>> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time
>> each fsync and see whether the time they take is proportional to the amount of
>> data written in the meantime or whether they randomly spike upwards.
>
> if you have a specific benchmark for me to test I would be happy to do
> this.
>
> the test that I did is basicly the best-case for the SSD (more-or-less
> sequential writes where the vendors claim that the drives match or
> slightly outperform the traditional disks). for random writes the vendors
> put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives.
>
> take a look at this paper
> http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf
>
> this is not one of the low-performance drives, they include a sandisk
> drive in the paper that shows significantly less performance (but the same
> basic pattern) than the imation drives.
>
> David Lang
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: understanding postgres issues/bottlenecks

From
Luke Lonergan
Date:
The new MLC based SSDs have better wear leveling tech and don't suffer the pauses.  Intel X25-M 80 and 160 GB SSDs are
bothpause-free.  See Anandtech's test results for details.
 

Intel's SLC SSDs should also be good enough but they're smaller.

- Luke

----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: Gregory Stark <stark@enterprisedb.com>
Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>;
pgsql-performance@postgresql.org<pgsql-performance@postgresql.org>
 
Sent: Sat Jan 10 14:40:51 2009
Subject: Re: [PERFORM] understanding postgres issues/bottlenecks

On Sat, 10 Jan 2009, Gregory Stark wrote:

> david@lang.hm writes:
>
>> On Sat, 10 Jan 2009, Markus Wanner wrote:
>>
>>> My understanding of SSDs so far is, that they are not that bad at
>>> writing *on average*, but to perform wear-leveling, they sometimes have
>>> to shuffle around multiple blocks at once. So there are pretty awful
>>> spikes for writing latency (IIRC more than 100ms has been measured on
>>> cheaper disks).
>
> That would be fascinating. And frightening. A lot of people have been
> recommending these for WAL disks and this would be make them actually *worse*
> than regular drives.
>
>> well, I have one of those cheap disks.
>>
>> brand new out of the box, format the 32G drive, then copy large files to it
>> (~1G per file). this should do almost no wear-leveling, but it's write
>> performance is still poor and it has occasional 1 second pauses.
>
> This isn't similar to the way WAL behaves though. What you're testing is the
> behaviour when the bandwidth to the SSD is saturated. At that point some point
> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer
> cache can start to queue up writes. The stalls you see could be the behaviour
> when that queue fills up and it needs to push back to higher layers.
>
> To simulate WAL you want to transfer smaller volumes of data, well below the
> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time
> each fsync and see whether the time they take is proportional to the amount of
> data written in the meantime or whether they randomly spike upwards.

if you have a specific benchmark for me to test I would be happy to do
this.

the test that I did is basicly the best-case for the SSD (more-or-less
sequential writes where the vendors claim that the drives match or
slightly outperform the traditional disks). for random writes the vendors
put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives.

take a look at this paper
http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf

this is not one of the low-performance drives, they include a sandisk
drive in the paper that shows significantly less performance (but the same
basic pattern) than the imation drives.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: understanding postgres issues/bottlenecks

From
Luke Lonergan
Date:
I believe they write at 200MB/s which is outstanding for sequential BW.  Not sure about the write latency, though the
Anandtechbenchmark results showed high detail and IIRC the write latencies were very good.
 

- Luke

----- Original Message -----
From: david@lang.hm <david@lang.hm>
To: Luke Lonergan
Cc: stark@enterprisedb.com <stark@enterprisedb.com>; markus@bluegap.ch <markus@bluegap.ch>; scott.marlowe@gmail.com
<scott.marlowe@gmail.com>;rjpeace@earthlink.net <rjpeace@earthlink.net>; pgsql-performance@postgresql.org
<pgsql-performance@postgresql.org>
Sent: Sat Jan 10 16:03:32 2009
Subject: Re: [PERFORM] understanding postgres issues/bottlenecks

On Sat, 10 Jan 2009, Luke Lonergan wrote:

> The new MLC based SSDs have better wear leveling tech and don't suffer
> the pauses.  Intel X25-M 80 and 160 GB SSDs are both pause-free.  See
> Anandtech's test results for details.

they don't suffer the pauses, but they still don't have fantasic write
speeds.

David Lang

> Intel's SLC SSDs should also be good enough but they're smaller.
>
> - Luke
>
> ----- Original Message -----
> From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
> To: Gregory Stark <stark@enterprisedb.com>
> Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>;
pgsql-performance@postgresql.org<pgsql-performance@postgresql.org>
 
> Sent: Sat Jan 10 14:40:51 2009
> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
>
> On Sat, 10 Jan 2009, Gregory Stark wrote:
>
>> david@lang.hm writes:
>>
>>> On Sat, 10 Jan 2009, Markus Wanner wrote:
>>>
>>>> My understanding of SSDs so far is, that they are not that bad at
>>>> writing *on average*, but to perform wear-leveling, they sometimes have
>>>> to shuffle around multiple blocks at once. So there are pretty awful
>>>> spikes for writing latency (IIRC more than 100ms has been measured on
>>>> cheaper disks).
>>
>> That would be fascinating. And frightening. A lot of people have been
>> recommending these for WAL disks and this would be make them actually *worse*
>> than regular drives.
>>
>>> well, I have one of those cheap disks.
>>>
>>> brand new out of the box, format the 32G drive, then copy large files to it
>>> (~1G per file). this should do almost no wear-leveling, but it's write
>>> performance is still poor and it has occasional 1 second pauses.
>>
>> This isn't similar to the way WAL behaves though. What you're testing is the
>> behaviour when the bandwidth to the SSD is saturated. At that point some point
>> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer
>> cache can start to queue up writes. The stalls you see could be the behaviour
>> when that queue fills up and it needs to push back to higher layers.
>>
>> To simulate WAL you want to transfer smaller volumes of data, well below the
>> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time
>> each fsync and see whether the time they take is proportional to the amount of
>> data written in the meantime or whether they randomly spike upwards.
>
> if you have a specific benchmark for me to test I would be happy to do
> this.
>
> the test that I did is basicly the best-case for the SSD (more-or-less
> sequential writes where the vendors claim that the drives match or
> slightly outperform the traditional disks). for random writes the vendors
> put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives.
>
> take a look at this paper
> http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf
>
> this is not one of the low-performance drives, they include a sandisk
> drive in the paper that shows significantly less performance (but the same
> basic pattern) than the imation drives.
>
> David Lang
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: understanding postgres issues/bottlenecks

From
Mark Kirkwood
Date:
david@lang.hm wrote:
> On Sat, 10 Jan 2009, Luke Lonergan wrote:
>
>> The new MLC based SSDs have better wear leveling tech and don't
>> suffer the pauses.  Intel X25-M 80 and 160 GB SSDs are both
>> pause-free.  See Anandtech's test results for details.
>
> they don't suffer the pauses, but they still don't have fantasic write
> speeds.
>
> David Lang
>
>> Intel's SLC SSDs should also be good enough but they're smaller.
>>

 From what I can see, SLC SSDs are still quite superior for reliability
and (write) performance. However they are too small and too expensive
right now. Hopefully the various manufacturers are working on improving
the size/price issue for SLC, as well as improving the
performance/reliability area for the MLC products.

regards

Mark

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sun, 11 Jan 2009, Mark Kirkwood wrote:

> david@lang.hm wrote:
>> On Sat, 10 Jan 2009, Luke Lonergan wrote:
>>
>>> The new MLC based SSDs have better wear leveling tech and don't suffer the
>>> pauses.  Intel X25-M 80 and 160 GB SSDs are both pause-free.  See
>>> Anandtech's test results for details.
>>
>> they don't suffer the pauses, but they still don't have fantasic write
>> speeds.
>>
>> David Lang
>>
>>> Intel's SLC SSDs should also be good enough but they're smaller.
>>>
>
> From what I can see, SLC SSDs are still quite superior for reliability and
> (write) performance. However they are too small and too expensive right now.
> Hopefully the various manufacturers are working on improving the size/price
> issue for SLC, as well as improving the performance/reliability area for the
> MLC products.

the very nature of the technology means that SLC will never be as cheap as
MLC and MLC will never be as reliable as SLC

take a look at
http://www.imation.com/PageFiles/83/SSD-Reliability-Lifetime-White-Paper.pdf
for a good writeup of the technology.

for both technologies, the price will continue to drop, and the
reliability and performance will continue to climb, but I don't see
anything that would improve one without the other (well, I could see MLC
gaining a 50% capacity boost if they can get to 3 bits per cell vs the
current 2, but that would come at the cost of reliability again)

for write performance I don't think there is as much of a difference
between the two technologies. today there is a huge difference in most of
the shipping products, but Intel has now demonstrated that it's mostly due
to the controller chip, so I expect much of that difference to vanish in
the next year or so (as new generations of controller chips ship)

David Lang

Re: understanding postgres issues/bottlenecks

From
James Mansion
Date:
Ron wrote:
>> I think the idea is that with SSDs or a RAID with a battery backed
>> cache you
>> can leave fsync on and not have any significant performance hit since
>> the seek
>> times are very fast for SSD. They have limited bandwidth but
>> bandwidth to the
>> WAL is rarely an issue -- just latency.
> Yes, Greg understands what I meant here.  In the case of SSDs, the
> performance hit of fsync = on is essentially zero.  In the case of
> battery backed RAM caches for RAID arrays, the efficacy is dependent
> on how the size of the cache compares with the working set of the disk
> access pattern.
Out of interest, if we take a scenario where the working set of updates
exceeds the size
of the RAID card cache, has anyone tested the relative performance of
using the battery
backed RAID on WAL only and non-cached access to other drives?

And perhaps the similar scenario with (hot) indices and WAL on a
battery-backed device
on the data on uncached devices?

It seems to me that if you're going to thrash the cache from data
updates (presumably
courtesy of full-page-write), then you might be better to partition the
cache - and a
thrashed cache can be hardly any better than no cache (so why have one?).


Re: understanding postgres issues/bottlenecks

From
"Stefano Nichele"
Date:
Hi All,
I ran pgbench. Here some result:

-bash-3.1$ pgbench -c 50 -t 1000
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 50
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
tps = 377.351354 (including connections establishing)
tps = 377.788377 (excluding connections establishing)

Some vmstat samplings in the meantime:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  4     92 127880   8252 3294512    0    0   458 12399 2441 14903 22  9 34 35  0
11 49     92 125336   8288 3297016    0    0   392 11071 2390 11568 17  7 51 24  0
 0  2     92 124548   8304 3297764    0    0   126  8249 2291 3829  5  3 64 28  0
 0  1     92 127268   7796 3295672    0    0   493 11387 2323 14221 23  9 47 21  0
 0  2     92 127256   7848 3295492    0    0   501 10654 2215 14599 24  9 42 24  0
 0  2     92 125772   7892 3295656    0    0    34  7541 2311  327  0  1 59 40  0
 0  1     92 127188   7952 3294084    0    0   537 11039 2274 15460 23 10 43 24  0
 7  4     92 123816   7996 3298620    0    0   253  8946 2284 7310 11  5 52 32  0
 0  2     92 126652   8536 3294220    0    0   440  9563 2307 9036 13  6 56 25  0
 0 10     92 125268   8584 3296116    0    0   426 10696 2285 11034 20  9 39 32  0
 0  2     92 124168   8604 3297252    0    0   104  8385 2319 4162  3  3 40 54  0
 0  8     92 123780   8648 3296456    0    0   542 11498 2298 16613 25 10 16 48  0
 

-bash-3.1$ pgbench -t 10000 -c 50
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)


-bash-3.1$ pgbench -t 10000 -c 50 -S
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)


(next test is with scaling factor 1)

-bash-3.1$ pgbench -t 20000 -c 8 -S pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
number of clients: 8
number of transactions per client: 20000
number of transactions actually processed: 160000/160000
tps = 11695.895318 (including connections establishing)
tps = 11715.603720 (excluding connections establishing)

Any comment ?

I can give you also some details about database usage of my application:
- number of active connections: about 60
- number of idle connections: about 60

Here some number from a mine old pgfouine report:
- query peak: 378 queries/s
- select: 53,1%, insert 3,8%, update 2,2 %, delete 2,8 %

The application is basically a web application and the db size is 37 GB.

Is there a way to have the number of queries per second and the percentages of  select/update/insert/delete without pgfouine ?
What is the performance impact of stats_start_collector = on and stats_row_level = on (they are on since I use autovacuum)

Thanks a lot for your help.

ste


On Wed, Jan 7, 2009 at 8:05 PM, Stefano Nichele <stefano.nichele@gmail.com> wrote:
Ok, here some information:

OS: Centos 5.x (Linux 2.6.18-53.1.21.el5 #1 SMP Tue May 20 09:34:18 EDT 2008 i686 i686 i386 GNU/Linux)
RAID: it's a hardware RAID controller
The disks are 9600rpm SATA drives

(6 disk 1+0 RAID array and 2 separate disks for the OS).


About iostat (on sdb I have pg_xlog, on sdc I have data)

> iostat -k
Linux 2.6.18-53.1.21.el5 (*******)       01/07/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
        17.27    0.00    5.13   45.08    0.00   32.52

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              30.42        38.50       170.48  182600516  808546589
sdb              46.16         0.23        52.10    1096693  247075617
sdc             269.26       351.51       451.00 1667112043 2138954833



> iostat -x -k -d 2 5
Linux 2.6.18-53.1.21.el5 (*******)       01/07/2009

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.17    12.68  0.47 29.95    38.51   170.51    13.74     0.03    0.86   0.19   0.57
sdb               0.01    80.11  0.05 46.11     0.23    52.01     2.26     0.01    0.22   0.22   1.01
sdc               7.50    64.57 222.55 46.69   350.91   450.98     5.96     0.57    2.05   3.13  84.41

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   196.00  1.00 117.00     4.00  1252.00    21.29     0.02    0.19   0.19   2.30
sdc               1.50    66.00 277.00 66.50  3100.00   832.00    22.89    50.84  242.30   2.91 100.10

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   264.50  0.00 176.50     0.00  1764.00    19.99     0.04    0.21   0.21   3.70
sdc               3.50   108.50 291.50 76.00  3228.00   752.00    21.66    89.42  239.39   2.72 100.05

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     4.98  0.00  1.00     0.00    23.88    48.00     0.00    0.00   0.00   0.00
sdb               0.00    23.88  0.00  9.45     0.00   133.33    28.21     0.00    0.21   0.21   0.20
sdc               1.00   105.97 274.13 53.73  3297.51   612.94    23.85    67.99  184.58   3.04  99.55

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00    79.00  0.00 46.00     0.00   500.00    21.74     0.01    0.25   0.25   1.15
sdc               2.50   141.00 294.00 43.50  3482.00   528.00    23.76    51.33  170.46   2.96 100.05


vmstat in the same time:
> vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0 27     80 126380  27304 3253016    0    0    98    55    0    1 17  5 33 45  0
0 26     80 124516  27300 3255456    0    0  3438  1724 2745 4011 11  2  8 78  0
1 25     80 124148  27276 3252548    0    0  3262  2806 3572 7007 33 11  3 53  0
1 28     80 128272  27244 3248516    0    0  2816  1006 2926 5624 12  3 12 73  0


I will run pgbench in the next days.



Aside from all the advice here about system tuning, as a system admin I'd also ask is the box doing the job you need? And are you looking at the Postgres log (with logging of slow queries) to see that queries perform in a sensible time? I'd assume with the current performance figure there is an issue somewhere, but I've been to places where it was as simple as adding one index, or even modifying an index so it does what the application developer intended instead of what they ask for ;)



I already checked postgres log and resolved index/slow queries issues. Actually I have queries that sometime are really fast, and sometime go in timeout.
But all the required indexes are there. For sure, there are space to improve performances also in that way, but I would like also to investigate issue from other point of views (in order to understand also how to monitor the server).


Cheers and thanks a lot.
ste


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: understanding postgres issues/bottlenecks

From
"Stefano Nichele"
Date:

Here some number from a mine old pgfouine report:
- query peak: 378 queries/s
- select: 53,1%, insert 3,8%, update 2,2 %, delete 2,8 %


Actually the percentages are wrong (I think pgfouine counts also other types of query like ET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE;):
These are the right ones:

SELECT   num: 658,5  perc:  85,7 %
INSERT    num: 47,66  perc:  6,2 %
UPDATE  num: 27,49  perc:   3,6 %
DELETE  num: 34,56  perc:  4,5 %


ste

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sat, Jan 10, 2009 at 2:56 PM, Ron <rjpeace@earthlink.net> wrote:
> At 10:36 AM 1/10/2009, Gregory Stark wrote:
>>
>> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>>
>> > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote:
>> >> At 03:28 PM 1/8/2009, Merlin Moncure wrote:
>> >>> just be aware of the danger .  hard reset (power off) class of failure
>> >>> when fsync = off means you are loading from backups.
>> >>
>> >> That's what redundant power conditioning UPS's are supposed to help
>> >> prevent
>> >> ;-)
>> >
>> > But of course, they can't prevent them, but only reduce the likelihood
>> > of their occurrance.  Everyone who's working in large hosting
>> > environments has at least one horror story to tell about a power
>> > outage that never should have happened.
>>
>> Or a system crash. If the kernel panics for any reason when it has dirty
>> buffers in memory the database will need to be restored.
>
> A power conditioning UPS should prevent a building wide or circuit level bad
> power event, caused by either dirty power or a power loss, from affecting
> the host.  Within the design limits of the UPS in question of course.

We had an electrician working who was supposed to have a tray
underneath their work.  They didn't.  A tiny bit of copper flew into a
power conditioner.  The power conditioner blew out, fed back to the
other two power conditionsers, which blew, they fed back to the UPSs
and blew them up, the power surge blew out the switch to allow the
diesel generator to take over.  We were running 100ft extension cables
from dirty wall power sockets all over the building to get the hosting
center back up.  There were about 12 or so database servers.  The only
one that came back up without data loss or corruption was mine,
running pgsql.  The others, running Oracle, db2, Ingress and a few
other databases all came back up with corrupted data on their drives
and forced nearly day long restores.

There is no protection against a kernel crash or a power loss that is
absolute.  And don't ever believe there is.  Human error is always a
possibility you have to be prepared to deal with.

> So the real worry with fsync = off in a environment with redundant decent
> UPS's is pretty much limited to host level HW failures, SW crashes, and
> unlikely catastrophes like building collapses, lightning strikes, floods,
> etc.
> Not that your fsync setting is going to matter much in the event of
> catastrophes in the physical environment...

Sure it will.  SCSI cable gets pulled out, power supply fails, mobo
just dies outright, the above mentioned situation with the power being
lost to the data center.  Meteor strikes, not so much.

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sun, Jan 11, 2009 at 11:07 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> running pgsql.  The others, running Oracle, db2, Ingress and a few
> other databases all came back up with corrupted data on their drives
> and forced nearly day long restores.

Before anyone thinks I'm slagging all other databases here, the
problem wasn't the db software, but the sysadmin / dbas who
administered those machines.  They stared in horror when I tested my
server by pulling the power cords out the back during acceptance
testing.

They also told me we could never lose power in the hosting center
because it was so wonder and redundant and that I was wasting my time.
 We lost power on a Friday afternoon.  I went skiing that weekend with
a friend, my system was up and running when it got power again.  I'm
pretty sure the other dbas at that company weren't skiing that
weekend. :)

Re: understanding postgres issues/bottlenecks

From
Glyn Astill
Date:
--- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> They also told me we could never lose power in the hosting
> center
> because it was so wonder and redundant and that I was
> wasting my time.

We'll that's just plain silly, at the very least there's always going to be some breakers / fuzes in between the power
andthe machines. 

In fact in our building there's quite a few breaker points between our comms room on the 3rd floor and the ups /
generatorin the basement. It's a crappy implementation actually. 





Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Sun, 11 Jan 2009, Glyn Astill wrote:

> --- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
>> They also told me we could never lose power in the hosting
>> center
>> because it was so wonder and redundant and that I was
>> wasting my time.
>
> We'll that's just plain silly, at the very least there's always going to
> be some breakers / fuzes in between the power and the machines.
>
> In fact in our building there's quite a few breaker points between our
> comms room on the 3rd floor and the ups / generator in the basement.
> It's a crappy implementation actually.

the response I get from people is that they give their servers redundant
poewr supplies and put them on seperate circuits so they must be safe from
that.

but as commented, that's not enough in the real world.

David Lang

Re: understanding postgres issues/bottlenecks

From
Luke Lonergan
Date:
Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash.  Battery backup doesn't
helpyou much there.
 

Fsync of log is necessary IMO.

That said, you could use a replication/backup strategy to get a consistent snapshot in the past if you don't mind
losingsome data or can recreate it from backup elsewhere.
 

I think a strategy that puts the WAL on an SLC SSD is a very good one as of Jan/09 and will get much better in short
order.

- Luke

----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: Glyn Astill <glynastill@yahoo.co.uk>
Cc: Ron <rjpeace@earthlink.net>; Scott Marlowe <scott.marlowe@gmail.com>; pgsql-performance@postgresql.org
<pgsql-performance@postgresql.org>
Sent: Sun Jan 11 15:35:22 2009
Subject: Re: [PERFORM] understanding postgres issues/bottlenecks

On Sun, 11 Jan 2009, Glyn Astill wrote:

> --- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
>> They also told me we could never lose power in the hosting
>> center
>> because it was so wonder and redundant and that I was
>> wasting my time.
>
> We'll that's just plain silly, at the very least there's always going to
> be some breakers / fuzes in between the power and the machines.
>
> In fact in our building there's quite a few breaker points between our
> comms room on the 3rd floor and the ups / generator in the basement.
> It's a crappy implementation actually.

the response I get from people is that they give their servers redundant
poewr supplies and put them on seperate circuits so they must be safe from
that.

but as commented, that's not enough in the real world.

David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sun, Jan 11, 2009 at 4:16 PM, Luke Lonergan <LLonergan@greenplum.com> wrote:
> Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash.  Battery backup doesn't
helpyou much there. 

I've been using pgsql since way back, in a lot of projects, and almost
almost of them on some flavor of linux, usually redhat.  I have had
zero kernel crashes on production dbs in that time.  I'd be interested
to know which historic kernels caused you the crashes.

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Luke Lonergan wrote:
> Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash.  Battery backup doesn't
helpyou much there. 

Well now ... that very much depends on where you *got* the server OS and
how you administer it. If you're talking a correctly-maintained Windows
2003 Server installation, or a correctly-maintained Red Hat Enterprise
Linux installation, or any other "branded" OS from Novell, Sun, HP,
etc., I'm guessing such crashes are much rarer than what you've
experienced.

And you're probably in pretty good shape with Debian stable and the RHEL
respins like CentOS. I can't comment on Ubuntu server or any of the BSD
family -- I've never worked with them. But you should be able to keep a
"branded" server up for months, with the exception of applying security
patches that require a reboot. And *those* can be *planned* outages!

Where you *will* have some major OS risk is with testing-level software
or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't
know why people run Fedora servers -- if it's Red Hat compatibility you
want, there's CentOS.


--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM

I've never met a happy clam. In fact, most of them were pretty steamed.

Re: understanding postgres issues/bottlenecks

From
"Robert Haas"
Date:
> Where you *will* have some major OS risk is with testing-level software
> or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't
> know why people run Fedora servers -- if it's Red Hat compatibility you
> want, there's CentOS.

I've had no stability problems with Fedora.  The worst experience I've
had with that distribution is that half the time the CD-burning
utilities seem to be flaky.  As for why that and not CentOS...  I like
having modern versions of all of my packages.  5 years is a long time
to get nothing but bugfixes.

...Robert

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Robert Haas wrote:
>> Where you *will* have some major OS risk is with testing-level software
>> or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't
>> know why people run Fedora servers -- if it's Red Hat compatibility you
>> want, there's CentOS.
>
> I've had no stability problems with Fedora.  The worst experience I've
> had with that distribution is that half the time the CD-burning
> utilities seem to be flaky.  As for why that and not CentOS...  I like
> having modern versions of all of my packages.  5 years is a long time
> to get nothing but bugfixes.
Fedora is not the only distro with "flaky CD-burning utilities". I'm
still hunting down issues with k3b and brasero on my openSUSE 11.1
system. "RW" media are your friends. :)

Five years may be a long time to get nothing but bugfixes, but having to
do extensive upgrade testing and an OS upgrade every six months, which
is where most "community" distros' release schedules are these days,
also seems extreme. I personally think for the money you pay for Windows
2003 Server or RHEL, you ought to be able to go a year or more between
unplanned reboots, and they really should minimize the number of reboots
you have to eat for security fixes too. Otherwise, what's the point of
spending money for an OS? :)
--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM

I've never met a happy clam. In fact, most of them were pretty steamed.

Re: understanding postgres issues/bottlenecks

From
Greg Smith
Date:
On Sun, 11 Jan 2009, M. Edward (Ed) Borasky wrote:

> And you're probably in pretty good shape with Debian stable and the RHEL
> respins like CentOS.

No one is in good shape until they've done production-level load testing
on the system and have run the sort of "unplug it under load" tests that
Scott was praising.  Since so many of the kernel bugs you might run into
are in device drivers, the subset of Debian/RHEL you're using is only as
stable as the worst supported driver running on your system.

> But you should be able to keep a "branded" server up for months, with
> the exception of applying security patches that require a reboot.

Right, this is why I only rely on Linux deployments using a name I trust:
Dell.

Returning to reality, the idea that there are brands you can buy that make
all your problems go away is rather optimistic.  The number of "branded"
servers I've seen that are just nearly or completely worthless for
database use is rather depressing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Greg Smith wrote:
> Right, this is why I only rely on Linux deployments using a name I
> trust: Dell.
>
> Returning to reality, the idea that there are brands you can buy that
> make all your problems go away is rather optimistic.  The number of
> "branded" servers I've seen that are just nearly or completely worthless
> for database use is rather depressing.

Well, of course, they won't make *all* your problems go away. But still,
I'd much rather have an IBM or HP or Dell server running Windows 2003 or
RHEL or SLES than some no-name hardware running Fedora or Ubuntu.
--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM

I've never met a happy clam. In fact, most of them were pretty steamed.

Re: understanding postgres issues/bottlenecks

From
Mark Kirkwood
Date:
M. Edward (Ed) Borasky wrote:
> Greg Smith wrote:
>
>> Right, this is why I only rely on Linux deployments using a name I
>> trust: Dell.
>>
>> Returning to reality, the idea that there are brands you can buy that
>> make all your problems go away is rather optimistic.  The number of
>> "branded" servers I've seen that are just nearly or completely worthless
>> for database use is rather depressing.
>>
>
> Well, of course, they won't make *all* your problems go away. But still,
> I'd much rather have an IBM or HP or Dell server running Windows 2003 or
> RHEL or SLES than some no-name hardware running Fedora or Ubuntu.
>
If you use no-name hardware it all depends on how reliable your supplier
is. The no-name white box providers I've had experience with have always
supplied hardware that was reliable and fast. And as they were small
companies they would work with you to give you hardware that you
preferred (e.g raid cards etc).

Conversely I've found big name brand suppliers would often change
critical parts (network or raid cards) midway through shipment - leaving
you with the odd-man-out server to debug silly issues with (e.g won't
get on the network, disk array not recognized by the installation media
etc). So I'm not entirely convinced by the 'name brand is good' argument.

I'd agree that Fedora is probably not the best choice for a
deployment(!). My experience of Ubuntu has been better, however using
the LTS release might be a wise move if one wants to user this distro.

regards

Mark

Re: understanding postgres issues/bottlenecks

From
"Scott Marlowe"
Date:
On Sun, Jan 11, 2009 at 8:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Where you *will* have some major OS risk is with testing-level software
>> or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't
>> know why people run Fedora servers -- if it's Red Hat compatibility you
>> want, there's CentOS.
>
> I've had no stability problems with Fedora.  The worst experience I've
> had with that distribution is that half the time the CD-burning
> utilities seem to be flaky.  As for why that and not CentOS...  I like
> having modern versions of all of my packages.  5 years is a long time
> to get nothing but bugfixes.

We basically do the same thing with our app tier servers, chasing the
next to latest ubuntus so we aren't running really old bits there.

For the db level it's RHEL with pgsql.  Ubuntu has been pretty stable
on the app tier, but we don't push it in the same ways we push our
databases either.

Re: understanding postgres issues/bottlenecks

From
Matthew Wakeling
Date:
On Sun, 11 Jan 2009, M. Edward (Ed) Borasky wrote:
> Where you *will* have some major OS risk is with testing-level software
> or "bleeding edge" Linux distros like Fedora.

I just ran "uptime" on my home machine, and it said 144 days. Debian
unstable, on no-name hardware. I guess the last time I rebooted was
when I changed the graphics drivers. I can't remember the last time it
actually crashed.

I guess the moral is, you find a combination of hardware and software that
works well, and you may as well stick with it. The most likely things to
destabilise the system are drivers for "interesting" pieces of hardware,
like graphics cards and (unfortunately) some high-performance RAID cards.

Matthew

--
 A good programmer is one who looks both ways before crossing a one-way street.
 Considering the quality and quantity of one-way streets in Cambridge, it
 should be no surprise that there are so many good programmers there.

Re: understanding postgres issues/bottlenecks

From
Jeff
Date:
On Jan 11, 2009, at 9:43 PM, M. Edward (Ed) Borasky wrote:

> Luke Lonergan wrote:
>> Not to mention the #1 cause of server faults in my experience: OS
>> kernel bug causes a crash.  Battery backup doesn't help you much
>> there.
>

Not that long ago (a month or so) we ran into a problem where hpacucli
(Utility for configuring/inspecting/etc HP smartarray controllers)
would tickle the cciss driver in such a way that it would  cause a
kernel panic. KABLAMMO (No data loss! weeeee!).   The box had run for
a long time without crashes, but it seems that when we added more
disks and started the array building the new logical drive some
magical things happened.

Bugs happen.  The [bad word] of it is catching the culprit with its
fingers in the cookie jar.

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: understanding postgres issues/bottlenecks

From
Jean-David Beyer
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

M. Edward (Ed) Borasky wrote:
| Luke Lonergan wrote:
|> Not to mention the #1 cause of server faults in my experience: OS
|> kernel bug causes a crash.  Battery backup doesn't help you much there.
|>
|
| Well now ... that very much depends on where you *got* the server OS and
| how you administer it. If you're talking a correctly-maintained Windows
| 2003 Server installation, or a correctly-maintained Red Hat Enterprise
| Linux installation, or any other "branded" OS from Novell, Sun, HP, etc.,
| I'm guessing such crashes are much rarer than what you've experienced.
|
| And you're probably in pretty good shape with Debian stable and the RHEL
| respins like CentOS. I can't comment on Ubuntu server or any of the BSD
| family -- I've never worked with them. But you should be able to keep a
| "branded" server up for months, with the exception of applying security
| patches that require a reboot. And *those* can be *planned* outages!
|
| Where you *will* have some major OS risk is with testing-level software
| or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know
| why people run Fedora servers -- if it's Red Hat compatibility you want,
| there's CentOS.
|
Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3
24/7 for over 6 months, and it was discontinued years ago. I recognize that
this is by no means a record. It did not crash after 6 months, but I
upgraded that box to CentOS 4 and it has been running that a long time. That
box has minor hardware problems that do not happen often enough to find the
real cause. But it stays up months at a time. All that box does is run BOINC
and a printer server (CUPS).

This machine does not crash, but it gets rebooted whenever a new kernel
comes out, and has been up almost a month. It run RHEL5.

I would think Fedora's kernel would probably be OK, but the other bleeding
edge stuff I would not risk a serious server on.

- --
~  .~.  Jean-David Beyer          Registered Linux User 85642.
~  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
~ /( )\ Shrewsbury, New Jersey    http://counter.li.org
~ ^^-^^ 14:10:01 up 30 days, 1:55, 3 users, load average: 4.18, 4.26, 4.24
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org

iD8DBQFJb4zmPtu2XpovyZoRAn9TAKDFoEZ0JtoTi7T0qs9ZlI7rLxs9lACeJjDZ
XL9rGZqzw0LjrszD1DaAhp4=
=LdVq
-----END PGP SIGNATURE-----

Re: understanding postgres issues/bottlenecks

From
"M. Edward (Ed) Borasky"
Date:
Jean-David Beyer wrote:
> M. Edward (Ed) Borasky wrote:
> | Luke Lonergan wrote:
> |> Not to mention the #1 cause of server faults in my experience: OS
> |> kernel bug causes a crash.  Battery backup doesn't help you much there.
> |>
> |
> | Well now ... that very much depends on where you *got* the server OS and
> | how you administer it. If you're talking a correctly-maintained Windows
> | 2003 Server installation, or a correctly-maintained Red Hat Enterprise
> | Linux installation, or any other "branded" OS from Novell, Sun, HP, etc.,
> | I'm guessing such crashes are much rarer than what you've experienced.
> |
> | And you're probably in pretty good shape with Debian stable and the RHEL
> | respins like CentOS. I can't comment on Ubuntu server or any of the BSD
> | family -- I've never worked with them. But you should be able to keep a
> | "branded" server up for months, with the exception of applying security
> | patches that require a reboot. And *those* can be *planned* outages!
> |
> | Where you *will* have some major OS risk is with testing-level software
> | or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know
> | why people run Fedora servers -- if it's Red Hat compatibility you want,
> | there's CentOS.
> |
> Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3
> 24/7 for over 6 months, and it was discontinued years ago. I recognize that
> this is by no means a record. It did not crash after 6 months, but I
> upgraded that box to CentOS 4 and it has been running that a long time.
> That
> box has minor hardware problems that do not happen often enough to find the
> real cause. But it stays up months at a time. All that box does is run
> BOINC
> and a printer server (CUPS).
>
> This machine does not crash, but it gets rebooted whenever a new kernel
> comes out, and has been up almost a month. It run RHEL5.
>
> I would think Fedora's kernel would probably be OK, but the other bleeding
> edge stuff I would not risk a serious server on.

I haven't heard much one way or the other about Fedora's kernels.
Because of the way their release cycle is staggered with the other major
community distros, they tend to be a number behind, say, openSUSE.
openSUSE 11.1, for example, just came out with 2.6.27, while Fedora came
out with 2.6.26 only a couple of weeks before that.

The things I care the most about -- kernel-level performance metrics --
pretty much guarantee that I'm going to run a bleeding edge kernel as
soon as it's good enough to live through a couple of days without losing
data on the hard drive. And on my laptop, anything that recognizes my
wireless and sound and leaves some of my 512 MB for applications is OK. :)
>

--
M. Edward (Ed) Borasky

I've never met a happy clam. In fact, most of them were pretty steamed.

Re: understanding postgres issues/bottlenecks

From
david@lang.hm
Date:
On Thu, 15 Jan 2009, Jean-David Beyer wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> M. Edward (Ed) Borasky wrote:
> | Luke Lonergan wrote:
> |> Not to mention the #1 cause of server faults in my experience: OS
> |> kernel bug causes a crash.  Battery backup doesn't help you much there.
> |>
> |
> | Well now ... that very much depends on where you *got* the server OS and
> | how you administer it. If you're talking a correctly-maintained Windows
> | 2003 Server installation, or a correctly-maintained Red Hat Enterprise
> | Linux installation, or any other "branded" OS from Novell, Sun, HP, etc.,
> | I'm guessing such crashes are much rarer than what you've experienced.
> |
> | And you're probably in pretty good shape with Debian stable and the RHEL
> | respins like CentOS. I can't comment on Ubuntu server or any of the BSD
> | family -- I've never worked with them. But you should be able to keep a
> | "branded" server up for months, with the exception of applying security
> | patches that require a reboot. And *those* can be *planned* outages!
> |
> | Where you *will* have some major OS risk is with testing-level software
> | or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know
> | why people run Fedora servers -- if it's Red Hat compatibility you want,
> | there's CentOS.
> |
> Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3
> 24/7 for over 6 months, and it was discontinued years ago. I recognize that
> this is by no means a record. It did not crash after 6 months, but I
> upgraded that box to CentOS 4 and it has been running that a long time. That
> box has minor hardware problems that do not happen often enough to find the
> real cause. But it stays up months at a time. All that box does is run BOINC
> and a printer server (CUPS).
>
> This machine does not crash, but it gets rebooted whenever a new kernel
> comes out, and has been up almost a month. It run RHEL5.
>
> I would think Fedora's kernel would probably be OK, but the other bleeding
> edge stuff I would not risk a serious server on.

I have been running kernel.org kernels in production for about 12 years
now (on what has now grown to a couple hundred servers), and I routinely
run from upgrade to upgrade with no crashes. I tend to upgrade every year
or so).

that being said, things happen. I have a set of firewalls running the
Checkpoint Secure Platform linux distribution that locked up solidly a
couple weeks after putting them in place (the iptables firewalls that they
replaced had been humming along just fine under much heavier loads for
months).

the more mainstream your hardware is the safer you are (unfortunantly very
few RAID cards are mainstream), but I've also found that by compiling a
minimal kernel that only supports the stuff that I need also contributes
to reliability.

but even with my experiance, I would never architect anything with the
expectation that system crashes don't happen. I actually see more crashes
due to overheating (fans fail, AC units fail, etc) than I do from kernel
crashes.

not everything needs reliability. I am getting ready to build a pair of
postgres servers that will have all safety disabled. I will get the
redundancy I need by replicating between the pair, and if they both go
down (datacenter outage) it is very appropriate to loose the entire
contents of the system and reinitialize from scratch (in fact, every boot
of the system will do this)

but you need to think carefully about what you are doing when you disable
the protection.

David Lang