Thread: understanding postgres issues/bottlenecks
Hi list, I would like to ask your help in order to understand if my postgresql server (ver. 8.2.9) is well configured. It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2 separate disks for the OS and write-ahead logs with 4GB of RAM. I don't know what is the best info to help me and so I start with some vmstat information: > vmstat -n 30 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 23 84 129968 25060 3247860 0 0 78 50 0 2 17 5 33 45 0 1 24 84 124204 25136 3257100 0 0 3037 1154 3359 7253 25 7 6 62 0 2 30 84 124704 25136 3256344 0 0 3004 1269 3553 7906 33 9 7 51 0 0 25 84 125784 24956 3253344 0 0 3357 773 3163 6454 17 4 10 68 0 0 2 84 125744 25236 3258996 0 0 3186 567 3125 6425 24 6 21 50 0 3 7 84 124948 25500 3260088 0 0 1829 535 2706 4625 18 3 54 25 0 5 0 84 124976 25624 3259112 0 0 2067 647 3050 6163 26 6 41 27 0 0 7 84 123836 25644 3260760 0 0 2239 1065 3289 8654 27 7 38 28 0 These are gathered loadavg info for the same period: 29.57 29.53 33.52 1/231 12641 29.54 29.63 33.31 1/226 12678 24.43 28.45 32.69 1/223 12696 12.31 24.17 30.95 4/223 12706 At the moment as average there are about 120/150 connections and this is my postgresql.conf file listen_addresses = '*' # what IP address(es) to listen on; port = 5432 # (change requires restart) max_connections = 400 # (change requires restart) ssl = off # (change requires restart) password_encryption = on shared_buffers = 32MB # min 128kB or max_connections*16kB max_prepared_transactions = 0 # can be 0 or more work_mem = 1MB # min 64kB maintenance_work_mem = 256MB # min 1MB max_fsm_pages = 204800 # min max_fsm_relations*16, 6 bytes each fsync = on # turns forced synchronization on or off full_page_writes = on # recover from partial page writes wal_buffers = 8MB # min 32kB checkpoint_segments = 56 # in logfile segments, min 1, 16MB each enable_bitmapscan = on enable_hashagg = on enable_hashjoin = on enable_indexscan = on enable_mergejoin = on enable_nestloop = on enable_seqscan = on enable_sort = on enable_tidscan = on cpu_tuple_cost = 0.003 # same scale as above cpu_index_tuple_cost = 0.001 # same scale as above cpu_operator_cost = 0.0005 # same scale as above effective_cache_size = 3GB geqo_threshold = 14 log_destination = 'stderr' # Valid values are combinations of redirect_stderr = on # Enable capturing of stderr into log log_directory = 'pg_log' # Directory where log files are written log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern. log_truncate_on_rotation = on # If on, any existing log file of the same log_rotation_age = 1d # Automatic rotation of logfiles will log_rotation_size = 0 # Automatic rotation of logfiles will log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements log_statement = 'none' # none, ddl, mod, all autovacuum = on # enable autovacuum subprocess? stats_start_collector = on # must be 'on' for autovacuum stats_row_level = on # must be 'on' for autovacuum statement_timeout = 150000 datestyle = 'iso, mdy' lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'en_US.UTF-8' # locale for time formatting deadlock_timeout = 5s escape_string_warning = off standard_conforming_strings = on Cheers and thanks a lot in advance. Let me know if other info is useful. Ste -- Stefano Nichele Funambol Chief Architect Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com
On Tue, Jan 6, 2009 at 11:50 AM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > Hi list, > I would like to ask your help in order to understand if my postgresql server > (ver. 8.2.9) is well configured. > It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2 separate > disks for the OS and write-ahead logs with 4GB of RAM. > > I don't know what is the best info to help me and so I start with some > vmstat information: > >> vmstat -n 30 > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa > st > 0 23 84 129968 25060 3247860 0 0 78 50 0 2 17 5 33 > 45 0 clearly i/o bound. can you throw an iostat to give some more detailed info? also, need o/s version, etc disk info (speed, raid controller, etc) my guess is your disk system is underpowered for transaction load you are giving. how many tps is the database doing? merlin
I concur with Merlin you're I/O bound. Adding to his post, what RAID controller are you running, does it have cache, does the cache have battery backup, is the cache set to write back or write through? Also, what do you get for this (need contrib module pgbench installed) pgbench -i -s 100 pgbench -c 50 -n 10000 ? Specifically transactions per second?
I got this bounce message from your account, stefano.nichele@gmail.com. I'm on gmail too, but don't get a lot of -perform messages into my spam folder. Just in case you've got eliminatecc turned on on the mailing list server, I'm resending it through the mail server without your email address in it. I concur with Merlin you're I/O bound. Adding to his post, what RAID controller are you running, does it have cache, does the cache have battery backup, is the cache set to write back or write through? Also, what do you get for this (need contrib module pgbench installed) pgbench -i -s 100 pgbench -c 50 -n 10000 ? Specifically transactions per second?
Thanks for your help. I'll give you the info you asked as soon as I'll have it (i have also to install iostat but I don't have enough privilege to do that). BTW, why did you said I/O bound ? Which are the parameters that highlight that ? Sorry for my ignorance.... ste Merlin Moncure wrote: > On Tue, Jan 6, 2009 at 11:50 AM, Stefano Nichele > <stefano.nichele@gmail.com> wrote: > >> Hi list, >> I would like to ask your help in order to understand if my postgresql server >> (ver. 8.2.9) is well configured. >> It's a quad-proc system (32 bit) with a 6 disk 1+0 RAID array and 2 separate >> disks for the OS and write-ahead logs with 4GB of RAM. >> >> I don't know what is the best info to help me and so I start with some >> vmstat information: >> >> >>> vmstat -n 30 >>> >> procs -----------memory---------- ---swap-- -----io---- --system-- >> -----cpu------ >> r b swpd free buff cache si so bi bo in cs us sy id wa >> st >> 0 23 84 129968 25060 3247860 0 0 78 50 0 2 17 5 33 >> 45 0 >> > > clearly i/o bound. can you throw an iostat to give some more detailed info? > also, > need o/s version, etc > disk info (speed, raid controller, etc) > my guess is your disk system is underpowered for transaction load you > are giving. > how many tps is the database doing? > > merlin > > -- Stefano Nichele Funambol Chief Architect Funambol :: Open Source Mobile'We' for the Mass Market :: http://www.funambol.com
On Tue, Jan 6, 2009 at 12:02 PM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > Thanks for your help. I'll give you the info you asked as soon as I'll have > it (i have also to install iostat but I don't have enough privilege to do > that). > > BTW, why did you said I/O bound ? Which are the parameters that highlight > that ? Sorry for my ignorance.... In this output, the second to the last column is wa, which stands for wait, the third to last is id, which is CPU idle. r b swpd free buff cache si so bi bo in cs us sy id wa st 1 24 84 124204 25136 3257100 0 0 3037 1154 3359 7253 25 7 6 62 0 2 30 84 124704 25136 3256344 0 0 3004 1269 3553 7906 33 9 7 51 0 0 25 84 125784 24956 3253344 0 0 3357 773 3163 6454 17 4 10 68 0 So here the first line reports 25% user 7% system 6% idle 62% wait So it's mostly waiting.
On Tue, Jan 6, 2009 at 11:02 AM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > BTW, why did you said I/O bound ? Which are the parameters that highlight > that ? Sorry for my ignorance.... In addition to the percentage of time spent in wait as Scott said, you can also see the number of processes which are blocked (b column on the left). Those are the processes which would like to run but are waiting for IO to complete. From your earlier vmstat output, you can see that often quite a large number of processes were waiting on IO - 20-30 of them. This is confirmed by your load average. Since you only have 6 spindles, during those periods, request service latency was likely very high during those periods. Anyway, I would start with significantly increasing the amount of memory you have allocated to shared_buffers. 32MB is _way_ too low. For a dedicated PostgreSQL machine with 4GB RAM, 1GB going to be a lot closer to optimal, with the optimal setting probably somewhere between 500MB and 2GB. And please post the rest of the information that the others have asked for... -Dave
David Rees wrote: > On Tue, Jan 6, 2009 at 11:02 AM, Stefano Nichele > <stefano.nichele@gmail.com> wrote: >> BTW, why did you said I/O bound ? Which are the parameters that highlight >> that ? Sorry for my ignorance.... > > In addition to the percentage of time spent in wait as Scott said, you > can also see the number of processes which are blocked (b column on > the left). Those are the processes which would like to run but are > waiting for IO to complete. > >From your earlier vmstat output, you can see that often quite a large > number of processes were waiting on IO - 20-30 of them. This is > confirmed by your load average. Since you only have 6 spindles, during > those periods, request service latency was likely very high during > those periods. > > Anyway, I would start with significantly increasing the amount of > memory you have allocated to shared_buffers. > > 32MB is _way_ too low. For a dedicated PostgreSQL machine with 4GB > RAM, 1GB going to be a lot closer to optimal, with the optimal setting > probably somewhere between 500MB and 2GB. > > And please post the rest of the information that the others have asked for... > > -Dave > A couple of notes on "iostat": 1. The package it lives in is called "sysstat". Most Linux distros do *not* install "sysstat" by default. Somebody should beat up on them about that. :) 2. It does not require admin privileges to install or execute if you want to have your own personal copy. You will, however, need the Gnu C compiler, "make", and some other things that also may not be installed. In any event, you can build "sysstat" from source in your home directory if you can't get the sysadmin to install it globally. 3. Once you do get it installed, the command line to run it that gives the most information is $ iostat -cdmtx <sampling interval> <number of samples> That will give you CPU utilization, throughputs in megabytes, time stamps and extended statistics like device utilizations, queue lengths and average service times. And if you're running 2.6.25 or later, it should give you all the statistics for all partitions, not just devices! Somewhere buried in one of my open source projects is a Perl script that will parse the "iostat" output into a CSV file. By the way, if you need to, you can be fairly aggressive with small sampling intervals. "iostat" is very efficient, and chances are, you're I/O bound anyhow so your processors will have plenty of bandwidth to format numbers out of /proc. :) I normally run with samples every ten seconds, but you can probably go even smaller if you need to.
On Wednesday 07 January 2009 04:17:10 M. Edward (Ed) Borasky wrote: > > 1. The package it lives in is called "sysstat". Most Linux distros do > *not* install "sysstat" by default. Somebody should beat up on them > about that. :) Hehe, although sysstat and friends did have issues on Linux for a long time. Nothing worse than misleading stats, so I suspect it lost a lot of friends back then. It is a lot better these days when most of the "Unix" software targets Linux first, and other kernels second. Aside from all the advice here about system tuning, as a system admin I'd also ask is the box doing the job you need? And are you looking at the Postgres log (with logging of slow queries) to see that queries perform in a sensible time? I'd assume with the current performance figure there is an issue somewhere, but I've been to places where it was as simple as adding one index, or even modifying an index so it does what the application developer intended instead of what they ask for ;)
Simon Waters wrote: > On Wednesday 07 January 2009 04:17:10 M. Edward (Ed) Borasky wrote: >> 1. The package it lives in is called "sysstat". Most Linux distros do >> *not* install "sysstat" by default. Somebody should beat up on them >> about that. :) > > Hehe, although sysstat and friends did have issues on Linux for a long time. > Nothing worse than misleading stats, so I suspect it lost a lot of friends > back then. It is a lot better these days when most of the "Unix" software > targets Linux first, and other kernels second. I'm unfortunately familiar with that "episode", which turned out to be bugs in the Red Hat Linux kernels / drivers. I don't remember the exact versions, although there was at least one of them shipped at one time as part of RHEL 3. I may be wrong, but I don't think Sebastien ever had to change any of *his* code in "sysstat" -- I think he had to wait for Red Hat. :) > > Aside from all the advice here about system tuning, as a system admin I'd also > ask is the box doing the job you need? And are you looking at the Postgres > log (with logging of slow queries) to see that queries perform in a sensible > time? I'd assume with the current performance figure there is an issue > somewhere, but I've been to places where it was as simple as adding one > index, or even modifying an index so it does what the application developer > intended instead of what they ask for ;) > Yeah ... the issues of application response time and throughput, cost of adding hardware to the box(es) vs. cost of changing the application, changes in demand for the server over time, etc., are what make capacity planning "fun". Squeezing the last megabyte per second out of a RAID array is the easy stuff. :)
Ok, here some information: OS: Centos 5.x (Linux 2.6.18-53.1.21.el5 #1 SMP Tue May 20 09:34:18 EDT 2008 i686 i686 i386 GNU/Linux) RAID: it's a hardware RAID controller The disks are 9600rpm SATA drives (6 disk 1+0 RAID array and 2 separate disks for the OS). About iostat (on sdb I have pg_xlog, on sdc I have data) > iostat -k Linux 2.6.18-53.1.21.el5 (*******) 01/07/2009 avg-cpu: %user %nice %system %iowait %steal %idle 17.27 0.00 5.13 45.08 0.00 32.52 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 30.42 38.50 170.48 182600516 808546589 sdb 46.16 0.23 52.10 1096693 247075617 sdc 269.26 351.51 451.00 1667112043 2138954833 > iostat -x -k -d 2 5 Linux 2.6.18-53.1.21.el5 (*******) 01/07/2009 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.17 12.68 0.47 29.95 38.51 170.51 13.74 0.03 0.86 0.19 0.57 sdb 0.01 80.11 0.05 46.11 0.23 52.01 2.26 0.01 0.22 0.22 1.01 sdc 7.50 64.57 222.55 46.69 350.91 450.98 5.96 0.57 2.05 3.13 84.41 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 196.00 1.00 117.00 4.00 1252.00 21.29 0.02 0.19 0.19 2.30 sdc 1.50 66.00 277.00 66.50 3100.00 832.00 22.89 50.84 242.30 2.91 100.10 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 264.50 0.00 176.50 0.00 1764.00 19.99 0.04 0.21 0.21 3.70 sdc 3.50 108.50 291.50 76.00 3228.00 752.00 21.66 89.42 239.39 2.72 100.05 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.98 0.00 1.00 0.00 23.88 48.00 0.00 0.00 0.00 0.00 sdb 0.00 23.88 0.00 9.45 0.00 133.33 28.21 0.00 0.21 0.21 0.20 sdc 1.00 105.97 274.13 53.73 3297.51 612.94 23.85 67.99 184.58 3.04 99.55 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 79.00 0.00 46.00 0.00 500.00 21.74 0.01 0.25 0.25 1.15 sdc 2.50 141.00 294.00 43.50 3482.00 528.00 23.76 51.33 170.46 2.96 100.05 vmstat in the same time: > vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 27 80 126380 27304 3253016 0 0 98 55 0 1 17 5 33 45 0 0 26 80 124516 27300 3255456 0 0 3438 1724 2745 4011 11 2 8 78 0 1 25 80 124148 27276 3252548 0 0 3262 2806 3572 7007 33 11 3 53 0 1 28 80 128272 27244 3248516 0 0 2816 1006 2926 5624 12 3 12 73 0 I will run pgbench in the next days. > > Aside from all the advice here about system tuning, as a system admin I'd also > ask is the box doing the job you need? And are you looking at the Postgres > log (with logging of slow queries) to see that queries perform in a sensible > time? I'd assume with the current performance figure there is an issue > somewhere, but I've been to places where it was as simple as adding one > index, or even modifying an index so it does what the application developer > intended instead of what they ask for ;) > I already checked postgres log and resolved index/slow queries issues. Actually I have queries that sometime are really fast, and sometime go in timeout. But all the required indexes are there. For sure, there are space to improve performances also in that way, but I would like also to investigate issue from other point of views (in order to understand also how to monitor the server). Cheers and thanks a lot. ste
On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'll run pgbench in the next days.
Cheers,
ste
I concur with Merlin you're I/O bound.
Adding to his post, what RAID controller are you running, does it have
cache, does the cache have battery backup, is the cache set to write
back or write through?
At the moment I don't have such information. It's a "standard" RAID controller coming with a DELL server. Is there any information I can have asking to the SO ?
Also, what do you get for this (need contrib module pgbench installed)
pgbench -i -s 100
pgbench -c 50 -n 10000
? Specifically transactions per second?
I'll run pgbench in the next days.
Cheers,
ste
On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > > On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> >> I concur with Merlin you're I/O bound. >> >> Adding to his post, what RAID controller are you running, does it have >> cache, does the cache have battery backup, is the cache set to write >> back or write through? > > > At the moment I don't have such information. It's a "standard" RAID > controller coming with a DELL server. Is there any information I can have > asking to the SO ? You can run lshw to see what flavor controller it is. Dell RAID controllers are pretty much either total crap, or mediocre at best. The latest one, the Perc 6 series are squarely in the same performance realm as a 4 or 5 year old LSI megaraid. The perc 5 series and before are total performance dogs. The really bad news is that you can't generally plug in a real RAID controller on a Dell. We put an Areca 168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a CPU Error. Dells are fine for web servers and such. For database servers they're a total loss. The best you can do with one is to put a generic SCSI card in it and connect to an external array with its own controller. We have a perc6e and a perc5e in two different servers, and no matter how we configure them, we can't get even 1/10th the performance of an Areca controller with the same number of drives on another machine of the same basic class as the 1950s. >> Also, what do you get for this (need contrib module pgbench installed) >> >> pgbench -i -s 100 >> pgbench -c 50 -n 10000 >> >> ? Specifically transactions per second? > > I'll run pgbench in the next days. Cool. That pgbench is a "best case scenario" benchmark. Lots of small transactions on a db that should fit into memory. If you can't pull off a decent number there (at least a few hundred tps) then can't expect better performance from real world usage. Oh, and that should be: pgbench -c 50 -t 10000 not -n... not enough sleep I guess.
Just to elaborate on the horror that is a Dell perc5e. We have one in a 1950 with battery backed cache (256 Meg I think). It has an 8 disk 500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10 Gigs ram. This server currently serves as a mnogo search server. Here's what vmstat 1 looks like during the day: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 5 38748 60400 43172 9227632 0 0 5676 0 668 703 4 2 12 82 0 5 38748 58888 43176 9229360 0 0 5548 0 672 792 2 0 15 83 0 4 38748 64460 43184 9230476 0 0 5964 72 773 947 1 0 31 67 0 5 38748 61884 43272 9241564 0 0 5896 1112 674 1028 1 2 23 74 0 5 38748 56612 43276 9247376 0 0 5660 0 603 795 0 0 21 79 0 5 38748 56472 43268 9247480 0 0 5700 0 603 790 0 0 22 77 Note 4 or 5 blocking, and reading in data at 5M/sec and bursting small writes. 75 to 80% wait state user and sys time around 1 or 2% rest is idle. This is without adding any load from pgbench. When I add pgbench, the numbers from vmstat look the same pretty much, slight increase of maybe 20% bi and bo and a rise in the blocked processes. Running vacuum on the pgbench db on this machine takes well over a minute. Even a small pgbench test runs slowly, getting 20 to 30 tps during the day. During off hours I can get a max of about 80 tps. ---------------------------------------------------------------------------------------------- Now, here's my primary production server. It is NOT a Dell, or an HP, or an IBM. It is a high end white box machine with a Tyan mobo. Primary difference here is 12 disk SAS RAID-10 and a much faster RAID controller. It's got 8 opteron 2.1GHz cores but honestly, it hardly uses any of them. This is it before I run pgbench. It almost looks like it's sleeping. It is handling about 250 queries per second right now, most serving from memory. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 2464 614156 626448 28658476 0 0 0 688 2515 2615 12 1 87 0 0 2 0 2464 638972 626448 28658484 0 0 8 656 2401 2454 16 2 82 0 0 1 0 2464 631852 626448 28658516 0 0 0 552 1939 1984 14 1 85 0 0 4 0 2464 617900 626448 28658532 0 0 32 500 4925 5276 19 2 78 0 0 1 0 2464 617656 626448 28658560 0 0 0 492 3363 3428 14 1 85 0 0 1 0 2464 618648 626448 28658560 0 0 0 752 3391 3579 12 2 85 0 0 It's not reading anything in because it fits in memory (it's got 32 Gig ram) and it's writing out a small amount. I'll run pgbench to give it a heavier load and force more writes. pgbench -c 10 -t 25000 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100 number of clients: 10 number of transactions per client: 25000 number of transactions actually processed: 250000/250000 tps = 2866.549792 (including connections establishing) tps = 2868.010947 (excluding connections establishing) Here's vmstat during that period: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 2408 190548 626584 28770448 0 0 0 66200 8210 84391 44 14 35 6 0 0 2 2408 207156 626584 28770900 0 0 0 23832 6688 6426 20 4 58 18 0 7 0 2408 295792 626588 28771904 0 0 0 34372 6831 62873 35 12 43 11 0 3 2 2408 308904 626588 28773128 0 0 0 60040 6711 78235 44 13 40 3 0 4 0 2408 310408 626588 28773660 0 0 0 54780 7779 37399 28 8 50 15 0 6 1 2408 325808 626592 28775912 0 0 16 43588 5345 105348 43 15 39 3 0 10 0 2408 324304 626592 28778188 0 0 16 60984 6582 125105 52 18 29 1 0 8 1 2408 339204 626596 28780248 0 0 8 47956 5203 113799 48 18 33 1 0 4 2 2408 337856 626596 28782840 0 0 0 108096 12132 90391 46 16 33 4 0 Note that wait is generally around 5 to 10% on this machine. Now, I know it's a much more powerful machine, but the difference in performance is not some percentage faster. It's many many factors faster. Twenty or more times as faster as the dell. That dell machine cost us somewhere in the range of $11,000. The MD-1000 was a large part of that cost. The Aberdeen white box that is replacing it cost us about $11,500. I cannot understand how Dell stays in business. It certainly isn't on merit.
On Wed, 7 Jan 2009, Scott Marlowe wrote: > I cannot understand how Dell stays in business. There's a continuous stream of people who expect RAID5 to perform well, too, yet this surprises you? -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Jan 7, 2009 at 3:34 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On Wed, 7 Jan 2009, Scott Marlowe wrote: > >> I cannot understand how Dell stays in business. > > There's a continuous stream of people who expect RAID5 to perform well, too, > yet this surprises you? I guess I've underestimated the human capacity for stupidity yet again. -- An optimist sees the glass as half full, a pessimist as half empty, and engineer as having redundant storage capacity.
--- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > The really bad news is that > you can't > generally plug in a real RAID controller on a Dell. We put > an Areca > 168-LP PCI-x8 in one of our 1950s and it wouldn't even > turn on, got a > CPU Error. > Hmm, I had to pull the perc5i's out of our dell servers to get them to boot off of our Adaptec 5805's Anyway that reminds me, I must remember to bring my lump hammer into work at some point...
--- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Just to elaborate on the horror that is a Dell perc5e. We > have one in > a 1950 with battery backed cache (256 Meg I think). It has > an 8 disk > 500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10 > Gigs ram. Our perc5i controllers performed better in raid 5 that 10. Sounds like the comment you made when I was wasting my time withthat perc3 fits all dell cards perfectly; "brain damaged"
On Wed, Jan 7, 2009 at 4:36 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > --- On Wed, 7/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > >> Just to elaborate on the horror that is a Dell perc5e. We >> have one in >> a 1950 with battery backed cache (256 Meg I think). It has >> an 8 disk >> 500Gig SATA drive RAID-10 array and 4 1.6GHz cpus and 10 >> Gigs ram. > > Our perc5i controllers performed better in raid 5 that 10. Sounds like the comment you made when I was wasting my timewith that perc3 fits all dell cards perfectly; "brain damaged" One of the beauties of the whole Dell RAID card naming scheme is that a Perc5i and a Perc5e can be made by different manufacturers and have completely different performance characteristics. The only common factor seems to be the high level of suck they managed to generate.
Since the discussion involves Dell PERC controllers, does anyone know if the performance of LSI cards (those with the same chipsets as Dell) also have similarly poor performance? I have a LSI 8888ELP card, so would like to know what other people's experiences are... -bborie Scott Marlowe wrote: > On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele > <stefano.nichele@gmail.com> wrote: >> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com> >> wrote: >>> I concur with Merlin you're I/O bound. >>> >>> Adding to his post, what RAID controller are you running, does it have >>> cache, does the cache have battery backup, is the cache set to write >>> back or write through? >> >> At the moment I don't have such information. It's a "standard" RAID >> controller coming with a DELL server. Is there any information I can have >> asking to the SO ? > > You can run lshw to see what flavor controller it is. Dell RAID > controllers are pretty much either total crap, or mediocre at best. > The latest one, the Perc 6 series are squarely in the same performance > realm as a 4 or 5 year old LSI megaraid. The perc 5 series and before > are total performance dogs. The really bad news is that you can't > generally plug in a real RAID controller on a Dell. We put an Areca > 168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a > CPU Error. > > Dells are fine for web servers and such. For database servers they're > a total loss. The best you can do with one is to put a generic SCSI > card in it and connect to an external array with its own controller. > > We have a perc6e and a perc5e in two different servers, and no matter > how we configure them, we can't get even 1/10th the performance of an > Areca controller with the same number of drives on another machine of > the same basic class as the 1950s. > >>> Also, what do you get for this (need contrib module pgbench installed) >>> >>> pgbench -i -s 100 >>> pgbench -c 50 -n 10000 >>> >>> ? Specifically transactions per second? >> I'll run pgbench in the next days. > > Cool. That pgbench is a "best case scenario" benchmark. Lots of > small transactions on a db that should fit into memory. If you can't > pull off a decent number there (at least a few hundred tps) then can't > expect better performance from real world usage. > > Oh, and that should be: > > pgbench -c 50 -t 10000 > > not -n... not enough sleep I guess. > -- Bborie Park Programmer Center for Vectorborne Diseases UC Davis 530-752-8380 bkpark@ucdavis.edu
If you’re stuck with a Dell, the Adaptec 5 series works, I’m using 5085’s in a pair and get 1200 MB/sec streaming reads best case with 20 SATA drives in RAID 10 (2 sets of 10, software raid 0 on top). Of course, Dell doesn’t like you putting in somebody else’s RAID card, but they support the rest of the system when you add in a third party PCIe card.
Sure, they are a bit pricey, but they are also very good performers with drivers for a lot of stuff, including OpenSolaris. I tested a PERC 6 versus this with the same drives, but only 10 total. The Adaptec was 70% faster out of the box, and still 35% faster after tuning the linux OS read-ahead and other parameters.
PERC 5 / 6 cards are LSI re-badged MegaRaid cards, for those interested. With special firmware modifications that make them ‘interesting’.
On 1/7/09 1:31 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
Sure, they are a bit pricey, but they are also very good performers with drivers for a lot of stuff, including OpenSolaris. I tested a PERC 6 versus this with the same drives, but only 10 total. The Adaptec was 70% faster out of the box, and still 35% faster after tuning the linux OS read-ahead and other parameters.
PERC 5 / 6 cards are LSI re-badged MegaRaid cards, for those interested. With special firmware modifications that make them ‘interesting’.
On 1/7/09 1:31 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele
<stefano.nichele@gmail.com> wrote:
>
> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> I concur with Merlin you're I/O bound.
>>
>> Adding to his post, what RAID controller are you running, does it have
>> cache, does the cache have battery backup, is the cache set to write
>> back or write through?
>
>
> At the moment I don't have such information. It's a "standard" RAID
> controller coming with a DELL server. Is there any information I can have
> asking to the SO ?
You can run lshw to see what flavor controller it is. Dell RAID
controllers are pretty much either total crap, or mediocre at best.
The latest one, the Perc 6 series are squarely in the same performance
realm as a 4 or 5 year old LSI megaraid. The perc 5 series and before
are total performance dogs. The really bad news is that you can't
generally plug in a real RAID controller on a Dell. We put an Areca
168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
CPU Error.
Dells are fine for web servers and such. For database servers they're
a total loss. The best you can do with one is to put a generic SCSI
card in it and connect to an external array with its own controller.
We have a perc6e and a perc5e in two different servers, and no matter
how we configure them, we can't get even 1/10th the performance of an
Areca controller with the same number of drives on another machine of
the same basic class as the 1950s.
>> Also, what do you get for this (need contrib module pgbench installed)
>>
>> pgbench -i -s 100
>> pgbench -c 50 -n 10000
>>
>> ? Specifically transactions per second?
>
> I'll run pgbench in the next days.
Cool. That pgbench is a "best case scenario" benchmark. Lots of
small transactions on a db that should fit into memory. If you can't
pull off a decent number there (at least a few hundred tps) then can't
expect better performance from real world usage.
Oh, and that should be:
pgbench -c 50 -t 10000
not -n... not enough sleep I guess.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Wed, Jan 7, 2009 at 7:11 PM, Scott Carey <scott@richrelevance.com> wrote: > If you're stuck with a Dell, the Adaptec 5 series works, I'm using 5085's in > a pair and get 1200 MB/sec streaming reads best case with 20 SATA drives in > RAID 10 (2 sets of 10, software raid 0 on top). Of course, Dell doesn't > like you putting in somebody else's RAID card, but they support the rest of > the system when you add in a third party PCIe card. > Sure, they are a bit pricey, but they are also very good performers with > drivers for a lot of stuff, including OpenSolaris. I tested a PERC 6 versus > this with the same drives, but only 10 total. The Adaptec was 70% faster > out of the box, and still 35% faster after tuning the linux OS read-ahead > and other parameters. Sequential read performance means precisely squat for most database loads. The dell stuff is ok....decent RAID 5 performance and mediocre raid 10. Unfortunately switching the disks to jbod and going software raid doesn't seem to help much. The biggest problem with dell hardware that I see is that overflowing the raid cache causes the whole system to spectacularly grind to a halt, causing random delays. To the OP, it looks like you are getting about 300 or so tps out of sdc (80% read), which is where I'm assuming the data is. I'm guessing most of that is random traffic. Here's the bad news: while this is on the low side for a 6 disk raid 10 7200 rpm, it's probably about what your particular hardware can do. I have some general suggestions for you: *) upgrade hardware: more/faster disks, etc *) disable fsync (dangerous!) can risk data loss, but maybe you have redundancy built in a different place. This will let linux reorganize i/o on top of what the hardware is doing. *) upgrade to postgres 8.3. Numerous efficiency advantages, and has the synchronous_commit setting, which is 'fsync lite'...most of the advantages and a lot less risk. *) tune the app merlin
On Wed, Jan 7, 2009 at 6:19 PM, Merlin Moncure <mmoncure@gmail.com> wrote: RE: Perc raid controllers > Unfortunately switching the disks to jbod and going software > raid doesn't seem to help much. The biggest problem with dell Yeah, I noticed that too when I was trying to get a good config from the perc 5e. Also, running software RAID0 over hardware RAID1 sets gave no appreciable speed boost.
Depends on the database workload. Many queries for me may scan 50GB of data for aggregation.
> Sequential read performance means precisely squat for most database
> loads.
Besides, it is a good test for making sure your RAID card doesn’t suck. Especially running tests with sequential access CONCURRENT with random access.
A good tuned raid setup will be able to handle a good chunk of sequential access while doing random reads concurrently. A bad one will grind to a halt.
The same can be said for concurrent writes and fsyncs with concurrent reads. Bad cards tend to struggle with this, good ones don’t.
$ sar -b
12:00:01 AM tps rtps wtps bread/s bwrtn/s
01:10:01 AM 1913.22 1903.74 9.48 561406.70 326.67
01:20:02 AM 2447.71 2439.97 7.74 930357.08 148.86
01:30:01 AM 1769.77 1740.41 29.35 581015.86 3729.37
01:40:01 AM 1762.05 1659.06 102.99 477730.70 26137.96
And disk utilization did not go past 85% or so during the peak load, usually much less (8 cores, 12.5% would reflect a CPU).
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
01:10:01 AM all 47.92 0.00 12.92 10.22 0.00 28.94
01:20:02 AM all 67.97 0.00 17.93 3.47 0.00 10.63
01:30:01 AM all 46.67 0.00 10.60 7.43 0.00 35.29
01:40:01 AM all 59.22 0.03 9.88 5.67 0.00 25.21
The workload regularly bursts to 900MB/sec with concurrent sequential scans.
> The dell stuff is ok....decent RAID 5 performance and mediocre
> raid 10. Unfortunately switching the disks to jbod and going software
> raid doesn't seem to help much. The biggest problem with dell
> hardware that I see is that overflowing the raid cache causes the
> whole system to spectacularly grind to a halt, causing random delays.
The Adaptec stuff doesn’t have the issues with cache overflow. For pure random access stuff the Dell Perc 6 is pretty good, but mix read/write it freaks out and has inconsistent performance. A PERC 6 does perform better than a 3Ware 9650 for me though. Those are both on my crap list, with 3Ware 9550 and PERC 5 both much worse.
Both got about 200 iops per drive on random access.
Agree with all of the above.
> To the OP, it looks like you are getting about 300 or so tps out of
> sdc (80% read), which is where I'm assuming the data is. I'm guessing
> most of that is random traffic. Here's the bad news: while this is on
> the low side for a 6 disk raid 10 7200 rpm, it's probably about what
> your particular hardware can do. I have some general suggestions for
> you:
> *) upgrade hardware: more/faster disks, etc
> *) disable fsync (dangerous!) can risk data loss, but maybe you have
> redundancy built in a different place. This will let linux reorganize
> i/o on top of what the hardware is doing.
> *) upgrade to postgres 8.3. Numerous efficiency advantages, and has
> the synchronous_commit setting, which is 'fsync lite'...most of the
> advantages and a lot less risk.
> *) tune the app
>
> merlin
The xlogs are on sdb, which is not I/O bound, so I am not sure how much changing fsync will help.
I second upgrading to 8.3 which is generally faster and will reduce random i/o if any sequential scans are kicking out random access data from shared_buffers.
If there is budget for an upgrade, how big is the data set, and how much will it grow?
For $1200 get two Intel X25-M SSD’s and all random iops issues will be gone (8k iops in raid 1). Double that if 4 drives in raid 10. Unfortunately, each pair only stores 80GB. But for many, that is plenty.
Scott Marlowe wrote: > On Wed, Jan 7, 2009 at 3:34 PM, Greg Smith <gsmith@gregsmith.com> wrote: >> On Wed, 7 Jan 2009, Scott Marlowe wrote: >> >>> I cannot understand how Dell stays in business. >> There's a continuous stream of people who expect RAID5 to perform well, too, >> yet this surprises you? > > I guess I've underestimated the human capacity for stupidity yet again. > > -- An optimist sees the glass as half full, a pessimist as half empty, > and engineer as having redundant storage capacity. Ah, but the half-empty glass has a fly in it. -- Kehlog Albran :)
Find !
Dell CERC SATA RAID 2 PCI SATA 6ch
Running lspci -v:
03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair)
Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 209
Memory at f8000000 (32-bit, prefetchable) [size=64M]
Expansion ROM at fe800000 [disabled] [size=32K]
Capabilities: [80] Power Management version 2
Any consideration looking at iostat output ?
Cheers and thanks to all!
ste
Dell CERC SATA RAID 2 PCI SATA 6ch
Running lspci -v:
03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair)
Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 209
Memory at f8000000 (32-bit, prefetchable) [size=64M]
Expansion ROM at fe800000 [disabled] [size=32K]
Capabilities: [80] Power Management version 2
Any consideration looking at iostat output ?
Cheers and thanks to all!
ste
On Wed, Jan 7, 2009 at 10:31 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Wed, Jan 7, 2009 at 2:02 PM, Stefano Nichele> On Tue, Jan 6, 2009 at 7:45 PM, Scott Marlowe <scott.marlowe@gmail.com>You can run lshw to see what flavor controller it is. Dell RAID
> wrote:
>>
>> I concur with Merlin you're I/O bound.
>>
>> Adding to his post, what RAID controller are you running, does it have
>> cache, does the cache have battery backup, is the cache set to write
>> back or write through?
>
>
> At the moment I don't have such information. It's a "standard" RAID
> controller coming with a DELL server. Is there any information I can have
> asking to the SO ?
controllers are pretty much either total crap, or mediocre at best.
The latest one, the Perc 6 series are squarely in the same performance
realm as a 4 or 5 year old LSI megaraid. The perc 5 series and before
are total performance dogs. The really bad news is that you can't
generally plug in a real RAID controller on a Dell. We put an Areca
168-LP PCI-x8 in one of our 1950s and it wouldn't even turn on, got a
CPU Error.
Dells are fine for web servers and such. For database servers they're
a total loss. The best you can do with one is to put a generic SCSI
card in it and connect to an external array with its own controller.
We have a perc6e and a perc5e in two different servers, and no matter
how we configure them, we can't get even 1/10th the performance of an
Areca controller with the same number of drives on another machine of
the same basic class as the 1950s.Cool. That pgbench is a "best case scenario" benchmark. Lots of
>> Also, what do you get for this (need contrib module pgbench installed)
>>
>> pgbench -i -s 100
>> pgbench -c 50 -n 10000
>>
>> ? Specifically transactions per second?
>
> I'll run pgbench in the next days.
small transactions on a db that should fit into memory. If you can't
pull off a decent number there (at least a few hundred tps) then can't
expect better performance from real world usage.
Oh, and that should be:
pgbench -c 50 -t 10000
not -n... not enough sleep I guess.
On Thu, Jan 8, 2009 at 3:36 AM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > Find ! > > Dell CERC SATA RAID 2 PCI SATA 6ch > > Running lspci -v: > > 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01) > Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair) IIRC that's the 'perc 6ir' card...no write caching. You are getting killed with syncs. If you can restart the database, you can test with fsync=off comparing load to confirm this. (another way is to compare select only vs regular transactions on pgbench). merlin
--- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote: > From: Stefano Nichele <stefano.nichele@gmail.com> > Subject: Re: [PERFORM] understanding postgres issues/bottlenecks > To: "Scott Marlowe" <scott.marlowe@gmail.com> > Cc: pgsql-performance@postgresql.org > Date: Thursday, 8 January, 2009, 8:36 AM > Find ! > > Dell CERC SATA RAID 2 PCI SATA 6ch > > Running lspci -v: > > 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01) > Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch > (DellCorsair) > Flags: bus master, 66MHz, slow devsel, latency 32, > IRQ 209 > Memory at f8000000 (32-bit, prefetchable) > [size=64M] > Expansion ROM at fe800000 [disabled] [size=32K] > Capabilities: [80] Power Management version 2 > Hmm, the 64M / 6ch makes it sound like this card http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975 Which is a 6ch dell version of http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/ I have one on my smash pile. The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ...
Glyn Astill wrote: > --- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote: > > >> From: Stefano Nichele <stefano.nichele@gmail.com> >> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks >> To: "Scott Marlowe" <scott.marlowe@gmail.com> >> Cc: pgsql-performance@postgresql.org >> Date: Thursday, 8 January, 2009, 8:36 AM >> Find ! >> >> Dell CERC SATA RAID 2 PCI SATA 6ch >> >> Running lspci -v: >> >> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01) >> Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch >> (DellCorsair) >> Flags: bus master, 66MHz, slow devsel, latency 32, >> IRQ 209 >> Memory at f8000000 (32-bit, prefetchable) >> [size=64M] >> Expansion ROM at fe800000 [disabled] [size=32K] >> Capabilities: [80] Power Management version 2 >> >> > > Hmm, the 64M / 6ch makes it sound like this card > > http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975 > Indeed I think it's this one. The server is a Power Edge 1800 Server. > Which is a 6ch dell version of > > http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/ > > I have one on my smash pile. The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ... > > > > > >
Merlin Moncure wrote: > On Thu, Jan 8, 2009 at 3:36 AM, Stefano Nichele > <stefano.nichele@gmail.com> wrote: > >> Find ! >> >> Dell CERC SATA RAID 2 PCI SATA 6ch >> >> Running lspci -v: >> >> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01) >> Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch (DellCorsair) >> > > IIRC that's the 'perc 6ir' card...no write caching. You are getting > killed with syncs. If you can restart the database, you can test with > fsync=off comparing load to confirm this. (another way is to compare > select only vs regular transactions on pgbench). > > merlin > > I'll try next Saturday. thanks ste
If it is PowerEdge 1800 with a low end non-write cache controller, then 100 iops per SATA drive isn’t too far off from what you should expect.
Contrary to what others are saying, I don’t think that turning fsync off is killing you, you might get 25% more throughput at best.
The iostat output would indicate that you are getting just about all you can get out of it without getting faster drives or controllers.
Since it is mostly random read bound, more RAM in the system would certainly help reduce the load, as would an I/O subsystem with high random i/o throughput (quality flash drives, the best quality ones currently being Intel X-25 M until the next gen Micron, Samsung, Mtron, and SanDisk ones come out — do NOT use any that don’t have high random write perf).
I’m guessing that the cheapest thing to do is increase the RAM on the server — I think you can fit 16GB in an 1800, if not at least 8GB, and increase shared_buffers to between 10% and 25% of that. Upgrading to 8.3.x will further help.
On 1/8/09 6:41 AM, "Stefano Nichele" <stefano.nichele@gmail.com> wrote:
Contrary to what others are saying, I don’t think that turning fsync off is killing you, you might get 25% more throughput at best.
The iostat output would indicate that you are getting just about all you can get out of it without getting faster drives or controllers.
Since it is mostly random read bound, more RAM in the system would certainly help reduce the load, as would an I/O subsystem with high random i/o throughput (quality flash drives, the best quality ones currently being Intel X-25 M until the next gen Micron, Samsung, Mtron, and SanDisk ones come out — do NOT use any that don’t have high random write perf).
I’m guessing that the cheapest thing to do is increase the RAM on the server — I think you can fit 16GB in an 1800, if not at least 8GB, and increase shared_buffers to between 10% and 25% of that. Upgrading to 8.3.x will further help.
On 1/8/09 6:41 AM, "Stefano Nichele" <stefano.nichele@gmail.com> wrote:
Glyn Astill wrote:
> --- On Thu, 8/1/09, Stefano Nichele <stefano.nichele@gmail.com> wrote:
>
>
>> From: Stefano Nichele <stefano.nichele@gmail.com>
>> Subject: Re: [PERFORM] understanding postgres issues/bottlenecks
>> To: "Scott Marlowe" <scott.marlowe@gmail.com>
>> Cc: pgsql-performance@postgresql.org
>> Date: Thursday, 8 January, 2009, 8:36 AM
>> Find !
>>
>> Dell CERC SATA RAID 2 PCI SATA 6ch
>>
>> Running lspci -v:
>>
>> 03:09.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
>> Subsystem: Dell CERC SATA RAID 2 PCI SATA 6ch
>> (DellCorsair)
>> Flags: bus master, 66MHz, slow devsel, latency 32,
>> IRQ 209
>> Memory at f8000000 (32-bit, prefetchable)
>> [size=64M]
>> Expansion ROM at fe800000 [disabled] [size=32K]
>> Capabilities: [80] Power Management version 2
>>
>>
>
> Hmm, the 64M / 6ch makes it sound like this card
>
> http://accessories.us.dell.com/sna/products/Controllers/productdetail.aspx?c=us&l=en&s=bsd&cs=04&sku=310-5975
>
Indeed I think it's this one. The server is a Power Edge 1800 Server.
> Which is a 6ch dell version of
>
> http://www.adaptec.com/en-US/support/raid/sata/AAR-2410SA/
>
> I have one on my smash pile. The only thing that makes me think otherwise is the 2 in "CERC SATA RAID 2" ...
>
>
>
>
>
>
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Thu, Jan 8, 2009 at 9:42 AM, Stefano Nichele <stefano.nichele@gmail.com> wrote: > Merlin Moncure wrote: >> IIRC that's the 'perc 6ir' card...no write caching. You are getting >> killed with syncs. If you can restart the database, you can test with >> fsync=off comparing load to confirm this. (another way is to compare >> select only vs regular transactions on pgbench). > > I'll try next Saturday. > just be aware of the danger . hard reset (power off) class of failure when fsync = off means you are loading from backups. merlin
At 03:28 PM 1/8/2009, Merlin Moncure wrote: >On Thu, Jan 8, 2009 at 9:42 AM, Stefano Nichele ><stefano.nichele@gmail.com> wrote: > > Merlin Moncure wrote: > >> IIRC that's the 'perc 6ir' card...no write caching. You are getting > >> killed with syncs. If you can restart the database, you can test with > >> fsync=off comparing load to confirm this. (another way is to compare > >> select only vs regular transactions on pgbench). > > > > I'll try next Saturday. > > > >just be aware of the danger . hard reset (power off) class of failure >when fsync = off means you are loading from backups. > >merlin That's what redundant power conditioning UPS's are supposed to help prevent ;-) Merlin is of course absolutely correct that you are taking a bigger risk if you turn fsync off. I would not recommend fysnc = off if you do not have other safety measures in place to protect against data loss because of a power event.. (At least for most DB applications.) ...and of course, those lucky few with bigger budgets can use SSD's and not care what fsync is set to. Ron
On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote: > At 03:28 PM 1/8/2009, Merlin Moncure wrote: >> just be aware of the danger . hard reset (power off) class of failure >> when fsync = off means you are loading from backups. > > That's what redundant power conditioning UPS's are supposed to help prevent > ;-) But of course, they can't prevent them, but only reduce the likelihood of their occurrance. Everyone who's working in large hosting environments has at least one horror story to tell about a power outage that never should have happened. > I would not recommend fysnc = off if you do not have other safety measures > in place to protect against data loss because of a power event.. > (At least for most DB applications.) Agreed. Keep in mind that you'll be losing whatever wasn't transferred to the backup machines. > ...and of course, those lucky few with bigger budgets can use SSD's and not > care what fsync is set to. Would that prevent any corruption if the writes got out of order because of lack of fsync? Or partial writes? Or wouldn't fsync still need to be turned on to keep the data safe.
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote: >> At 03:28 PM 1/8/2009, Merlin Moncure wrote: >>> just be aware of the danger . hard reset (power off) class of failure >>> when fsync = off means you are loading from backups. >> >> That's what redundant power conditioning UPS's are supposed to help prevent >> ;-) > > But of course, they can't prevent them, but only reduce the likelihood > of their occurrance. Everyone who's working in large hosting > environments has at least one horror story to tell about a power > outage that never should have happened. Or a system crash. If the kernel panics for any reason when it has dirty buffers in memory the database will need to be restored. >> ...and of course, those lucky few with bigger budgets can use SSD's and not >> care what fsync is set to. > > Would that prevent any corruption if the writes got out of order > because of lack of fsync? Or partial writes? Or wouldn't fsync still > need to be turned on to keep the data safe. I think the idea is that with SSDs or a RAID with a battery backed cache you can leave fsync on and not have any significant performance hit since the seek times are very fast for SSD. They have limited bandwidth but bandwidth to the WAL is rarely an issue -- just latency. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Sat, 10 Jan 2009, Gregory Stark wrote: >>> ...and of course, those lucky few with bigger budgets can use SSD's and not >>> care what fsync is set to. >> >> Would that prevent any corruption if the writes got out of order >> because of lack of fsync? Or partial writes? Or wouldn't fsync still >> need to be turned on to keep the data safe. > > I think the idea is that with SSDs or a RAID with a battery backed cache you > can leave fsync on and not have any significant performance hit since the seek > times are very fast for SSD. They have limited bandwidth but bandwidth to the > WAL is rarely an issue -- just latency. I don't think that this is true, even if your SSD is battery backed RAM (as opposed to the flash based devices that have slower writes than high-end hard drives) you can complete 'writes' to the system RAM faster than the OS can get the data to the drive, so if you don't do a fsync you can still loose a lot in a power outage. raid controllers with battery backed ram cache will make the fsyncs very cheap (until the cache fills up anyway) with SSDs having extremely good read speeds, but poor (at least by comparison) write speeds I wonder if any of the RAID controllers are going to get a mode where they cache writes, but don't cache reads, leaving all ofyour cache to handle writes. David Lang
Hi, david@lang.hm wrote: > On Sat, 10 Jan 2009, Gregory Stark wrote: >> I think the idea is that with SSDs or a RAID with a battery backed >> cache you >> can leave fsync on and not have any significant performance hit since >> the seek >> times are very fast for SSD. They have limited bandwidth but bandwidth >> to the >> WAL is rarely an issue -- just latency. That's also my understanding. > with SSDs having extremely good read speeds, but poor (at least by > comparison) write speeds I wonder if any of the RAID controllers are > going to get a mode where they cache writes, but don't cache reads, > leaving all ofyour cache to handle writes. My understanding of SSDs so far is, that they are not that bad at writing *on average*, but to perform wear-leveling, they sometimes have to shuffle around multiple blocks at once. So there are pretty awful spikes for writing latency (IIRC more than 100ms has been measured on cheaper disks). A battery backed cache could theoretically flatten those, as long as your avg. WAL throughput is below the SSDs avg. writing throughput. Regards Markus Wanner
On Sat, Jan 10, 2009 at 12:00 PM, Markus Wanner <markus@bluegap.ch> wrote: > Hi, > > david@lang.hm wrote: >> On Sat, 10 Jan 2009, Gregory Stark wrote: >>> I think the idea is that with SSDs or a RAID with a battery backed >>> cache you >>> can leave fsync on and not have any significant performance hit since >>> the seek >>> times are very fast for SSD. They have limited bandwidth but bandwidth >>> to the >>> WAL is rarely an issue -- just latency. > > That's also my understanding. > >> with SSDs having extremely good read speeds, but poor (at least by >> comparison) write speeds I wonder if any of the RAID controllers are >> going to get a mode where they cache writes, but don't cache reads, >> leaving all ofyour cache to handle writes. > > My understanding of SSDs so far is, that they are not that bad at > writing *on average*, but to perform wear-leveling, they sometimes have > to shuffle around multiple blocks at once. So there are pretty awful > spikes for writing latency (IIRC more than 100ms has been measured on > cheaper disks). Multiply it by 10 and apply to both reads and writes for most cheap SSDs when doing random writes and reads mixed together. Which is why so many discussions specificall mention the intel XM series, because they don't suck like that. They keep good access times even under several random read / write threads. Some review of the others was posted here a while back and it was astounding how slow the others became in a mixed read / write benchmark.
On Sat, 10 Jan 2009, Markus Wanner wrote: > david@lang.hm wrote: >> On Sat, 10 Jan 2009, Gregory Stark wrote: >>> I think the idea is that with SSDs or a RAID with a battery backed >>> cache you >>> can leave fsync on and not have any significant performance hit since >>> the seek >>> times are very fast for SSD. They have limited bandwidth but bandwidth >>> to the >>> WAL is rarely an issue -- just latency. > > That's also my understanding. > >> with SSDs having extremely good read speeds, but poor (at least by >> comparison) write speeds I wonder if any of the RAID controllers are >> going to get a mode where they cache writes, but don't cache reads, >> leaving all ofyour cache to handle writes. > > My understanding of SSDs so far is, that they are not that bad at > writing *on average*, but to perform wear-leveling, they sometimes have > to shuffle around multiple blocks at once. So there are pretty awful > spikes for writing latency (IIRC more than 100ms has been measured on > cheaper disks). well, I have one of those cheap disks. brand new out of the box, format the 32G drive, then copy large files to it (~1G per file). this should do almost no wear-leveling, but it's write performance is still poor and it has occasional 1 second pauses. I for my initial tests I hooked it up to a USB->SATA adapter and the write speed is showing about half of what I can get on a 1.5TB SATA drive hooked to the same system. the write speed is fairly comparable to what you can do with slow laptop drives (even ignoring the pauses) read speed is much better (and I think limited by the USB) the key thing with any new storage technology (including RAID controller) is that you need to do your own testing, treat the manufacturers specs as ideal conditions or 'we guarentee that the product will never do better than this' specs Imation has a white paper on their site about solid state drive performance that is interesting. among other things it shows that high-speed SCSI drives are still a significant win in random-write workloads at this point, if I was specing out a new high-end system I would be looking at and testing somthing like the following SSD for read-mostly items (OS, possibly some indexes) 15K SCSI drives for heavy writing (WAL, indexes, temp tables, etc) SATA drives for storage capacity (table contents) David Lang
david@lang.hm writes: > On Sat, 10 Jan 2009, Markus Wanner wrote: > >> My understanding of SSDs so far is, that they are not that bad at >> writing *on average*, but to perform wear-leveling, they sometimes have >> to shuffle around multiple blocks at once. So there are pretty awful >> spikes for writing latency (IIRC more than 100ms has been measured on >> cheaper disks). That would be fascinating. And frightening. A lot of people have been recommending these for WAL disks and this would be make them actually *worse* than regular drives. > well, I have one of those cheap disks. > > brand new out of the box, format the 32G drive, then copy large files to it > (~1G per file). this should do almost no wear-leveling, but it's write > performance is still poor and it has occasional 1 second pauses. This isn't similar to the way WAL behaves though. What you're testing is the behaviour when the bandwidth to the SSD is saturated. At that point some point in the stack, whether in the SSD, the USB hardware or driver, or OS buffer cache can start to queue up writes. The stalls you see could be the behaviour when that queue fills up and it needs to push back to higher layers. To simulate WAL you want to transfer smaller volumes of data, well below the bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time each fsync and see whether the time they take is proportional to the amount of data written in the meantime or whether they randomly spike upwards. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Sat, 10 Jan 2009, Gregory Stark wrote: > david@lang.hm writes: > >> On Sat, 10 Jan 2009, Markus Wanner wrote: >> >>> My understanding of SSDs so far is, that they are not that bad at >>> writing *on average*, but to perform wear-leveling, they sometimes have >>> to shuffle around multiple blocks at once. So there are pretty awful >>> spikes for writing latency (IIRC more than 100ms has been measured on >>> cheaper disks). > > That would be fascinating. And frightening. A lot of people have been > recommending these for WAL disks and this would be make them actually *worse* > than regular drives. > >> well, I have one of those cheap disks. >> >> brand new out of the box, format the 32G drive, then copy large files to it >> (~1G per file). this should do almost no wear-leveling, but it's write >> performance is still poor and it has occasional 1 second pauses. > > This isn't similar to the way WAL behaves though. What you're testing is the > behaviour when the bandwidth to the SSD is saturated. At that point some point > in the stack, whether in the SSD, the USB hardware or driver, or OS buffer > cache can start to queue up writes. The stalls you see could be the behaviour > when that queue fills up and it needs to push back to higher layers. > > To simulate WAL you want to transfer smaller volumes of data, well below the > bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time > each fsync and see whether the time they take is proportional to the amount of > data written in the meantime or whether they randomly spike upwards. if you have a specific benchmark for me to test I would be happy to do this. the test that I did is basicly the best-case for the SSD (more-or-less sequential writes where the vendors claim that the drives match or slightly outperform the traditional disks). for random writes the vendors put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives. take a look at this paper http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf this is not one of the low-performance drives, they include a sandisk drive in the paper that shows significantly less performance (but the same basic pattern) than the imation drives. David Lang
At 10:36 AM 1/10/2009, Gregory Stark wrote: >"Scott Marlowe" <scott.marlowe@gmail.com> writes: > > > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote: > >> At 03:28 PM 1/8/2009, Merlin Moncure wrote: > >>> just be aware of the danger . hard reset (power off) class of failure > >>> when fsync = off means you are loading from backups. > >> > >> That's what redundant power conditioning UPS's are supposed to > help prevent > >> ;-) > > > > But of course, they can't prevent them, but only reduce the likelihood > > of their occurrance. Everyone who's working in large hosting > > environments has at least one horror story to tell about a power > > outage that never should have happened. > >Or a system crash. If the kernel panics for any reason when it has dirty >buffers in memory the database will need to be restored. A power conditioning UPS should prevent a building wide or circuit level bad power event, caused by either dirty power or a power loss, from affecting the host. Within the design limits of the UPS in question of course. So the real worry with fsync = off in a environment with redundant decent UPS's is pretty much limited to host level HW failures, SW crashes, and unlikely catastrophes like building collapses, lightning strikes, floods, etc. Not that your fsync setting is going to matter much in the event of catastrophes in the physical environment... Like anything else, there is usually more than one way to reduce risk while at the same time meeting (realistic) performance goals. If you need the performance implied by fsync off, then you have to take other steps to reduce the risk of data corruption down to about the same statistical level as running with fsync on. Or you have to decide that you are willing to life with the increased risk (NOT my recommendation for most DB hosting scenarios.) > >> ...and of course, those lucky few with bigger budgets can use > SSD's and not > >> care what fsync is set to. > > > > Would that prevent any corruption if the writes got out of order > > because of lack of fsync? Or partial writes? Or wouldn't fsync still > > need to be turned on to keep the data safe. > >I think the idea is that with SSDs or a RAID with a battery backed cache you >can leave fsync on and not have any significant performance hit since the seek >times are very fast for SSD. They have limited bandwidth but bandwidth to the >WAL is rarely an issue -- just latency. Yes, Greg understands what I meant here. In the case of SSDs, the performance hit of fsync = on is essentially zero. In the case of battery backed RAM caches for RAID arrays, the efficacy is dependent on how the size of the cache compares with the working set of the disk access pattern. Ron
On Sat, 10 Jan 2009, Ron wrote: > At 10:36 AM 1/10/2009, Gregory Stark wrote: >> "Scott Marlowe" <scott.marlowe@gmail.com> writes: >> >> > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote: >> >> At 03:28 PM 1/8/2009, Merlin Moncure wrote: >> >>> just be aware of the danger . hard reset (power off) class of failure >> >>> when fsync = off means you are loading from backups. >> >> >> >> That's what redundant power conditioning UPS's are supposed to help >> prevent >> >> ;-) >> > >> > But of course, they can't prevent them, but only reduce the likelihood >> > of their occurrance. Everyone who's working in large hosting >> > environments has at least one horror story to tell about a power >> > outage that never should have happened. >> >> Or a system crash. If the kernel panics for any reason when it has dirty >> buffers in memory the database will need to be restored. > A power conditioning UPS should prevent a building wide or circuit level bad > power event, caused by either dirty power or a power loss, from affecting the > host. Within the design limits of the UPS in question of course. > > So the real worry with fsync = off in a environment with redundant decent > UPS's is pretty much limited to host level HW failures, SW crashes, and > unlikely catastrophes like building collapses, lightning strikes, floods, > etc. I've seen datacenters with redundant UPSs go dark unexpectedly. it's less common, but it does happen. > Not that your fsync setting is going to matter much in the event of > catastrophes in the physical environment... questionable, but sometimes true. in the physical environment disasters you will loose access to your data for a while, but after the drives are dug out of the rubble (or dried out from the flood) the data can probably be recovered. for crying out loud, they were able to recover most of the data from the hard drives in the latest shuttle disaster. > Like anything else, there is usually more than one way to reduce risk while > at the same time meeting (realistic) performance goals. very true. >> >> ...and of course, those lucky few with bigger budgets can use SSD's and >> not >> >> care what fsync is set to. >> > >> > Would that prevent any corruption if the writes got out of order >> > because of lack of fsync? Or partial writes? Or wouldn't fsync still >> > need to be turned on to keep the data safe. >> >> I think the idea is that with SSDs or a RAID with a battery backed cache >> you >> can leave fsync on and not have any significant performance hit since the >> seek >> times are very fast for SSD. They have limited bandwidth but bandwidth to >> the >> WAL is rarely an issue -- just latency. > Yes, Greg understands what I meant here. In the case of SSDs, the > performance hit of fsync = on is essentially zero. this is definantly not the case. fsync off the data stays in memory and may never end up being sent to the drive. RAM speeds are several orders of magnatude faster than the interfaces to the drives (or even to the RAID controllers in high-speed slots) it may be that it's fast enough (see the other posts disputing that), but don't think that it's the same. David Lang > In the case of battery > backed RAM caches for RAID arrays, the efficacy is dependent on how the size > of the cache compares with the working set of the disk access pattern. > > Ron > >
Ron <rjpeace@earthlink.net> writes: > At 10:36 AM 1/10/2009, Gregory Stark wrote: >> >> Or a system crash. If the kernel panics for any reason when it has dirty >> buffers in memory the database will need to be restored. > > A power conditioning UPS should prevent a building wide or circuit level bad > power event Except of course those caused *by* a faulty UPS. Or for that matter by the power supply in the computer or drive array, or someone just accidentally hitting the wrong power button. I'm surprised people are so confident in their kernels though. I know some computers with uptimes measured in years but I know far more which don't. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Sat, 10 Jan 2009, Luke Lonergan wrote: > The new MLC based SSDs have better wear leveling tech and don't suffer > the pauses. Intel X25-M 80 and 160 GB SSDs are both pause-free. See > Anandtech's test results for details. they don't suffer the pauses, but they still don't have fantasic write speeds. David Lang > Intel's SLC SSDs should also be good enough but they're smaller. > > - Luke > > ----- Original Message ----- > From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> > To: Gregory Stark <stark@enterprisedb.com> > Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>; pgsql-performance@postgresql.org<pgsql-performance@postgresql.org> > Sent: Sat Jan 10 14:40:51 2009 > Subject: Re: [PERFORM] understanding postgres issues/bottlenecks > > On Sat, 10 Jan 2009, Gregory Stark wrote: > >> david@lang.hm writes: >> >>> On Sat, 10 Jan 2009, Markus Wanner wrote: >>> >>>> My understanding of SSDs so far is, that they are not that bad at >>>> writing *on average*, but to perform wear-leveling, they sometimes have >>>> to shuffle around multiple blocks at once. So there are pretty awful >>>> spikes for writing latency (IIRC more than 100ms has been measured on >>>> cheaper disks). >> >> That would be fascinating. And frightening. A lot of people have been >> recommending these for WAL disks and this would be make them actually *worse* >> than regular drives. >> >>> well, I have one of those cheap disks. >>> >>> brand new out of the box, format the 32G drive, then copy large files to it >>> (~1G per file). this should do almost no wear-leveling, but it's write >>> performance is still poor and it has occasional 1 second pauses. >> >> This isn't similar to the way WAL behaves though. What you're testing is the >> behaviour when the bandwidth to the SSD is saturated. At that point some point >> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer >> cache can start to queue up writes. The stalls you see could be the behaviour >> when that queue fills up and it needs to push back to higher layers. >> >> To simulate WAL you want to transfer smaller volumes of data, well below the >> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time >> each fsync and see whether the time they take is proportional to the amount of >> data written in the meantime or whether they randomly spike upwards. > > if you have a specific benchmark for me to test I would be happy to do > this. > > the test that I did is basicly the best-case for the SSD (more-or-less > sequential writes where the vendors claim that the drives match or > slightly outperform the traditional disks). for random writes the vendors > put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives. > > take a look at this paper > http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf > > this is not one of the low-performance drives, they include a sandisk > drive in the paper that shows significantly less performance (but the same > basic pattern) than the imation drives. > > David Lang > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
The new MLC based SSDs have better wear leveling tech and don't suffer the pauses. Intel X25-M 80 and 160 GB SSDs are bothpause-free. See Anandtech's test results for details. Intel's SLC SSDs should also be good enough but they're smaller. - Luke ----- Original Message ----- From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> To: Gregory Stark <stark@enterprisedb.com> Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>; pgsql-performance@postgresql.org<pgsql-performance@postgresql.org> Sent: Sat Jan 10 14:40:51 2009 Subject: Re: [PERFORM] understanding postgres issues/bottlenecks On Sat, 10 Jan 2009, Gregory Stark wrote: > david@lang.hm writes: > >> On Sat, 10 Jan 2009, Markus Wanner wrote: >> >>> My understanding of SSDs so far is, that they are not that bad at >>> writing *on average*, but to perform wear-leveling, they sometimes have >>> to shuffle around multiple blocks at once. So there are pretty awful >>> spikes for writing latency (IIRC more than 100ms has been measured on >>> cheaper disks). > > That would be fascinating. And frightening. A lot of people have been > recommending these for WAL disks and this would be make them actually *worse* > than regular drives. > >> well, I have one of those cheap disks. >> >> brand new out of the box, format the 32G drive, then copy large files to it >> (~1G per file). this should do almost no wear-leveling, but it's write >> performance is still poor and it has occasional 1 second pauses. > > This isn't similar to the way WAL behaves though. What you're testing is the > behaviour when the bandwidth to the SSD is saturated. At that point some point > in the stack, whether in the SSD, the USB hardware or driver, or OS buffer > cache can start to queue up writes. The stalls you see could be the behaviour > when that queue fills up and it needs to push back to higher layers. > > To simulate WAL you want to transfer smaller volumes of data, well below the > bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time > each fsync and see whether the time they take is proportional to the amount of > data written in the meantime or whether they randomly spike upwards. if you have a specific benchmark for me to test I would be happy to do this. the test that I did is basicly the best-case for the SSD (more-or-less sequential writes where the vendors claim that the drives match or slightly outperform the traditional disks). for random writes the vendors put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives. take a look at this paper http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf this is not one of the low-performance drives, they include a sandisk drive in the paper that shows significantly less performance (but the same basic pattern) than the imation drives. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
I believe they write at 200MB/s which is outstanding for sequential BW. Not sure about the write latency, though the Anandtechbenchmark results showed high detail and IIRC the write latencies were very good. - Luke ----- Original Message ----- From: david@lang.hm <david@lang.hm> To: Luke Lonergan Cc: stark@enterprisedb.com <stark@enterprisedb.com>; markus@bluegap.ch <markus@bluegap.ch>; scott.marlowe@gmail.com <scott.marlowe@gmail.com>;rjpeace@earthlink.net <rjpeace@earthlink.net>; pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> Sent: Sat Jan 10 16:03:32 2009 Subject: Re: [PERFORM] understanding postgres issues/bottlenecks On Sat, 10 Jan 2009, Luke Lonergan wrote: > The new MLC based SSDs have better wear leveling tech and don't suffer > the pauses. Intel X25-M 80 and 160 GB SSDs are both pause-free. See > Anandtech's test results for details. they don't suffer the pauses, but they still don't have fantasic write speeds. David Lang > Intel's SLC SSDs should also be good enough but they're smaller. > > - Luke > > ----- Original Message ----- > From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> > To: Gregory Stark <stark@enterprisedb.com> > Cc: Markus Wanner <markus@bluegap.ch>; Scott Marlowe <scott.marlowe@gmail.com>; Ron <rjpeace@earthlink.net>; pgsql-performance@postgresql.org<pgsql-performance@postgresql.org> > Sent: Sat Jan 10 14:40:51 2009 > Subject: Re: [PERFORM] understanding postgres issues/bottlenecks > > On Sat, 10 Jan 2009, Gregory Stark wrote: > >> david@lang.hm writes: >> >>> On Sat, 10 Jan 2009, Markus Wanner wrote: >>> >>>> My understanding of SSDs so far is, that they are not that bad at >>>> writing *on average*, but to perform wear-leveling, they sometimes have >>>> to shuffle around multiple blocks at once. So there are pretty awful >>>> spikes for writing latency (IIRC more than 100ms has been measured on >>>> cheaper disks). >> >> That would be fascinating. And frightening. A lot of people have been >> recommending these for WAL disks and this would be make them actually *worse* >> than regular drives. >> >>> well, I have one of those cheap disks. >>> >>> brand new out of the box, format the 32G drive, then copy large files to it >>> (~1G per file). this should do almost no wear-leveling, but it's write >>> performance is still poor and it has occasional 1 second pauses. >> >> This isn't similar to the way WAL behaves though. What you're testing is the >> behaviour when the bandwidth to the SSD is saturated. At that point some point >> in the stack, whether in the SSD, the USB hardware or driver, or OS buffer >> cache can start to queue up writes. The stalls you see could be the behaviour >> when that queue fills up and it needs to push back to higher layers. >> >> To simulate WAL you want to transfer smaller volumes of data, well below the >> bandwidth limit of the drive, fsync the data, then pause a bit repeat. Time >> each fsync and see whether the time they take is proportional to the amount of >> data written in the meantime or whether they randomly spike upwards. > > if you have a specific benchmark for me to test I would be happy to do > this. > > the test that I did is basicly the best-case for the SSD (more-or-less > sequential writes where the vendors claim that the drives match or > slightly outperform the traditional disks). for random writes the vendors > put SSDs at fewer IOPS than 5400 rpm drives, let along 15K rpm drives. > > take a look at this paper > http://www.imation.com/PageFiles/83/Imation-SSD-Performance-White-Paper.pdf > > this is not one of the low-performance drives, they include a sandisk > drive in the paper that shows significantly less performance (but the same > basic pattern) than the imation drives. > > David Lang > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
david@lang.hm wrote: > On Sat, 10 Jan 2009, Luke Lonergan wrote: > >> The new MLC based SSDs have better wear leveling tech and don't >> suffer the pauses. Intel X25-M 80 and 160 GB SSDs are both >> pause-free. See Anandtech's test results for details. > > they don't suffer the pauses, but they still don't have fantasic write > speeds. > > David Lang > >> Intel's SLC SSDs should also be good enough but they're smaller. >> From what I can see, SLC SSDs are still quite superior for reliability and (write) performance. However they are too small and too expensive right now. Hopefully the various manufacturers are working on improving the size/price issue for SLC, as well as improving the performance/reliability area for the MLC products. regards Mark
On Sun, 11 Jan 2009, Mark Kirkwood wrote: > david@lang.hm wrote: >> On Sat, 10 Jan 2009, Luke Lonergan wrote: >> >>> The new MLC based SSDs have better wear leveling tech and don't suffer the >>> pauses. Intel X25-M 80 and 160 GB SSDs are both pause-free. See >>> Anandtech's test results for details. >> >> they don't suffer the pauses, but they still don't have fantasic write >> speeds. >> >> David Lang >> >>> Intel's SLC SSDs should also be good enough but they're smaller. >>> > > From what I can see, SLC SSDs are still quite superior for reliability and > (write) performance. However they are too small and too expensive right now. > Hopefully the various manufacturers are working on improving the size/price > issue for SLC, as well as improving the performance/reliability area for the > MLC products. the very nature of the technology means that SLC will never be as cheap as MLC and MLC will never be as reliable as SLC take a look at http://www.imation.com/PageFiles/83/SSD-Reliability-Lifetime-White-Paper.pdf for a good writeup of the technology. for both technologies, the price will continue to drop, and the reliability and performance will continue to climb, but I don't see anything that would improve one without the other (well, I could see MLC gaining a 50% capacity boost if they can get to 3 bits per cell vs the current 2, but that would come at the cost of reliability again) for write performance I don't think there is as much of a difference between the two technologies. today there is a huge difference in most of the shipping products, but Intel has now demonstrated that it's mostly due to the controller chip, so I expect much of that difference to vanish in the next year or so (as new generations of controller chips ship) David Lang
Ron wrote: >> I think the idea is that with SSDs or a RAID with a battery backed >> cache you >> can leave fsync on and not have any significant performance hit since >> the seek >> times are very fast for SSD. They have limited bandwidth but >> bandwidth to the >> WAL is rarely an issue -- just latency. > Yes, Greg understands what I meant here. In the case of SSDs, the > performance hit of fsync = on is essentially zero. In the case of > battery backed RAM caches for RAID arrays, the efficacy is dependent > on how the size of the cache compares with the working set of the disk > access pattern. Out of interest, if we take a scenario where the working set of updates exceeds the size of the RAID card cache, has anyone tested the relative performance of using the battery backed RAID on WAL only and non-cached access to other drives? And perhaps the similar scenario with (hot) indices and WAL on a battery-backed device on the data on uncached devices? It seems to me that if you're going to thrash the cache from data updates (presumably courtesy of full-page-write), then you might be better to partition the cache - and a thrashed cache can be hardly any better than no cache (so why have one?).
Hi All,
I ran pgbench. Here some result:
-bash-3.1$ pgbench -c 50 -t 1000
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 50
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
tps = 377.351354 (including connections establishing)
tps = 377.788377 (excluding connections establishing)
Some vmstat samplings in the meantime:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 92 127880 8252 3294512 0 0 458 12399 2441 14903 22 9 34 35 0
11 49 92 125336 8288 3297016 0 0 392 11071 2390 11568 17 7 51 24 0
0 2 92 124548 8304 3297764 0 0 126 8249 2291 3829 5 3 64 28 0
0 1 92 127268 7796 3295672 0 0 493 11387 2323 14221 23 9 47 21 0
0 2 92 127256 7848 3295492 0 0 501 10654 2215 14599 24 9 42 24 0
0 2 92 125772 7892 3295656 0 0 34 7541 2311 327 0 1 59 40 0
0 1 92 127188 7952 3294084 0 0 537 11039 2274 15460 23 10 43 24 0
7 4 92 123816 7996 3298620 0 0 253 8946 2284 7310 11 5 52 32 0
0 2 92 126652 8536 3294220 0 0 440 9563 2307 9036 13 6 56 25 0
0 10 92 125268 8584 3296116 0 0 426 10696 2285 11034 20 9 39 32 0
0 2 92 124168 8604 3297252 0 0 104 8385 2319 4162 3 3 40 54 0
0 8 92 123780 8648 3296456 0 0 542 11498 2298 16613 25 10 16 48 0
-bash-3.1$ pgbench -t 10000 -c 50
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)
-bash-3.1$ pgbench -t 10000 -c 50 -S
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)
(next test is with scaling factor 1)
-bash-3.1$ pgbench -t 20000 -c 8 -S pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
number of clients: 8
number of transactions per client: 20000
number of transactions actually processed: 160000/160000
tps = 11695.895318 (including connections establishing)
tps = 11715.603720 (excluding connections establishing)
Any comment ?
I can give you also some details about database usage of my application:
- number of active connections: about 60
- number of idle connections: about 60
Here some number from a mine old pgfouine report:
- query peak: 378 queries/s
- select: 53,1%, insert 3,8%, update 2,2 %, delete 2,8 %
The application is basically a web application and the db size is 37 GB.
Is there a way to have the number of queries per second and the percentages of select/update/insert/delete without pgfouine ?
What is the performance impact of stats_start_collector = on and stats_row_level = on (they are on since I use autovacuum)
Thanks a lot for your help.
ste
I ran pgbench. Here some result:
-bash-3.1$ pgbench -c 50 -t 1000
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 50
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
tps = 377.351354 (including connections establishing)
tps = 377.788377 (excluding connections establishing)
Some vmstat samplings in the meantime:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 92 127880 8252 3294512 0 0 458 12399 2441 14903 22 9 34 35 0
11 49 92 125336 8288 3297016 0 0 392 11071 2390 11568 17 7 51 24 0
0 2 92 124548 8304 3297764 0 0 126 8249 2291 3829 5 3 64 28 0
0 1 92 127268 7796 3295672 0 0 493 11387 2323 14221 23 9 47 21 0
0 2 92 127256 7848 3295492 0 0 501 10654 2215 14599 24 9 42 24 0
0 2 92 125772 7892 3295656 0 0 34 7541 2311 327 0 1 59 40 0
0 1 92 127188 7952 3294084 0 0 537 11039 2274 15460 23 10 43 24 0
7 4 92 123816 7996 3298620 0 0 253 8946 2284 7310 11 5 52 32 0
0 2 92 126652 8536 3294220 0 0 440 9563 2307 9036 13 6 56 25 0
0 10 92 125268 8584 3296116 0 0 426 10696 2285 11034 20 9 39 32 0
0 2 92 124168 8604 3297252 0 0 104 8385 2319 4162 3 3 40 54 0
0 8 92 123780 8648 3296456 0 0 542 11498 2298 16613 25 10 16 48 0
-bash-3.1$ pgbench -t 10000 -c 50
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)
-bash-3.1$ pgbench -t 10000 -c 50 -S
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
number of clients: 50
number of transactions per client: 10000
number of transactions actually processed: 500000/500000
tps = 8571.573651 (including connections establishing)
tps = 8594.357138 (excluding connections establishing)
(next test is with scaling factor 1)
-bash-3.1$ pgbench -t 20000 -c 8 -S pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
number of clients: 8
number of transactions per client: 20000
number of transactions actually processed: 160000/160000
tps = 11695.895318 (including connections establishing)
tps = 11715.603720 (excluding connections establishing)
Any comment ?
I can give you also some details about database usage of my application:
- number of active connections: about 60
- number of idle connections: about 60
Here some number from a mine old pgfouine report:
- query peak: 378 queries/s
- select: 53,1%, insert 3,8%, update 2,2 %, delete 2,8 %
The application is basically a web application and the db size is 37 GB.
Is there a way to have the number of queries per second and the percentages of select/update/insert/delete without pgfouine ?
What is the performance impact of stats_start_collector = on and stats_row_level = on (they are on since I use autovacuum)
Thanks a lot for your help.
ste
On Wed, Jan 7, 2009 at 8:05 PM, Stefano Nichele <stefano.nichele@gmail.com> wrote:
Ok, here some information:
OS: Centos 5.x (Linux 2.6.18-53.1.21.el5 #1 SMP Tue May 20 09:34:18 EDT 2008 i686 i686 i386 GNU/Linux)
RAID: it's a hardware RAID controller
The disks are 9600rpm SATA drives
(6 disk 1+0 RAID array and 2 separate disks for the OS).
About iostat (on sdb I have pg_xlog, on sdc I have data)
> iostat -k
Linux 2.6.18-53.1.21.el5 (*******) 01/07/2009
avg-cpu: %user %nice %system %iowait %steal %idle
17.27 0.00 5.13 45.08 0.00 32.52
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 30.42 38.50 170.48 182600516 808546589
sdb 46.16 0.23 52.10 1096693 247075617
sdc 269.26 351.51 451.00 1667112043 2138954833
> iostat -x -k -d 2 5
Linux 2.6.18-53.1.21.el5 (*******) 01/07/2009
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.17 12.68 0.47 29.95 38.51 170.51 13.74 0.03 0.86 0.19 0.57
sdb 0.01 80.11 0.05 46.11 0.23 52.01 2.26 0.01 0.22 0.22 1.01
sdc 7.50 64.57 222.55 46.69 350.91 450.98 5.96 0.57 2.05 3.13 84.41
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 196.00 1.00 117.00 4.00 1252.00 21.29 0.02 0.19 0.19 2.30
sdc 1.50 66.00 277.00 66.50 3100.00 832.00 22.89 50.84 242.30 2.91 100.10
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 264.50 0.00 176.50 0.00 1764.00 19.99 0.04 0.21 0.21 3.70
sdc 3.50 108.50 291.50 76.00 3228.00 752.00 21.66 89.42 239.39 2.72 100.05
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 4.98 0.00 1.00 0.00 23.88 48.00 0.00 0.00 0.00 0.00
sdb 0.00 23.88 0.00 9.45 0.00 133.33 28.21 0.00 0.21 0.21 0.20
sdc 1.00 105.97 274.13 53.73 3297.51 612.94 23.85 67.99 184.58 3.04 99.55
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 79.00 0.00 46.00 0.00 500.00 21.74 0.01 0.25 0.25 1.15
sdc 2.50 141.00 294.00 43.50 3482.00 528.00 23.76 51.33 170.46 2.96 100.05
vmstat in the same time:
> vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------r b swpd free buff cache si so bi bo in cs us sy id wa st0 27 80 126380 27304 3253016 0 0 98 55 0 1 17 5 33 45 0
0 26 80 124516 27300 3255456 0 0 3438 1724 2745 4011 11 2 8 78 0
1 25 80 124148 27276 3252548 0 0 3262 2806 3572 7007 33 11 3 53 0
1 28 80 128272 27244 3248516 0 0 2816 1006 2926 5624 12 3 12 73 0
I will run pgbench in the next days.
Aside from all the advice here about system tuning, as a system admin I'd also ask is the box doing the job you need? And are you looking at the Postgres log (with logging of slow queries) to see that queries perform in a sensible time? I'd assume with the current performance figure there is an issue somewhere, but I've been to places where it was as simple as adding one index, or even modifying an index so it does what the application developer intended instead of what they ask for ;)
I already checked postgres log and resolved index/slow queries issues. Actually I have queries that sometime are really fast, and sometime go in timeout.
But all the required indexes are there. For sure, there are space to improve performances also in that way, but I would like also to investigate issue from other point of views (in order to understand also how to monitor the server).
Cheers and thanks a lot.
ste
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Here some number from a mine old pgfouine report:
- query peak: 378 queries/s
- select: 53,1%, insert 3,8%, update 2,2 %, delete 2,8 %
Actually the percentages are wrong (I think pgfouine counts also other types of query like ET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE;):
These are the right ones:
SELECT num: 658,5 perc: 85,7 %
INSERT num: 47,66 perc: 6,2 %
UPDATE num: 27,49 perc: 3,6 %
DELETE num: 34,56 perc: 4,5 %
On Sat, Jan 10, 2009 at 2:56 PM, Ron <rjpeace@earthlink.net> wrote: > At 10:36 AM 1/10/2009, Gregory Stark wrote: >> >> "Scott Marlowe" <scott.marlowe@gmail.com> writes: >> >> > On Sat, Jan 10, 2009 at 5:40 AM, Ron <rjpeace@earthlink.net> wrote: >> >> At 03:28 PM 1/8/2009, Merlin Moncure wrote: >> >>> just be aware of the danger . hard reset (power off) class of failure >> >>> when fsync = off means you are loading from backups. >> >> >> >> That's what redundant power conditioning UPS's are supposed to help >> >> prevent >> >> ;-) >> > >> > But of course, they can't prevent them, but only reduce the likelihood >> > of their occurrance. Everyone who's working in large hosting >> > environments has at least one horror story to tell about a power >> > outage that never should have happened. >> >> Or a system crash. If the kernel panics for any reason when it has dirty >> buffers in memory the database will need to be restored. > > A power conditioning UPS should prevent a building wide or circuit level bad > power event, caused by either dirty power or a power loss, from affecting > the host. Within the design limits of the UPS in question of course. We had an electrician working who was supposed to have a tray underneath their work. They didn't. A tiny bit of copper flew into a power conditioner. The power conditioner blew out, fed back to the other two power conditionsers, which blew, they fed back to the UPSs and blew them up, the power surge blew out the switch to allow the diesel generator to take over. We were running 100ft extension cables from dirty wall power sockets all over the building to get the hosting center back up. There were about 12 or so database servers. The only one that came back up without data loss or corruption was mine, running pgsql. The others, running Oracle, db2, Ingress and a few other databases all came back up with corrupted data on their drives and forced nearly day long restores. There is no protection against a kernel crash or a power loss that is absolute. And don't ever believe there is. Human error is always a possibility you have to be prepared to deal with. > So the real worry with fsync = off in a environment with redundant decent > UPS's is pretty much limited to host level HW failures, SW crashes, and > unlikely catastrophes like building collapses, lightning strikes, floods, > etc. > Not that your fsync setting is going to matter much in the event of > catastrophes in the physical environment... Sure it will. SCSI cable gets pulled out, power supply fails, mobo just dies outright, the above mentioned situation with the power being lost to the data center. Meteor strikes, not so much.
On Sun, Jan 11, 2009 at 11:07 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > running pgsql. The others, running Oracle, db2, Ingress and a few > other databases all came back up with corrupted data on their drives > and forced nearly day long restores. Before anyone thinks I'm slagging all other databases here, the problem wasn't the db software, but the sysadmin / dbas who administered those machines. They stared in horror when I tested my server by pulling the power cords out the back during acceptance testing. They also told me we could never lose power in the hosting center because it was so wonder and redundant and that I was wasting my time. We lost power on a Friday afternoon. I went skiing that weekend with a friend, my system was up and running when it got power again. I'm pretty sure the other dbas at that company weren't skiing that weekend. :)
--- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > They also told me we could never lose power in the hosting > center > because it was so wonder and redundant and that I was > wasting my time. We'll that's just plain silly, at the very least there's always going to be some breakers / fuzes in between the power andthe machines. In fact in our building there's quite a few breaker points between our comms room on the 3rd floor and the ups / generatorin the basement. It's a crappy implementation actually.
On Sun, 11 Jan 2009, Glyn Astill wrote: > --- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > >> They also told me we could never lose power in the hosting >> center >> because it was so wonder and redundant and that I was >> wasting my time. > > We'll that's just plain silly, at the very least there's always going to > be some breakers / fuzes in between the power and the machines. > > In fact in our building there's quite a few breaker points between our > comms room on the 3rd floor and the ups / generator in the basement. > It's a crappy implementation actually. the response I get from people is that they give their servers redundant poewr supplies and put them on seperate circuits so they must be safe from that. but as commented, that's not enough in the real world. David Lang
Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash. Battery backup doesn't helpyou much there. Fsync of log is necessary IMO. That said, you could use a replication/backup strategy to get a consistent snapshot in the past if you don't mind losingsome data or can recreate it from backup elsewhere. I think a strategy that puts the WAL on an SLC SSD is a very good one as of Jan/09 and will get much better in short order. - Luke ----- Original Message ----- From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> To: Glyn Astill <glynastill@yahoo.co.uk> Cc: Ron <rjpeace@earthlink.net>; Scott Marlowe <scott.marlowe@gmail.com>; pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> Sent: Sun Jan 11 15:35:22 2009 Subject: Re: [PERFORM] understanding postgres issues/bottlenecks On Sun, 11 Jan 2009, Glyn Astill wrote: > --- On Sun, 11/1/09, Scott Marlowe <scott.marlowe@gmail.com> wrote: > >> They also told me we could never lose power in the hosting >> center >> because it was so wonder and redundant and that I was >> wasting my time. > > We'll that's just plain silly, at the very least there's always going to > be some breakers / fuzes in between the power and the machines. > > In fact in our building there's quite a few breaker points between our > comms room on the 3rd floor and the ups / generator in the basement. > It's a crappy implementation actually. the response I get from people is that they give their servers redundant poewr supplies and put them on seperate circuits so they must be safe from that. but as commented, that's not enough in the real world. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On Sun, Jan 11, 2009 at 4:16 PM, Luke Lonergan <LLonergan@greenplum.com> wrote: > Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash. Battery backup doesn't helpyou much there. I've been using pgsql since way back, in a lot of projects, and almost almost of them on some flavor of linux, usually redhat. I have had zero kernel crashes on production dbs in that time. I'd be interested to know which historic kernels caused you the crashes.
Luke Lonergan wrote: > Not to mention the #1 cause of server faults in my experience: OS kernel bug causes a crash. Battery backup doesn't helpyou much there. Well now ... that very much depends on where you *got* the server OS and how you administer it. If you're talking a correctly-maintained Windows 2003 Server installation, or a correctly-maintained Red Hat Enterprise Linux installation, or any other "branded" OS from Novell, Sun, HP, etc., I'm guessing such crashes are much rarer than what you've experienced. And you're probably in pretty good shape with Debian stable and the RHEL respins like CentOS. I can't comment on Ubuntu server or any of the BSD family -- I've never worked with them. But you should be able to keep a "branded" server up for months, with the exception of applying security patches that require a reboot. And *those* can be *planned* outages! Where you *will* have some major OS risk is with testing-level software or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know why people run Fedora servers -- if it's Red Hat compatibility you want, there's CentOS. -- M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM I've never met a happy clam. In fact, most of them were pretty steamed.
> Where you *will* have some major OS risk is with testing-level software > or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't > know why people run Fedora servers -- if it's Red Hat compatibility you > want, there's CentOS. I've had no stability problems with Fedora. The worst experience I've had with that distribution is that half the time the CD-burning utilities seem to be flaky. As for why that and not CentOS... I like having modern versions of all of my packages. 5 years is a long time to get nothing but bugfixes. ...Robert
Robert Haas wrote: >> Where you *will* have some major OS risk is with testing-level software >> or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't >> know why people run Fedora servers -- if it's Red Hat compatibility you >> want, there's CentOS. > > I've had no stability problems with Fedora. The worst experience I've > had with that distribution is that half the time the CD-burning > utilities seem to be flaky. As for why that and not CentOS... I like > having modern versions of all of my packages. 5 years is a long time > to get nothing but bugfixes. Fedora is not the only distro with "flaky CD-burning utilities". I'm still hunting down issues with k3b and brasero on my openSUSE 11.1 system. "RW" media are your friends. :) Five years may be a long time to get nothing but bugfixes, but having to do extensive upgrade testing and an OS upgrade every six months, which is where most "community" distros' release schedules are these days, also seems extreme. I personally think for the money you pay for Windows 2003 Server or RHEL, you ought to be able to go a year or more between unplanned reboots, and they really should minimize the number of reboots you have to eat for security fixes too. Otherwise, what's the point of spending money for an OS? :) -- M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM I've never met a happy clam. In fact, most of them were pretty steamed.
On Sun, 11 Jan 2009, M. Edward (Ed) Borasky wrote: > And you're probably in pretty good shape with Debian stable and the RHEL > respins like CentOS. No one is in good shape until they've done production-level load testing on the system and have run the sort of "unplug it under load" tests that Scott was praising. Since so many of the kernel bugs you might run into are in device drivers, the subset of Debian/RHEL you're using is only as stable as the worst supported driver running on your system. > But you should be able to keep a "branded" server up for months, with > the exception of applying security patches that require a reboot. Right, this is why I only rely on Linux deployments using a name I trust: Dell. Returning to reality, the idea that there are brands you can buy that make all your problems go away is rather optimistic. The number of "branded" servers I've seen that are just nearly or completely worthless for database use is rather depressing. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > Right, this is why I only rely on Linux deployments using a name I > trust: Dell. > > Returning to reality, the idea that there are brands you can buy that > make all your problems go away is rather optimistic. The number of > "branded" servers I've seen that are just nearly or completely worthless > for database use is rather depressing. Well, of course, they won't make *all* your problems go away. But still, I'd much rather have an IBM or HP or Dell server running Windows 2003 or RHEL or SLES than some no-name hardware running Fedora or Ubuntu. -- M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P), WOM I've never met a happy clam. In fact, most of them were pretty steamed.
M. Edward (Ed) Borasky wrote: > Greg Smith wrote: > >> Right, this is why I only rely on Linux deployments using a name I >> trust: Dell. >> >> Returning to reality, the idea that there are brands you can buy that >> make all your problems go away is rather optimistic. The number of >> "branded" servers I've seen that are just nearly or completely worthless >> for database use is rather depressing. >> > > Well, of course, they won't make *all* your problems go away. But still, > I'd much rather have an IBM or HP or Dell server running Windows 2003 or > RHEL or SLES than some no-name hardware running Fedora or Ubuntu. > If you use no-name hardware it all depends on how reliable your supplier is. The no-name white box providers I've had experience with have always supplied hardware that was reliable and fast. And as they were small companies they would work with you to give you hardware that you preferred (e.g raid cards etc). Conversely I've found big name brand suppliers would often change critical parts (network or raid cards) midway through shipment - leaving you with the odd-man-out server to debug silly issues with (e.g won't get on the network, disk array not recognized by the installation media etc). So I'm not entirely convinced by the 'name brand is good' argument. I'd agree that Fedora is probably not the best choice for a deployment(!). My experience of Ubuntu has been better, however using the LTS release might be a wise move if one wants to user this distro. regards Mark
On Sun, Jan 11, 2009 at 8:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Where you *will* have some major OS risk is with testing-level software >> or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't >> know why people run Fedora servers -- if it's Red Hat compatibility you >> want, there's CentOS. > > I've had no stability problems with Fedora. The worst experience I've > had with that distribution is that half the time the CD-burning > utilities seem to be flaky. As for why that and not CentOS... I like > having modern versions of all of my packages. 5 years is a long time > to get nothing but bugfixes. We basically do the same thing with our app tier servers, chasing the next to latest ubuntus so we aren't running really old bits there. For the db level it's RHEL with pgsql. Ubuntu has been pretty stable on the app tier, but we don't push it in the same ways we push our databases either.
On Sun, 11 Jan 2009, M. Edward (Ed) Borasky wrote: > Where you *will* have some major OS risk is with testing-level software > or "bleeding edge" Linux distros like Fedora. I just ran "uptime" on my home machine, and it said 144 days. Debian unstable, on no-name hardware. I guess the last time I rebooted was when I changed the graphics drivers. I can't remember the last time it actually crashed. I guess the moral is, you find a combination of hardware and software that works well, and you may as well stick with it. The most likely things to destabilise the system are drivers for "interesting" pieces of hardware, like graphics cards and (unfortunately) some high-performance RAID cards. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there.
On Jan 11, 2009, at 9:43 PM, M. Edward (Ed) Borasky wrote: > Luke Lonergan wrote: >> Not to mention the #1 cause of server faults in my experience: OS >> kernel bug causes a crash. Battery backup doesn't help you much >> there. > Not that long ago (a month or so) we ran into a problem where hpacucli (Utility for configuring/inspecting/etc HP smartarray controllers) would tickle the cciss driver in such a way that it would cause a kernel panic. KABLAMMO (No data loss! weeeee!). The box had run for a long time without crashes, but it seems that when we added more disks and started the array building the new logical drive some magical things happened. Bugs happen. The [bad word] of it is catching the culprit with its fingers in the cookie jar. -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 M. Edward (Ed) Borasky wrote: | Luke Lonergan wrote: |> Not to mention the #1 cause of server faults in my experience: OS |> kernel bug causes a crash. Battery backup doesn't help you much there. |> | | Well now ... that very much depends on where you *got* the server OS and | how you administer it. If you're talking a correctly-maintained Windows | 2003 Server installation, or a correctly-maintained Red Hat Enterprise | Linux installation, or any other "branded" OS from Novell, Sun, HP, etc., | I'm guessing such crashes are much rarer than what you've experienced. | | And you're probably in pretty good shape with Debian stable and the RHEL | respins like CentOS. I can't comment on Ubuntu server or any of the BSD | family -- I've never worked with them. But you should be able to keep a | "branded" server up for months, with the exception of applying security | patches that require a reboot. And *those* can be *planned* outages! | | Where you *will* have some major OS risk is with testing-level software | or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know | why people run Fedora servers -- if it's Red Hat compatibility you want, | there's CentOS. | Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3 24/7 for over 6 months, and it was discontinued years ago. I recognize that this is by no means a record. It did not crash after 6 months, but I upgraded that box to CentOS 4 and it has been running that a long time. That box has minor hardware problems that do not happen often enough to find the real cause. But it stays up months at a time. All that box does is run BOINC and a printer server (CUPS). This machine does not crash, but it gets rebooted whenever a new kernel comes out, and has been up almost a month. It run RHEL5. I would think Fedora's kernel would probably be OK, but the other bleeding edge stuff I would not risk a serious server on. - -- ~ .~. Jean-David Beyer Registered Linux User 85642. ~ /V\ PGP-Key: 9A2FC99A Registered Machine 241939. ~ /( )\ Shrewsbury, New Jersey http://counter.li.org ~ ^^-^^ 14:10:01 up 30 days, 1:55, 3 users, load average: 4.18, 4.26, 4.24 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org iD8DBQFJb4zmPtu2XpovyZoRAn9TAKDFoEZ0JtoTi7T0qs9ZlI7rLxs9lACeJjDZ XL9rGZqzw0LjrszD1DaAhp4= =LdVq -----END PGP SIGNATURE-----
Jean-David Beyer wrote: > M. Edward (Ed) Borasky wrote: > | Luke Lonergan wrote: > |> Not to mention the #1 cause of server faults in my experience: OS > |> kernel bug causes a crash. Battery backup doesn't help you much there. > |> > | > | Well now ... that very much depends on where you *got* the server OS and > | how you administer it. If you're talking a correctly-maintained Windows > | 2003 Server installation, or a correctly-maintained Red Hat Enterprise > | Linux installation, or any other "branded" OS from Novell, Sun, HP, etc., > | I'm guessing such crashes are much rarer than what you've experienced. > | > | And you're probably in pretty good shape with Debian stable and the RHEL > | respins like CentOS. I can't comment on Ubuntu server or any of the BSD > | family -- I've never worked with them. But you should be able to keep a > | "branded" server up for months, with the exception of applying security > | patches that require a reboot. And *those* can be *planned* outages! > | > | Where you *will* have some major OS risk is with testing-level software > | or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know > | why people run Fedora servers -- if it's Red Hat compatibility you want, > | there's CentOS. > | > Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3 > 24/7 for over 6 months, and it was discontinued years ago. I recognize that > this is by no means a record. It did not crash after 6 months, but I > upgraded that box to CentOS 4 and it has been running that a long time. > That > box has minor hardware problems that do not happen often enough to find the > real cause. But it stays up months at a time. All that box does is run > BOINC > and a printer server (CUPS). > > This machine does not crash, but it gets rebooted whenever a new kernel > comes out, and has been up almost a month. It run RHEL5. > > I would think Fedora's kernel would probably be OK, but the other bleeding > edge stuff I would not risk a serious server on. I haven't heard much one way or the other about Fedora's kernels. Because of the way their release cycle is staggered with the other major community distros, they tend to be a number behind, say, openSUSE. openSUSE 11.1, for example, just came out with 2.6.27, while Fedora came out with 2.6.26 only a couple of weeks before that. The things I care the most about -- kernel-level performance metrics -- pretty much guarantee that I'm going to run a bleeding edge kernel as soon as it's good enough to live through a couple of days without losing data on the hard drive. And on my laptop, anything that recognizes my wireless and sound and leaves some of my 512 MB for applications is OK. :) > -- M. Edward (Ed) Borasky I've never met a happy clam. In fact, most of them were pretty steamed.
On Thu, 15 Jan 2009, Jean-David Beyer wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > M. Edward (Ed) Borasky wrote: > | Luke Lonergan wrote: > |> Not to mention the #1 cause of server faults in my experience: OS > |> kernel bug causes a crash. Battery backup doesn't help you much there. > |> > | > | Well now ... that very much depends on where you *got* the server OS and > | how you administer it. If you're talking a correctly-maintained Windows > | 2003 Server installation, or a correctly-maintained Red Hat Enterprise > | Linux installation, or any other "branded" OS from Novell, Sun, HP, etc., > | I'm guessing such crashes are much rarer than what you've experienced. > | > | And you're probably in pretty good shape with Debian stable and the RHEL > | respins like CentOS. I can't comment on Ubuntu server or any of the BSD > | family -- I've never worked with them. But you should be able to keep a > | "branded" server up for months, with the exception of applying security > | patches that require a reboot. And *those* can be *planned* outages! > | > | Where you *will* have some major OS risk is with testing-level software > | or "bleeding edge" Linux distros like Fedora. Quite frankly, I don't know > | why people run Fedora servers -- if it's Red Hat compatibility you want, > | there's CentOS. > | > Linux kernels seem to be pretty good these days. I ran Red Hat Linux 7.3 > 24/7 for over 6 months, and it was discontinued years ago. I recognize that > this is by no means a record. It did not crash after 6 months, but I > upgraded that box to CentOS 4 and it has been running that a long time. That > box has minor hardware problems that do not happen often enough to find the > real cause. But it stays up months at a time. All that box does is run BOINC > and a printer server (CUPS). > > This machine does not crash, but it gets rebooted whenever a new kernel > comes out, and has been up almost a month. It run RHEL5. > > I would think Fedora's kernel would probably be OK, but the other bleeding > edge stuff I would not risk a serious server on. I have been running kernel.org kernels in production for about 12 years now (on what has now grown to a couple hundred servers), and I routinely run from upgrade to upgrade with no crashes. I tend to upgrade every year or so). that being said, things happen. I have a set of firewalls running the Checkpoint Secure Platform linux distribution that locked up solidly a couple weeks after putting them in place (the iptables firewalls that they replaced had been humming along just fine under much heavier loads for months). the more mainstream your hardware is the safer you are (unfortunantly very few RAID cards are mainstream), but I've also found that by compiling a minimal kernel that only supports the stuff that I need also contributes to reliability. but even with my experiance, I would never architect anything with the expectation that system crashes don't happen. I actually see more crashes due to overheating (fans fail, AC units fail, etc) than I do from kernel crashes. not everything needs reliability. I am getting ready to build a pair of postgres servers that will have all safety disabled. I will get the redundancy I need by replicating between the pair, and if they both go down (datacenter outage) it is very appropriate to loose the entire contents of the system and reinitialize from scratch (in fact, every boot of the system will do this) but you need to think carefully about what you are doing when you disable the protection. David Lang