Re: Scaling further up - Mailing list pgsql-performance
From | Aaron Werman |
---|---|
Subject | Re: Scaling further up |
Date | |
Msg-id | Law10-OE17LcVq0N1ml000479de@hotmail.com Whole thread Raw |
In response to | Re: Scaling further up (Marty Scholes <marty@outputservices.com>) |
List | pgsql-performance |
Sorry about not chiming in before - I've been too swamped to think. I agree with most of the points, but a lot of these posts are interesting and seem to describe systems from an SA perspective to my DBA-centric view. ----- Original Message ----- From: "Marty Scholes" <marty@outputservices.com> To: <pgsql-performance@postgresql.org> Sent: Wednesday, March 10, 2004 6:29 PM Subject: Re: [PERFORM] Scaling further up > I have some suggestions based on my anecdotal experience. > > 1. This is a relatively small DB -- the working set will likely be in > RAM at any moment in time, making read I/O time mostly irrelevant. > > 2. The killer will be write times -- specifically log writes. Small and > heavily synchronized writes, log and data writes, will drag down an > impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays > with write back cache and are constantly seeking ways to improve write > performance. We do a lot of batch processing, though, so we do a lot of > write I/Os. My experience with RAID5 for streaming sequential writes is bad. This is sometimes helped by the hardware caching to cover the cost of the additional I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until you run out of cache). Batch processing is different from high concurrency transactions because it needs faster volume streaming, while TP is dependant on the speed of ack'ing (few big writes with less synchronous waits vs. lots of small writes which serialize everyone). (RAID 3 worked for me in the past for logging, but I haven't used it in years.) > > 3. Be very careful with "battery backed write cache." It usually works > as advertised. More than once in the past decade I have seen > spontaneous cache corruption after power losss. The corruption usually > happens when some admin, including me, has assumed that the cache will > ALWAYS survive a power failure unblemished and has no "plan B." Make > sure you have a contingency plan for corruption, or don't enable the cache. I agree strongly. There is also the same problem with disk write back cache and even with SCSI controllers with write through enabled. PITR would help here. A lot of these problems are due to procedural error post crash. > > 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of > the setup, and might hinder, not help small write I/O performance. In a high volume system without write caching you are almost always going to see queuing, which can make the larger buffer mostly irrelevant, if it's not huge. Write caching thrives on big block sizes (which is a key reason why Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops I've set up or seen use mirroring or RAID 10 for logs. Note also that many RAID 10 controllers in a non-write cached setup allows having a race between the two writers, acknowledging when the first of the two completes - increasing throughput by about 1/4. > > 5. Most (almost all) of the I/O time will be due to the access time > (head seek + head settle + rotational latency) and very little of the > I/O time will due to data transfer time. In other words, getting drives > that provide faster transfer rates will barely improve performance. The > secret is lowering the access time. True. This is very much a latency story. Even in volume batch, you can see access time that clearly shows some other system configuration bottleneck that happens elsewhere before hitting I/O capacity. > > 6. A relatively cheap way to drastically drop the access time is to get > large drive(s) and only use a portion of them for storage. The less > space used on the drive, the less area the heads need to cover for > seeks. At one extreme, you could make the partition the size of a > single cylinder. This would make access time (ignoring OS and > controller overhead) identical to rotational latency, which is as low as > 4.2 ms for a cheap 7200 RPM drive. This is a good strategy for VLDB, and may not be relevant in this case. Also - big sequential writes and 15K rpm drives, in the case of writethrough, is a beautiful thing - they look like a manufacturers' demo. A primary performance role of a RDBMS is to convert random I/O to sequential (by buffering reads and using a streaming log to defer random writes to checkpoints). RDBMS's are the prime beneficiaries of the drive speed improvements - since logging, backups, and copies are about the only things (ignoring bad perl scripts and find commands) that generate loads of 50+ mB/sec. /Aaron
pgsql-performance by date: