Thread: Advice configuring ServeRAID 8k for performance
I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM running CentOS 5.4 x86_64. I have a ServeRAID 8k controller with 6 SATA 7500RPM disks in RAID 6, and for the OLAP workload it feels* slow. I have 6 more disks to add, and the RAID has to be rebuilt in any case, but first I would like to solicit general advice. I know that's little data to go on, and I believe in the scientific method, but in this case I don't have the time to make many iterations. My questions are simple, but in my reading I have not been able to find definitive answers: 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on RAID 6. RAID 6 was the original choice for more usable space with good redundancy. My current performance is 85MB/s write, 151 MB/s reads (using dd of 2xRAM per http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm). 2) Should I configure the ext3 file system with noatime and/or data=writeback or data=ordered? My controller has a battery, the logical drive has write cache enabled (write-back), and the physical devices have write cache disabled (write-through). 3) Do I just need to spend more time configuring postgresql? My non-default settings were largely generated by pgtune-0.9.3: max_locks_per_transaction = 128 # manual; avoiding "out of shared memory" default_statistics_target = 100 maintenance_work_mem = 1GB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 16GB work_mem = 352MB wal_buffers = 32MB checkpoint_segments = 64 shared_buffers = 2316MB max_connections = 32 I am happy to take informed opinion. If you don't have the time to properly cite all your sources but have suggestions, please send them. Thanks in advance, Ken * I know "feels slow" is not scientific. What I mean is that any single query on a fact table, or any 'rm -rf' of a big directory sends disk utilization to 100% (measured with iostat -x 3).
On Thursday, August 05, 2010, "Kenneth Cox" <kenstir@gmail.com> wrote: > 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 > is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little > on RAID 6. RAID 6 was the original choice for more usable space with > good redundancy. My current performance is 85MB/s write, 151 MB/s reads > (using dd of 2xRAM per > http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm). If you can spare the drive space, go to RAID 10. RAID 5/6 usually look fine on single-threaded sequential tests (unless your controller really sucks), but in the real world with multiple processes doing random I/O RAID 10 will go a lot further on the same drives. Plus your recovery time from disk failures will be a lot faster. If you can't spare the drive space ... you should buy more drives. > > 2) Should I configure the ext3 file system with noatime and/or > data=writeback or data=ordered? My controller has a battery, the logical > drive has write cache enabled (write-back), and the physical devices have > write cache disabled (write-through). noatime is fine but really minor filesystem options rarely show much impact. My best performance comes from XFS filesystems created with stripe options matching the underlying RAID array. Anything else is just a bonus. > * I know "feels slow" is not scientific. What I mean is that any single > query on a fact table, or any 'rm -rf' of a big directory sends disk > utilization to 100% (measured with iostat -x 3). .. and it should. Any modern system can peg a small disk array without much effort. Disks are slow. -- "No animals were harmed in the recording of this episode. We tried but that damn monkey was just too fast."
On Thu, Aug 5, 2010 at 12:28 PM, Kenneth Cox <kenstir@gmail.com> wrote: > I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM running > CentOS 5.4 x86_64. I have a ServeRAID 8k controller with 6 SATA 7500RPM > disks in RAID 6, and for the OLAP workload it feels* slow. I have 6 more > disks to add, and the RAID has to be rebuilt in any case, but first I would > like to solicit general advice. I know that's little data to go on, and I > believe in the scientific method, but in this case I don't have the time to > make many iterations. > > My questions are simple, but in my reading I have not been able to find > definitive answers: > > 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 is > bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on RAID > 6. RAID 6 was the original choice for more usable space with good > redundancy. My current performance is 85MB/s write, 151 MB/s reads (using > dd of 2xRAM per > http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm). Sequential read / write is not very useful for a database benchmark. It does kind of give you a baseline for throughput, but most db access is mixed enough that random access becomes the important measurement. RAID6 is basically RAID5 with a hot spare already built into the array. This makes rebuild less of an issue since you can reduce the spare io used to rebuild the array to something really small. However, it's in the same performance ballpark as RAID 5 with the accompanying write performance penalty. RAID-10 is pretty much the only way to go for a DB, and if you need more space, you need more or bigger drives, not RAID-5/6 -- To understand recursion, one must first understand recursion.
Kenneth Cox wrote: > 1) Should I switch to RAID 10 for performance? I see things like > "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I > see little on RAID 6. RAID 6 was the original choice for more usable > space with good redundancy. My current performance is 85MB/s write, > 151 MB/s reads RAID6 is no better than RAID5 performance wise, it just has better fault tolerance. And the ServeRAID 8k is a pretty underpowered card as RAID controllers go, so it would not be impossible for it computing RAID parity and the like to be the bottleneck here. I'd expect a 6-disk RAID10 with 7200RPM drives to be closer to 120MB/s on writes, so you're not getting ideal performance there. Your read figure is more competative, but that's usually the RAID5 pattern--decent on reads, slugging on writes. > 2) Should I configure the ext3 file system with noatime and/or > data=writeback or data=ordered? My controller has a battery, the > logical drive has write cache enabled (write-back), and the physical > devices have write cache disabled (write-through). data=ordered is the ext3 default and usually a reasonable choice. Using writeback instead can be dangerous, I wouldn't advise starting there. noatime is certainly a good thing, but the speedup is pretty minor if you have a battery-backed write cache. > 3) Do I just need to spend more time configuring postgresql? My > non-default settings were largely generated by pgtune-0.9.3 Those look reasonable enough, except no reason to make wal_buffers bigger than 16MB. That work_mem figure might be high too, that's a known concern with pgtune I need to knock out of it one day soon. When you are hitting high I/O wait periods, is the system running out of RAM and swapping? That can cause really nasty I/O wait. Your basic hardware is off a bit, but not so badly that I'd start there. Have you turned on slow query logging to see what is hammering the system when the iowait climbs? Often tuning those by looking at the EXPLAIN ANALYZE output can be much more effective than hardware/server configuration tuning. > * I know "feels slow" is not scientific. What I mean is that any > single query on a fact table, or any 'rm -rf' of a big directory sends > disk utilization to 100% (measured with iostat -x 3). "rm -rf" is really slow on ext3 on just about any hardware. If your fact tables aren't in RAM and you run a query against them, paging them back in again will hammer the disks until it's done. That's not necessarily indicative of a misconfiguration on its own. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> 1) Should I switch to RAID 10 for performance? I see things like "RAID > 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see > little on RAID 6. As others said, RAID6 is RAID5 + a hot spare. Basically when you UPDATE a row, at some point postgres will write the page which contains that row. RAID10 : write the page to all mirrors. RAID5/6 : write the page to the relevant disk. Read the corresponding page from all disks (minus one), compute parity, write parity. As you can see one small write will need to hog all drives in the array. RAID5/6 performance for small random writes is really, really bad. Databases like RAID10 for reads too because when you need some random data you can get it from any of the mirrors, so you get increased parallelism on reads too. > with good redundancy. My current performance is 85MB/s write, 151 MB/s > reads FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap desktop consumer SATA drives in my home multimedia server...
On 8/5/10 11:28 AM, Kenneth Cox wrote: > I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM > running CentOS 5.4 x86_64. I have a ServeRAID 8k controller with 6 SATA > 7500RPM disks in RAID 6, and for the OLAP workload it feels* slow.... > My current performance is 85MB/s write, 151 MB/s reads I get 193MB/sec write and 450MB/sec read on a RAID10 on 8 SATA 7200 RPM disks. RAID10 seems to scale linearly -- add disks,get more speed, to the limit of your controller. Craig
On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote: > >> 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 >> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on >> RAID 6. > > As others said, RAID6 is RAID5 + a hot spare. > > Basically when you UPDATE a row, at some point postgres will write the page > which contains that row. > > RAID10 : write the page to all mirrors. > RAID5/6 : write the page to the relevant disk. Read the corresponding page > from all disks (minus one), compute parity, write parity. Actually it's not quite that bad. You only have to read from two disks, the data disk and the parity disk, then compute new parity and write to both disks. Still 2 reads / 2 writes for every write. > As you can see one small write will need to hog all drives in the array. > RAID5/6 performance for small random writes is really, really bad. > > Databases like RAID10 for reads too because when you need some random data > you can get it from any of the mirrors, so you get increased parallelism on > reads too. Also for sequential access RAID-10 can read both drives in a pair interleaved so you get 50% of the data you need from each drive and double the read rate there. This is even true for linux software md RAID. >> with good redundancy. My current performance is 85MB/s write, 151 MB/s >> reads > > FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap desktop > consumer SATA drives in my home multimedia server... On a machine NOT configured for max seq throughput (it's used for mostly OLTP stuff) I get 325M/s both read and write speed with a 26 disk RAID-10. OTOH, that setup gets ~6000 to 7000 transactions per second with multi-day runs of pgbench.
Definitely switch to RAID-10 .... it's not merely that it's a fair bit faster on normal operations (less seek contention), it's **WAY** faster than any parity based RAID (RAID-2 through RAID-6) in degraded mode when you lose a disk and have to rebuild it. This is something many people don't test for, and then get bitten badly when they lose a drive under production loads.
Use higher capacity drives if necessary to make your data fit in the number of spindles your controller supports ... the difference in cost is modest compared to an overall setup, especially with SATA. Make sure you still leave at least one hot spare!
In normal operation, RAID-5 has to read and write 2 drives for every write ... not sure about RAID-6 but I suspect it needs to read the entire stripe to recalculate the Hamming parity, and it definitely has to write to 3 drives for each write, which means seeking all 3 of those drives to that position. In degraded mode (a disk rebuilding) with either of those levels, ALL the drives have to seek to that point for every write, and for any reads of the failed drive, so seek contention is horrendous.
RAID-5 and RAID-6 are designed for optimum capacity, protection, and low write performance, which is fine for a general file server.
Parity RAID simply isn't suitable for database use .... anyone who claims otherwise either (a) doesn't understand the failure modes of RAID, or (b) is running in a situation where performance simply doesn't matter.
Cheers
Dave
Use higher capacity drives if necessary to make your data fit in the number of spindles your controller supports ... the difference in cost is modest compared to an overall setup, especially with SATA. Make sure you still leave at least one hot spare!
In normal operation, RAID-5 has to read and write 2 drives for every write ... not sure about RAID-6 but I suspect it needs to read the entire stripe to recalculate the Hamming parity, and it definitely has to write to 3 drives for each write, which means seeking all 3 of those drives to that position. In degraded mode (a disk rebuilding) with either of those levels, ALL the drives have to seek to that point for every write, and for any reads of the failed drive, so seek contention is horrendous.
RAID-5 and RAID-6 are designed for optimum capacity, protection, and low write performance, which is fine for a general file server.
Parity RAID simply isn't suitable for database use .... anyone who claims otherwise either (a) doesn't understand the failure modes of RAID, or (b) is running in a situation where performance simply doesn't matter.
Cheers
Dave
On Thu, Aug 5, 2010 at 1:28 PM, Kenneth Cox <kenstir@gmail.com> wrote:
My questions are simple, but in my reading I have not been able to find definitive answers:
1) Should I switch to RAID 10 for performance? I see things like "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on RAID 6. RAID 6 was the original choice for more usable space with good redundancy. My current performance is 85MB/s write, 151 MB/s reads (using dd of 2xRAM per http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm).
On Thu, Aug 5, 2010 at 5:13 PM, Dave Crooke <dcrooke@gmail.com> wrote: > Definitely switch to RAID-10 .... it's not merely that it's a fair bit > faster on normal operations (less seek contention), it's **WAY** faster than > any parity based RAID (RAID-2 through RAID-6) in degraded mode when you lose > a disk and have to rebuild it. This is something many people don't test for, > and then get bitten badly when they lose a drive under production loads. Had a friend with a 600G x 5 disk RAID-5 and one drive died. It took nearly 48 hours to rebuild the array. > Use higher capacity drives if necessary to make your data fit in the number > of spindles your controller supports ... the difference in cost is modest > compared to an overall setup, especially with SATA. Make sure you still > leave at least one hot spare! Yeah, a lot of chassis hold an even number of drives, and I wind up with 2 hot spares because of it. > Parity RAID simply isn't suitable for database use .... anyone who claims > otherwise either (a) doesn't understand the failure modes of RAID, or (b) is > running in a situation where performance simply doesn't matter. The only time it's acceptable is when you're running something like low write volume report generation / batch processing, where you're mostly sequentially reading and writing 100s of gigabytes at a time in one or maybe two threads. -- To understand recursion, one must first understand recursion.
On 06/08/10 06:28, Kenneth Cox wrote: > I am using PostgreSQL 8.3.7 on a dedicated IBM 3660 with 24GB RAM > running CentOS 5.4 x86_64. I have a ServeRAID 8k controller with 6 > SATA 7500RPM disks in RAID 6, and for the OLAP workload it feels* > slow. I have 6 more disks to add, and the RAID has to be rebuilt in > any case, but first I would like to solicit general advice. I know > that's little data to go on, and I believe in the scientific method, > but in this case I don't have the time to make many iterations. > > My questions are simple, but in my reading I have not been able to > find definitive answers: > > 1) Should I switch to RAID 10 for performance? I see things like > "RAID 5 is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I > see little on RAID 6. RAID 6 was the original choice for more usable > space with good redundancy. My current performance is 85MB/s write, > 151 MB/s reads (using dd of 2xRAM per > http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm). > Normally I'd agree with the others and recommend RAID10 - but you say you have an OLAP workload - if it is *heavily* read biased you may get better performance with RAID5 (more effective disks to read from). Having said that, your sequential read performance right now is pretty low (151 MB/s - should be double this), which may point to an issue with this controller. Unfortunately this *may* be important for an OLAP workload (seq scans of big tables). > 2) Should I configure the ext3 file system with noatime and/or > data=writeback or data=ordered? My controller has a battery, the > logical drive has write cache enabled (write-back), and the physical > devices have write cache disabled (write-through). > Probably wise to use noatime. If you have a heavy write workload (i.e so what I just wrote above does *not* apply), then you might find adjusting the ext3 commit interval upwards from its default of 5 seconds can help (I'm doing some testing at the moment and commit=20 seemed to improve performance by about 5-10%). > 3) Do I just need to spend more time configuring postgresql? My > non-default settings were largely generated by pgtune-0.9.3: > > max_locks_per_transaction = 128 # manual; avoiding "out of shared > memory" > default_statistics_target = 100 > maintenance_work_mem = 1GB > constraint_exclusion = on > checkpoint_completion_target = 0.9 > effective_cache_size = 16GB > work_mem = 352MB > wal_buffers = 32MB > checkpoint_segments = 64 > shared_buffers = 2316MB > max_connections = 32 > Possibly higher checkpoint_segments and lower wal_buffers (I recall someone - maybe Greg suggesting that there was no benefit in having the latter > 10MB). I wonder about setting shared_buffers higher - how large is the database? Cheers Mark
On Thursday, August 05, 2010, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > Normally I'd agree with the others and recommend RAID10 - but you say > you have an OLAP workload - if it is *heavily* read biased you may get > better performance with RAID5 (more effective disks to read from). > Having said that, your sequential read performance right now is pretty > low (151 MB/s - should be double this), which may point to an issue > with this controller. Unfortunately this *may* be important for an OLAP > workload (seq scans of big tables). Probably a low (default) readahead limitation. ext3 doesn't help but it can usually get up over 400MB/sec. Doubt it's the controller. -- "No animals were harmed in the recording of this episode. We tried but that damn monkey was just too fast."
On 06/08/10 11:58, Alan Hodgson wrote: > On Thursday, August 05, 2010, Mark Kirkwood<mark.kirkwood@catalyst.net.nz> > wrote: > >> Normally I'd agree with the others and recommend RAID10 - but you say >> you have an OLAP workload - if it is *heavily* read biased you may get >> better performance with RAID5 (more effective disks to read from). >> Having said that, your sequential read performance right now is pretty >> low (151 MB/s - should be double this), which may point to an issue >> with this controller. Unfortunately this *may* be important for an OLAP >> workload (seq scans of big tables). >> > Probably a low (default) readahead limitation. ext3 doesn't help but it can > usually get up over 400MB/sec. Doubt it's the controller. > > Yeah - good suggestion, so cranking up readahead (man blockdev) and retesting is recommended. Cheers Mark
On 06/08/10 12:31, Mark Kirkwood wrote: > On 06/08/10 11:58, Alan Hodgson wrote: >> On Thursday, August 05, 2010, Mark >> Kirkwood<mark.kirkwood@catalyst.net.nz> >> wrote: >>> Normally I'd agree with the others and recommend RAID10 - but you say >>> you have an OLAP workload - if it is *heavily* read biased you may get >>> better performance with RAID5 (more effective disks to read from). >>> Having said that, your sequential read performance right now is pretty >>> low (151 MB/s - should be double this), which may point to an issue >>> with this controller. Unfortunately this *may* be important for an OLAP >>> workload (seq scans of big tables). >> Probably a low (default) readahead limitation. ext3 doesn't help but >> it can >> usually get up over 400MB/sec. Doubt it's the controller. >> > > Yeah - good suggestion, so cranking up readahead (man blockdev) and > retesting is recommended. > > ... sorry, it just occurred to wonder about the stripe or chunk size used in the array, as making this too small can also severely hamper sequential performance. Cheers Mark
On Thu, 5 Aug 2010, Scott Marlowe wrote: > RAID6 is basically RAID5 with a hot spare already built into the > array. On Fri, 6 Aug 2010, Pierre C wrote: > As others said, RAID6 is RAID5 + a hot spare. No. RAID6 is NOT RAID5 plus a hot spare. RAID5 uses a single parity datum (XOR) to ensure protection against data loss if one drive fails. RAID6 uses two different sets of parity (Reed-Solomon) to ensure protection against data loss if two drives fail simultaneously. If you have a RAID5 set with a hot spare, and you lose two drives, then you have data loss. If the same happens to a RAID6 set, then there is no data loss. Matthew -- And the lexer will say "Oh look, there's a null string. Oooh, there's another. And another.", and will fall over spectacularly when it realises there are actually rather a lot. - Computer Science Lecturer (edited)
On Fri, Aug 6, 2010 at 3:17 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Thu, 5 Aug 2010, Scott Marlowe wrote: >> >> RAID6 is basically RAID5 with a hot spare already built into the >> array. > > On Fri, 6 Aug 2010, Pierre C wrote: >> >> As others said, RAID6 is RAID5 + a hot spare. > > No. RAID6 is NOT RAID5 plus a hot spare. The original phrase was that RAID 6 was like RAID 5 with a hot spare ALREADY BUILT IN.
On Fri, Aug 6, 2010 at 11:32 AM, Justin Pitts <justinpitts@gmail.com> wrote: >>>> As others said, RAID6 is RAID5 + a hot spare. >>> >>> No. RAID6 is NOT RAID5 plus a hot spare. >> >> The original phrase was that RAID 6 was like RAID 5 with a hot spare >> ALREADY BUILT IN. > > Built-in, or not - it is neither. It is more than that, actually. RAID > 6 is like RAID 5 in that it uses parity for redundancy and pays a > write cost for maintaining those parity blocks, but will maintain data > integrity in the face of 2 simultaneous drive failures. Yes, I know that. I am very familiar with how RAID6 works. RAID5 with the hot spare already rebuilt / built in is a good enough answer for management where big words like parity might scare some PHBs. > In terms of storage cost, it IS like paying for RAID5 + a hot spare, > but the protection is better. > > A RAID 5 with a hot spare built in could not survive 2 simultaneous > drive failures. Exactly. Which is why I had said with the hot spare already built in / rebuilt. Geeze, pedant much? -- To understand recursion, one must first understand recursion.
> Yes, I know that. I am very familiar with how RAID6 works. RAID5 > with the hot spare already rebuilt / built in is a good enough answer > for management where big words like parity might scare some PHBs. > >> In terms of storage cost, it IS like paying for RAID5 + a hot spare, >> but the protection is better. >> >> A RAID 5 with a hot spare built in could not survive 2 simultaneous >> drive failures. > > Exactly. Which is why I had said with the hot spare already built in > / rebuilt. My apologies. The 'rebuilt' slant escaped me. Thats a fair way to cast it. > Geeze, pedant much? Of course!
On Aug 5, 2010, at 4:09 PM, Scott Marlowe wrote: > On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote: >> >>> 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 >>> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on >>> RAID 6. >> >> As others said, RAID6 is RAID5 + a hot spare. >> >> Basically when you UPDATE a row, at some point postgres will write the page >> which contains that row. >> >> RAID10 : write the page to all mirrors. >> RAID5/6 : write the page to the relevant disk. Read the corresponding page >> from all disks (minus one), compute parity, write parity. > > Actually it's not quite that bad. You only have to read from two > disks, the data disk and the parity disk, then compute new parity and > write to both disks. Still 2 reads / 2 writes for every write. > >> As you can see one small write will need to hog all drives in the array. >> RAID5/6 performance for small random writes is really, really bad. >> >> Databases like RAID10 for reads too because when you need some random data >> you can get it from any of the mirrors, so you get increased parallelism on >> reads too. > > Also for sequential access RAID-10 can read both drives in a pair > interleaved so you get 50% of the data you need from each drive and > double the read rate there. This is even true for linux software md > RAID. My experience is that it is ONLY true for software RAID and ZFS. Most hardware raid controllers read both mirrors and validatethat the data is equal, and thus writing is about as fast as read. Tested with Adaptec, 3Ware, Dell PERC 4/5/6,and LSI MegaRaid hardware wise. In all cases it was clear that the hardware raid was not using data from the two mirrorsto improve read performance for sequential or random I/O. > >>> with good redundancy. My current performance is 85MB/s write, 151 MB/s >>> reads >> >> FYI, I get 200 MB/s sequential out of the software RAID5 of 3 cheap desktop >> consumer SATA drives in my home multimedia server... > > On a machine NOT configured for max seq throughput (it's used for > mostly OLTP stuff) I get 325M/s both read and write speed with a 26 > disk RAID-10. OTOH, that setup gets ~6000 to 7000 transactions per > second with multi-day runs of pgbench. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Sun, Aug 8, 2010 at 12:46 AM, Scott Carey <scott@richrelevance.com> wrote: > > On Aug 5, 2010, at 4:09 PM, Scott Marlowe wrote: > >> On Thu, Aug 5, 2010 at 4:27 PM, Pierre C <lists@peufeu.com> wrote: >>> >>>> 1) Should I switch to RAID 10 for performance? I see things like "RAID 5 >>>> is bad for a DB" and "RAID 5 is slow with <= 6 drives" but I see little on >>>> RAID 6. >>> >>> As others said, RAID6 is RAID5 + a hot spare. >>> >>> Basically when you UPDATE a row, at some point postgres will write the page >>> which contains that row. >>> >>> RAID10 : write the page to all mirrors. >>> RAID5/6 : write the page to the relevant disk. Read the corresponding page >>> from all disks (minus one), compute parity, write parity. >> >> Actually it's not quite that bad. You only have to read from two >> disks, the data disk and the parity disk, then compute new parity and >> write to both disks. Still 2 reads / 2 writes for every write. >> >>> As you can see one small write will need to hog all drives in the array. >>> RAID5/6 performance for small random writes is really, really bad. >>> >>> Databases like RAID10 for reads too because when you need some random data >>> you can get it from any of the mirrors, so you get increased parallelism on >>> reads too. >> >> Also for sequential access RAID-10 can read both drives in a pair >> interleaved so you get 50% of the data you need from each drive and >> double the read rate there. This is even true for linux software md >> RAID. > > > My experience is that it is ONLY true for software RAID and ZFS. Most hardware raid controllers read both mirrors andvalidate that the data is equal, and thus writing is about as fast as read. Tested with Adaptec, 3Ware, Dell PERC 4/5/6,and LSI MegaRaid hardware wise. In all cases it was clear that the hardware raid was not using data from the two mirrorsto improve read performance for sequential or random I/O. Interesting. I'm using an Areca, I'll have to run some tests and see if a mirror is reading at > 100% read speed of a single drive or not.
>>> As others said, RAID6 is RAID5 + a hot spare. >> >> No. RAID6 is NOT RAID5 plus a hot spare. > > The original phrase was that RAID 6 was like RAID 5 with a hot spare > ALREADY BUILT IN. Built-in, or not - it is neither. It is more than that, actually. RAID 6 is like RAID 5 in that it uses parity for redundancy and pays a write cost for maintaining those parity blocks, but will maintain data integrity in the face of 2 simultaneous drive failures. In terms of storage cost, it IS like paying for RAID5 + a hot spare, but the protection is better. A RAID 5 with a hot spare built in could not survive 2 simultaneous drive failures.
Greg Smith wrote: > > 2) Should I configure the ext3 file system with noatime and/or > > data=writeback or data=ordered? My controller has a battery, the > > logical drive has write cache enabled (write-back), and the physical > > devices have write cache disabled (write-through). > > data=ordered is the ext3 default and usually a reasonable choice. Using > writeback instead can be dangerous, I wouldn't advise starting there. > noatime is certainly a good thing, but the speedup is pretty minor if > you have a battery-backed write cache. We recomment 'data=writeback' for ext3 in our docs: http://www.postgresql.org/docs/9.0/static/wal-intro.html Tip: Because WAL restores database file contents after a crash, journaled file systems are not necessary for reliable storage of the data files or WAL files. In fact, journaling overhead can reduce performance, especially if journaling causes file system data to be flushed to disk. Fortunately, data flushing during journaling can often be disabled with a file system mount option, e.g. data=writeback on a Linux ext3 file system. Journaled file systems do improve boot speed after a crash. Should this be changed? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian wrote: > We recomment 'data=writeback' for ext3 in our docs > Only for the WAL though, which is fine, and I think spelled out clearly enough in the doc section you quoted. Ken's system has one big RAID volume, which means he'd be mounting the data files with 'writeback' too; that's the thing to avoid. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Don't ever have WAL and data on the same OS volume as ext3. If data=writeback, performance will be fine, data integrity will be ok for WAL, but data integrity will not be sufficientfor the data partition. If data=ordered, performance will be very bad, but data integrity will be OK. This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file beingfsync'd. One partition for WAL, one for data. If using ext3 this is essentially a performance requirement no matter how your arrayis set up underneath. On Aug 13, 2010, at 11:41 AM, Greg Smith wrote: > Bruce Momjian wrote: >> We recomment 'data=writeback' for ext3 in our docs >> > > Only for the WAL though, which is fine, and I think spelled out clearly > enough in the doc section you quoted. Ken's system has one big RAID > volume, which means he'd be mounting the data files with 'writeback' > too; that's the thing to avoid. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file beingfsync'd. > One partition for WAL, one for data. If using ext3 this is essentially a performance requirement no matter how your arrayis set up underneath. > Unless you want the opposite of course. Some systems split out the WAL onto a second disk, only to discover checkpoint I/O spikes become a problem all of the sudden after that. The fsync calls for the WAL writes keep the write cache for the data writes from ever getting too big. This slows things down on average, but makes the worst case less stressful. Free lunches are so hard to find nowadays... -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Aug 16, 2010 at 01:46:21PM -0400, Greg Smith wrote: > Scott Carey wrote: > >This is because an fsync on ext3 flushes _all dirty pages in the file system_ to disk, not just those for the file beingfsync'd. > >One partition for WAL, one for data. If using ext3 this is > >essentially a performance requirement no matter how your array is > >set up underneath. > > Unless you want the opposite of course. Some systems split out the > WAL onto a second disk, only to discover checkpoint I/O spikes > become a problem all of the sudden after that. The fsync calls for > the WAL writes keep the write cache for the data writes from ever > getting too big. This slows things down on average, but makes the > worst case less stressful. Free lunches are so hard to find > nowadays... Or use -o sync. Or configure a ridiciuosly low dirty_memory amount (which has a problem on large systems because 1% can still be too much. Argh.)... Andres
Andres Freund wrote: > Or use -o sync. Or configure a ridiciuosly low dirty_memory amount > (which has a problem on large systems because 1% can still be too > much. Argh.)... > -o sync completely trashes performance, and trying to set the dirty_ratio values to even 1% doesn't really work due to things like the "congestion avoidance" code in the kernel. If you sync a lot more often, which putting the WAL on the same disk as the database accidentally does for you, that works surprisingly well at avoiding this whole class of problem on ext3. A really good solution is going to take a full rewrite of the PostgreSQL checkpoint logic though, which will get sorted out during 9.1 development. (cue dramatic foreshadowing music here) -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Aug 16, 2010 at 04:13:22PM -0400, Greg Smith wrote: > Andres Freund wrote: > >Or use -o sync. Or configure a ridiciuosly low dirty_memory amount > >(which has a problem on large systems because 1% can still be too > >much. Argh.)... > > -o sync completely trashes performance, and trying to set the > dirty_ratio values to even 1% doesn't really work due to things like > the "congestion avoidance" code in the kernel. If you sync a lot > more often, which putting the WAL on the same disk as the database > accidentally does for you, that works surprisingly well at avoiding > this whole class of problem on ext3. A really good solution is > going to take a full rewrite of the PostgreSQL checkpoint logic > though, which will get sorted out during 9.1 development. (cue > dramatic foreshadowing music here) -o sync works ok enough for the data partition (surely not the wal) if you make the background writer less aggressive. But yes. A new checkpointing logic + a new syncing logic (prepare_fsync() earlier and then fsync() later) would be a nice thing. Do you plan to work on that? Andres
Andres Freund wrote: > A new checkpointing logic + a new syncing logic > (prepare_fsync() earlier and then fsync() later) would be a nice > thing. Do you plan to work on that? > The background writer already caches fsync calls into a queue, so the prepare step you're thinking needs to be there is already. The problem is that the actual fsync calls happen in a tight loop. That we're busy fixing. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Aug 16, 2010 at 04:54:19PM -0400, Greg Smith wrote: > Andres Freund wrote: > >A new checkpointing logic + a new syncing logic > >(prepare_fsync() earlier and then fsync() later) would be a nice > >thing. Do you plan to work on that? > The background writer already caches fsync calls into a queue, so > the prepare step you're thinking needs to be there is already. The > problem is that the actual fsync calls happen in a tight loop. That > we're busy fixing. That doesn't help that much on many systems with a somewhat deep queue. An fsync() equals a barrier so it has the effect of stopping reordering around it - especially on systems with larger multi-disk arrays thats pretty expensive. You can achieve surprising speedups, at least in my experience, by forcing the kernel to start writing out pages *without enforcing barriers* first and then later enforce a barrier to be sure its actually written out. Which, in a simplified case, turns the earlier needed multiple barriers into a single one (in practise you want to call fsync() anyway, but thats not a big problem if its already written out). Andres
Scott Carey wrote: > Don't ever have WAL and data on the same OS volume as ext3. > > If data=writeback, performance will be fine, data integrity will be ok > for WAL, but data integrity will not be sufficient for the data > partition. If data=ordered, performance will be very bad, but data > integrity will be OK. > > This is because an fsync on ext3 flushes _all dirty pages in the file > system_ to disk, not just those for the file being fsync'd. > > One partition for WAL, one for data. If using ext3 this is essentially > a performance requirement no matter how your array is set up underneath. Do we need to document this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Andres Freund wrote: > An fsync() equals a barrier so it has the effect of stopping > reordering around it - especially on systems with larger multi-disk > arrays thats pretty expensive. > You can achieve surprising speedups, at least in my experience, by > forcing the kernel to start writing out pages *without enforcing > barriers* first and then later enforce a barrier to be sure its > actually written out. Standard practice on high performance systems with good filesystems and a battery-backed controller is to turn off barriers anyway. That's one of the first things to tune on XFS for example, when you have a reliable controller. I don't have enough data on ext4 to comment on tuning for it yet. The sole purpose for the whole Linux write barrier implementation in my world is to flush the drive's cache, when the database does writes onto cheap SATA drives that will otherwise cache dangerously. Barriers don't have any place on a serious system that I can see. The battery-backed RAID controller you have to use to make fsync calls fast anyway can do some simple write reordering, but the operating system doesn't ever have enough visibility into what it's doing to make intelligent decisions about that anyway. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Bruce Momjian wrote:
Not for 9.0. What Scott is suggesting is often the case, but not always; I can produce a counter example at will now that I know exactly which closets have the skeletons in them. The underlying situation is more complicated due to some limitations to the whole "spread checkpoint" code that is turning really sour on newer hardware with large amounts of RAM. I have about 5 pages of written notes on this specific issue so far, and that keeps growing every week. That's all leading toward a proposed 9.1 change to the specific fsync behavior. And I expect to dump a large stack of documentation to support that patch that will address this whole area. I'll put the whole thing onto the wiki as soon as my 9.0 related work settles down.
Scott Carey wrote:Don't ever have WAL and data on the same OS volume as ext3. ... One partition for WAL, one for data. If using ext3 this is essentially a performance requirement no matter how your array is set up underneath.Do we need to document this?
Not for 9.0. What Scott is suggesting is often the case, but not always; I can produce a counter example at will now that I know exactly which closets have the skeletons in them. The underlying situation is more complicated due to some limitations to the whole "spread checkpoint" code that is turning really sour on newer hardware with large amounts of RAM. I have about 5 pages of written notes on this specific issue so far, and that keeps growing every week. That's all leading toward a proposed 9.1 change to the specific fsync behavior. And I expect to dump a large stack of documentation to support that patch that will address this whole area. I'll put the whole thing onto the wiki as soon as my 9.0 related work settles down.
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tuesday 17 August 2010 10:29:10 Greg Smith wrote: > Andres Freund wrote: > > An fsync() equals a barrier so it has the effect of stopping > > reordering around it - especially on systems with larger multi-disk > > arrays thats pretty expensive. > > You can achieve surprising speedups, at least in my experience, by > > forcing the kernel to start writing out pages *without enforcing > > barriers* first and then later enforce a barrier to be sure its > > actually written out. > > Standard practice on high performance systems with good filesystems and > a battery-backed controller is to turn off barriers anyway. That's one > of the first things to tune on XFS for example, when you have a reliable > controller. I don't have enough data on ext4 to comment on tuning for > it yet. > > The sole purpose for the whole Linux write barrier implementation in my > world is to flush the drive's cache, when the database does writes onto > cheap SATA drives that will otherwise cache dangerously. Barriers don't > have any place on a serious system that I can see. The battery-backed > RAID controller you have to use to make fsync calls fast anyway can do > some simple write reordering, but the operating system doesn't ever have > enough visibility into what it's doing to make intelligent decisions > about that anyway. Even if were not talking about a write barrier in an "ensure its written out of the cache" way it still stops the io-scheduler from reordering. I benchmarked it (custom app) and it was very noticeable on a bunch of different systems (with a good BBUed RAID). Andres