Thread: What exactly is postgres doing during INSERT/UPDATE ?
If I run " dd if=/dev/zero bs=1024k of=file count=1000 " iostat shows me: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 671.50 88.00 113496.00 176 226992 However postgres 8.3.7 doing a bulk data write (a slony slave, doing inserts and updates) doesn't go nearly as fast: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 418.41 648.76 7052.74 1304 14176 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 237.50 44.00 3668.00 88 7336 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 221.50 444.00 3832.00 888 7664 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 166.00 248.00 3360.00 496 6720 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 163.00 480.00 3184.00 960 6368 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 102.50 724.00 1736.00 1448 3472 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 295.50 712.00 6004.00 1424 12008 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 109.45 433.83 2260.70 872 4544 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 180.00 640.00 3512.00 1280 7024 top shows the cpu usage of the pg process ranges from zero to never more than ten percent of a cpu, and that one cpu is always ninety some odd percent in iowait. So what is postgres doing (with fsync off) that causes the cpu to spend so much time in iowait? This is a 64 bit amd linux system with ext3 filesystem. free shows: total used free shared buffers cached Mem: 8116992 8085848 31144 0 103016 3098568 -/+ buffers/cache: 4884264 3232728 Swap: 6697296 2035508 4661788
> top shows the cpu usage of the pg process ranges from zero to never more > than ten percent of a cpu, and that one cpu is always ninety some odd > percent in iowait. So what is postgres doing (with fsync off) that > causes the cpu to spend so much time in iowait? Updating indexes ?
On Fri, 28 Aug 2009, Joseph S wrote: > If I run " dd if=/dev/zero bs=1024k of=file count=1000 " iostat shows me: > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda 671.50 88.00 113496.00 176 226992 That's the sequential transfer rate of your drive. It's easier to present these numbers if you use "vmstat 1" instead; that shows the I/O in more useful units, and with the CPU stats on the same line. > However postgres 8.3.7 doing a bulk data write (a slony slave, doing inserts > and updates) doesn't go nearly as fast: In PostgreSQL, an update is: 1) A read of the old data 2) Writing out the updated data 3) Marking the original data as dead 4) Updating any indexes involved 5) Later cleaning up after the now dead row On top of that Slony may need to do its own metadata updates. This sort of workload involves random I/O rather than sequential. On regular hard drives this normally happens at a tiny fraction of the speed because of how the disk has to seek around. Typically a single drive capable of 50-100MB/s on sequential I/O will only do 1-2MB/s on a completely random workload. You look like you're getting somewhere in the middle there, on the low side which doesn't surprise me. The main two things you can do to improve this on the database side: -Increase checkpoint_segments, which reduces how often updated data has to be flushed to disk -Increase shared_buffers in order to hold more of the working set of data in RAM, so that more reads are satisfied by the database cache and less data gets evicted to disk. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Aug 28, 2009 at 2:08 AM, Greg Smith<gsmith@gregsmith.com> wrote: > > This sort of workload involves random I/O rather than sequential. On > regular hard drives this normally happens at a tiny fraction of the speed > because of how the disk has to seek around. Typically a single drive > capable of 50-100MB/s on sequential I/O will only do 1-2MB/s on a completely > random workload. You look like you're getting somewhere in the middle > there, on the low side which doesn't surprise me. > > The main two things you can do to improve this on the database side: > > -Increase checkpoint_segments, which reduces how often updated data has to > be flushed to disk > > -Increase shared_buffers in order to hold more of the working set of data in > RAM, so that more reads are satisfied by the database cache and less data > gets evicted to disk. After that you have to start looking at hardware. Soimething as simple as a different drive for indexes and another for WAL, and another for the base tables can make a big difference.
Greg Smith wrote: > The main two things you can do to improve this on the database side: > > -Increase checkpoint_segments, which reduces how often updated data has > to be flushed to disk It fsync is turned off, does this matter so much? >
Scott Marlowe wrote: > After that you have to start looking at hardware. Soimething as > simple as a different drive for indexes and another for WAL, and > another for the base tables can make a big difference. > If I have 14 drives in a RAID 10 to split between data tables and indexes what would be the best way to allocate the drives for performance?
---------- Forwarded message ----------
From: Joseph S <jks@selectacast.net>
To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
Date: Fri, 28 Aug 2009 10:25:10 -0400
Subject: Re: What exactly is postgres doing during INSERT/UPDATE ?
Greg Smith wrote:The main two things you can do to improve this on the database side:
-Increase checkpoint_segments, which reduces how often updated data has to be flushed to disk
It fsync is turned off, does this matter so much?
It still matters. The kernel is only willing to have so much dirty data sitting in the disk cache. Once it reaches that limit, user processes doing writes start blocking while the kernel flushes stuff on their behalf.
Jeff
Joseph S Wrote > If I have 14 drives in a RAID 10 to split between data tables > and indexes what would be the best way to allocate the drives > for performance? RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than RAID-10 for sequential writes,but about the same for sequential reads. For typical access patterns, I would put the data and indexes on RAID-5unless you expect there to be lots of sequential scans. If you do this, you can drop the random_page_cost from the default 4.0 to 1.0. That should also encourage postgres to usethe index more often. I think the default costs for postgres assume that the data is on a RAID-1 array. Either that,or they are a compromise that isn't quite right for any system. On a plain old disk the random_page_cost should be8.0 or 10.0. The division of the drives into two arrays would depend on how much space will be occupied by the tables vs the indexes. This is very specific to your database. For example, if indexes take half as much space as tables, then you want2/3rds for tables and 1/3rd for indexes. 8 drives for tables, 5 drives for indexes, and 1 for a hot standby. The smallerarray may be a bit slower for some operations due to reduced parallelism. This also depends on the intelligence ofyour RAID controller. Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate so many drives to the logs). The only significant performance difference between RAID-10 and RAID-1 is that RAID-1 is much slower (factor of 4or 5) for random reads. I think the ratio of random reads from the transaction logs would typically be quite low. Theyare written sequentially and during checkpoint they are read sequentially. In the interim, the data is probably stillin shared memory if it needs to be read. You don't want your transaction logs or any swapfiles on RAID-5. The slow sequential write performance can be a killer. -Luke
On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote: > Joseph S Wrote >> If I have 14 drives in a RAID 10 to split between data tables >> and indexes what would be the best way to allocate the drives >> for performance? > > RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than RAID-10 for sequential writes,but about the same for sequential reads. For typical access patterns, I would put the data and indexes on RAID-5unless you expect there to be lots of sequential scans. That's pretty much exactly backwards. RAID-5 will at best slightly slower than RAID-0 or RAID-10 for sequential reads or random reads. For sequential writes it performs *terribly*, especially for random writes. The only write pattern where it performs ok sometimes is sequential writes of large chunks. > Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate so many drives to the logs). The only significant performance difference between RAID-10 and RAID-1 is that RAID-1 is much slower (factor of 4or 5) for random reads. no, RAID-10 and RAID-1 should perform the same for reads. RAID-10 will be slower at writes by about a factor equal to the number of mirror sides. > I think the ratio of random reads from the transaction logs would typically be quite low. During normal operation the logs are *never* read, neither randomly nor sequentially. > You don't want your transaction logs or any swapfiles on RAID-5. The slow sequential write performance can be a killer. As i mentioned sequential writes are the only write case when RAID-5 sometimes ok. However the picture is complicated by transaction syncing which would make RAID-5 see it more as random i/o. In any case wal normally doesn't take much disk space so there's not much reason to use anything but RAID-1. -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, Aug 28, 2009 at 8:19 PM, Jeff Janes<jeff.janes@gmail.com> wrote: >> ---------- Forwarded message ---------- >> From: Joseph S <jks@selectacast.net> >> To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org >> Date: Fri, 28 Aug 2009 10:25:10 -0400 >> Subject: Re: What exactly is postgres doing during INSERT/UPDATE ? >> Greg Smith wrote: >> >>> The main two things you can do to improve this on the database side: >>> >>> -Increase checkpoint_segments, which reduces how often updated data has >>> to be flushed to disk >> >> It fsync is turned off, does this matter so much? > > It still matters. The kernel is only willing to have so much dirty data > sitting in the disk cache. Once it reaches that limit, user processes doing > writes start blocking while the kernel flushes stuff on their behalf. it doesn't matter nearly as much though. if you are outrunning the o/s write cache with fsync off, then it's time to start looking at new hardware. merlin
On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote: > On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote: >> Joseph S Wrote >>> If I have 14 drives in a RAID 10 to split between data tables >>> and indexes what would be the best way to allocate the drives >>> for performance? >> >> RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than RAID-10 for sequential writes,but about the same for sequential reads. For typical access patterns, I would put the data and indexes on RAID-5unless you expect there to be lots of sequential scans. > > That's pretty much exactly backwards. RAID-5 will at best slightly > slower than RAID-0 or RAID-10 for sequential reads or random reads. > For sequential writes it performs *terribly*, especially for random > writes. The only write pattern where it performs ok sometimes is > sequential writes of large chunks. Note that while RAID-10 is theoretically always better than RAID-5, I've run into quite a few cheapie controllers that were heavily optimised for RAID-5 and de-optimised for RAID-10. However, if it's got battery backed cache and can run in JBOD mode, linux software RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost always beat hardware RAID-5 on the same controller.
On Sat, Aug 29, 2009 at 1:46 AM, Greg Stark<gsstark@mit.edu> wrote: > On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote: >> RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than >> RAID-10 for sequential writes, but about the same for sequential reads. For typical access >> patterns, I would put the data and indexes on RAID-5 unless you expect there to be lots of >> sequential scans. > > That's pretty much exactly backwards. RAID-5 will at best slightly > slower than RAID-0 or RAID-10 for sequential reads or random reads. > For sequential writes it performs *terribly*, especially for random > writes. The only write pattern where it performs ok sometimes is > sequential writes of large chunks. Also note that how terribly RAID5 performs on those small random writes depends on a LOT on the implementation. A good controller with a large BBU cache will be able to mitigate the performance penalty of having to read stripes before small writes to calculate parity (of course, if the writes are really random enough, it's still not going to help much). >> Always put the transaction logs (WAL Files) on RAID-10 (or RAID-1 if you don't want to dedicate >> so many drives to the logs). The only significant performance difference between RAID-10 and >> RAID-1 is that RAID-1 is much slower (factor of 4 or 5) for random reads. > > no, RAID-10 and RAID-1 should perform the same for reads. RAID-10 will > be slower at writes by about a factor equal to the number of mirror > sides. Let's keep in mind that a 2-disk RAID-10 is really the same as a 2-disk RAID-1, it just doesn't have any mirrors to stripe over. So since you really need 4-disks for a "true" RAID-10, the performance of a RAID-10 array compared to a RAID1 array is pretty much proportional to the number of disks in the array (more disks = more performance). The "far" RAID-10 layout that is available when using Linux software raid is interesting. It will lay the data out on the disks so that you can get the streaming read performance of a RAID-0 array, but streaming write performance will suffer a bit since now the disk will have to seek to perform those writes. You can also use this layout with just 2 disks instead of RAID1. Some claim that the performance hit isn't noticeable due to write caching/IO ordering, but I have not tested it's performance using PostgreSQL. Might be a nice thing for someone to try. http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 -Dave
On Sat, Aug 29, 2009 at 6:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
True, but it matters enough that it ought not be ignored. I've run into it more than once, and I haven't been at this very long.
Or to start looking at tweaking the kernel VM settings. The kernel doesn't always handle these situations as gracefully as it could, and might produce a practical throughput that is much less than the theoretical one. But reducing the frequency of checkpoints is easier than either of these, and cheaper than buying new hardware. I don't see why the hardest and most expensive option would be the first choice.
Jeff
it doesn't matter nearly as much though.On Fri, Aug 28, 2009 at 8:19 PM, Jeff Janes<jeff.janes@gmail.com> wrote:
>> ---------- Forwarded message ----------
>> From: Joseph S <jks@selectacast.net>
>> To: Greg Smith <gsmith@gregsmith.com>, pgsql-performance@postgresql.org
>> Date: Fri, 28 Aug 2009 10:25:10 -0400
>> Subject: Re: What exactly is postgres doing during INSERT/UPDATE ?
>> Greg Smith wrote:
>>
>>> The main two things you can do to improve this on the database side:
>>>
>>> -Increase checkpoint_segments, which reduces how often updated data has
>>> to be flushed to disk
>>
>> It fsync is turned off, does this matter so much?
>
> It still matters. The kernel is only willing to have so much dirty data
> sitting in the disk cache. Once it reaches that limit, user processes doing
> writes start blocking while the kernel flushes stuff on their behalf.
True, but it matters enough that it ought not be ignored. I've run into it more than once, and I haven't been at this very long.
if you are outrunning the
o/s write cache with fsync off, then it's time to start looking at new
hardware.
Or to start looking at tweaking the kernel VM settings. The kernel doesn't always handle these situations as gracefully as it could, and might produce a practical throughput that is much less than the theoretical one. But reducing the frequency of checkpoints is easier than either of these, and cheaper than buying new hardware. I don't see why the hardest and most expensive option would be the first choice.
Jeff
On Sat, Aug 29, 2009 at 9:59 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote: >> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote: >>> Joseph S Wrote >>>> If I have 14 drives in a RAID 10 to split between data tables >>>> and indexes what would be the best way to allocate the drives >>>> for performance? >>> >>> RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than RAID-10 for sequential writes,but about the same for sequential reads. For typical access patterns, I would put the data and indexes on RAID-5unless you expect there to be lots of sequential scans. >> >> That's pretty much exactly backwards. RAID-5 will at best slightly >> slower than RAID-0 or RAID-10 for sequential reads or random reads. >> For sequential writes it performs *terribly*, especially for random >> writes. The only write pattern where it performs ok sometimes is >> sequential writes of large chunks. > > Note that while RAID-10 is theoretically always better than RAID-5, > I've run into quite a few cheapie controllers that were heavily > optimised for RAID-5 and de-optimised for RAID-10. However, if it's > got battery backed cache and can run in JBOD mode, linux software > RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost > always beat hardware RAID-5 on the same controller. raid 5 can outperform raid 10 on sequential writes in theory. if you are writing 100mb of actual data on, say, a 8 drive array, the raid 10 system has to write 200mb data and the raid 5 system has to write 100 * (8/7) or about 114mb. Of course, the raid 5 system has to do parity, etc. For random writes, raid 5 has to write a minimum of two drives, the data being written and parity. Raid 10 also has to write two drives minimum. A lot of people think parity is a big deal in terms of raid 5 performance penalty, but I don't -- relative to the what's going on in the drive, xor calculation costs (one of the fastest operations in computing) are basically zero, and off-lined if you have a hardware raid controller. I bet part of the problem with raid 5 is actually contention. since your write to a stripe can conflict with other writes to a different stripe. The other problem with raid 5 that I see is that you don't get very much extra protection -- it's pretty scary doing a rebuild even with a hot spare (and then you should probably be doing raid 6). On read performance RAID 10 wins all day long because more drives can be involved. merlin
On Sun, Aug 30, 2009 at 4:40 PM, Merlin Moncure<mmoncure@gmail.com> wrote: > For random writes, raid 5 has to write a minimum of two drives, the > data being written and parity. Raid 10 also has to write two drives > minimum. A lot of people think parity is a big deal in terms of raid > 5 performance penalty, but I don't -- relative to the what's going on > in the drive, xor calculation costs (one of the fastest operations in > computing) are basically zero, and off-lined if you have a hardware > raid controller. The cost is that in order to calculate the parity block the RAID controller has to *read* in either the old data block being overwritten and the old parity block or all the other data blocks which participate in that parity block. So every random write becomes not just two writes but two reads + two writes. If you're always writing large sequential hunks at a time then this is minimized because the RAID controller can just calculate the new parity block for the whole new hunk. But if you often just seek to random places in the file and overwrite 8k at a time then things go very very poorly. -- greg http://mit.edu/~gsstark/resume.pdf
On 08/30/2009 11:40 AM, Merlin Moncure wrote: > For random writes, raid 5 has to write a minimum of two drives, the > data being written and parity. Raid 10 also has to write two drives > minimum. A lot of people think parity is a big deal in terms of raid > 5 performance penalty, but I don't -- relative to the what's going on > in the drive, xor calculation costs (one of the fastest operations in > computing) are basically zero, and off-lined if you have a hardware > raid controller. > > I bet part of the problem with raid 5 is actually contention. since > your write to a stripe can conflict with other writes to a different > stripe. The other problem with raid 5 that I see is that you don't > get very much extra protection -- it's pretty scary doing a rebuild > even with a hot spare (and then you should probably be doing raid 6). > On read performance RAID 10 wins all day long because more drives can > be involved. > In real life, with real life writes (i.e. not sequential from the start of the disk to the end of the disk), where the stripes on the disk being written are not already in RAM (to allow for XOR to be cheap), RAID 5 is horrible. I still recall naively playing with software RAID 5 on a three disk system and finding write performance to be 20% - 50% less than a single drive on its own. People need to realize that the cost of maintaining parity is not the XOR itself - XOR is cheap - the cost is having knowledge of all drives in the stripe in order to write the parity. This implies it is already in cache (requires a very large cache, or a very localized load such that the load all fits in cache), or it requires 1 or more reads before 2 or more writes. Latency is a killer here - latency is already the slowest part of the disk, so to effectively multiply latency x 2 has a huge impact. I will never use RAID 5 again unless I have a huge memory backed cache for it to cache writes against. By huge, I mean something approximately the size of the data normally read and written. Having 1 Gbytes of RAM dedicated to RAID 5 for a 1 Tbyte drive may not be enough. RAID 1+0 on the other hand, has never disappointed me yet. Disks are cheap, and paying x2 for single disk redundancy is an acceptable price. Cheers, mark -- Mark Mielke<mark@mielke.cc>
I've already learned my lesson and will never use raid 5 again. The question is what I do with my 14 drives. Should I use only 1 pair for indexes or should I use 4 drives? The wal logs are already slated for an SSD. Scott Marlowe wrote: > On Sat, Aug 29, 2009 at 2:46 AM, Greg Stark<gsstark@mit.edu> wrote: >> On Sat, Aug 29, 2009 at 5:20 AM, Luke Koops<luke.koops@entrust.com> wrote: >>> Joseph S Wrote >>>> If I have 14 drives in a RAID 10 to split between data tables >>>> and indexes what would be the best way to allocate the drives >>>> for performance? >>> RAID-5 can be much faster than RAID-10 for random reads and writes. It is much slower than RAID-10 for sequential writes,but about the same for sequential reads. For typical access patterns, I would put the data and indexes on RAID-5unless you expect there to be lots of sequential scans. >> That's pretty much exactly backwards. RAID-5 will at best slightly >> slower than RAID-0 or RAID-10 for sequential reads or random reads. >> For sequential writes it performs *terribly*, especially for random >> writes. The only write pattern where it performs ok sometimes is >> sequential writes of large chunks. > > Note that while RAID-10 is theoretically always better than RAID-5, > I've run into quite a few cheapie controllers that were heavily > optimised for RAID-5 and de-optimised for RAID-10. However, if it's > got battery backed cache and can run in JBOD mode, linux software > RAID-10 or hybrid RAID-1 in hardware RAID-0 in software will almost > always beat hardware RAID-5 on the same controller. >
On Sun, Aug 30, 2009 at 1:36 PM, Mark Mielke<mark@mark.mielke.cc> wrote: > On 08/30/2009 11:40 AM, Merlin Moncure wrote: >> >> For random writes, raid 5 has to write a minimum of two drives, the >> data being written and parity. Raid 10 also has to write two drives >> minimum. A lot of people think parity is a big deal in terms of raid >> 5 performance penalty, but I don't -- relative to the what's going on >> in the drive, xor calculation costs (one of the fastest operations in >> computing) are basically zero, and off-lined if you have a hardware >> raid controller. >> >> I bet part of the problem with raid 5 is actually contention. since >> your write to a stripe can conflict with other writes to a different >> stripe. The other problem with raid 5 that I see is that you don't >> get very much extra protection -- it's pretty scary doing a rebuild >> even with a hot spare (and then you should probably be doing raid 6). >> On read performance RAID 10 wins all day long because more drives can >> be involved. >> > > In real life, with real life writes (i.e. not sequential from the start of > the disk to the end of the disk), where the stripes on the disk being > written are not already in RAM (to allow for XOR to be cheap), RAID 5 is > horrible. I still recall naively playing with software RAID 5 on a three > disk system and finding write performance to be 20% - 50% less than a single > drive on its own. > > People need to realize that the cost of maintaining parity is not the XOR > itself - XOR is cheap - the cost is having knowledge of all drives in the > stripe in order to write the parity. This implies it is already in cache > (requires a very large cache, or a very localized load such that the load > all fits in cache), or it requires 1 or more reads before 2 or more writes. > Latency is a killer here - latency is already the slowest part of the disk, > so to effectively multiply latency x 2 has a huge impact. This is not necessarily correct. As long as the data you are writing is less than the raid stripe size (say 64kb), then you only need the old data for that stripe (which is stored on one disk only), the parity (also stored on one disk only), and the data being written to recalculate the parity. A raid stripe is usually on one disk. So a raid 5 random write will only involve two drives if it's less than stripe size (and three drives if it's up to 2x stripe size, etc). IOW, if your stripe size is 64k: 64k written: raid 10: two writes raid 5: two writes, one read (but the read and one of the writes is same physical location) 128k written raid 10: four writes raid 5: three writes, one read (but the read and one of the writes is same physical location) 192k written raid 10: six writes raid 5: four writes, one read (but the read and one of the writes is same physical location) now, by 'same physical' location, that may mean that the drive head has to move if the data is not in cache. I realize that many raid 5 implementations tend to suck. That said, raid 5 should offer higher theoretical performance for writing than raid 10, both for sequential and random. (many, many online descriptions of raid get this wrong and stupidly blame the overhead of parity calculation). raid 10 wins on read all day long. Of course, on a typical system with lots of things going on, it gets a lot more complicated... (just for the record, I use raid 10 on my databases always) :-) merlin
On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@gmail.com> wrote: > 192k written > raid 10: six writes > raid 5: four writes, one read (but the read and one of the writes is > same physical location) > > now, by 'same physical' location, that may mean that the drive head > has to move if the data is not in cache. > > I realize that many raid 5 implementations tend to suck. That said, > raid 5 should offer higher theoretical performance for writing than > raid 10, both for sequential and random. In the above there are two problems. 1) 192kB is not a random access pattern. Any time you're writing a whole raid stripe or more then RAID5 can start performing reasonably but that's not random, that's sequential i/o. The relevant random i/o pattern is writing 8kB chunks at random offsets into a multi-terabyte storage which doesn't fit in cache. 2) It's not clear but I think you're saying "but the read and one of the writes is same physical location" on the basis that this mitigates the costs. In fact it's the worst case. It means after doing the read and calculating the parity block the drive must then spin a full rotation before being able to write it back out. So instead of an average latency of 1/2 of a rotation you have that plus a full rotation, or 3x as much latency before the write can be performed as without raid5. It's not a fault of the implementations, it's a fundamental problem with RAId5. Even a spectacular implementation of RAID5 will be awful for random access writes. The only saving grace some hardware implementations have is having huge amounts of battery backed cache which mean that they can usually buffer all the writes for long enough that the access patterns no longer look random. If you buffer enough then you can hope you'll eventually overwrite the whole stripe and can write out the new parity without reading the old data. Or failing that you can perform the reads of the old data when it's convenient because you're reading nearby data effectively turning it into sequential i/o. -- greg http://mit.edu/~gsstark/resume.pdf
On Sun, Aug 30, 2009 at 7:38 PM, Greg Stark<gsstark@mit.edu> wrote: > On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >> 192k written >> raid 10: six writes >> raid 5: four writes, one read (but the read and one of the writes is >> same physical location) >> >> now, by 'same physical' location, that may mean that the drive head >> has to move if the data is not in cache. >> >> I realize that many raid 5 implementations tend to suck. That said, >> raid 5 should offer higher theoretical performance for writing than >> raid 10, both for sequential and random. > > In the above there are two problems. > > 1) 192kB is not a random access pattern. Any time you're writing a > whole raid stripe or more then RAID5 can start performing reasonably > but that's not random, that's sequential i/o. The relevant random i/o > pattern is writing 8kB chunks at random offsets into a multi-terabyte > storage which doesn't fit in cache. > > 2) It's not clear but I think you're saying "but the read and one of > the writes is same physical location" on the basis that this mitigates > the costs. In fact it's the worst case. It means after doing the read > and calculating the parity block the drive must then spin a full > rotation before being able to write it back out. So instead of an > average latency of 1/2 of a rotation you have that plus a full > rotation, or 3x as much latency before the write can be performed as > without raid5. > > It's not a fault of the implementations, it's a fundamental problem > with RAId5. Even a spectacular implementation of RAID5 will be awful > for random access writes. The only saving grace some hardware > implementations have is having huge amounts of battery backed cache > which mean that they can usually buffer all the writes for long enough > that the access patterns no longer look random. If you buffer enough > then you can hope you'll eventually overwrite the whole stripe and can > write out the new parity without reading the old data. Or failing that > you can perform the reads of the old data when it's convenient because > you're reading nearby data effectively turning it into sequential i/o. I agree, that's good analysis. The main point I was making was that if you have say a 10 disk raid 5, you don't involve 10 disks, only two...a very common misconception. I made another mistake that you didn't catch: you need to read *both* the data drive and the parity drive before writing, not just the parity drive. I wonder if flash SSD are a better fit for raid 5 since the reads are much cheaper than writes and there is no rotational latency. (also, $/gb is different, and so are the failure cases). merlin
* Merlin Moncure <mmoncure@gmail.com> [090831 10:38]: > I agree, that's good analysis. The main point I was making was that > if you have say a 10 disk raid 5, you don't involve 10 disks, only > two...a very common misconception. I made another mistake that you > didn't catch: you need to read *both* the data drive and the parity > drive before writing, not just the parity drive. > > I wonder if flash SSD are a better fit for raid 5 since the reads are > much cheaper than writes and there is no rotational latency. (also, > $/gb is different, and so are the failure cases). The other thing that scares me about raid-5 is the write-hole, and the possible delayed inconsistency that brings... Again, hopefully mitigated by a dependable controller w/ BBU... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
On Sun, Aug 30, 2009 at 1:01 PM, Joseph S <jks@selectacast.net> wrote:
I've already learned my lesson and will never use raid 5 again. The question is what I do with my 14 drives. Should I use only 1 pair for indexes or should I use 4 drives? The wal logs are already slated for an SSD.
Why not just spread all your index data over 14 spindles, and do the same with your table data? I haven't encountered this debate in in the pgsql world, but from the Oracle world it seems to me the "Stripe And Mirror Everything" people had the better argument than the "separate tables and indexes" people.
Jeff
Jeff Janes <jeff.janes@gmail.com> wrote: > Joseph S <jks@selectacast.net> wrote: >> The question is what I do with my 14 drives. Should I use only 1 >> pair for indexes or should I use 4 drives? The wal logs are >> already slated for an SSD. > Why not just spread all your index data over 14 spindles, and do the > same with your table data? If you have the luxury of being able to test more than one configuration with something resembling your actual workload, I would strongly recommend including this as one of your configurations. Spreading everything over the larger number of spindles might well out-perform your most carefully hand-crafted tuning of object placement on smaller spindle sets. -Kevin
On Mon, Aug 31, 2009 at 10:31 AM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > Jeff Janes <jeff.janes@gmail.com> wrote: >> Joseph S <jks@selectacast.net> wrote: > >>> The question is what I do with my 14 drives. Should I use only 1 >>> pair for indexes or should I use 4 drives? The wal logs are >>> already slated for an SSD. > >> Why not just spread all your index data over 14 spindles, and do the >> same with your table data? > > If you have the luxury of being able to test more than one > configuration with something resembling your actual workload, I would > strongly recommend including this as one of your configurations. > Spreading everything over the larger number of spindles might well > out-perform your most carefully hand-crafted tuning of object > placement on smaller spindle sets. The first thing I'd test would be if having a separate mirror set for pg_xlog helps. If you have a high write environment moving pg_xlog off of the main data set can help a lot.