Thread: Optimal settings for RAID controller - optimized for writes
Hi, I’m kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here. I’m just about to setup my newly rented DELL R720 12. gen server. It’s running a single Intel Xeon E5-2620 v.2 processorand 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI SAS2208 dual core ROC). Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck. Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in one RAID10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it’s own drives? Secondly: Now what settings should I pay attention to when setting this up, if I wan’t it to have optimal write performance(cache behavior, write back etc.)? THANKS!
Hi,
I configured a similar architecture some months ago and this is the best choice after some pgbench and Bonnie++ tests.
Server: DELL R720d
CPU: dual Xeon 8-core
RAM: 32GB ECC
Controller PERC H710
Disks:
2xSSD (MLC) Raid1 for Operating System (CentOS 6.4)
4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", where we have most UPDATE actions (+ Hot spare).
10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead of Raid10) (+ Hot spare).
Our application have above 200 UPDATE /sec. (on the "fast tablespace") and above 15GB per die of records (on the "default tablespace").
After the testing phase I had the following conclusion:
4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in the sequential Read and Write, but much more performance on the Random scan (obviously!!), BUT as far I know the postgresql I/O processes are not heavily involved in a random I/O, so at same price I will prefer to buy 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the same price!!
10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small difference in I/O performance so I decided to use RAID 5 + a dedicated Hot Spare instead of a RAID10.
If I could go back, I would have spent the money of the SLC in other HDDs.
regards.
2014-02-17 16:03 GMT+01:00 Niels Kristian Schjødt <nielskristian@autouncle.com>:
Hi,Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
I’m kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here.
I’m just about to setup my newly rented DELL R720 12. gen server. It’s running a single Intel Xeon E5-2620 v.2 processor and 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI SAS 2208 dual core ROC).
Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck.
Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in one RAID 10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it’s own drives?
Secondly: Now what settings should I pay attention to when setting this up, if I wan’t it to have optimal write performance (cache behavior, write back etc.)?
THANKS!
--
On 17 Únor 2014, 16:03, Niels Kristian Schjødt wrote: > Hi, > > I’m kind of a noob when it comes to setting up RAID controllers and > tweaking them so I need some advice here. > > I’m just about to setup my newly rented DELL R720 12. gen server. It’s > running a single Intel Xeon E5-2620 v.2 processor and 64GB ECC ram. I have > installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based > on the LSI SAS 2208 dual core ROC). > > Now my database should be optimized for writing. UPDATEs are by far my > biggest bottleneck. I think it's pretty difficult to answer this without a clear idea of how much data the UPDATEs modify etc. Locating the data may require a lot of reads too. > Firstly: Should I just put all 8 drives in one single RAID 10 array, or > would it be better to have the 6 of them in one RAID 10 array, and then > the remaining two in a separate RAID 1 array e.g. for having WAL log dir > on it’s own drives? This used to be done to separate WAL and data files onto separate disks, as the workloads are very different (WAL is almost entirely sequential writes, access to data files is often random). With spinning drives this made a huge difference as the WAL drives were doing just seq writes, but with SSDs it's not that important anymore. If you can do some testing, do it. I'd probably create a RAID10 on all 8 disks. > Secondly: Now what settings should I pay attention to when setting this > up, if I want it to have optimal write performance (cache behavior, write > back etc.)? I'm wondering whether the controller (H710) actually handles TRIM well or not. I know a lot of hardware controllers tend not to pass TRIM to the drives, which results in poor write performance after some time, but I've been unable to google anything about TRIM on H710. Tomas
The thing is, it's difficult to transfer these experiences without clear idea of the workloads. For example I wouldn't say 200 updates / second is a write-heavy workload. A single 15k drive should handle that just fine, assuming the data fit into RAM (which seems to be the case, but maybe I got that wrong). Niels, what amounts of data are we talking about? What is the total database size? How much data are you updating? Are those updates random, or are you updating a lot of data in a sequential manner? How did you determine UPDATEs are the bottleneck? Tomas On 17 Únor 2014, 16:29, DFE wrote: > Hi, > I configured a similar architecture some months ago and this is the best > choice after some pgbench and Bonnie++ tests. > Server: DELL R720d > CPU: dual Xeon 8-core > RAM: 32GB ECC > Controller PERC H710 > Disks: > 2xSSD (MLC) Raid1 for Operating System (CentOS 6.4) > 4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", > where > we have most UPDATE actions (+ Hot spare). > 10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead > of Raid10) (+ Hot spare). > > Our application have above 200 UPDATE /sec. (on the "fast tablespace") and > above 15GB per die of records (on the "default tablespace"). > > After the testing phase I had the following conclusion: > 4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in > the sequential Read and Write, but much more performance on the Random > scan > (obviously!!), BUT as far I know the postgresql I/O processes are not > heavily involved in a random I/O, so at same price I will prefer to buy > 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the > same price!! > > 10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small > difference in I/O performance so I decided to use RAID 5 + a dedicated Hot > Spare instead of a RAID10. > > If I could go back, I would have spent the money of the SLC in other > HDDs. > > regards.
Hi, I don't have PERC H710 raid controller, but I think he would like to know raid striping/chunk size or read/write cache ratio in writeback-cache setting is the best. I'd like to know it, too:) Regards, -- Mitsumasa KONDO NTT Open Source Software Center (2014/02/18 0:54), Tomas Vondra wrote: > The thing is, it's difficult to transfer these experiences without clear > idea of the workloads. > > For example I wouldn't say 200 updates / second is a write-heavy workload. > A single 15k drive should handle that just fine, assuming the data fit > into RAM (which seems to be the case, but maybe I got that wrong). > > Niels, what amounts of data are we talking about? What is the total > database size? How much data are you updating? Are those updates random, > or are you updating a lot of data in a sequential manner? How did you > determine UPDATEs are the bottleneck? > > Tomas > > On 17 Únor 2014, 16:29, DFE wrote: >> Hi, >> I configured a similar architecture some months ago and this is the best >> choice after some pgbench and Bonnie++ tests. >> Server: DELL R720d >> CPU: dual Xeon 8-core >> RAM: 32GB ECC >> Controller PERC H710 >> Disks: >> 2xSSD (MLC) Raid1 for Operating System (CentOS 6.4) >> 4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", >> where >> we have most UPDATE actions (+ Hot spare). >> 10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead >> of Raid10) (+ Hot spare). >> >> Our application have above 200 UPDATE /sec. (on the "fast tablespace") and >> above 15GB per die of records (on the "default tablespace"). >> >> After the testing phase I had the following conclusion: >> 4xSSD (SLC) RAID 10 vs. 10xHDD RAID 5 have comparable I/O performance in >> the sequential Read and Write, but much more performance on the Random >> scan >> (obviously!!), BUT as far I know the postgresql I/O processes are not >> heavily involved in a random I/O, so at same price I will prefer to buy >> 10xHDD instead of 4xSSD (SLC) using above 10x of available space at the >> same price!! >> >> 10xHDD RAID 10 vs. 10xHDD RAID 5 : with Bonnie++ I noticed a very small >> difference in I/O performance so I decided to use RAID 5 + a dedicated Hot >> Spare instead of a RAID10. >> >> If I could go back, I would have spent the money of the SLC in other >> HDDs.
On Mon, Feb 17, 2014 at 8:03 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote: > Hi, > > I'm kind of a noob when it comes to setting up RAID controllers and tweaking them so I need some advice here. > > I'm just about to setup my newly rented DELL R720 12. gen server. It's running a single Intel Xeon E5-2620 v.2 processorand 64GB ECC ram. I have installed 8 300GB SSDs in it. It has an PERC H710 raid controller (Based on the LSI SAS2208 dual core ROC). > > Now my database should be optimized for writing. UPDATEs are by far my biggest bottleneck. > > Firstly: Should I just put all 8 drives in one single RAID 10 array, or would it be better to have the 6 of them in oneRAID 10 array, and then the remaining two in a separate RAID 1 array e.g. for having WAL log dir on it's own drives? > > Secondly: Now what settings should I pay attention to when setting this up, if I wan't it to have optimal write performance(cache behavior, write back etc.)? Pick a base configuration that's the simplest, i.e. all 8 in a RAID-10. Benchmark it to get a baseline, using a load similar to your own. You can use pgbench's ability to run scripts to make some pretty realistic benchmarks. Once you've got your baseline then start experimenting. If you can't prove that moving two drives to RAID-1 for xlog makes it faster then don't do it. Recently I was testing MCL FusionIO cards (1.2TB) and no matter how I sliced things up, one big partition was just as fast as or faster than any other configuration (separate spinners for xlog, etc) I could come up with. On this machine sequential IO to a RAID-1 pair of those was ~1GB/s. Random access during various pgbench runs was limited to ~200MB/s random throughput. Moving half of that (pg_xlog) onto other media didn't make things any faster and just made setup more complicated. I'll be testing 6x600GB SSDs in the next few weeks under an LSI card, and we'll likely have a spinning drive RAID-1 for pg_xlog there, at least to compare. If you want I can post what I see from that benchmark next week etc. So how many updates / second do you need to push through this thing?
On Mon, Feb 17, 2014 at 04:29:10PM +0100, DFE wrote: >2xSSD (MLC) Raid1 for Operating System (CentOS 6.4) >4xSSD (SLC) Raid10 for WAL archive and a dedicated "fast tablespace", where we >have most UPDATE actions (+ Hot spare). >10xHDD 15kRPM Raid5 for "default tablespace" (optimized for space, instead of >Raid10) (+ Hot spare). That's bascially backwards. The WAL is basically a sequential write-only workload, and there's generally no particular advantage to having it on an SSD. Unless you've got a workload that's writing WAL faster than the sequential write speed of your spinning disk array (fairly unusual). Putting indexes on the SSD and the WAL on the spinning disk would probably result in more bang for the buck. One thing I've found to help performance in some workloads is to move the xlog to a simple ext2 partition. There's no reason for that data to be on a journaled fs, and isolating it can keep the synchronous xlog operations from interfering with the async table operations. (E.g., it enforces seperate per-filesystem queues, metadata flushes, etc.; note that there will be essentially no metadata updates on the xlog if there are sufficient log segments allocated.) Mike Stone
On 18.2.2014 02:23, KONDO Mitsumasa wrote: > Hi, > > I don't have PERC H710 raid controller, but I think he would like to > know raid striping/chunk size or read/write cache ratio in > writeback-cache setting is the best. I'd like to know it, too:) We do have dozens of H710 controllers, but not with SSDs. I've been unable to find reliable answers how it handles TRIM, and how that works with wearout reporting (using SMART). The stripe size is actually a very good question. On spinning drives it usually does not matter too much - unless you have a very specialized workload, the 'medium size' is the right choice (AFAIK we're using 64kB on H710, which is the default). With SSDs this might actually matter much more, as the SSDs work with "erase blocks" (mostly 512kB), and I suspect using small stripe might result in repeated writes to the same block - overwriting one block repeatedly and thus increased wearout. But maybe the controller will handle that just fine, e.g. by coalescing the writes and sending them to the drive as a single write. Or maybe the drive can do that in local write cache (all SSDs have that). The other thing is filesystem alignment - a few years ago this was a major issue causing poor write performance. Nowadays this works fine, most tools are able to create partitions properly aligned to the 512kB automatically. But if the controller discards this information, it might be worth messing with the stripe size a bit to get it right. But those are mostly speculations / curious questions I've been asking myself recently, as we've been considering SSDs with H710/H710p too. As for the controller cache - my opinion is that using this for caching writes is just plain wrong. If you need to cache reads, buy more RAM - it's much cheaper, so you can buy more of it. Cache on controller (with a BBU) is designed especially for caching writes safely. (And maybe it could help with some of the erase-block issues too?) regards Tomas
(2014/02/19 5:41), Tomas Vondra wrote: > On 18.2.2014 02:23, KONDO Mitsumasa wrote: >> Hi, >> >> I don't have PERC H710 raid controller, but I think he would like to >> know raid striping/chunk size or read/write cache ratio in >> writeback-cache setting is the best. I'd like to know it, too:) > > The stripe size is actually a very good question. On spinning drives it > usually does not matter too much - unless you have a very specialized > workload, the 'medium size' is the right choice (AFAIK we're using 64kB > on H710, which is the default). I am interested that raid stripe size of PERC H710 is 64kB. In HP raid card, default chunk size is 256kB. If we use two disks with raid 0, stripe size will be 512kB. I think that it might too big, but it might be optimized in raid card... In actually, it isn't bad in that settings. I'm interested in raid card internal behavior. Fortunately, linux raid card driver is open souce, so we might good at looking the source code when we have time. > With SSDs this might actually matter much more, as the SSDs work with > "erase blocks" (mostly 512kB), and I suspect using small stripe might > result in repeated writes to the same block - overwriting one block > repeatedly and thus increased wearout. But maybe the controller will > handle that just fine, e.g. by coalescing the writes and sending them to > the drive as a single write. Or maybe the drive can do that in local > write cache (all SSDs have that). I have heard that genuine raid card with genuine ssds are optimized in these ssds. It is important that using compatible with ssd for performance. If the worst case, life time of ssd is be short, and will be bad performance. > But those are mostly speculations / curious questions I've been asking > myself recently, as we've been considering SSDs with H710/H710p too. > > As for the controller cache - my opinion is that using this for caching > writes is just plain wrong. If you need to cache reads, buy more RAM - > it's much cheaper, so you can buy more of it. Cache on controller (with > a BBU) is designed especially for caching writes safely. (And maybe it > could help with some of the erase-block issues too?) I'm wondering about effective of readahead in OS and raid card. In general, readahead data by raid card is stored in raid cache, and not stored in OS caches. Readahead data by OS is stored in OS cache. I'd like to use all raid cache for only write cache, because fsync() becomes faster. But then, it cannot use readahead very much by raid card.. If we hope to use more effectively, we have to clear it, but it seems difficult:( Regards, -- Mitsumasa KONDO NTT Open Source Software Center
On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 18.2.2014 02:23, KONDO Mitsumasa wrote: >> Hi, >> >> I don't have PERC H710 raid controller, but I think he would like to >> know raid striping/chunk size or read/write cache ratio in >> writeback-cache setting is the best. I'd like to know it, too:) > > We do have dozens of H710 controllers, but not with SSDs. I've been > unable to find reliable answers how it handles TRIM, and how that works > with wearout reporting (using SMART). AFAIK (I haven't looked for a few months), they don't support TRIM. The only hardware RAID vendor that has even basic TRIM support Intel and that's no accident; I have a theory that enterprise storage vendors are deliberately holding back SSD: SSD (at least, the newer, better ones) destroy the business model for "enterprise storage equipment" in a large percentage of applications. A 2u server with, say, 10 s3700 drives gives *far* superior performance to most SANs that cost under 100k$. For about 1/10th of the price. If you have a server that is i/o constrained as opposed to storage constrained (AKA: a database) hard drives make zero economic sense. If your vendor is jerking you around by charging large multiples of market rates for storage and/or disallowing drives that actually perform well in their storage gear, choose a new vendor. And consider using software raid. merlin
On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> We do have dozens of H710 controllers, but not with SSDs. I've been >> unable to find reliable answers how it handles TRIM, and how that works >> with wearout reporting (using SMART). > > AFAIK (I haven't looked for a few months), they don't support TRIM. > The only hardware RAID vendor that has even basic TRIM support Intel > and that's no accident; I have a theory that enterprise storage > vendors are deliberately holding back SSD: SSD (at least, the newer, > better ones) destroy the business model for "enterprise storage > equipment" in a large percentage of applications. A 2u server with, > say, 10 s3700 drives gives *far* superior performance to most SANs > that cost under 100k$. For about 1/10th of the price. > > If you have a server that is i/o constrained as opposed to storage > constrained (AKA: a database) hard drives make zero economic sense. > If your vendor is jerking you around by charging large multiples of > market rates for storage and/or disallowing drives that actually > perform well in their storage gear, choose a new vendor. And consider > using software raid. You can also do the old trick of underprovisioning and / or underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a HW RAID controller in RAID-10, then only parititon out 1/2 the storage you get from that. so you get 1.5TB os storage and the drives are underutilized enough to have spare space. Right now I'm testing on a machine with 2x Intel E5-2690s (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, but storage wise it's WAY cheaper than most SANS and very fast. pg_xlog is on a pair of non-descript SATA spinners btw.
Hi, On 19.2.2014 03:45, KONDO Mitsumasa wrote: > (2014/02/19 5:41), Tomas Vondra wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> The stripe size is actually a very good question. On spinning drives it >> usually does not matter too much - unless you have a very specialized >> workload, the 'medium size' is the right choice (AFAIK we're using 64kB >> on H710, which is the default). > > I am interested that raid stripe size of PERC H710 is 64kB. In HP > raid card, default chunk size is 256kB. If we use two disks with raid > 0, stripe size will be 512kB. I think that it might too big, but it > might be optimized in raid card... In actually, it isn't bad in that > settings. With HP controllers this depends on RAID level (and maybe even controller). Which HP controller are you talking about? I have some basic experience with P400/P800, and those have 16kB (RAID6), 64kB (RAID5) or 128kB (RAID10) defaults. None of them has 256kB. See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400). > I'm interested in raid card internal behavior. Fortunately, linux raid > card driver is open souce, so we might good at looking the source code > when we have time. What do you mean by "linux raid card driver"? Afaik the admin tools may be available, but the interesting stuff happens inside the controller, and that's still proprietary. >> With SSDs this might actually matter much more, as the SSDs work with >> "erase blocks" (mostly 512kB), and I suspect using small stripe might >> result in repeated writes to the same block - overwriting one block >> repeatedly and thus increased wearout. But maybe the controller will >> handle that just fine, e.g. by coalescing the writes and sending them to >> the drive as a single write. Or maybe the drive can do that in local >> write cache (all SSDs have that). > > I have heard that genuine raid card with genuine ssds are optimized in > these ssds. It is important that using compatible with ssd for > performance. If the worst case, life time of ssd is be short, and will > be bad performance. Well, that's the main question here, right? Because if the "worst case" actually happens to be true, then what's the point of SSDs? You have a disk that does not provite the performance you expected, died much sooner than you expected and maybe suddenly so it interrupted the operation. So instead of paying more for higher performance, you paid more for bad performance and much shorter life of the disk. Coincidentally we're currently trying to find the answer to this question too. That is - how long will the SSD endure in that particular RAID level? Does that pay off? BTW what you mean by "genuine raid card" and "genuine ssds"? > I'm wondering about effective of readahead in OS and raid card. In > general, readahead data by raid card is stored in raid cache, and > not stored in OS caches. Readahead data by OS is stored in OS cache. > I'd like to use all raid cache for only write cache, because fsync() > becomes faster. But then, it cannot use readahead very much by raid > card.. If we hope to use more effectively, we have to clear it, but > it seems difficult:( I've done a lot of testing of this on H710 in 2012 (~18 months ago), measuring combinations of * read-ahead on controller (adaptive, enabled, disabled) * read-ahead in kernel (with various sizes) * scheduler The test was the simplest and most suitable workload for this - just "dd" with 1MB block size (AFAIK, would have to check the scripts). In short, my findings are that: * read-ahead in kernel matters - tweak this * read-ahead on controller sucks - either makes no difference, or actually harms performance (adaptive with small values set for kernel read-ahead) * scheduler made no difference (at least for this workload) So we disable readahead on the controller, use 24576 for kernel and it works fine. I've done the same test with fusionio iodrive (attached to PCIe, not through controller) - absolutely no difference. Tomas
On Wed, Feb 19, 2014 at 12:09 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>>> Hi, >>>> >>>> I don't have PERC H710 raid controller, but I think he would like to >>>> know raid striping/chunk size or read/write cache ratio in >>>> writeback-cache setting is the best. I'd like to know it, too:) >>> >>> We do have dozens of H710 controllers, but not with SSDs. I've been >>> unable to find reliable answers how it handles TRIM, and how that works >>> with wearout reporting (using SMART). >> >> AFAIK (I haven't looked for a few months), they don't support TRIM. >> The only hardware RAID vendor that has even basic TRIM support Intel >> and that's no accident; I have a theory that enterprise storage >> vendors are deliberately holding back SSD: SSD (at least, the newer, >> better ones) destroy the business model for "enterprise storage >> equipment" in a large percentage of applications. A 2u server with, >> say, 10 s3700 drives gives *far* superior performance to most SANs >> that cost under 100k$. For about 1/10th of the price. >> >> If you have a server that is i/o constrained as opposed to storage >> constrained (AKA: a database) hard drives make zero economic sense. >> If your vendor is jerking you around by charging large multiples of >> market rates for storage and/or disallowing drives that actually >> perform well in their storage gear, choose a new vendor. And consider >> using software raid. > > You can also do the old trick of underprovisioning and / or > underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a > HW RAID controller in RAID-10, then only parititon out 1/2 the storage > you get from that. so you get 1.5TB os storage and the drives are > underutilized enough to have spare space. > > Right now I'm testing on a machine with 2x Intel E5-2690s > (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) > 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI > MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on > a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, > but storage wise it's WAY cheaper than most SANS and very fast. > pg_xlog is on a pair of non-descript SATA spinners btw. Yeah -- underprovisioing certainly helps but for any write heavy configuration, all else being equal, TRIM support will perform faster and have less wear. Those drives are likely the older 320 600gb. The newer s3700 are much faster although they cost around twice as much. merlin
On 19.2.2014 16:13, Merlin Moncure wrote: > On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> We do have dozens of H710 controllers, but not with SSDs. I've been >> unable to find reliable answers how it handles TRIM, and how that works >> with wearout reporting (using SMART). > > AFAIK (I haven't looked for a few months), they don't support TRIM. > The only hardware RAID vendor that has even basic TRIM support Intel > and that's no accident; I have a theory that enterprise storage > vendors are deliberately holding back SSD: SSD (at least, the newer, > better ones) destroy the business model for "enterprise storage > equipment" in a large percentage of applications. A 2u server with, > say, 10 s3700 drives gives *far* superior performance to most SANs > that cost under 100k$. For about 1/10th of the price. Yeah, maybe. I'm generally a bit skeptic when it comes to conspiration theories like this, but for ~1 year we all know that it might easily happen to be true. So maybe ... Nevertheless, I'd guess this is another case of the "Nobody ever got fired for buying X", where X is a storage product based on spinning drives, proven to be reliable, with known operational statistics and pretty good understanding of how it works. While "Y" is a new thing based on SSDs, that got rather bad reputation initially because of a hype and premature usage of consumer-grade products for unsuitable stuff. Also, each vendor of Y uses different tricks, which makes application of experiences across vendors (or even various generations of drives of the same vendor) very difficult. Factor in how conservative DBAs happen to be, and I think it might be this particular feedback loop, forcing the vendors not to push this. > If you have a server that is i/o constrained as opposed to storage > constrained (AKA: a database) hard drives make zero economic sense. > If your vendor is jerking you around by charging large multiples of > market rates for storage and/or disallowing drives that actually > perform well in their storage gear, choose a new vendor. And consider > using software raid. Yeah, exactly. Tomas
On 19.2.2014 19:09, Scott Marlowe wrote: > > You can also do the old trick of underprovisioning and / or > underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a > HW RAID controller in RAID-10, then only parititon out 1/2 the storage > you get from that. so you get 1.5TB os storage and the drives are > underutilized enough to have spare space. Yeah. AFAIK that's basically what Intel did with S3500 -> S3700. What I'm trying to find is the 'sweet spot' considering lifespan, capacity, performance and price. That's why I'm still wondering if there are some experiences with current generation of SSDs and RAID controllers, with RAID levels other than RAID-10. Say I have 8x 400GB SSD, 75k/32k read/write IOPS each (i.e. it's basically the S3700 from Intel). Assuming the writes are ~25% of the workload, this is what I get for RAID10 vs. RAID6 (math done using http://www.wmarow.com/strcalc/) | capacity GB | bandwidth MB/s | IOPS --------------------------------------------------- RAID-10 | 1490 | 2370 | 300k RAID-6 | 2230 | 1070 | 130k Let's say the app can't really generate 130k IOPS (we'll saturate CPU way before that), so even if the real-world numbers will be less than 50% of this, we're not going to hit disks as the main bottleneck. So let's assume there's no observable performance difference between RAID10 and RAID6 in our case. But we could put 1.5x the amount of data on the RAID6, making it much cheaper (we're talking about non-trivial numbers of such machines). The question is - how long will it last before the SSDs die because of wearout? Will the RAID controller make it worse due to (not) handling TRIM? Will we know how much time we have left, i.e. will the controller provide the info the drives provide through SMART? > Right now I'm testing on a machine with 2x Intel E5-2690s > (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) > 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI Most likely S3500. S3700 are not offered with 600GB capacity. > MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on > a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, > but storage wise it's WAY cheaper than most SANS and very fast. > pg_xlog is on a pair of non-descript SATA spinners btw. Nice. I've done some testing with fusionio iodrive duo (2 devices in RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So considering the price, the 7.2k TPS is really good IMHO. regards Tomas
On Wed, Feb 19, 2014 at 6:10 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 19.2.2014 19:09, Scott Marlowe wrote: >> Right now I'm testing on a machine with 2x Intel E5-2690s >> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) >> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI > > Most likely S3500. S3700 are not offered with 600GB capacity. > >> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on >> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, >> but storage wise it's WAY cheaper than most SANS and very fast. >> pg_xlog is on a pair of non-descript SATA spinners btw. > > Nice. I've done some testing with fusionio iodrive duo (2 devices in > RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So > considering the price, the 7.2k TPS is really good IMHO. The part number reported by the LSI is: SSDSC2BB600G4 so I'm assuming it's an SLC drive. Done some further testing, I keep well over 6k tps right up to 128 clients. At no time is there any IOWait under vmstat, and if I turn off fsync speed goes up by some tiny amount, so I'm guessing I'm CPU bound at this point. This machine has dual 8 core HT Intels CPUs. We have another class of machine running on FusionIO IODrive2 MLC cards in RAID-1 and 4 6 core non-HT CPUs. It's a bit slower (1366 versus 1600MHz Memory, slower CPU clocks and interconects etc) and it can do about 5k tps and again, like the ther machine, no IO Wait, all CPU bound. I'd say once you get to a certain level of IO Subsystem it gets harder and harder to max it out. I'd love to have a 64 core 4 socket AMD top of the line system to compare here. But honestly both class of machines are more than fast enough for what we need, and our major load is from select statements so fitting the db into RAM is more important that IOPs for what we do.
Hi, (2014/02/20 9:13), Tomas Vondra wrote: > Hi, > > On 19.2.2014 03:45, KONDO Mitsumasa wrote: >> (2014/02/19 5:41), Tomas Vondra wrote: >>> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>>> Hi, >>>> >>>> I don't have PERC H710 raid controller, but I think he would like to >>>> know raid striping/chunk size or read/write cache ratio in >>>> writeback-cache setting is the best. I'd like to know it, too:) >>> >>> The stripe size is actually a very good question. On spinning drives it >>> usually does not matter too much - unless you have a very specialized >>> workload, the 'medium size' is the right choice (AFAIK we're using 64kB >>> on H710, which is the default). >> >> I am interested that raid stripe size of PERC H710 is 64kB. In HP >> raid card, default chunk size is 256kB. If we use two disks with raid >> 0, stripe size will be 512kB. I think that it might too big, but it >> might be optimized in raid card... In actually, it isn't bad in that >> settings. > > With HP controllers this depends on RAID level (and maybe even > controller). Which HP controller are you talking about? I have some > basic experience with P400/P800, and those have 16kB (RAID6), 64kB > (RAID5) or 128kB (RAID10) defaults. None of them has 256kB. > See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400). I use P410 and P420 that are equiped in DL360 gen7 and DL360gen8. They seems relatively latest. I check raid stripe size(RAID1+0) using hpacucli tool, and it is surely 256kB chunk size. And P420 enables to set higher/smaller chunk sizes which range is 8KB - 1024kB? higher. But I don't know the best parameter in postgres:( >> I'm interested in raid card internal behavior. Fortunately, linux raid >> card driver is open souce, so we might good at looking the source code >> when we have time. > > What do you mean by "linux raid card driver"? Afaik the admin tools may > be available, but the interesting stuff happens inside the controller, > and that's still proprietary. I said open source driver. HP drivers are in under following url. http://cciss.sourceforge.net/ However, unless I read driver source code roughly, core part of raid card programing is in farmware, as you say. It seems that just to drive from OS. I'm interested in elevetor algorithm, when I read driver source code. But detail algorithm might be in firmware.. >>> With SSDs this might actually matter much more, as the SSDs work with >>> "erase blocks" (mostly 512kB), and I suspect using small stripe might >>> result in repeated writes to the same block - overwriting one block >>> repeatedly and thus increased wearout. But maybe the controller will >>> handle that just fine, e.g. by coalescing the writes and sending them to >>> the drive as a single write. Or maybe the drive can do that in local >>> write cache (all SSDs have that). >> >> I have heard that genuine raid card with genuine ssds are optimized in >> these ssds. It is important that using compatible with ssd for >> performance. If the worst case, life time of ssd is be short, and will >> be bad performance. > > Well, that's the main question here, right? Because if the "worst case" > actually happens to be true, then what's the point of SSDs? Sorry, this thread topic is SSD stiriping size tuning. I'm interested in magnetic disk especially. But also interested SSD. > You have a > disk that does not provite the performance you expected, died much > sooner than you expected and maybe suddenly so it interrupted the operation. > So instead of paying more for higher performance, you paid more for bad > performance and much shorter life of the disk. I'm intetested in that changing raid chunk size will be short life. I have not had this point. It mgiht be true. And I'd like to test it using SMART cheacker if we have time. > Coincidentally we're currently trying to find the answer to this > question too. That is - how long will the SSD endure in that particular > RAID level? Does that pay off? > > BTW what you mean by "genuine raid card" and "genuine ssds"? I want to say "genuine" as it is same manufacturing maker or vendor. >> I'm wondering about effective of readahead in OS and raid card. In >> general, readahead data by raid card is stored in raid cache, and >> not stored in OS caches. Readahead data by OS is stored in OS cache. >> I'd like to use all raid cache for only write cache, because fsync() >> becomes faster. But then, it cannot use readahead very much by raid >> card.. If we hope to use more effectively, we have to clear it, but >> it seems difficult:( > > I've done a lot of testing of this on H710 in 2012 (~18 months ago), > measuring combinations of > > * read-ahead on controller (adaptive, enabled, disabled) > * read-ahead in kernel (with various sizes) > * scheduler > > The test was the simplest and most suitable workload for this - just > "dd" with 1MB block size (AFAIK, would have to check the scripts). > > In short, my findings are that: > > * read-ahead in kernel matters - tweak this > * read-ahead on controller sucks - either makes no difference, or > actually harms performance (adaptive with small values set for > kernel read-ahead) > * scheduler made no difference (at least for this workload) > > So we disable readahead on the controller, use 24576 for kernel and it > works fine. > > I've done the same test with fusionio iodrive (attached to PCIe, not > through controller) - absolutely no difference. I'd like to know random access(8kB) performance, it does not seems it.. But this is inteteresting data. What command did you use kernel readahead paramter? If you use blockdev, value 256 indicate using 256 * 512B(sector size)=128kB readahaed parameter. And you set 245676, it will be 245676 * 512B = 120MB readahead parameter. I think it is too big, but it is optimized in your enviroment. In the end of the day, is it good for too big readahead, rather than small readahead or nothing? If we have big RAM, it seems true. But not in the situations, is it not? It is difficult problem. Regards, -- Mitsumasa KONDO NTT Open Source Software Center
On 20.2.2014 02:47, Scott Marlowe wrote: > On Wed, Feb 19, 2014 at 6:10 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 19.2.2014 19:09, Scott Marlowe wrote: > >>> Right now I'm testing on a machine with 2x Intel E5-2690s >>> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) >>> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI >> >> Most likely S3500. S3700 are not offered with 600GB capacity. >> >>> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on >>> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, >>> but storage wise it's WAY cheaper than most SANS and very fast. >>> pg_xlog is on a pair of non-descript SATA spinners btw. >> >> Nice. I've done some testing with fusionio iodrive duo (2 devices in >> RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So >> considering the price, the 7.2k TPS is really good IMHO. > > The part number reported by the LSI is: SSDSC2BB600G4 so I'm assuming > it's an SLC drive. Done some further testing, I keep well over 6k tps > right up to 128 clients. At no time is there any IOWait under vmstat, > and if I turn off fsync speed goes up by some tiny amount, so I'm > guessing I'm CPU bound at this point. This machine has dual 8 core HT > Intels CPUs. No, it's S3500, which is a MLC drive. http://ark.intel.com/products/74944/Intel-SSD-DC-S3500-Series-600GB-2_5in-SATA-6Gbs-20nm-MLC Tomas