Thread: Hardware vs Software RAID
Hi Has anyone done some benchmarks between hardware RAID vs Linux MD software RAID? I'm keen to know the result. -- Adrian Moisey Systems Administrator | CareerJunction | Your Future Starts Here. Web: www.careerjunction.co.za | Email: adrian@careerjunction.co.za Phone: +27 21 686 6820 | Mobile: +27 82 858 7830 | Fax: +27 21 686 6842
On Wed, Jun 25, 2008 at 7:05 AM, Adrian Moisey <adrian@careerjunction.co.za> wrote: > Has anyone done some benchmarks between hardware RAID vs Linux MD software > RAID? > > I'm keen to know the result. I have here: http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html I also did some pgbench tests which I unfortunately did not record. The upshot is I don't really see a difference in performance. I mainly prefer software raid because it's flexible and you can use the same set of tools across different hardware. One annoying thing about software raid that comes up periodically is that you can't grow raid 0 volumes. merlin
On Wed, 25 Jun 2008, Merlin Moncure wrote: >> Has anyone done some benchmarks between hardware RAID vs Linux MD software >> RAID? > > I have here: > http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html > > The upshot is I don't really see a difference in performance. The main difference is that you can get hardware RAID with battery-backed-up cache, which means small writes will be much quicker than software RAID. Postgres does a lot of small writes under some use cases. Without a BBU cache, it is sensible to put the transaction logs on a separate disc system to the main database, to make the transaction log writes fast (due to no seeking on those discs). However, with a BBU cache, that advantage is irrelevant, as the cache will absorb the writes. However, not all hardware RAID will have such a battery-backed-up cache, and those that do tend to have a hefty price tag. Matthew -- $ rm core Segmentation Fault (core dumped)
"Also sprach Matthew Wakeling:" > >> Has anyone done some benchmarks between hardware RAID vs Linux MD software > >> RAID? ... > > The upshot is I don't really see a difference in performance. > > The main difference is that you can get hardware RAID with > battery-backed-up cache, which means small writes will be much quicker > than software RAID. Postgres does a lot of small writes under some use It doesn't "mean" that, I'm afraid. You can put the log/bitmap wherever you want in software raid, including on a battery-backed local ram disk if you feel so inclined. So there is no intrinsic advantage to be gained there at all. > However, not all hardware RAID will have such a battery-backed-up cache, > and those that do tend to have a hefty price tag. Whereas software raid and a firewire-attached log device does not. Peter
On Wed, 25 Jun 2008, Peter T. Breuer wrote: > You can put the log/bitmap wherever you want in software raid, including > on a battery-backed local ram disk if you feel so inclined. So there is > no intrinsic advantage to be gained there at all. You are technically correct but this is irrelevant. There are zero mainstream battery-backed local RAM disk setups appropriate for database use that don't cost substantially more than the upgrade cost to just getting a good hardware RAID controller with cache integrated and using regular disks. What I often do is get a hardware RAID controller, just to accelerate disk writes, but configure it in JBOD mode and use Linux or other software RAID on that platform. Advantages of using software RAID, in general and in some cases even with a hardware disk controller: -Your CPU is inevitably faster than the one on the controller, so this can give better performance than having RAID calcuations done on the controller itself. -If the RAID controllers dies, you can move everything to another machine and know that the RAID setup will transfer. Usually hardware RAID controllers use a formatting process such that you can't read the array without such a controller, so you're stuck with having a replacement controller around if you're paranoid. As long as I've got any hardware that can read the disks, I can get a software RAID back again. -There is a transparency to having the disks directly attached to the OS you lose with most hardware RAID. Often with hardware RAID you lose the ability to do things like monitor drive status and temperature without using a special utility to read SMART and similar data. Disadvantages: -Maintenance like disk replacement rebuilds will be using up your main CPU and its resources (like I/O bus bandwidth) that might be offloaded onto the hardware RAID controller. -It's harder to setup a redundant boot volume with software RAID that works right with a typical PC BIOS. If you use hardware RAID it tends to insulate you from the BIOS quirks. -If a disk fails, I've found a full hardware RAID setup is less likely to result in an OS crash than a software RAID is. The same transparency and visibility into what the individual disks are doing can be a problem when a disk goes crazy and starts spewing junk the OS has to listen to. Hardware controllers tend to do a better job planning for that sort of failure, and some of that is lost even by putting them into JBOD mode. >> However, not all hardware RAID will have such a battery-backed-up cache, >> and those that do tend to have a hefty price tag. > > Whereas software raid and a firewire-attached log device does not. A firewire-attached log device is an extremely bad idea. First off, you're at the mercy of the firewire bridge's write guarantees, which may or may not be sensible. It's not hard to find reports of people whose disks were corrupted when the disk was accidentally disconnected, or of buggy drive controller firmware causing problems. I stopped using Firewire years ago because it seems you need to do some serious QA to figure out which combinations are reliable and which aren't, and I don't use external disks enough to spend that kind of time with them. Second, there's few if any Firewire setups where the host gets to read SMART error data from the disk. This means that you can continue to use a flaky disk long past the point where a direct connected drive would have been kicked out of an array for being unreliable. SMART doesn't detect 100% of drive failures in advance, but you'd be silly to setup a database system where you don't get to take advantage of the ~50% it does catch before you lose any data. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Jun 25, 2008 at 11:24 AM, Greg Smith <gsmith@gregsmith.com> wrote: > SMART doesn't detect 100% of drive failures in advance, but you'd be silly > to setup a database system where you don't get to take advantage of the > ~50% it does catch before you lose any data. Can't argue with that one. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
On Wed, 2008-06-25 at 11:30 -0400, Jonah H. Harris wrote: > On Wed, Jun 25, 2008 at 11:24 AM, Greg Smith <gsmith@gregsmith.com> wrote: > > SMART doesn't detect 100% of drive failures in advance, but you'd be silly > > to setup a database system where you don't get to take advantage of the > > ~50% it does catch before you lose any data. > > Can't argue with that one. SMART has certainly saved our butts more than once. Joshua D. Drake
On Wed, 25 Jun 2008, Greg Smith wrote: > A firewire-attached log device is an extremely bad idea. Anyone have experience with IDE, SATA, or SAS-connected flash devices like the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a transfer rate of 100MB/s, and doesn't degrade much in performance when writing small random blocks. But what's it actually like, and is it reliable? Matthew -- Terrorists evolve but security is intelligently designed? -- Jake von Slatt
On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey <adrian@careerjunction.co.za> wrote: > Hi > > Has anyone done some benchmarks between hardware RAID vs Linux MD software > RAID? > > I'm keen to know the result. I've had good performance from sw RAID-10 in later kernels, especially if it was handling a mostly read type load, like a reporting server. The problem with hw RAID is that the actual performance delivered doesn't always match up to the promise, due to issues like driver bugs, mediocre implementations, etc. Years ago when the first megaraid v2 drivers were coming out they were pretty buggy. Once a stable driver was out they worked quite well. I'm currently having a problem with a "well known very large servermanufacturer who shall remain unnamed" and their semi-custom RAID controller firmware not getting along with the driver for ubuntu. The machine we're ordering to replace it will have a much beefier RAID controller with a better driver / OS match and I expect better behavior from that setup.
On Wed, 2008-06-25 at 09:53 -0600, Scott Marlowe wrote: > On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey > <adrian@careerjunction.co.za> wrote: > > Hi > I'm currently having a problem with a "well known very large > servermanufacturer who shall remain unnamed" and their semi-custom > RAID controller firmware not getting along with the driver for ubuntu. /me waves to Dell. Joshua D. Drake
"Also sprach Greg Smith:" > On Wed, 25 Jun 2008, Peter T. Breuer wrote: > > > You can put the log/bitmap wherever you want in software raid, including > > on a battery-backed local ram disk if you feel so inclined. So there is > > no intrinsic advantage to be gained there at all. > > You are technically correct but this is irrelevant. There are zero > mainstream battery-backed local RAM disk setups appropriate for database > use that don't cost substantially more than the upgrade cost to just I refrained from saying in my reply that I would set up a firewire-based link to ram in a spare old portable (which comes with a battery) if I wanted to do this cheaply. One reason I refrained was because I did not want to enter into a discussion of transport speeds vs latency vs block request size. GE, for example, would have horrendous performance at 1KB i/o blocks. Mind you, it still would be over 20MB/s (I measure 70MB/s to a real scsi remote disk across GE at 64KB blocksize). > getting a good hardware RAID controller with cache integrated and using > regular disks. > > What I often do is get a hardware RAID controller, just to accelerate disk > writes, but configure it in JBOD mode and use Linux or other software RAID > on that platform. I wonder what "JBOD mode" is ... :) Journaled block over destiny? Oh .. "Just a Bunch of Disks". So you use the linux software raid driver instead of the hardware or firmware driver on the raid assembly. Fair enough. > Advantages of using software RAID, in general and in some cases even with > a hardware disk controller: > > -Your CPU is inevitably faster than the one on the controller, so this can > give better performance than having RAID calcuations done on the > controller itself. It's not clear. You take i/o bandwidth out of the rest of your system, and cpu time too. In a standard dual core machine which is not a workstation, it's OK. On my poor ol' 1GHz P3 TP x24 laptop, doing two things at once is definitely a horrible strain on my X responsiveness. On a risc machine (ARM, 250MHz) I have seen horrible cpu loads from software raid. > -If the RAID controllers dies, you can move everything to another machine > and know that the RAID setup will transfer. Usually hardware RAID Oh, I agree with that. You're talking about the proprietary formatting in hw raid assemblies, I take it? Yah. > -There is a transparency to having the disks directly attached to the OS Agreed. "It's alright until it goes wrong". > Disadvantages: > > -Maintenance like disk replacement rebuilds will be using up your main CPU Agreed (above). > > -It's harder to setup a redundant boot volume with software RAID that Yeah. I don't bother. A small boot volume in readonly mode with a copy on another disk is what I use. > works right with a typical PC BIOS. If you use hardware RAID it tends to > insulate you from the BIOS quirks. Until the machine dies? (and fries a disk or two on the way down .. happens, has happend to me). > -If a disk fails, I've found a full hardware RAID setup is less likely to > result in an OS crash than a software RAID is. The same transparency and Not sure. > >> However, not all hardware RAID will have such a battery-backed-up cache, > >> and those that do tend to have a hefty price tag. > > > > Whereas software raid and a firewire-attached log device does not. > > A firewire-attached log device is an extremely bad idea. First off, > you're at the mercy of the firewire bridge's write guarantees, which may > or may not be sensible. The log is sync. Therefore it doesn't matter what the guarantees are, or at least I assume you are worrying about acks coming back before the write has been sent, etc. Only an actual net write will be acked by the firewire transport as far as I know. If OTOH you are thinking of "a firewire attached disk" as a complete black box, then yes, I agree, you are at the mercy of the driver writer for that black box. But I was not thinking of that. I was only choosing firewire as a transport because of its relatively good behaviour with small requests, as opposed to GE as a transport, or 100BT as a transport, or whatever else as a transport... > It's not hard to find reports of people whose > disks were corrupted when the disk was accidentally disconnected, or of > buggy drive controller firmware causing problems. I stopped using > Firewire years ago because it seems you need to do some serious QA to > figure out which combinations are reliable and which aren't, and I don't > use external disks enough to spend that kind of time with them. Sync operation of the disk should make you immune to any quirks, even if you are thinking of "firewire plus disk" as a black-box unit. > Second, there's few if any Firewire setups where the host gets to read > SMART error data from the disk. An interesting point, but I really was considering firewire only as the transport (I'm the author of the ENBD - enhanced network block device - driver, which makes any remote block device available over any transport, so I guess that accounts for the different assumption). Peter
On Wed, Jun 25, 2008 at 11:55 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Wed, 2008-06-25 at 09:53 -0600, Scott Marlowe wrote: >> On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey >> <adrian@careerjunction.co.za> wrote: > >> I'm currently having a problem with a "well known very large >> servermanufacturer who shall remain unnamed" and their semi-custom >> RAID controller firmware not getting along with the driver for ubuntu. > /me waves to Dell. not just ubuntu...the dell perc/x line software utilities also explicitly check the hardware platform so they only run on dell hardware. However, the lsi logic command line utilities run just fine. As for ubuntu sas support, ubuntu suports the mpt fusion/sas line directly through the kernel. In fact, installing ubuntu server fixed an unrelated issue relating to a qlogic fibre hba that was causing reboots under heavy load with a pci-x fibre controller on centos. So, based on this and other experiences, i'm starting to be more partial to linux distributions with faster moving kernels, mainly because i trust the kernel drivers more than the vendor provided drivers. The in place distribution upgrade is also very nice. merlin
On Wed, Jun 25, 2008 at 9:03 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Wed, 25 Jun 2008, Merlin Moncure wrote: >>> >>> Has anyone done some benchmarks between hardware RAID vs Linux MD >>> software >>> RAID? >> >> I have here: >> >> http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html >> >> The upshot is I don't really see a difference in performance. > > The main difference is that you can get hardware RAID with battery-backed-up > cache, which means small writes will be much quicker than software RAID. > Postgres does a lot of small writes under some use cases. As discussed down thread, software raid still gets benefits of write-back caching on the raid controller...but there are a couple of things I'd like to add. First, if your sever is extremely busy, the write back cache will eventually get overrun and performance will eventually degrade to more typical ('write through') performance. Secondly, many hardware raid controllers have really nasty behavior in this scenario. Linux software raid has decent degradation in overload conditions but many popular raid controllers (dell perc/lsi logic sas for example) become unpredictable and very bursty in sustained high load conditions. As greg mentioned, I trust the linux kernel software raid much more than the black box hw controllers. Also, contrary to vast popular mythology, the 'overhead' of sw raid in most cases is zero except in very particular conditions. merlin
On Wed, Jun 25, 2008 at 01:35:49PM -0400, Merlin Moncure wrote: > experiences, i'm starting to be more partial to linux distributions > with faster moving kernels, mainly because i trust the kernel drivers > more than the vendor provided drivers. While I have some experience that agrees with this, I'll point out that I've had the opposite experience, too: upgrading the kernel made a perfectly stable system both unstable and prone to data loss. I think this is a blade that cuts both ways, and the key thing to do is to ensure you have good testing infrastructure in place to check that things will work before you deploy to production. (The other way to say that, of course, is "Linux is only free if your time is worth nothing." Substitute your favourite free software for "Linux", of course. ;-) ) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
>>> Andrew Sullivan <ajs@commandprompt.com> wrote: > this is a blade that cuts both ways, and the key thing to do is > to ensure you have good testing infrastructure in place to check that > things will work before you deploy to production. (The other way to > say that, of course, is "Linux is only free if your time is worth > nothing." Substitute your favourite free software for "Linux", of > course. ;-) ) It doesn't have to be free software to cut that way. I've actually found the free software to waste less of my time. If you depend on your systems, though, you should never deploy any change, no matter how innocuous it seems, without testing. -Kevin
On Wed, Jun 25, 2008 at 01:07:25PM -0500, Kevin Grittner wrote: > > It doesn't have to be free software to cut that way. I've actually > found the free software to waste less of my time. No question. But one of the unfortunate facts of the no-charge-for-licenses world is that many people expect the systems to be _really free_. It appears that some people think, because they've already paid $smallfortune for a license, it's therefore ok to pay another amount in operation costs and experts to run the system. Free systems, for some reason, are expected also magically to run themselves. This tendency is getting better, but hasn't gone away. It's partly because the budget for the administrators is often buried in the overall large system budget, so nobody balks when there's a big figure attached there. When you present a budget for "free software" that includes the cost of a few administrators, the accounting people want to know why the free software costs so much. > If you depend on your systems, though, you should never deploy any > change, no matter how innocuous it seems, without testing. I agree completely. -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Wed, 25 Jun 2008, Peter T. Breuer wrote: > I refrained from saying in my reply that I would set up a firewire-based > link to ram in a spare old portable (which comes with a battery) if I > wanted to do this cheaply. Maybe, but this is kind of a weird setup. Not many people are going to run a production database that way and us wandering into the details too much risks confusing everybody else. > The log is sync. Therefore it doesn't matter what the guarantees are, or > at least I assume you are worrying about acks coming back before the > write has been sent, etc. Only an actual net write will be acked by the > firewire transport as far as I know. That's exactly the issue; it's critical for database use that a disk not lie to you about writes being done if they're actually sitting in a cache somewhere. (S)ATA disks do that, so you have to turn that off for them to be safe to use. Since the firewire enclosure is a black box, it's difficult to know exactly what it's doing here, and history here says that every type (S)ATA disk does the wrong in the default case. I expect that for any Firewire/USB device, if I write to the disk, then issue a fsync, it will return success from that once the data has been written to the disk's cache--which is crippling behavior from the database's perspective one day when you get a crash. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 25 Jun 2008, Merlin Moncure wrote: > So, based on this and other experiences, i'm starting to be more partial > to linux distributions with faster moving kernels, mainly because i > trust the kernel drivers more than the vendor provided drivers. Depends on how fast. I find it takes a minimum of 3-6 months before any new kernel release stabilizes (somewhere around 2.6.X-5 to -10), and some distributions push them out way before that. Also, after major changes, it can be a year or more before a new kernel is not a regression either in reliability, performance, or worst-case behavior. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 25 Jun 2008, Andrew Sullivan wrote: > the key thing to do is to ensure you have good testing infrastructure in > place to check that things will work before you deploy to production. This is true whether you're using Linux or completely closed source software. There are two main differences from my view: -OSS software lets you look at the code before a typical closed-source company would have pushed a product out the door at all. Downside is that you need to recognize that. Linux kernels for example need significant amounts of encouters with the real world after release before they're ready for most people. -If your OSS program doesn't work, you can potentially find the problem yourself. I find that I don't fix issues when I come across them very much, but being able to browse the source code for something that isn't working frequently makes it easier to understand what's going on as part of troubleshooting. It's not like closed source software doesn't have the same kinds of bugs. The way commercial software (and projects like PostgreSQL) get organized into a smaller number of official releases tends to focus the QA process a bit better though, so that regular customers don't see as many rough edges. Linux used to do a decent job of this with their development vs. stable kernels, which I really miss. Unfortunately there's just not enough time for the top-level developers to manage that while still keeping up with the pace needed just for new work. Sorting out which are the stable kernel releases seems to have become the job of the distributors (RedHat, SuSE, Debian, etc.) instead of the core kernel developers. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Jun 25, 2008, at 11:35 AM, Matthew Wakeling wrote: > On Wed, 25 Jun 2008, Greg Smith wrote: >> A firewire-attached log device is an extremely bad idea. > > Anyone have experience with IDE, SATA, or SAS-connected flash > devices like the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - > 32GB, at a transfer rate of 100MB/s, and doesn't degrade much in > performance when writing small random blocks. But what's it actually > like, and is it reliable? None of these manufacturers rates these drives for massive amounts of writes. They're sold as suitable for laptop/desktop use, which normally is not a heavy wear and tear operation like a DB. Once they claim suitability for this purpose, be sure that I and a lot of others will dive into it to see how well it really works. Until then, it will just be an expensive brick-making experiment, I'm sure.
"Also sprach Merlin Moncure:" > As discussed down thread, software raid still gets benefits of > write-back caching on the raid controller...but there are a couple of (I wish I knew what write-back caching was!) Well, if you mean the Linux software raid driver, no, there's no extra caching (buffering). Every request arriving at the device is duplicated (for RAID1), using a local finite cache of buffer head structures and real extra muffers from the kernel's general resources. Every arriving request is dispatched two its subtargets as it arrives (as two or more new requests). On reception of both (or more) acks, the original request is acked, and not before. This imposes a considerable extra resource burden. It's a mystery to me why the driver doesn't deadlock against other resource eaters that it may depend on. Writing to a device that also needs extra memory per request in its driver should deadlock it, in theory. Against a network device as component, it's a problem (tcp needs buffers). However the lack of extra buffering is really deliberate (double buffering is a horrible thing in many ways, not least because of the probable memory deadlock against some component driver's requirement). The driver goes to the lengths of replacing the kernel's generic make_request function just for itself in order to make sure full control resides in the driver. This is required, among other things, to make sure that request order is preserved, and that requests. It has the negative that standard kernel contiguous request merging does not take place. But that's really required for sane coding in the driver. Getting request pages into general kernel buffers ... may happen. > things I'd like to add. First, if your sever is extremely busy, the > write back cache will eventually get overrun and performance will > eventually degrade to more typical ('write through') performance. I'd like to know where this 'write back cache' �s! (not to mention what it is :). What on earth does `write back' mean? Peraps you mean the kernel's general memory system, which has the effect of buffering and caching requests on the way to drivers like raid. Yes, if you write to a device, any device, you will only write to the kernel somwhere, which may or may not decide now or later to send the dirty buffers thus created on to the driver in question, either one by one or merged. But as I said, raid replaces most of the kernel's mechanisms in that area (make_request, plug) to avoid losing ordering. I would be surprised if the raw device exhibited any buffering at all after getting rid of the generic kernel mechanisms. Any buffering you see would likely be happening at file system level (and be a darn nuisance). Reads from the device are likely to hit the kernel's existing buffers first, thus making them act as a "cache". > Secondly, many hardware raid controllers have really nasty behavior in > this scenario. Linux software raid has decent degradation in overload I wouldn't have said so! If there is any, it's sort of accidental. On memory starvation, the driver simply couldn't create and despatch component requests. Dunno what happens then. It won't run out of buffer head structs though, since it's pretty well serialised on those, per device, in order to maintain request order, and it has its own cache. > conditions but many popular raid controllers (dell perc/lsi logic sas > for example) become unpredictable and very bursty in sustained high > load conditions. Well, that's because they can't tell the linux memory manager to quit storing data from them in memory and let them have it NOW (a general problem .. how one gets feedback on the mm state, I don't know). Maybe one could .. one can control buffer aging pretty much per device nowadays. Perhaps one can set the limit to zero for buffer age in memory before being sent to the device. That would help. Also one can lower the bdflush limit at which the device goes sync. All that would help against bursty performance, but it would slow ordinary operation towards sync behaviour. > As greg mentioned, I trust the linux kernel software raid much more > than the black box hw controllers. Also, contrary to vast popular Well, it's readable code. That's the basis for my comments! > mythology, the 'overhead' of sw raid in most cases is zero except in > very particular conditions. It's certainly very small. It would be smaller still if we could avoid needing new buffers per device. Perhaps the dm multipathing allows that. Peter
On Thu, 26 Jun 2008, Vivek Khera wrote: >> Anyone have experience with IDE, SATA, or SAS-connected flash devices like >> the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a transfer >> rate of 100MB/s, and doesn't degrade much in performance when writing small >> random blocks. But what's it actually like, and is it reliable? > > None of these manufacturers rates these drives for massive amounts of writes. > They're sold as suitable for laptop/desktop use, which normally is not a > heavy wear and tear operation like a DB. Once they claim suitability for > this purpose, be sure that I and a lot of others will dive into it to see how > well it really works. Until then, it will just be an expensive brick-making > experiment, I'm sure. It claims a MTBF of 2,000,000 hours, but no further reliability information seems forthcoming. I thought the idea that flash couldn't cope with many writes was no longer true these days? Matthew -- I work for an investment bank. I have dealt with code written by stock exchanges. I have seen how the computer systems that store your money are run. If I ever make a fortune, I will store it in gold bullion under my bed. -- Matthew Crosby
On Thu, Jun 26, 2008 at 10:14 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Thu, 26 Jun 2008, Vivek Khera wrote: >>> >>> Anyone have experience with IDE, SATA, or SAS-connected flash devices >>> like the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a >>> transfer rate of 100MB/s, and doesn't degrade much in performance when >>> writing small random blocks. But what's it actually like, and is it >>> reliable? >> >> None of these manufacturers rates these drives for massive amounts of >> writes. They're sold as suitable for laptop/desktop use, which normally is >> not a heavy wear and tear operation like a DB. Once they claim suitability >> for this purpose, be sure that I and a lot of others will dive into it to >> see how well it really works. Until then, it will just be an expensive >> brick-making experiment, I'm sure. > > It claims a MTBF of 2,000,000 hours, but no further reliability information > seems forthcoming. I thought the idea that flash couldn't cope with many > writes was no longer true these days? What's mainly happened is a great increase in storage capacity has allowed flash based devices to spread their writes out over so many cells that the time it takes to overwrite all the cells enough to get dead ones is measured in much longer intervals. Instead of dieing in weeks or months, they'll now die, for most work loads, in years or more. However, I've tested a few less expensive solid state storage and for some transactional loads it was much faster, but then for things like report queries scanning whole tables they were factors slower than a sw RAID-10 array of just 4 spinning disks. But pg_bench was quite snappy using the solid state storage for pg_xlog.
On Thu, Jun 26, 2008 at 9:49 AM, Peter T. Breuer <ptb@inv.it.uc3m.es> wrote: > "Also sprach Merlin Moncure:" >> As discussed down thread, software raid still gets benefits of >> write-back caching on the raid controller...but there are a couple of > > (I wish I knew what write-back caching was!) hardware raid controllers generally have some dedicated memory for caching. the controllers can be configured in one of two modes: (the jargon is so common it's almost standard) write back: raid controller can lie to host o/s. when o/s asks controller to sync, controller can hold data in cache (for a time) write through: raid controller can not lie. all sync requests must pass through to disk The thinking is, the bbu on the controller can hold scheduled writes in memory (for a time) and replayed to disk when server restarts in event of power failure. This is a reasonable compromise between data integrity and performance. 'write back' caching provides insane burst IOPS (because you are writing to controller cache) and somewhat improved sustained IOPS because the controller is reorganizing writes on the fly in (hopefully) optimal fashion. > This imposes a considerable extra resource burden. It's a mystery to me > However the lack of extra buffering is really deliberate (double > buffering is a horrible thing in many ways, not least because of the <snip> completely unconvincing. the overhead of various cache layers is completely minute compared to a full fault to disk that requires a seek which is several orders of magnitude slower. The linux software raid algorithms are highly optimized, and run on a presumably (much faster) cpu than what the controller supports. However, there is still some extra oomph you can get out of letting the raid controller do what the software raid can't...namely delay sync for a time. merlin
On Thu, Jun 26, 2008 at 12:14 PM, Matthew Wakeling <matthew@flymine.org> wrote: >> None of these manufacturers rates these drives for massive amounts of >> writes. They're sold as suitable for laptop/desktop use, which normally is >> not a heavy wear and tear operation like a DB. Once they claim suitability >> for this purpose, be sure that I and a lot of others will dive into it to >> see how well it really works. Until then, it will just be an expensive >> brick-making experiment, I'm sure. > > It claims a MTBF of 2,000,000 hours, but no further reliability information > seems forthcoming. I thought the idea that flash couldn't cope with many > writes was no longer true these days? Flash and disks have completely different failure modes, and you can't do apples to apples MTBF comparisons. In addition there are many different types of flash (MLC/SLC) and the flash cells themselves can be organized in particular ways involving various trade-offs. The best flash drives combined with smart wear leveling are anecdotally believed to provide lifetimes that are good enough to warrant use in high duty server environments. The main issue is lousy random write performance that basically makes them useless for any kind of OLTP operation. There are a couple of software (hacks?) out there which may address this problem if the technology doesn't get there first. If the random write problem were solved, a single ssd would provide the equivalent of a stack of 15k disks in a raid 10. see: http://www.bigdbahead.com/?p=44 http://feedblog.org/2008/01/30/24-hours-with-an-ssd-and-mysql/ merlin
"Also sprach Merlin Moncure:" > write back: raid controller can lie to host o/s. when o/s asks This is not what the linux software raid controller does, then. It does not queue requests internally at all, nor ack requests that have not already been acked by the components (modulo the fact that one can deliberately choose to have a slow component not be sync by allowing "write-behind" on it, in which case the "controller" will ack the incoming request after one of the compionents has been serviced, without waiting for both). > integrity and performance. 'write back' caching provides insane burst > IOPS (because you are writing to controller cache) and somewhat > improved sustained IOPS because the controller is reorganizing writes > on the fly in (hopefully) optimal fashion. This is what is provided by Linux file system and (ordinary) block device driver subsystem. It is deliberately eschewed by the soft raid driver, because any caching will already have been done above and below the driver, either in the FS or in the components. > > However the lack of extra buffering is really deliberate (double > > buffering is a horrible thing in many ways, not least because of the > > <snip> > completely unconvincing. But true. Therefore the problem in attaining conviction must be at your end. Double buffering just doubles the resources dedicated to a single request, without doing anything for it! It doubles the frequency with which one runs out of resources, it doubles the frequency of the burst limit being reached. It's deadly (deadlockly :) in the situation where the receiving component device also needs resources in order to service the request, such as when the transport is network tcp (and I have my suspicions about scsi too). > the overhead of various cache layers is > completely minute compared to a full fault to disk that requires a > seek which is several orders of magnitude slower. That's aboslutely true when by "overhead" you mean "computation cycles" and absolutely false when by overhead you mean "memory resources", as I do. Double buffering is a killer. > The linux software raid algorithms are highly optimized, and run on a I can confidently tell you that that's balderdash both as a Linux author and as a software RAID linux author (check the attributions in the kernel source, or look up something like "Raiding the Noosphere" on google). > presumably (much faster) cpu than what the controller supports. > However, there is still some extra oomph you can get out of letting > the raid controller do what the software raid can't...namely delay > sync for a time. There are several design problems left in software raid in the linux kernel. One of them is the need for extra memory to dispatch requests with and as (i.e. buffer heads and buffers, both). bhs should be OK since the small cache per device won't be exceeded while the raid driver itself serialises requests, which is essentially the case (it does not do any buffering, queuing, whatever .. and tries hard to avoid doing so). The need for extra buffers for the data is a problem. On different platforms different aspects of that problem are important (would you believe that on ARM mere copying takes so much cpu time that one wants to avoid it at all costs, whereas on intel it's a forgettable trivium). I also wouldn't aboslutely swear that request ordering is maintained under ordinary circumstances. But of course we try. Peter
On Thu, Jun 26, 2008 at 1:03 AM, Peter T. Breuer <ptb@inv.it.uc3m.es> wrote: > "Also sprach Merlin Moncure:" >> write back: raid controller can lie to host o/s. when o/s asks > > This is not what the linux software raid controller does, then. It > does not queue requests internally at all, nor ack requests that have > not already been acked by the components (modulo the fact that one can > deliberately choose to have a slow component not be sync by allowing > "write-behind" on it, in which case the "controller" will ack the > incoming request after one of the compionents has been serviced, > without waiting for both). > >> integrity and performance. 'write back' caching provides insane burst >> IOPS (because you are writing to controller cache) and somewhat >> improved sustained IOPS because the controller is reorganizing writes >> on the fly in (hopefully) optimal fashion. > > This is what is provided by Linux file system and (ordinary) block > device driver subsystem. It is deliberately eschewed by the soft raid > driver, because any caching will already have been done above and below > the driver, either in the FS or in the components. > >> > However the lack of extra buffering is really deliberate (double >> > buffering is a horrible thing in many ways, not least because of the >> >> <snip> >> completely unconvincing. > > But true. Therefore the problem in attaining conviction must be at your > end. Double buffering just doubles the resources dedicated to a single > request, without doing anything for it! It doubles the frequency with > which one runs out of resources, it doubles the frequency of the burst > limit being reached. It's deadly (deadlockly :) in the situation where Only if those resources are drawn from the same pool. You are oversimplifying a calculation that has many variables such as cost. CPUs for example are introducing more cache levels (l1, l2, l3), etc. Also, the different levels of cache have different capabilities. Only the hardware controller cache is (optionally) allowed to delay acknowledgment of a sync. In postgresql terms, we get roughly the same effect with the computers entire working memory with fsync disabled...so that we are trusting, rightly or wrongly, that all writes will eventually make it to disk. In this case, the raid controller cache is redundant and marginally useful. > the receiving component device also needs resources in order to service > the request, such as when the transport is network tcp (and I have my > suspicions about scsi too). > >> the overhead of various cache layers is >> completely minute compared to a full fault to disk that requires a >> seek which is several orders of magnitude slower. > > That's aboslutely true when by "overhead" you mean "computation cycles" > and absolutely false when by overhead you mean "memory resources", as I > do. Double buffering is a killer. Double buffering is most certainly _not_ a killer (or at least, _the_ killer) in practical terms. Most database systems that do any amount of writing (that is, interesting databases) are bound by the ability to randomly read and write to the storage medium, and only that. This is why raid controllers come with a relatively small amount of cache...there are diminishing returns from reorganizing writes. This is also why up and coming storage technologies (like flash) are so interesting. Disk drives have made only marginal improvements in speed since the early 80's. >> The linux software raid algorithms are highly optimized, and run on a > > I can confidently tell you that that's balderdash both as a Linux author I'm just saying here that there is little/no cpu overhead for using software raid on modern hardware. > believe that on ARM mere copying takes so much cpu time that one wants > to avoid it at all costs, whereas on intel it's a forgettable trivium). This is a database list. The main area of interest is in dealing with server class hardware. merlin
On Thu, 26 Jun 2008, Peter T. Breuer wrote: > "Also sprach Merlin Moncure:" >> The linux software raid algorithms are highly optimized, and run on a > > I can confidently tell you that that's balderdash both as a Linux author > and as a software RAID linux author (check the attributions in the > kernel source, or look up something like "Raiding the Noosphere" on > google). > >> presumably (much faster) cpu than what the controller supports. >> However, there is still some extra oomph you can get out of letting >> the raid controller do what the software raid can't...namely delay >> sync for a time. > > There are several design problems left in software raid in the linux kernel. > One of them is the need for extra memory to dispatch requests with and > as (i.e. buffer heads and buffers, both). bhs should be OK since the > small cache per device won't be exceeded while the raid driver itself > serialises requests, which is essentially the case (it does not do any > buffering, queuing, whatever .. and tries hard to avoid doing so). The > need for extra buffers for the data is a problem. On different > platforms different aspects of that problem are important (would you > believe that on ARM mere copying takes so much cpu time that one wants > to avoid it at all costs, whereas on intel it's a forgettable trivium). > > I also wouldn't aboslutely swear that request ordering is maintained > under ordinary circumstances. which flavor of linux raid are you talking about (the two main families I am aware of are the md and dm ones) David Lang
On Thu, 26 Jun 2008, Peter T. Breuer wrote: > Double buffering is a killer. No, it isn't; it's a completely trivial bit of overhead. It only exists during the time when blocks are queued to write but haven't been written yet. On any database system, in those cases I/O congestion at the disk level (probably things backed up behind seeks) is going to block writes way before the memory used or the bit of CPU time making the extra copy becomes a factor on anything but minimal platforms. You seem to know quite a bit about the RAID implementation, but you are a) extrapolating from that knowledge into areas of database performance you need to spend some more time researching first and b) extrapolating based on results from trivial hardware, relative to what the average person on this list is running a database server on in 2008. The weakest platform I deploy PostgreSQL on and consider relevant today has two cores and 2GB of RAM, for a single-user development system that only has to handle a small amount of data relative to what the real servers handle. If you note the kind of hardware people ask about here that's pretty typical. You have some theories here, Merlin and I have positions that come from running benchmarks, and watching theories suffer a brutal smack-down from the real world is one of those things that happens every day. There is absolutely some overhead from paths through the Linux software RAID that consume resources. But you can't even measure that in database-oriented comparisions against hardware setups that don't use those resources, which means that for practical purposes the overhead doesn't exist in this context. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wednesday 25 June 2008 11:24:23 Greg Smith wrote: > What I often do is get a hardware RAID controller, just to accelerate disk > writes, but configure it in JBOD mode and use Linux or other software RAID > on that platform. > JBOD + RAIDZ2 FTW ;-) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Thu, 26 Jun 2008, Merlin Moncure wrote: > In addition there are many different types of flash (MLC/SLC) and the > flash cells themselves can be organized in particular ways involving > various trade-offs. Yeah, I wouldn't go for MLC, given it has a tenth the lifespan of SLC. > The main issue is lousy random write performance that basically makes > them useless for any kind of OLTP operation. For the mentioned device, they claim a sequential read speed of 100MB/s, sequential write speed of 80MB/s, random read speed of 80MB/s and random write speed of 30MB/s. This is *much* better than figures quoted for many other devices, but of course unless they publish the block size they used for the random speed tests, the figures are completely useless. Matthew -- sed -e '/^[when][coders]/!d;/^...[discover].$/d;/^..[real].[code]$/!d ' <`locate dict/words`
On Fri, Jun 27, 2008 at 7:00 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Thu, 26 Jun 2008, Merlin Moncure wrote: >> >> In addition there are many different types of flash (MLC/SLC) and the >> flash cells themselves can be organized in particular ways involving various >> trade-offs. > > Yeah, I wouldn't go for MLC, given it has a tenth the lifespan of SLC. > >> The main issue is lousy random write performance that basically makes them >> useless for any kind of OLTP operation. > > For the mentioned device, they claim a sequential read speed of 100MB/s, > sequential write speed of 80MB/s, random read speed of 80MB/s and random > write speed of 30MB/s. This is *much* better than figures quoted for many > other devices, but of course unless they publish the block size they used > for the random speed tests, the figures are completely useless. right. not likely completely truthful. here's why: A 15k drive can deliver around 200 seeks/sec (under worst case conditions translating to 1-2mb/sec with 8k block size). 30mb/sec random performance would then be rough equivalent to around 40 15k drives configured in a raid 10. Of course, I'm assuming the block size :-). Unless there were some other mitigating factors (lifetime, etc), this would demonstrate that flash ssd would crush disks in any reasonable cost/performance metric. It's probably not so cut and dry, otherwise we'd be hearing more about them (pure speculation on my part). merlin