Thread: SSD performance
I spotted a new interesting SSD review. it's a $379 5.25" drive bay device that holds up to 8 DDR2 DIMMS (up to 8G per DIMM) and appears to the system as a SATA drive (or a pair of SATA drives that you can RAID-0 to get past the 300MB/s SATA bottleneck) the best review I've seen only ran it on windows (and a relativly old hardware platform at that), I suspect it's performance would be even better under linux and with a top-notch controller card (especially with the RAID option) it has a battery backup (good for 4 hours or so) and a CF cardslot that it can back the ram up to (~20 min to save 32G and 15 min to restore, so not something you really want to make use of, but a good safety net) the review also includes the Intel X-25E and X-25M drives (along with a variety of SCSI and SATA drives) http://techreport.com/articles.x/16255/1 equipped with 16G the street price should be ~$550, with 32G it should be ~$1200 with 64G even more expensive, but the performance is very good. there are times when the X-25E matches it or edges it out in these tests, so there is room for additional improvement, but as I noted above it may do better with a better controller and non-windows OS. power consumption is slightly higher than normal hard drives at about 12w (_much_ higher than the X-25) they also have a review of the X-25E vs the X-25M http://techreport.com/articles.x/15931/1 one thing that both of these reviews show is that if you are doing a significant amount of writing the X-25M is no better than a normal hard drive (and much of the time in the middle to bottom of the pack compared to normal hard drives) David Lang
> I spotted a new interesting SSD review. it's a $379 > 5.25" drive bay device that holds up to 8 DDR2 DIMMS > (up to 8G per DIMM) and appears to the system as a SATA > drive (or a pair of SATA drives that you can RAID-0 to get > past the 300MB/s SATA bottleneck) > Sounds very similar to the Gigabyte iRam drives of a few years ago http://en.wikipedia.org/wiki/I-RAM
On Fri, 23 Jan 2009, Glyn Astill wrote: >> I spotted a new interesting SSD review. it's a $379 >> 5.25" drive bay device that holds up to 8 DDR2 DIMMS >> (up to 8G per DIMM) and appears to the system as a SATA >> drive (or a pair of SATA drives that you can RAID-0 to get >> past the 300MB/s SATA bottleneck) >> > > Sounds very similar to the Gigabyte iRam drives of a few years ago > > http://en.wikipedia.org/wiki/I-RAM similar concept, but there are some significant differences the iRam was limited to 4G, used DDR ram, and used a PCI slot for power (which can be in short supply nowdays) this new drive can go to 64G, uses DDR2 ram (cheaper than DDR nowdays), gets powered like a normal SATA drive, can use two SATA channels (to be able to get past the throughput limits of a single SATA interface), and has a CF card slot to backup the data to if the system powers down. plus the performance appears to be significantly better (even without using the second SATA interface) David Lang
Why not simply plug your server into a UPS and get 10-20x the performance using the same approach (with OS IO cache)? In fact, with the server it's more robust, as you don't have to transit several intervening physical devices to get to theRAM. If you want a file interface, declare a RAMDISK. Cheaper/faster/improved reliability. - Luke ----- Original Message ----- From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> To: Glyn Astill <glynastill@yahoo.co.uk> Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> Sent: Fri Jan 23 04:39:07 2009 Subject: Re: [PERFORM] SSD performance On Fri, 23 Jan 2009, Glyn Astill wrote: >> I spotted a new interesting SSD review. it's a $379 >> 5.25" drive bay device that holds up to 8 DDR2 DIMMS >> (up to 8G per DIMM) and appears to the system as a SATA >> drive (or a pair of SATA drives that you can RAID-0 to get >> past the 300MB/s SATA bottleneck) >> > > Sounds very similar to the Gigabyte iRam drives of a few years ago > > http://en.wikipedia.org/wiki/I-RAM similar concept, but there are some significant differences the iRam was limited to 4G, used DDR ram, and used a PCI slot for power (which can be in short supply nowdays) this new drive can go to 64G, uses DDR2 ram (cheaper than DDR nowdays), gets powered like a normal SATA drive, can use two SATA channels (to be able to get past the throughput limits of a single SATA interface), and has a CF card slot to backup the data to if the system powers down. plus the performance appears to be significantly better (even without using the second SATA interface) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On Fri, 23 Jan 2009, Luke Lonergan wrote: > Why not simply plug your server into a UPS and get 10-20x the > performance using the same approach (with OS IO cache)? > > In fact, with the server it's more robust, as you don't have to transit > several intervening physical devices to get to the RAM. > > If you want a file interface, declare a RAMDISK. > > Cheaper/faster/improved reliability. you can also disable fsync to not wait for your disks if you trust your system to never go down. personally I don't trust any system to not go down. if you have a system crash or reboot your RAMDISK will loose it's content, this device won't. also you are limited to how many DIMMS you can put on your motherboard (for the dual-socket systems I am buying nowdays, I'm limited to 32G of ram) going to a different motherboard that can support additional ram can be quite expensive. this isn't for everyone, but for people who need the performance, data reliability, this looks like a very interesting option. David Lang > - Luke > > ----- Original Message ----- > From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> > To: Glyn Astill <glynastill@yahoo.co.uk> > Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> > Sent: Fri Jan 23 04:39:07 2009 > Subject: Re: [PERFORM] SSD performance > > On Fri, 23 Jan 2009, Glyn Astill wrote: > >>> I spotted a new interesting SSD review. it's a $379 >>> 5.25" drive bay device that holds up to 8 DDR2 DIMMS >>> (up to 8G per DIMM) and appears to the system as a SATA >>> drive (or a pair of SATA drives that you can RAID-0 to get >>> past the 300MB/s SATA bottleneck) >>> >> >> Sounds very similar to the Gigabyte iRam drives of a few years ago >> >> http://en.wikipedia.org/wiki/I-RAM > > similar concept, but there are some significant differences > > the iRam was limited to 4G, used DDR ram, and used a PCI slot for power > (which can be in > short supply nowdays) > > this new drive can go to 64G, uses DDR2 ram (cheaper than DDR nowdays), > gets powered like a normal SATA drive, can use two SATA channels (to be > able to get past the throughput limits of a single SATA interface), and > has a CF card slot to backup the data to if the system powers down. > > plus the performance appears to be significantly better (even without > using the second SATA interface) > > David Lang > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Fri, 23 Jan 2009, Luke Lonergan wrote: > Why not simply plug your server into a UPS and get 10-20x the > performance using the same approach (with OS IO cache)? > > In fact, with the server it's more robust, as you don't have to transit > several intervening physical devices to get to the RAM. > > If you want a file interface, declare a RAMDISK. > > Cheaper/faster/improved reliability. I'm sure we have gone over that one before. With that method, your data is at the mercy of the *entire system*. Any fault in any part of the computer (hardware or software) will result in the loss of all your data. In contrast, a RAM-based SSD is isolated from such failures, especially if it backs up to another device on power fail. You can completely trash the computer, remove the SSD and put it into another machine, and boot it up as normal. Computers break. Nothing is going to stop that from happening. Except VMS maybe. Not arguing that your method is faster though. Matthew -- "Finger to spiritual emptiness underlying everything." -- How a foreign C manual referred to a "pointer to void."
Hmm - I wonder what OS it runs ;-) - Luke ----- Original Message ----- From: david@lang.hm <david@lang.hm> To: Luke Lonergan Cc: glynastill@yahoo.co.uk <glynastill@yahoo.co.uk>; pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> Sent: Fri Jan 23 04:52:27 2009 Subject: Re: [PERFORM] SSD performance On Fri, 23 Jan 2009, Luke Lonergan wrote: > Why not simply plug your server into a UPS and get 10-20x the > performance using the same approach (with OS IO cache)? > > In fact, with the server it's more robust, as you don't have to transit > several intervening physical devices to get to the RAM. > > If you want a file interface, declare a RAMDISK. > > Cheaper/faster/improved reliability. you can also disable fsync to not wait for your disks if you trust your system to never go down. personally I don't trust any system to not go down. if you have a system crash or reboot your RAMDISK will loose it's content, this device won't. also you are limited to how many DIMMS you can put on your motherboard (for the dual-socket systems I am buying nowdays, I'm limited to 32G of ram) going to a different motherboard that can support additional ram can be quite expensive. this isn't for everyone, but for people who need the performance, data reliability, this looks like a very interesting option. David Lang > - Luke > > ----- Original Message ----- > From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org> > To: Glyn Astill <glynastill@yahoo.co.uk> > Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> > Sent: Fri Jan 23 04:39:07 2009 > Subject: Re: [PERFORM] SSD performance > > On Fri, 23 Jan 2009, Glyn Astill wrote: > >>> I spotted a new interesting SSD review. it's a $379 >>> 5.25" drive bay device that holds up to 8 DDR2 DIMMS >>> (up to 8G per DIMM) and appears to the system as a SATA >>> drive (or a pair of SATA drives that you can RAID-0 to get >>> past the 300MB/s SATA bottleneck) >>> >> >> Sounds very similar to the Gigabyte iRam drives of a few years ago >> >> http://en.wikipedia.org/wiki/I-RAM > > similar concept, but there are some significant differences > > the iRam was limited to 4G, used DDR ram, and used a PCI slot for power > (which can be in > short supply nowdays) > > this new drive can go to 64G, uses DDR2 ram (cheaper than DDR nowdays), > gets powered like a normal SATA drive, can use two SATA channels (to be > able to get past the throughput limits of a single SATA interface), and > has a CF card slot to backup the data to if the system powers down. > > plus the performance appears to be significantly better (even without > using the second SATA interface) > > David Lang > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Luke Lonergan wrote: > Why not simply plug your server into a UPS and get 10-20x the performance using the same approach (with OS IO cache)? A big reason is that your machine may already have as much RAM as is currently economical to install. Hardware with LOTS of RAM slots can cost quite a bit. Another reason is that these devices won't lose data because of an unexpected OS reboot. If they're fitted with a battery backup and CF media for emergency write-out, they won't lose data if your UPS runs out of juice either. I'd be much more confident with something like those devices than I would with an OS ramdisk plus startup/shutdown scripts to initialize it from a file and write it out to a file. Wouldn't it be a pain if the UPS didn't give the OS enough warning to write the RAM disk out before losing power... In any case, you're very rarely better off dedicating host memory to a ramdisk rather than using the normal file system and letting the host cache it. A ramdisk really only seems to help when you're really using it to bypass safeties like the effects of fsync() and ordered journaling. There are other ways to avoid those if you really don't care about your data. These devices would be interesting for a few uses, IMO. One is temp table space and sort space in Pg. Another is scratch space for apps (like Photoshop) that do their own VM management. There's also potential for use as 1st priority OS swap space, though at least on Linux I think the CPU overhead involved in swapping is so awful you wouldn't benefit from it much. I've been hoping this sort of thing would turn up again in a new incarnation with battery backup and CF/SD for BBU-flat safety. -- Craig Ringer
* Craig Ringer: > I'd be much more confident with something like those devices than I > would with an OS ramdisk plus startup/shutdown scripts to initialize it > from a file and write it out to a file. Wouldn't it be a pain if the UPS > didn't give the OS enough warning to write the RAM disk out before > losing power... The cache warm-up time can also be quite annoying. Of course, with flash-backed DRAM, this is a concern as long as you use the cheaper, slower variants for the backing storage. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On 1/23/09, david@lang.hm <david@lang.hm> wrote: > the review also includes the Intel X-25E and X-25M drives (along with a > variety of SCSI and SATA drives) > The x25-e is a game changer for database storage. It's still a little pricey for what it does but who can argue with these numbers? http://techreport.com/articles.x/15931/9 merlin
On Fri, 23 Jan 2009, Merlin Moncure wrote: > On 1/23/09, david@lang.hm <david@lang.hm> wrote: >> the review also includes the Intel X-25E and X-25M drives (along with a >> variety of SCSI and SATA drives) >> > > The x25-e is a game changer for database storage. It's still a little > pricey for what it does but who can argue with these numbers? > http://techreport.com/articles.x/15931/9 take a look at this ram based drive, specificly look at the numbers here http://techreport.com/articles.x/16255/9 they are about as much above the X25-e as the X25-e is above normal drives. David Lang
david@lang.hm wrote: > On Fri, 23 Jan 2009, Luke Lonergan wrote: > >> Why not simply plug your server into a UPS and get 10-20x the >> performance using the same approach (with OS IO cache)? >> >> In fact, with the server it's more robust, as you don't have to >> transit several intervening physical devices to get to the RAM. >> >> If you want a file interface, declare a RAMDISK. >> >> Cheaper/faster/improved reliability. > > you can also disable fsync to not wait for your disks if you trust your > system to never go down. personally I don't trust any system to not go > down. > > if you have a system crash or reboot your RAMDISK will loose it's > content, this device won't. > > also you are limited to how many DIMMS you can put on your motherboard > (for the dual-socket systems I am buying nowdays, I'm limited to 32G of > ram) going to a different motherboard that can support additional ram > can be quite expensive. > > this isn't for everyone, but for people who need the performance, data > reliability, this looks like a very interesting option. > > David Lang > >> - Luke >> >> ----- Original Message ----- >> From: pgsql-performance-owner@postgresql.org >> <pgsql-performance-owner@postgresql.org> >> To: Glyn Astill <glynastill@yahoo.co.uk> >> Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org> >> Sent: Fri Jan 23 04:39:07 2009 >> Subject: Re: [PERFORM] SSD performance >> >> On Fri, 23 Jan 2009, Glyn Astill wrote: >> >>>> I spotted a new interesting SSD review. it's a $379 >>>> 5.25" drive bay device that holds up to 8 DDR2 DIMMS >>>> (up to 8G per DIMM) and appears to the system as a SATA >>>> drive (or a pair of SATA drives that you can RAID-0 to get >>>> past the 300MB/s SATA bottleneck) >>>> >>> >>> Sounds very similar to the Gigabyte iRam drives of a few years ago >>> >>> http://en.wikipedia.org/wiki/I-RAM >> >> similar concept, but there are some significant differences >> >> the iRam was limited to 4G, used DDR ram, and used a PCI slot for power >> (which can be in >> short supply nowdays) >> >> this new drive can go to 64G, uses DDR2 ram (cheaper than DDR nowdays), >> gets powered like a normal SATA drive, can use two SATA channels (to be >> able to get past the throughput limits of a single SATA interface), and >> has a CF card slot to backup the data to if the system powers down. >> >> plus the performance appears to be significantly better (even without >> using the second SATA interface) >> >> David Lang >> >> >> -- >> Sent via pgsql-performance mailing list >> (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > Can I call a time out here? :) There are "always" going to be memory hierarchies -- registers on the processors, multiple levels of caches, RAM used for programs / data / I-O caches, and non-volatile rotating magnetic storage. And there are "always" going to be new hardware technologies cropping up at various levels in the hierarchy. There are always going to be cost / reliability / performance trade-offs, leading to "interesting" though perhaps not really business-relevant "optimizations". The equations are there for anyone to use should they want to optimize for a given workload at a given point in time with given business / service level constraints. See http://www.amazon.com/Storage-Network-Performance-Analysis-Huseyin/dp/076451685X for all the details. I question, however, whether there's much point in seeking an optimum. As was noted long ago by Nobel laureate Herbert Simon, in actual fact managers / businesses rarely optimize. Instead, they satisfice. They do what is "good enough", not what is best. And my own personal opinion in the current context -- PostgreSQL running on an open-source operating system -- is that * large-capacity inexpensive rotating disks, * a hardware RAID controller containing a battery-backed cache, * as much RAM as one can afford and the chassis will hold, and * enough cores to keep the workload from becoming processor-bound are good enough. And given that, a moderate amount of software tweaking and balancing will get you close to a local optimum. -- M. Edward (Ed) Borasky I've never met a happy clam. In fact, most of them were pretty steamed.
On Fri, 2009-01-23 at 09:22 -0800, M. Edward (Ed) Borasky wrote: > I question, however, whether there's much point in seeking an optimum. > As was noted long ago by Nobel laureate Herbert Simon, in actual fact > managers / businesses rarely optimize. Instead, they satisfice. They do > what is "good enough", not what is best. And my own personal opinion in > the current context -- PostgreSQL running on an open-source operating > system -- is that This community is notorious for "optimum". MySQL is notorious for "satisfy". Which one would you rather store your financial information in? I actually agree with you to a degree. A loud faction of this community spends a little too much time mentally masturbating but without that we wouldn't have a lot of the very interesting features we have now. There is no correct in left. There is no correct in right. Correctness is the result of friction caused by the mingling of the two. Sincerely, Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Fri, 23 Jan 2009, M. Edward (Ed) Borasky wrote: > * large-capacity inexpensive rotating disks, > * a hardware RAID controller containing a battery-backed cache, > * as much RAM as one can afford and the chassis will hold, and > * enough cores to keep the workload from becoming processor-bound > > are good enough. And given that, a moderate amount of software tweaking > and balancing will get you close to a local optimum. That's certainly the case for very large-scale (in terms of data quantity) databases. However, these solid state devices do have quite an advantage when what you want to scale is the performance, rather than the data quantity. The thing is, it isn't just a matter of storage heirarchy. There's the volatility matter there as well. What you have in these SSDs is a device which is non-volatile, like a disc, but fast, like RAM. Matthew -- Anyone who goes to a psychiatrist ought to have his head examined.
Joshua D. Drake wrote: > This community is notorious for "optimum". MySQL is notorious for "satisfy". Within *this* community, MySQL is just plain notorious. Let's face it -- we are *not* dolphin-safe. <ducking> > > Which one would you rather store your financial information in? The one that had the best data integrity, taking into account the RDBMS *and* the hardware and other software. > I actually agree with you to a degree. A loud faction of this community > spends a little too much time mentally masturbating but without that we > wouldn't have a lot of the very interesting features we have now. Yes -- you will never hear *me* say "Premature optimization is the root of all evil." I don't know why Hoare or Dijkstra or Knuth or Wirth or whoever coined that phrase, but it's been used too many times as an excuse for not doing any performance engineering, forcing the deployed "solution" to throw hardware at performance issues. > > > There is no correct in left. > There is no correct in right. > Correctness is the result of friction caused by the mingling of the two. "The only good I/O is a dead I/O" -- Mark Friedman -- M. Edward (Ed) Borasky I've never met a happy clam. In fact, most of them were pretty steamed.
On Fri, 23 Jan 2009, david@lang.hm wrote: > take a look at this ram based drive, specificly look at the numbers here > http://techreport.com/articles.x/16255/9 > they are about as much above the X25-e as the X25-e is above normal drives. They're so close to having a killer product with that one. All they need to do is make the backup to the CF card automatic once the battery backup power drops low (but not so low there's not enough power to do said backup) and it would actually be a reasonable solution. The whole battery-backed cache approach is risky enough when the battery is expected to last a day or two; with this product only giving 4 hours, it not hard to imagine situations where you'd lose everything on there. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sun, 25 Jan 2009, Greg Smith wrote: > On Fri, 23 Jan 2009, david@lang.hm wrote: > >> take a look at this ram based drive, specificly look at the numbers here >> http://techreport.com/articles.x/16255/9 >> they are about as much above the X25-e as the X25-e is above normal drives. > > They're so close to having a killer product with that one. All they need to > do is make the backup to the CF card automatic once the battery backup power > drops low (but not so low there's not enough power to do said backup) and it > would actually be a reasonable solution. The whole battery-backed cache > approach is risky enough when the battery is expected to last a day or two; > with this product only giving 4 hours, it not hard to imagine situations > where you'd lose everything on there. they currently have it do a backup immediatly on power loss (which is a safe choice as the contents won't be changing without power), but it then powers off (which is not good for startup time afterwords) David Lang
david@lang.hm writes: > they currently have it do a backup immediatly on power loss (which is a safe > choice as the contents won't be changing without power), but it then powers off > (which is not good for startup time afterwords) So if you have a situation where it's power cycling rapidly each iteration drains the battery of the time it takes to save the state but only charges it for the time the power is on. I wonder how many iterations that gives you. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Sun, 25 Jan 2009, Gregory Stark wrote: > david@lang.hm writes: > >> they currently have it do a backup immediatly on power loss (which is a safe >> choice as the contents won't be changing without power), but it then powers off >> (which is not good for startup time afterwords) > > So if you have a situation where it's power cycling rapidly each iteration > drains the battery of the time it takes to save the state but only charges it > for the time the power is on. I wonder how many iterations that gives you. good question. assuming that it's smart enough to not start a save if it didn't finish doing a restore, and going from the timings in the article (~20 min save, ~15 min load and 4 hour battery life) you would get ~12 cycles from the initial battery plus whatever you could get from the battery charging (~3 hours during the initial battery time) if the battery could be fully charged in 3 hours it could keep doing this indefinantly. if it takes 6 hours it would get a half charge, so 12+6+3+1=22 cycles but even the initial 12 cycles is long enough that you should probably be taking action by then. in most situations you are going to have a UPS on your system anyway, and it will have the same type of problem (but usually with _much_ less than 4 hours worth of operation to start with) so while you could loose data from intermittent power, I think you would be far more likely to loose data due to a defective battery or the CF card not being fully seated or something like that. David Lang
Craig Ringer wrote: > These devices would be interesting for a few uses, IMO. One is temp > table space and sort space in Pg. Another is scratch space for apps > (like Photoshop) that do their own VM management. There's also potential > Surely temp tables and sort space isn't subject to fsync and won't gain that much since they should stay in the OS cache? The device will surely help seek- or sync-bound tasks. Doesn't that make it a good candidate for WAL and hot tables? James
On Tue, 27 Jan 2009, James Mansion wrote: > Craig Ringer wrote: >> These devices would be interesting for a few uses, IMO. One is temp >> table space and sort space in Pg. Another is scratch space for apps >> (like Photoshop) that do their own VM management. There's also potential >> > Surely temp tables and sort space isn't subject to fsync and won't gain that > much since they > should stay in the OS cache? The device will surely help seek- or sync-bound > tasks. > > Doesn't that make it a good candidate for WAL and hot tables? it doesn't just gain on fsync speed, but also raw transfer speed. if everything stays in the OS buffers than you are right, but when you start to exceed those buffers is when fast storage like this is very useful. David Lang
On 1/23/09 3:35 AM, "david@lang.hm" <david@lang.hm> wrote:
http://techreport.com/articles.x/15931/1
one thing that both of these reviews show is that if you are doing a
significant amount of writing the X-25M is no better than a normal hard
drive (and much of the time in the middle to bottom of the pack compared
to normal hard drives)
David Lang
The X-25-M may not have write STR rates that high compared to normal disks, but for write latency, it is FAR superior to a normal disk, and for random writes will demolish most small and medium sized raid arrays by itself. It will push 30MB to 60MB /sec of random 8k writes, or ~2000 to 12000 8k fsyncs/sec. The –E is definitely a lot better, but the –M can get you pretty far.
For any postgres installation where you don’t expect to write to a WAL log at more than 30MB/sec (the vast majority), it is good enough to use (mirrored) as a WAL device, without a battery back up, with very good performance. A normal disk cannot do that.
Also, it can be used very well for the OS swap, and some other temp space to prevent swap storms from severely impacting the system.
For anyone worried about the X 25–M’s ability to withstand lots of write cycles ... Calculate how long it would take you to write 800TB to the drive at a typical rate. For most use cases that’s going to be > 5 years. For the 160GB version, it will take 2x as much data and time to wear it down.
Samsung, SanDisk, Toshiba, Micron, and several others are expected to have low random write latency, next gen SSD’s this year. A few of these are claiming > 150MB/sec for the writes, even for MLC based drives.
A RAM based device is intriguing, but an ordinary SSD will be enough to make most Postgres databases CPU bound, and with those there is no concern about data loss on power failure. The Intel X 25 series does not even use the RAM on it for write cache! (it uses some SRAM on the controller chip for that, and its fsync safe) The RAM is working memory for the controller chip to cache the LBA to Physical flash block mappings and other data needed for the wear leveling, contrary to what many reviews may claim.
I somehow managed to convince the powers that be to let me get a couple X25-E's. I tossed them in my macpro (8 cores), fired up Ubuntu 8.10 and did some testing. Raw numbers are very impressive. I was able to get 3700 random seek +read's a second. In a R1 config it stayed at 3700, but if I added another process it went up to 7000, and eventually settled into the 4000s. If I added in some random writing with fsyncs to it, it settled at 2200 (to be specific, I had 3 instances going - 2 read-only and 1 read-20% write to get that). These numbers were obtained running a slightly modified version of pgiosim (which is on pgfoundtry) - it randomly seeks to a "block" in a file and reads 8kB of data, optionally writing the block back out. Now, moving into reality I compiled 8.3.latest and gave it a whirl. Running against a software R1 of the 2 x25-e's I got the following pgbench results: (note config tweaks: work_mem=>4mb, shared_buffers=>1gb, should probably have tweaked checkpoint_segs, as it was emitting lots of notices about that, but I didn't). (multiple runs, avg tps) Scalefactor 50, 10 clients: 1700tps At that point I realized write caching on the drives was ON. So I turned it off at this point: Scalefactor 50, 10 clients: 900tps At scalefactor 50 the dataset fits well within memory, so I scaled it up. Scalefactor 1500: 10 clients: 420tps While some of us have arrays that can smash those numbers, that is crazy impressive for a plain old mirror pair. I also did not do much tweaking of PG itself. While I'm in the testing mood, are there some other tests folks would like me to try out? -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
On Tue, Feb 3, 2009 at 9:54 AM, Jeff <threshar@torgo.978.org> wrote: > Scalefactor 50, 10 clients: 900tps > > At scalefactor 50 the dataset fits well within memory, so I scaled it up. > > Scalefactor 1500: 10 clients: 420tps > > While some of us have arrays that can smash those numbers, that is crazy > impressive for a plain old mirror pair. I also did not do much tweaking of > PG itself. > > While I'm in the testing mood, are there some other tests folks would like > me to try out? How do the same benchmarks fair on regular rotating discs on the same system? Ideally we'd have numbers for 7.2k and 10k disks to give us some sort of idea of exactly how much faster we're talking here. Hey, since you asked, right? ;-) -Dave
I don’t think write caching on the disks is a risk to data integrity if you are configured correctly.
Furthermore, these drives don’t use the RAM for write cache, they only use a bit of SRAM on the controller chip for that (and respect fsync), so write caching should be fine.
Confirm that NCQ is on (a quick check in dmesg), I have seen degraded performance when the wrong SATA driver is in use on some linux configs, but your results indicate its probably fine.
How much RAM is in that machine?
Some suggested tests if you are looking for more things to try :D
-- What affect does the following tuning have:
Turn the I/O scheduler to ‘noop’ ( echo noop > /sys/block/<devices>/queue/scheduler) I’m assuming the current was cfq, deadline may also be interesting, anticipatory would have comically horrible results.
Tune upward the readahead value ( blockdev —setra <value> /dev/<device>) -- try 16384 (8MB) This probably won’t help that much for a pgbench tune, its more for large sequential scans in other workload types, and more important for rotating media.
Generally speaking with SSD’s, tuning the above values does less than with hard drives.
File system effects would also be interesting. If you’re in need of more tests to try, compare XFS to EXT3 (I am assuming the below is ext3).
On 2/3/09 9:54 AM, "Jeff" <threshar@torgo.978.org> wrote:
Furthermore, these drives don’t use the RAM for write cache, they only use a bit of SRAM on the controller chip for that (and respect fsync), so write caching should be fine.
Confirm that NCQ is on (a quick check in dmesg), I have seen degraded performance when the wrong SATA driver is in use on some linux configs, but your results indicate its probably fine.
How much RAM is in that machine?
Some suggested tests if you are looking for more things to try :D
-- What affect does the following tuning have:
Turn the I/O scheduler to ‘noop’ ( echo noop > /sys/block/<devices>/queue/scheduler) I’m assuming the current was cfq, deadline may also be interesting, anticipatory would have comically horrible results.
Tune upward the readahead value ( blockdev —setra <value> /dev/<device>) -- try 16384 (8MB) This probably won’t help that much for a pgbench tune, its more for large sequential scans in other workload types, and more important for rotating media.
Generally speaking with SSD’s, tuning the above values does less than with hard drives.
File system effects would also be interesting. If you’re in need of more tests to try, compare XFS to EXT3 (I am assuming the below is ext3).
On 2/3/09 9:54 AM, "Jeff" <threshar@torgo.978.org> wrote:
I somehow managed to convince the powers that be to let me get a
couple X25-E's.
I tossed them in my macpro (8 cores), fired up Ubuntu 8.10 and did
some testing.
Raw numbers are very impressive. I was able to get 3700 random seek
+read's a second. In a R1 config it stayed at 3700, but if I added
another process it went up to 7000, and eventually settled into the
4000s. If I added in some random writing with fsyncs to it, it
settled at 2200 (to be specific, I had 3 instances going - 2 read-only
and 1 read-20% write to get that). These numbers were obtained
running a slightly modified version of pgiosim (which is on
pgfoundtry) - it randomly seeks to a "block" in a file and reads 8kB
of data, optionally writing the block back out.
Now, moving into reality I compiled 8.3.latest and gave it a whirl.
Running against a software R1 of the 2 x25-e's I got the following
pgbench results:
(note config tweaks: work_mem=>4mb, shared_buffers=>1gb, should
probably have tweaked checkpoint_segs, as it was emitting lots of
notices about that, but I didn't).
(multiple runs, avg tps)
Scalefactor 50, 10 clients: 1700tps
At that point I realized write caching on the drives was ON. So I
turned it off at this point:
Scalefactor 50, 10 clients: 900tps
At scalefactor 50 the dataset fits well within memory, so I scaled it
up.
Scalefactor 1500: 10 clients: 420tps
While some of us have arrays that can smash those numbers, that is
crazy impressive for a plain old mirror pair. I also did not do much
tweaking of PG itself.
While I'm in the testing mood, are there some other tests folks would
like me to try out?
--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/
On Tue, Feb 3, 2009 at 10:54 AM, Jeff <threshar@torgo.978.org> wrote: > Now, moving into reality I compiled 8.3.latest and gave it a whirl. Running > against a software R1 of the 2 x25-e's I got the following pgbench results: > (note config tweaks: work_mem=>4mb, shared_buffers=>1gb, should probably > have tweaked checkpoint_segs, as it was emitting lots of notices about that, > but I didn't). You may find you get better numbers with a lower shared_buffers value, and definitely try cranking up number of checkpoint segments to something in the 50 to 100 range. > (multiple runs, avg tps) > > Scalefactor 50, 10 clients: 1700tps > > At that point I realized write caching on the drives was ON. So I turned it > off at this point: > > Scalefactor 50, 10 clients: 900tps > > At scalefactor 50 the dataset fits well within memory, so I scaled it up. > > Scalefactor 1500: 10 clients: 420tps > > > While some of us have arrays that can smash those numbers, that is crazy > impressive for a plain old mirror pair. I also did not do much tweaking of > PG itself. On a scale factor or 100 my 12 disk 15k.5 seagate sas drives on an areca get somewhere in the 2800 to 3200 tps range on sustained tests for anywhere from 8 to 32 or so concurrent clients. I get similar performance falloffs as I increase the testdb scaling factor. But for a pair of disks in a mirror with no caching controller, that's impressive. I've already told my boss our next servers will likely have intel's SSDs in them. > While I'm in the testing mood, are there some other tests folks would like > me to try out? how about varying the number of clients with a static scalefactor? -- When fascism comes to America, it will be the intolerant selling fascism as diversity.
On Wed, 4 Feb 2009, Jeff wrote: > On Feb 3, 2009, at 1:43 PM, Scott Carey wrote: > >> I don?t think write caching on the disks is a risk to data integrity if you >> are configured correctly. >> Furthermore, these drives don?t use the RAM for write cache, they only use >> a bit of SRAM on the controller chip for that (and respect fsync), so write >> caching should be fine. >> >> Confirm that NCQ is on (a quick check in dmesg), I have seen degraded >> performance when the wrong SATA driver is in use on some linux configs, but >> your results indicate its probably fine. >> > > As it turns out, there's a bug/problem/something with the controller in the > macpro vs the ubuntu drives where the controller goes into "works, but not as > super as it could" mode, so NCQ is effectively disabled, haven't seen a > workaround yet. Not sure if this problem exists on other distros (used ubuntu > because I just wanted to try a live). I read some stuff from Intel on the > NCQ and in a lot of cases it won't make that much difference because the > thing can respond so fast. actually, what I've heard is that NCQ is a win on the intel drives becouse it avoids having the drive wait while the OS prepares and sends the next write. >> Some suggested tests if you are looking for more things to try :D >> -- What affect does the following tuning have: >> >> Turn the I/O scheduler to ?noop? ( echo noop > >> /sys/block/<devices>/queue/scheduler) I?m assuming the current was cfq, >> deadline may also be interesting, anticipatory would have comically >> horrible results. > > I only tested noop, if you think about it, it is the most logical one as an > SSD really does not need an elevator at all. There is no rotational latency > or moving of the arm that the elevator was designed to cope with. you would think so, but that isn't nessasarily the case. here's a post where NOOP lost to CFQ by ~24% when there were multiple proceses competing for the drive (not on intel drives) http://www.alphatek.info/2009/02/02/io-scheduler-and-ssd-part-2/ David Lang
On Feb 3, 2009, at 1:43 PM, Scott Carey wrote: > I don’t think write caching on the disks is a risk to data integrity > if you are configured correctly. > Furthermore, these drives don’t use the RAM for write cache, they > only use a bit of SRAM on the controller chip for that (and respect > fsync), so write caching should be fine. > > Confirm that NCQ is on (a quick check in dmesg), I have seen > degraded performance when the wrong SATA driver is in use on some > linux configs, but your results indicate its probably fine. > As it turns out, there's a bug/problem/something with the controller in the macpro vs the ubuntu drives where the controller goes into "works, but not as super as it could" mode, so NCQ is effectively disabled, haven't seen a workaround yet. Not sure if this problem exists on other distros (used ubuntu because I just wanted to try a live). I read some stuff from Intel on the NCQ and in a lot of cases it won't make that much difference because the thing can respond so fast. > How much RAM is in that machine? > 8GB > Some suggested tests if you are looking for more things to try :D > -- What affect does the following tuning have: > > Turn the I/O scheduler to ‘noop’ ( echo noop > /sys/block/<devices>/ > queue/scheduler) I’m assuming the current was cfq, deadline may > also be interesting, anticipatory would have comically horrible > results. I only tested noop, if you think about it, it is the most logical one as an SSD really does not need an elevator at all. There is no rotational latency or moving of the arm that the elevator was designed to cope with. but, here are the results: scale 50, 100 clients, 10x txns: 1600tps (a noticable improvement!) scale 1500, 100 clients, 10xtxns: 434tps I'm going to try to get some results for raptors, but there was another post earlier today that got higher, but not ridiculously higher tps but it required 14 15k disks instead of 2 > > Tune upward the readahead value ( blockdev —setra <value> /dev/ > <device>) -- try 16384 (8MB) This probably won’t help that much > for a pgbench tune, its more for large sequential scans in other > workload types, and more important for rotating media. > Generally speaking with SSD’s, tuning the above values does less > than with hard drives. > Yeah, I don't think RA will help pgbench, and for my workloads it is rather useless as they tend to be tons of random IO. I've got some Raptors here too I'll post numbers wed or thu. -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
On Fri, 30 Jan 2009, Scott Carey wrote: > For anyone worried about the X 25–M’s ability to withstand lots of write > cycles ... Calculate how long it would take you to write 800TB to the > drive at a typical rate. For most use cases that’s going to be > 5 > years. For the 160GB version, it will take 2x as much data and time to > wear it down. This article just came out: http://www.theregister.co.uk/2009/02/20/intel_x25emmental/ and http://www.pcper.com/article.php?aid=669 It seems that the performance of the X25-M degrades over time, as the write levelling algorithm fragments the device into little bits. Especially under database-like access patterns. Matthew -- I quite understand I'm doing algebra on the blackboard and the usual response is to throw objects... If you're going to freak out... wait until party time and invite me along -- Computer Science Lecturer