Thread: Fusion-io ioDrive
I recently got my hands on a device called ioDrive from a company called Fusion-io. The ioDrive is essentially 80GB of flash on a PCI card. It has its own driver for Linux completely outside of the normal scsi/sata/sas/fc block device stack, but from the user's perspective it behaves like a block device. I put the ioDrive in an ordinary PC with 1GB of memory, a single 2.2GHz AMD CPU, and an existing Areca RAID with 6 SATA disks and a 128MB cache. I tested the device with PostgreSQL 8.3.3 on Centos 5.3 x86_64 (Linux 2.6.18). The pgbench database was initialized with scale factor 100. Test runs were performed with 8 parallel connections (-c 8), both read-only (-S) and read-write. PostgreSQL itself was configured with 256MB of shared buffers and 32 checkpoint segments. Otherwise the configuration was all defaults. In the following table, the "RAID" configuration has the xlogs on a RAID 0 of 2 10krpm disks with ext2, and the heap is on a RAID 0 of 4 7200rpm disks with ext3. The "Fusion" configuration has everything on the ioDrive with xfs. I tried the ioDrive with ext2 and ext3 but it didn't seem to make any difference. Service Time Percentile, millis R/W TPS R-O TPS 50th 80th 90th 95th RAID 182 673 18 32 42 64 Fusion 971 4792 8 9 10 11 Basically the ioDrive is smoking the RAID. The only real problem with this benchmark is that the machine became CPU-limited rather quickly. During the runs with the ioDrive, iowait was pretty well zero, with user CPU being about 75% and system getting about 20%. Now, I will say a couple of other things. The Linux driver for this piece of hardware is pretty dodgy. Sub-alpha quality actually. But they seem to be working on it. Also there's no driver for OpenSolaris, Mac OS X, or Windows right now. In fact there's not even anything available for Debian or other respectable Linux distros, only Red Hat and its clones. The other problem is the 80GB model is too small to hold my entire DB, Although it could be used as a tablespace for some critical tables. But hey, it's fast. I'm going to put this board into my 8-way Xeon to see if it goes any faster with more CPU available. I'd be interested in hearing experiences with other flash storage devices, SSDs, and that type of thing. So far, this is the fastest hardware I've seen for the price. -jwb
On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote: > Red Hat and its clones. The other problem is the 80GB model is too > small to hold my entire DB, Although it could be used as a tablespace > for some critical tables. But hey, it's fast. And when/if it dies, please give us a rough guestimate of its life-span in terms of read/write cycles. Sounds exciting, though! Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.american.edu/econ/notes/htmlmail.htm
On Tue, Jul 1, 2008 at 6:17 PM, Andrej Ricnik-Bay <andrej.groups@gmail.com> wrote: > On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote: > >> Red Hat and its clones. The other problem is the 80GB model is too >> small to hold my entire DB, Although it could be used as a tablespace >> for some critical tables. But hey, it's fast. > And when/if it dies, please give us a rough guestimate of its > life-span in terms of read/write cycles. Sounds exciting, though! Yeah. The manufacturer rates it for 5 years in constant use. I remain skeptical.
On Tue, 1 Jul 2008, Jeffrey Baker wrote: > The only real problem with this benchmark is that the machine became > CPU-limited rather quickly. During the runs with the ioDrive, iowait was > pretty well zero, with user CPU being about 75% and system getting about > 20%. You might try reducing the number of clients; with a single CPU like yours I'd expect peak throughput here would be at closer to 4 clients rather than 8, and possibly as low as 2. What I normally do is run a quick scan of a few client loads before running a long test to figure out where the general area of peak throughput is. For your 8-way box, it will be closer to 32 clients. Well done test though. When you try again with the faster system, the only other postgresql.conf parameter I'd suggest bumping up is wal_buffers; that can limit best pgbench scores a bit and it only needs a MB or so to make that go away. It's also worth nothing that the gap between the two types of storage will go up up if you increase scale further; scale=100 is only making a 1.5GB or so database. If you collected a second data point at a scale of 500 I'd expect the standard disk results would halve by then, but I don't know what the fusion device would do and I'm kind of curious. You may need to increase this regardless because the bigger box has more RAM, and you want the database to be larger than RAM to get interesting results in this type of test. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote: > Yeah. The manufacturer rates it for 5 years in constant use. I > remain skeptical. I read in one of their spec-sheets that w/ continuous writes it should survive roughly 3.4 years ... I'd be a tad more conservative, I guess, and try to drop 20-30% of that figure if I'd consider something like it for production use And I'll be very indiscreet and ask: "How much do they go for?" :} I couldn't find anyone actually offering them in 5 minutes of googling, just some ball-park figure of 2400US$ ... Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.american.edu/econ/notes/htmlmail.htm
On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > I recently got my hands on a device called ioDrive from a company > called Fusion-io. The ioDrive is essentially 80GB of flash on a PCI > card. It has its own driver for Linux completely outside of the > normal scsi/sata/sas/fc block device stack, but from the user's > perspective it behaves like a block device. I put the ioDrive in an > ordinary PC with 1GB of memory, a single 2.2GHz AMD CPU, and an > existing Areca RAID with 6 SATA disks and a 128MB cache. I tested the > device with PostgreSQL 8.3.3 on Centos 5.3 x86_64 (Linux 2.6.18). > > The pgbench database was initialized with scale factor 100. Test runs > were performed with 8 parallel connections (-c 8), both read-only (-S) > and read-write. PostgreSQL itself was configured with 256MB of shared > buffers and 32 checkpoint segments. Otherwise the configuration was > all defaults. > > In the following table, the "RAID" configuration has the xlogs on a > RAID 0 of 2 10krpm disks with ext2, and the heap is on a RAID 0 of 4 > 7200rpm disks with ext3. The "Fusion" configuration has everything on > the ioDrive with xfs. I tried the ioDrive with ext2 and ext3 but it > didn't seem to make any difference. > > Service Time Percentile, millis > R/W TPS R-O TPS 50th 80th 90th 95th > RAID 182 673 18 32 42 64 > Fusion 971 4792 8 9 10 11 > > Basically the ioDrive is smoking the RAID. The only real problem with > this benchmark is that the machine became CPU-limited rather quickly. > During the runs with the ioDrive, iowait was pretty well zero, with > user CPU being about 75% and system getting about 20%. > > Now, I will say a couple of other things. The Linux driver for this > piece of hardware is pretty dodgy. Sub-alpha quality actually. But > they seem to be working on it. Also there's no driver for > OpenSolaris, Mac OS X, or Windows right now. In fact there's not even > anything available for Debian or other respectable Linux distros, only > Red Hat and its clones. The other problem is the 80GB model is too > small to hold my entire DB, Although it could be used as a tablespace > for some critical tables. But hey, it's fast. > > I'm going to put this board into my 8-way Xeon to see if it goes any > faster with more CPU available. > > I'd be interested in hearing experiences with other flash storage > devices, SSDs, and that type of thing. So far, this is the fastest > hardware I've seen for the price. Any chance of getting bonnie results? How long are your pgbench runs? Are you sure that you are seeing proper syncs to the device? (this is my largest concern actually) merlin.
On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > Basically the ioDrive is smoking the RAID. The only real problem with > this benchmark is that the machine became CPU-limited rather quickly. That's traditionally the problem with everything being in memory. Unless the database algorithms are designed to exploit L1/L2 cache and RAM, which is not the case for a disk-based DBMS, you generally lose some concurrency due to the additional CPU overhead of playing only with memory. This is generally acceptable if you're going to trade off higher concurrency for faster service times. And, it isn't only evidenced in single systems where a disk-based DBMS is 100% cached, but also in most shared-memory clustering architectures. In most cases, when you're waiting on disk I/O, you can generally support higher concurrency because the OS can utilize the CPU's free cycles (during the wait) to handle other users. In short, sometimes, disk I/O is a good thing; it just depends on what you need. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
Le Wednesday 02 July 2008, Jonah H. Harris a écrit : > On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > > Basically the ioDrive is smoking the RAID. The only real problem with > > this benchmark is that the machine became CPU-limited rather quickly. > > That's traditionally the problem with everything being in memory. > Unless the database algorithms are designed to exploit L1/L2 cache and > RAM, which is not the case for a disk-based DBMS, you generally lose > some concurrency due to the additional CPU overhead of playing only > with memory. This is generally acceptable if you're going to trade > off higher concurrency for faster service times. And, it isn't only > evidenced in single systems where a disk-based DBMS is 100% cached, > but also in most shared-memory clustering architectures. My experience is that using an IRAM for replication (on the slave) is very good. I am unfortunely unable to provide any numbers or benchs :/ (I'll try to get some but it won't be easy) I would probably use some flash/memory disk when Postgresql get the warm stand-by at transaction level (and is up for readonly query)... > > In most cases, when you're waiting on disk I/O, you can generally > support higher concurrency because the OS can utilize the CPU's free > cycles (during the wait) to handle other users. In short, sometimes, > disk I/O is a good thing; it just depends on what you need. > > -- > Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 > EnterpriseDB Corporation | fax: 732.331.1301 > 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com > Edison, NJ 08837 | http://www.enterprisedb.com/ -- Cédric Villemain Administrateur de Base de Données Cel: +33 (0)6 74 15 56 53 http://dalibo.com - http://dalibo.org
Attachment
On Tue, Jul 1, 2008 at 5:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > I recently got my hands on a device called ioDrive from a company > called Fusion-io. The ioDrive is essentially 80GB of flash on a PCI > card. [...] > Service Time Percentile, millis > R/W TPS R-O TPS 50th 80th 90th 95th > RAID 182 673 18 32 42 64 > Fusion 971 4792 8 9 10 11 Essentially the same benchmark, but on a quad Xeon 2GHz with 3GB main memory, and the scale factor of 300. Really all we learn from this exercise is the sheer futility of throwing CPU at PostgreSQL. R/W TPS: 1168 R-O TPS: 6877 Quadrupling the CPU resources and tripling the RAM results in a 20% or 44% performance increase on read/write and read-only loads, respectively. The system loafs along with 2-3 CPUs completely idle, although oddly iowait is 0%. I think the system is constrained by context switch, which is tens of thousands per second. This is a problem with the ioDrive software, not with pg. Someone asked for bonnie++ output: Block output: 495MB/s, 81% CPU Block input: 676MB/s, 93% CPU Block rewrite: 262MB/s, 59% CPU Pretty respectable. In the same ballpark as an HP MSA70 + P800 with 25 spindles. -jwb
On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <jwbaker@gmail.com> wrote: >> Service Time Percentile, millis >> R/W TPS R-O TPS 50th 80th 90th 95th >> RAID 182 673 18 32 42 64 >> Fusion 971 4792 8 9 10 11 > > Someone asked for bonnie++ output: > > Block output: 495MB/s, 81% CPU > Block input: 676MB/s, 93% CPU > Block rewrite: 262MB/s, 59% CPU > > Pretty respectable. In the same ballpark as an HP MSA70 + P800 with > 25 spindles. You left off the 'seeks' portion of the bonnie++ results -- this is actually the most important portion of the test. Based on your tps #s, I'm expecting seeks equiv of about 10 10k drives in configured in a raid 10, or around 1000-1500. They didn't publish any prices so it's hard to say if this is 'cost competitive'. These numbers are indeed fantastic, disruptive even. If I was testing the device for consideration in high duty server environments, I would be doing durability testing right now...I would slamming the database with transactions (fsync on, etc) and then power off the device. I would do this several times...making sure the software layer isn't doing some mojo that is technically cheating. I'm not particularly enamored of having a storage device be stuck directly in a pci slot -- although I understand it's probably necessary in the short term as flash changes all the rules and you can't expect it to run well using mainstream hardware raid controllers. By using their own device they have complete control of the i/o stack up to the o/s driver level. I've been thinking for a while now that flash is getting ready to explode into use in server environments. The outstanding questions I see are: *) is write endurance problem truly solved (giving at least a 5-10 year lifetime) *) what are the true odds of catastrophic device failure (industry claims less, we'll see) *) is the flash random write problem going to be solved in hardware or specialized solid state write caching techniques. At least currently, it seems like software is filling the role. *) do the software solutions really work (unproven) *) when are the major hardware vendors going to get involved. they make a lot of money selling disks and supporting hardware (san, etc). merlin
On Wed, Jul 2, 2008 at 7:41 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote: > On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: >> Basically the ioDrive is smoking the RAID. The only real problem with >> this benchmark is that the machine became CPU-limited rather quickly. > > That's traditionally the problem with everything being in memory. > Unless the database algorithms are designed to exploit L1/L2 cache and > RAM, which is not the case for a disk-based DBMS, you generally lose > some concurrency due to the additional CPU overhead of playing only > with memory. This is generally acceptable if you're going to trade > off higher concurrency for faster service times. And, it isn't only > evidenced in single systems where a disk-based DBMS is 100% cached, > but also in most shared-memory clustering architectures. > > In most cases, when you're waiting on disk I/O, you can generally > support higher concurrency because the OS can utilize the CPU's free > cycles (during the wait) to handle other users. In short, sometimes, > disk I/O is a good thing; it just depends on what you need. I have a lot of problems with your statements. First of all, we are not really talking about 'RAM' storage...I think your comments would be more on point if we were talking about mounting database storage directly from the server memory for example. Sever memory and cpu are involved to the extent that the o/s using them for caching and filesystem things and inside the device driver. Also, your comments seem to indicate that having a slower device leads to higher concurrency because it allows the process to yield and do other things. This is IMO simply false. With faster storage cpu loads will increase but only because the overall system throughput increases and cpu/memory 'work' increases in terms of overall system activity. Presumably as storage approaches speeds of main system memory the algorithms of dealing with it will become simpler (not having to go through acrobatics to try and making everything sequential) and thus faster. I also find the remarks of software 'optimizing' for strict hardware assumptions (L1+L2) cache a little suspicious. In some old programs I remember keeping a giant C 'union' of critical structures that was exactly 8k to fit in the 486 cpu cache. In modern terms I think that type of programming (sans some specialized environments) is usually counter-productive...I think PostgreSQL's approach of deferring as much work as possible to the o/s is a great approach. merlin
> *) is the flash random write problem going to be solved in hardware or > specialized solid state write caching techniques. At least > currently, it seems like software is filling the role. Those flash chips are page-based, not unlike a harddisk, ie. you cannot erase and write a byte, you must erase and write a full page. Size of said page depends on the chip implementation. I don't know which chips they used so cannot comment there, but you can easily imagine that smaller pages yield faster random IO write throughput. For reads, you must first select a page and then access it. Thus, it is not like RAM at all. It is much more similar to a harddisk with an almost zero seek time (on reads) and a very small, but significant seek time (on writes) because a page must be erased before being written. Big flash chips include ECC inside to improve reliability. Basically the chips include a small static RAM buffer. When you want to read a page it is first copied to SRAM and ECC checked. When you want to write a page you first write it to SRAM and then order the chip to write it to flash. Usually you can't erase a page, you must erase a block which contains many pages (this is probably why most flash SSDs suck at random writes). NAND flash will never replace SDRAM because of these restrictions (NOR flash acts like RAM but it is slow and has less capacity). However NAND flash is well suited to replace harddisks. When writing a page you write it to the small static RAM buffer on the chip (fast) and tell the chip to write it to flash (slow). When the chip is busy erasing or writing you can not do anything with it, but you can still talk to the other chips. Since the ioDrive has many chips I'd bet they use this feature. I don't know about the ioDrive implementation but you can see that the paging and erasing requirements mean some tricks have to be applied and the thing will probably need some smart buffering in RAM in order to be fast. Since the data in a flash doesn't need to be sequential (read seek time being close to zero) it is possible they use a system which makes all writes sequential (for instance) which would suit the block erasing requirements very well, with the information about block mapping stored in RAM, or perhaps they use some form of copy-on-write. It would be interesting to dissect this algorithm, especially the part which allows to store permanently the block mappings, which cannot be stored in a constant known sector since it would wear out pretty quickly. Ergo, in order to benchmark this thing and get relevant results, I would tend to think that you'd need to fill it to say, 80% of capacity and bombard it with small random writes, the total amount of data being written being many times more than the total capacity of the drive, in order to test the remapping algorithms which are the weak point of such a device. > *) do the software solutions really work (unproven) > *) when are the major hardware vendors going to get involved. they > make a lot of money selling disks and supporting hardware (san, etc). Looking at the pictures of the "drive" I see a bunch of Flash chips which probably make the bulk of the cost, a switching power supply, a small BGA chip which is probably a DDR memory for buffering, and the mystery ASIC which is probably a FPGA, I would tend to think Virtex4 from the shape of the package seen from the side in one of the pictures. A team of talented engineers can design and produce such a board, and assembly would only use standard PCB processes. This is unlike harddisks, which need a huge investment and a specialized factory because of the complex mechanical parts and very tight tolerances. In the case of the ioDrive, most of the value is in the intellectual property : software on the PC CPU (driver), embedded software, and programming the FPGA. All this points to a very different economic model for storage. I could design and build a scaled down version of the ioDrive in my "garage", for instance (well, the PCI Express licensing fees are hefty, so I'd use PCI, but you get the idea). This means I think we are about to see a flood of these devices coming from many small companies. This is very good for the end user, because there will be competition, natural selection, and fast evolution. Interesting times ahead ! > I'm not particularly enamored of having a storage device be stuck > directly in a pci slot -- although I understand it's probably > necessary in the short term as flash changes all the rules and you > can't expect it to run well using mainstream hardware raid > controllers. By using their own device they have complete control of > the i/o stack up to the o/s driver level. Well, SATA is great for harddisks : small cables, less clutter, less failure prone than 80 conductor cables, faster, cheaper, etc... Basically serial LVDS (low voltage differential signalling) point to point links (SATA, PCI-Express, etc) are replacing parallel busses (PCI, IDE) everywhere, except where you need extremely low latency combined with extremely high throughput (like RAM). Point-to-point is much better because there is no contention. SATA is too slow for Flash, though, because it has only 2 lanes. This only leaves PCI-Express. However the humongous data rates this "drive" puts out are not going to go through a cable that is going to be cheap. Therefore we are probably going to see a lot more PCI-Express flash drives until a standard comes up to allow the RAID-Card + "drives" paradigm. But it probably won't involve cables and bays, most likely Flash sticks just like we have RAM sticks now, and a RAID controller on the mobo or a PCI-Express card. Or perhaps it will just be software RAID. As for reliability of this device, I'd say the failure point is the Flash chips, as stated by the manufacturer. Wear levelling algorithms are going to matter a lot.
On Mon, Jul 7, 2008 at 9:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > I have a lot of problems with your statements. First of all, we are > not really talking about 'RAM' storage...I think your comments would > be more on point if we were talking about mounting database storage > directly from the server memory for example. Sever memory and cpu are > involved to the extent that the o/s using them for caching and > filesystem things and inside the device driver. I'm not sure how those cards work, but my guess is that the CPU will go 100% busy (with a near-zero I/O wait) on any sizable workload. In this case, the current pgbench configuration being used is quite small and probably won't resemble this. > Also, your comments seem to indicate that having a slower device leads > to higher concurrency because it allows the process to yield and do > other things. This is IMO simply false. Argue all you want, but this is a fairly well known (20+ year-old) behavior. > With faster storage cpu loads will increase but only because the overall > system throughput increases and cpu/memory 'work' increases in terms > of overall system activity. Again, I said that response times (throughput) would improve. I'd like to see your argument for explaining how you can handle more CPU-only operations when 0% of the CPU is free for use. > Presumably as storage approaches speedsof main system memory > the algorithms of dealing with it will become simpler (not having to > go through acrobatics to try and making everything sequential) > and thus faster. We'll have to see. > I also find the remarks of software 'optimizing' for strict hardware > assumptions (L1+L2) cache a little suspicious. In some old programs I > remember keeping a giant C 'union' of critical structures that was > exactly 8k to fit in the 486 cpu cache. In modern terms I think that > type of programming (sans some specialized environments) is usually > counter-productive...I think PostgreSQL's approach of deferring as > much work as possible to the o/s is a great approach. All of the major database vendors still see an immense value in optimizing their algorithms and memory structures for specific platforms and CPU caches. Hence, if they're *paying* money for very-specialized industry professionals to optimize in this way, I would hesitate to say there isn't any value in it. As a fact, Postgres doesn't have those low-level resources, so for the most part, I have to agree that they have to rely on the OS. -Jonah
On Mon, Jul 7, 2008 at 6:08 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <jwbaker@gmail.com> wrote: >>> Service Time Percentile, millis >>> R/W TPS R-O TPS 50th 80th 90th 95th >>> RAID 182 673 18 32 42 64 >>> Fusion 971 4792 8 9 10 11 >> >> Someone asked for bonnie++ output: >> >> Block output: 495MB/s, 81% CPU >> Block input: 676MB/s, 93% CPU >> Block rewrite: 262MB/s, 59% CPU >> >> Pretty respectable. In the same ballpark as an HP MSA70 + P800 with >> 25 spindles. > > You left off the 'seeks' portion of the bonnie++ results -- this is > actually the most important portion of the test. Based on your tps > #s, I'm expecting seeks equiv of about 10 10k drives in configured in > a raid 10, or around 1000-1500. They didn't publish any prices so > it's hard to say if this is 'cost competitive'. I left it out because bonnie++ reports it as "+++++" i.e. greater than or equal to 100000 per second. -jwb
> PFC, I have to say these kind of posts make me a fan of yours. I've > read many of your storage-related replied and have found them all very > educational. I just want to let you know I found your assessment of the > impact of Flash storage perfectly-worded and unbelievably insightful. > Thanks a million for sharing your knowledge with the list. -Dan Hehe, thanks. There was a time when you had to be a big company full of cash to build a computer, and then sudenly people did it in garages, like Wozniak and Jobs, out of off-the-shelf parts. I feel the ioDrive guys are the same kind of hackers, except today's hackers have much more powerful tools. Perhaps, and I hope it's true, storage is about to undergo a revolution like the personal computer had 20-30 years ago, when the IBMs of the time were eaten from the roots up. IMHO the key is that you can build a ioDrive from off the shelf parts, but you can't do that with a disk drive. Flash manufacturers are smelling blood, they profit from USB keys and digicams but imagine the market for solid state drives ! And in this case the hardware is simple : flash, ram, a fpga, some chips, nothing out of the ordinary, it is the brain juice in the software (which includes FPGAs) which will sort out the high performance and reliability winners from the rest. Lowering the barrier of entry is good for innovation. I believe Linux will benefit, too, since the target is (for now) high-performance servers, and as shown by the ioDrive, innovating hackers prefer to write Linux drivers rather than Vista (argh) drivers.
Hi, Jonah H. Harris wrote: > I'm not sure how those cards work, but my guess is that the CPU will > go 100% busy (with a near-zero I/O wait) on any sizable workload. In > this case, the current pgbench configuration being used is quite small > and probably won't resemble this. I'm not sure how they work either, but why should they require more CPU cycles than any other PCIe SAS controller? I think they are doing a clever step by directly attaching the NAND chips to PCIe, instead of piping all the data through SAS or (S)ATA (and then through PCIe as well). And if the controller chip on the card isn't absolutely bogus, that certainly has the potential to reduce latency and improve throughput - compared to other SSDs. Or am I missing something? Regards Markus
Well, what does a revolution like this require of Postgres? That is the question.
I have looked at the I/O drive, and it could increase our DB throughput significantly over a RAID array.
Ideally, I would put a few key tables and the WAL, etc. I'd also want all the sort or hash overflow from work_mem to go to this device. Some of our tables / indexes are heavily written to for short periods of time then more infrequently later -- these are partitioned by date. I would put the fresh ones on such a device then move them to the hard drives later.
Ideally, we would then need a few changes in Postgres to take full advantage of this:
#1 Per-Tablespace optimizer tuning parameters. Arguably, this is already needed. The tablespaces on such a solid state device would have random and sequential access at equal (low) cost. Any one-size-fits-all set of optimizer variables is bound to cause performance issues when two tablespaces have dramatically different performance profiles.
#2 Optimally, work_mem could be shrunk, and the optimizer would have to not preferentially sort - group_aggregate whenever it suspected that work_mem was too large for a hash_agg. A disk based hash_agg will pretty much win every time with such a device over a sort (in memory or not) once the number of rows to aggregate goes above a moderate threshold of a couple hundred thousand or so.
In fact, I have several examples with 8.3.3 and a standard RAID array where a hash_agg that spilled to disk (poor or -- purposely distorted statistics cause this) was a lot faster than the sort that the optimizer wants to do instead. Whatever mechanism is calculating the cost of doing sorts or hashes on disk will need to be tunable per tablespace.
I suppose both of the above may be one task -- I don't know enough about the Postgres internals.
#3 Being able to move tables / indexes from one tablespace to another as efficiently as possible.
There are probably other enhancements that will help such a setup. These were the first that came to mind.
I have looked at the I/O drive, and it could increase our DB throughput significantly over a RAID array.
Ideally, I would put a few key tables and the WAL, etc. I'd also want all the sort or hash overflow from work_mem to go to this device. Some of our tables / indexes are heavily written to for short periods of time then more infrequently later -- these are partitioned by date. I would put the fresh ones on such a device then move them to the hard drives later.
Ideally, we would then need a few changes in Postgres to take full advantage of this:
#1 Per-Tablespace optimizer tuning parameters. Arguably, this is already needed. The tablespaces on such a solid state device would have random and sequential access at equal (low) cost. Any one-size-fits-all set of optimizer variables is bound to cause performance issues when two tablespaces have dramatically different performance profiles.
#2 Optimally, work_mem could be shrunk, and the optimizer would have to not preferentially sort - group_aggregate whenever it suspected that work_mem was too large for a hash_agg. A disk based hash_agg will pretty much win every time with such a device over a sort (in memory or not) once the number of rows to aggregate goes above a moderate threshold of a couple hundred thousand or so.
In fact, I have several examples with 8.3.3 and a standard RAID array where a hash_agg that spilled to disk (poor or -- purposely distorted statistics cause this) was a lot faster than the sort that the optimizer wants to do instead. Whatever mechanism is calculating the cost of doing sorts or hashes on disk will need to be tunable per tablespace.
I suppose both of the above may be one task -- I don't know enough about the Postgres internals.
#3 Being able to move tables / indexes from one tablespace to another as efficiently as possible.
There are probably other enhancements that will help such a setup. These were the first that came to mind.
On Tue, Jul 8, 2008 at 2:49 AM, Markus Wanner <markus@bluegap.ch> wrote:
Hi,I'm not sure how they work either, but why should they require more CPU cycles than any other PCIe SAS controller?
Jonah H. Harris wrote:I'm not sure how those cards work, but my guess is that the CPU will
go 100% busy (with a near-zero I/O wait) on any sizable workload. In
this case, the current pgbench configuration being used is quite small
and probably won't resemble this.
I think they are doing a clever step by directly attaching the NAND chips to PCIe, instead of piping all the data through SAS or (S)ATA (and then through PCIe as well). And if the controller chip on the card isn't absolutely bogus, that certainly has the potential to reduce latency and improve throughput - compared to other SSDs.
Or am I missing something?
Regards
Markus
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > Well, what does a revolution like this require of Postgres? That is the > question. [...] > #1 Per-Tablespace optimizer tuning parameters. ... automatically measured? Cheers, Jeremy