Re: Fusion-io ioDrive - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Fusion-io ioDrive |
Date | |
Msg-id | op.udxfvlzrcigqcu@apollo13.peufeu.com Whole thread Raw |
In response to | Re: Fusion-io ioDrive ("Merlin Moncure" <mmoncure@gmail.com>) |
List | pgsql-performance |
> *) is the flash random write problem going to be solved in hardware or > specialized solid state write caching techniques. At least > currently, it seems like software is filling the role. Those flash chips are page-based, not unlike a harddisk, ie. you cannot erase and write a byte, you must erase and write a full page. Size of said page depends on the chip implementation. I don't know which chips they used so cannot comment there, but you can easily imagine that smaller pages yield faster random IO write throughput. For reads, you must first select a page and then access it. Thus, it is not like RAM at all. It is much more similar to a harddisk with an almost zero seek time (on reads) and a very small, but significant seek time (on writes) because a page must be erased before being written. Big flash chips include ECC inside to improve reliability. Basically the chips include a small static RAM buffer. When you want to read a page it is first copied to SRAM and ECC checked. When you want to write a page you first write it to SRAM and then order the chip to write it to flash. Usually you can't erase a page, you must erase a block which contains many pages (this is probably why most flash SSDs suck at random writes). NAND flash will never replace SDRAM because of these restrictions (NOR flash acts like RAM but it is slow and has less capacity). However NAND flash is well suited to replace harddisks. When writing a page you write it to the small static RAM buffer on the chip (fast) and tell the chip to write it to flash (slow). When the chip is busy erasing or writing you can not do anything with it, but you can still talk to the other chips. Since the ioDrive has many chips I'd bet they use this feature. I don't know about the ioDrive implementation but you can see that the paging and erasing requirements mean some tricks have to be applied and the thing will probably need some smart buffering in RAM in order to be fast. Since the data in a flash doesn't need to be sequential (read seek time being close to zero) it is possible they use a system which makes all writes sequential (for instance) which would suit the block erasing requirements very well, with the information about block mapping stored in RAM, or perhaps they use some form of copy-on-write. It would be interesting to dissect this algorithm, especially the part which allows to store permanently the block mappings, which cannot be stored in a constant known sector since it would wear out pretty quickly. Ergo, in order to benchmark this thing and get relevant results, I would tend to think that you'd need to fill it to say, 80% of capacity and bombard it with small random writes, the total amount of data being written being many times more than the total capacity of the drive, in order to test the remapping algorithms which are the weak point of such a device. > *) do the software solutions really work (unproven) > *) when are the major hardware vendors going to get involved. they > make a lot of money selling disks and supporting hardware (san, etc). Looking at the pictures of the "drive" I see a bunch of Flash chips which probably make the bulk of the cost, a switching power supply, a small BGA chip which is probably a DDR memory for buffering, and the mystery ASIC which is probably a FPGA, I would tend to think Virtex4 from the shape of the package seen from the side in one of the pictures. A team of talented engineers can design and produce such a board, and assembly would only use standard PCB processes. This is unlike harddisks, which need a huge investment and a specialized factory because of the complex mechanical parts and very tight tolerances. In the case of the ioDrive, most of the value is in the intellectual property : software on the PC CPU (driver), embedded software, and programming the FPGA. All this points to a very different economic model for storage. I could design and build a scaled down version of the ioDrive in my "garage", for instance (well, the PCI Express licensing fees are hefty, so I'd use PCI, but you get the idea). This means I think we are about to see a flood of these devices coming from many small companies. This is very good for the end user, because there will be competition, natural selection, and fast evolution. Interesting times ahead ! > I'm not particularly enamored of having a storage device be stuck > directly in a pci slot -- although I understand it's probably > necessary in the short term as flash changes all the rules and you > can't expect it to run well using mainstream hardware raid > controllers. By using their own device they have complete control of > the i/o stack up to the o/s driver level. Well, SATA is great for harddisks : small cables, less clutter, less failure prone than 80 conductor cables, faster, cheaper, etc... Basically serial LVDS (low voltage differential signalling) point to point links (SATA, PCI-Express, etc) are replacing parallel busses (PCI, IDE) everywhere, except where you need extremely low latency combined with extremely high throughput (like RAM). Point-to-point is much better because there is no contention. SATA is too slow for Flash, though, because it has only 2 lanes. This only leaves PCI-Express. However the humongous data rates this "drive" puts out are not going to go through a cable that is going to be cheap. Therefore we are probably going to see a lot more PCI-Express flash drives until a standard comes up to allow the RAID-Card + "drives" paradigm. But it probably won't involve cables and bays, most likely Flash sticks just like we have RAM sticks now, and a RAID controller on the mobo or a PCI-Express card. Or perhaps it will just be software RAID. As for reliability of this device, I'd say the failure point is the Flash chips, as stated by the manufacturer. Wear levelling algorithms are going to matter a lot.
pgsql-performance by date: