Re: Fusion-io ioDrive - Mailing list pgsql-performance

From PFC
Subject Re: Fusion-io ioDrive
Date
Msg-id op.udxfvlzrcigqcu@apollo13.peufeu.com
Whole thread Raw
In response to Re: Fusion-io ioDrive  ("Merlin Moncure" <mmoncure@gmail.com>)
List pgsql-performance

> *) is the flash random write problem going to be solved in hardware or
> specialized solid state write caching techniques.   At least
> currently, it seems like software is filling the role.

    Those flash chips are page-based, not unlike a harddisk, ie. you cannot
erase and write a byte, you must erase and write a full page. Size of said
page depends on the chip implementation. I don't know which chips they
used so cannot comment there, but you can easily imagine that smaller
pages yield faster random IO write throughput. For reads, you must first
select a page and then access it. Thus, it is not like RAM at all. It is
much more similar to a harddisk with an almost zero seek time (on reads)
and a very small, but significant seek time (on writes) because a  page
must be erased before being written.

    Big flash chips include ECC inside to improve reliability. Basically the
chips include a small static RAM buffer. When you want to read a page it
is first copied to SRAM and ECC checked. When you want to write a page you
first write it to SRAM and then order the chip to write it to flash.

    Usually you can't erase a page, you must erase a block which contains
many pages (this is probably why most flash SSDs suck at random writes).

    NAND flash will never replace SDRAM because of these restrictions (NOR
flash acts like RAM but it is slow and has less capacity).
    However NAND flash is well suited to replace harddisks.

    When writing a page you write it to the small static RAM buffer on the
chip (fast) and tell the chip to write it to flash (slow). When the chip
is busy erasing or writing you can not do anything with it, but you can
still talk to the other chips. Since the ioDrive has many chips I'd bet
they use this feature.

    I don't know about the ioDrive implementation but you can see that the
paging and erasing requirements mean some tricks have to be applied and
the thing will probably need some smart buffering in RAM in order to be
fast. Since the data in a flash doesn't need to be sequential (read seek
time being close to zero) it is possible they use a system which makes all
writes sequential (for instance) which would suit the block erasing
requirements very well, with the information about block mapping stored in
RAM, or perhaps they use some form of copy-on-write. It would be
interesting to dissect this algorithm, especially the part which allows to
store permanently the block mappings, which cannot be stored in a constant
known sector since it would wear out pretty quickly.

    Ergo, in order to benchmark this thing and get relevant results, I would
tend to think that you'd need to fill it to say, 80% of capacity and
bombard it with small random writes, the total amount of data being
written being many times more than the total capacity of the drive, in
order to test the remapping algorithms which are the weak point of such a
device.

> *) do the software solutions really work (unproven)
> *) when are the major hardware vendors going to get involved.  they
> make a lot of money selling disks and supporting hardware (san, etc).

    Looking at the pictures of the "drive" I see a bunch of Flash chips which
probably make the bulk of the cost, a switching power supply, a small BGA
chip which is probably a DDR memory for buffering, and the mystery ASIC
which is probably a FPGA, I would tend to think Virtex4 from the shape of
the package seen from the side in one of the pictures.

    A team of talented engineers can design and produce such a board, and
assembly would only use standard PCB processes. This is unlike harddisks,
which need a huge investment and a specialized factory because of the
complex mechanical parts and very tight tolerances. In the case of the
ioDrive, most of the value is in the intellectual property : software on
the PC CPU (driver), embedded software, and programming the FPGA.

    All this points to a very different economic model for storage. I could
design and build a scaled down version of the ioDrive in my "garage", for
instance (well, the PCI Express licensing fees are hefty, so I'd use PCI,
but you get the idea).

    This means I think we are about to see a flood of these devices coming
 from many small companies. This is very good for the end user, because
there will be competition, natural selection, and fast evolution.

    Interesting times ahead !

> I'm not particularly enamored of having a storage device be stuck
> directly in a pci slot -- although I understand it's probably
> necessary in the short term as flash changes all the rules and you
> can't expect it to run well using mainstream hardware raid
> controllers.  By using their own device they have complete control of
> the i/o stack up to the o/s driver level.

    Well, SATA is great for harddisks : small cables, less clutter, less
failure prone than 80 conductor cables, faster, cheaper, etc...

    Basically serial LVDS (low voltage differential signalling) point to
point links (SATA, PCI-Express, etc) are replacing parallel busses (PCI,
IDE) everywhere, except where you need extremely low latency combined with
extremely high throughput (like RAM). Point-to-point is much better
because there is no contention. SATA is too slow for Flash, though,
because it has only 2 lanes. This only leaves PCI-Express. However the
humongous data rates this "drive" puts out are not going to go through a
cable that is going to be cheap.

    Therefore we are probably going to see a lot more PCI-Express flash
drives until a standard comes up to allow the RAID-Card + "drives"
paradigm. But it probably won't involve cables and bays, most likely Flash
sticks just like we have RAM sticks now, and a RAID controller on the mobo
or a PCI-Express card. Or perhaps it will just be software RAID.

    As for reliability of this device, I'd say the failure point is the Flash
chips, as stated by the manufacturer. Wear levelling algorithms are going
to matter a lot.














pgsql-performance by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: Fusion-io ioDrive
Next
From: "Jonah H. Harris"
Date:
Subject: Re: Fusion-io ioDrive