Thread: Fusion-io ioDrive

Fusion-io ioDrive

From
"Jeffrey Baker"
Date:
I recently got my hands on a device called ioDrive from a company
called Fusion-io.  The ioDrive is essentially 80GB of flash on a PCI
card.  It has its own driver for Linux completely outside of the
normal scsi/sata/sas/fc block device stack, but from the user's
perspective it behaves like a block device.  I put the ioDrive in an
ordinary PC with 1GB of memory, a single 2.2GHz AMD CPU, and an
existing Areca RAID with 6 SATA disks and a 128MB cache.  I tested the
device with PostgreSQL 8.3.3 on Centos 5.3 x86_64 (Linux 2.6.18).

The pgbench database was initialized with scale factor 100.  Test runs
were performed with 8 parallel connections (-c 8), both read-only (-S)
and read-write.  PostgreSQL itself was configured with 256MB of shared
buffers and 32 checkpoint segments.  Otherwise the configuration was
all defaults.

In the following table, the "RAID" configuration has the xlogs on a
RAID 0 of 2 10krpm disks with ext2, and the heap is on a RAID 0 of 4
7200rpm disks with ext3.  The "Fusion" configuration has everything on
the ioDrive with xfs.  I tried the ioDrive with ext2 and ext3 but it
didn't seem to make any difference.

                            Service Time Percentile, millis
        R/W TPS   R-O TPS      50th   80th   90th   95th
RAID      182       673         18     32     42     64
Fusion    971      4792          8      9     10     11

Basically the ioDrive is smoking the RAID.  The only real problem with
this benchmark is that the machine became CPU-limited rather quickly.
During the runs with the ioDrive, iowait was pretty well zero, with
user CPU being about 75% and system getting about 20%.

Now, I will say a couple of other things.  The Linux driver for this
piece of hardware is pretty dodgy.  Sub-alpha quality actually.  But
they seem to be working on it.  Also there's no driver for
OpenSolaris, Mac OS X, or Windows right now.  In fact there's not even
anything available for Debian or other respectable Linux distros, only
Red Hat and its clones.  The other problem is the 80GB model is too
small to hold my entire DB, Although it could be used as a tablespace
for some critical tables.  But hey, it's fast.

I'm going to put this board into my 8-way Xeon to see if it goes any
faster with more CPU available.

I'd be interested in hearing experiences with other flash storage
devices, SSDs, and that type of thing.  So far, this is the fastest
hardware I've seen for the price.

-jwb

Re: Fusion-io ioDrive

From
"Andrej Ricnik-Bay"
Date:
On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote:

>  Red Hat and its clones.  The other problem is the 80GB model is too
>  small to hold my entire DB, Although it could be used as a tablespace
>  for some critical tables.  But hey, it's fast.
And when/if it dies, please give us a rough guestimate of its
life-span in terms of read/write cycles.  Sounds exciting, though!


Cheers,
Andrej

--
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

Re: Fusion-io ioDrive

From
"Jeffrey Baker"
Date:
On Tue, Jul 1, 2008 at 6:17 PM, Andrej Ricnik-Bay
<andrej.groups@gmail.com> wrote:
> On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote:
>
>>  Red Hat and its clones.  The other problem is the 80GB model is too
>>  small to hold my entire DB, Although it could be used as a tablespace
>>  for some critical tables.  But hey, it's fast.
> And when/if it dies, please give us a rough guestimate of its
> life-span in terms of read/write cycles.  Sounds exciting, though!

Yeah.  The manufacturer rates it for 5 years in constant use.  I
remain skeptical.

Re: Fusion-io ioDrive

From
Greg Smith
Date:
On Tue, 1 Jul 2008, Jeffrey Baker wrote:

> The only real problem with this benchmark is that the machine became
> CPU-limited rather quickly. During the runs with the ioDrive, iowait was
> pretty well zero, with user CPU being about 75% and system getting about
> 20%.

You might try reducing the number of clients; with a single CPU like yours
I'd expect peak throughput here would be at closer to 4 clients rather
than 8, and possibly as low as 2.  What I normally do is run a quick scan
of a few client loads before running a long test to figure out where the
general area of peak throughput is.  For your 8-way box, it will be closer
to 32 clients.

Well done test though.  When you try again with the faster system, the
only other postgresql.conf parameter I'd suggest bumping up is
wal_buffers; that can limit best pgbench scores a bit and it only needs a
MB or so to make that go away.

It's also worth nothing that the gap between the two types of storage will
go up up if you increase scale further; scale=100 is only making a 1.5GB
or so database.  If you collected a second data point at a scale of 500
I'd expect the standard disk results would halve by then, but I don't know
what the fusion device would do and I'm kind of curious.  You may need to
increase this regardless because the bigger box has more RAM, and you want
the database to be larger than RAM to get interesting results in this type
of test.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Fusion-io ioDrive

From
"Andrej Ricnik-Bay"
Date:
On 02/07/2008, Jeffrey Baker <jwbaker@gmail.com> wrote:
> Yeah.  The manufacturer rates it for 5 years in constant use.  I
>  remain skeptical.
I read in one of their spec-sheets that w/ continuous writes it
should survive roughly 3.4 years ... I'd be a tad more conservative,
I guess, and try to drop 20-30% of that figure if I'd consider something
like it for production use

And I'll be very indiscreet and ask: "How much do they go for?" :}
I couldn't find anyone actually offering them in 5 minutes of
googling, just some ball-park figure of 2400US$ ...



Cheers,
Andrej

--
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

Re: Fusion-io ioDrive

From
"Merlin Moncure"
Date:
On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
> I recently got my hands on a device called ioDrive from a company
> called Fusion-io.  The ioDrive is essentially 80GB of flash on a PCI
> card.  It has its own driver for Linux completely outside of the
> normal scsi/sata/sas/fc block device stack, but from the user's
> perspective it behaves like a block device.  I put the ioDrive in an
> ordinary PC with 1GB of memory, a single 2.2GHz AMD CPU, and an
> existing Areca RAID with 6 SATA disks and a 128MB cache.  I tested the
> device with PostgreSQL 8.3.3 on Centos 5.3 x86_64 (Linux 2.6.18).
>
> The pgbench database was initialized with scale factor 100.  Test runs
> were performed with 8 parallel connections (-c 8), both read-only (-S)
> and read-write.  PostgreSQL itself was configured with 256MB of shared
> buffers and 32 checkpoint segments.  Otherwise the configuration was
> all defaults.
>
> In the following table, the "RAID" configuration has the xlogs on a
> RAID 0 of 2 10krpm disks with ext2, and the heap is on a RAID 0 of 4
> 7200rpm disks with ext3.  The "Fusion" configuration has everything on
> the ioDrive with xfs.  I tried the ioDrive with ext2 and ext3 but it
> didn't seem to make any difference.
>
>                            Service Time Percentile, millis
>        R/W TPS   R-O TPS      50th   80th   90th   95th
> RAID      182       673         18     32     42     64
> Fusion    971      4792          8      9     10     11
>
> Basically the ioDrive is smoking the RAID.  The only real problem with
> this benchmark is that the machine became CPU-limited rather quickly.
> During the runs with the ioDrive, iowait was pretty well zero, with
> user CPU being about 75% and system getting about 20%.
>
> Now, I will say a couple of other things.  The Linux driver for this
> piece of hardware is pretty dodgy.  Sub-alpha quality actually.  But
> they seem to be working on it.  Also there's no driver for
> OpenSolaris, Mac OS X, or Windows right now.  In fact there's not even
> anything available for Debian or other respectable Linux distros, only
> Red Hat and its clones.  The other problem is the 80GB model is too
> small to hold my entire DB, Although it could be used as a tablespace
> for some critical tables.  But hey, it's fast.
>
> I'm going to put this board into my 8-way Xeon to see if it goes any
> faster with more CPU available.
>
> I'd be interested in hearing experiences with other flash storage
> devices, SSDs, and that type of thing.  So far, this is the fastest
> hardware I've seen for the price.

Any chance of getting bonnie results?  How long are your pgbench runs?
 Are you sure that you are seeing proper syncs to the device? (this is
my largest concern actually)

merlin.

Re: Fusion-io ioDrive

From
"Jonah H. Harris"
Date:
On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
> Basically the ioDrive is smoking the RAID.  The only real problem with
> this benchmark is that the machine became CPU-limited rather quickly.

That's traditionally the problem with everything being in memory.
Unless the database algorithms are designed to exploit L1/L2 cache and
RAM, which is not the case for a disk-based DBMS, you generally lose
some concurrency due to the additional CPU overhead of playing only
with memory.  This is generally acceptable if you're going to trade
off higher concurrency for faster service times.  And, it isn't only
evidenced in single systems where a disk-based DBMS is 100% cached,
but also in most shared-memory clustering architectures.

In most cases, when you're waiting on disk I/O, you can generally
support higher concurrency because the OS can utilize the CPU's free
cycles (during the wait) to handle other users.  In short, sometimes,
disk I/O is a good thing; it just depends on what you need.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/

Re: Fusion-io ioDrive

From
Cédric Villemain
Date:
Le Wednesday 02 July 2008, Jonah H. Harris a écrit :
> On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
> > Basically the ioDrive is smoking the RAID.  The only real problem with
> > this benchmark is that the machine became CPU-limited rather quickly.
>
> That's traditionally the problem with everything being in memory.
> Unless the database algorithms are designed to exploit L1/L2 cache and
> RAM, which is not the case for a disk-based DBMS, you generally lose
> some concurrency due to the additional CPU overhead of playing only
> with memory.  This is generally acceptable if you're going to trade
> off higher concurrency for faster service times.  And, it isn't only
> evidenced in single systems where a disk-based DBMS is 100% cached,
> but also in most shared-memory clustering architectures.

My experience is that using an IRAM for replication (on the slave) is very
good. I am unfortunely unable to provide any numbers or benchs :/ (I'll try
to get some but it won't be easy)

I would probably use some flash/memory disk when Postgresql get the warm
stand-by at transaction level (and is up for readonly query)...

>
> In most cases, when you're waiting on disk I/O, you can generally
> support higher concurrency because the OS can utilize the CPU's free
> cycles (during the wait) to handle other users.  In short, sometimes,
> disk I/O is a good thing; it just depends on what you need.
>
> --
> Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
> EnterpriseDB Corporation | fax: 732.331.1301
> 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
> Edison, NJ 08837 | http://www.enterprisedb.com/



--
Cédric Villemain
Administrateur de Base de Données
Cel: +33 (0)6 74 15 56 53
http://dalibo.com - http://dalibo.org

Attachment

Re: Fusion-io ioDrive

From
"Jeffrey Baker"
Date:
On Tue, Jul 1, 2008 at 5:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
> I recently got my hands on a device called ioDrive from a company
> called Fusion-io.  The ioDrive is essentially 80GB of flash on a PCI
> card.

[...]

>                            Service Time Percentile, millis
>        R/W TPS   R-O TPS      50th   80th   90th   95th
> RAID      182       673         18     32     42     64
> Fusion    971      4792          8      9     10     11

Essentially the same benchmark, but on a quad Xeon 2GHz with 3GB main
memory, and the scale factor of 300.  Really all we learn from this
exercise is the sheer futility of throwing CPU at PostgreSQL.

R/W TPS: 1168
R-O TPS: 6877

Quadrupling the CPU resources and tripling the RAM results in a 20% or
44% performance increase on read/write and read-only loads,
respectively.  The system loafs along with 2-3 CPUs completely idle,
although oddly iowait is 0%.  I think the system is constrained by
context switch, which is tens of thousands per second.  This is a
problem with the ioDrive software, not with pg.

Someone asked for bonnie++ output:

Block output: 495MB/s, 81% CPU
Block input: 676MB/s, 93% CPU
Block rewrite: 262MB/s, 59% CPU

Pretty respectable.  In the same ballpark as an HP MSA70 + P800 with
25 spindles.

-jwb

Re: Fusion-io ioDrive

From
"Merlin Moncure"
Date:
On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <jwbaker@gmail.com> wrote:
>>                            Service Time Percentile, millis
>>        R/W TPS   R-O TPS      50th   80th   90th   95th
>> RAID      182       673         18     32     42     64
>> Fusion    971      4792          8      9     10     11
>
> Someone asked for bonnie++ output:
>
> Block output: 495MB/s, 81% CPU
> Block input: 676MB/s, 93% CPU
> Block rewrite: 262MB/s, 59% CPU
>
> Pretty respectable.  In the same ballpark as an HP MSA70 + P800 with
> 25 spindles.

You left off the 'seeks' portion of the bonnie++ results -- this is
actually the most important portion of the test.  Based on your tps
#s, I'm expecting seeks equiv of about 10 10k drives in configured in
a raid 10, or around 1000-1500.  They didn't publish any prices so
it's hard to say if this is 'cost competitive'.

These numbers are indeed fantastic, disruptive even.  If I was testing
the device for consideration in high duty server environments, I would
be doing durability testing right now...I would slamming the database
with transactions (fsync on, etc) and then power off the device.  I
would do this several times...making sure the software layer isn't
doing some mojo that is technically cheating.

I'm not particularly enamored of having a storage device be stuck
directly in a pci slot -- although I understand it's probably
necessary in the short term as flash changes all the rules and you
can't expect it to run well using mainstream hardware raid
controllers.  By using their own device they have complete control of
the i/o stack up to the o/s driver level.

I've been thinking for a while now that flash is getting ready to
explode into use in server environments.  The outstanding questions I
see are:
*) is write endurance problem truly solved (giving at least a 5-10
year lifetime)
*) what are the true odds of catastrophic device failure (industry
claims less, we'll see)
*) is the flash random write problem going to be solved in hardware or
specialized solid state write caching techniques.   At least
currently, it seems like software is filling the role.
*) do the software solutions really work (unproven)
*) when are the major hardware vendors going to get involved.  they
make a lot of money selling disks and supporting hardware (san, etc).

merlin

Re: Fusion-io ioDrive

From
"Merlin Moncure"
Date:
On Wed, Jul 2, 2008 at 7:41 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
>> Basically the ioDrive is smoking the RAID.  The only real problem with
>> this benchmark is that the machine became CPU-limited rather quickly.
>
> That's traditionally the problem with everything being in memory.
> Unless the database algorithms are designed to exploit L1/L2 cache and
> RAM, which is not the case for a disk-based DBMS, you generally lose
> some concurrency due to the additional CPU overhead of playing only
> with memory.  This is generally acceptable if you're going to trade
> off higher concurrency for faster service times.  And, it isn't only
> evidenced in single systems where a disk-based DBMS is 100% cached,
> but also in most shared-memory clustering architectures.
>
> In most cases, when you're waiting on disk I/O, you can generally
> support higher concurrency because the OS can utilize the CPU's free
> cycles (during the wait) to handle other users.  In short, sometimes,
> disk I/O is a good thing; it just depends on what you need.

I have a lot of problems with your statements.  First of all, we are
not really talking about 'RAM' storage...I think your comments would
be more on point if we were talking about mounting database storage
directly from the server memory for example.  Sever memory and cpu are
involved to the extent that the o/s using them for caching and
filesystem things and inside the device driver.

Also, your comments seem to indicate that having a slower device leads
to higher concurrency because it allows the process to yield and do
other things.  This is IMO simply false.  With faster storage cpu
loads will increase but only because the overall system throughput
increases and cpu/memory 'work' increases in terms of overall system
activity.  Presumably as storage approaches speeds of main system
memory the algorithms of dealing with it will become simpler (not
having to go through acrobatics to try and making everything
sequential) and thus faster.

I also find the remarks of software 'optimizing' for strict hardware
assumptions (L1+L2) cache a little suspicious.  In some old programs I
remember keeping a giant C 'union' of critical structures that was
exactly 8k to fit in the 486 cpu cache.  In modern terms I think that
type of programming (sans some specialized environments) is usually
counter-productive...I think PostgreSQL's approach of deferring as
much work as possible to the o/s is a great approach.

merlin

Re: Fusion-io ioDrive

From
PFC
Date:

> *) is the flash random write problem going to be solved in hardware or
> specialized solid state write caching techniques.   At least
> currently, it seems like software is filling the role.

    Those flash chips are page-based, not unlike a harddisk, ie. you cannot
erase and write a byte, you must erase and write a full page. Size of said
page depends on the chip implementation. I don't know which chips they
used so cannot comment there, but you can easily imagine that smaller
pages yield faster random IO write throughput. For reads, you must first
select a page and then access it. Thus, it is not like RAM at all. It is
much more similar to a harddisk with an almost zero seek time (on reads)
and a very small, but significant seek time (on writes) because a  page
must be erased before being written.

    Big flash chips include ECC inside to improve reliability. Basically the
chips include a small static RAM buffer. When you want to read a page it
is first copied to SRAM and ECC checked. When you want to write a page you
first write it to SRAM and then order the chip to write it to flash.

    Usually you can't erase a page, you must erase a block which contains
many pages (this is probably why most flash SSDs suck at random writes).

    NAND flash will never replace SDRAM because of these restrictions (NOR
flash acts like RAM but it is slow and has less capacity).
    However NAND flash is well suited to replace harddisks.

    When writing a page you write it to the small static RAM buffer on the
chip (fast) and tell the chip to write it to flash (slow). When the chip
is busy erasing or writing you can not do anything with it, but you can
still talk to the other chips. Since the ioDrive has many chips I'd bet
they use this feature.

    I don't know about the ioDrive implementation but you can see that the
paging and erasing requirements mean some tricks have to be applied and
the thing will probably need some smart buffering in RAM in order to be
fast. Since the data in a flash doesn't need to be sequential (read seek
time being close to zero) it is possible they use a system which makes all
writes sequential (for instance) which would suit the block erasing
requirements very well, with the information about block mapping stored in
RAM, or perhaps they use some form of copy-on-write. It would be
interesting to dissect this algorithm, especially the part which allows to
store permanently the block mappings, which cannot be stored in a constant
known sector since it would wear out pretty quickly.

    Ergo, in order to benchmark this thing and get relevant results, I would
tend to think that you'd need to fill it to say, 80% of capacity and
bombard it with small random writes, the total amount of data being
written being many times more than the total capacity of the drive, in
order to test the remapping algorithms which are the weak point of such a
device.

> *) do the software solutions really work (unproven)
> *) when are the major hardware vendors going to get involved.  they
> make a lot of money selling disks and supporting hardware (san, etc).

    Looking at the pictures of the "drive" I see a bunch of Flash chips which
probably make the bulk of the cost, a switching power supply, a small BGA
chip which is probably a DDR memory for buffering, and the mystery ASIC
which is probably a FPGA, I would tend to think Virtex4 from the shape of
the package seen from the side in one of the pictures.

    A team of talented engineers can design and produce such a board, and
assembly would only use standard PCB processes. This is unlike harddisks,
which need a huge investment and a specialized factory because of the
complex mechanical parts and very tight tolerances. In the case of the
ioDrive, most of the value is in the intellectual property : software on
the PC CPU (driver), embedded software, and programming the FPGA.

    All this points to a very different economic model for storage. I could
design and build a scaled down version of the ioDrive in my "garage", for
instance (well, the PCI Express licensing fees are hefty, so I'd use PCI,
but you get the idea).

    This means I think we are about to see a flood of these devices coming
 from many small companies. This is very good for the end user, because
there will be competition, natural selection, and fast evolution.

    Interesting times ahead !

> I'm not particularly enamored of having a storage device be stuck
> directly in a pci slot -- although I understand it's probably
> necessary in the short term as flash changes all the rules and you
> can't expect it to run well using mainstream hardware raid
> controllers.  By using their own device they have complete control of
> the i/o stack up to the o/s driver level.

    Well, SATA is great for harddisks : small cables, less clutter, less
failure prone than 80 conductor cables, faster, cheaper, etc...

    Basically serial LVDS (low voltage differential signalling) point to
point links (SATA, PCI-Express, etc) are replacing parallel busses (PCI,
IDE) everywhere, except where you need extremely low latency combined with
extremely high throughput (like RAM). Point-to-point is much better
because there is no contention. SATA is too slow for Flash, though,
because it has only 2 lanes. This only leaves PCI-Express. However the
humongous data rates this "drive" puts out are not going to go through a
cable that is going to be cheap.

    Therefore we are probably going to see a lot more PCI-Express flash
drives until a standard comes up to allow the RAID-Card + "drives"
paradigm. But it probably won't involve cables and bays, most likely Flash
sticks just like we have RAM sticks now, and a RAID controller on the mobo
or a PCI-Express card. Or perhaps it will just be software RAID.

    As for reliability of this device, I'd say the failure point is the Flash
chips, as stated by the manufacturer. Wear levelling algorithms are going
to matter a lot.














Re: Fusion-io ioDrive

From
"Jonah H. Harris"
Date:
On Mon, Jul 7, 2008 at 9:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I have a lot of problems with your statements.  First of all, we are
> not really talking about 'RAM' storage...I think your comments would
> be more on point if we were talking about mounting database storage
> directly from the server memory for example.  Sever memory and cpu are
> involved to the extent that the o/s using them for caching and
> filesystem things and inside the device driver.

I'm not sure how those cards work, but my guess is that the CPU will
go 100% busy (with a near-zero I/O wait) on any sizable workload.  In
this case, the current pgbench configuration being used is quite small
and probably won't resemble this.

> Also, your comments seem to indicate that having a slower device leads
> to higher concurrency because it allows the process to yield and do
> other things.  This is IMO simply false.

Argue all you want, but this is a fairly well known (20+ year-old) behavior.

> With faster storage cpu loads will increase but only because the overall
> system throughput increases and cpu/memory 'work' increases in terms
> of overall system activity.

Again, I said that response times (throughput) would improve.  I'd
like to see your argument for explaining how you can handle more
CPU-only operations when 0% of the CPU is free for use.

> Presumably as storage approaches speedsof main system memory
> the algorithms of dealing with it will become simpler (not having to
> go through acrobatics to try and making everything sequential)
> and thus faster.

We'll have to see.

> I also find the remarks of software 'optimizing' for strict hardware
> assumptions (L1+L2) cache a little suspicious.  In some old programs I
> remember keeping a giant C 'union' of critical structures that was
> exactly 8k to fit in the 486 cpu cache.  In modern terms I think that
> type of programming (sans some specialized environments) is usually
> counter-productive...I think PostgreSQL's approach of deferring as
> much work as possible to the o/s is a great approach.

All of the major database vendors still see an immense value in
optimizing their algorithms and memory structures for specific
platforms and CPU caches.  Hence, if they're *paying* money for
very-specialized industry professionals to optimize in this way, I
would hesitate to say there isn't any value in it.   As a fact,
Postgres doesn't have those low-level resources, so for the most part,
I have to agree that they have to rely on the OS.

-Jonah

Re: Fusion-io ioDrive

From
"Jeffrey Baker"
Date:
On Mon, Jul 7, 2008 at 6:08 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <jwbaker@gmail.com> wrote:
>>>                            Service Time Percentile, millis
>>>        R/W TPS   R-O TPS      50th   80th   90th   95th
>>> RAID      182       673         18     32     42     64
>>> Fusion    971      4792          8      9     10     11
>>
>> Someone asked for bonnie++ output:
>>
>> Block output: 495MB/s, 81% CPU
>> Block input: 676MB/s, 93% CPU
>> Block rewrite: 262MB/s, 59% CPU
>>
>> Pretty respectable.  In the same ballpark as an HP MSA70 + P800 with
>> 25 spindles.
>
> You left off the 'seeks' portion of the bonnie++ results -- this is
> actually the most important portion of the test.  Based on your tps
> #s, I'm expecting seeks equiv of about 10 10k drives in configured in
> a raid 10, or around 1000-1500.  They didn't publish any prices so
> it's hard to say if this is 'cost competitive'.

I left it out because bonnie++ reports it as "+++++" i.e. greater than
or equal to 100000 per second.

-jwb

Re: Fusion-io ioDrive

From
PFC
Date:
> PFC, I have to say these kind of posts make me a fan of yours.  I've
> read many of your storage-related replied and have found them all very
> educational.  I just want to let you know I found your assessment of the
> impact of Flash storage perfectly-worded and unbelievably insightful.
> Thanks a million for sharing your knowledge with the list. -Dan

    Hehe, thanks.

    There was a time when you had to be a big company full of cash to build a
computer, and then sudenly people did it in garages, like Wozniak and
Jobs, out of off-the-shelf parts.

    I feel the ioDrive guys are the same kind of hackers, except today's
hackers have much more powerful tools. Perhaps, and I hope it's true,
storage is about to undergo a revolution like the personal computer had
20-30 years ago, when the IBMs of the time were eaten from the roots up.

    IMHO the key is that you can build a ioDrive from off the shelf parts,
but you can't do that with a disk drive.
    Flash manufacturers are smelling blood, they profit from USB keys and
digicams but imagine the market for solid state drives !
    And in this case the hardware is simple : flash, ram, a fpga, some chips,
nothing out of the ordinary, it is the brain juice in the software (which
includes FPGAs) which will sort out the high performance and reliability
winners from the rest.

    Lowering the barrier of entry is good for innovation. I believe Linux
will benefit, too, since the target is (for now) high-performance servers,
and as shown by the ioDrive, innovating hackers prefer to write Linux
drivers rather than Vista (argh) drivers.

Re: Fusion-io ioDrive

From
Markus Wanner
Date:
Hi,

Jonah H. Harris wrote:
> I'm not sure how those cards work, but my guess is that the CPU will
> go 100% busy (with a near-zero I/O wait) on any sizable workload.  In
> this case, the current pgbench configuration being used is quite small
> and probably won't resemble this.

I'm not sure how they work either, but why should they require more CPU
cycles than any other PCIe SAS controller?

I think they are doing a clever step by directly attaching the NAND
chips to PCIe, instead of piping all the data through SAS or (S)ATA (and
then through PCIe as well). And if the controller chip on the card isn't
absolutely bogus, that certainly has the potential to reduce latency and
improve throughput - compared to other SSDs.

Or am I missing something?

Regards

Markus


Re: Fusion-io ioDrive

From
"Scott Carey"
Date:
Well, what does a revolution like this require of Postgres?   That is the question.

I have looked at the I/O drive, and it could increase our DB throughput significantly over a RAID array.

Ideally, I would put a few key tables and the WAL, etc.  I'd also want all the sort or hash overflow from work_mem to go to this device.  Some of our tables / indexes are heavily written to for short periods of time then more infrequently later -- these are partitioned by date.  I would put the fresh ones on such a device then move them to the hard drives later.

Ideally, we would then need a few changes in Postgres to take full advantage of this:

#1  Per-Tablespace optimizer tuning parameters.  Arguably, this is already needed.  The tablespaces on such a solid state device would have random and sequential access at equal (low) cost.   Any one-size-fits-all set of optimizer variables is bound to cause performance issues when two tablespaces have dramatically different performance profiles.
#2  Optimally, work_mem could be shrunk, and the optimizer would have to not preferentially sort - group_aggregate whenever it suspected that work_mem was too large for a hash_agg.  A disk based hash_agg will pretty much win every time with such a device over a sort (in memory or not) once the number of rows to aggregate goes above a moderate threshold of a couple hundred thousand or so.
In fact, I have several examples with 8.3.3 and a standard RAID array where a hash_agg that spilled to disk (poor or -- purposely distorted statistics cause this) was a lot faster than the sort that the optimizer wants to do instead.   Whatever mechanism is calculating the cost of doing sorts or hashes on disk will need to be tunable per tablespace.

I suppose both of the above may be one task -- I don't know enough about the Postgres internals.

#3  Being able to move tables / indexes from one tablespace to another as efficiently as possible.

There are probably other enhancements that will help such a setup.  These were the first that came to mind.

On Tue, Jul 8, 2008 at 2:49 AM, Markus Wanner <markus@bluegap.ch> wrote:
Hi,


Jonah H. Harris wrote:
I'm not sure how those cards work, but my guess is that the CPU will
go 100% busy (with a near-zero I/O wait) on any sizable workload.  In
this case, the current pgbench configuration being used is quite small
and probably won't resemble this.

I'm not sure how they work either, but why should they require more CPU cycles than any other PCIe SAS controller?

I think they are doing a clever step by directly attaching the NAND chips to PCIe, instead of piping all the data through SAS or (S)ATA (and then through PCIe as well). And if the controller chip on the card isn't absolutely bogus, that certainly has the potential to reduce latency and improve throughput - compared to other SSDs.

Or am I missing something?

Regards

Markus



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Fusion-io ioDrive

From
Jeremy Harris
Date:
Scott Carey wrote:
> Well, what does a revolution like this require of Postgres?   That is the
> question.
[...]
> #1  Per-Tablespace optimizer tuning parameters.

... automatically measured?


Cheers,
   Jeremy