Re: Question: BlockSize > 8192 with FusionIO - Mailing list pgsql-performance

From Scott Carey
Subject Re: Question: BlockSize > 8192 with FusionIO
Date
Msg-id 42AF139A-0385-4226-B81C-9569FB64873E@richrelevance.com
Whole thread Raw
In response to Re: Question: BlockSize > 8192 with FusionIO  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-performance
On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

> On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Strange, John W wrote:
>>>
>>> Has anyone had a chance to recompile and try larger a larger blocksize
>>> than 8192 with pSQL 8.4.x?
>>
>> While I haven't done the actual experiment you're asking about, the problem
>> working against you here is how WAL data is used to protect against partial
>> database writes.  See the documentation for full_page_writes at
>> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>>  Because full size copies of the blocks have to get written there, attempts
>> to chunk writes into larger pieces end up requiring a correspondingly larger
>> volume of writes to protect against partial writes to those pages.  You
>> might get a nice efficiency gain on the read side, but the situation when
>> under a heavy write load (the main thing you have to be careful about with
>> these SSDs) is much less clear.
>
> most flash drives, especially mlc flash, use huge blocks anyways on
> physical level.  the numbers claimed here
> (http://www.fusionio.com/products/iodrive/)  (141k write iops) are
> simply not believable without write buffering.  i didn't see any note
> of how fault tolerance is maintained through the buffer (anyone
> know?).


Flash may have very large erase blocks -- 4k to 16M, but you can write to it at much smaller block sizes sequentially.

It has to delete a block in bulk, but it can write to an erased block bit by bit, sequentially (512 or 4096 bytes
typically,but some is 8k and 16k). 

Older MLC NAND flash could be written to at a couple bytes at a time -- but drives today incorporate too much EEC and
uselarger chunks to do that.  The minimum write size now is caused by the EEC requirements and not the physical NAND
flashrequirements.   

So, buffering isn't that big of a requirement with the current LBA > Physical translations which change all writes --
randomor not -- to sequential writes in one erase block. 
 But performance if waiting for the write to complete will not be all that good, especially with MLC.  Turn off the
bufferon an Intel SLC drive for example, and write IOPS is cut by 1/3 or more -- to 'only' 1000 or so iops. 

pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: Same stament sometime fast, something slow
Next
From: Josh Berkus
Date:
Subject: Wrong docs on wal_buffers?