Re: SSD + RAID - Mailing list pgsql-performance

From Scott Carey
Subject Re: SSD + RAID
Date
Msg-id B9BC5B98-5128-49F8-9CB9-11DD5AE983DD@richrelevance.com
Whole thread Raw
In response to Re: SSD + RAID  ("Pierre C" <lists@peufeu.com>)
List pgsql-performance
On Feb 23, 2010, at 3:49 AM, Pierre C wrote:
> Now I wonder about something. SSDs use wear-leveling which means the
> information about which block was written where must be kept somewhere.
> Which means this information must be updated. I wonder how crash-safe and
> how atomic these updates are, in the face of a power loss.  This is just
> like a filesystem. You've been talking only about data, but the block
> layout information (metadata) is subject to the same concerns. If the
> drive says it's written, not only the data must have been written, but
> also the information needed to locate that data...
>
> Therefore I think the yank-the-power-cord test should be done with random
> writes happening on an aged and mostly-full SSD... and afterwards, I'd be
> interested to know if not only the last txn really committed, but if some
> random parts of other stuff weren't "wear-leveled" into oblivion at the
> power loss...
>

A couple years ago I postulated that SSD's could do random writes fast if they remapped blocks.  Microsoft's SSD
whitepaperat the time hinted at this too. 
Persisting the remap data is not hard.  It goes in the same location as the data, or a separate area that can be
writtento linearly. 

Each block may contain its LBA and a transaction ID or other atomic count.  Or another block can have that info.  When
theSSD 
powers up, it can build its table of LBA > block by looking at that data and inverting it and keeping the highest
transactionID for duplicate LBA claims. 

Although SSD's have to ERASE data in a large block at a time (256K to 2M typically), they can write linearly to an
erasedblock in much smaller chunks. 
Thus, to commit a write, either:
Data, LBA tag, and txID in same block (may require oddly sized blocks).
or
Data written to one block (not committed yet), then LBA tag and txID written elsewhere (which commits the write).
Sinceits all copy on write, partial writes can't happen. 
If a block is being moved or compressed when power fails data should never be lost since the old data still exists, the
newversion just didn't commit.  But new data that is being written may not be committed yet in the case of a power
failureunless other measures are taken. 

>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


pgsql-performance by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Internal operations when the planner makes a hash join.
Next
From: Scott Carey
Date:
Subject: Re: Internal operations when the planner makes a hash join.