Re: SSD filesystem aligned to DBMS - Mailing list pgsql-general

From George Neuner
Subject Re: SSD filesystem aligned to DBMS
Date
Msg-id bhls5dt30t4rd7q54r2fpif9rt15raijlh@4ax.com
Whole thread Raw
In response to SSD filesystem aligned to DBMS  (Neto pr <netoprbr9@gmail.com>)
List pgsql-general
On Tue, 16 Jan 2018 16:50:28 +0000, Michael Loftis <mloftis@wgops.com>
wrote:

>Alignment definitely makes a difference for writes. It can also make a
>difference for random reads as well since the underlying read may not line
>up to the hardware add in a read ahead (at drive or OS Level) and you’re
>reading far more data in the drive than the OS asks for.

Best performance will be when the filesystem block size matches the
SSD's writeable *data* block size.  The SSD also has a separate erase
sector size which is some (large) multiple of the data block size.


<background>
Recall that an SSD doesn't overwrite existing data blocks.  When you
update a file, the updates are written out to *new* "clean" data
blocks, and the file's block index is updated to reflect the new
structure.  

The old data blocks are marked "free+dirty".  They must be erased
(become "free+clean") before reuse.  Depending on the drive size, the
SSD's erase sectors may be anywhere from 64MB..512MB in size, and so a
single erase sector will hold many individually writeable data blocks.

When an erase sector is cleaned, ALL the data blocks it contains are
erased.  If any still contain good data, they must be relocated before
the erase can be done.
</background>


You don't want your filesystem block to be smaller than the SSD data
block, because then you are subject to *unnecessary* write
applification: the drive controller has to read/modify/write a whole
data block to change any part of it.

But, conversely, filesystem blocks that are larger than the SSD write
block typically are not a problem because ... unless you do something
really stupid [with really low level code] ... the large filesystem
blocks will end up be an exact multiple of data blocks.


Much of the literature re: alignment actually is related to the erase
sectors rather than the data blocks and is targeted at embedded
systems that are not using conventional filesystems but rather are
accessing the raw SSD.

You do want your partitions to start on erase sector boundaries, but
that usually is trivial to do.


>Stupidly a lot of this isn’t published by a lot of SSD manufacturers, but
>through benchmarks it shows up.

Yes.  The advice to match your filesystem to the data block size is
not often given.


>Another potential difference here with SAS vs SATA is the maximum queue
>depth supported by the protocol and drive.

Yes. The interface, and how it is configured, matters greatly.


>SSD drives also do internal housekeeping tasks for wear leveling on writing.

The biggest of which is always writing to a new location.  Enterprise
grade SSD's sometimes do perform erases ahead of time during idle
periods, but cheap drives often wait until the free+dirty space is to
be reused.


>I’ve seen SSD drives benchmark with 80-90MB sequential read or write,
>change the alignment, and you’ll get 400+ on the same drive with sequential
>reads (changing nothing else)
>
>A specific example
>https://www.servethehome.com/ssd-alignment-quickly-benchmark-ssd/

I believe you have seen it, but if the read performance changed that
drastically, then the controller/driver was doing something awfully
stupid ... e.g., re-reading the same data block for each filesystem
block it contains.


YMMV.
George



pgsql-general by date:

Previous
From: Gavin Flower
Date:
Subject: Re: OPtimize the performance of a query
Next
From: Daniel Farina
Date:
Subject: Extra files in "base" dir not seen in relfilenodes