Re: Setting BLCKSZ 4kB - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Setting BLCKSZ 4kB
Date
Msg-id 39f9fcb4-33e9-52bd-0c44-aa1b5d2fcd21@2ndquadrant.com
Whole thread Raw
In response to Re: Setting BLCKSZ 4kB  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers

On 01/27/2018 05:01 AM, Bruce Momjian wrote:
> On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
>>
>> ...
>>
>> FWIW even if it's not save in general, it would be useful to
>> understand what are the requirements to make it work. I mean,
>> conditions that need to be met on various levels (sector size of
>> the storage device, page size of of the file system, filesystem
>> alignment, ...).
> 
> I think you are fine as soon the data arrives at the durable
> storage, and assuming the data can't be partially written to durable
> storage. I was thinking more of a case where you have a file system,
> a RAID card without a BBU, and then magnetic disks. In that case,
> even if the file system were to write in 4k chunks, the RAID
> controller would also need to do the same, and with the same
> alignment. Of course, that's probably a silly example since there is
> probably no way to atomically write 4k to a magnetic disk.
> 
> Actually, what happens if a 4k write is being written to an SSD and
> the server crashes. Is the entire write discarded?
> 

AFAIK it's not possible to end up with a partial write, particularly not
such that would contain a mix of old and new data - that's because SSDs
can't overwrite a block without erasing it first.

So the write should either succeed or fail as a whole, depending on when
exactly the server crashes - it might be right before confirming the
flush back to the client, for example. That assumes the drive has 4kB
sectors (internal pages) - on drives with volatile write cache but
supporting write barriers and cache flushes. On drives with non-volatile
write cache (so with battery/capacitor) it should always succeed and
never get discarded.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Erik Rijkers
Date:
Subject: Re: Add RANGE with values and exclusions clauses to the WindowFunctions
Next
From: Dmitry Dolgov
Date:
Subject: Write lifetime hints for NVMe