Thread: Setting BLCKSZ 4kB

Setting BLCKSZ 4kB

From
sanyam jain
Date:

Hi,

I am trying to solve WAL flooding due to FPWs.


What are the cons of setting BLCKSZ as 4kB?


When saw the results published on http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html

4kB page is giving better performance in comparison to 8kB except when tested with 15kB row size.


Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size of file system is 4kB?


Thanks,

Sanyam Jain 

Re: Setting BLCKSZ 4kB

From
Giuseppe Broccolo
Date:
Hi Sanyam,

Interesting topic!

2018-01-16 7:50 GMT+01:00 sanyam jain <sanyamjain22@live.in>:

Hi,

I am trying to solve WAL flooding due to FPWs.


What are the cons of setting BLCKSZ as 4kB?


When saw the results published on http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html

4kB page is giving better performance in comparison to 8kB except when tested with 15kB row size.


Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size of file system is 4kB?


There is this interesting article of Tomas Vondra:


that explains some consequences turning off full_page_writes. If I correctly understood, turning off full_page_writes with BLCKSZ set to 4kB can reduce
significantly the amount of produced WAL, but you cannot be sure that you are completely safe with a PostgreSQL page that can be completely contained
in a 4kB file system page, though modern ones are less vulnerable to partial writes.

In the article, Tomas focus the attention on the fact that most of full page writes happens right after a checkpoint: a proper tuning of checkpoint can help
reducing the amount of writes on the storage, continuing to safely keep full_page_writes enabled.

Giuseppe.

Re: Setting BLCKSZ 4kB

From
Fabien COELHO
Date:
Hello,

> What are the cons of setting BLCKSZ as 4kB? When saw the results 
> published on [...].

There were other posts and publications which points to the same direction 
consistently.

This matches my deep belief is that postgres default block size is a 
reasonable compromise for HDD, but is less pertinent for SSD for most OLTP 
loads.

For OLAP, I do not think it would lose much, but I have not tested it.

> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page 
> size of file system is 4kB?

FPW = Full Page Write. I would not bet on turning off FPW, ISTM that SSDs 
can have "page" sizes as low as 512 bytes, but are typically 2 kB or 4 kB, 
and the information easily available anyway.

-- 
Fabien.


Re: Setting BLCKSZ 4kB

From
Tomas Vondra
Date:

On 01/16/2018 11:17 AM, Giuseppe Broccolo wrote:
> Hi Sanyam,
> 
> Interesting topic!
> 
> 2018-01-16 7:50 GMT+01:00 sanyam jain <sanyamjain22@live.in
> <mailto:sanyamjain22@live.in>>:
> 
>     Hi,
> 
>     I am trying to solve WAL flooding due to FPWs.
> 
> 
>     What are the cons of setting BLCKSZ as 4kB?
> 
> 
>     When saw the results published
>     on http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html
>     <http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html>
> 
>     4kB page is giving better performance in comparison to 8kB except
>     when tested with 15kB row size.
> 
> 
>     Does turning off FPWs will be safe if BLCKSZ is set to 4kB given
>     page size of file system is 4kB?
> 
> 
> There is this interesting article of Tomas Vondra:
> 
> https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes/
> 
> that explains some consequences turning off full_page_writes. If I 
> correctly understood, turning off full_page_writes with BLCKSZ set
> to 4kB can reduce significantly the amount of produced WAL, but you
> cannot be sure that you are completely safe with a PostgreSQL page
> that can be completely contained in a 4kB file system page, though
> modern ones are less vulnerable to partial writes.
> 

Actually, I don't have a definitive answer to that. I think using 4kB
pages might be safe assuming

(1) it's on a filesystem with 4kB pages

(2) it's on a platform with 4kB memory pages

(3) it's on storage with atomic 4kB writes (e.g. 4kB sectors or BBWC)

But unfortunately that's only something I *think* and I'm still looking
for someone with a deeper knowledge of this topic, who could confirm
that's the case.

>
> In the article, Tomas focus the attention on the fact that most of
> full page writes happens right after a checkpoint: a proper tuning
> of checkpoint can help reducing the amount of writes on the storage,
> continuing to safely keep full_page_writes enabled.
> 

Right, and in most cases that's very effective way of reducing the
amount of WAL. Unfortunately, the "right after checkpoint" WAL spikes
are still there, and many workloads are resilient to that (e.g. inserts
with generated UUID values are a good example).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Setting BLCKSZ 4kB

From
Bruce Momjian
Date:
On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote:
> 
> Hello,
> 
> >What are the cons of setting BLCKSZ as 4kB? When saw the results published
> >on [...].
> 
> There were other posts and publications which points to the same direction
> consistently.
> 
> This matches my deep belief is that postgres default block size is a
> reasonable compromise for HDD, but is less pertinent for SSD for most OLTP
> loads.
> 
> For OLAP, I do not think it would lose much, but I have not tested it.
> 
> >Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size
> >of file system is 4kB?
> 
> FPW = Full Page Write. I would not bet on turning off FPW, ISTM that SSDs
> can have "page" sizes as low as 512 bytes, but are typically 2 kB or 4 kB,
> and the information easily available anyway.

Yes, that is the hard part, making sure you have 4k granularity of
write, and matching write alignment.  pg_test_fsync and diskchecker.pl
(which we mention in our docs) will not help here.  A specific alignment
test based on diskchecker.pl would have to be written.  However, if you
look at the kernel code you might be able to verify quickly that the 4k
atomicity is not guaranteed.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Setting BLCKSZ 4kB

From
Tomas Vondra
Date:

On 01/26/2018 02:56 PM, Bruce Momjian wrote:
> On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote:
>>
>> Hello,
>>
>>> What are the cons of setting BLCKSZ as 4kB? When saw the results published
>>> on [...].
>>
>> There were other posts and publications which points to the same direction
>> consistently.
>>
>> This matches my deep belief is that postgres default block size is a
>> reasonable compromise for HDD, but is less pertinent for SSD for most OLTP
>> loads.
>>
>> For OLAP, I do not think it would lose much, but I have not tested it.
>>
>>> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size
>>> of file system is 4kB?
>>
>> FPW = Full Page Write. I would not bet on turning off FPW, ISTM 
>> that SSDs can have "page" sizes as low as 512 bytes, but are 
>> typically 2 kB or 4 kB, and the information easily available
>> anyway.
> 

Is this referring to sector size or the internal SSD page size?

AFAIK there are only 512B and 4096B sectors, so I assume you must be
talking about the latter. I don't think I've ever heard about an SSD
with 512B pages though (generally the page sizes are 2kB to 16kB).

But more importantly, I don't see why the size of the internal page
would matter here at all? SSDs have non-volatile write cache (DRAM with
battery), protecting all the internal writes to pages. If your SSD does
not do that correctly, it's already broken no matter what page size it
uses even with full_page_writes=on.

On spinning rust the caches would be disabled and replaced by write
cache on a RAID controller with battery, but that's not possible on SSDs
where the on-disk cache is baked into the whole design.

What I think does matters here is the sector size (i.e. either 512B or
4096B) used to communicate with the disk. Obviously, if the kernel
writes 4kB page as a series of independent 512B writes, that would be
unreliable. If it sends one 4kB write, why wouldn't that work?

> Yes, that is the hard part, making sure you have 4k granularity of 
> write, and matching write alignment. pg_test_fsync and diskchecker.pl
> (which we mention in our docs) will not help here. A specific
> alignment test based on diskchecker.pl would have to be written.
> However, if you look at the kernel code you might be able to verify
> quickly that the 4k atomicity is not guaranteed.
> 

Are you suggesting there's a part of the kernel code clearly showing
it's not atomic? Can you point us to that part of the kernel sources?

FWIW even if it's not save in general, it would be useful to understand
what are the requirements to make it work. I mean, conditions that need
to be met on various levels (sector size of the storage device, page
size of of the file system, filesystem alignment, ...).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Setting BLCKSZ 4kB

From
Andres Freund
Date:
Hi,

On 2018-01-26 23:53:33 +0100, Tomas Vondra wrote:
> But more importantly, I don't see why the size of the internal page
> would matter here at all? SSDs have non-volatile write cache (DRAM with
> battery), protecting all the internal writes to pages. If your SSD does
> not do that correctly, it's already broken no matter what page size it
> uses even with full_page_writes=on.

Far far from all SSDs have non-volatile write caches. And if they
respect barrier requests (i.e. flush before returning), they're not
broken.

Greetings,

Andres Freund


Re: Setting BLCKSZ 4kB

From
Tomas Vondra
Date:

On 01/27/2018 12:06 AM, Andres Freund wrote:
> Hi,
> 
> On 2018-01-26 23:53:33 +0100, Tomas Vondra wrote:
>> But more importantly, I don't see why the size of the internal page
>> would matter here at all? SSDs have non-volatile write cache (DRAM with
>> battery), protecting all the internal writes to pages. If your SSD does
>> not do that correctly, it's already broken no matter what page size it
>> uses even with full_page_writes=on.
> 
> Far far from all SSDs have non-volatile write caches. And if they
> respect barrier requests (i.e. flush before returning), they're not
> broken.
>

That is true, thanks for the correction.

But does that make the internal page size relevant to the atomicity
question? For example, let's say we write 4kB on a drive with 2kB
internal pages, and the power goes out after writing the first 2kB of
data (so losing the second 2kB get lost). The disk however never
confirmed the 4kB write, exactly because of the writer barrier ...

I have to admit I'm not sure what happens at this point - whether the
drive will produce torn page (with the first 2kB updated and 2kB old),
or if it's smart enough to realize the write barrier was not reached.

But perhaps this (non-volatile write cache) is one of the requirements
for disabling full page writes?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Setting BLCKSZ 4kB

From
Andres Freund
Date:
Hi,

On 2018-01-27 00:28:07 +0100, Tomas Vondra wrote:
> But does that make the internal page size relevant to the atomicity
> question? For example, let's say we write 4kB on a drive with 2kB
> internal pages, and the power goes out after writing the first 2kB of
> data (so losing the second 2kB get lost). The disk however never
> confirmed the 4kB write, exactly because of the writer barrier ...

That would be problematic, yes. That's *precisely* the torn page issue
we're worried about re full page writes.  Consider, as just one of many
examples, crashing during WAL apply, the first half of the page might be
new, the other old - we'd skip the next time we try apply because the
LSN in the page would indicate it's new enough. With FPWs that doesn't
happen because the first time through we'll reapply the whole write.


> I have to admit I'm not sure what happens at this point - whether the
> drive will produce torn page (with the first 2kB updated and 2kB old),
> or if it's smart enough to realize the write barrier was not reached.

I don't think you can rely on anything.


> But perhaps this (non-volatile write cache) is one of the requirements
> for disabling full page writes?

I don't think that's reliably doable due to the limited knowledge about
what exactly happens inside each and every model of drive.

Greetings,

Andres Freund


Re: Setting BLCKSZ 4kB

From
Bruce Momjian
Date:
On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
> 
> 
> On 01/26/2018 02:56 PM, Bruce Momjian wrote:
> > Yes, that is the hard part, making sure you have 4k granularity of 
> > write, and matching write alignment. pg_test_fsync and diskchecker.pl
> > (which we mention in our docs) will not help here. A specific
> > alignment test based on diskchecker.pl would have to be written.
> > However, if you look at the kernel code you might be able to verify
> > quickly that the 4k atomicity is not guaranteed.
> > 
> 
> Are you suggesting there's a part of the kernel code clearly showing
> it's not atomic? Can you point us to that part of the kernel sources?

Well, my point is that you would either need to repeatedly test that the
file system writes to some durable storage in 4k chunks or check the
file system source code to see it does that.  I don't know how to check
the file system source code myself.  The other issue is that it has to
write 4k chunks using the same alignment as the file itself.

> FWIW even if it's not save in general, it would be useful to understand
> what are the requirements to make it work. I mean, conditions that need
> to be met on various levels (sector size of the storage device, page
> size of of the file system, filesystem alignment, ...).

I think you are fine as soon the data arrives at the durable storage,
and assuming the data can't be partially written to durable storage.  I
was thinking more of a case where you have a file system, a RAID card
without a BBU, and then magnetic disks.  In that case, even if the file
system were to write in 4k chunks, the RAID controller would also need
to do the same, and with the same alignment.  Of course, that's probably
a silly example since there is probably no way to atomically write 4k to
a magnetic disk.

Actually, what happens if a 4k write is being written to an SSD and the
server crashes.  Is the entire write discarded?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Setting BLCKSZ 4kB

From
Tomas Vondra
Date:

On 01/27/2018 05:01 AM, Bruce Momjian wrote:
> On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
>>
>> ...
>>
>> FWIW even if it's not save in general, it would be useful to
>> understand what are the requirements to make it work. I mean,
>> conditions that need to be met on various levels (sector size of
>> the storage device, page size of of the file system, filesystem
>> alignment, ...).
> 
> I think you are fine as soon the data arrives at the durable
> storage, and assuming the data can't be partially written to durable
> storage. I was thinking more of a case where you have a file system,
> a RAID card without a BBU, and then magnetic disks. In that case,
> even if the file system were to write in 4k chunks, the RAID
> controller would also need to do the same, and with the same
> alignment. Of course, that's probably a silly example since there is
> probably no way to atomically write 4k to a magnetic disk.
> 
> Actually, what happens if a 4k write is being written to an SSD and
> the server crashes. Is the entire write discarded?
> 

AFAIK it's not possible to end up with a partial write, particularly not
such that would contain a mix of old and new data - that's because SSDs
can't overwrite a block without erasing it first.

So the write should either succeed or fail as a whole, depending on when
exactly the server crashes - it might be right before confirming the
flush back to the client, for example. That assumes the drive has 4kB
sectors (internal pages) - on drives with volatile write cache but
supporting write barriers and cache flushes. On drives with non-volatile
write cache (so with battery/capacitor) it should always succeed and
never get discarded.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services