Thread: Setting BLCKSZ 4kB
Hi,
I am trying to solve WAL flooding due to FPWs.
What are the cons of setting BLCKSZ as 4kB?
When saw the results published on http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html
4kB page is giving better performance in comparison to 8kB except when tested with 15kB row size.
Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size of file system is 4kB?
Thanks,
Sanyam Jain
Hi,
I am trying to solve WAL flooding due to FPWs.
What are the cons of setting BLCKSZ as 4kB?
When saw the results published on http://blog.coelho.net/
database/2014/08/17/ postgresql-page-size-for-SSD- 2.html 4kB page is giving better performance in comparison to 8kB except when tested with 15kB row size.
Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size of file system is 4kB?
Hello, > What are the cons of setting BLCKSZ as 4kB? When saw the results > published on [...]. There were other posts and publications which points to the same direction consistently. This matches my deep belief is that postgres default block size is a reasonable compromise for HDD, but is less pertinent for SSD for most OLTP loads. For OLAP, I do not think it would lose much, but I have not tested it. > Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page > size of file system is 4kB? FPW = Full Page Write. I would not bet on turning off FPW, ISTM that SSDs can have "page" sizes as low as 512 bytes, but are typically 2 kB or 4 kB, and the information easily available anyway. -- Fabien.
On 01/16/2018 11:17 AM, Giuseppe Broccolo wrote: > Hi Sanyam, > > Interesting topic! > > 2018-01-16 7:50 GMT+01:00 sanyam jain <sanyamjain22@live.in > <mailto:sanyamjain22@live.in>>: > > Hi, > > I am trying to solve WAL flooding due to FPWs. > > > What are the cons of setting BLCKSZ as 4kB? > > > When saw the results published > on http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html > <http://blog.coelho.net/database/2014/08/17/postgresql-page-size-for-SSD-2.html> > > 4kB page is giving better performance in comparison to 8kB except > when tested with 15kB row size. > > > Does turning off FPWs will be safe if BLCKSZ is set to 4kB given > page size of file system is 4kB? > > > There is this interesting article of Tomas Vondra: > > https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes/ > > that explains some consequences turning off full_page_writes. If I > correctly understood, turning off full_page_writes with BLCKSZ set > to 4kB can reduce significantly the amount of produced WAL, but you > cannot be sure that you are completely safe with a PostgreSQL page > that can be completely contained in a 4kB file system page, though > modern ones are less vulnerable to partial writes. > Actually, I don't have a definitive answer to that. I think using 4kB pages might be safe assuming (1) it's on a filesystem with 4kB pages (2) it's on a platform with 4kB memory pages (3) it's on storage with atomic 4kB writes (e.g. 4kB sectors or BBWC) But unfortunately that's only something I *think* and I'm still looking for someone with a deeper knowledge of this topic, who could confirm that's the case. > > In the article, Tomas focus the attention on the fact that most of > full page writes happens right after a checkpoint: a proper tuning > of checkpoint can help reducing the amount of writes on the storage, > continuing to safely keep full_page_writes enabled. > Right, and in most cases that's very effective way of reducing the amount of WAL. Unfortunately, the "right after checkpoint" WAL spikes are still there, and many workloads are resilient to that (e.g. inserts with generated UUID values are a good example). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote: > > Hello, > > >What are the cons of setting BLCKSZ as 4kB? When saw the results published > >on [...]. > > There were other posts and publications which points to the same direction > consistently. > > This matches my deep belief is that postgres default block size is a > reasonable compromise for HDD, but is less pertinent for SSD for most OLTP > loads. > > For OLAP, I do not think it would lose much, but I have not tested it. > > >Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size > >of file system is 4kB? > > FPW = Full Page Write. I would not bet on turning off FPW, ISTM that SSDs > can have "page" sizes as low as 512 bytes, but are typically 2 kB or 4 kB, > and the information easily available anyway. Yes, that is the hard part, making sure you have 4k granularity of write, and matching write alignment. pg_test_fsync and diskchecker.pl (which we mention in our docs) will not help here. A specific alignment test based on diskchecker.pl would have to be written. However, if you look at the kernel code you might be able to verify quickly that the 4k atomicity is not guaranteed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 01/26/2018 02:56 PM, Bruce Momjian wrote: > On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote: >> >> Hello, >> >>> What are the cons of setting BLCKSZ as 4kB? When saw the results published >>> on [...]. >> >> There were other posts and publications which points to the same direction >> consistently. >> >> This matches my deep belief is that postgres default block size is a >> reasonable compromise for HDD, but is less pertinent for SSD for most OLTP >> loads. >> >> For OLAP, I do not think it would lose much, but I have not tested it. >> >>> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size >>> of file system is 4kB? >> >> FPW = Full Page Write. I would not bet on turning off FPW, ISTM >> that SSDs can have "page" sizes as low as 512 bytes, but are >> typically 2 kB or 4 kB, and the information easily available >> anyway. > Is this referring to sector size or the internal SSD page size? AFAIK there are only 512B and 4096B sectors, so I assume you must be talking about the latter. I don't think I've ever heard about an SSD with 512B pages though (generally the page sizes are 2kB to 16kB). But more importantly, I don't see why the size of the internal page would matter here at all? SSDs have non-volatile write cache (DRAM with battery), protecting all the internal writes to pages. If your SSD does not do that correctly, it's already broken no matter what page size it uses even with full_page_writes=on. On spinning rust the caches would be disabled and replaced by write cache on a RAID controller with battery, but that's not possible on SSDs where the on-disk cache is baked into the whole design. What I think does matters here is the sector size (i.e. either 512B or 4096B) used to communicate with the disk. Obviously, if the kernel writes 4kB page as a series of independent 512B writes, that would be unreliable. If it sends one 4kB write, why wouldn't that work? > Yes, that is the hard part, making sure you have 4k granularity of > write, and matching write alignment. pg_test_fsync and diskchecker.pl > (which we mention in our docs) will not help here. A specific > alignment test based on diskchecker.pl would have to be written. > However, if you look at the kernel code you might be able to verify > quickly that the 4k atomicity is not guaranteed. > Are you suggesting there's a part of the kernel code clearly showing it's not atomic? Can you point us to that part of the kernel sources? FWIW even if it's not save in general, it would be useful to understand what are the requirements to make it work. I mean, conditions that need to be met on various levels (sector size of the storage device, page size of of the file system, filesystem alignment, ...). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-01-26 23:53:33 +0100, Tomas Vondra wrote: > But more importantly, I don't see why the size of the internal page > would matter here at all? SSDs have non-volatile write cache (DRAM with > battery), protecting all the internal writes to pages. If your SSD does > not do that correctly, it's already broken no matter what page size it > uses even with full_page_writes=on. Far far from all SSDs have non-volatile write caches. And if they respect barrier requests (i.e. flush before returning), they're not broken. Greetings, Andres Freund
On 01/27/2018 12:06 AM, Andres Freund wrote: > Hi, > > On 2018-01-26 23:53:33 +0100, Tomas Vondra wrote: >> But more importantly, I don't see why the size of the internal page >> would matter here at all? SSDs have non-volatile write cache (DRAM with >> battery), protecting all the internal writes to pages. If your SSD does >> not do that correctly, it's already broken no matter what page size it >> uses even with full_page_writes=on. > > Far far from all SSDs have non-volatile write caches. And if they > respect barrier requests (i.e. flush before returning), they're not > broken. > That is true, thanks for the correction. But does that make the internal page size relevant to the atomicity question? For example, let's say we write 4kB on a drive with 2kB internal pages, and the power goes out after writing the first 2kB of data (so losing the second 2kB get lost). The disk however never confirmed the 4kB write, exactly because of the writer barrier ... I have to admit I'm not sure what happens at this point - whether the drive will produce torn page (with the first 2kB updated and 2kB old), or if it's smart enough to realize the write barrier was not reached. But perhaps this (non-volatile write cache) is one of the requirements for disabling full page writes? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-01-27 00:28:07 +0100, Tomas Vondra wrote: > But does that make the internal page size relevant to the atomicity > question? For example, let's say we write 4kB on a drive with 2kB > internal pages, and the power goes out after writing the first 2kB of > data (so losing the second 2kB get lost). The disk however never > confirmed the 4kB write, exactly because of the writer barrier ... That would be problematic, yes. That's *precisely* the torn page issue we're worried about re full page writes. Consider, as just one of many examples, crashing during WAL apply, the first half of the page might be new, the other old - we'd skip the next time we try apply because the LSN in the page would indicate it's new enough. With FPWs that doesn't happen because the first time through we'll reapply the whole write. > I have to admit I'm not sure what happens at this point - whether the > drive will produce torn page (with the first 2kB updated and 2kB old), > or if it's smart enough to realize the write barrier was not reached. I don't think you can rely on anything. > But perhaps this (non-volatile write cache) is one of the requirements > for disabling full page writes? I don't think that's reliably doable due to the limited knowledge about what exactly happens inside each and every model of drive. Greetings, Andres Freund
On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote: > > > On 01/26/2018 02:56 PM, Bruce Momjian wrote: > > Yes, that is the hard part, making sure you have 4k granularity of > > write, and matching write alignment. pg_test_fsync and diskchecker.pl > > (which we mention in our docs) will not help here. A specific > > alignment test based on diskchecker.pl would have to be written. > > However, if you look at the kernel code you might be able to verify > > quickly that the 4k atomicity is not guaranteed. > > > > Are you suggesting there's a part of the kernel code clearly showing > it's not atomic? Can you point us to that part of the kernel sources? Well, my point is that you would either need to repeatedly test that the file system writes to some durable storage in 4k chunks or check the file system source code to see it does that. I don't know how to check the file system source code myself. The other issue is that it has to write 4k chunks using the same alignment as the file itself. > FWIW even if it's not save in general, it would be useful to understand > what are the requirements to make it work. I mean, conditions that need > to be met on various levels (sector size of the storage device, page > size of of the file system, filesystem alignment, ...). I think you are fine as soon the data arrives at the durable storage, and assuming the data can't be partially written to durable storage. I was thinking more of a case where you have a file system, a RAID card without a BBU, and then magnetic disks. In that case, even if the file system were to write in 4k chunks, the RAID controller would also need to do the same, and with the same alignment. Of course, that's probably a silly example since there is probably no way to atomically write 4k to a magnetic disk. Actually, what happens if a 4k write is being written to an SSD and the server crashes. Is the entire write discarded? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 01/27/2018 05:01 AM, Bruce Momjian wrote: > On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote: >> >> ... >> >> FWIW even if it's not save in general, it would be useful to >> understand what are the requirements to make it work. I mean, >> conditions that need to be met on various levels (sector size of >> the storage device, page size of of the file system, filesystem >> alignment, ...). > > I think you are fine as soon the data arrives at the durable > storage, and assuming the data can't be partially written to durable > storage. I was thinking more of a case where you have a file system, > a RAID card without a BBU, and then magnetic disks. In that case, > even if the file system were to write in 4k chunks, the RAID > controller would also need to do the same, and with the same > alignment. Of course, that's probably a silly example since there is > probably no way to atomically write 4k to a magnetic disk. > > Actually, what happens if a 4k write is being written to an SSD and > the server crashes. Is the entire write discarded? > AFAIK it's not possible to end up with a partial write, particularly not such that would contain a mix of old and new data - that's because SSDs can't overwrite a block without erasing it first. So the write should either succeed or fail as a whole, depending on when exactly the server crashes - it might be right before confirming the flush back to the client, for example. That assumes the drive has 4kB sectors (internal pages) - on drives with volatile write cache but supporting write barriers and cache flushes. On drives with non-volatile write cache (so with battery/capacitor) it should always succeed and never get discarded. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services