Thread: Large block sizes support in Linux
Hello, My team and I have been working on adding Large block size(LBS) support to XFS in Linux[1]. Once this feature lands upstream, we will be able to create XFS with FS block size > page size of the system on Linux. We also gave a talk about it in Linux Plumbers conference recently[2] for more context. The initial support is only for XFS but more FSs will follow later. On an x86_64 system, fs block size was limited to 4k, but traditionally Postgres uses 8k as its default internal page size. With LBS support, fs block size can be set to 8K, thereby matching the Postgres page size. If the file system block size == DB page size, then Postgres can have guarantees that a single DB page will be written as a single unit during kernel write back and not split. My knowledge of Postgres internals is limited, so I'm wondering if there are any optimizations or potential optimizations that Postgres could leverage once we have LBS support on Linux? [1] https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-kernel@pankajraghav.com/ [2] https://www.youtube.com/watch?v=ar72r5Xf7x4 -- Pankaj Raghav
On Thu, Mar 21, 2024 at 06:46:19PM +0100, Pankaj Raghav (Samsung) wrote: > Hello, > > My team and I have been working on adding Large block size(LBS) > support to XFS in Linux[1]. Once this feature lands upstream, we will be > able to create XFS with FS block size > page size of the system on Linux. > We also gave a talk about it in Linux Plumbers conference recently[2] > for more context. The initial support is only for XFS but more FSs will > follow later. > > On an x86_64 system, fs block size was limited to 4k, but traditionally > Postgres uses 8k as its default internal page size. With LBS support, > fs block size can be set to 8K, thereby matching the Postgres page size. > > If the file system block size == DB page size, then Postgres can have > guarantees that a single DB page will be written as a single unit during > kernel write back and not split. > > My knowledge of Postgres internals is limited, so I'm wondering if there > are any optimizations or potential optimizations that Postgres could > leverage once we have LBS support on Linux? We have discussed this in the past, and in fact in the early years we thought we didn't need fsync since the BSD file system was 8k at the time. What we later realized is that we have no guarantee that the file system will write to the device in the specified block size, and even it it does, the I/O layers between the OS and the device might not, since many devices use 512 byte blocks or other sizes. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 3/22/24 19:46, Bruce Momjian wrote: > On Thu, Mar 21, 2024 at 06:46:19PM +0100, Pankaj Raghav (Samsung) wrote: >> Hello, >> >> My team and I have been working on adding Large block size(LBS) >> support to XFS in Linux[1]. Once this feature lands upstream, we will be >> able to create XFS with FS block size > page size of the system on Linux. >> We also gave a talk about it in Linux Plumbers conference recently[2] >> for more context. The initial support is only for XFS but more FSs will >> follow later. >> >> On an x86_64 system, fs block size was limited to 4k, but traditionally >> Postgres uses 8k as its default internal page size. With LBS support, >> fs block size can be set to 8K, thereby matching the Postgres page size. >> >> If the file system block size == DB page size, then Postgres can have >> guarantees that a single DB page will be written as a single unit during >> kernel write back and not split. >> >> My knowledge of Postgres internals is limited, so I'm wondering if there >> are any optimizations or potential optimizations that Postgres could >> leverage once we have LBS support on Linux? > > We have discussed this in the past, and in fact in the early years we > thought we didn't need fsync since the BSD file system was 8k at the > time. > > What we later realized is that we have no guarantee that the file system > will write to the device in the specified block size, and even it it > does, the I/O layers between the OS and the device might not, since many > devices use 512 byte blocks or other sizes. > Right, but things change over time - current storage devices support much larger sectors (LBA format), usually 4K. And if you do I/O with this size, it's usually atomic. AFAIK if you built Postgres with 4K pages, on a device with 4K LBA format, that would not need full-page writes - we always do I/O in 4k pages, and block layer does I/O (during writeback from page cache) with minimum guaranteed size = logical block size. 4K are great for OLTP systems in general, it'd be even better if we didn't need to worry about torn pages (but the tricky part is to be confident it's safe to disable them on a particular system). I did watch the talk linked by Pankaj, and IIUC the promise of the LBS patches is that this benefit would extend would apply even to larger page sizes (= fs page size). Which right now you can't even mount, but the patches allow that. So for example it would be possible to create an XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as usual, and we'd know that the page cache always writes out either the whole page or none of it. Which right now is not guaranteed to happen, it's possible to e.g. write the page as two 4K requests, even if all other things are set properly (drive has 4K logical/physical sectors). At least that's my understanding ... Pankaj, could you clarify what the guarantees provided by LBS are going to be? the talk uses wording like "should be" and "hint" in a couple places, and there's also stuff I'm not 100% familiar with. If we create a filesystem with 8K blocks, and we only ever do writes (and reads) in 8K chunks (our default page size), what guarantees that gives us? What if the underlying device has LBA format with only 4K (or perhaps even just 512B), how would that affect the guarantees? The other thing is - is there a reliable way to say when the guarantees actually apply? I mean, how would the administrator *know* it's safe to set full_page_writes=off, or even better how could we verify this when the database starts (and complain if it's not safe to disable FPW)? It's easy to e.g. take a backup on one filesystem and restore it on another one, and forget those may have different block sizes etc. I'm not sure it's possible in a 100% reliable way (tablespaces?). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote: > Right, but things change over time - current storage devices support > much larger sectors (LBA format), usually 4K. And if you do I/O with > this size, it's usually atomic. > > AFAIK if you built Postgres with 4K pages, on a device with 4K LBA > format, that would not need full-page writes - we always do I/O in 4k > pages, and block layer does I/O (during writeback from page cache) with > minimum guaranteed size = logical block size. 4K are great for OLTP > systems in general, it'd be even better if we didn't need to worry about > torn pages (but the tricky part is to be confident it's safe to disable > them on a particular system). Yes, even if the file system is 8k, and the storage is 8k, we only know that torn pages are impossible if the file system never overwrites existing 8k pages, but writes new ones and then makes it active. I think ZFS does that to handle snapshots. > The other thing is - is there a reliable way to say when the guarantees > actually apply? I mean, how would the administrator *know* it's safe to > set full_page_writes=off, or even better how could we verify this when > the database starts (and complain if it's not safe to disable FPW)? Yes, this is quite hard to know. Our docs have: https://www.postgresql.org/docs/current/wal-reliability.html Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were written while others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQL can --> restore partially-written pages from WAL. If you have file-system --> software that prevents partial page writes (e.g., ZFS), you can turn off --> this page imaging by turning off the full_page_writes parameter. --> Battery-Backed Unit (BBU) disk controllers do not prevent partial page --> writes unless they guarantee that data is written to the BBU as full --> (8kB) pages. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > My team and I have been working on adding Large block size(LBS) > support to XFS in Linux[1]. Once this feature lands upstream, we will be > able to create XFS with FS block size > page size of the system on Linux. > We also gave a talk about it in Linux Plumbers conference recently[2] > for more context. The initial support is only for XFS but more FSs will > follow later. Very cool! (I used XFS on IRIX in the 90s, and it had large blocks then, a feature lost in the port to Linux AFAIK.) > On an x86_64 system, fs block size was limited to 4k, but traditionally > Postgres uses 8k as its default internal page size. With LBS support, > fs block size can be set to 8K, thereby matching the Postgres page size. > > If the file system block size == DB page size, then Postgres can have > guarantees that a single DB page will be written as a single unit during > kernel write back and not split. > > My knowledge of Postgres internals is limited, so I'm wondering if there > are any optimizations or potential optimizations that Postgres could > leverage once we have LBS support on Linux? FWIW here are a couple of things I wrote about our storage atomicity problem, for non-PostgreSQL hackers who may not understand our project jargon: https://wiki.postgresql.org/wiki/Full_page_writes https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf The short version is that we (and MySQL, via a different scheme with different tradeoffs) could avoid writing all our stuff out twice if we could count on atomic writes of a suitable size on power failure, so the benefits are very large. As far as I know, there are two things we need from the kernel and storage to do that on "overwrite" filesystems like XFS: 1. The disk must promise that its atomicity-on-power-failure is a multiple of our block size -- something like NVMe AWUPF, right? My devices seem to say 0 :-( Or I guess the filesystem has to compensate, but then it's not exactly an overwrite filesystem anymore... 2. The kernel must promise that there is no code path in either buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or other configured block size) writes on some smaller boundary, most likely sector I guess, on their way to the device, as you were saying. Not just in happy cases, but even under memory pressure, if interrupted, etc etc. Sounds like you're working on problem #2 which is great news. I've been wondering for a while how a Unixoid kernel should report these properties to userspace where it knows them, especially on non-overwrite filesystems like ZFS where this sort of thing works already, without stuff like AWUPF working the way one might hope. Here was one throw-away idea on the back of a napkin about that, for what little it's worth: https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO
Hi Tomas and Bruce, >>> My knowledge of Postgres internals is limited, so I'm wondering if there >>> are any optimizations or potential optimizations that Postgres could >>> leverage once we have LBS support on Linux? >> >> We have discussed this in the past, and in fact in the early years we >> thought we didn't need fsync since the BSD file system was 8k at the >> time. >> >> What we later realized is that we have no guarantee that the file system >> will write to the device in the specified block size, and even it it >> does, the I/O layers between the OS and the device might not, since many >> devices use 512 byte blocks or other sizes. >> > > Right, but things change over time - current storage devices support > much larger sectors (LBA format), usually 4K. And if you do I/O with > this size, it's usually atomic. > > AFAIK if you built Postgres with 4K pages, on a device with 4K LBA > format, that would not need full-page writes - we always do I/O in 4k > pages, and block layer does I/O (during writeback from page cache) with > minimum guaranteed size = logical block size. 4K are great for OLTP > systems in general, it'd be even better if we didn't need to worry about > torn pages (but the tricky part is to be confident it's safe to disable > them on a particular system). > > I did watch the talk linked by Pankaj, and IIUC the promise of the LBS > patches is that this benefit would extend would apply even to larger > page sizes (= fs page size). Which right now you can't even mount, but > the patches allow that. So for example it would be possible to create an > XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as > usual, and we'd know that the page cache always writes out either the > whole page or none of it. Which right now is not guaranteed to happen, > it's possible to e.g. write the page as two 4K requests, even if all > other things are set properly (drive has 4K logical/physical sectors). > > At least that's my understanding ... >> Pankaj, could you clarify what the guarantees provided by LBS are going > to be? the talk uses wording like "should be" and "hint" in a couple > places, and there's also stuff I'm not 100% familiar with. > > If we create a filesystem with 8K blocks, and we only ever do writes > (and reads) in 8K chunks (our default page size), what guarantees that > gives us? What if the underlying device has LBA format with only 4K (or > perhaps even just 512B), how would that affect the guarantees? > Yes, the whole FS block is managed as one unit (also on a physically contiguous page), so we send the whole fs block while performing writeback. This is not guaranteed when FS block size = 4k and the DB page size is 8k as it might be sent as two different requests as you have indicated. The LBA format will not affect the guarantee of sending the whole FS block without splitting as long as the FS block size is less than the maximum IO transfer size*. But another issue now is even though the host has done its job, the device might have a smaller atomic guarantee, thereby making it not powerfail safe. > The other thing is - is there a reliable way to say when the guarantees > actually apply? I mean, how would the administrator *know* it's safe to > set full_page_writes=off, or even better how could we verify this when > the database starts (and complain if it's not safe to disable FPW)? > This is an excellent question that needs a bit of community discussion to expose a device agnostic value that userspace can trust. There might be a talk this year at LSFMM about untorn writes[1] in buffered IO path. I will make sure to bring this question up. At the moment, Linux exposes the physical blocksize by taking also atomic guarantees into the picture, especially for NVMe it uses the NAWUPF and AWUPF while setting physical blocksize (/sys/block/<dev>/queue/physical_block_size). A system admin could use value exposed by phy_bs as a hint to disable full_page_write=off. Of course this requires also the device to give atomic guarantees. The most optimal would be DB page size == FS block size == Device atomic size. > It's easy to e.g. take a backup on one filesystem and restore it on > another one, and forget those may have different block sizes etc. I'm > not sure it's possible in a 100% reliable way (tablespaces?). > > > regards > [1] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/ * A small caveat, I am most familiar with NVMe, so my answers might be based on my experience in NVMe.
Hi Thomas, On 23/03/2024 05:53, Thomas Munro wrote: > On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung) > <kernel@pankajraghav.com> wrote: >> My team and I have been working on adding Large block size(LBS) >> support to XFS in Linux[1]. Once this feature lands upstream, we will be >> able to create XFS with FS block size > page size of the system on Linux. >> We also gave a talk about it in Linux Plumbers conference recently[2] >> for more context. The initial support is only for XFS but more FSs will >> follow later. > > Very cool! > > (I used XFS on IRIX in the 90s, and it had large blocks then, a > feature lost in the port to Linux AFAIK.) > Yes, I heard this also from the Maintainer of XFS that they had to drop this functionality when they did the port. :) >> On an x86_64 system, fs block size was limited to 4k, but traditionally >> Postgres uses 8k as its default internal page size. With LBS support, >> fs block size can be set to 8K, thereby matching the Postgres page size. >> >> If the file system block size == DB page size, then Postgres can have >> guarantees that a single DB page will be written as a single unit during >> kernel write back and not split. >> >> My knowledge of Postgres internals is limited, so I'm wondering if there >> are any optimizations or potential optimizations that Postgres could >> leverage once we have LBS support on Linux? > > FWIW here are a couple of things I wrote about our storage atomicity > problem, for non-PostgreSQL hackers who may not understand our project > jargon: > > https://wiki.postgresql.org/wiki/Full_page_writes > https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf > This is very useful, thanks a lot. > The short version is that we (and MySQL, via a different scheme with > different tradeoffs) could avoid writing all our stuff out twice if we > could count on atomic writes of a suitable size on power failure, so > the benefits are very large. As far as I know, there are two things > we need from the kernel and storage to do that on "overwrite" > filesystems like XFS: > > 1. The disk must promise that its atomicity-on-power-failure is a > multiple of our block size -- something like NVMe AWUPF, right? My > devices seem to say 0 :-( Or I guess the filesystem has to > compensate, but then it's not exactly an overwrite filesystem > anymore... > 0 means 1 logical block, which might be 4k in your case. Typically device vendors have to put extra hardware to guarantee bigger atomic block sizes. > 2. The kernel must promise that there is no code path in either > buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or > other configured block size) writes on some smaller boundary, most > likely sector I guess, on their way to the device, as you were saying. > Not just in happy cases, but even under memory pressure, if > interrupted, etc etc. > > Sounds like you're working on problem #2 which is great news. > Yes, you are spot on. :) > I've been wondering for a while how a Unixoid kernel should report > these properties to userspace where it knows them, especially on > non-overwrite filesystems like ZFS where this sort of thing works So it looks like ZFS (or any other CoW filesystem that supports larger block sizes) is doing what postgres will do anyway with FPW=on, making it safe to turn off FPW. One question: Does ZFS do something like FUA request to force the device to clear the cache before it can update the node to point to the new page? If it doesn't do it, there is no guarantee from device to update the data atomically unless it has bigger atomic guarantees? > already, without stuff like AWUPF working the way one might hope. > Here was one throw-away idea on the back of a napkin about that, for > what little it's worth: > > https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO As I replied in the previous mail to Tomas, we might be having a talk about Untorn writes[1] in LSFMM this year. I hope to bring up some of the discussions from here. Thanks! [1] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
On 23/03/2024 03:41, Bruce Momjian wrote: > On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote: >> Right, but things change over time - current storage devices support >> much larger sectors (LBA format), usually 4K. And if you do I/O with >> this size, it's usually atomic. >> >> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA >> format, that would not need full-page writes - we always do I/O in 4k >> pages, and block layer does I/O (during writeback from page cache) with >> minimum guaranteed size = logical block size. 4K are great for OLTP >> systems in general, it'd be even better if we didn't need to worry about >> torn pages (but the tricky part is to be confident it's safe to disable >> them on a particular system). > > Yes, even if the file system is 8k, and the storage is 8k, we only know > that torn pages are impossible if the file system never overwrites > existing 8k pages, but writes new ones and then makes it active. I > think ZFS does that to handle snapshots. > I think we can also avoid torn writes: - if filesystem's data path always writes in multiples of 8k (with alignment) - device supports 8k atomic writes. Then we might be able to push the responsibility to the device without having the overhead of a CoW FS or FPW=on. Of course, the performance here depends on the vendor specific implementation of atomics. We are trying to enable the former by adding LBS support to XFS in Linux. -- Pankaj
On Mon, Mar 25, 2024 at 02:53:56PM +0100, Pankaj Raghav wrote: > This is an excellent question that needs a bit of community discussion to > expose a device agnostic value that userspace can trust. > > There might be a talk this year at LSFMM about untorn writes[1] in buffered IO > path. I will make sure to bring this question up. > > At the moment, Linux exposes the physical blocksize by taking also atomic guarantees > into the picture, especially for NVMe it uses the NAWUPF and AWUPF while setting > physical blocksize (/sys/block/<dev>/queue/physical_block_size). > > A system admin could use value exposed by phy_bs as a hint to disable full_page_write=off. > Of course this requires also the device to give atomic guarantees. > > The most optimal would be DB page size == FS block size == Device atomic size. One other thing I remember is that some people modified the ZFS file system parameters enough that they made Postgres non-durable and corrupted their database. This is a very hard thing to get right because the user has very little feedback when they break things. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Tue, Mar 26, 2024 at 3:34 AM Pankaj Raghav <kernel@pankajraghav.com> wrote: > One question: Does ZFS do something like FUA request to force the device > to clear the cache before it can update the node to point to the new page? > > If it doesn't do it, there is no guarantee from device to update the data > atomically unless it has bigger atomic guarantees? It flushes the whole disk write cache (unless you turn that off). AFAIK it can't use use FUA instead yet (it knows some things about it, there are mentions under the Linux-specific parts of the tree but that may be more to do with understanding and implementing it when exporting a virtual block device, or something like that (?), but I don't believe it knows how to use it for its own underlying log or ordering). FUA would clearly be better, no waiting for random extra data to be flushed.
Greetings, * Pankaj Raghav (kernel@pankajraghav.com) wrote: > On 23/03/2024 05:53, Thomas Munro wrote: > > On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung) > > <kernel@pankajraghav.com> wrote: > >> My team and I have been working on adding Large block size(LBS) > >> support to XFS in Linux[1]. Once this feature lands upstream, we will be > >> able to create XFS with FS block size > page size of the system on Linux. > >> We also gave a talk about it in Linux Plumbers conference recently[2] > >> for more context. The initial support is only for XFS but more FSs will > >> follow later. > > > > Very cool! Yes, this is very cool sounding and could be a real difference for PG. > > (I used XFS on IRIX in the 90s, and it had large blocks then, a > > feature lost in the port to Linux AFAIK.) > > Yes, I heard this also from the Maintainer of XFS that they had to drop > this functionality when they did the port. :) I also recall the days of XFS on IRIX... Many moons ago. > > The short version is that we (and MySQL, via a different scheme with > > different tradeoffs) could avoid writing all our stuff out twice if we > > could count on atomic writes of a suitable size on power failure, so > > the benefits are very large. As far as I know, there are two things > > we need from the kernel and storage to do that on "overwrite" > > filesystems like XFS: > > > > 1. The disk must promise that its atomicity-on-power-failure is a > > multiple of our block size -- something like NVMe AWUPF, right? My > > devices seem to say 0 :-( Or I guess the filesystem has to > > compensate, but then it's not exactly an overwrite filesystem > > anymore... > > 0 means 1 logical block, which might be 4k in your case. Typically device > vendors have to put extra hardware to guarantee bigger atomic block sizes. If I'm following correctly, this would mean that PG with FPW=off (assuming everything else works) would be safe on more systems if PG supported a 4K block size than if PG only supports 8K blocks, right? There's been discussion and even some patches posted around the idea of having run-time support in PG for different block sizes. Currently, it's a compile-time option with the default being 8K, meaning that's the only option on a huge number of the deployed PG environments out there. Moving it to run-time has some challenges and there's concerns about the performance ... but if it meant we could run safely with FPW=off, that's a pretty big deal. On the other hand, if the expectation is that basically everything will support atomic 8K, then we might be able to simply keep that and not deal with supporting different page sizes at run-time (of course, this is only one of the considerations in play, but it could be particularly key, if I'm following correctly). Appreciate any insights you can share on this. Thanks! Stephen