Thread: [PING] fallocate() causes btrfs to never compress postgresql files
Hello, sorry for mass sending this, but I didn't get any response to my first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and the reviewers. I think it's an important issue, because I need to custom-compile postgresql to have what I had before: a transparently compressed database. [1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a-aacf-00f86450e738@gmx.net [2] https://github.com/postgres/postgres/commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8 My previous message follows: Hi, this is just a heads-up about files being generated by PostgreSQL 17 not being compressed by Btrfs, even when mounted with the force-compress mount option. I have this occuring aggressively when restoring a database via pg_restore. I think this is caused mdzeroextend() calling FileFallocate(), which in turn invokes posix_fallocate(). I also verified that turning off the use of fallocate causes the database to write compressed files again, like it did in older versions. Unfortunately the only way I found was to configure with a "hack" so that autoconf thinks the feature is not available: ./configure ac_cv_func_posix_fallocate=no There have been discussions on the btrfs mailing list about why it does that, the summary is that it is very difficult to guarantee that compressed writes will not fail with ENOSPACE on a CoW filesystem, thus files with fallocate()d ranges are treated as being marked NOCOW, effectively disabling compression. Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is it the filesystem at fault for not returning EOPNOTSUPP, in which case postgres would use its fallback code? BTW even in the last case, PostgreSQL would not notice the lack of fallocate() support as glibc implements a userspace fallback in posix_fallocate(). That fallback has its own issues that hopefully will not affect postgres (see CAVEATS in man 3 posix_fallocate). Regards, Dimitris
On 5/28/25 16:22, Dimitrios Apostolou wrote: > Hello, sorry for mass sending this, but I didn't get any response to my > first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and > the reviewers. I think it's an important issue, because I need to > custom-compile postgresql to have what I had before: a transparently > compressed database. > That message arrived a couple days before the feature freeze, so everyone was busy with getting PG18 patches over the line. I assume that's why no one responded to a message about an issue that already affects PG17. We're in the quieter part of the dev cycle, people are recovering etc. Hence the delay. > [1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a- > aacf-00f86450e738@gmx.net > [2] https://github.com/postgres/postgres/ > commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8 > > My previous message follows: > > Hi, > > this is just a heads-up about files being generated by PostgreSQL 17 not > being compressed by Btrfs, even when mounted with the force-compress mount > option. I have this occuring aggressively when restoring a database via > pg_restore. I think this is caused mdzeroextend() calling FileFallocate(), > which in turn invokes posix_fallocate(). > Right, I don't think we're really using posix_fallocate() in other places, or at least not in places that would matter. And this code comes from commit 4d330a61bb in PG17: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4d330a61bb1969df31f2cebfe1ba9d1d004346d8 The commit message explains why we do that - it has advantages when allocating large number of blocks. FWIW it's a general code, when we need to add space to a relation, not just for pg_restore. > I also verified that turning off the use of fallocate causes the database > to write compressed files again, like it did in older versions. > Unfortunately the only way I found was to configure with a "hack" so that > autoconf thinks the feature is not available: > > ./configure ac_cv_func_posix_fallocate=no > Unfortunately, that seems pretty heavy handed, because it will affect the whole build, no matter which filesystem it gets used with. And I guess we don't want to disable posix_fallocate() just because one filesystem does something ... strange. > There have been discussions on the btrfs mailing list about why it does > that, the summary is that it is very difficult to guarantee that > compressed writes will not fail with ENOSPACE on a CoW filesystem, thus > files with fallocate()d ranges are treated as being marked NOCOW, > effectively disabling compression. > Isn't guaranteeing success of a write a general issue with compressed filesystem? Why is posix_fallocate() any special in this regard? Shouldn't the filesystem be defensive and assume the data is not compressible? Or maybe just return EOPNOTSUPP when in doubt. > Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is > it the filesystem at fault for not returning EOPNOTSUPP, in which case > postgres would use its fallback code? > I don't have a clear opinion on whether it's a filesystem issue. Maybe we should be handling this differently, not sure. > BTW even in the last case, PostgreSQL would not notice the lack of > fallocate() support as glibc implements a userspace fallback in > posix_fallocate(). That fallback has its own issues that hopefully will > not affect postgres (see CAVEATS in man 3 posix_fallocate). > Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the userspace fallback, we wouldn't notice. But that's up to the btrfs to decide if they want to support fallocate. We still need our fallback anyway, because of other OSes. regards -- Tomas Vondra
Thomas Munro <thomas.munro@gmail.com> writes: > It's slightly tricky to get smgr to behave differently because of the > contents of a system catalogue! The mere thought makes me blanch. I'm okay with the GUC part, but I do not think we should put in 0002 --- the odds of causing serious problems greatly outweigh the value, IMO. Fundamental layering violations tend to bite you on tender parts of your anatomy. regards, tom lane
Re: [PING] fallocate() causes btrfs to never compress postgresql files
From
Dimitrios Apostolou
Date:
On Sun, 1 Jun 2025, Thomas Munro wrote: > Or for a completely different approach: I wonder if ftruncate() would > be more efficient on COW systems anyway. The minimum thing we need is > for the file system to remember the new size, 'cause, erm, we don't. > All the rest is probably a waste of cycles, since they reserve real > space (or fail to) later in the checkpointer or whatever process > eventually writes the data out. FWIW I asked the btrfs devs. From https://github.com/kdave/btrfs-progs/pull/976 I quote Qu Wenruo: > Only for falloc(), not ftruncate(). > > The PREALLOC inode flag is added for any preallocated file extent, > meanwhile truncate only creates holes. > > truncate is fast but it's really different from fallocate by there is > nothing really allocated. > > This means the later writes will need to allocate their own data > extents. This is fine and even preferred for btrfs, but may lead to > performance drop for more traditional fses. > > We're in an era that fs features are not longer that generic, fallocate > is just one example, in fact fallocate will cause more problems more > than no compression. > > It's really a deep rabbit hole, and is not something simple true or > false questions. In other words, btrfs will not try to allocate anything with ftruncate(), it will just mark the new space as a "hole". As such, the file is not marked as "PREALLOC" which is what disables compression. Of course there is no guarantee that further writes will succeed, and as quoted above, other (non-COW) filesystems might be slower writing the ftruncate()-allocated space. Regards, Dimitris
On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote: > On Sun, 1 Jun 2025, Thomas Munro wrote: > > Or for a completely different approach: I wonder if ftruncate() would > > be more efficient on COW systems anyway. The minimum thing we need is > > for the file system to remember the new size, 'cause, erm, we don't. > > All the rest is probably a waste of cycles, since they reserve real > > space (or fail to) later in the checkpointer or whatever process > > eventually writes the data out. > > FWIW I asked the btrfs devs. From > https://github.com/kdave/btrfs-progs/pull/976 > I quote Qu Wenruo: > > > Only for falloc(), not ftruncate(). > > > > The PREALLOC inode flag is added for any preallocated file extent, > > meanwhile truncate only creates holes. > > > > truncate is fast but it's really different from fallocate by there is > > nothing really allocated. > > > > This means the later writes will need to allocate their own data > > extents. This is fine and even preferred for btrfs, but may lead to > > performance drop for more traditional fses. > > > > We're in an era that fs features are not longer that generic, fallocate > > is just one example, in fact fallocate will cause more problems more > > than no compression. > > > > It's really a deep rabbit hole, and is not something simple true or > > false questions. > > > In other words, btrfs will not try to allocate anything with ftruncate(), > it will just mark the new space as a "hole". As such, the file is not > marked as "PREALLOC" which is what disables compression. Of course there > is no guarantee that further writes will succeed, and as quoted above, > other (non-COW) filesystems might be slower writing the > ftruncate()-allocated space. Yeah, right, I know. But PostgreSQL has at least two different goals when extending a relation: 1. Remember the new size of the relation somewhere*. 2. Reserve space now, so that we can report ENOSPC and roll back the transaction that wants to extend the relation when the disk is full, instead of causing a checkpoint or buffer eviction to fail later (see https://wiki.postgresql.org/wiki/ENOSPC for longer version). But the second thing just can't work on a COW system by definition, so the whole notion is bogus, which is why I wondered if fruncate() is actually a reasonable option to have, even though it just creates holes (on Unixen). I also know of another completely different reason to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS supports holes via other syscalls, but ftruncate() or rather _chsize_s() as they spell it doesn't make them), making it more like posix_fallocate() in this usage. So I was beginning to wonder if we might want to experiment with a patch that adds file_extend_method=fallocate,ftruncate,write. Perhaps accompanied by a threshold setting below which it always writes. Then we could experiment with various COW file systems (zfs, btrfs, apfs, refs, ???) and NTFS to see how that speculation works out in reality. Wild speculation: To actually achieve the second thing on a COW file system, you'd probably need some totally new kind of interface, because that POSIX interface has the wrong shape. I have wondered about a new fcntl() or whatever that would let you reserve the right to write N blocks (ie just once!) without ENOSPC on a given descriptor, that a database could conceptually acquire when dirtying buffers, since that's the point at which we know that a write must eventually happen (then probably amortise that accounting a lot), including but not limited to this relation-extension case, and that way you could achieve goal #2, ie transferring ENOSPC errors to transaction time. But that's just a daydream about vapourware. One problem is that PostgreSQL has many processes with separate file descriptors, so that'd make the bookkeeping trickier but not impossible. (*That has a few known issues...)
On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote: > This sounds like the best solution IMO. People can then experiment with > different settings and filesystems, and that way we also learn in the > process. Thank you for the effort and patches so far. OK, here's a basic patch to experiment with. You can set: file_extend_method = fallocate,ftruncate,write file_extend_method_threshold = 8 # (below 8 always write, 0 means never write) To really make COPY fly we also need to get write combining and AIO going (we've had this working with various prototypes, but it all missed the boat for v18 which can only do that stuff for reads). Then you'll have concurrent 128kB or up to 1MB writes trundling along in the background which I guess should work pretty nicely for stuff like BTRFS/ZFS and compression and all that jazz.
Attachment
Re: [PING] fallocate() causes btrfs to never compress postgresql files
From
Dimitrios Apostolou
Date:
On Thu, 12 Jun 2025, Dimitrios Apostolou wrote: > On Mon, 9 Jun 2025, Thomas Munro wrote: > >> On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote: >>> This sounds like the best solution IMO. People can then experiment with >>> different settings and filesystems, and that way we also learn in the >>> process. Thank you for the effort and patches so far. >> >> OK, here's a basic patch to experiment with. You can set: >> >> file_extend_method = fallocate,ftruncate,write >> file_extend_method_threshold = 8 # (below 8 always write, 0 means never >> write) >> > > I applied the patch on PostgreSQL v17 and am testing it now. I chose > ftruncate method and I see ftruncate in action using strace while doing > pg_restore of a big database. Nothing unexpected has happened so far. I also > verified that files are being compressed, obeying Btrfs's mount option > compress=zstd. > > Thanks for the patch! What are the odds of commiting it to v17? Ping. :-) Patch behaves good for me. Any chance of applying it and backporting it? > > Dimitris > >
On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote: > > I applied the patch on PostgreSQL v17 and am testing it now. I chose > > ftruncate method and I see ftruncate in action using strace while doing > > pg_restore of a big database. Nothing unexpected has happened so far. I also > > verified that files are being compressed, obeying Btrfs's mount option > > compress=zstd. > > > > Thanks for the patch! What are the odds of commiting it to v17? > > Ping. :-) > Patch behaves good for me. Any chance of applying it and backporting it? Yeah, this seems to make sense, as it is a pretty bad regression for people who are counting on BTRFS compression for their large database. Not so sure about the threshold bit -- I'd probably leave that out of the backport in the interest of stable branch-minimalism. Anyone have any better ideas, better naming, or objections?