Thread: FileFallocate misbehaving on XFS
Hello PG Hackers Our application has recently migrated to PG16, and we have experienced some failed upgrades. The upgrades are performed using pg_upgrade and have failed during the phase where the schema is restored into the new cluster, with the following error: pg_restore: error: could not execute query: ERROR: could not extend file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No space left on device HINT: Check free disk space. This has happened multiple times on different servers, and in each case there was plenty of free space available. We found this thread describing similar issues: https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com As is the case in that thread, all of the affected databases are using XFS. One of my colleagues built postgres from source with HAVE_POSIX_FALLOCATE not defined, and using that build he was able to complete the pg_upgrade, and then switched to a stock postgres build after the upgrade. However, as you might expect, after the upgrade we have experienced similar errors during regular operation. We make heavy use of COPY, which is mentioned in the above discussion as pre-allocating files. We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky Linux 9 (Kernel 5.14.0). I am wondering if this bug might be related: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323 > When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the lengthto be allocated is greater than the available space. There is a reproduction procedure at the bottom of the above ubuntu thread, and using that procedure I get the same results on both kernel 4.18.0 and 5.14.0. When calling fallocate with offset zero on an existing file, I get enospc even if I am only requesting the same amount of space as the file already has. If I repeat the experiment with ext4 I don't get that behaviour. On a surface examination of the code paths leading to the FileFallocate call, it does not look like it should be trying to allocate already allocated space, but I might have missed something there. Is this already being looked into? Thanks in advance, Cheers Mike
Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com> ha scritto:
Is this already being looked into?
Funny, i guess it's the same reason I see randomly complain of WhatsApp web interface, on Chrome, since I switched to XFS. It says something like "no more space on disk" and logout, with more than 300GB available.
Anyway, just a stupid hint, I would try to write to XFS mailing list. There you can reach XFS maintainers of Red Hat and the usual historical developers, of course!!!
On 12/9/24 08:34, Michael Harris wrote: > Hello PG Hackers > > Our application has recently migrated to PG16, and we have experienced > some failed upgrades. The upgrades are performed using pg_upgrade and > have failed during the phase where the schema is restored into the new > cluster, with the following error: > > pg_restore: error: could not execute query: ERROR: could not extend > file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with > FileFallocate(): No space left on device > HINT: Check free disk space. > > This has happened multiple times on different servers, and in each > case there was plenty of free space available. > > We found this thread describing similar issues: > > https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com > > As is the case in that thread, all of the affected databases are using XFS. > > One of my colleagues built postgres from source with > HAVE_POSIX_FALLOCATE not defined, and using that build he was able to > complete the pg_upgrade, and then switched to a stock postgres build > after the upgrade. However, as you might expect, after the upgrade we > have experienced similar errors during regular operation. We make > heavy use of COPY, which is mentioned in the above discussion as > pre-allocating files. > > We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky > Linux 9 (Kernel 5.14.0). > > I am wondering if this bug might be related: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323 > >> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the lengthto be allocated is greater than the available space. > > There is a reproduction procedure at the bottom of the above ubuntu > thread, and using that procedure I get the same results on both kernel > 4.18.0 and 5.14.0. > When calling fallocate with offset zero on an existing file, I get > enospc even if I am only requesting the same amount of space as the > file already has. > If I repeat the experiment with ext4 I don't get that behaviour. > > On a surface examination of the code paths leading to the > FileFallocate call, it does not look like it should be trying to > allocate already allocated space, but I might have missed something > there. > > Is this already being looked into? > Sounds more like an XFS bug/behavior, so it's not clear to me what we could do about it. I mean, if the filesystem reports bogus out-of-space, is there even something we can do? What is not clear to me is why would this affect pg_upgrade at all. We have the data files split into 1GB segments, and the copy/clone/... goes one by one. So there shouldn't be more than 1GB "extra" space needed. Surely you have more free space on the system? regards -- Tomas Vondra
On 12/9/24 10:47, Andrea Gelmini wrote: > > > Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com > <mailto:harmic@gmail.com>> ha scritto: > > > Is this already being looked into? > > > Funny, i guess it's the same reason I see randomly complain of WhatsApp > web interface, on Chrome, since I switched to XFS. It says something > like "no more space on disk" and logout, with more than 300GB available. > If I understand the fallocate issue correctly, it essentially ignores the offset, so "fallocate -o 0 -l LENGTH" fails if LENGTH + CURRENT_LENGTH > FREE_SPACE But if you have 300GB available, that'd mean you have a file that's close to that size already. But is that likely for WhatsApp? > Anyway, just a stupid hint, I would try to write to XFS mailing list. > There you can reach XFS maintainers of Red Hat and the usual historical > developers, of course!!! > Yes, I think that's a better place to report this. I don't think we're doing anything particularly weird / wrong with fallocate(). regards -- Tomas Vondra
On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com> wrote:
Hi Michael,
We found this thread describing similar issues:
https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
We've got some case in the past here in EDB, where an OS vendor has blamed XFS AG fragmentation (too many AGs, and if one AG is not having enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?id=000018219 for Your AG range
-J.
On 12/9/24 11:27, Jakub Wartak wrote: > On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com > <mailto:harmic@gmail.com>> wrote: > > Hi Michael, > > We found this thread describing similar issues: > > https://www.postgresql.org/message-id/flat/ > AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com <https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com> > > > We've got some case in the past here in EDB, where an OS vendor has > blamed XFS AG fragmentation (too many AGs, and if one AG is not having > enough space -> error). Could You perhaps show us output of on that LUN: > 1. xfs_info > 2. run that script from https://www.suse.com/support/kb/doc/? > id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for > Your AG range > But this can be reproduced on a brand new filesystem - I just tried creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB file twice. Which that fails, and there can't be any real fragmentation. regards -- Tomas Vondra
Hi, On 2024-12-09 15:47:55 +0100, Tomas Vondra wrote: > On 12/9/24 11:27, Jakub Wartak wrote: > > On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com > > <mailto:harmic@gmail.com>> wrote: > > > > Hi Michael, > > > > We found this thread describing similar issues: > > > > https://www.postgresql.org/message-id/flat/ > > AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com <https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com> > > > > > > We've got some case in the past here in EDB, where an OS vendor has > > blamed XFS AG fragmentation (too many AGs, and if one AG is not having > > enough space -> error). Could You perhaps show us output of on that LUN: > > 1. xfs_info > > 2. run that script from https://www.suse.com/support/kb/doc/? > > id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for > > Your AG range > > > > But this can be reproduced on a brand new filesystem - I just tried > creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB > file twice. Which that fails, and there can't be any real fragmentation. If I understand correctly xfs, before even looking at the file's current layout, checks if there's enough free space for the fallocate() to succeed. Here's an explanation for why: https://www.spinics.net/lists/linux-xfs/msg55429.html The real problem with preallocation failing part way through due to overcommit of space is that we can't go back an undo the allocation(s) made by fallocate because when we get ENOSPC we have lost all the state of the previous allocations made. If fallocate is filling holes between unwritten extents already in the file, then we have no way of knowing where the holes we filled were and hence cannot reliably free the space we've allocated before ENOSPC was hit. I.e. reserving space as you go would leave you open to ending up with some, but not all, of those allocations having been made. Whereas pre-reserving the worst case space needed, ahead of time, ensures that you have enough space to go through it all. You can't just go through the file [range] and compute how much free space you will need allocate and then do the a second pass through the file, because the file layout might have changed concurrently... This issue seems independent of the issue Michael is having though. Postgres, afaik, won't fallocate huge ranges with already allocated space. Greetings, Andres Freund
Hi, On 2024-12-09 18:34:22 +1100, Michael Harris wrote: > Our application has recently migrated to PG16, and we have experienced > some failed upgrades. The upgrades are performed using pg_upgrade and > have failed during the phase where the schema is restored into the new > cluster, with the following error: > > pg_restore: error: could not execute query: ERROR: could not extend > file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with > FileFallocate(): No space left on device > HINT: Check free disk space. Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same filesystem, in the past? The reflink stuff in xfs (which is used to implement copy-on-write for files) is somewhat newer and you're using somewhat old kernels: > We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky > Linux 9 (Kernel 5.14.0). I found some references for bugs that were fixed in 5.13. But I think at least some of this would persist if the filesystem ran into the issue with a kernel before those fixes. Did you upgrade "in-place" from Rocky Linux 8? > I am wondering if this bug might be related: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323 Doubt it, we never do this as far as I am aware. Greetings, Andres Freund
Hi Andres On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote: > Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same > filesystem, in the past? No, our procedure is to use --link. > I found some references for bugs that were fixed in 5.13. But I think at least > some of this would persist if the filesystem ran into the issue with a kernel > before those fixes. Did you upgrade "in-place" from Rocky Linux 8? We generally don't use "in place" OS upgrades - however we would usually have the databases on separate filesystem(s) to the OS, and those filesystem(s) would be preserved through the upgrade, while the root fs would be scratched. A lot of the cases reported are on RL8. I will try to find out the history of the RL9 cases to see if the filesystems started on RL8. Could you please provide me links for the kernel bugs you are referring to? Cheers Mike.
Hi Tomas On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote: > Sounds more like an XFS bug/behavior, so it's not clear to me what we > could do about it. I mean, if the filesystem reports bogus out-of-space, > is there even something we can do? I don't disagree that it's most likely an XFS issue. However, XFS is pretty widely used - it's the default FS for RHEL & the default in SUSE for non-root partitions - so maybe some action should be taken. Some things we could consider: - Providing a way to configure PG not to use posix_fallocate at runtime - Detecting the use of XFS (probably nasty and complex to do in a platform independent way) and disable posix_fallocate - In the case of posix_fallocate failing with ENOSPC, fall back to FileZero (worst case that will fail as well, in which case we will know that we really are out of space) - Documenting that XFS might not be a good choice, at least for some kernel versions > What is not clear to me is why would this affect pg_upgrade at all. We > have the data files split into 1GB segments, and the copy/clone/... goes > one by one. So there shouldn't be more than 1GB "extra" space needed. > Surely you have more free space on the system? Yes, that also confused me. It actually fails during the schema restore phase - where pg_upgrade calls pg_restore to restore a schema-only dump that it takes earlier in the process. At this stage it is only trying to restore the schema, not any actual table data. Note that we use the --link option to pg_upgrade, so it should not be using much space even when the table data is being upgraded. The filesystems have >1TB free space when this has occurred. It does continue to give this error after the upgrade, at apparently random intervals, when data is being loaded into the DB using COPY commands, so it might be best not to focus too much on the fact that we first encounter it during the upgrade. Cheers Mike.
Hi, On 2024-12-10 09:34:08 +1100, Michael Harris wrote: > On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote: > > I found some references for bugs that were fixed in 5.13. But I think at least > > some of this would persist if the filesystem ran into the issue with a kernel > > before those fixes. Did you upgrade "in-place" from Rocky Linux 8? > > We generally don't use "in place" OS upgrades - however we would > usually have the databases on separate filesystem(s) to the OS, and > those filesystem(s) would be preserved through the upgrade, while the > root fs would be scratched. Makes sense. > A lot of the cases reported are on RL8. I will try to find out the > history of the RL9 cases to see if the filesystems started on RL8. That'd be helpful.... > Could you please provide me links for the kernel bugs you are referring to? I unfortunately closed most of the tabs, the only one I could quickly find again is the one referenced at the bottom of: https://www.spinics.net/lists/linux-xfs/msg55445.html Greetings, Andres
Hi, On 2024-12-10 10:00:43 +1100, Michael Harris wrote: > On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote: > > Sounds more like an XFS bug/behavior, so it's not clear to me what we > > could do about it. I mean, if the filesystem reports bogus out-of-space, > > is there even something we can do? > > I don't disagree that it's most likely an XFS issue. However, XFS is > pretty widely used - it's the default FS for RHEL & the default in > SUSE for non-root partitions - so maybe some action should be taken. > > Some things we could consider: > > - Providing a way to configure PG not to use posix_fallocate at runtime > > - Detecting the use of XFS (probably nasty and complex to do in a > platform independent way) and disable posix_fallocate > > - In the case of posix_fallocate failing with ENOSPC, fall back to > FileZero (worst case that will fail as well, in which case we will > know that we really are out of space) > > - Documenting that XFS might not be a good choice, at least for some > kernel versions Pretty unexcited about all of these - XFS is fairly widely used for PG, but this problem doesn't seem very common. It seems to me that we're missing something that causes this to only happen in a small subset of cases. I think the source of this needs to be debugged further before we try to apply workarounds in postgres. Are you using any filesystem quotas? It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also xfs_spaceman -c 'freesp -s' /mountpoint xfs_spaceman -c 'health' /mountpoint and df. What kind of storage is this on? Was the filesystem ever grown from a smaller size? Have you checked the filesystem's internal consistency? I.e. something like xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or unmounted though. But corrupted filesystem datastructures certainly could cause spurious ENOSPC. > > What is not clear to me is why would this affect pg_upgrade at all. We > > have the data files split into 1GB segments, and the copy/clone/... goes > > one by one. So there shouldn't be more than 1GB "extra" space needed. > > Surely you have more free space on the system? > > Yes, that also confused me. It actually fails during the schema > restore phase - where pg_upgrade calls pg_restore to restore a > schema-only dump that it takes earlier in the process. At this stage > it is only trying to restore the schema, not any actual table data. > Note that we use the --link option to pg_upgrade, so it should not be > using much space even when the table data is being upgraded. Are you using pg_upgrade -j? I'm asking because looking at linux's git tree I found this interesting recent commit: https://git.kernel.org/linus/94a0333b9212 - but IIUC it'd actually cause file creation, not fallocate to fail. > The filesystems have >1TB free space when this has occurred. > > It does continue to give this error after the upgrade, at apparently > random intervals, when data is being loaded into the DB using COPY > commands, so it might be best not to focus too much on the fact that > we first encounter it during the upgrade. I assume the file that actually errors out changes over time? It's always fallocate() that fails? Can you tell us anything about the workload / data? Lots of tiny tables, lots of big tables, write heavy, ...? Greetings, Andres Freund
Hi Andres Following up on the earlier question about OS upgrade paths - all the cases reported so far are either on RL8 (Kernel 4.18.0) or were upgraded to RL9 (kernel 5.14.0) and the affected filesystems were preserved. In fact the RL9 systems were initially built as Centos 7, and then when that went EOL they were upgraded to RL9. The process was as I described - the /var/opt filesystem which contained the database was preserved, and the root and other OS filesystems were scratched. The majority of systems where we have this problem are on RL8. On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote: > Are you using any filesystem quotas? No. > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also > xfs_spaceman -c 'freesp -s' /mountpoint > xfs_spaceman -c 'health' /mountpoint > and df. I gathered this info from one of the systems that is currently on RL9. This system is relatively small compared to some of the others that have exhibited this issue, but it is the only one I can access right now. # uname -a Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux # xfs_info /dev/mapper/ippvg-ipplv meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0, rmapbt=0 = reflink=0 bigtime=0 inobtcount=0 nrext64=0 data = bsize=4096 blocks=1049885696, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=512639, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done from to extents blocks pct 1 1 37502 37502 0.15 2 3 62647 148377 0.59 4 7 87793 465950 1.85 8 15 135529 1527172 6.08 16 31 184811 3937459 15.67 32 63 165979 7330339 29.16 64 127 101674 8705691 34.64 128 255 15123 2674030 10.64 256 511 973 307655 1.22 total free extents 792031 total free blocks 25134175 average free extent size 31.7338 from to extents blocks pct 1 1 43895 43895 0.22 2 3 59312 141693 0.70 4 7 83406 443964 2.20 8 15 120804 1362108 6.75 16 31 133140 2824317 14.00 32 63 118619 5188474 25.71 64 127 77960 6751764 33.46 128 255 16383 2876626 14.26 256 511 1763 546506 2.71 total free extents 655282 total free blocks 20179347 average free extent size 30.7949 from to extents blocks pct 1 1 72034 72034 0.26 2 3 98158 232135 0.83 4 7 126228 666187 2.38 8 15 169602 1893007 6.77 16 31 180286 3818527 13.65 32 63 164529 7276833 26.01 64 127 109687 9505160 33.97 128 255 22113 3921162 14.02 256 511 1901 592052 2.12 total free extents 944538 total free blocks 27977097 average free extent size 29.6199 from to extents blocks pct 1 1 51462 51462 0.21 2 3 98993 233204 0.93 4 7 131578 697655 2.79 8 15 178151 1993062 7.97 16 31 175718 3680535 14.72 32 63 145310 6372468 25.48 64 127 89518 7749021 30.99 128 255 18926 3415768 13.66 256 511 2640 813586 3.25 total free extents 892296 total free blocks 25006761 average free extent size 28.0252 # xfs_spaceman -c 'health' /var/opt Health status has not been collected for this filesystem. > What kind of storage is this on? As mentioned, there are quite a few systems in different sites, so a number of different storage solutions in use, some with directly attached disks, others with some SAN solutions. The instance I got the printout above from is a VM, but in the other site they are all bare metal. > Was the filesystem ever grown from a smaller size? I can't say for sure that none of them were, but given the number of different systems that have this issue I am confident that would not be a common factor. > Have you checked the filesystem's internal consistency? I.e. something like > xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or > unmounted though. But corrupted filesystem datastructures certainly could > cause spurious ENOSPC. I executed this on the same system as the above prints came from. It did not report any issues. > Are you using pg_upgrade -j? Yes, we use -j `nproc` > I assume the file that actually errors out changes over time? It's always > fallocate() that fails? Yes, correct, on both counts. > Can you tell us anything about the workload / data? Lots of tiny tables, lots > of big tables, write heavy, ...? It is a write heavy application which stores mostly time series data. The time series data is partitioned by time; the application writes constantly into the 'current' partition, and data is expired by removing the oldest partition. Most of the data is written once and not updated. There are quite a lot of these partitioned tables (in the 1000's or 10000's) depending on how the application is configured. Individual partitions range in size from a few MB to 10s of GB. Cheers Mike.
Hi again One extra piece of information: I had said that all the machines were Rocky Linux 8 or Rocky Linux 9, but actually a large number of them are RHEL8. Sorry for the confusion. Of course RL8 is a rebuild of RHEL8 so it is not surprising they would be behaving similarly. Cheers Mike On Tue, 10 Dec 2024 at 17:28, Michael Harris <harmic@gmail.com> wrote: > > Hi Andres > > Following up on the earlier question about OS upgrade paths - all the > cases reported so far are either on RL8 (Kernel 4.18.0) or were > upgraded to RL9 (kernel 5.14.0) and the affected filesystems were > preserved. > In fact the RL9 systems were initially built as Centos 7, and then > when that went EOL they were upgraded to RL9. The process was as I > described - the /var/opt filesystem which contained the database was > preserved, and the root and other OS filesystems were scratched. > The majority of systems where we have this problem are on RL8. > > On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote: > > Are you using any filesystem quotas? > > No. > > > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also > > xfs_spaceman -c 'freesp -s' /mountpoint > > xfs_spaceman -c 'health' /mountpoint > > and df. > > I gathered this info from one of the systems that is currently on RL9. > This system is relatively small compared to some of the others that > have exhibited this issue, but it is the only one I can access right > now. > > # uname -a > Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 > 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux > > # xfs_info /dev/mapper/ippvg-ipplv > meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=0, sparse=0, rmapbt=0 > = reflink=0 bigtime=0 inobtcount=0 nrext64=0 > data = bsize=4096 blocks=1049885696, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=512639, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done > from to extents blocks pct > 1 1 37502 37502 0.15 > 2 3 62647 148377 0.59 > 4 7 87793 465950 1.85 > 8 15 135529 1527172 6.08 > 16 31 184811 3937459 15.67 > 32 63 165979 7330339 29.16 > 64 127 101674 8705691 34.64 > 128 255 15123 2674030 10.64 > 256 511 973 307655 1.22 > total free extents 792031 > total free blocks 25134175 > average free extent size 31.7338 > from to extents blocks pct > 1 1 43895 43895 0.22 > 2 3 59312 141693 0.70 > 4 7 83406 443964 2.20 > 8 15 120804 1362108 6.75 > 16 31 133140 2824317 14.00 > 32 63 118619 5188474 25.71 > 64 127 77960 6751764 33.46 > 128 255 16383 2876626 14.26 > 256 511 1763 546506 2.71 > total free extents 655282 > total free blocks 20179347 > average free extent size 30.7949 > from to extents blocks pct > 1 1 72034 72034 0.26 > 2 3 98158 232135 0.83 > 4 7 126228 666187 2.38 > 8 15 169602 1893007 6.77 > 16 31 180286 3818527 13.65 > 32 63 164529 7276833 26.01 > 64 127 109687 9505160 33.97 > 128 255 22113 3921162 14.02 > 256 511 1901 592052 2.12 > total free extents 944538 > total free blocks 27977097 > average free extent size 29.6199 > from to extents blocks pct > 1 1 51462 51462 0.21 > 2 3 98993 233204 0.93 > 4 7 131578 697655 2.79 > 8 15 178151 1993062 7.97 > 16 31 175718 3680535 14.72 > 32 63 145310 6372468 25.48 > 64 127 89518 7749021 30.99 > 128 255 18926 3415768 13.66 > 256 511 2640 813586 3.25 > total free extents 892296 > total free blocks 25006761 > average free extent size 28.0252 > > # xfs_spaceman -c 'health' /var/opt > Health status has not been collected for this filesystem. > > > What kind of storage is this on? > > As mentioned, there are quite a few systems in different sites, so a > number of different storage solutions in use, some with directly > attached disks, others with some SAN solutions. > The instance I got the printout above from is a VM, but in the other > site they are all bare metal. > > > Was the filesystem ever grown from a smaller size? > > I can't say for sure that none of them were, but given the number of > different systems that have this issue I am confident that would not > be a common factor. > > > Have you checked the filesystem's internal consistency? I.e. something like > > xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or > > unmounted though. But corrupted filesystem datastructures certainly could > > cause spurious ENOSPC. > > I executed this on the same system as the above prints came from. It > did not report any issues. > > > Are you using pg_upgrade -j? > > Yes, we use -j `nproc` > > > I assume the file that actually errors out changes over time? It's always > > fallocate() that fails? > > Yes, correct, on both counts. > > > Can you tell us anything about the workload / data? Lots of tiny tables, lots > > of big tables, write heavy, ...? > > It is a write heavy application which stores mostly time series data. > > The time series data is partitioned by time; the application writes > constantly into the 'current' partition, and data is expired by > removing the oldest partition. Most of the data is written once and > not updated. > > There are quite a lot of these partitioned tables (in the 1000's or > 10000's) depending on how the application is configured. Individual > partitions range in size from a few MB to 10s of GB. > > Cheers > Mike.
Hi, On 2024-12-10 17:28:21 +1100, Michael Harris wrote: > On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote: > > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also > > xfs_spaceman -c 'freesp -s' /mountpoint > > xfs_spaceman -c 'health' /mountpoint > > and df. > > I gathered this info from one of the systems that is currently on RL9. > This system is relatively small compared to some of the others that > have exhibited this issue, but it is the only one I can access right > now. I think it's implied, but I just want to be sure: This was one of the affected systems? > # uname -a > Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 > 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux > > # xfs_info /dev/mapper/ippvg-ipplv > meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=0, sparse=0, rmapbt=0 > = reflink=0 bigtime=0 inobtcount=0 nrext64=0 > data = bsize=4096 blocks=1049885696, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=512639, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all affect space allocation to some degree and more recently created filesystems will have them to different values, which could explain why you but not that many others hit this issue. Any chance to get df output? I'm mainly curious about the number of used inodes. Could you show the mount options that end up being used? grep /var/opt /proc/mounts I rather doubt it is, but it'd sure be interesting if inode32 were used. I assume you have never set XFS options for the PG directory or files within it? Could you show xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc ? > # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done > from to extents blocks pct > 1 1 37502 37502 0.15 > 2 3 62647 148377 0.59 > 4 7 87793 465950 1.85 > 8 15 135529 1527172 6.08 > 16 31 184811 3937459 15.67 > 32 63 165979 7330339 29.16 > 64 127 101674 8705691 34.64 > 128 255 15123 2674030 10.64 > 256 511 973 307655 1.22 > total free extents 792031 > total free blocks 25134175 > average free extent size 31.7338 > from to extents blocks pct > 1 1 43895 43895 0.22 > 2 3 59312 141693 0.70 > 4 7 83406 443964 2.20 > 8 15 120804 1362108 6.75 > 16 31 133140 2824317 14.00 > 32 63 118619 5188474 25.71 > 64 127 77960 6751764 33.46 > 128 255 16383 2876626 14.26 > 256 511 1763 546506 2.71 > total free extents 655282 > total free blocks 20179347 > average free extent size 30.7949 > from to extents blocks pct > 1 1 72034 72034 0.26 > 2 3 98158 232135 0.83 > 4 7 126228 666187 2.38 > 8 15 169602 1893007 6.77 > 16 31 180286 3818527 13.65 > 32 63 164529 7276833 26.01 > 64 127 109687 9505160 33.97 > 128 255 22113 3921162 14.02 > 256 511 1901 592052 2.12 > total free extents 944538 > total free blocks 27977097 > average free extent size 29.6199 > from to extents blocks pct > 1 1 51462 51462 0.21 > 2 3 98993 233204 0.93 > 4 7 131578 697655 2.79 > 8 15 178151 1993062 7.97 > 16 31 175718 3680535 14.72 > 32 63 145310 6372468 25.48 > 64 127 89518 7749021 30.99 > 128 255 18926 3415768 13.66 > 256 511 2640 813586 3.25 > total free extents 892296 > total free blocks 25006761 > average free extent size 28.0252 So there's *some*, but not a lot, of imbalance in AG usage. Of course that's as of this moment, and as you say below, you expire old partitions on a regular basis... My understanding of XFS's space allocation is that by default it continues to use the same AG for allocations within one directory, until that AG is full. For a write heavy postgres workload that's of course not optimal, as all activity will focus on one AG. I'd try monitoring the per-ag free space over time and see if the the ENOSPC issue is correlated with one AG getting full. 'freesp' is probably too expensive for that, but it looks like xfs_db -r -c agresv /dev/nvme6n1 should work? Actually that output might be interesting to see, even when you don't hit the issue. > > Can you tell us anything about the workload / data? Lots of tiny tables, lots > > of big tables, write heavy, ...? > > It is a write heavy application which stores mostly time series data. > > The time series data is partitioned by time; the application writes > constantly into the 'current' partition, and data is expired by > removing the oldest partition. Most of the data is written once and > not updated. > > There are quite a lot of these partitioned tables (in the 1000's or > 10000's) depending on how the application is configured. Individual > partitions range in size from a few MB to 10s of GB. So there are 1000s of tables that are concurrently being appended, but only into one partition each. That does make it plausible that there's a significant amount of fragmentation. Possibly transient due to the expiration. How many partitions are there for each of the tables? Mainly wondering because of the number of inodes being used. Are all of the active tables within one database? That could be relevant due to per-directory behaviour of free space allocation. Greetings, Andres Freund
Hi, On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote: > On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote: > 1. Well it doesn't look like XFS AG fragmentation to me (we had a customer > with a huge number of AGs with small space in them) reporting such errors > after upgrading to 16, but not for earlier versions (somehow > posix_fallocate() had to be the culprit). Given the workload expires old partitions, I'm not sure we conclude a whole lot from the current state :/ > 2. > > > # xfs_info /dev/mapper/ippvg-ipplv > > meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, > agsize=262471424 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=1 finobt=0, sparse=0, rmapbt=0 > > = reflink=0 bigtime=0 inobtcount=0 > nrext64=0 > > Yay, reflink=0, that's pretty old fs ?! I think that only started to default to on more recently (2019, plus time to percolate into RHEL). The more curious cases is finobt=0 (turned on by default since 2015) and to a lesser degree sparse=0 (turned on by default since 2018). > > ERROR: could not extend file > "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No > space left on device > > 2. This indicates it was allocating 1GB for such a table (".1"), on > tablespace that was created more than a year ago. Could you get us maybe > those below commands too? (or from any other directory exhibiting such > errors) The date in the directory is the catversion of the server, which is just determined by the major version being used, not the creation time of the tablespace. andres@awork3:~/src/postgresql$ git grep CATALOG_VERSION_NO upstream/REL_16_STABLE src/include/catalog/catversion.h upstream/REL_16_STABLE:src/include/catalog/catversion.h:#define CATALOG_VERSION_NO 202307071 Greetings, Andres Freund
On 2024-12-10 11:34:15 -0500, Andres Freund wrote: > On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote: > > On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote: > > 2. > > > > > # xfs_info /dev/mapper/ippvg-ipplv > > > meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, > > agsize=262471424 blks > > > = sectsz=512 attr=2, projid32bit=1 > > > = crc=1 finobt=0, sparse=0, rmapbt=0 > > > = reflink=0 bigtime=0 inobtcount=0 > > nrext64=0 > > > > Yay, reflink=0, that's pretty old fs ?! > > I think that only started to default to on more recently (2019, plus time to > percolate into RHEL). The more curious cases is finobt=0 (turned on by default > since 2015) and to a lesser degree sparse=0 (turned on by default since 2018). One thing that might be interesting is to compare xfs_info of affected and non-affected servers...
On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote: > Pretty unexcited about all of these - XFS is fairly widely used for PG, but > this problem doesn't seem very common. It seems to me that we're missing > something that causes this to only happen in a small subset of cases. I wonder if this is actually pretty common on XFS. I mean, we've already hit this with at least one EDB customer, and Michael's report is, as far as I know, independent of that; and he points to a pgsql-general thread which, AFAIK, is also independent. We don't get three (or more?) independent reports of that many bugs, so I think it's not crazy to think that the problem is actually pretty common. It's probably workload dependent somehow, but for all we know today it seems like the workload could be as simple as "do enough file extension and you'll get into trouble eventually" or maybe "do enough file extension[with some level of concurrency and you'll get into trouble eventually". > I think the source of this needs to be debugged further before we try to apply > workarounds in postgres. Why? It seems to me that this has to be a filesystem bug, and we should almost certainly adopt one of these ideas from Michael Harris: - Providing a way to configure PG not to use posix_fallocate at runtime - In the case of posix_fallocate failing with ENOSPC, fall back to FileZero (worst case that will fail as well, in which case we will know that we really are out of space) Maybe we need some more research to figure out which of those two things we should do -- I suspect the second one is better but if that fails then we might need to do the first one -- but I doubt that we can wait for XFS to fix whatever the issue is here. Our usage of posix_fallocate doesn't look to be anything more than plain vanilla, so as between these competing hypotheses: (1) posix_fallocate is and always has been buggy and you can't rely on it, or (2) we use posix_fallocate in a way that nobody else has and have hit an incredibly obscure bug as a result, which will be swiftly patched ...the first seems much more likely. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-12-10 12:36:40 -0500, Robert Haas wrote: > On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote: > > Pretty unexcited about all of these - XFS is fairly widely used for PG, but > > this problem doesn't seem very common. It seems to me that we're missing > > something that causes this to only happen in a small subset of cases. > > I wonder if this is actually pretty common on XFS. I mean, we've > already hit this with at least one EDB customer, and Michael's report > is, as far as I know, independent of that; and he points to a > pgsql-general thread which, AFAIK, is also independent. We don't get > three (or more?) independent reports of that many bugs, so I think > it's not crazy to think that the problem is actually pretty common. Maybe. I think we would have gotten a lot more reports if it were common. I know of quite a few very busy installs using xfs. I think there must be some as-of-yet-unknown condition gating it. E.g. that the filesystem has been created a while ago and has some now-on-by-default options disabled. > > I think the source of this needs to be debugged further before we try to apply > > workarounds in postgres. > > Why? It seems to me that this has to be a filesystem bug, Adding workarounds for half-understood problems tends to lead to code that we can't evolve in the future, as we a) don't understand b) can't reproduce the problem. Workarounds could also mask some bigger / worse issues. We e.g. have blamed ext4 for a bunch of bugs that then turned out to be ours in the past. But we didn't look for a long time, because it was convenient to just blame ext4. > and we should almost certainly adopt one of these ideas from Michael Harris: > > - Providing a way to configure PG not to use posix_fallocate at runtime I'm not strongly opposed to that. That's testable without access to an affected system. I wouldn't want to automatically do that when detecting an affected system though, that'll make behaviour way less predictable. > - In the case of posix_fallocate failing with ENOSPC, fall back to > FileZero (worst case that will fail as well, in which case we will > know that we really are out of space) I doubt that that's a good idea. What if fallocate failing is an indicator of a problem? What if you turn on AIO + DIO and suddenly get a much more fragmented file? Greetings, Andres Freund
Hi Andres On Wed, 11 Dec 2024 at 03:09, Andres Freund <andres@anarazel.de> wrote: > I think it's implied, but I just want to be sure: This was one of the affected > systems? Yes, correct. > Any chance to get df output? I'm mainly curious about the number of used > inodes. Sorry, I could swear I had included that already! Here it is: # df /var/opt Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/ippvg-ipplv 4197492228 3803866716 393625512 91% /var/opt # df -i /var/opt Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/ippvg-ipplv 419954240 1568137 418386103 1% /var/opt > Could you show the mount options that end up being used? > grep /var/opt /proc/mounts /dev/mapper/ippvg-ipplv /var/opt xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 These seem to be the defaults. > I assume you have never set XFS options for the PG directory or files within > it? Correct. > Could you show > xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc -p--------------X pg_tblspc/16402/PG_16_202307071/49163/1132925906.4 fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4" fd.flags = non-sync,non-direct,read-only stat.ino = 4320612794 stat.type = regular file stat.size = 201211904 stat.blocks = 393000 fsxattr.xflags = 0x80000002 [-p--------------X] fsxattr.projid = 0 fsxattr.extsize = 0 fsxattr.cowextsize = 0 fsxattr.nextents = 165 fsxattr.naextents = 0 dioattr.mem = 0x200 dioattr.miniosz = 512 dioattr.maxiosz = 2147483136 fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4" statfs.f_bsize = 4096 statfs.f_blocks = 1049373057 statfs.f_bavail = 98406378 statfs.f_files = 419954240 statfs.f_ffree = 418386103 statfs.f_flags = 0x1020 geom.bsize = 4096 geom.agcount = 4 geom.agblocks = 262471424 geom.datablocks = 1049885696 geom.rtblocks = 0 geom.rtextents = 0 geom.rtextsize = 1 geom.sunit = 0 geom.swidth = 0 counts.freedata = 98406378 counts.freertx = 0 counts.freeino = 864183 counts.allocino = 2432320 > I'd try monitoring the per-ag free space over time and see if the the ENOSPC > issue is correlated with one AG getting full. 'freesp' is probably too > expensive for that, but it looks like > xfs_db -r -c agresv /dev/nvme6n1 > should work? > > Actually that output might be interesting to see, even when you don't hit the > issue. I will see if I can set that up. > How many partitions are there for each of the tables? Mainly wondering because > of the number of inodes being used. It is configurable and varies from site to site. It could range from 7 up to maybe 60. > Are all of the active tables within one database? That could be relevant due > to per-directory behaviour of free space allocation. Each pg instance may have one or more application databases. Typically data is being written into all of them (although sometimes a database will be archived, with no new data going into it). You might be onto something though. The system I got the above prints from is only experiencing this issue in one directory - that might not mean very much though, it only has 2 databases and one of them looks like it is not receiving imports. But another system I can access has multiple databases with ongoing imports, yet all the errors bar one relate to one directory. I will collect some data from that system and post it shortly. Cheers Mike
Hi Jakub On Tue, 10 Dec 2024 at 22:36, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Yay, reflink=0, that's pretty old fs ?! This particular filesystem was created on Centos 7, and retained when the system was upgraded to RL9. So yes probably pretty old! > Could you get us maybe those below commands too? (or from any other directory exhibiting such errors) > > stat pg_tblspc/16401/PG_16_202307071/17643/ > ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l > time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess timing of getdents() call as that may something aboutthat directory indirectly # stat pg_tblspc/16402/PG_16_202307071/49163/ File: pg_tblspc/16402/PG_16_202307071/49163/ Size: 5177344 Blocks: 14880 IO Block: 4096 directory Device: fd02h/64770d Inode: 4299946593 Links: 2 Access: (0700/drwx------) Uid: ( 26/postgres) Gid: ( 26/postgres) Access: 2024-12-11 09:39:42.467802419 +0900 Modify: 2024-12-11 09:51:19.813948673 +0900 Change: 2024-12-11 09:51:19.813948673 +0900 Birth: 2024-11-25 17:37:11.812374672 +0900 # time ls -1 pg_tblspc/16402/PG_16_202307071/49163/ | wc -l 179000 real 0m0.474s user 0m0.439s sys 0m0.038s > 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocationfrom many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fsand see if that helps. I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are*not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentiallythe number of opened fds to get an avalanche of "preallocations"). I will try to organize that. They are production systems so it might take some time. > 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so wemight end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get theexact syscall. I'll take a look at Alvaro's patch. strace sounds good, but how to arrange to start it on the correct PG backends? There will be a large-ish number of PG backends going at a time, only some of which are performing imports, and they will be coming and going every so often as the ETL application scales up and down with the load. > 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fsand attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g OK, I will look into that also. Cheers Mike
On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <harmic@gmail.com> wrote:
Hi Jakub
On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
[..]
> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocation from many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fs and see if that helps. I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentially the number of opened fds to get an avalanche of "preallocations").
I will try to organize that. They are production systems so it might
take some time.
Cool.
> 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get the exact syscall.
I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.
Yes, it sounds like mission impossible. Is there any chance you can get reproduced down to one or a small number of postgres backends doing the writes?
> 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g
OK, I will look into that also.
Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got one big pile of errors into 1 error category and that's not helpful at all (inode/extent/block allocation problems are all squeezed into 1 error)
- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather than multiple separate(but adjacent) iovectors to pg_writev. It launched only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are no large extents in that AG)
from to extents blocks pct
1 1 4949 4949 0.65
2 3 86113 173452 22.73
4 7 19399 94558 12.39
8 15 23233 248602 32.58
16 31 12425 241421 31.64
total free extents 146119
total free blocks 762982
average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the others AG which have 1024-8192. Therefore it looks there are no contiguous blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore -j (also concurrent posix_fallocate() to many independent files sharing the same AG, but that's 1 backend:1 file so no contention for waitcount in RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent posix_fallocate() that end up somehow coalesced ? Or hypothetically let's say 16-32 fallocates() hit the same AG initially, maybe it's some form of concurrency semi race-condition inside XFS where one of fallocate calls fails to find space in that one AG, but according to [1] it should fallback to another AGs.
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are no large extents in that AG)
from to extents blocks pct
1 1 4949 4949 0.65
2 3 86113 173452 22.73
4 7 19399 94558 12.39
8 15 23233 248602 32.58
16 31 12425 241421 31.64
total free extents 146119
total free blocks 762982
average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the others AG which have 1024-8192. Therefore it looks there are no contiguous blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore -j (also concurrent posix_fallocate() to many independent files sharing the same AG, but that's 1 backend:1 file so no contention for waitcount in RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent posix_fallocate() that end up somehow coalesced ? Or hypothetically let's say 16-32 fallocates() hit the same AG initially, maybe it's some form of concurrency semi race-condition inside XFS where one of fallocate calls fails to find space in that one AG, but according to [1] it should fallback to another AGs.
- and there's also additional XFS dynamic speculative preallocation that might cause space pressure during our normal writes..
Another workaround idea/test: create tablespace on the same XFS fs (but in a somewhat different directory if possible) and see if it still fails.
-J.
Hi, On 2024-12-10 16:33:06 -0500, Andres Freund wrote: > Maybe. I think we would have gotten a lot more reports if it were common. I > know of quite a few very busy installs using xfs. > > I think there must be some as-of-yet-unknown condition gating it. E.g. that > the filesystem has been created a while ago and has some now-on-by-default > options disabled. > > > > > I think the source of this needs to be debugged further before we try to apply > > > workarounds in postgres. > > > > Why? It seems to me that this has to be a filesystem bug, > > Adding workarounds for half-understood problems tends to lead to code that we > can't evolve in the future, as we a) don't understand b) can't reproduce the > problem. > > Workarounds could also mask some bigger / worse issues. We e.g. have blamed > ext4 for a bunch of bugs that then turned out to be ours in the past. But we > didn't look for a long time, because it was convenient to just blame ext4. > > > and we should almost certainly adopt one of these ideas from Michael Harris: > > > > - Providing a way to configure PG not to use posix_fallocate at runtime > > I'm not strongly opposed to that. That's testable without access to an > affected system. I wouldn't want to automatically do that when detecting an > affected system though, that'll make behaviour way less predictable. > > > > - In the case of posix_fallocate failing with ENOSPC, fall back to > > FileZero (worst case that will fail as well, in which case we will > > know that we really are out of space) > > I doubt that that's a good idea. What if fallocate failing is an indicator of > a problem? What if you turn on AIO + DIO and suddenly get a much more > fragmented file? One thing that I think we should definitely do is to include more detail in the error message. mdzeroextend()'s error messages don't include how many blocks the relation was to be extended by. Neither mdextend() nor mdzeroextend() include the offset at which the extension failed. I'm not entirely sure about the phrasing though, we have a somewhat confusing mix of blocks and bytes in messages. Perhaps some of information should be in an errdetail, but I admit I'm a bit hesitant about doing so for crucial details. I find that often only the primary error message is available when debugging problems encountered by others. Maybe something like /* translator: second %s is a function name like FileAllocate() */ could not extend file \"%s\" by %u blocks, from %llu to %llu bytes, using %s: %m or could not extend file \"%s\" using %s by %u blocks, from its current size of %u blocks: %m or could not extend file \"%s\" using %s by %u blocks/%llu bytes from its current size of %llu bytes: %m If we want to use errdetail() judiciously, we could go for something like errmsg("could not extend file \"%s\" by %u blocks, using %s: %m", ... errdetail("Failed to extend file from %u blocks/%llu bytes to %u blocks / %llu bytes.", ...) I think it might also be good - this is a slightly more complicated project - to report the amount of free space the filesystem reports when we hit ENOSPC. I have dealt with cases of the FS transiently filling up way too many times, and it's always a pain to figure that out. Greetings, Andres Freund
Hi, FWIW, I tried fairly hard to reproduce this. An extended cycle of 80 backends copying into relations and occasionally truncating them (to simulate the partitions being dropped and new ones created). For this I ran a 4TB filesystem very close to fully filled (peaking at 99.998 % full). I did not see any ENOSPC errors unless the filesystem really was full at that time. To check that, I made mdzeroextend() do a statfs() when encountering ENOSPC, printed statfs.f_blocks and made that case PANIC. What I do see is that after - intentionally - hitting an out-of-disk-space error, the available disk space would occasionally increase a small amount after a few seconds. Regardless of whether using the fallocate and non-fallocate path. From what I can tell this small increase in free space has a few reasons: - Checkpointer might not have gotten around to unlinking files, keeping the inode alive. - Occasionally bgwriter or a backend would have relation segments that were unlinked open, so the inode (not the actual file space, because the segment to prevent that) could not yet be removed from the filesystem - It looks like xfs does some small amount of work to reclaim space in the background. Which makes sense, otherwise each unlink would have to be a flush to disk. But that's not in any way enough amount of space to explain what you're seeing. The most I've were 6MB, when ramping up the truncation frequency a lot. Of course this was on a newer kernel, not on RHEL / RL 8/9. Just to make sure - you're absolutely certain that you actually have space at the time of the errors? E.g. a checkpoint could free up a lot of WAL after a checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL files. That can be 100s of gigabytes. If I were to provide you with a patch that showed the amount of free disk space at the time of an error, the size of the relation etc, could you reproduce the issue with it applied? Or is that unrealistic? On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote: > - one AG with extreme low extent sizes compared to the others AGs (I bet > that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are > no large extents in that AG) > from to extents blocks pct > 1 1 4949 4949 0.65 > 2 3 86113 173452 22.73 > 4 7 19399 94558 12.39 > 8 15 23233 248602 32.58 > 16 31 12425 241421 31.64 > total free extents 146119 > total free blocks 762982 > average free extent size 5.22165 (!) Note that this does not mean that all extents in the AG are that small, just that the *free* extents are of that size. I think this might primarily be because this AG has the smallest amount of free blocks (2.9GB). However, the fact that it *does* have less, could be interesting. It might be the AG associated with the directory for the busiest database or such. The next least-space AG is: from to extents blocks pct 1 1 1021 1021 0.10 2 3 48748 98255 10.06 4 7 9840 47038 4.81 8 15 13648 146779 15.02 16 31 15818 323022 33.06 32 63 584 27932 2.86 64 127 147 14286 1.46 128 255 253 49047 5.02 256 511 229 87173 8.92 512 1023 139 102456 10.49 1024 2047 51 72506 7.42 2048 4095 3 7422 0.76 total free extents 90481 total free blocks 976937 It seems plausible it'd would look similar if more of the free blocks were used. > - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped > up to 64 pg blocks maximum (and that's higher than the above) > - but the fails where observed also using pg_upgrade --link -j/pg_restore > -j (also concurrent posix_fallocate() to many independent files sharing the > same AG, but that's 1 backend:1 file so no contention for waitcount in > RelationAddBlocks()) We also extend by more than one page, even without concurrency, if bulk-insertion is used, and i think we do use that for e.g. pg_attribute. Which is actually the table where pg_restore encountered the issue: pg_restore: error: could not execute query: ERROR: could not extend file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No space left on device 1249 is the initial relfilenode for pg_attribute. There could also be some parallelism leading to bulk extension, due to the parallel restore. I don't remember which commands pg_restore actually executes in parallel. Greetings, Andres Freund
Hi Andres On Thu, 12 Dec 2024 at 10:50, Andres Freund <andres@anarazel.de> wrote: > Just to make sure - you're absolutely certain that you actually have space at > the time of the errors? As sure as I can be. The RHEL8 system that I took prints from yesterday has > 1.5TB free. I can't see it varying by that much. It does look as though the system needs to be quite full to provoke this problem. The systems I have looked at so far have >90% full filesystems. Another interesting snippet: the application has a number of ETL workers going at once. The actual number varies depending on a number of factors but might be somewhere from 10 - 150. Each worker will have a single postgres backend that they are feeding data to. At the time of the error, it is not the case that all ETL workers strike it at once - it looks like a lot of the time only a single worker is affected, or at most a handful of workers. I can't see for sure what the other workers were doing at the time, but I would expect they were all importing data as well. > If I were to provide you with a patch that showed the amount of free disk > space at the time of an error, the size of the relation etc, could you > reproduce the issue with it applied? Or is that unrealistic? I have not been able to reproduce it on demand, and so far it has only happened in production systems. As long as the patch doesn't degrade normal performance it should be possible to deploy it to one of the systems that is regularly reporting the error, although it might take a while to get approval to do that. Cheers Mike
Hi Andres On Fri, 13 Dec 2024 at 08:38, Andres Freund <andres@anarazel.de> wrote: > > Another interesting snippet: the application has a number of ETL > > workers going at once. The actual number varies depending on a number > > of factors but might be somewhere from 10 - 150. Each worker will have > > a single postgres backend that they are feeding data to. > > Are they all inserting into distinct tables/partitions or into shared tables? The set of tables they are writing into is the same, but we do take some effort to randomize the order of the tables that we each worker is writing into so as to reduce contention. Even so it is quite likely that multiple processes will be writing into a table at a time. Also worth noting that I have only seen this error triggered by COPY statements (other than the upgrade case). There are some other cases in our code that use INSERT but so far I have not seen that end in an out of space error. > When you say that they're not "all striking it at once", do you mean that some > of them aren't interacting with the database at the time, or that they're not > erroring out? Sorry, I meant erroring out. Thanks for the patch! Cheers Mike
On 2024-Dec-11, Andres Freund wrote: > One thing that I think we should definitely do is to include more detail in > the error message. mdzeroextend()'s error messages don't include how many > blocks the relation was to be extended by. Neither mdextend() nor > mdzeroextend() include the offset at which the extension failed. I proposed a patch at https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql FileFallocate failure: errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m", (long long) addbytes, (long long) seekpos, FilePathName(v->mdfd_vfd)), FileZero failure: errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m", (long long) addbytes, (long long) seekpos, FilePathName(v->mdfd_vfd)), I'm not sure that we need to talk about blocks, given that the underlying syscalls don't work in blocks anyway. IMO we should just report bytes. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)
On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2024-Dec-11, Andres Freund wrote: > > One thing that I think we should definitely do is to include more detail in > > the error message. mdzeroextend()'s error messages don't include how many > > blocks the relation was to be extended by. Neither mdextend() nor > > mdzeroextend() include the offset at which the extension failed. > > I proposed a patch at > https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql If adding more logging, I wonder why FileAccess()'s "re-open failed" case is not considered newsworthy. I've suspected it as a candidate source of an unexplained and possibly misattributed error in other cases. I'm not saying it's at all likely in this case, but it seems like just the sort of rare unexpected failure that we'd want to know more about when trying to solve mysteries.
On Sat, Dec 14, 2024 at 4:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Dec-11, Andres Freund wrote: > > > One thing that I think we should definitely do is to include more detail in > > > the error message. mdzeroextend()'s error messages don't include how many > > > blocks the relation was to be extended by. Neither mdextend() nor > > > mdzeroextend() include the offset at which the extension failed. > > > > I proposed a patch at > > https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql > > If adding more logging, I wonder why FileAccess()'s "re-open failed" > case is not considered newsworthy. I've suspected it as a candidate > source of an unexplained and possibly misattributed error in other > cases. I'm not saying it's at all likely in this case, but it seems > like just the sort of rare unexpected failure that we'd want to know > more about when trying to solve mysteries. Wow. That's truly abominable. It doesn't seem likely to explain this case, because I don't think trying to reopen an existing file could result in LruInsert() returning ENOSPC. But this code desperately needs refactoring to properly report the open() failure as such, instead of conflating it with a failure of whatever syscall we were contemplating before we realized we needed to open(). -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
FWIW, I tried fairly hard to reproduce this.
Same, but without PG and also without much success. I've also tried to push the AGs (with just one or two AGs created via mkfs) to contain only small size extents (by creating hundreds of thousands of 8kb files) then deleting some modulo and then try couple of bigger fallocate/writes to see if that would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not blow up. It only failed when df -h was exactly 100% in multiple scenarios like that (and yes it added little space out of blue sometimes too). So my take is something related to state (having fd open) and concurrency.
Interesting thing that I've observed is that the per directory AG affinity for big directories (think $PGDATA) is lost when AG is full and then extents are allocated from different AGs (one can use xfs_bmap -vv to see allocated AG affinity for directory VS files there)
An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).
I could only think of the question: how many files were involved there ? Maybe it is some kind of race between other (or the same) backends frequently churning their fdcache's with open()/close() [defeating speculative preallocation] -> XFS ending up fragmented and only then posix_fallocate() having issues for larger allocations (>> 8kB)? My take is if we send N io write vectors then this seems to be handled fine, but when we throw one big fallocate -- it is not -- so maybe the posix_fallocate() was in the process of finding space while some other activities happened to that inode -- like close() -- but then it seems it doesn't match the pg_upgrade scenario.
Well IMHO we are stuck till Michael provides some more data (patch outcome, bpf and maybe other hints and tests).
-J.
Hi, On 2024-12-14 09:29:12 +0100, Alvaro Herrera wrote: > On 2024-Dec-11, Andres Freund wrote: > > > One thing that I think we should definitely do is to include more detail in > > the error message. mdzeroextend()'s error messages don't include how many > > blocks the relation was to be extended by. Neither mdextend() nor > > mdzeroextend() include the offset at which the extension failed. > > I proposed a patch at > https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql > > FileFallocate failure: > errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m", > (long long) addbytes, (long long) seekpos, > FilePathName(v->mdfd_vfd)), > > FileZero failure: > errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m", > (long long) addbytes, (long long) seekpos, > FilePathName(v->mdfd_vfd)), Personally I don't like the obfuscation of "allocate" and "zero" vs just naming the function names. But I guess that's just taste thing. > I'm not sure that we need to talk about blocks, given that the > underlying syscalls don't work in blocks anyway. IMO we should just > report bytes. When looking for problems it's considerably more work with bytes, because - at least for me - the large number is hard to compare quickly and to know how aggressively we extended also requires to translate to blocks. Greetings, Andres Freund
On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote: > Personally I don't like the obfuscation of "allocate" and "zero" vs just > naming the function names. But I guess that's just taste thing. > > When looking for problems it's considerably more work with bytes, because - at > least for me - the large number is hard to compare quickly and to know how > aggressively we extended also requires to translate to blocks. FWIW, I think that what we report in the error should hew as closely to the actual system call as possible. Hence, I agree with your first complaint and would prefer to simply see the system calls named, but I disagree with your second complaint and would prefer to see the byte count. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-12-16 14:45:37 +0100, Jakub Wartak wrote: > On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote: > An extended cycle of 80 backends copying into relations and occasionally > > truncating them (to simulate the partitions being dropped and new ones > > created). For this I ran a 4TB filesystem very close to fully filled > > (peaking > > at 99.998 % full). > > > > I could only think of the question: how many files were involved there ? I varied the number heavily. From dozens to 10s of thousands. No meaningful difference. > Well IMHO we are stuck till Michael provides some more data (patch outcome, > bpf and maybe other hints and tests). Yea. Greetings, Andres Freund
On 2024-Dec-16, Robert Haas wrote: > On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote: > > Personally I don't like the obfuscation of "allocate" and "zero" vs just > > naming the function names. But I guess that's just taste thing. > > > > When looking for problems it's considerably more work with bytes, because - at > > least for me - the large number is hard to compare quickly and to know how > > aggressively we extended also requires to translate to blocks. > > FWIW, I think that what we report in the error should hew as closely > to the actual system call as possible. Hence, I agree with your first > complaint and would prefer to simply see the system calls named, but I > disagree with your second complaint and would prefer to see the byte > count. Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)") with the number of bytes, and leave the errmsg() mentioning the general operation being done (allocate, zero, etc) with the number of blocks. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
Hi, On 2024-12-16 18:05:59 +0100, Alvaro Herrera wrote: > On 2024-Dec-16, Robert Haas wrote: > > > On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote: > > > Personally I don't like the obfuscation of "allocate" and "zero" vs just > > > naming the function names. But I guess that's just taste thing. > > > > > > When looking for problems it's considerably more work with bytes, because - at > > > least for me - the large number is hard to compare quickly and to know how > > > aggressively we extended also requires to translate to blocks. > > > > FWIW, I think that what we report in the error should hew as closely > > to the actual system call as possible. Hence, I agree with your first > > complaint and would prefer to simply see the system calls named, but I > > disagree with your second complaint and would prefer to see the byte > > count. > > Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)") > with the number of bytes, and leave the errmsg() mentioning the general > operation being done (allocate, zero, etc) with the number of blocks. I don't see what we gain by requiring guesswork (what does allocating vs zeroing mean, zeroing also allocates disk space after all) to interpret the main error message. My experience is that it's often harder to get the DETAIL than the actual error message (grepping becomes harder due to separate line, terse verbosity is commonly used). I think we're going too far towards not mentioning the actual problems in too many error messages in general. Greetings, Andres Freund
On Mon, Dec 16, 2024 at 12:52 PM Andres Freund <andres@anarazel.de> wrote: > I don't see what we gain by requiring guesswork (what does allocating vs > zeroing mean, zeroing also allocates disk space after all) to interpret the > main error message. My experience is that it's often harder to get the DETAIL > than the actual error message (grepping becomes harder due to separate line, > terse verbosity is commonly used). I feel like the normal way that we do this is basically: could not {name of system call} file "\%s\": %m e.g. could not read file \"%s\": %m I don't know why we should do anything else in this type of case. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Dec 19, 2024 at 7:49 AM Michael Harris <harmic@gmail.com> wrote:
Hello,
I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.
Here is a sample of the output:
2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796
[..]
I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.
No one else has responded, so I'll try. My take is that we got very limited number of reports (2-3) of this stuff happening and it always seem to be >90% space used, yet the adoption of PG16 is rising, so we may or may not see more errors of those kind, but on another side of things: it's frequency is so rare that it's really wild we don't see more reports like this one. Lots of OS upgrades in the wild are performed by building new standbys (maybe that lowers the fs fragmentation), rather than in-place OS upgrades. To me it sounds like a new bug in XFS that is rare. You can probably live with #undef HAVE_POSIX_FALLOCATE as a way to survive, another would be probably to try to run xfs_fsr to defragment the fs.
Longer-term: other than collecting the eBPF data to start digging from where it is really triggered, I don't see a way forward. It would be suboptimal to just abandon fallocate() optimizations from commit 31966b151e6ab7a6284deab6e8fe5faddaf2ae4c, just because of very unusual combinations of factors (XFS bug).
Well we could be having some kludge like pseudo-code: if(posix_falloc() == ENOSPC && statfs().free_space_pct >= 1) fallback_to_pwrites(), but it is ugly. Another is GUC (or even two -- how much to extend or to use or not the posix_fallocate()), but people do not like more GUCs...
> I have so far not installed the bpftrace that Jakub suggested before -
> as I say this is a production machine and I am wary of triggering a
> kernel panic or worse (even though it seems like the risk for that
> would be low?). While a kernel stack trace would no doubt be helpful
> to the XFS developers, from a postgres point of view, would that be
> likely to help us decide what to do about this?[..]
Well you could try having reproduction outside of production, or even clone the storage (but not using backup/restore), but literally clone the XFS LUNs on the storage itself and connect those separate VM to have a safe testbed (or even use dd(1) of some smaller XFS fs exhibiting such behaviour to some other place)
As for eBPF/bpftrace: it is safe (it's sandboxed anyway), lots of customers are using it, but as always YMMV.
There's also xfs_fsr that might help overcome.
You can also experiment if -o allocsize helps or just even try -o allocsize=0 (but that might have some negative effects on performance probably)
-J.
Hi, On 2024-12-19 17:47:13 +1100, Michael Harris wrote: > I finally managed to get the patched version installed in a production > database where the error is occurring very regularly. Thanks! > Here is a sample of the output: > > 2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate > failing with ENOSPC: free space for filesystem containing > "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks: > 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files: > 1073741376, f_ffree: 1069933796 That's ~700 GB of free space... It'd be interesting to see filefrag -v for that segment. > This is a different system to those I previously provided logs from. > It is also running RHEL8 with a similar configuration to the other > system. Given it's a RHEL system, have you raised this as an issue with RH? They probably have somebody with actual XFS hacking experience on staff. RH's kernels are *heavily* patched, so it's possible the issue is actually RH specific. > I have so far not installed the bpftrace that Jakub suggested before - > as I say this is a production machine and I am wary of triggering a > kernel panic or worse (even though it seems like the risk for that > would be low?). While a kernel stack trace would no doubt be helpful > to the XFS developers, from a postgres point of view, would that be > likely to help us decide what to do about this? Well, I'm personally wary of installing workarounds for a problem I don't understand and can't reproduce, which might be specific to old filesystems and/or heavily patched kernels. This clearly is an FS bug. That said, if we learn that somehow this is a fundamental XFS issue that can be triggered on every XFS filesystem, with current kernels, it becomes more reasonable to implement a workaround in PG. Another thing I've been wondering about is if we could reduce the frequency of hitting problems by rounding up the number of blocks we extend by to powers of two. That would probably reduce fragmentation, and the extra space would be quickly used in workloads where we extend by a bunch of blocks at once, anyway. Greetings, Andres Freund