Thread: FileFallocate misbehaving on XFS

FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hello PG Hackers

Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:

pg_restore: error: could not execute query: ERROR:  could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT:  Check free disk space.

This has happened multiple times on different servers, and in each
case there was plenty of free space available.

We found this thread describing similar issues:


https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com

As is the case in that thread, all of the affected databases are using XFS.

One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.

We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).

I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the
lengthto be allocated is greater than the available space.
 

There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.

On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.

Is this already being looked into?

Thanks in advance,

Cheers
Mike



Re: FileFallocate misbehaving on XFS

From
Andrea Gelmini
Date:


Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com> ha scritto:

Is this already being looked into?

Funny, i guess it's the same reason I see randomly complain of WhatsApp web interface, on Chrome, since I switched to XFS. It says something like "no more space on disk" and logout, with more than 300GB available.

Anyway, just a stupid hint, I would try to write to XFS mailing list. There you can reach XFS maintainers of Red Hat and the usual historical developers, of course!!!


Re: FileFallocate misbehaving on XFS

From
Tomas Vondra
Date:

On 12/9/24 08:34, Michael Harris wrote:
> Hello PG Hackers
> 
> Our application has recently migrated to PG16, and we have experienced
> some failed upgrades. The upgrades are performed using pg_upgrade and
> have failed during the phase where the schema is restored into the new
> cluster, with the following error:
> 
> pg_restore: error: could not execute query: ERROR:  could not extend
> file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
> FileFallocate(): No space left on device
> HINT:  Check free disk space.
> 
> This has happened multiple times on different servers, and in each
> case there was plenty of free space available.
> 
> We found this thread describing similar issues:
> 
>
https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
> 
> As is the case in that thread, all of the affected databases are using XFS.
> 
> One of my colleagues built postgres from source with
> HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
> complete the pg_upgrade, and then switched to a stock postgres build
> after the upgrade. However, as you might expect, after the upgrade we
> have experienced similar errors during regular operation. We make
> heavy use of COPY, which is mentioned in the above discussion as
> pre-allocating files.
> 
> We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
> Linux 9 (Kernel 5.14.0).
> 
> I am wondering if this bug might be related:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
> 
>> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the
lengthto be allocated is greater than the available space.
 
> 
> There is a reproduction procedure at the bottom of the above ubuntu
> thread, and using that procedure I get the same results on both kernel
> 4.18.0 and 5.14.0.
> When calling fallocate with offset zero on an existing file, I get
> enospc even if I am only requesting the same amount of space as the
> file already has.
> If I repeat the experiment with ext4 I don't get that behaviour.
> 
> On a surface examination of the code paths leading to the
> FileFallocate call, it does not look like it should be trying to
> allocate already allocated space, but I might have missed something
> there.
> 
> Is this already being looked into?
> 

Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?

What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?


regards

-- 
Tomas Vondra




Re: FileFallocate misbehaving on XFS

From
Tomas Vondra
Date:
On 12/9/24 10:47, Andrea Gelmini wrote:
> 
> 
> Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com
> <mailto:harmic@gmail.com>> ha scritto:
> 
> 
>     Is this already being looked into?
> 
> 
> Funny, i guess it's the same reason I see randomly complain of WhatsApp
> web interface, on Chrome, since I switched to XFS. It says something
> like "no more space on disk" and logout, with more than 300GB available.
> 

If I understand the fallocate issue correctly, it essentially ignores
the offset, so "fallocate -o 0 -l LENGTH" fails if

    LENGTH + CURRENT_LENGTH > FREE_SPACE

But if you have 300GB available, that'd mean you have a file that's
close to that size already. But is that likely for WhatsApp?

> Anyway, just a stupid hint, I would try to write to XFS mailing list.
> There you can reach XFS maintainers of Red Hat and the usual historical
> developers, of course!!!
> 

Yes, I think that's a better place to report this. I don't think we're
doing anything particularly weird / wrong with fallocate().


regards

-- 
Tomas Vondra




Re: FileFallocate misbehaving on XFS

From
Jakub Wartak
Date:
On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com> wrote:

Hi Michael,

We found this thread describing similar issues:

https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com

We've got some case in the past here in EDB, where an OS vendor has blamed XFS AG fragmentation (too many AGs, and if one AG is not having enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?id=000018219 for Your AG range

-J.

Re: FileFallocate misbehaving on XFS

From
Tomas Vondra
Date:

On 12/9/24 11:27, Jakub Wartak wrote:
> On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
> <mailto:harmic@gmail.com>> wrote:
> 
> Hi Michael,
> 
>     We found this thread describing similar issues:
> 
>     https://www.postgresql.org/message-id/flat/
>     AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
<https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com>
> 
> 
> We've got some case in the past here in EDB, where an OS vendor has
> blamed XFS AG fragmentation (too many AGs, and if one AG is not having
> enough space -> error). Could You perhaps show us output of on that LUN:
> 1. xfs_info
> 2. run that script from https://www.suse.com/support/kb/doc/?
> id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for
> Your AG range
> 

But this can be reproduced on a brand new filesystem - I just tried
creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
file twice. Which that fails, and there can't be any real fragmentation.

regards

-- 
Tomas Vondra




Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-09 15:47:55 +0100, Tomas Vondra wrote:
> On 12/9/24 11:27, Jakub Wartak wrote:
> > On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
> > <mailto:harmic@gmail.com>> wrote:
> > 
> > Hi Michael,
> > 
> >     We found this thread describing similar issues:
> > 
> >     https://www.postgresql.org/message-id/flat/
> >     AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
<https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com>
> > 
> > 
> > We've got some case in the past here in EDB, where an OS vendor has
> > blamed XFS AG fragmentation (too many AGs, and if one AG is not having
> > enough space -> error). Could You perhaps show us output of on that LUN:
> > 1. xfs_info
> > 2. run that script from https://www.suse.com/support/kb/doc/?
> > id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for
> > Your AG range
> > 
> 
> But this can be reproduced on a brand new filesystem - I just tried
> creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
> file twice. Which that fails, and there can't be any real fragmentation.

If I understand correctly xfs, before even looking at the file's current
layout, checks if there's enough free space for the fallocate() to
succeed.  Here's an explanation for why:
https://www.spinics.net/lists/linux-xfs/msg55429.html

  The real problem with preallocation failing part way through due to
  overcommit of space is that we can't go back an undo the
  allocation(s) made by fallocate because when we get ENOSPC we have
  lost all the state of the previous allocations made. If fallocate is
  filling holes between unwritten extents already in the file, then we
  have no way of knowing where the holes we filled were and hence
  cannot reliably free the space we've allocated before ENOSPC was
  hit.

I.e. reserving space as you go would leave you open to ending up with some,
but not all, of those allocations having been made. Whereas pre-reserving the
worst case space needed, ahead of time, ensures that you have enough space to
go through it all.

You can't just go through the file [range] and compute how much free space you
will need allocate and then do the a second pass through the file, because the
file layout might have changed concurrently...


This issue seems independent of the issue Michael is having though. Postgres,
afaik, won't fallocate huge ranges with already allocated space.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-09 18:34:22 +1100, Michael Harris wrote:
> Our application has recently migrated to PG16, and we have experienced
> some failed upgrades. The upgrades are performed using pg_upgrade and
> have failed during the phase where the schema is restored into the new
> cluster, with the following error:
>
> pg_restore: error: could not execute query: ERROR:  could not extend
> file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
> FileFallocate(): No space left on device
> HINT:  Check free disk space.

Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
filesystem, in the past?

The reflink stuff in xfs (which is used to implement copy-on-write for files)
is somewhat newer and you're using somewhat old kernels:


> We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
> Linux 9 (Kernel 5.14.0).

I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?


> I am wondering if this bug might be related:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

Doubt it, we never do this as far as I am aware.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Andres

On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:
> Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
> filesystem, in the past?

No, our procedure is to use --link.

> I found some references for bugs that were fixed in 5.13. But I think at least
> some of this would persist if the filesystem ran into the issue with a kernel
> before those fixes. Did you upgrade "in-place" from Rocky Linux 8?

We generally don't use "in place" OS upgrades - however we would
usually have the databases on separate filesystem(s) to the OS, and
those filesystem(s) would be preserved through the upgrade, while the
root fs would be scratched.
A lot of the cases reported are on RL8. I will try to find out the
history of the RL9 cases to see if the filesystems started on RL8.

Could you please provide me links for the kernel bugs you are referring to?

Cheers
Mike.



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Tomas

On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:
> Sounds more like an XFS bug/behavior, so it's not clear to me what we
> could do about it. I mean, if the filesystem reports bogus out-of-space,
> is there even something we can do?

I don't disagree that it's most likely an XFS issue. However, XFS is
pretty widely used - it's the default FS for RHEL & the default in
SUSE for non-root partitions - so maybe some action should be taken.

Some things we could consider:

 - Providing a way to configure PG not to use posix_fallocate at runtime

 - Detecting the use of XFS (probably nasty and complex to do in a
platform independent way) and disable posix_fallocate

 - In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

 - Documenting that XFS might not be a good choice, at least for some
kernel versions

> What is not clear to me is why would this affect pg_upgrade at all. We
> have the data files split into 1GB segments, and the copy/clone/... goes
> one by one. So there shouldn't be more than 1GB "extra" space needed.
> Surely you have more free space on the system?

Yes, that also confused me. It actually fails during the schema
restore phase - where pg_upgrade calls pg_restore to restore a
schema-only dump that it takes earlier in the process. At this stage
it is only trying to restore the schema, not any actual table data.
Note that we use the --link  option to pg_upgrade, so it should not be
using much space even when the table data is being upgraded.

The filesystems have >1TB free space when this has occurred.

It does continue to give this error after the upgrade, at apparently
random intervals, when data is being loaded into the DB using COPY
commands, so it might be best not to focus too much on the fact that
we first encounter it during the upgrade.

Cheers
Mike.



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 09:34:08 +1100, Michael Harris wrote:
> On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:
> > I found some references for bugs that were fixed in 5.13. But I think at least
> > some of this would persist if the filesystem ran into the issue with a kernel
> > before those fixes. Did you upgrade "in-place" from Rocky Linux 8?
> 
> We generally don't use "in place" OS upgrades - however we would
> usually have the databases on separate filesystem(s) to the OS, and
> those filesystem(s) would be preserved through the upgrade, while the
> root fs would be scratched.

Makes sense.


> A lot of the cases reported are on RL8. I will try to find out the
> history of the RL9 cases to see if the filesystems started on RL8.

That'd be helpful....


> Could you please provide me links for the kernel bugs you are referring to?

I unfortunately closed most of the tabs, the only one I could quickly find
again is the one referenced at the bottom of:
https://www.spinics.net/lists/linux-xfs/msg55445.html

Greetings,

Andres



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 10:00:43 +1100, Michael Harris wrote:
> On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:
> > Sounds more like an XFS bug/behavior, so it's not clear to me what we
> > could do about it. I mean, if the filesystem reports bogus out-of-space,
> > is there even something we can do?
> 
> I don't disagree that it's most likely an XFS issue. However, XFS is
> pretty widely used - it's the default FS for RHEL & the default in
> SUSE for non-root partitions - so maybe some action should be taken.
> 
> Some things we could consider:
> 
>  - Providing a way to configure PG not to use posix_fallocate at runtime
> 
>  - Detecting the use of XFS (probably nasty and complex to do in a
> platform independent way) and disable posix_fallocate
> 
>  - In the case of posix_fallocate failing with ENOSPC, fall back to
> FileZero (worst case that will fail as well, in which case we will
> know that we really are out of space)
> 
>  - Documenting that XFS might not be a good choice, at least for some
> kernel versions

Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Are you using any filesystem quotas?

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

What kind of storage is this on?

Was the filesystem ever grown from a smaller size?

Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.


> > What is not clear to me is why would this affect pg_upgrade at all. We
> > have the data files split into 1GB segments, and the copy/clone/... goes
> > one by one. So there shouldn't be more than 1GB "extra" space needed.
> > Surely you have more free space on the system?
> 
> Yes, that also confused me. It actually fails during the schema
> restore phase - where pg_upgrade calls pg_restore to restore a
> schema-only dump that it takes earlier in the process. At this stage
> it is only trying to restore the schema, not any actual table data.
> Note that we use the --link  option to pg_upgrade, so it should not be
> using much space even when the table data is being upgraded.

Are you using pg_upgrade -j?

I'm asking because looking at linux's git tree I found this interesting recent
commit: https://git.kernel.org/linus/94a0333b9212 - but IIUC it'd actually
cause file creation, not fallocate to fail.



> The filesystems have >1TB free space when this has occurred.
> 
> It does continue to give this error after the upgrade, at apparently
> random intervals, when data is being loaded into the DB using COPY
> commands, so it might be best not to focus too much on the fact that
> we first encounter it during the upgrade.

I assume the file that actually errors out changes over time? It's always
fallocate() that fails?

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Andres

Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
> Are you using any filesystem quotas?

No.

> It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> xfs_spaceman -c 'freesp -s' /mountpoint
> xfs_spaceman -c 'health' /mountpoint
> and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4, agsize=262471424 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=1049885696, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=512639, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
   from      to extents  blocks    pct
      1       1   37502   37502   0.15
      2       3   62647  148377   0.59
      4       7   87793  465950   1.85
      8      15  135529 1527172   6.08
     16      31  184811 3937459  15.67
     32      63  165979 7330339  29.16
     64     127  101674 8705691  34.64
    128     255   15123 2674030  10.64
    256     511     973  307655   1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
   from      to extents  blocks    pct
      1       1   43895   43895   0.22
      2       3   59312  141693   0.70
      4       7   83406  443964   2.20
      8      15  120804 1362108   6.75
     16      31  133140 2824317  14.00
     32      63  118619 5188474  25.71
     64     127   77960 6751764  33.46
    128     255   16383 2876626  14.26
    256     511    1763  546506   2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
   from      to extents  blocks    pct
      1       1   72034   72034   0.26
      2       3   98158  232135   0.83
      4       7  126228  666187   2.38
      8      15  169602 1893007   6.77
     16      31  180286 3818527  13.65
     32      63  164529 7276833  26.01
     64     127  109687 9505160  33.97
    128     255   22113 3921162  14.02
    256     511    1901  592052   2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
   from      to extents  blocks    pct
      1       1   51462   51462   0.21
      2       3   98993  233204   0.93
      4       7  131578  697655   2.79
      8      15  178151 1993062   7.97
     16      31  175718 3680535  14.72
     32      63  145310 6372468  25.48
     64     127   89518 7749021  30.99
    128     255   18926 3415768  13.66
    256     511    2640  813586   3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.

> What kind of storage is this on?

As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.

> Was the filesystem ever grown from a smaller size?

I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.

> Have you checked the filesystem's internal consistency? I.e. something like
> xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
> unmounted though. But corrupted filesystem datastructures certainly could
> cause spurious ENOSPC.

I executed this on the same system as the above prints came from. It
did not report any issues.

> Are you using pg_upgrade -j?

Yes, we use -j `nproc`

> I assume the file that actually errors out changes over time? It's always
> fallocate() that fails?

Yes, correct, on both counts.

> Can you tell us anything about the workload / data? Lots of tiny tables, lots
> of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

Cheers
Mike.



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi again

One extra piece of information: I had said that all the machines were
Rocky Linux 8 or Rocky Linux 9, but actually a large number of them
are RHEL8.

Sorry for the confusion.

Of course RL8 is a rebuild of RHEL8 so it is not surprising they would
be behaving similarly.

Cheers
Mike

On Tue, 10 Dec 2024 at 17:28, Michael Harris <harmic@gmail.com> wrote:
>
> Hi Andres
>
> Following up on the earlier question about OS upgrade paths - all the
> cases reported so far are either on RL8 (Kernel 4.18.0) or were
> upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
> preserved.
> In fact the RL9 systems were initially built as Centos 7, and then
> when that went EOL they were upgraded to RL9. The process was as I
> described - the /var/opt filesystem which contained the database was
> preserved, and the root and other OS filesystems were scratched.
> The majority of systems where we have this problem are on RL8.
>
> On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
> > Are you using any filesystem quotas?
>
> No.
>
> > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> > xfs_spaceman -c 'freesp -s' /mountpoint
> > xfs_spaceman -c 'health' /mountpoint
> > and df.
>
> I gathered this info from one of the systems that is currently on RL9.
> This system is relatively small compared to some of the others that
> have exhibited this issue, but it is the only one I can access right
> now.
>
> # uname -a
> Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
> 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4, agsize=262471424 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
> data     =                       bsize=4096   blocks=1049885696, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=512639, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
>    from      to extents  blocks    pct
>       1       1   37502   37502   0.15
>       2       3   62647  148377   0.59
>       4       7   87793  465950   1.85
>       8      15  135529 1527172   6.08
>      16      31  184811 3937459  15.67
>      32      63  165979 7330339  29.16
>      64     127  101674 8705691  34.64
>     128     255   15123 2674030  10.64
>     256     511     973  307655   1.22
> total free extents 792031
> total free blocks 25134175
> average free extent size 31.7338
>    from      to extents  blocks    pct
>       1       1   43895   43895   0.22
>       2       3   59312  141693   0.70
>       4       7   83406  443964   2.20
>       8      15  120804 1362108   6.75
>      16      31  133140 2824317  14.00
>      32      63  118619 5188474  25.71
>      64     127   77960 6751764  33.46
>     128     255   16383 2876626  14.26
>     256     511    1763  546506   2.71
> total free extents 655282
> total free blocks 20179347
> average free extent size 30.7949
>    from      to extents  blocks    pct
>       1       1   72034   72034   0.26
>       2       3   98158  232135   0.83
>       4       7  126228  666187   2.38
>       8      15  169602 1893007   6.77
>      16      31  180286 3818527  13.65
>      32      63  164529 7276833  26.01
>      64     127  109687 9505160  33.97
>     128     255   22113 3921162  14.02
>     256     511    1901  592052   2.12
> total free extents 944538
> total free blocks 27977097
> average free extent size 29.6199
>    from      to extents  blocks    pct
>       1       1   51462   51462   0.21
>       2       3   98993  233204   0.93
>       4       7  131578  697655   2.79
>       8      15  178151 1993062   7.97
>      16      31  175718 3680535  14.72
>      32      63  145310 6372468  25.48
>      64     127   89518 7749021  30.99
>     128     255   18926 3415768  13.66
>     256     511    2640  813586   3.25
> total free extents 892296
> total free blocks 25006761
> average free extent size 28.0252
>
> # xfs_spaceman -c 'health' /var/opt
> Health status has not been collected for this filesystem.
>
> > What kind of storage is this on?
>
> As mentioned, there are quite a few systems in different sites, so a
> number of different storage solutions in use, some with directly
> attached disks, others with some SAN solutions.
> The instance I got the printout above from is a VM, but in the other
> site they are all bare metal.
>
> > Was the filesystem ever grown from a smaller size?
>
> I can't say for sure that none of them were, but given the number of
> different systems that have this issue I am confident that would not
> be a common factor.
>
> > Have you checked the filesystem's internal consistency? I.e. something like
> > xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
> > unmounted though. But corrupted filesystem datastructures certainly could
> > cause spurious ENOSPC.
>
> I executed this on the same system as the above prints came from. It
> did not report any issues.
>
> > Are you using pg_upgrade -j?
>
> Yes, we use -j `nproc`
>
> > I assume the file that actually errors out changes over time? It's always
> > fallocate() that fails?
>
> Yes, correct, on both counts.
>
> > Can you tell us anything about the workload / data? Lots of tiny tables, lots
> > of big tables, write heavy, ...?
>
> It is a write heavy application which stores mostly time series data.
>
> The time series data is partitioned by time; the application writes
> constantly into the 'current' partition, and data is expired by
> removing the oldest partition. Most of the data is written once and
> not updated.
>
> There are quite a lot of these partitioned tables (in the 1000's or
> 10000's) depending on how the application is configured. Individual
> partitions range in size from a few MB to 10s of GB.
>
> Cheers
> Mike.



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 17:28:21 +1100, Michael Harris wrote:
> On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
> > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> > xfs_spaceman -c 'freesp -s' /mountpoint
> > xfs_spaceman -c 'health' /mountpoint
> > and df.
>
> I gathered this info from one of the systems that is currently on RL9.
> This system is relatively small compared to some of the others that
> have exhibited this issue, but it is the only one I can access right
> now.

I think it's implied, but I just want to be sure: This was one of the affected
systems?


> # uname -a
> Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
> 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4, agsize=262471424 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
> data     =                       bsize=4096   blocks=1049885696, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=512639, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all
affect space allocation to some degree and more recently created filesystems
will have them to different values, which could explain why you but not that
many others hit this issue.

Any chance to get df output? I'm mainly curious about the number of used
inodes.

Could you show the mount options that end up being used?
   grep /var/opt /proc/mounts

I rather doubt it is, but it'd sure be interesting if inode32 were used.


I assume you have never set XFS options for the PG directory or files within
it?  Could you show
  xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc
?


> # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
>    from      to extents  blocks    pct
>       1       1   37502   37502   0.15
>       2       3   62647  148377   0.59
>       4       7   87793  465950   1.85
>       8      15  135529 1527172   6.08
>      16      31  184811 3937459  15.67
>      32      63  165979 7330339  29.16
>      64     127  101674 8705691  34.64
>     128     255   15123 2674030  10.64
>     256     511     973  307655   1.22
> total free extents 792031
> total free blocks 25134175
> average free extent size 31.7338
>    from      to extents  blocks    pct
>       1       1   43895   43895   0.22
>       2       3   59312  141693   0.70
>       4       7   83406  443964   2.20
>       8      15  120804 1362108   6.75
>      16      31  133140 2824317  14.00
>      32      63  118619 5188474  25.71
>      64     127   77960 6751764  33.46
>     128     255   16383 2876626  14.26
>     256     511    1763  546506   2.71
> total free extents 655282
> total free blocks 20179347
> average free extent size 30.7949
>    from      to extents  blocks    pct
>       1       1   72034   72034   0.26
>       2       3   98158  232135   0.83
>       4       7  126228  666187   2.38
>       8      15  169602 1893007   6.77
>      16      31  180286 3818527  13.65
>      32      63  164529 7276833  26.01
>      64     127  109687 9505160  33.97
>     128     255   22113 3921162  14.02
>     256     511    1901  592052   2.12
> total free extents 944538
> total free blocks 27977097
> average free extent size 29.6199
>    from      to extents  blocks    pct
>       1       1   51462   51462   0.21
>       2       3   98993  233204   0.93
>       4       7  131578  697655   2.79
>       8      15  178151 1993062   7.97
>      16      31  175718 3680535  14.72
>      32      63  145310 6372468  25.48
>      64     127   89518 7749021  30.99
>     128     255   18926 3415768  13.66
>     256     511    2640  813586   3.25
> total free extents 892296
> total free blocks 25006761
> average free extent size 28.0252

So there's *some*, but not a lot, of imbalance in AG usage. Of course that's
as of this moment, and as you say below, you expire old partitions on a
regular basis...

My understanding of XFS's space allocation is that by default it continues to
use the same AG for allocations within one directory, until that AG is full.
For a write heavy postgres workload that's of course not optimal, as all
activity will focus on one AG.

I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full.  'freesp' is probably too
expensive for that, but it looks like
   xfs_db -r -c agresv /dev/nvme6n1
should work?

Actually that output might be interesting to see, even when you don't hit the
issue.


> > Can you tell us anything about the workload / data? Lots of tiny tables, lots
> > of big tables, write heavy, ...?
>
> It is a write heavy application which stores mostly time series data.
>
> The time series data is partitioned by time; the application writes
> constantly into the 'current' partition, and data is expired by
> removing the oldest partition. Most of the data is written once and
> not updated.
>
> There are quite a lot of these partitioned tables (in the 1000's or
> 10000's) depending on how the application is configured. Individual
> partitions range in size from a few MB to 10s of GB.

So there are 1000s of tables that are concurrently being appended, but only
into one partition each. That does make it plausible that there's a
significant amount of fragmentation. Possibly transient due to the expiration.

How many partitions are there for each of the tables? Mainly wondering because
of the number of inodes being used.

Are all of the active tables within one database? That could be relevant due
to per-directory behaviour of free space allocation.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
>  On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
> 1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
> with a huge number of AGs with small space in them) reporting such errors
> after upgrading to 16, but not for earlier versions (somehow
> posix_fallocate() had to be the culprit).

Given the workload expires old partitions, I'm not sure we conclude a whole
lot from the current state :/


> 2.
> 
> > # xfs_info /dev/mapper/ippvg-ipplv
> > meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4,
> agsize=262471424 blks
> >         =                       sectsz=512   attr=2, projid32bit=1
> >         =                       crc=1        finobt=0, sparse=0, rmapbt=0
> >         =                       reflink=0    bigtime=0 inobtcount=0
> nrext64=0
> 
> Yay, reflink=0, that's pretty old fs ?!

I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).


> > ERROR:  could not extend file
> "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
> space left on device
> 
> 2. This indicates it was allocating 1GB for such a table (".1"), on
> tablespace that was created more than a year ago. Could you get us maybe
> those below commands too? (or from any other directory exhibiting such
> errors)

The date in the directory is the catversion of the server, which is just
determined by the major version being used, not the creation time of the
tablespace.

andres@awork3:~/src/postgresql$ git grep CATALOG_VERSION_NO upstream/REL_16_STABLE src/include/catalog/catversion.h
upstream/REL_16_STABLE:src/include/catalog/catversion.h:#define CATALOG_VERSION_NO   202307071

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
On 2024-12-10 11:34:15 -0500, Andres Freund wrote:
> On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
> >  On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
> > 2.
> > 
> > > # xfs_info /dev/mapper/ippvg-ipplv
> > > meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4,
> > agsize=262471424 blks
> > >         =                       sectsz=512   attr=2, projid32bit=1
> > >         =                       crc=1        finobt=0, sparse=0, rmapbt=0
> > >         =                       reflink=0    bigtime=0 inobtcount=0
> > nrext64=0
> > 
> > Yay, reflink=0, that's pretty old fs ?!
> 
> I think that only started to default to on more recently (2019, plus time to
> percolate into RHEL). The more curious cases is finobt=0 (turned on by default
> since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).

One thing that might be interesting is to compare xfs_info of affected and
non-affected servers...



Re: FileFallocate misbehaving on XFS

From
Robert Haas
Date:
On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:
> Pretty unexcited about all of these - XFS is fairly widely used for PG, but
> this problem doesn't seem very common. It seems to me that we're missing
> something that causes this to only happen in a small subset of cases.

I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.
It's probably workload dependent somehow, but for all we know today it
seems like the workload could be as simple as "do enough file
extension and you'll get into trouble eventually" or maybe "do enough
file extension[with some level of concurrency and you'll get into
trouble eventually".

> I think the source of this needs to be debugged further before we try to apply
> workarounds in postgres.

Why? It seems to me that this has to be a filesystem bug, and we
should almost certainly adopt one of these ideas from Michael Harris:

 - Providing a way to configure PG not to use posix_fallocate at runtime

 - In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

Maybe we need some more research to figure out which of those two
things we should do -- I suspect the second one is better but if that
fails then we might need to do the first one -- but I doubt that we
can wait for XFS to fix whatever the issue is here. Our usage of
posix_fallocate doesn't look to be anything more than plain vanilla,
so as between these competing hypotheses:

(1) posix_fallocate is and always has been buggy and you can't rely on it, or
(2) we use posix_fallocate in a way that nobody else has and have hit
an incredibly obscure bug as a result, which will be swiftly patched

...the first seems much more likely.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 12:36:40 -0500, Robert Haas wrote:
> On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:
> > Pretty unexcited about all of these - XFS is fairly widely used for PG, but
> > this problem doesn't seem very common. It seems to me that we're missing
> > something that causes this to only happen in a small subset of cases.
>
> I wonder if this is actually pretty common on XFS. I mean, we've
> already hit this with at least one EDB customer, and Michael's report
> is, as far as I know, independent of that; and he points to a
> pgsql-general thread which, AFAIK, is also independent. We don't get
> three (or more?) independent reports of that many bugs, so I think
> it's not crazy to think that the problem is actually pretty common.

Maybe. I think we would have gotten a lot more reports if it were common. I
know of quite a few very busy installs using xfs.

I think there must be some as-of-yet-unknown condition gating it. E.g. that
the filesystem has been created a while ago and has some now-on-by-default
options disabled.


> > I think the source of this needs to be debugged further before we try to apply
> > workarounds in postgres.
>
> Why? It seems to me that this has to be a filesystem bug,

Adding workarounds for half-understood problems tends to lead to code that we
can't evolve in the future, as we a) don't understand b) can't reproduce the
problem.

Workarounds could also mask some bigger / worse issues.  We e.g. have blamed
ext4 for a bunch of bugs that then turned out to be ours in the past. But we
didn't look for a long time, because it was convenient to just blame ext4.


> and we should almost certainly adopt one of these ideas from Michael Harris:
>
>  - Providing a way to configure PG not to use posix_fallocate at runtime

I'm not strongly opposed to that. That's testable without access to an
affected system.  I wouldn't want to automatically do that when detecting an
affected system though, that'll make behaviour way less predictable.


>  - In the case of posix_fallocate failing with ENOSPC, fall back to
> FileZero (worst case that will fail as well, in which case we will
> know that we really are out of space)

I doubt that that's a good idea. What if fallocate failing is an indicator of
a problem? What if you turn on AIO + DIO and suddenly get a much more
fragmented file?

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Andres

On Wed, 11 Dec 2024 at 03:09, Andres Freund <andres@anarazel.de> wrote:
> I think it's implied, but I just want to be sure: This was one of the affected
> systems?

Yes, correct.

> Any chance to get df output? I'm mainly curious about the number of used
> inodes.

Sorry, I could swear I had included that already! Here it is:

# df /var/opt
Filesystem               1K-blocks       Used Available Use% Mounted on
/dev/mapper/ippvg-ipplv 4197492228 3803866716 393625512  91% /var/opt

# df -i /var/opt
Filesystem                 Inodes   IUsed     IFree IUse% Mounted on
/dev/mapper/ippvg-ipplv 419954240 1568137 418386103    1% /var/opt

> Could you show the mount options that end up being used?
>    grep /var/opt /proc/mounts

/dev/mapper/ippvg-ipplv /var/opt xfs
rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

These seem to be the defaults.

> I assume you have never set XFS options for the PG directory or files within
> it?

Correct.

>  Could you show
>   xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc

-p--------------X pg_tblspc/16402/PG_16_202307071/49163/1132925906.4
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
fd.flags = non-sync,non-direct,read-only
stat.ino = 4320612794
stat.type = regular file
stat.size = 201211904
stat.blocks = 393000
fsxattr.xflags = 0x80000002 [-p--------------X]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.cowextsize = 0
fsxattr.nextents = 165
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
statfs.f_bsize = 4096
statfs.f_blocks = 1049373057
statfs.f_bavail = 98406378
statfs.f_files = 419954240
statfs.f_ffree = 418386103
statfs.f_flags = 0x1020
geom.bsize = 4096
geom.agcount = 4
geom.agblocks = 262471424
geom.datablocks = 1049885696
geom.rtblocks = 0
geom.rtextents = 0
geom.rtextsize = 1
geom.sunit = 0
geom.swidth = 0
counts.freedata = 98406378
counts.freertx = 0
counts.freeino = 864183
counts.allocino = 2432320

> I'd try monitoring the per-ag free space over time and see if the the ENOSPC
> issue is correlated with one AG getting full.  'freesp' is probably too
> expensive for that, but it looks like
>    xfs_db -r -c agresv /dev/nvme6n1
> should work?
>
> Actually that output might be interesting to see, even when you don't hit the
> issue.

I will see if I can set that up.

> How many partitions are there for each of the tables? Mainly wondering because
> of the number of inodes being used.

It is configurable and varies from site to site. It could range from 7
up to maybe 60.

> Are all of the active tables within one database? That could be relevant due
> to per-directory behaviour of free space allocation.

Each pg instance may have one or more application databases. Typically
data is being written into all of them (although sometimes a database
will be archived, with no new data going into it).

You might be onto something though. The system I got the above prints
from is only experiencing this issue in one directory - that might not
mean very much though, it only has 2 databases and one of them looks
like it is not receiving imports.
But another system I can access has multiple databases with ongoing
imports, yet all the errors bar one relate to one directory.
I will collect some data from that system and post it shortly.

Cheers
Mike



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Yay, reflink=0, that's pretty old fs ?!

This particular filesystem was created on Centos 7, and retained when
the system was upgraded to RL9. So yes probably pretty old!

> Could you get us maybe those below commands too? (or from any other directory exhibiting such errors)
>
> stat pg_tblspc/16401/PG_16_202307071/17643/
> ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
> time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess timing of getdents() call as that may something
aboutthat directory indirectly 

# stat pg_tblspc/16402/PG_16_202307071/49163/
  File: pg_tblspc/16402/PG_16_202307071/49163/
  Size: 5177344         Blocks: 14880      IO Block: 4096   directory
Device: fd02h/64770d    Inode: 4299946593  Links: 2
Access: (0700/drwx------)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2024-12-11 09:39:42.467802419 +0900
Modify: 2024-12-11 09:51:19.813948673 +0900
Change: 2024-12-11 09:51:19.813948673 +0900
 Birth: 2024-11-25 17:37:11.812374672 +0900

# time ls -1 pg_tblspc/16402/PG_16_202307071/49163/ | wc -l
179000

real    0m0.474s
user    0m0.439s
sys     0m0.038s

> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative
preallocationfrom many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that
/fsand see if that helps.  I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and
are*not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to
potentiallythe number of opened fds to get an avalanche of "preallocations"). 

I will try to organize that. They are production systems so it might
take some time.

> 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so
wemight end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get
theexact syscall. 

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

> 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS
fsand attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g 

OK, I will look into that also.

Cheers
Mike



Re: FileFallocate misbehaving on XFS

From
Jakub Wartak
Date:


On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <harmic@gmail.com> wrote:
Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
 [..]

> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocation from many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fs and see if that helps.  I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentially the number of opened fds to get an avalanche of "preallocations").

I will try to organize that. They are production systems so it might
take some time.

Cool.

> 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get the exact syscall.

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

Yes, it sounds like mission impossible. Is there any chance you can get reproduced down to one or a small number of postgres backends doing the writes?
 

> 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g

OK, I will look into that also.


Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got one big pile of errors into 1 error category and that's not helpful at all (inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread including that useful file from subthread, hopefully I've did not misinterpreted something:

- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather than multiple separate(but adjacent) iovectors to pg_writev. It launched only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are no large extents in that AG)
   from      to extents  blocks    pct
      1       1    4949    4949   0.65
      2       3   86113  173452  22.73
      4       7   19399   94558  12.39
      8      15   23233  248602  32.58
     16      31   12425  241421  31.64
   total free extents 146119
   total free blocks 762982
   average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the others AG which have 1024-8192. Therefore it looks there are no contiguous blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore -j (also concurrent posix_fallocate() to many independent files sharing the same AG, but that's 1 backend:1 file so no contention for waitcount in RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent posix_fallocate() that end up somehow coalesced ? Or hypothetically let's say 16-32 fallocates() hit the same AG initially, maybe it's some form of concurrency semi race-condition inside XFS where one of fallocate calls fails to find space in that one AG, but according to [1] it should fallback to another AGs.
- and there's also additional XFS dynamic speculative preallocation that might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in a somewhat different directory if possible) and see if it still fails.

-J.

Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-10 16:33:06 -0500, Andres Freund wrote:
> Maybe. I think we would have gotten a lot more reports if it were common. I
> know of quite a few very busy installs using xfs.
>
> I think there must be some as-of-yet-unknown condition gating it. E.g. that
> the filesystem has been created a while ago and has some now-on-by-default
> options disabled.
>
>
> > > I think the source of this needs to be debugged further before we try to apply
> > > workarounds in postgres.
> >
> > Why? It seems to me that this has to be a filesystem bug,
>
> Adding workarounds for half-understood problems tends to lead to code that we
> can't evolve in the future, as we a) don't understand b) can't reproduce the
> problem.
>
> Workarounds could also mask some bigger / worse issues.  We e.g. have blamed
> ext4 for a bunch of bugs that then turned out to be ours in the past. But we
> didn't look for a long time, because it was convenient to just blame ext4.

>
> > and we should almost certainly adopt one of these ideas from Michael Harris:
> >
> >  - Providing a way to configure PG not to use posix_fallocate at runtime
>
> I'm not strongly opposed to that. That's testable without access to an
> affected system.  I wouldn't want to automatically do that when detecting an
> affected system though, that'll make behaviour way less predictable.
>
>
> >  - In the case of posix_fallocate failing with ENOSPC, fall back to
> > FileZero (worst case that will fail as well, in which case we will
> > know that we really are out of space)
>
> I doubt that that's a good idea. What if fallocate failing is an indicator of
> a problem? What if you turn on AIO + DIO and suddenly get a much more
> fragmented file?

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I'm not entirely sure about the phrasing though, we have a somewhat confusing
mix of blocks and bytes in messages.

Perhaps some of information should be in an errdetail, but I admit I'm a bit
hesitant about doing so for crucial details. I find that often only the
primary error message is available when debugging problems encountered by
others.

Maybe something like
  /* translator: second %s is a function name like FileAllocate() */
  could not extend file \"%s\" by %u blocks, from %llu to %llu bytes, using %s: %m
or
  could not extend file \"%s\" using %s by %u blocks, from its current size of %u blocks: %m
or
  could not extend file \"%s\" using %s by %u blocks/%llu bytes from its current size of %llu bytes: %m

If we want to use errdetail() judiciously, we could go for something like
  errmsg("could not extend file \"%s\" by %u blocks, using %s: %m", ...
  errdetail("Failed to extend file from %u blocks/%llu bytes to %u blocks / %llu bytes.", ...)



I think it might also be good - this is a slightly more complicated project -
to report the amount of free space the filesystem reports when we hit
ENOSPC. I have dealt with cases of the FS transiently filling up way too many
times, and it's always a pain to figure that out.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

FWIW, I tried fairly hard to reproduce this.

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I did not see any ENOSPC errors unless the filesystem really was full at that
time. To check that, I made mdzeroextend() do a statfs() when encountering
ENOSPC, printed statfs.f_blocks and made that case PANIC.


What I do see is that after - intentionally - hitting an out-of-disk-space
error, the available disk space would occasionally increase a small amount
after a few seconds. Regardless of whether using the fallocate and
non-fallocate path.

From what I can tell this small increase in free space has a few reasons:

- Checkpointer might not have gotten around to unlinking files, keeping the
  inode alive.

- Occasionally bgwriter or a backend would have relation segments that were
  unlinked open, so the inode (not the actual file space, because the segment
  to prevent that) could not yet be removed from the filesystem

- It looks like xfs does some small amount of work to reclaim space in the
  background. Which makes sense, otherwise each unlink would have to be a
  flush to disk.

But that's not in any way enough amount of space to explain what you're
seeing. The most I've were 6MB, when ramping up the truncation frequency a
lot.

Of course this was on a newer kernel, not on RHEL / RL 8/9.


Just to make sure - you're absolutely certain that you actually have space at
the time of the errors?  E.g. a checkpoint could free up a lot of WAL after a
checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL
files. That can be 100s of gigabytes.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?




On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote:
> - one AG with extreme low extent sizes compared to the others AGs (I bet
> that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
> no large extents in that AG)
>    from      to extents  blocks    pct
>       1       1    4949    4949   0.65
>       2       3   86113  173452  22.73
>       4       7   19399   94558  12.39
>       8      15   23233  248602  32.58
>      16      31   12425  241421  31.64
>    total free extents 146119
>    total free blocks 762982
>    average free extent size 5.22165 (!)

Note that this does not mean that all extents in the AG are that small, just
that the *free* extents are of that size.

I think this might primarily be because this AG has the smallest amount of
free blocks (2.9GB). However, the fact that it *does* have less, could be
interesting. It might be the AG associated with the directory for the busiest
database or such.

The next least-space AG is:

   from      to extents  blocks    pct
      1       1    1021    1021   0.10
      2       3   48748   98255  10.06
      4       7    9840   47038   4.81
      8      15   13648  146779  15.02
     16      31   15818  323022  33.06
     32      63     584   27932   2.86
     64     127     147   14286   1.46
    128     255     253   49047   5.02
    256     511     229   87173   8.92
    512    1023     139  102456  10.49
   1024    2047      51   72506   7.42
   2048    4095       3    7422   0.76
total free extents 90481
total free blocks 976937

It seems plausible it'd would look similar if more of the free blocks were
used.


> - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
> up to 64 pg blocks maximum (and that's higher than the above)
> - but the fails where observed also using pg_upgrade --link -j/pg_restore
> -j (also concurrent posix_fallocate() to many independent files sharing the
> same AG, but that's 1 backend:1 file so no contention for waitcount in
> RelationAddBlocks())

We also extend by more than one page, even without concurrency, if
bulk-insertion is used, and i think we do use that for
e.g. pg_attribute. Which is actually the table where pg_restore encountered
the issue:

pg_restore: error: could not execute query: ERROR:  could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device

1249 is the initial relfilenode for pg_attribute.

There could also be some parallelism leading to bulk extension, due to the
parallel restore. I don't remember which commands pg_restore actually executes
in parallel.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Andres

On Thu, 12 Dec 2024 at 10:50, Andres Freund <andres@anarazel.de> wrote:
> Just to make sure - you're absolutely certain that you actually have space at
> the time of the errors?

As sure as I can be. The RHEL8 system that I took prints from
yesterday has > 1.5TB free. I can't see it varying by that much.

It does look as though the system needs to be quite full to provoke
this problem. The systems I have looked at so far have >90% full
filesystems.

Another interesting snippet: the application has a number of ETL
workers going at once. The actual number varies depending on a number
of factors but might be somewhere from 10 - 150. Each worker will have
a single postgres backend that they are feeding data to.

At the time of the error, it is not the case that all ETL workers
strike it at once - it looks like a lot of the time only a single
worker is affected, or at most a handful of workers. I can't see for
sure what the other workers were doing at the time, but I would expect
they were all importing data as well.

> If I were to provide you with a patch that showed the amount of free disk
> space at the time of an error, the size of the relation etc, could you
> reproduce the issue with it applied? Or is that unrealistic?

I have not been able to reproduce it on demand, and so far it has only
happened in production systems.

As long as the patch doesn't degrade normal performance it should be
possible to deploy it to one of the systems that is regularly
reporting the error, although it might take a while to get approval to
do that.

Cheers
Mike



Re: FileFallocate misbehaving on XFS

From
Michael Harris
Date:
Hi Andres

On Fri, 13 Dec 2024 at 08:38, Andres Freund <andres@anarazel.de> wrote:
> > Another interesting snippet: the application has a number of ETL
> > workers going at once. The actual number varies depending on a number
> > of factors but might be somewhere from 10 - 150. Each worker will have
> > a single postgres backend that they are feeding data to.
>
> Are they all inserting into distinct tables/partitions or into shared tables?

The set of tables they are writing into is the same, but we do take
some effort to randomize the order of the tables that we each worker
is writing into so as to reduce contention. Even so it is quite likely
that multiple processes will be writing into a table at a time.
Also worth noting that I have only seen this error triggered by COPY
statements (other than the upgrade case). There are some other cases
in our code that use INSERT but so far I have not seen that end in an
out of space error.

> When you say that they're not "all striking it at once", do you mean that some
> of them aren't interacting with the database at the time, or that they're not
> erroring out?

Sorry, I meant erroring out.

Thanks for the patch!

Cheers
Mike



Re: FileFallocate misbehaving on XFS

From
Alvaro Herrera
Date:
On 2024-Dec-11, Andres Freund wrote:

> One thing that I think we should definitely do is to include more detail in
> the error message. mdzeroextend()'s error messages don't include how many
> blocks the relation was to be extended by. Neither mdextend() nor
> mdzeroextend() include the offset at which the extension failed.

I proposed a patch at
https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql

FileFallocate failure:
        errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m",
               (long long) addbytes, (long long) seekpos,
               FilePathName(v->mdfd_vfd)),

FileZero failure:
        errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m",
               (long long) addbytes, (long long) seekpos,
               FilePathName(v->mdfd_vfd)),

I'm not sure that we need to talk about blocks, given that the
underlying syscalls don't work in blocks anyway.  IMO we should just
report bytes.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)



Re: FileFallocate misbehaving on XFS

From
Thomas Munro
Date:
On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2024-Dec-11, Andres Freund wrote:
> > One thing that I think we should definitely do is to include more detail in
> > the error message. mdzeroextend()'s error messages don't include how many
> > blocks the relation was to be extended by. Neither mdextend() nor
> > mdzeroextend() include the offset at which the extension failed.
>
> I proposed a patch at
> https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql

If adding more logging, I wonder why FileAccess()'s "re-open failed"
case is not considered newsworthy.  I've suspected it as a candidate
source of an unexplained and possibly misattributed error in other
cases.  I'm not saying it's at all likely in this case, but it seems
like just the sort of rare unexpected failure that we'd want to know
more about when trying to solve mysteries.



Re: FileFallocate misbehaving on XFS

From
Robert Haas
Date:
On Sat, Dec 14, 2024 at 4:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2024-Dec-11, Andres Freund wrote:
> > > One thing that I think we should definitely do is to include more detail in
> > > the error message. mdzeroextend()'s error messages don't include how many
> > > blocks the relation was to be extended by. Neither mdextend() nor
> > > mdzeroextend() include the offset at which the extension failed.
> >
> > I proposed a patch at
> > https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql
>
> If adding more logging, I wonder why FileAccess()'s "re-open failed"
> case is not considered newsworthy.  I've suspected it as a candidate
> source of an unexplained and possibly misattributed error in other
> cases.  I'm not saying it's at all likely in this case, but it seems
> like just the sort of rare unexpected failure that we'd want to know
> more about when trying to solve mysteries.

Wow. That's truly abominable. It doesn't seem likely to explain this
case, because I don't think trying to reopen an existing file could
result in LruInsert() returning ENOSPC. But this code desperately
needs refactoring to properly report the open() failure as such,
instead of conflating it with a failure of whatever syscall we were
contemplating before we realized we needed to open().

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: FileFallocate misbehaving on XFS

From
Jakub Wartak
Date:

On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

FWIW, I tried fairly hard to reproduce this.

Same, but without PG and also without much success. I've also tried to push the AGs (with just one or two AGs created via mkfs) to contain only small size extents (by creating hundreds of thousands of 8kb files) then deleting some modulo and then try couple of bigger fallocate/writes to see if that would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not blow up. It only failed when df -h was exactly 100% in multiple scenarios like that (and yes it added little space out of blue sometimes too). So my take is something related to state (having fd open) and concurrency.

Interesting thing that I've observed is that the per directory AG affinity for big directories (think $PGDATA) is lost when AG is full and then extents are allocated from different AGs (one can use xfs_bmap -vv to see allocated AG affinity for directory VS files there)

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I could only think of the question: how many files were involved there ? Maybe it is some kind of race between other (or the same) backends frequently churning their fdcache's with open()/close() [defeating speculative preallocation] -> XFS ending up fragmented and only then posix_fallocate() having issues for larger allocations (>> 8kB)? My take is if we send N io write vectors then this seems to be handled fine, but when we throw one big fallocate -- it is not -- so maybe the posix_fallocate() was in the process of finding space while some other activities happened to that inode -- like close() -- but then it seems it doesn't match the pg_upgrade scenario.

Well IMHO we are stuck till Michael provides some more data (patch outcome, bpf and maybe other hints and tests).

-J.

Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-14 09:29:12 +0100, Alvaro Herrera wrote:
> On 2024-Dec-11, Andres Freund wrote:
> 
> > One thing that I think we should definitely do is to include more detail in
> > the error message. mdzeroextend()'s error messages don't include how many
> > blocks the relation was to be extended by. Neither mdextend() nor
> > mdzeroextend() include the offset at which the extension failed.
> 
> I proposed a patch at
> https://postgr.es/m/202409110955.6njbwzm4ocus@alvherre.pgsql
> 
> FileFallocate failure:
>         errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m",
>                (long long) addbytes, (long long) seekpos,
>                FilePathName(v->mdfd_vfd)),
> 
> FileZero failure:
>         errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m",
>                (long long) addbytes, (long long) seekpos,
>                FilePathName(v->mdfd_vfd)),

Personally I don't like the obfuscation of "allocate" and "zero" vs just
naming the function names. But I guess that's just taste thing.


> I'm not sure that we need to talk about blocks, given that the
> underlying syscalls don't work in blocks anyway.  IMO we should just
> report bytes.

When looking for problems it's considerably more work with bytes, because - at
least for me - the large number is hard to compare quickly and to know how
aggressively we extended also requires to translate to blocks.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Robert Haas
Date:
On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
> Personally I don't like the obfuscation of "allocate" and "zero" vs just
> naming the function names. But I guess that's just taste thing.
>
> When looking for problems it's considerably more work with bytes, because - at
> least for me - the large number is hard to compare quickly and to know how
> aggressively we extended also requires to translate to blocks.

FWIW, I think that what we report in the error should hew as closely
to the actual system call as possible. Hence, I agree with your first
complaint and would prefer to simply see the system calls named, but I
disagree with your second complaint and would prefer to see the byte
count.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-16 14:45:37 +0100, Jakub Wartak wrote:
> On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:
> An extended cycle of 80 backends copying into relations and occasionally
> > truncating them (to simulate the partitions being dropped and new ones
> > created). For this I ran a 4TB filesystem very close to fully filled
> > (peaking
> > at 99.998 % full).
> >
> 
> I could only think of the question: how many files were involved there ?

I varied the number heavily. From dozens to 10s of thousands. No meaningful
difference.


> Well IMHO we are stuck till Michael provides some more data (patch outcome,
> bpf and maybe other hints and tests).

Yea.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Alvaro Herrera
Date:
On 2024-Dec-16, Robert Haas wrote:

> On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
> > Personally I don't like the obfuscation of "allocate" and "zero" vs just
> > naming the function names. But I guess that's just taste thing.
> >
> > When looking for problems it's considerably more work with bytes, because - at
> > least for me - the large number is hard to compare quickly and to know how
> > aggressively we extended also requires to translate to blocks.
> 
> FWIW, I think that what we report in the error should hew as closely
> to the actual system call as possible. Hence, I agree with your first
> complaint and would prefer to simply see the system calls named, but I
> disagree with your second complaint and would prefer to see the byte
> count.

Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)")
with the number of bytes, and leave the errmsg() mentioning the general
operation being done (allocate, zero, etc) with the number of blocks.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)



Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-16 18:05:59 +0100, Alvaro Herrera wrote:
> On 2024-Dec-16, Robert Haas wrote:
> 
> > On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
> > > Personally I don't like the obfuscation of "allocate" and "zero" vs just
> > > naming the function names. But I guess that's just taste thing.
> > >
> > > When looking for problems it's considerably more work with bytes, because - at
> > > least for me - the large number is hard to compare quickly and to know how
> > > aggressively we extended also requires to translate to blocks.
> > 
> > FWIW, I think that what we report in the error should hew as closely
> > to the actual system call as possible. Hence, I agree with your first
> > complaint and would prefer to simply see the system calls named, but I
> > disagree with your second complaint and would prefer to see the byte
> > count.
> 
> Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)")
> with the number of bytes, and leave the errmsg() mentioning the general
> operation being done (allocate, zero, etc) with the number of blocks.

I don't see what we gain by requiring guesswork (what does allocating vs
zeroing mean, zeroing also allocates disk space after all) to interpret the
main error message. My experience is that it's often harder to get the DETAIL
than the actual error message (grepping becomes harder due to separate line,
terse verbosity is commonly used).

I think we're going too far towards not mentioning the actual problems in too
many error messages in general.

Greetings,

Andres Freund



Re: FileFallocate misbehaving on XFS

From
Robert Haas
Date:
On Mon, Dec 16, 2024 at 12:52 PM Andres Freund <andres@anarazel.de> wrote:
> I don't see what we gain by requiring guesswork (what does allocating vs
> zeroing mean, zeroing also allocates disk space after all) to interpret the
> main error message. My experience is that it's often harder to get the DETAIL
> than the actual error message (grepping becomes harder due to separate line,
> terse verbosity is commonly used).

I feel like the normal way that we do this is basically:

could not {name of system call} file "\%s\": %m

e.g.

could not read file \"%s\": %m

I don't know why we should do anything else in this type of case.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: FileFallocate misbehaving on XFS

From
Jakub Wartak
Date:

On Thu, Dec 19, 2024 at 7:49 AM Michael Harris <harmic@gmail.com> wrote:
Hello,

I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.

Here is a sample of the output:

2024-12-19 01:08:50 CET [2533222]:  LOG:  mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796
 
[..]
I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.

I couldn't resist: you seem to have entered the quantum realm of free disk space AKA Schrodinger's free space: you both have the space and dont have it... ;)

No one else has responded, so I'll try. My take is that we got very limited number of reports (2-3) of this stuff happening and it always seem to be >90% space used, yet the adoption of PG16 is rising, so we may or may not see more errors of those kind, but on another side of things: it's frequency is so rare that it's really wild we don't see more reports like this one. Lots of OS upgrades in the wild are performed by building new standbys (maybe that lowers the fs fragmentation), rather than in-place OS upgrades. To me it sounds like a new bug in XFS that is rare. You can probably live with #undef HAVE_POSIX_FALLOCATE as a way to survive, another would be probably to try to run xfs_fsr to defragment the fs.

Longer-term: other than collecting the eBPF data to start digging from where it is really triggered, I don't see a way forward. It would be suboptimal to just abandon fallocate() optimizations from commit 31966b151e6ab7a6284deab6e8fe5faddaf2ae4c, just because of very unusual combinations of factors (XFS bug).

Well we could be having some kludge like pseudo-code: if(posix_falloc() == ENOSPC && statfs().free_space_pct >= 1) fallback_to_pwrites(), but it is ugly. Another is GUC (or even two -- how much to extend or to use or not the posix_fallocate()), but people do not like more GUCs...

>  I have so far not installed the bpftrace that Jakub suggested before -
> as I say this is a production machine and I am wary of triggering a
> kernel panic or worse (even though it seems like the risk for that
> would be low?). While a kernel stack trace would no doubt be helpful
> to the XFS developers, from a postgres point of view, would that be
> likely to help us decide what to do about this?[..]

Well you could try having reproduction outside of production, or even clone the storage (but not using backup/restore), but literally clone the XFS LUNs on the storage itself and connect those separate VM to have a safe testbed (or even use dd(1) of some smaller XFS fs exhibiting such behaviour to some other place)

As for eBPF/bpftrace: it is safe (it's sandboxed anyway), lots of customers are using it, but as always YMMV.

There's also xfs_fsr that might help overcome.

You can also experiment if -o allocsize helps or just even try -o allocsize=0 (but that might have some negative effects on performance probably)

-J.

Re: FileFallocate misbehaving on XFS

From
Andres Freund
Date:
Hi,

On 2024-12-19 17:47:13 +1100, Michael Harris wrote:
> I finally managed to get the patched version installed in a production
> database where the error is occurring very regularly.

Thanks!


> Here is a sample of the output:
> 
> 2024-12-19 01:08:50 CET [2533222]:  LOG:  mdzeroextend FileFallocate
> failing with ENOSPC: free space for filesystem containing
> "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
> 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
> 1073741376, f_ffree: 1069933796

That's ~700 GB of free space...

It'd be interesting to see filefrag -v for that segment.


> This is a different system to those I previously provided logs from.
> It is also running RHEL8 with a similar configuration to the other
> system.

Given it's a RHEL system, have you raised this as an issue with RH? They
probably have somebody with actual XFS hacking experience on staff.

RH's kernels are *heavily* patched, so it's possible the issue is actually RH
specific.


> I have so far not installed the bpftrace that Jakub suggested before -
> as I say this is a production machine and I am wary of triggering a
> kernel panic or worse (even though it seems like the risk for that
> would be low?). While a kernel stack trace would no doubt be helpful
> to the XFS developers, from a postgres point of view, would that be
> likely to help us decide what to do about this?

Well, I'm personally wary of installing workarounds for a problem I don't
understand and can't reproduce, which might be specific to old filesystems
and/or heavily patched kernels.  This clearly is an FS bug.

That said, if we learn that somehow this is a fundamental XFS issue that can
be triggered on every XFS filesystem, with current kernels, it becomes more
reasonable to implement a workaround in PG.


Another thing I've been wondering about is if we could reduce the frequency of
hitting problems by rounding up the number of blocks we extend by to powers of
two. That would probably reduce fragmentation, and the extra space would be
quickly used in workloads where we extend by a bunch of blocks at once,
anyway.

Greetings,

Andres Freund