Hello PG Hackers
Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:
pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.
This has happened multiple times on different servers, and in each
case there was plenty of free space available.
We found this thread describing similar issues:
https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
As is the case in that thread, all of the affected databases are using XFS.
One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.
We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).
I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the
lengthto be allocated is greater than the available space.
There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.
On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.
Is this already being looked into?
Thanks in advance,
Cheers
Mike