On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
FWIW, I tried fairly hard to reproduce this.
Same, but without PG and also without much success. I've also tried to push the AGs (with just one or two AGs created via mkfs) to contain only small size extents (by creating hundreds of thousands of 8kb files) then deleting some modulo and then try couple of bigger fallocate/writes to see if that would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not blow up. It only failed when df -h was exactly 100% in multiple scenarios like that (and yes it added little space out of blue sometimes too). So my take is something related to state (having fd open) and concurrency.
Interesting thing that I've observed is that the per directory AG affinity for big directories (think $PGDATA) is lost when AG is full and then extents are allocated from different AGs (one can use xfs_bmap -vv to see allocated AG affinity for directory VS files there)
An extended cycle of 80 backends copying into relations and occasionally truncating them (to simulate the partitions being dropped and new ones created). For this I ran a 4TB filesystem very close to fully filled (peaking at 99.998 % full).
I could only think of the question: how many files were involved there ? Maybe it is some kind of race between other (or the same) backends frequently churning their fdcache's with open()/close() [defeating speculative preallocation] -> XFS ending up fragmented and only then posix_fallocate() having issues for larger allocations (>> 8kB)? My take is if we send N io write vectors then this seems to be handled fine, but when we throw one big fallocate -- it is not -- so maybe the posix_fallocate() was in the process of finding space while some other activities happened to that inode -- like close() -- but then it seems it doesn't match the pg_upgrade scenario.
Well IMHO we are stuck till Michael provides some more data (patch outcome, bpf and maybe other hints and tests).