Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: FileFallocate misbehaving on XFS
Date
Msg-id CAKZiRmzO=ZetYm2xvXJVmmiSeyZJcwC9oHYEwSjsV7ifT4cn=g@mail.gmail.com
Whole thread Raw
In response to Re: FileFallocate misbehaving on XFS  (Michael Harris <harmic@gmail.com>)
Responses Re: FileFallocate misbehaving on XFS
List pgsql-hackers


On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <harmic@gmail.com> wrote:
Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
 [..]

> 3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocation from many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fs and see if that helps.  I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentially the number of opened fds to get an avalanche of "preallocations").

I will try to organize that. They are production systems so it might
take some time.

Cool.

> 4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get the exact syscall.

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

Yes, it sounds like mission impossible. Is there any chance you can get reproduced down to one or a small number of postgres backends doing the writes?
 

> 5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g

OK, I will look into that also.


Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got one big pile of errors into 1 error category and that's not helpful at all (inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread including that useful file from subthread, hopefully I've did not misinterpreted something:

- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather than multiple separate(but adjacent) iovectors to pg_writev. It launched only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are no large extents in that AG)
   from      to extents  blocks    pct
      1       1    4949    4949   0.65
      2       3   86113  173452  22.73
      4       7   19399   94558  12.39
      8      15   23233  248602  32.58
     16      31   12425  241421  31.64
   total free extents 146119
   total free blocks 762982
   average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the others AG which have 1024-8192. Therefore it looks there are no contiguous blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore -j (also concurrent posix_fallocate() to many independent files sharing the same AG, but that's 1 backend:1 file so no contention for waitcount in RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent posix_fallocate() that end up somehow coalesced ? Or hypothetically let's say 16-32 fallocates() hit the same AG initially, maybe it's some form of concurrency semi race-condition inside XFS where one of fallocate calls fails to find space in that one AG, but according to [1] it should fallback to another AGs.
- and there's also additional XFS dynamic speculative preallocation that might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in a somewhat different directory if possible) and see if it still fails.

-J.

pgsql-hackers by date:

Previous
From: Nishant Sharma
Date:
Subject: Re: on_error table, saving error info to a table
Next
From: Guillaume Lelarge
Date:
Subject: Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE