Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: FileFallocate misbehaving on XFS |
Date | |
Msg-id | nq4ayqhjmipxahpjtj6jqog3hlk5mfztpvvax62rrmpjjlblrt@42gcpw2cldhv Whole thread Raw |
In response to | Re: FileFallocate misbehaving on XFS (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: FileFallocate misbehaving on XFS
Re: FileFallocate misbehaving on XFS |
List | pgsql-hackers |
Hi, FWIW, I tried fairly hard to reproduce this. An extended cycle of 80 backends copying into relations and occasionally truncating them (to simulate the partitions being dropped and new ones created). For this I ran a 4TB filesystem very close to fully filled (peaking at 99.998 % full). I did not see any ENOSPC errors unless the filesystem really was full at that time. To check that, I made mdzeroextend() do a statfs() when encountering ENOSPC, printed statfs.f_blocks and made that case PANIC. What I do see is that after - intentionally - hitting an out-of-disk-space error, the available disk space would occasionally increase a small amount after a few seconds. Regardless of whether using the fallocate and non-fallocate path. From what I can tell this small increase in free space has a few reasons: - Checkpointer might not have gotten around to unlinking files, keeping the inode alive. - Occasionally bgwriter or a backend would have relation segments that were unlinked open, so the inode (not the actual file space, because the segment to prevent that) could not yet be removed from the filesystem - It looks like xfs does some small amount of work to reclaim space in the background. Which makes sense, otherwise each unlink would have to be a flush to disk. But that's not in any way enough amount of space to explain what you're seeing. The most I've were 6MB, when ramping up the truncation frequency a lot. Of course this was on a newer kernel, not on RHEL / RL 8/9. Just to make sure - you're absolutely certain that you actually have space at the time of the errors? E.g. a checkpoint could free up a lot of WAL after a checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL files. That can be 100s of gigabytes. If I were to provide you with a patch that showed the amount of free disk space at the time of an error, the size of the relation etc, could you reproduce the issue with it applied? Or is that unrealistic? On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote: > - one AG with extreme low extent sizes compared to the others AGs (I bet > that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are > no large extents in that AG) > from to extents blocks pct > 1 1 4949 4949 0.65 > 2 3 86113 173452 22.73 > 4 7 19399 94558 12.39 > 8 15 23233 248602 32.58 > 16 31 12425 241421 31.64 > total free extents 146119 > total free blocks 762982 > average free extent size 5.22165 (!) Note that this does not mean that all extents in the AG are that small, just that the *free* extents are of that size. I think this might primarily be because this AG has the smallest amount of free blocks (2.9GB). However, the fact that it *does* have less, could be interesting. It might be the AG associated with the directory for the busiest database or such. The next least-space AG is: from to extents blocks pct 1 1 1021 1021 0.10 2 3 48748 98255 10.06 4 7 9840 47038 4.81 8 15 13648 146779 15.02 16 31 15818 323022 33.06 32 63 584 27932 2.86 64 127 147 14286 1.46 128 255 253 49047 5.02 256 511 229 87173 8.92 512 1023 139 102456 10.49 1024 2047 51 72506 7.42 2048 4095 3 7422 0.76 total free extents 90481 total free blocks 976937 It seems plausible it'd would look similar if more of the free blocks were used. > - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped > up to 64 pg blocks maximum (and that's higher than the above) > - but the fails where observed also using pg_upgrade --link -j/pg_restore > -j (also concurrent posix_fallocate() to many independent files sharing the > same AG, but that's 1 backend:1 file so no contention for waitcount in > RelationAddBlocks()) We also extend by more than one page, even without concurrency, if bulk-insertion is used, and i think we do use that for e.g. pg_attribute. Which is actually the table where pg_restore encountered the issue: pg_restore: error: could not execute query: ERROR: could not extend file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No space left on device 1249 is the initial relfilenode for pg_attribute. There could also be some parallelism leading to bulk extension, due to the parallel restore. I don't remember which commands pg_restore actually executes in parallel. Greetings, Andres Freund
pgsql-hackers by date: