Home > mailing lists

Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: FileFallocate misbehaving on XFS
Date	December 12, 2024 02:50:25
Msg-id	nq4ayqhjmipxahpjtj6jqog3hlk5mfztpvvax62rrmpjjlblrt@42gcpw2cldhv Whole thread Raw
In response to	Re: FileFallocate misbehaving on XFS (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses	Re: FileFallocate misbehaving on XFS Re: FileFallocate misbehaving on XFS
List	pgsql-hackers

Tree view

Hi,

FWIW, I tried fairly hard to reproduce this.

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I did not see any ENOSPC errors unless the filesystem really was full at that
time. To check that, I made mdzeroextend() do a statfs() when encountering
ENOSPC, printed statfs.f_blocks and made that case PANIC.

What I do see is that after - intentionally - hitting an out-of-disk-space
error, the available disk space would occasionally increase a small amount
after a few seconds. Regardless of whether using the fallocate and
non-fallocate path.

From what I can tell this small increase in free space has a few reasons:

- Checkpointer might not have gotten around to unlinking files, keeping the
  inode alive.

- Occasionally bgwriter or a backend would have relation segments that were
  unlinked open, so the inode (not the actual file space, because the segment
  to prevent that) could not yet be removed from the filesystem

- It looks like xfs does some small amount of work to reclaim space in the
  background. Which makes sense, otherwise each unlink would have to be a
  flush to disk.

But that's not in any way enough amount of space to explain what you're
seeing. The most I've were 6MB, when ramping up the truncation frequency a
lot.

Of course this was on a newer kernel, not on RHEL / RL 8/9.

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors?  E.g. a checkpoint could free up a lot of WAL after a
checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL
files. That can be 100s of gigabytes.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote:
> - one AG with extreme low extent sizes compared to the others AGs (I bet
> that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
> no large extents in that AG)
>    from      to extents  blocks    pct
>       1       1    4949    4949   0.65
>       2       3   86113  173452  22.73
>       4       7   19399   94558  12.39
>       8      15   23233  248602  32.58
>      16      31   12425  241421  31.64
>    total free extents 146119
>    total free blocks 762982
>    average free extent size 5.22165 (!)

Note that this does not mean that all extents in the AG are that small, just
that the *free* extents are of that size.

I think this might primarily be because this AG has the smallest amount of
free blocks (2.9GB). However, the fact that it *does* have less, could be
interesting. It might be the AG associated with the directory for the busiest
database or such.

The next least-space AG is:

   from      to extents  blocks    pct
      1       1    1021    1021   0.10
      2       3   48748   98255  10.06
      4       7    9840   47038   4.81
      8      15   13648  146779  15.02
     16      31   15818  323022  33.06
     32      63     584   27932   2.86
     64     127     147   14286   1.46
    128     255     253   49047   5.02
    256     511     229   87173   8.92
    512    1023     139  102456  10.49
   1024    2047      51   72506   7.42
   2048    4095       3    7422   0.76
total free extents 90481
total free blocks 976937

It seems plausible it'd would look similar if more of the free blocks were
used.

> - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
> up to 64 pg blocks maximum (and that's higher than the above)
> - but the fails where observed also using pg_upgrade --link -j/pg_restore
> -j (also concurrent posix_fallocate() to many independent files sharing the
> same AG, but that's 1 backend:1 file so no contention for waitcount in
> RelationAddBlocks())

We also extend by more than one page, even without concurrency, if
bulk-insertion is used, and i think we do use that for
e.g. pg_attribute. Which is actually the table where pg_restore encountered
the issue:

pg_restore: error: could not execute query: ERROR:  could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device

1249 is the initial relfilenode for pg_attribute.

There could also be some parallelism leading to bulk extension, due to the
parallel restore. I don't remember which commands pg_restore actually executes
in parallel.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Thomas Munro
Date: 12 December 2024, 01:43:27
Subject: Re: connection establishment versus parallel workers

From: Jelte Fennema-Nio
Date: 12 December 2024, 02:53:27
Subject: Re: Add Pipelining support in psql

Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers

Previous

Next