Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers

From Michael Harris
Subject Re: FileFallocate misbehaving on XFS
Date
Msg-id CADofcAWphm3uMtXZVCwko15E47HVhksR5YZ2pWhUpEjNz6Hbmw@mail.gmail.com
Whole thread Raw
In response to Re: FileFallocate misbehaving on XFS  (Andres Freund <andres@anarazel.de>)
Responses Re: FileFallocate misbehaving on XFS
Re: FileFallocate misbehaving on XFS
List pgsql-hackers
Hi Andres

Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
> Are you using any filesystem quotas?

No.

> It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> xfs_spaceman -c 'freesp -s' /mountpoint
> xfs_spaceman -c 'health' /mountpoint
> and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4, agsize=262471424 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=1049885696, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=512639, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
   from      to extents  blocks    pct
      1       1   37502   37502   0.15
      2       3   62647  148377   0.59
      4       7   87793  465950   1.85
      8      15  135529 1527172   6.08
     16      31  184811 3937459  15.67
     32      63  165979 7330339  29.16
     64     127  101674 8705691  34.64
    128     255   15123 2674030  10.64
    256     511     973  307655   1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
   from      to extents  blocks    pct
      1       1   43895   43895   0.22
      2       3   59312  141693   0.70
      4       7   83406  443964   2.20
      8      15  120804 1362108   6.75
     16      31  133140 2824317  14.00
     32      63  118619 5188474  25.71
     64     127   77960 6751764  33.46
    128     255   16383 2876626  14.26
    256     511    1763  546506   2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
   from      to extents  blocks    pct
      1       1   72034   72034   0.26
      2       3   98158  232135   0.83
      4       7  126228  666187   2.38
      8      15  169602 1893007   6.77
     16      31  180286 3818527  13.65
     32      63  164529 7276833  26.01
     64     127  109687 9505160  33.97
    128     255   22113 3921162  14.02
    256     511    1901  592052   2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
   from      to extents  blocks    pct
      1       1   51462   51462   0.21
      2       3   98993  233204   0.93
      4       7  131578  697655   2.79
      8      15  178151 1993062   7.97
     16      31  175718 3680535  14.72
     32      63  145310 6372468  25.48
     64     127   89518 7749021  30.99
    128     255   18926 3415768  13.66
    256     511    2640  813586   3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.

> What kind of storage is this on?

As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.

> Was the filesystem ever grown from a smaller size?

I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.

> Have you checked the filesystem's internal consistency? I.e. something like
> xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
> unmounted though. But corrupted filesystem datastructures certainly could
> cause spurious ENOSPC.

I executed this on the same system as the above prints came from. It
did not report any issues.

> Are you using pg_upgrade -j?

Yes, we use -j `nproc`

> I assume the file that actually errors out changes over time? It's always
> fallocate() that fails?

Yes, correct, on both counts.

> Can you tell us anything about the workload / data? Lots of tiny tables, lots
> of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

Cheers
Mike.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Skip collecting decoded changes of already-aborted transactions
Next
From: Michael Harris
Date:
Subject: Re: FileFallocate misbehaving on XFS