Re: FileFallocate misbehaving on XFS - Mailing list pgsql-hackers

From Michael Harris
Subject Re: FileFallocate misbehaving on XFS
Date
Msg-id CADofcAUOqdrEhZj6-3h3GKz2k7J1pJe4pQ0W-PEibOj2=vrScA@mail.gmail.com
Whole thread Raw
In response to Re: FileFallocate misbehaving on XFS  (Michael Harris <harmic@gmail.com>)
List pgsql-hackers
Hi again

One extra piece of information: I had said that all the machines were
Rocky Linux 8 or Rocky Linux 9, but actually a large number of them
are RHEL8.

Sorry for the confusion.

Of course RL8 is a rebuild of RHEL8 so it is not surprising they would
be behaving similarly.

Cheers
Mike

On Tue, 10 Dec 2024 at 17:28, Michael Harris <harmic@gmail.com> wrote:
>
> Hi Andres
>
> Following up on the earlier question about OS upgrade paths - all the
> cases reported so far are either on RL8 (Kernel 4.18.0) or were
> upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
> preserved.
> In fact the RL9 systems were initially built as Centos 7, and then
> when that went EOL they were upgraded to RL9. The process was as I
> described - the /var/opt filesystem which contained the database was
> preserved, and the root and other OS filesystems were scratched.
> The majority of systems where we have this problem are on RL8.
>
> On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
> > Are you using any filesystem quotas?
>
> No.
>
> > It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
> > xfs_spaceman -c 'freesp -s' /mountpoint
> > xfs_spaceman -c 'health' /mountpoint
> > and df.
>
> I gathered this info from one of the systems that is currently on RL9.
> This system is relatively small compared to some of the others that
> have exhibited this issue, but it is the only one I can access right
> now.
>
> # uname -a
> Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
> 12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>
> # xfs_info /dev/mapper/ippvg-ipplv
> meta-data=/dev/mapper/ippvg-ipplv isize=512    agcount=4, agsize=262471424 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
> data     =                       bsize=4096   blocks=1049885696, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=512639, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> # for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
>    from      to extents  blocks    pct
>       1       1   37502   37502   0.15
>       2       3   62647  148377   0.59
>       4       7   87793  465950   1.85
>       8      15  135529 1527172   6.08
>      16      31  184811 3937459  15.67
>      32      63  165979 7330339  29.16
>      64     127  101674 8705691  34.64
>     128     255   15123 2674030  10.64
>     256     511     973  307655   1.22
> total free extents 792031
> total free blocks 25134175
> average free extent size 31.7338
>    from      to extents  blocks    pct
>       1       1   43895   43895   0.22
>       2       3   59312  141693   0.70
>       4       7   83406  443964   2.20
>       8      15  120804 1362108   6.75
>      16      31  133140 2824317  14.00
>      32      63  118619 5188474  25.71
>      64     127   77960 6751764  33.46
>     128     255   16383 2876626  14.26
>     256     511    1763  546506   2.71
> total free extents 655282
> total free blocks 20179347
> average free extent size 30.7949
>    from      to extents  blocks    pct
>       1       1   72034   72034   0.26
>       2       3   98158  232135   0.83
>       4       7  126228  666187   2.38
>       8      15  169602 1893007   6.77
>      16      31  180286 3818527  13.65
>      32      63  164529 7276833  26.01
>      64     127  109687 9505160  33.97
>     128     255   22113 3921162  14.02
>     256     511    1901  592052   2.12
> total free extents 944538
> total free blocks 27977097
> average free extent size 29.6199
>    from      to extents  blocks    pct
>       1       1   51462   51462   0.21
>       2       3   98993  233204   0.93
>       4       7  131578  697655   2.79
>       8      15  178151 1993062   7.97
>      16      31  175718 3680535  14.72
>      32      63  145310 6372468  25.48
>      64     127   89518 7749021  30.99
>     128     255   18926 3415768  13.66
>     256     511    2640  813586   3.25
> total free extents 892296
> total free blocks 25006761
> average free extent size 28.0252
>
> # xfs_spaceman -c 'health' /var/opt
> Health status has not been collected for this filesystem.
>
> > What kind of storage is this on?
>
> As mentioned, there are quite a few systems in different sites, so a
> number of different storage solutions in use, some with directly
> attached disks, others with some SAN solutions.
> The instance I got the printout above from is a VM, but in the other
> site they are all bare metal.
>
> > Was the filesystem ever grown from a smaller size?
>
> I can't say for sure that none of them were, but given the number of
> different systems that have this issue I am confident that would not
> be a common factor.
>
> > Have you checked the filesystem's internal consistency? I.e. something like
> > xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
> > unmounted though. But corrupted filesystem datastructures certainly could
> > cause spurious ENOSPC.
>
> I executed this on the same system as the above prints came from. It
> did not report any issues.
>
> > Are you using pg_upgrade -j?
>
> Yes, we use -j `nproc`
>
> > I assume the file that actually errors out changes over time? It's always
> > fallocate() that fails?
>
> Yes, correct, on both counts.
>
> > Can you tell us anything about the workload / data? Lots of tiny tables, lots
> > of big tables, write heavy, ...?
>
> It is a write heavy application which stores mostly time series data.
>
> The time series data is partitioned by time; the application writes
> constantly into the 'current' partition, and data is expired by
> removing the oldest partition. Most of the data is written once and
> not updated.
>
> There are quite a lot of these partitioned tables (in the 1000's or
> 10000's) depending on how the application is configured. Individual
> partitions range in size from a few MB to 10s of GB.
>
> Cheers
> Mike.



pgsql-hackers by date:

Previous
From: Michael Harris
Date:
Subject: Re: FileFallocate misbehaving on XFS
Next
From: Bertrand Drouvot
Date:
Subject: Re: shared-memory based stats collector - v70