Re: The ability of postgres to determine loss of files of the main fork - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: The ability of postgres to determine loss of files of the main fork
Date
Msg-id CAKZiRmzGgOY5iQ=CFHJVE2df2Bz599bACqoWP3V_e-3Rw+XJhA@mail.gmail.com
Whole thread Raw
In response to Re: The ability of postgres to determine loss of files of the main fork  (Michael Banck <mbanck@gmx.net>)
Responses Re: The ability of postgres to determine loss of files of the main fork
List pgsql-hackers
On Wed, Oct 1, 2025 at 2:15 PM Michael Banck <mbanck@gmx.net> wrote:

> > > That might be a lot of files to open.
> >
> > I was afraid of that, but let's say modern high-end is 200TB big DB,
> > that's like 200*1024 1GB files, but I'm getting such time(1) timings
> > for 204k files on ext4:
> >
> > $ time ./createfiles                      # real    0m2.157s, it's
> > open(O_CREAT)+close()
> > $ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
> > $ time ./openfiles                          # real    0m0.297s , for
> > already existing ones (hot)
> > $ time ./openfiles                          # real    0m1.456s , for
> > already existing ones (cold, echo 3 > drop_caches sysctl)
> >
> > Not bad in my book as a one time activity. It could pose a problem
> > potentially with some high latency open() calls, maybe NFS or
> > something remote I guess.
>
> Yeah, did you try on SAN as well? I am doubtful that will be performant.


OK you I guess you wanted to have some latency + big fs size, so I do
not have real SAN here, but I've tried that on noisy Azure's Cloud 4x
disk Standard __HDD__ (probably SSD/flash but just throttled way
down). Each LUN of size: 32TB (max), giving me in total 128TB striped
fs:
- lvcreate -L 128G -n lv -i 4 -I 1M vgbig + mkfs.xfs
- ioping reports there the following latencies for Direct IO:
min/avg/max/mdev = 3.25 ms / 18.0 ms / 193.8 ms / 39.5 ms (yay!)
- so as per above it's doesn't have low latency at all
- simple fio (fio --name=rand_read_iops
--filename=/xfs/many_files_dir/test1 --rw=randread --bs=4k --direct=1
--ioengine=io_uring --iodepth=64 --size=4G --runtime=60 --time_based
--group_reporting) reports just:  IOPS=959, BW=3837KiB/s

Yet, to create/open 204k empty files to simulate that 200TB cluster on
that 128TB fs:

$ time ./createfiles                            # real 0m2.140s
$ time ls -l /xfs/many_files_dir/ > /dev/null  # real 0m0.697s
$ time ./openfiles                             # real 0m0.440s (hot)
$ time ./openfiles                               # real 0m29.872s
(after 3 to drop_caches)
$ time ./openfiles                             # real 0m0.443s (hot)
$ time ./openfiles                               # real 0m31.050s
(after 3 to drop_caches)

$ echo 3 > /proc/sys/vm/drop_caches
$ time ls -l /xfs/many_files_dir/ > /dev/null  # real 1m2.345s
$ time ./openfiles                             # real 0m0.437s (hot)

Anyway, I do not know if opening all the files on startup (or just
crash-recovery?) is the proper way, personally I would take +30s to
open a 200TB database any day over the risk of silent data loss and
wrong results given the option. I'm sure however that ignoring missing
files on access is certainly not how things should be looking like and
I would speculate that you might even get some accidents that files
may go missing when up & running and you won't be notified in any way
(due to human error: imagine someone rsyncing the wrong [opposite]
direction he wanted by accident or something like that).

-J.



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Patch for migration of the pg_commit_ts directory v2