Re: pg_upgrade reflink support on OpenZFS - Mailing list pgsql-general

From Marcel Menzel
Subject Re: pg_upgrade reflink support on OpenZFS
Date
Msg-id 5fd60425-db26-4700-b716-5be3762acd33@menzel.de
Whole thread Raw
In response to Re: pg_upgrade reflink support on OpenZFS  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-general
On 15/11/2025 05:17, Thomas Munro wrote:
> On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel <marcel@menzel.de> wrote:
>> For the PostgreSQL upgrade to version 18, I took the opportunity to test
>> the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 /
>> Linux 6.15.11 and it worked flawlessly, being a huge time saver here.
> 
> Nice!
> 
>> I've looked into the documentation for pg_upgrade and it's only
>> mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought
>> It'd be an interesting heads-up to report that Linux gained a 3rd FS and
>> also I think FreeBSD in general the ability for doing reflink copies.
> 
> It does mention both Linux and FreeBSD under --copy-file-range.  I
> didn't try to list all the relevant file systems there though, partly
> because I didn't feel like documenting all the quirks (only works if
> you created your XFS file system with the feature enabled, might need
> to frobnicate ZFS sysctl, which NFS clients and servers can push it
> down, likewise for non-COW file systems and device drivers, etc etc).
> It might be nice to find a decent reference for all that stuff
> somewhere else and point to it, but I don't think we can maintain that
> accurately ourselves.
> 
> I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd)
> worked for you.  I knew that it was really BTRFS's ioctl and XFS
> accepted it too, but I didn't know that ZFS also understood it[1] in
> 2.3.  They apparently didn't really expect anyone to call it, and
> since ZFS 2.4 is apparently about to ship without it[2], it seems like
> a bad time to add it to the documentation for --clone.

Oh, I haven't had any looks at upcoming versions yet, but yeah this 
doesn't make any sense then to mention this.

>> OpenZFS has been supporting this since 2.2 but has had it disabled due
>> to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on
>> Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so
>> only the zpool feature "block_cloning" has to be enabled, which might be
>> the case when running "zpool upgrade".
> 
> Yeah, those data corruption reports (which turned out to be
> misattributed IIRC?) provided one reason to keep the old BTRFS ioctl()
> under --clone but add the new behaviour under --copy-file-range.
> --copy-file-range should work for all COW filesystems on Linux via
> proper VFS entrypoints, and is the official way to do this from user
> space.  Perhaps we should eventually harmonise this under a single
> option and drop the ioctl() stuff.  One semantic change would be that
> copy_file_range() means "copy with your best trick" (could be cloning,
> network/driver pushdown or user space buffer copy, silently selecting
> the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so
> that was another reason to want a separate option for now.

I haven't looked close at the copy_file_range() syscall and how tools 
interact with it in detail yet, but I've found this[3] interesting 
GitHub comment which gives me a clearer picture now. Totally 
understandable why the OpenZFS remove the compat for those BTRFS 
syscalls since they now have a proper replacement.

Peeking at the OpenZFS docs[4][5], they also mention the 
copy_file_range() syscall invoking the BRT, so I guess I'll use 
pg_upgrade with --copy-file-range the next time.

> For reference, the macOS copyfile() call used for --clone has flags
> that should cause it to fail if it can't clone IIUC, while the Windows
> CopyFile() call used for --copy might even clone blocks on ReFS even
> if you don't specify --clone... huh.
> 
>> I haven't had the possibility to check this on FreeBSD yet, but I don't
>> see why this should not work as I also can't spot anything in the
>> OpenZFS docs regarding reflink / block cloning limitations on FreeBSD.
>> Also I saw one of the OpenZFS devs writing on Reddit about block cloning
>> being supported on FreeBSD v14.
> 
> It always succeeds on FreeBSD, but it only actually clones if you set
> vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with
> that and they work nicely.  The sysctl wasn't on by default in FreeBSD
> 14.x, but 15 is about to ship and the "experimental" label was removed
> in man 4 zfs.
> 
> If you haven't seen them yet, you might also like these COW tricks:
> 
> Shared storage of basic catalog tables when you have a lot of databases:
> SET file_copy_method = CLONE;
> CREATE DATABASE ... STRATEGY=FILE_COPY;
> 
> Fast database clone/snapshot of very large databases (caveats: users
> can't be connected to source, checkpoint forced):
> SET file_copy_method = CLONE;
> CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db;
> 
> Combine a chain of incremental backups and a full backup to produce a
> new full backup, sharing disk blocks with the ancestor backups:
> pg_combinebackup --copy-file-range
> 
> That last one is a really powerful use of copy_file_range()'s subfile
> cloning powers.  Another subfile cloning trick I've proposed before is
> making relation segment size user-controllable, and then allowing
> pg_upgrade to migrate between segment sizes by splicing them together.

Oh, those are really handy commands, especially the last one, yes. Many 
thanks for pointing these out!

> [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e
> [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242

[3] https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842
[4] 
https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning
[5] 
https://openzfs.github.io/openzfs-docs/man/master/7/zfsconcepts.7.html#Block_cloning




pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: failure to drop table due to pg_temp_7 schema
Next
From: "Peter 'PMc' Much"
Date:
Subject: Re: failure to drop table due to pg_temp_7 schema