Re: pg_upgrade --copy-file-range - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: pg_upgrade --copy-file-range
Date
Msg-id CAKZiRmyQ_F+OxHUi0+po9wnM=iwB0XUd=-ZT0ry_mOQJRnwmfA@mail.gmail.com
Whole thread Raw
In response to Re: pg_upgrade --copy-file-range  (Michael Paquier <michael@paquier.xyz>)
Responses Re: pg_upgrade --copy-file-range
List pgsql-hackers
Hi Thomas, Michael, Peter and -hackers,

On Sun, Dec 24, 2023 at 3:57 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> > As it happens I was just thinking about this particular patch because
> > I suddenly had a strong urge to teach pg_combinebackup to use
> > copy_file_range.  I wonder if you had the same idea...
>
> Yeah, +1.  That would make copy_file_blocks() more efficient where the
> code is copying 50 blocks in batches because it needs to reassign
> checksums to the blocks copied.

I've tried to achieve what you were discussing. Actually this was my
first thought when using pg_combinebackup with larger (realistic)
backup sizes back in December. Attached is a set of very DIRTY (!)
patches that provide CoW options (--clone/--copy-range-file) to
pg_combinebackup (just like pg_upgrade to keep it in sync), while also
refactoring some related bits of code to avoid duplication.

With XFS (with reflink=1 which is default) on Linux with kernel 5.10
and ~210GB backups, I'm getting:

root@jw-test-1:/xfs# du -sm *
210229  full
250     incr.1

Today in master, the old classic read()/while() loop without
CoW/reflink optimization :
root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE -o outtest full incr.1
3

real    49m43.963s
user    0m0.887s
sys     2m52.697s

VS patch with "--clone" :

root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE --clone -o outtest full incr.1
3

real    0m39.812s
user    0m0.325s
sys     0m2.401s

So it is 49mins down to 40 seconds(!) +/-10s (3 tries) if the FS
supports CoW/reflinks (XFS, BTRFS, upcoming bcachefs?). It looks to me
that this might mean that if one actually wants to use incremental
backups (to get minimal RTO), it would be wise to only use CoW
filesystems from the start so that RTO is as low as possible.

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.

Any help is appreciated.

-J.

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Robert Haas
Date:
Subject: Re: the s_lock_stuck on perform_spin_delay