Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pg_combinebackup --copy-file-range
Date
Msg-id add23637-03a2-4fd0-890b-5fa64bb2530a@enterprisedb.com
Whole thread Raw
In response to Re: pg_combinebackup --copy-file-range  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: pg_combinebackup --copy-file-range
List pgsql-hackers
On 3/31/24 03:03, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> So I decided to take a stab at Thomas' idea, i.e. reading the data to
>> ...
>> I'll see how this works on EXT4/ZFS next ...
> 
> Wow, very cool!  A couple of very quick thoughts/notes:
> 
> ZFS: the open source version only gained per-file block cloning in
> 2.2, so if you're on an older release I expect copy_file_range() to
> work but not really do the magic thing.  On the FreeBSD version you
> also have to turn cloning on with a sysctl because people were worried
> about bugs in early versions so by default you still get actual
> copying, not sure if you need something like that on the Linux
> version...  (Obviously ZFS is always completely COW-based, but before
> the new block cloning stuff it could only share blocks by its own
> magic deduplication if enabled, or by cloning/snapshotting a whole
> dataset/mountpoint; there wasn't a way to control it explicitly like
> this.)
> 

I'm on 2.2.2 (on Linux). But there's something wrong, because the
pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.

I'm not sure it's a ZFS config issue, though, because it's not CPU or
I/O bound, and I see this on both machines. And some simple dd tests
show the zpool can do 10x the throughput. Could this be due to the file
header / pool alignment?

> Alignment: block sharing on any fs requires it.  I haven't re-checked
> recently but IIRC the incremental file format might have a
> non-block-sized header?  That means that if you copy_file_range() from
> both the older backup and also the incremental backup, only the former
> will share blocks, and the latter will silently be done by copying to
> newly allocated blocks.  If that's still true, I'm not sure how hard
> it would be to tweak the format to inject some padding and to make
> sure that there isn't any extra header before each block.

I admit I'm not very familiar with the format, but you're probably right
there's a header, and header_length does not seem to consider alignment.
make_incremental_rfile simply does this:

    /* Remember length of header. */
    rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
        sizeof(rf->truncation_block_length) +
        sizeof(BlockNumber) * rf->num_blocks;

and sendFile() does the same thing when creating incremental basebackup.
I guess it wouldn't be too difficult to make sure to align this to
BLCKSZ or something like this. I wonder if the file format is documented
somewhere ... It'd certainly be nicer to tweak before v18, if necessary.

Anyway, is that really a problem? I mean, in my tests the CoW stuff
seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
that's why it took longer on XFS ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Change GUC hashtable to use simplehash?
Next
From: Thomas Munro
Date:
Subject: Re: pg_combinebackup --copy-file-range