Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: pg_combinebackup --copy-file-range
Date
Msg-id CA+hUKGJw-+S+BaON0yoS10iUC1mcnNWs7Wiaugxfd4Vy8d8HMw@mail.gmail.com
Whole thread Raw
In response to Re: pg_combinebackup --copy-file-range  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: pg_combinebackup --copy-file-range
List pgsql-hackers
On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I'm on 2.2.2 (on Linux). But there's something wrong, because the
> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>
> I'm not sure it's a ZFS config issue, though, because it's not CPU or
> I/O bound, and I see this on both machines. And some simple dd tests
> show the zpool can do 10x the throughput. Could this be due to the file
> header / pool alignment?

Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
the same 128kB record as you copy_file_range 16 x 8kB blocks?
(Guessing you might be using the default recordsize?)

> I admit I'm not very familiar with the format, but you're probably right
> there's a header, and header_length does not seem to consider alignment.
> make_incremental_rfile simply does this:
>
>     /* Remember length of header. */
>     rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>         sizeof(rf->truncation_block_length) +
>         sizeof(BlockNumber) * rf->num_blocks;
>
> and sendFile() does the same thing when creating incremental basebackup.
> I guess it wouldn't be too difficult to make sure to align this to
> BLCKSZ or something like this. I wonder if the file format is documented
> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>
> Anyway, is that really a problem? I mean, in my tests the CoW stuff
> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
> that's why it took longer on XFS ...

Yeah I'm not sure, I assume it did more allocating and copying because
of that.  It doesn't matter and it would be fine if a first version
weren't as good as possible, and fine if we tune the format later once
we know more, ie leaving improvements on the table.  I just wanted to
share the observation.  I wouldn't be surprised if the block-at-a-time
coding makes it slower and maybe makes the on disk data structures
worse, but I dunno I'm just guessing.

It's also interesting but not required to figure out how to tune ZFS
well for this purpose right now...



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: pg_combinebackup --copy-file-range
Next
From: Bharath Rupireddy
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation