Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg_combinebackup --copy-file-range |
Date | |
Msg-id | add23637-03a2-4fd0-890b-5fa64bb2530a@enterprisedb.com Whole thread Raw |
In response to | Re: pg_combinebackup --copy-file-range (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: pg_combinebackup --copy-file-range
|
List | pgsql-hackers |
On 3/31/24 03:03, Thomas Munro wrote: > On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> So I decided to take a stab at Thomas' idea, i.e. reading the data to >> ... >> I'll see how this works on EXT4/ZFS next ... > > Wow, very cool! A couple of very quick thoughts/notes: > > ZFS: the open source version only gained per-file block cloning in > 2.2, so if you're on an older release I expect copy_file_range() to > work but not really do the magic thing. On the FreeBSD version you > also have to turn cloning on with a sysctl because people were worried > about bugs in early versions so by default you still get actual > copying, not sure if you need something like that on the Linux > version... (Obviously ZFS is always completely COW-based, but before > the new block cloning stuff it could only share blocks by its own > magic deduplication if enabled, or by cloning/snapshotting a whole > dataset/mountpoint; there wasn't a way to control it explicitly like > this.) > I'm on 2.2.2 (on Linux). But there's something wrong, because the pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS. I'm not sure it's a ZFS config issue, though, because it's not CPU or I/O bound, and I see this on both machines. And some simple dd tests show the zpool can do 10x the throughput. Could this be due to the file header / pool alignment? > Alignment: block sharing on any fs requires it. I haven't re-checked > recently but IIRC the incremental file format might have a > non-block-sized header? That means that if you copy_file_range() from > both the older backup and also the incremental backup, only the former > will share blocks, and the latter will silently be done by copying to > newly allocated blocks. If that's still true, I'm not sure how hard > it would be to tweak the format to inject some padding and to make > sure that there isn't any extra header before each block. I admit I'm not very familiar with the format, but you're probably right there's a header, and header_length does not seem to consider alignment. make_incremental_rfile simply does this: /* Remember length of header. */ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + sizeof(rf->truncation_block_length) + sizeof(BlockNumber) * rf->num_blocks; and sendFile() does the same thing when creating incremental basebackup. I guess it wouldn't be too difficult to make sure to align this to BLCKSZ or something like this. I wonder if the file format is documented somewhere ... It'd certainly be nicer to tweak before v18, if necessary. Anyway, is that really a problem? I mean, in my tests the CoW stuff seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe that's why it took longer on XFS ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: