Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg_combinebackup --copy-file-range |
Date | |
Msg-id | d9d115a7-c5e2-4258-9fc3-7514ab33f373@enterprisedb.com Whole thread Raw |
In response to | Re: pg_combinebackup --copy-file-range (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: pg_combinebackup --copy-file-range
|
List | pgsql-hackers |
Hi, Here's a much more polished and cleaned up version of the patches, fixing all the issues I've been aware of, and with various parts merged into much more cohesive parts (instead of keeping them separate to make the changes/evolution more obvious). I decided to reorder the changes like this: 1) block alignment - As I said earlier, I think we should definitely do this, even if only to make future improvements possible. After chatting about this with Robert off-list a bit, he pointed out I actually forgot to not align the headers for files without any blocks, so this version fixes that. 2) add clone/copy_file_range for the case that copies whole files. This is pretty simple, but it also adds the --clone/copy-file-range options, and similar infrastructure. The one slightly annoying bit is that we now have the ifdef stuff in two places - when parsing the options, and then in the copy_file_XXX methods, and the pg_fatal() calls should not be reachable in practice. But that doesn't seem harmful, and might be a useful protection against someone calling function that does nothing. This also merges the original two parts, where the first one only did this cloning/CoW stuff when checksum did not need to be calculated, and the second extended it to those cases too (by also reading the data, but still doing the copy the old way). I think this is the right way - if that's not desisable, it's easy to either add --no-manifest or not use the CoW options. Or just not change the checksum type. There's no other way. 3) add copy_file_range to write_reconstructed_file, by using roughly the same logic/reasoning as (2). If --copy-file-range is specified and a checksum should be calculated, the data is read for the checksum, but the copy is done using copy_file_range. I did rework the flow write_reconstructed_file() flow a bit, because tracking what exactly needs to be read/written in each of the cases (which copy method, zeroed block, checksum calculated) made the flow really difficult to follow. Instead I introduced a function to read/write a block, and call them from different places. I think this is much better, and it also makes the following part dealing with batches of blocks much easier / smaller change. 4) prefetching - This is mostly unrelated to the CoW stuff, but it has tremendous benefits, especially for ZFS. I haven't been able to tune ZFS to get decent performance, and ISTM it'd be rather unusable for backup purposes without this. 5) batching in write_reconstructed_file - This benefits especially the copy_file_range case, where I've seen it to yield massive speedups (see the message from Monday for better data). 6) batching for prefetch - Similar logic to (5), but for fadvise. This is the only part where I'm not entirely sure whether it actually helps or not. Needs some more analysis, but I'm including it for completeness. I do think the parts (1)-(4) are in pretty good shape, good enough to make them committable in a day or two. I see it mostly a matter of testing and comment improvements rather than code changes. (5) is in pretty good shape too, but I'd like to maybe simplify and refine the write_reconstructed_file changes a bit more. I don't think it's broken, but it feels a bit too cumbersome. Not sure about (6) yet. I changed how I think about this a bit - I don't really see the CoW copy methods as necessary faster than the regular copy (even though it can be with (5)). I think the main benefits are the space savings, enabled by patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without that I don't think the performance is an issue - everything has a cost. On 4/3/24 15:39, Jakub Wartak wrote: > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> I've been running some benchmarks and experimenting with various stuff, >> trying to improve the poor performance on ZFS, and the regression on XFS >> when using copy_file_range. And oh boy, did I find interesting stuff ... > > [..] > > Congratulations on great results! > >> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster >> to finish recovery, run pg_checksums --check (to check the patches don't >> produce something broken) > > I've performed some follow-up small testing on all patches mentioned > here (1..7), with the earlier developed nano-incremental-backup-tests > that helped detect some issues for Robert earlier during original > development. They all went fine in both cases: > - no special options when using pg_combinebackup > - using pg_combinebackup --copy-file-range --manifest-checksums=NONE > > Those were: > test_across_wallevelminimal.sh > test_full_pri__incr_stby__restore_on_pri.sh > test_full_pri__incr_stby__restore_on_stby.sh > test_full_stby__incr_stby__restore_on_pri.sh > test_full_stby__incr_stby__restore_on_stby.sh > test_incr_after_timelineincrease.sh > test_incr_on_standby_after_promote.sh > test_many_incrementals_dbcreate_duplicateOID.sh > test_many_incrementals_dbcreate_filecopy_NOINCR.sh > test_many_incrementals_dbcreate_filecopy.sh > test_many_incrementals_dbcreate.sh > test_many_incrementals.sh > test_multixact.sh > test_pending_2pc.sh > test_reindex_and_vacuum_full.sh > test_repro_assert_RP.sh > test_repro_assert.sh > test_standby_incr_just_backup.sh > test_stuck_walsum.sh > test_truncaterollback.sh > test_unlogged_table.sh > >> Now to the findings .... >> Thanks. Would be great if you could run this on the attached version of the patches, ideally for each of them independently, so make sure it doesn't get broken+fixed somewhere on the way. >> >> 1) block alignment > > [..] > >> And I think we probably want to do this now, because this affects all >> tools dealing with incremental backups - even if someone writes a custom >> version of pg_combinebackup, it will have to deal with misaligned data. >> Perhaps there might be something like pg_basebackup that "transforms" >> the data received from the server (and also the backup manifest), but >> that does not seem like a great direction. > > If anything is on the table, then I think in the far future > pg_refresh_standby_using_incremental_backup_from_primary would be the > only other tool using the format ? > Possibly, but I was thinking more about backup solutions using the same format, but doing the client-side differently. Essentially, something that would still use the server side to generate incremental backups, but replace pg_combinebackup to do this differently (stream the data somewhere else, index it somehow, or whatever). >> 2) prefetch >> ----------- > [..] >> I think this means we may need a "--prefetch" option, that'd force >> prefetching, probably both before pread and copy_file_range. Otherwise >> people on ZFS are doomed and will have poor performance. > > Right, we could optionally cover in the docs later-on various options > to get the performance (on XFS use $this, but without $that and so > on). It's kind of madness dealing with all those performance > variations. > Yeah, something like that. I'm not sure we want to talk too much about individual filesystems in our docs, because those things evolve over time too. And also this depends on how large the increment is. If it only modifies 1% of the blocks, then 99% will come from the full backup, and the sequential prefetch should do OK (well, maybe not on ZFS). But as the incremental backup gets larger / is more random, the prefetch is more and more important. > Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a > getopt variant like: --prefetch[=HOWMANY] with 128 being as default ? > I did think about that, but there's a dependency on the batching. If we're prefetching ~1MB of data, we may need to prefetch up to ~1MB ahead. Because otherwise we might want to read 1MB and only a tiny part of that would be prefetched. I was thinking maybe we could skip the sequential parts, but ZFS does need that. So I don't think we can just allow users to set arbitrary values, at least not without also tweaking the batch. Or maybe 1MB batches are too large, and we should use something smaller? I need to think about this a bit more ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20240403-0001-Properly-align-blocks-in-incremental-backu.patch
- v20240403-0002-Allow-copying-files-using-clone-copy_file_.patch
- v20240403-0003-Use-copy_file_range-in-write_reconstructed.patch
- v20240403-0004-Prefetch-blocks-read-by-write_reconstructe.patch
- v20240403-0005-Try-copying-larger-chunks-of-data-from-the.patch
- v20240403-0006-Try-prefetching-larger-chunks-of-data-from.patch
pgsql-hackers by date: