Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg_combinebackup --copy-file-range |
Date | |
Msg-id | 5ac425a4-6201-4f24-912e-8eed6905790a@enterprisedb.com Whole thread Raw |
In response to | Re: pg_combinebackup --copy-file-range (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: pg_combinebackup --copy-file-range
Re: pg_combinebackup --copy-file-range |
List | pgsql-hackers |
Hi, I've been running some benchmarks and experimenting with various stuff, trying to improve the poor performance on ZFS, and the regression on XFS when using copy_file_range. And oh boy, did I find interesting stuff ... Attached is a PDF with results of my benchmark for ZFS/XFS/BTRFS, on my two machines. I already briefly described what the benchmark does, but to clarify: 1) generate backups: initialize pgbench scale 5000, do full backup, update roughly 1%, 10% and 20% blocks and do an incremental backup after each of those steps 2) combine backups: full + 1%, full + 1% + 10%, full + 1% + 10% + 20% 3) measure how long it takes and how much more disk space is used (to see how well the CoW stuff works) 4) after each pg_combinebackup run to pg_verifybackup, start the cluster to finish recovery, run pg_checksums --check (to check the patches don't produce something broken) There's a lot of interesting stuff to discuss, some of which was already mentioned in this thread earlier - in particular, I want to talk about block alignment, prefetching and processing larger chunks of blocks. Attached is also all the patches including the ugly WIP parts discussed later, complete results if you want to do your own analysis, and the scripts used to generate/restore scripts. FWIW I'm not claiming the patches are commit-ready (especially the new WIP parts), but should be correct and good enough for discussion (that applies especially to 0007). I think I could get them ready in a day or two, but I'd like some feedback to my findings, and also if someone would have objections to get this in so short before the feature freeze, I'd prefer to know about that. The patches are numbered the same as in the benchmark results, i.e. 0001 is "1", 0002 is "2" etc. The "0-baseline" option is current master without any patches. Now to the findings .... 1) block alignment ------------------ This was mentioned by Thomas a couple days ago, when he pointed out the incremental files have a variable-length header (to record which blocks are stored in the file), followed by the block data, which means the block data is not aligned to fs block. I haven't realized this, I just used whatever the reconstruction function received, but Thomas pointed out this may interfere with CoW, which needs the blocks to be aligned. And I think he's right, and my tests confirm this. I did a trivial patch to align the blocks to 8K boundary, by forcing the header to be a multiple of 8K (I think 4K alignment would be enough). See the 0001 patch that does this. And if I measure the disk space used by pg_combinebackup, and compare the results with results without the patch ([1] from a couple days back), I see this: pct not aligned aligned ------------------------------------- 1% 689M 19M 10% 3172M 22M 20% 13797M 27M Yes, those numbers are correct. I didn't believe this at first, but the backups are valid/verified, checksums are OK, etc. BTRFS has similar numbers (e.g. drop from 20GB to 600MB). If you look at the charts in the PDF, charts for on-disk space are on the right side. It might seem like copy_file_range/CoW has no impact, but that's just an illusion - the bars for the last three cases are so small it's difficult to see them (especially on XFS). While this does not show the impact of alignment (because all of the cases in these runs have blocks aligned), it shows how tiny the backups can be made. But it does have significant impact, per the preceding paragraph. This also affect the prefetching, that I'm going to talk about next. But having the blocks misaligned (spanning multiple 4K pages) forces the system to prefetch more pages than necessary. I don't know how big the impact is, because the prefetch patch is 0002, so I only have results for prefetching on aligned blocks, but I don't see how it could not have a cost. I do think we should just align the blocks properly. The 0001 patch does that simply by adding a bunch of \0 bytes up to the next 8K boundary. Yes, this has a cost - if you have tiny files with only one or two blocks changed, the increment file will be a bit larger. Files without any blocks don't need alignment/padding, and as the number of blocks increases, it gets negligible pretty quickly. Also, files use a multiple of fs blocks anyway, so if we align to 4K blocks it wouldn't actually need more space at all. And even if it does, it's all \0, so pretty damn compressible (and I'm sorry, but if you care about tiny amounts of data added by alignment, but refuse to use compression ...). I think we absolutely need to align the blocks in the incremental files, and I think we should do that now. I think 8K would work, but maybe we should add alignment parameter to basebackup & manifest? The reason why I think maybe this should be a basebackup parameter is the recent discussion about large fs blocks - it seems to be in the works, so maybe better to be ready and not assume all fs have 4K. And I think we probably want to do this now, because this affects all tools dealing with incremental backups - even if someone writes a custom version of pg_combinebackup, it will have to deal with misaligned data. Perhaps there might be something like pg_basebackup that "transforms" the data received from the server (and also the backup manifest), but that does not seem like a great direction. Note: Of course, these space savings only exist thanks to sharing blocks with the input backups, because the blocks in the combined backup point to one of the other backups. If those old backups are removed, then the "saved space" disappears because there's only a single copy. 2) prefetch ----------- I was very puzzled by the awful performance on ZFS. When every other fs (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning advice I could think of, with almost no effect. Ultimately I decided that it probably is the "no readahead" behavior I've observed on ZFS. I assume it's because it doesn't use the page cache where the regular readahead is detected etc. And there's no prefetching in pg_combinebackup, so I decided to an experiment and added a trivial explicit prefetch when reconstructing the file - every time we'd read data from a file, we do posix_fadvise for up to 128 blocks ahead (similar to what bitmap heap scan code does). See 0002. And tadaaa - the duration dropped from 900-1000 seconds to only about 250-300 seconds, so an improvement of a factor of 3-4x. I think this is pretty massive. There's a couple more interesting ZFS details - the prefetching seems to be necessary even when using copy_file_range() and don't need to read the data (to calculate checksums). This is why the "manifest=off" chart has the strange group of high bars at the end - the copy cases are fast because prefetch happens, but if we switch to copy_file_range() there are no prefetches and it gets slow. This is a bit bizarre, especially because the manifest=on cases are still fast, exactly because the pread + prefetching still happens. I'm sure users would find this puzzling. Unfortunately, the prefetching is not beneficial for all filesystems. For XFS it does not seem to make any difference, but on BTRFS it seems to cause a regression. I think this means we may need a "--prefetch" option, that'd force prefetching, probably both before pread and copy_file_range. Otherwise people on ZFS are doomed and will have poor performance. 3) bulk operations ------------------ Another thing suggested by Thomas last week was that maybe we should try detecting longer runs of blocks coming from the same file, and operate on them as a single chunk of data. If you see e.g. 32 blocks, instead of doing read/write or copy_file_range for each of them, we could simply do one call for all those blocks at once. I think this is pretty likely, especially for small incremental backups where most of the blocks will come from the full backup. And I was suspecting the XFS regression (where the copy-file-range was up to 30-50% slower in some cases, see [1]) is related to this, because the perf profiles had stuff like this: 97.28% 2.10% pg_combinebacku [kernel.vmlinux] [k] | |--95.18%--entry_SYSCALL_64 | | | --94.99%--do_syscall_64 | | | |--74.13%--__do_sys_copy_file_range | | | | | --73.72%--vfs_copy_file_range | | | | | --73.14%--xfs_file_remap_range | | | | | |--70.65%--xfs_reflink_remap_blocks | | | | | | | --69.86%--xfs_reflink_remap_extent So I took a stab at this in 0007, which detects runs of blocks coming from the same source file (limited to 128 blocks, i.e. 1MB). I only did this for the copy_file_range() calls in 0007, and the results for XFS look like this (complete results are in the PDF): old (block-by-block) new (batches) ------------------------------------------------------ 1% 150s 4s 10% 150-200s 46s 20% 150-200s 65s Yes, once again, those results are real, the backups are valid etc. So not only it takes much less space (thanks to block alignment), it also takes much less time (thanks to bulk operations). The cases with "manifest=on" improve too, but not nearly this much. I believe this is simply because the read/write still happens block by block. But it shouldn't be difficult to do in a bulk manner too (we already have the range detected, but I was lazy). [1] https://www.postgresql.org/message-id/0e27835d-dab5-49cd-a3ea-52cf6d9ef59e%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v20240401-0001-WIP-block-alignment.patch
- v20240401-0002-WIP-prefetch-blocks-when-reconstructing-fi.patch
- v20240401-0003-use-clone-copy_file_range-to-copy-whole-fi.patch
- v20240401-0004-use-copy_file_range-in-write_reconstructed.patch
- v20240401-0005-use-copy_file_range-with-checksums.patch
- v20240401-0006-allow-cloning-with-checksum-calculation.patch
- v20240401-0007-WIP-copy-larger-chunks-from-the-same-file.patch
- xeon-nvme-xfs.csv
- i5-ssd-zfs.csv
- xeon-nvme-btrfs.csv
- benchmark-results.pdf
- generate-backups.sh
- restore-backups.sh
pgsql-hackers by date: