Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg_combinebackup --copy-file-range |
Date | |
Msg-id | 4b197636-257e-4c4d-ae36-b037a3033118@enterprisedb.com Whole thread Raw |
In response to | Re: pg_combinebackup --copy-file-range (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: pg_combinebackup --copy-file-range
|
List | pgsql-hackers |
Hi, I have pushed the three patches of this series - the one that aligns blocks, and the two adding clone/copy_file_range to pg_combinebackup. The committed versions are pretty much the 2024/04/03 version, with various minor cleanups (e.g. I noticed the docs are still claiming the copy methods work only without checksum calculations, but that's no longer true). I also changed the parameter order to keep the dry_run and debug parameters last, it seems nicer this way. The buildfarm reported two compile-time problems, both of them entirely avoidable (reported by cfbot but I failed to notice that). Should have known better ... Anyway, with these patches committed, pg_combinebackup can use CoW stuff to combine backups cheaply (especially in disk-space terms). The first patch (block alignment) however turned out to be important even for non-CoW filesystems, in some cases. I did a lot of benchmarks with the standard block-by-block copying of data, and on a machine with SSD RAID storage the duration went from ~400 seconds for some runs to only about 150 seconds (with aligned blocks). My explanation is that with the misaligned blocks the RAID often has to access two devices to read a block, and the alignment makes that go away. In the attached PDF with results (duration.pdf), showing the duration of pg_combinebackup on an increment of a particular size (1%, 10% or 20%), this is visible as a green square on the right. Those columns are results relative to a baseline - which for "copy" is master before the block alignment patch, and for "copy_file_range" it's the 3-reconstruct (adding copy_file_range to combining blocks from increments). FWIW the last three columns are a comparison with prefetching enabled. There's a couple interesting observations from this, based on which I'm not going to try to get the remaining patches (batching and prefetching) into v17. It clearly needs more analysis to make the right tradeoff. From the table, I think it's clear that: 0) The impact of block alignment on RAID storage, with regular copy. 1) The batching (origina patch 0005) either does not help the regular copy, or it actually makes it slower. The PDF is a bit misleading because it seems to suggest the i5 machine is unaffected, while the xeon gets ~30% slower. But that's just an illusion - the comparison is to master, but the alignment patch made i5 about 2x faster. So it's 200% slower when compared to "current master" with the alignment patch. That's not great :-/ And also a bit strange - I would have expected the batching to help the simple copy too. I haven't looked into why this happens, so there's a chance I made some silly mistake, who knows. For the copy_file_range case the batching is usually very beneficial, sometimes reducing the duration to a fraction of the non-batched case. My interpretation is that (unless there's a bug in the patch) we may need two variants of that code - a non-batched one for regular copy, and a batched variant for copy_file_range. 2) The prefetching is not a huge improvement, at least not for these three filesystems (btrfs, ext4, xfs). From the color scale it might seem like it helps, but those values are relative to the baseline, so when the non-prefetching value is 5% and with prefetching 10%, that means the prefetching makes it slower. And that's very often true. This is visible more clearly in prefetching.pdf, comparing the non-prefetching and prefetching results for each patch, not to baseline. That's makes it quite clear there's a lot of "red" where prefetching makes it slower. It certainly does help for larger increments (which makes sense, because the modified blocks are distributed randomly, and thus come from random files, making long streaks unlikely). I've imagined the prefetching could be made a bit smarter to ignore the streaks (=sequential patterns), but once again - this only matters with the batching, which we don't have. And without the batching it looks like a net loss (that's the first column in the prefetching PDF). I did start thinking about prefetching because of ZFS, where it was necessary to get decent performance. And that's still true. But (a) even with the explicit prefetching it's still 2-3x slower than any of these filesystems, so I assume performance-sensitive use cases won't use it. And (b) the prefetching seems necessary in all cases, no matter how large the increment is. Which goes directly against the idea of looking at how random the blocks are and prefetching only the sufficiently random patterns. That doesn't seem like a great thing. 3) There's also the question of disk space usage. The size.pdf shows how the patches affect space needed for the pg_combinebackup result. It does depend a bit on the internal fs cleanup for each run, but it seems the batching makes a difference - clearly copying 1MB blocks instead of 8kB allows lower overhead for some filesystems (e.g. btrfs, where we get from ~1.5GB to a couple MBs). But the space savings are quite negligible compared to just using --copy-file-range option (where we get from 75GB to 1.5GB). I think the batching is interesting mostly because of the substantial duration reduction. I'm also attaching the benchmarking script I used (warning: ugly!), and results for the three filesystems. For ZFS I only have partial results so far, because it's so slow, but in general - without prefetching it's slow (~1000s) with prefetching it's better but still slow (~250s). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: