Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pg_combinebackup --copy-file-range
Date
Msg-id 4b197636-257e-4c4d-ae36-b037a3033118@enterprisedb.com
Whole thread Raw
In response to Re: pg_combinebackup --copy-file-range  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: pg_combinebackup --copy-file-range
List pgsql-hackers
Hi,

I have pushed the three patches of this series - the one that aligns
blocks, and the two adding clone/copy_file_range to pg_combinebackup.
The committed versions are pretty much the 2024/04/03 version, with
various minor cleanups (e.g. I noticed the docs are still claiming the
copy methods work only without checksum calculations, but that's no
longer true). I also changed the parameter order to keep the dry_run and
debug parameters last, it seems nicer this way.

The buildfarm reported two compile-time problems, both of them entirely
avoidable (reported by cfbot but I failed to notice that). Should have
known better ...

Anyway, with these patches committed, pg_combinebackup can use CoW stuff
to combine backups cheaply (especially in disk-space terms).

The first patch (block alignment) however turned out to be important
even for non-CoW filesystems, in some cases. I did a lot of benchmarks
with the standard block-by-block copying of data, and on a machine with
SSD RAID storage the duration went from ~400 seconds for some runs to
only about 150 seconds (with aligned blocks). My explanation is that
with the misaligned blocks the RAID often has to access two devices to
read a block, and the alignment makes that go away.

In the attached PDF with results (duration.pdf), showing the duration of
pg_combinebackup on an increment of a particular size (1%, 10% or 20%),
this is visible as a green square on the right. Those columns are
results relative to a baseline - which for "copy" is master before the
block alignment patch, and for "copy_file_range" it's the 3-reconstruct
(adding copy_file_range to combining blocks from increments).

FWIW the last three columns are a comparison with prefetching enabled.

There's a couple interesting observations from this, based on which I'm
not going to try to get the remaining patches (batching and prefetching)
into v17. It clearly needs more analysis to make the right tradeoff.

From the table, I think it's clear that:

0) The impact of block alignment on RAID storage, with regular copy.

1) The batching (origina patch 0005) either does not help the regular
copy, or it actually makes it slower. The PDF is a bit misleading
because it seems to suggest the i5 machine is unaffected, while the xeon
gets ~30% slower. But that's just an illusion - the comparison is to
master, but the alignment patch made i5 about 2x faster. So it's 200%
slower when compared to "current master" with the alignment patch.

That's not great :-/ And also a bit strange - I would have expected the
batching to help the simple copy too. I haven't looked into why this
happens, so there's a chance I made some silly mistake, who knows.

For the copy_file_range case the batching is usually very beneficial,
sometimes reducing the duration to a fraction of the non-batched case.

My interpretation is that (unless there's a bug in the patch) we may
need two variants of that code - a non-batched one for regular copy, and
a batched variant for copy_file_range.

2) The prefetching is not a huge improvement, at least not for these
three filesystems (btrfs, ext4, xfs). From the color scale it might seem
like it helps, but those values are relative to the baseline, so when
the non-prefetching value is 5% and with prefetching 10%, that means the
prefetching makes it slower. And that's very often true.

This is visible more clearly in prefetching.pdf, comparing the
non-prefetching and prefetching results for each patch, not to baseline.
That's makes it quite clear there's a lot of "red" where prefetching
makes it slower. It certainly does help for larger increments (which
makes sense, because the modified blocks are distributed randomly, and
thus come from random files, making long streaks unlikely).

I've imagined the prefetching could be made a bit smarter to ignore the
streaks (=sequential patterns), but once again - this only matters with
the batching, which we don't have. And without the batching it looks
like a net loss (that's the first column in the prefetching PDF).

I did start thinking about prefetching because of ZFS, where it was
necessary to get decent performance. And that's still true. But (a) even
with the explicit prefetching it's still 2-3x slower than any of these
filesystems, so I assume performance-sensitive use cases won't use it.
And (b) the prefetching seems necessary in all cases, no matter how
large the increment is. Which goes directly against the idea of looking
at how random the blocks are and prefetching only the sufficiently
random patterns. That doesn't seem like a great thing.

3) There's also the question of disk space usage. The size.pdf shows how
the patches affect space needed for the pg_combinebackup result. It does
depend a bit on the internal fs cleanup for each run, but it seems the
batching makes a difference - clearly copying 1MB blocks instead of 8kB
allows lower overhead for some filesystems (e.g. btrfs, where we get
from ~1.5GB to a couple MBs). But the space savings are quite negligible
compared to just using --copy-file-range option (where we get from 75GB
to 1.5GB). I think the batching is interesting mostly because of the
substantial duration reduction.

I'm also attaching the benchmarking script I used (warning: ugly!), and
results for the three filesystems. For ZFS I only have partial results
so far, because it's so slow, but in general - without prefetching it's
slow (~1000s) with prefetching it's better but still slow (~250s).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Fixing backslash dot for COPY FROM...CSV
Next
From: Tom Lane
Date:
Subject: [MASSMAIL]Obsolete comment in CopyReadLineText()