Re: Add -k/--link option to pg_combinebackup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Add -k/--link option to pg_combinebackup
Date
Msg-id CA+TgmoZkf+WRLvLF1MHJC3NF1YAdeMmKdVKVwQpknEdgkEEkTg@mail.gmail.com
Whole thread Raw
List pgsql-hackers
On Wed, Jan 15, 2025 at 12:42 PM Israel Barth Rubio
<barthisrael@gmail.com> wrote:
> With the current implementation of pg_combinebackup, we have a few
> copy methods: --clone, --copy and --copy-file-range. By using either of
> them, it implicates an actual file copy in the file system, i.e. among
> other things, disk usage.
>
> While discussing with some people, e.g. Robert Haas and Martín Marqués,
> about possible ways to improve pg_combinebackup performance and/or
> reduce disk usage taken by the synthetic backup, there was a thought
> of using hard links.
>
> I'm submitting a patch in this thread which introduces the -k/--link
> option to pg_combinebackup, making it similar to the options exposed
> by pg_upgrade.

In general, +1, although I think that the patch needs polishing in a
bunch of areas.

Originally, I thought what we wanted was something like a --in-place
option to pg_combinebackup, allowing the output directory to be the
same as one of the input directories. However, I now think that what
this patch does is likely better, and this patch is a lot simpler to
write than that one would be. With the --in-place option, if I'm
combine backup1 and backup2, I must write either "pg_combinebackup
backup1 backup2 -o backup1" or "pg_combinebackup backup1 backup2 -o
backup2". In either case, I can skip whole-file copies from one of the
two source directories -- whichever one happens to be the output
directory. However, if I write "pg_combinebackup --link backup1
backup2 -o result", I can skip *all* whole-file copies from *every*
input directory, which seems a lot nicer.

Furthermore, if all the input directories are staging directories,
basically copies of original backups stored elsewhere, then the fact
that those directories get trashed is of no concern to me. In fact,
they don't even get trashed if pg_combinebackup is interrupted partway
through, because I can just remove the output directory and recreate
it and everything is fine. With an --in-place option, that would be
trickier.

Regarding the patch itself, I think we need to rethink the test case
changes in particular. As written, the patch won't do any testing of
link mode unless the user sets PG_TEST_PG_COMBINEBACKUP_MODE=--link;
and if they do set that option, then we won't test anything else.
Also, even with that environment variable set, we'll only test --link
mode a bit ... accidentally. The patch doesn't really do anything to
make sure that link mode actually does what it's intended to do. It
only adapts the existing tests not to fail. I think it would be better
to decide that you're not supposed to set
PG_TEST_PG_COMBINEBACKUP_MODE=--link, and then write some new test,
not depending on PG_TEST_PG_COMBINEBACKUP_MODE, that specifically
verifies that link mode behaves as expected.

After all, link mode is noticeably different from the other copy
modes. Those should all produce equivalent results, or fail outright,
we suppose. This produces a different user-visible result, so we
probably ought to test that we get that result. For example, we might
check that certain files end up with a link count of 1 or 2, as
appropriate.

Does link() work on Windows?

I'm not sure what to do about the issue that using --link trashes the
input cluster if you subsequently start the database or otherwise
modify any hard-linked files. Keep in mind that, for command-line
arguments other than the first, these are incremental backups, and you
already can't run postgres on those directories. Only the first input
directory, which is a full backup not an incremental, is a potential
target for an unexpected database start. I'm tentatively inclined to
think we shouldn't modify the input directories and just emit a
warning like:

warning: --link mode was used; any modifications to the output
directory may destructively modify input directories

...but maybe that's not the right idea. Happy to hear other opinions.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: RFC: Additional Directory for Extensions
Next
From: Robert Haas
Date:
Subject: Re: Eagerly scan all-visible pages to amortize aggressive vacuum