Re: Add -k/--link option to pg_combinebackup - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Add -k/--link option to pg_combinebackup |
Date | |
Msg-id | CA+TgmoZkf+WRLvLF1MHJC3NF1YAdeMmKdVKVwQpknEdgkEEkTg@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
On Wed, Jan 15, 2025 at 12:42 PM Israel Barth Rubio <barthisrael@gmail.com> wrote: > With the current implementation of pg_combinebackup, we have a few > copy methods: --clone, --copy and --copy-file-range. By using either of > them, it implicates an actual file copy in the file system, i.e. among > other things, disk usage. > > While discussing with some people, e.g. Robert Haas and Martín Marqués, > about possible ways to improve pg_combinebackup performance and/or > reduce disk usage taken by the synthetic backup, there was a thought > of using hard links. > > I'm submitting a patch in this thread which introduces the -k/--link > option to pg_combinebackup, making it similar to the options exposed > by pg_upgrade. In general, +1, although I think that the patch needs polishing in a bunch of areas. Originally, I thought what we wanted was something like a --in-place option to pg_combinebackup, allowing the output directory to be the same as one of the input directories. However, I now think that what this patch does is likely better, and this patch is a lot simpler to write than that one would be. With the --in-place option, if I'm combine backup1 and backup2, I must write either "pg_combinebackup backup1 backup2 -o backup1" or "pg_combinebackup backup1 backup2 -o backup2". In either case, I can skip whole-file copies from one of the two source directories -- whichever one happens to be the output directory. However, if I write "pg_combinebackup --link backup1 backup2 -o result", I can skip *all* whole-file copies from *every* input directory, which seems a lot nicer. Furthermore, if all the input directories are staging directories, basically copies of original backups stored elsewhere, then the fact that those directories get trashed is of no concern to me. In fact, they don't even get trashed if pg_combinebackup is interrupted partway through, because I can just remove the output directory and recreate it and everything is fine. With an --in-place option, that would be trickier. Regarding the patch itself, I think we need to rethink the test case changes in particular. As written, the patch won't do any testing of link mode unless the user sets PG_TEST_PG_COMBINEBACKUP_MODE=--link; and if they do set that option, then we won't test anything else. Also, even with that environment variable set, we'll only test --link mode a bit ... accidentally. The patch doesn't really do anything to make sure that link mode actually does what it's intended to do. It only adapts the existing tests not to fail. I think it would be better to decide that you're not supposed to set PG_TEST_PG_COMBINEBACKUP_MODE=--link, and then write some new test, not depending on PG_TEST_PG_COMBINEBACKUP_MODE, that specifically verifies that link mode behaves as expected. After all, link mode is noticeably different from the other copy modes. Those should all produce equivalent results, or fail outright, we suppose. This produces a different user-visible result, so we probably ought to test that we get that result. For example, we might check that certain files end up with a link count of 1 or 2, as appropriate. Does link() work on Windows? I'm not sure what to do about the issue that using --link trashes the input cluster if you subsequently start the database or otherwise modify any hard-linked files. Keep in mind that, for command-line arguments other than the first, these are incremental backups, and you already can't run postgres on those directories. Only the first input directory, which is a full backup not an incremental, is a potential target for an unexpected database start. I'm tentatively inclined to think we shouldn't modify the input directories and just emit a warning like: warning: --link mode was used; any modifications to the output directory may destructively modify input directories ...but maybe that's not the right idea. Happy to hear other opinions. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: