Re: pg_combinebackup --incremental - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: pg_combinebackup --incremental
Date
Msg-id CAKZiRmyarJqnw126a9v0A8ZEYs3JQ=1tMLKRVffJEHb8CoeCBw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
On Mon, Nov 4, 2024 at 6:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

Hi Robert,

[..]

1. Well, I have also the same bug as Bertrand which seems to be because MacOS was used development rather than Linux (and thus MacOS doesnt have copy_file_range(2)/HAVE_COPY_FILE_RANGE) --> I've simply fallen back to undefHAVE_COPY_FILE_RANGE in my case, but patch needs to be fixed. I haven't run any longer or more data-intensive tests as the copy_file_range() seems to be missing and from my point of view that thing is crucial.

2. While interleaving several incremental backups with pgbench, I've noticed something strange by accident:

This will work:

$ pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1 incr2 incr3 incr4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 incr3 incr4 -o good_incr1_4
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup good_incr1_4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 -o incr_12 #ok
$ pg_combinebackup -i incr_12 incr3 -o incr_13 #ok
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr_13

BUT, if I make intentional user error and if I merge the same incr2 into both into two sets of incremental backups it won't reconstruct that:

$ pg_combinebackup -i incr1 incr2 -o incr1_2 # contains 1 + 2
$ pg_combinebackup -i incr2 incr3 -o incr2_3 # contains 2(!) + 3
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1_2 # ofc works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1_2 incr2_3 # fails?
pg_combinebackup: error: backup at "incr1_2" starts at LSN 0/B000028, but expected 0/9000028

It seems to be taking LSN from incr1_2 and ignoring incr2_3 ?

$ find incr1 incr2 incr3 incr1_2 incr2_3 fulldbbackup -name backup_manifest -exec grep -H LSN {} \;
incr1/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/9000028", "End-LSN": "0/9000120" }
incr2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028", "End-LSN": "0/B000120" }
incr3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028", "End-LSN": "0/D000120" }
incr1_2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028", "End-LSN": "0/B000120" }
incr2_3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028", "End-LSN": "0/D000120" }
fulldbbackup/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/70000D8", "End-LSN": "0/70001D0" }

So not sure should we cover that scenario or not ?

$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1_2 incr3 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # two combined sets - also work

4. Space saving feature seems to be there (I've tried to merge up to ~40 backups with rolling merging incr backup always after each incremental backup), which seems to be the primary objective of the patch:

$ du -sm incr? incr?? incr1_38
38      incr1
25      incr2
[..]
24      incr37
24      incr38
[..above would be ~38*~25M = ~950MB]
87      incr1_38 # instead of ~950MB just 87MB

5. I've run accidently into independent problem when using "pg_combinebackup -i `ls -1vd incr*` -o incr_ALL" (when dealing with dozens of incrementals that are merged, I bet this is going to be pretty used pattern), that pg_combinebackup was failing with
$ pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
pg_combinebackup: error: incr26/global/pg_control: expected system identifier 7436752340350991865, but found 7436753510269539237

The issue for me is that the check if the output directory should not exist first, because it is taking incr_ALL here into ls(1) first while looking for System-Identifiers and blowing up with error , before checking if that -o dir doesn't exit:

$ grep System-Id ./incr_ALL/backup_manifest
"System-Identifier": 7436752340350991865,

So the issue is sequencing: first it should check if the incr_ALL does not exist and only maybe later start inspecting backups to be combined?

6. Not sure, if that's critical, but it seems to require incremental backups in order to be merging correctly , is that a limitation by design or not ? (note --reverse in ls(1)):

$ rm -rf incr_ALL && pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i `ls -1rvd incr*` -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B000028, but expected 0/70000D8

simpler:
$ rm -rf incr_ALL && pg_combinebackup -i incr1 incr2 incr3 -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i incr3 incr2 incr1 -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B000028, but expected 0/70000D8
$ find incr1 incr2 incr3 -name backup_manifest -exec grep -H LSN {} \;   | sort -nk 1
incr1/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/9000028", "End-LSN": "0/9000120" }
incr2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028", "End-LSN": "0/B000120" }
incr3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028", "End-LSN": "0/D000120" }

Nitpicking and other possibly not important things:

a. I'm still a fan of `--merge-incremental[-backups]` over `--incremental` switch in pg_combinebackup and disabling the short `-i` switch :^)

b. pg_combinebackup help message has:
>  -i, --incremental         combine incrementals without a full backup
Maybe s/combine incrementals/merge incrementals backups/ as the "incrementals" misses the "incremental of <what>"

c. If we are at copy_file_blocks(), couldn't we here simply report also strerror(errno) in one of the parameters to pg_fatal during short write ? I bet ENOSPC error message would be less vague:

pg_combinebackup: error: could not write to file "incr1_39/base/5/2613", offset 9011200: wrote 327680 of 409600
pg_combinebackup: removing output directory "incr1_39"

-J.

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT
Next
From: Yugo NAGATA
Date:
Subject: Re: Add reject_limit option to file_fdw