Re: Proposal: Incremental Backup - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Proposal: Incremental Backup
Date
Msg-id CAGTBQpaMV6+BvL_N66BuyBD8JsauW-p5o-FajNyVoKvsshiv2Q@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: Incremental Backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Jul 25, 2014 at 3:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
>> <marco.nenciarini@2ndquadrant.it> wrote:
>>> 1. Proposal
>>> =================================
>>> Our proposal is to introduce the concept of a backup profile. The backup
>>> profile consists of a file with one line per file detailing tablespace,
>>> path, modification time, size and checksum.
>>> Using that file the BASE_BACKUP command can decide which file needs to
>>> be sent again and which is not changed. The algorithm should be very
>>> similar to rsync, but since our files are never bigger than 1 GB per
>>> file that is probably granular enough not to worry about copying parts
>>> of files, just whole files.
>>
>> That wouldn't nearly as useful as the LSN-based approach mentioned before.
>>
>> I've had my share of rsyncing live databases (when resizing
>> filesystems, not for backup, but the anecdotal evidence applies
>> anyhow) and with moderately write-heavy databases, even if you only
>> modify a tiny portion of the records, you end up modifying a huge
>> portion of the segments, because the free space choice is random.
>>
>> There have been patches going around to change the random nature of
>> that choice, but none are very likely to make a huge difference for
>> this application. In essence, file-level comparisons get you only a
>> mild speed-up, and are not worth the effort.
>>
>> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
>> the I/O of inspecting the LSN of entire segments (necessary
>> optimization for huge multi-TB databases) and backups only the
>> portions modified when segments do contain changes, so it's the best
>> of both worlds. Any partial implementation would either require lots
>> of I/O (LSN only) or save very little (file only) unless it's an
>> almost read-only database.
>
> I agree with much of that.  However, I'd question whether we can
> really seriously expect to rely on file modification times for
> critical data-integrity operations.  I wouldn't like it if somebody
> ran ntpdate to fix the time while the base backup was running, and it
> set the time backward, and the next differential backup consequently
> omitted some blocks that had been modified during the base backup.

I was thinking the same. But that timestamp could be saved on the file
itself, or some other catalog, like a "trusted metadata" implemented
by pg itself, and it could be an LSN range instead of a timestamp
really.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Proposal: Incremental Backup
Next
From: Noah Misch
Date:
Subject: [w32] test_shm_mq test suite permanently burns connections slots