Re: Tracking of page changes for backup purposes. PTRACK [POC] - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Tracking of page changes for backup purposes. PTRACK [POC]
Date
Msg-id d2842a0f-d858-4e41-2805-818974646560@2ndquadrant.com
Whole thread Raw
In response to Tracking of page changes for backup purposes. PTRACK [POC]  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses Re: Tracking of page changes for backup purposes. PTRACK [POC]  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

a couple of months ago there was proposal / patch with the similar
goals, from Andrey Borodin. See these two threads

[1]

https://www.postgresql.org/message-id/flat/843D96CC-7C55-4296-ADE0-622A7ACD4978%40yesql.se#843D96CC-7C55-4296-ADE0-622A7ACD4978@yesql.se

[2]

https://www.postgresql.org/message-id/flat/449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD%40yandex-team.ru#449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD@yandex-team.ru

I recall there was a long discussion regarding which of the approaches
is the *right* one (although that certainly depends on other factors).

On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
> In this thread I would like to raise the issue of incremental backups.
> What I suggest in this thread, is to choose one direction, so we can
> concentrate our community efforts.
> There is already a number of tools, which provide incremental backup.
> And we can see five principle techniques they use:
> 
> 1. Use file modification time as a marker that the file has changed.
> 2. Compute file checksums and compare them.
> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
> 4. Scan all WAL files in the archive since the previous backup and
> collect information about changed pages.
> 5. Track page changes on the fly. (ptrack)
> 
> They can also be combined to achieve better performance.
> 
> My personal candidate is the last one, since it provides page-level
> granularity, while most of the others approaches can only do file-level
> incremental backups or require additional reads or calculations.
> 

I share the opinion that options 1 and 2 are not particularly
attractive, due to either unreliability, or not really saving that much
CPU and I/O.

I'm not quite sure about 3, because it doesn't really explain how would
it be done - it seems to assume we'd have to reread the files. I'll get
back to this.

Option 4 has some very interesting features. Firstly, relies on WAL and
so should not require any new code (and it could, in theory, support
even older PostgreSQL releases, for example). Secondly, this can be
offloaded to a different machine. And it does even support additional
workflows - e.g. "given these two full backups and the WAL, generate an
incremental backup between them".

So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.


> In a nutshell, using ptrack patch, PostgreSQL can track page changes on
> the fly. Each time a relation page is updated, this page is marked in a
> special PTRACK bitmap fork for this relation. As one page requires just
> one bit in the PTRACK fork, such bitmaps are quite small. Tracking
> implies some minor overhead on the database server operation but speeds
> up incremental backups significantly.
> 

That makes sense, I guess, although I find the "ptrack" acronym somewhat
cryptic, and we should probably look for something more descriptive. But
the naming can wait, I guess.

My main question is if bitmap is the right data type. It seems to cause
a lot of complexity later, because it needs to be reset once in a while,
you have to be careful about failed incremental backups etc.

What if we tracked the LSN for each page instead? Sure, it'd require so,
64x more space (1 bit -> 8 bytes per page), but it would not require
resets, you could take incremental backups from arbitrary point in time,
and so on. That seems like a significant improvement to me, so perhaps
the space requirements are justified (still just 1MB for 1GB segment,
with the default 8kB pages).

> Detailed overview of the implementation with all pros and cons,
> patches and links to the related threads you can find here:
> 
> https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups.
> 
> Patches for v 10.1 and v 9.6 are attached.
> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.
> Now it is integrated with our backup utility, called pg_probackup. You can
> find it herehttps://github.com/postgrespro/pg_probackup
> Let me know if you find the documentation too long and complicated, I'll
> write a brief How-to for ptrack backups.
> 
> Spoiler: Please consider this patch and README as a proof of concept. It
> can be improved in some ways, but in its current state PTRACK is a
> stable prototype, reviewed and tested well enough to find many
> non-trivial corner cases and subtle problems. And any discussion of
> change track algorithm must be aware of them. Feel free to share your
> concerns and point out any shortcomings of the idea or the implementation.
> 

Thanks for the proposal and patch!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Erik Rijkers
Date:
Subject: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals
Next
From: Erik Rijkers
Date:
Subject: Re: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals