Re: block-level incremental backup - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: block-level incremental backup
Date
Msg-id D72B7E92-2F1A-4F55-B1F2-6374E18C6C28@yandex-team.ru
Whole thread Raw
In response to block-level incremental backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: block-level incremental backup
List pgsql-hackers
Hi!

> 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> Thoughts?
Thanks for this long and thoughtful post!

At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we
implementedthis functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters
backupedwith this technology. 
Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are
tooperate backups cheaply and restore new node quickly. Here's what I see from our perspective. 

1. Yes, this feature is important.

2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap.

3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of
Tbdaily. 

4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied
sequentiallyis not very useful, it will not be much faster than simple WAL replay in many cases. 

5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization
reduction.We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in
thearchive. 

6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare
LSNson a page of visibility map: some operations do not bump them. 

7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch
themto page cache like pg_prefaulter does. 

8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That
databasealso has cool functionality "increment for catchup". 

9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version
ofDB", it is "difference between two DB states". But wording choice does not matter much. 


Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted.
Iwill try next year :) 

> 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
> - This is just a design proposal at this point; there is no code.  If
> this proposal, or some modified version of it, seems likely to be
> acceptable, I and/or my colleagues might try to implement it.

I'll be happy to help with code, discussion and patch review.

Best regards, Andrey Borodin.

[0] https://yadi.sk/i/Y_S1iqNN5WxS6A


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: hyrax vs. RelationBuildPartitionDesc
Next
From: Andrey Borodin
Date:
Subject: Re: GSOC 2019 proposal 'WAL-G safety features'