Re: block-level incremental backup - Mailing list pgsql-hackers

From Anastasia Lubennikova
Subject Re: block-level incremental backup
Date
Msg-id 3e15314c-b8de-2c81-6722-80c33423bc85@postgrespro.ru
Whole thread Raw
In response to block-level incremental backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: block-level incremental backup
List pgsql-hackers
09.04.2019 18:48, Robert Haas writes:
> Thoughts?
Hi,
Thank you for bringing that up.
In-core support of incremental backups is a long-awaited feature.
Hopefully, this take will end up committed in PG13.

Speaking of UI:
1) I agree that it should be implemented as a new replication command.

2) There should be a command to get only a map of changes without actual 
data.

Most backup tools establish server connection, so they can use this 
protocol to get the list of changed blocks.
Then they can use this information for any purpose. For example, 
distribute files between parallel workers to copy the data,
or estimate backup size before data is sent, or store metadata 
separately from the data itself.
Most methods (except straightforward LSN comparison) consist of two 
steps: get a map of changes and read blocks.
So it won't add much of extra work.

example commands:
GET_FILELIST [lsn]
returning json (or whatever) with filenames and maps of changed blocks

Map format is also the subject of discussion.
Now in pg_probackup we reuse code from pg_rewind/datapagemap,
not sure if this format is good for sending data via the protocol, though.

3) The API should provide functions to request data with a granularity 
of file and block.
It will be useful for parallelism and for various future projects.

example commands:
GET_DATAFILE [filename [map of blocks] ]
GET_DATABLOCK [filename] [blkno]
returning data in some format

4) The algorithm of collecting changed blocks is another topic.
Though, it's API should be discussed here:

Do we want to have multiple implementations?
Personally, I think that it's good to provide several strategies,
since they have different requirements and fit for different workloads.

Maybe we can add a hook to allow custom implementations.

Do we want to allow the backup client to tell what block collection 
method to use?
example commands:
GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
Or should it be server-side cost-based decision?

5) The method based on LSN comparison stands out - it can be done in one 
pass.
So it probably requires special protocol commands.
for example:
GET_DATAFILES [lsn]
GET_DATAFILE [filename] [lsn]

This is pretty simple to implement and pg_basebackup can use this method,
at least until we have something more advanced in-core.

I'll be happy to help with design, code, review, and testing.
Hope that my experience with pg_probackup will be useful.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: setLastTid() and currtid()
Next
From: Tom Lane
Date:
Subject: Re: Pluggable Storage - Andres's take