Thread: Tracking of page changes for backup purposes. PTRACK [POC]

Tracking of page changes for backup purposes. PTRACK [POC]

From
Anastasia Lubennikova
Date:
In this thread I would like to raise the issue of incremental backups.
What I suggest in this thread, is to choose one direction, so we can
concentrate our community efforts.
There is already a number of tools, which provide incremental backup.
And we can see five principle techniques they use:

1. Use file modification time as a marker that the file has changed.
2. Compute file checksums and compare them.
3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
4. Scan all WAL files in the archive since the previous backup and
collect information about changed pages.
5. Track page changes on the fly. (ptrack)

They can also be combined to achieve better performance.

My personal candidate is the last one, since it provides page-level
granularity, while most of the others approaches can only do file-level
incremental backups or require additional reads or calculations.

In a nutshell, using ptrack patch, PostgreSQL can track page changes on
the fly. Each time a relation page is updated, this page is marked in a
special PTRACK bitmap fork for this relation. As one page requires just
one bit in the PTRACK fork, such bitmaps are quite small. Tracking
implies some minor overhead on the database server operation but speeds
up incremental backups significantly.

Detailed overview of the implementation with all pros and cons,
patches and links to the related threads you can find here:

https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups.

Patches for v 10.1 and v 9.6 are attached.
Since ptrack is basically just an API for use in backup tools, it is
impossible to test the patch independently.
Now it is integrated with our backup utility, called pg_probackup. You can
find it herehttps://github.com/postgrespro/pg_probackup
Let me know if you find the documentation too long and complicated, I'll
write a brief How-to for ptrack backups.

Spoiler: Please consider this patch and README as a proof of concept. It
can be improved in some ways, but in its current state PTRACK is a
stable prototype, reviewed and tested well enough to find many
non-trivial corner cases and subtle problems. And any discussion of
change track algorithm must be aware of them. Feel free to share your
concerns and point out any shortcomings of the idea or the implementation.

-- 
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Aleksander Alekseev
Date:
Hello Anastasia,

Personally I'm very glad PTRACK is finally proposed to be ported to the
community PostgreSQL.

> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.

I believe it's worth to create an extension that will provide access to
the PTRACK's public API. Even if there will be not many users interested
in this extension, we should have at least some basic tests like "if we
write to the table, its pages change", "if we update a tuple,
corresponding indexes change", "if we clean up a bitmap, PTRACK says there
are no changed pages", etc.

--
Best regards,
Aleksander Alekseev

Attachment

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Aleksander Alekseev
Date:
Hello Anastasia,

> ptrack_9.6.6_v1.4.patch

Also I'm surprised you proposed a patch for 9.6. Since PTRACK is a new
feature I don't believe we are going to backport it.

--
Best regards,
Aleksander Alekseev

Attachment

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Andrey Borodin
Date:
Hello!

Thanks for sharing this work! I think this is important feature to make backups more efficient.
> 18 дек. 2017 г., в 15:18, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>
> Patches for v 10.1 and v 9.6 are attached.
> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.
> Now it is integrated with our backup utility, called pg_probackup. You can
> find it herehttps://github.com/postgrespro/pg_probackup
I can add experimental support for WAL-G too. We have QA tools for delta backups too.
>
> Spoiler: Please consider this patch and README as a proof of concept. It
> can be improved in some ways, but in its current state PTRACK is a
> stable prototype, reviewed and tested well enough to find many
> non-trivial corner cases and subtle problems. And any discussion of
> change track algorithm must be aware of them. Feel free to share your
> concerns and point out any shortcomings of the idea or the implementation.
This version of the patch is quite big - basically it accompanies most of START_CRIT_SECTION() calls with PTRACK calls.
I have two concerns about this:
1. Does this affect the performance of the database when PTRACK is not enabled?
2. What is the cost of having PTRACK enabled?

Best regards, Andrey Borodin.

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Aleksander Alekseev
Date:
Hello hackers,

> I have two concerns about this:
> 1. Does this affect the performance of the database when PTRACK is not enabled?
> 2. What is the cost of having PTRACK enabled?

I played with this patch a bit and did a simple benchmark on my laptop.

Configuration:

```
make distclean && ./configure prefix=/some/path/ && make -s -j4
```

postgresql.conf:

```
max_prepared_transactions = 100
wal_level = logical
wal_keep_segments = 128
max_connections = 100
wal_log_hints = on
max_wal_senders = 8
wal_keep_segments = 64
listen_addresses = '*'
hot_standby = on
log_statement = all
max_locks_per_transaction = 256
shared_buffers = 1GB
```

The benchmark:

```
pgbench -i && pgbench -j 4 -c 4 -T 300 -P 10
```

Here are the results.

10.1, ptrack_enable=false

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 28772
latency average = 41.705 ms
latency stddev = 18.274 ms
tps = 95.895605 (including connections establishing)
tps = 95.906434 (excluding connections establishing)

10.1, ptrack_enable=true

scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 28607
latency average = 41.951 ms
latency stddev = 18.454 ms
tps = 95.344363 (including connections establishing)
tps = 95.345290 (excluding connections establishing)


10.1, without ptrack

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 28622
latency average = 41.928 ms
latency stddev = 18.238 ms
tps = 95.396155 (including connections establishing)
tps = 95.397406 (excluding connections establishing)


At first glance PTRACK doesn't seem to affect the overall performance
significantly.

There are a few minor issues with the patch. There is a missing '/'
symbol in the comment before ptrack_get_and_clear procedure:

```
 * Get ptrack file as bytea and clear it */
bytea *
ptrack_get_and_clear(Oid tablespace_oid, Oid table_oid)
{
```

Also I believe the patch should include some changes of
postgresql.conf.sample.

I suggest to add this patch to the closest commitfest. Otherwise it can
be lost.

--
Best regards,
Aleksander Alekseev

Attachment

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Robert Haas
Date:
On Mon, Dec 18, 2017 at 8:34 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> 10.1, without ptrack
>
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 1
> query mode: simple
> number of clients: 4
> number of threads: 4
> duration: 300 s
> number of transactions actually processed: 28622
> latency average = 41.928 ms
> latency stddev = 18.238 ms
> tps = 95.396155 (including connections establishing)
> tps = 95.397406 (excluding connections establishing)
>
>
> At first glance PTRACK doesn't seem to affect the overall performance
> significantly.

I think this doesn't really show much because it's apparently limited
by the speed of fsync() on your filesystem.  You might try running the
test with synchronous_commit=off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Michael Paquier
Date:
On Mon, Dec 18, 2017 at 7:18 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Patches for v 10.1 and v 9.6 are attached.

Why no patch for HEAD? If you are planning to show some performance
numbers of some kind you had better run on the latest development
version, which would also avoid interference with any load bottleneck
that could have been removed during the development of v10 or v11. At
quick glance the patch proposed does not interfere with such areas.
-- 
Michael


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Michael Paquier
Date:
On Tue, Dec 19, 2017 at 2:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think this doesn't really show much because it's apparently limited
> by the speed of fsync() on your filesystem.  You might try running the
> test with synchronous_commit=off.

You may want to run Postgres on scissors as much as possible by
decreasing checkpoint frequency. etc.

+    /*
+     * Do not track changes for unlogged and temp relations,
+     * since we are not going to backup them anyway.
+     */
+    if (rel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT)
+        return;
This is not true for init forks of unlogged tables, which are logged
and included in backups.
-- 
Michael


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Aleksander Alekseev
Date:
Hello Robert,

> I think this doesn't really show much because it's apparently limited
> by the speed of fsync() on your filesystem.  You might try running the
> test with synchronous_commit=off.

You are right, synchronous_commit=off revealed a noticeable performance
degradation. Also I realized that using log_statement=all was not very
smart as well. Here are the results.

10.1, ptrack_enable=false, synchronous_commit = off

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1713550
latency average = 0.700 ms
latency stddev = 0.434 ms
tps = 5711.822110 (including connections establishing)
tps = 5712.251807 (excluding connections establishing)

10.1, ptrack_enable=true, synchronous_commit = off

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1691011
latency average = 0.710 ms
latency stddev = 0.380 ms
tps = 5636.691378 (including connections establishing)
tps = 5636.730514 (excluding connections establishing)

10.1, without ptrack, synchronous_commit = off

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1843623
latency average = 0.651 ms
latency stddev = 0.589 ms
tps = 6145.395486 (including connections establishing)
tps = 6145.441431 (excluding connections establishing)

--
Best regards,
Aleksander Alekseev

Attachment

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Tomas Vondra
Date:
Hi,

a couple of months ago there was proposal / patch with the similar
goals, from Andrey Borodin. See these two threads

[1]

https://www.postgresql.org/message-id/flat/843D96CC-7C55-4296-ADE0-622A7ACD4978%40yesql.se#843D96CC-7C55-4296-ADE0-622A7ACD4978@yesql.se

[2]

https://www.postgresql.org/message-id/flat/449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD%40yandex-team.ru#449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD@yandex-team.ru

I recall there was a long discussion regarding which of the approaches
is the *right* one (although that certainly depends on other factors).

On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
> In this thread I would like to raise the issue of incremental backups.
> What I suggest in this thread, is to choose one direction, so we can
> concentrate our community efforts.
> There is already a number of tools, which provide incremental backup.
> And we can see five principle techniques they use:
> 
> 1. Use file modification time as a marker that the file has changed.
> 2. Compute file checksums and compare them.
> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
> 4. Scan all WAL files in the archive since the previous backup and
> collect information about changed pages.
> 5. Track page changes on the fly. (ptrack)
> 
> They can also be combined to achieve better performance.
> 
> My personal candidate is the last one, since it provides page-level
> granularity, while most of the others approaches can only do file-level
> incremental backups or require additional reads or calculations.
> 

I share the opinion that options 1 and 2 are not particularly
attractive, due to either unreliability, or not really saving that much
CPU and I/O.

I'm not quite sure about 3, because it doesn't really explain how would
it be done - it seems to assume we'd have to reread the files. I'll get
back to this.

Option 4 has some very interesting features. Firstly, relies on WAL and
so should not require any new code (and it could, in theory, support
even older PostgreSQL releases, for example). Secondly, this can be
offloaded to a different machine. And it does even support additional
workflows - e.g. "given these two full backups and the WAL, generate an
incremental backup between them".

So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.


> In a nutshell, using ptrack patch, PostgreSQL can track page changes on
> the fly. Each time a relation page is updated, this page is marked in a
> special PTRACK bitmap fork for this relation. As one page requires just
> one bit in the PTRACK fork, such bitmaps are quite small. Tracking
> implies some minor overhead on the database server operation but speeds
> up incremental backups significantly.
> 

That makes sense, I guess, although I find the "ptrack" acronym somewhat
cryptic, and we should probably look for something more descriptive. But
the naming can wait, I guess.

My main question is if bitmap is the right data type. It seems to cause
a lot of complexity later, because it needs to be reset once in a while,
you have to be careful about failed incremental backups etc.

What if we tracked the LSN for each page instead? Sure, it'd require so,
64x more space (1 bit -> 8 bytes per page), but it would not require
resets, you could take incremental backups from arbitrary point in time,
and so on. That seems like a significant improvement to me, so perhaps
the space requirements are justified (still just 1MB for 1GB segment,
with the default 8kB pages).

> Detailed overview of the implementation with all pros and cons,
> patches and links to the related threads you can find here:
> 
> https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups.
> 
> Patches for v 10.1 and v 9.6 are attached.
> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.
> Now it is integrated with our backup utility, called pg_probackup. You can
> find it herehttps://github.com/postgrespro/pg_probackup
> Let me know if you find the documentation too long and complicated, I'll
> write a brief How-to for ptrack backups.
> 
> Spoiler: Please consider this patch and README as a proof of concept. It
> can be improved in some ways, but in its current state PTRACK is a
> stable prototype, reviewed and tested well enough to find many
> non-trivial corner cases and subtle problems. And any discussion of
> change track algorithm must be aware of them. Feel free to share your
> concerns and point out any shortcomings of the idea or the implementation.
> 

Thanks for the proposal and patch!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Robert Haas
Date:
On Tue, Dec 19, 2017 at 5:37 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
>> 1. Use file modification time as a marker that the file has changed.
>> 2. Compute file checksums and compare them.
>> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
>> 4. Scan all WAL files in the archive since the previous backup and
>> collect information about changed pages.
>> 5. Track page changes on the fly. (ptrack)
>
> I share the opinion that options 1 and 2 are not particularly
> attractive, due to either unreliability, or not really saving that much
> CPU and I/O.
>
> I'm not quite sure about 3, because it doesn't really explain how would
> it be done - it seems to assume we'd have to reread the files. I'll get
> back to this.
>
> Option 4 has some very interesting features. Firstly, relies on WAL and
> so should not require any new code (and it could, in theory, support
> even older PostgreSQL releases, for example). Secondly, this can be
> offloaded to a different machine. And it does even support additional
> workflows - e.g. "given these two full backups and the WAL, generate an
> incremental backup between them".
>
> So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.

I agree.  I think (4) is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Pavel Stehule
Date:


2017-12-20 21:11 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Tue, Dec 19, 2017 at 5:37 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
>> 1. Use file modification time as a marker that the file has changed.
>> 2. Compute file checksums and compare them.
>> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
>> 4. Scan all WAL files in the archive since the previous backup and
>> collect information about changed pages.
>> 5. Track page changes on the fly. (ptrack)
>
> I share the opinion that options 1 and 2 are not particularly
> attractive, due to either unreliability, or not really saving that much
> CPU and I/O.
>
> I'm not quite sure about 3, because it doesn't really explain how would
> it be done - it seems to assume we'd have to reread the files. I'll get
> back to this.
>
> Option 4 has some very interesting features. Firstly, relies on WAL and
> so should not require any new code (and it could, in theory, support
> even older PostgreSQL releases, for example). Secondly, this can be
> offloaded to a different machine. And it does even support additional
> workflows - e.g. "given these two full backups and the WAL, generate an
> incremental backup between them".
>
> So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.

I agree.  I think (4) is better.

Can depends on load? For smaller intensive updated databases the 5 can be optimal, for large less updated databases the 4 can be better.

Regards

Pavel


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Robert Haas
Date:
On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.
>>
>> I agree.  I think (4) is better.
>
> Can depends on load? For smaller intensive updated databases the 5 can be
> optimal, for large less updated databases the 4 can be better.

It seems to me that the difference is that (4) tracks which pages have
changed in the background, and (5) does it in the foreground.  Why
would we want the latter?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Pavel Stehule
Date:


2017-12-20 21:18 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.
>>
>> I agree.  I think (4) is better.
>
> Can depends on load? For smaller intensive updated databases the 5 can be
> optimal, for large less updated databases the 4 can be better.

It seems to me that the difference is that (4) tracks which pages have
changed in the background, and (5) does it in the foreground.  Why
would we want the latter?

Isn't more effective hold this info in Postgres than in backup sw? Then any backup sw can use this implementation.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Tomas Vondra
Date:
On 12/20/2017 09:29 PM, Pavel Stehule wrote:
> 
> 
> 2017-12-20 21:18 GMT+01:00 Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>>:
> 
>     On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule
>     <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> wrote:
>     >> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.
>     >>
>     >> I agree.  I think (4) is better.
>     >
>     > Can depends on load? For smaller intensive updated databases the 5 can be
>     > optimal, for large less updated databases the 4 can be better.
> 
>     It seems to me that the difference is that (4) tracks which pages have
>     changed in the background, and (5) does it in the foreground.  Why
>     would we want the latter?
> 
> 
> Isn't more effective hold this info in Postgres than in backup sw?
> Then any backup sw can use this implementation.
> 

I don't think it means it can't be implemented in Postgres, but does it
need to be done in backend?

For example, it might be a command-line tool similar to pg_waldump,
which processes WAL segments and outputs list of modified blocks,
possibly with the matching LSN. Or perhaps something like pg_receivewal,
doing that in streaming mode.

This part of the solution can still be part of PostgreSQL codebase, and
the rest has to be part of backup solution anyway.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Robert Haas
Date:
On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> Isn't more effective hold this info in Postgres than in backup sw?
>> Then any backup sw can use this implementation.
>
> I don't think it means it can't be implemented in Postgres, but does it
> need to be done in backend?
>
> For example, it might be a command-line tool similar to pg_waldump,
> which processes WAL segments and outputs list of modified blocks,
> possibly with the matching LSN. Or perhaps something like pg_receivewal,
> doing that in streaming mode.
>
> This part of the solution can still be part of PostgreSQL codebase, and
> the rest has to be part of backup solution anyway.

I agree with all of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Michael Paquier
Date:
On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> Isn't more effective hold this info in Postgres than in backup sw?
>>> Then any backup sw can use this implementation.
>>
>> I don't think it means it can't be implemented in Postgres, but does it
>> need to be done in backend?
>>
>> For example, it might be a command-line tool similar to pg_waldump,
>> which processes WAL segments and outputs list of modified blocks,
>> possibly with the matching LSN. Or perhaps something like pg_receivewal,
>> doing that in streaming mode.
>>
>> This part of the solution can still be part of PostgreSQL codebase, and
>> the rest has to be part of backup solution anyway.
>
> I agree with all of that.

+1. This summarizes a bunch of concerns about all kinds of backend
implementations proposed. Scanning for a list of blocks modified via
streaming gives more availability, but knowing that you will need to
switch to a new segment anyway when finishing a backup, does it really
matter? Doing it once a segment has finished would be cheap enough,
and you can even do it in parallel with a range of segments.

Also, since 9.4 and the introduction of the new WAL API to track
modified blocks, you don't need to know about the record types to know
which blocks are being changed. Here is an example of tool I hacked up
in a couple of hours that does actually what you are looking for, aka
a scanner of the blocks modified per record for a given WAL segment
using xlogreader.c:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

You could just use that and shape the data in the way you want and you
would be good to go.
-- 
Michael


Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Andrey Borodin
Date:
Hi!

> 21 дек. 2017 г., в 5:51, Michael Paquier <michael.paquier@gmail.com> написал(а):
>
> On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> Isn't more effective hold this info in Postgres than in backup sw?
>>>> Then any backup sw can use this implementation
>> [Skipped]
>> I agree with all of that.
>
> +1. This summarizes a bunch of concerns about all kinds of backend
> implementations proposed. Scanning for a list of blocks modified via
> streaming gives more availability, but knowing that you will need to
> switch to a new segment anyway when finishing a backup, does it really
> matter? Doing it once a segment has finished would be cheap enough,
> and you can even do it in parallel with a range of segments.
>
> Also, since 9.4 and the introduction of the new WAL API to track
> modified blocks, you don't need to know about the record types to know
> which blocks are being changed. Here is an example of tool I hacked up
> in a couple of hours that does actually what you are looking for, aka
> a scanner of the blocks modified per record for a given WAL segment
> using xlogreader.c:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
>
> You could just use that and shape the data in the way you want and you
> would be good to go.

Michael, that's almost what I want. I've even filed GSoC proposal for this [0].
But can we have something like this in Postgres?
The tool I'm hacking is in Go, I cannot just embed bunch of Postgres C into it. That is why API, like PTRACK, suits my
needsbetter. Not because it uses any superior mechanics, but because it is API, ready for external 3rd party backup
software(having backup software in Pg would be even better). 


Anastasia, I've implemented PTRACK support in WAL-G.
First, there are few minor issues with patch:
1. There is malformed comment
2. Function pg_ptrack_version() is absent

Then, I think that API is far from perfect: pg_ptrack_get_and_clear() changes global ptrack_clear_lsn, which introduces
someweakness (for paranoids). May be use something like "pg_ptrack_get_and_clear(oid,oid,previous_lsn)" which will fail
ifprevious_lsn do not match? Also, function pg_ptrack_get_and_clear() do not return errors when there is no table with
thisoid. Finally, I had to interpret any empty map as absence of map. From my POV, function must fail on errors like:
invaidoid passed, no table found, no PTRACK map exists, et c. 
I use external file-tracking mechanics, so function pg_ptrack_init_get_and_clear() was of no use for me.

Last, but most important for me: my tests showed lost page updates. Probably, it is bug or paranoia in my test
software.But may I ask you to check this [1] code, which converts PTRACK map to number of block numbers. Do I get
meaningof PTRACK map right? Thank you very much. 

[0] https://wiki.postgresql.org/index.php?title=GSoC_2018#WAL-G_delta_backups_with_WAL_scanning_.282018.29
[1] https://github.com/wal-g/wal-g/blob/ptrack/pagefile.go#L167-L173



Re: Tracking of page changes for backup purposes. PTRACK [POC]

From
Magnus Hagander
Date:


On Thu, Dec 21, 2017 at 1:51 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> Isn't more effective hold this info in Postgres than in backup sw?
>>> Then any backup sw can use this implementation.
>>
>> I don't think it means it can't be implemented in Postgres, but does it
>> need to be done in backend?
>>
>> For example, it might be a command-line tool similar to pg_waldump,
>> which processes WAL segments and outputs list of modified blocks,
>> possibly with the matching LSN. Or perhaps something like pg_receivewal,
>> doing that in streaming mode.
>>
>> This part of the solution can still be part of PostgreSQL codebase, and
>> the rest has to be part of backup solution anyway.
>
> I agree with all of that.

+1. This summarizes a bunch of concerns about all kinds of backend
implementations proposed. Scanning for a list of blocks modified via
streaming gives more availability, but knowing that you will need to
switch to a new segment anyway when finishing a backup, does it really
matter? Doing it once a segment has finished would be cheap enough,
and you can even do it in parallel with a range of segments.

There's definitely a lot of value to that, in particular the being able to do out entirely outside the backend making it possible to extract the data from an existing log archive.

Just to throw another option out  there, it could also be implemented at least partially as a walsender command. That way you can get it out through a replication connection and piggyback on things like replication slots to make sure you have the data you need, without having to send the full volume of data. And it would make it possible to do incremental/differential *without* having a WAL archive in the first place.

--