Thread: block-level incremental backup

block-level incremental backup

From
Robert Haas
Date:
Hi,

Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
developed technology that permits a block-level incremental backup to
be taken from a PostgreSQL server.  I believe the idea in all of those
cases is that non-relation files should be backed up in their
entirety, but for relation files, only those blocks that have been
changed need to be backed up.  I would like to propose that we should
have a solution for this problem in core, rather than leaving it to
each individual PostgreSQL company to develop and maintain their own
solution. Generally my idea is:

1. There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value.  There are
several possible ways for the server to implement this, the simplest
of which is to just scan all the blocks and send only the ones that
satisfy that criterion.  That might sound dumb, but it does still save
network bandwidth, and it works even without any prior setup. It will
probably be more efficient in many cases to instead scan all the WAL
generated since that LSN and extract block references from it, but
that is only possible if the server has all of that WAL available or
can somehow get it from the archive.  We could also, as several people
have proposed previously, have some kind of additional relation for
that stores either a single is-modified bit -- which only helps if the
reference LSN for the is-modified bit is older than the requested LSN
but not too much older -- or the highest LSN for each range of K
blocks, or something like that.  I am at the moment not too concerned
with the exact strategy we use here. I believe we may want to
eventually support more than one, since they have different
trade-offs.

2. When you use pg_basebackup in this way, each relation file that is
not sent in its entirety is replaced by a file with a different name.
For example, instead of base/16384/16417, you might get
base/16384/partial.16417 or however we decide to name them.  Each such
file will store near the beginning of the file a list of all the
blocks contained in that file, and the blocks themselves will follow
at offsets that can be predicted from the metadata at the beginning of
the file.  The idea is that you shouldn't have to read the whole file
to figure out which blocks it contains, and if you know specifically
what blocks you want, you should be able to reasonably efficiently
read just those blocks.  A backup taken in this manner should also
probably create some kind of metadata file in the root directory that
stops the server from starting and lists other salient details of the
backup.  In particular, you need the threshold LSN for the backup
(i.e. contains blocks newer than this) and the start LSN for the
backup (i.e. the LSN that would have been returned from
pg_start_backup).

3. There should be a new tool that knows how to merge a full backup
with any number of incremental backups and produce a complete data
directory with no remaining partial files.  The tool should check that
the threshold LSN for each incremental backup is less than or equal to
the start LSN of the previous backup; if not, there may be changes
that happened in between which would be lost, so combining the backups
is unsafe.  Running this tool can be thought of either as restoring
the backup or as producing a new synthetic backup from any number of
incremental backups.  This would allow for a strategy of unending
incremental backups.  For instance, on day 1, you take a full backup.
On every subsequent day, you take an incremental backup.  On day 9,
you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
day2.  On each subsequent day you do something similar.  Now you can
always roll back to any of the last seven days by combining the oldest
backup you have (which is always a synthetic full backup) with as many
newer incrementals as you want, up to the point where you want to
stop.

Other random points:
- If the server has multiple ways of finding blocks with an LSN
greater than or equal to the threshold LSN, it could make a cost-based
decision between those methods, or it could allow the client to
specify the method to be used.
- I imagine that the server would offer this functionality through a
new replication command or a syntax extension to an existing command,
so it could also be used by tools other than pg_basebackup if they
wished.
- Combining backups could also be done destructively rather than, as
proposed above, non-destructively, but you have to be careful about
what happens in case of a failure.
- The pg_combinebackup tool (or whatever we call it) should probably
have an option to exploit hard links to save disk space; this could in
particular make construction of a new synthetic full backup much
cheaper.  However you'd better be careful not to use this option when
actually trying to restore, because if you start the server and run
recovery, you don't want to change the copies of those same files that
are in your backup directory.  I guess the server could be taught to
complain about st_nlink > 1 but I'm not sure we want to go there.
- It would also be possible to collapse multiple incremental backups
into a single incremental backup, without combining with a full
backup.  In the worst case, size(i1+i2) = size(i1) + size(i2), but if
the same data is modified repeatedly collapsing backups would save
lots of space.  This doesn't seem like a must-have for v1, though.
- If you have a SAN and are taking backups using filesystem snapshots,
then you don't need this, because your SAN probably already uses
copy-on-write magic for those snapshots, and so you are already
getting all of the same benefits in terms of saving storage space that
you would get from something like this.  But not everybody has a SAN.
- I know that there have been several previous efforts in this area,
but none of them have gotten to the point of being committed.  I
intend no disrespect to those efforts.  I believe I'm taking a
slightly different view of the problem here than what has been done
previously, trying to focus on the user experience rather than, e.g.,
the technology that is used to decide which blocks need to be sent.
However it's possible I've missed a promising patch that takes an
approach very similar to what I'm outlining here, and if so, I don't
mind a bit having that pointed out to me.
- This is just a design proposal at this point; there is no code.  If
this proposal, or some modified version of it, seems likely to be
acceptable, I and/or my colleagues might try to implement it.
- It would also be nice to support *parallel* backup, both for full
backups as we can do them today and for incremental backups.  But that
sound like a separate effort.  pg_combinebackup could potentially
support parallel operation as well, although that might be too
ambitious for v1.
- It would also be nice if pg_basebackup could write backups to places
other than the local disk, like an object store, a tape drive, etc.
But that also sounds like a separate effort.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Arthur Zakirov
Date:
Hello,

On 09.04.2019 18:48, Robert Haas wrote:
> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.
> 
> Thoughts? 

(Just thinking out loud) Also it might be useful to have remote restore 
facility (i.e. if pg_combinebackup could write to non-local storage), so 
you don't need to restore the instance into a locale place and copy/move 
to the remote machine. But it seems to me that it is the most nontrivial 
feature and requires much more effort than other points.

In pg_probackup we have remote restore via SSH in the beta state. But 
SSH isn't an option for in-core approach I think.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.

Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version.  I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.


> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

I wonder if we shouldn't just integrate that into pg_control or such. So
that:

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Could just be part of server startup?


> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.

Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?


> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.

Indeed seems separate. But worthwhile.

Greetings,

Andres Freund



Re: block-level incremental backup

From
Gary M
Date:
Having worked in the data storage industry since the '80s, I think backup is an important capability. Having said that, the ideas should be expanded to an overall data management strategy combining local and remote storage including cloud.

From my experience, record and transaction consistency is critical to any replication action, including backup.  The approach commonly includes a starting baseline, snapshot if you prefer, and a set of incremental changes to the snapshot.  I always used the transaction logs for both backup and remote replication to other DBMS. In standard ECMA-208 @94, you will note a file object with a transaction property. Although the language specifies files, a file may be any set of records.

SAN based snapshots usually occur on the SAN storage device, meaning if cached data (unwritten to disk) will not be snapshotted or inconsistently reference and likely result in a corrupted database on restore. 

Snapshots are point in time states of storage objects. Between snapshot periods, any number of changes many occur.  If a record of "all changes" are required, snapshot methods must be augmented with a historical record.. the transaction log.     

 Delta block methods for backups have been in practice for many years. ZFS had adopted the practice for block management. The ability of incremental backups, whether block, transactions or other methods, is dependent on prior data. Like primary storage, backup media can fail, become lost and be inadvertently corrupted. The result of incremental data backup loss is the restored data after the point of loss is likely corrupted.

cheers, 
garym

On Tue, Apr 9, 2019 at 10:35 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.

Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version.  I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.


> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

I wonder if we shouldn't just integrate that into pg_control or such. So
that:

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Could just be part of server startup?


> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.

Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?


> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.

Indeed seems separate. But worthwhile.

Greetings,

Andres Freund


Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Apr 9, 2019 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. But that means that files that are shipped nearly in their entirety,
> need to be fully rewritten. Wonder if it's better to ship them as files
> with holes, and have the metadata in a separate file. That'd then allow
> to just fill in the holes with data from the older version.  I'd assume
> that there's a lot of workloads where some significantly sized relations
> will get updated in nearly their entirety between backups.

I don't want to rely on holes at the FS level.  I don't want to have
to worry about what Windows does and what every Linux filesystem does
and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do.  And I
don't want to have to write documentation for the fine manual
explaining to people that they need to use a hole-preserving tool when
they copy an incremental backup around.  And I don't want to have to
listen to complaints from $USER that their backup tool, $THING, is not
hole-aware.  Just - no.

But what we could do is have some threshold (as git does), beyond
which you just send the whole file.  For example if >90% of the blocks
have changed, or >80% or whatever, then you just send everything.
That way, if you have a database where you have lots and lots of 1GB
segments with low churn (so that you can't just use full backups) and
lots and lots of 1GB segments with high churn (to create the problem
you're describing) you'll still be OK.

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Could just be part of server startup?

Yes, but I think that sucks.  You might not want to start the server
but rather just create a new synthetic backup.  And realistically,
it's hard to imagine the server doing anything but synthesizing the
backup first and then proceeding as normal.  In theory there's no
reason why it couldn't be smart enough to construct the files it needs
"on demand" in the background, but that sounds really hard and I don't
think there's enough value to justify that level of effort.  YMMV, of
course.

> > - I imagine that the server would offer this functionality through a
> > new replication command or a syntax extension to an existing command,
> > so it could also be used by tools other than pg_basebackup if they
> > wished.
>
> Would this logic somehow be usable from tools that don't want to copy
> the data directory via pg_basebackup (e.g. for parallelism, to directly
> send to some backup service / SAN / whatnot)?

Well, I'm imagining it as a piece of server-side functionality that
can figure out what has changed using one of several possible methods,
and then send that stuff to you.  So I think if you don't have a
server connection you are out of luck.  If you have a server
connection but just want to be told what has changed rather than
actually being given that data, that might be something that could be
worked into the design.  I'm not sure whether that's a real need,
though, or just extra work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Apr 9, 2019 at 12:32 PM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> In pg_probackup we have remote restore via SSH in the beta state. But
> SSH isn't an option for in-core approach I think.

That's a little off-topic for this thread, but I think we should have
some kind of extensible mechanism for pg_basebackup and maybe other
tools, so that you can teach it to send backups to AWS or your
teletype or etch them on stone tablets or whatever without having to
modify core code.  But let's not design that mechanism on this thread,
'cuz that will distract from what I want to talk about here.  Feel
free to start a new thread for it, though, and I'll jump in.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Peter Eisentraut
Date:
On 2019-04-09 17:48, Robert Haas wrote:
> It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

This could be a variant of a replication slot that preserves WAL between
incremental backup runs.

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Are there by any chance standard file formats and tools that describe a
binary difference between directories?  That would be really useful here.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: block-level incremental backup

From
Alvaro Herrera
Date:
On 2019-Apr-09, Peter Eisentraut wrote:

> On 2019-04-09 17:48, Robert Haas wrote:

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
> 
> Are there by any chance standard file formats and tools that describe a
> binary difference between directories?  That would be really useful here.

VCDIFF? https://tools.ietf.org/html/rfc3284

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: block-level incremental backup

From
Andrey Borodin
Date:
Hi!

> 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> Thoughts?
Thanks for this long and thoughtful post!

At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we
implementedthis functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters
backupedwith this technology. 
Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are
tooperate backups cheaply and restore new node quickly. Here's what I see from our perspective. 

1. Yes, this feature is important.

2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap.

3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of
Tbdaily. 

4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied
sequentiallyis not very useful, it will not be much faster than simple WAL replay in many cases. 

5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization
reduction.We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in
thearchive. 

6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare
LSNson a page of visibility map: some operations do not bump them. 

7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch
themto page cache like pg_prefaulter does. 

8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That
databasealso has cool functionality "increment for catchup". 

9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version
ofDB", it is "difference between two DB states". But wording choice does not matter much. 


Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted.
Iwill try next year :) 

> 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
> - This is just a design proposal at this point; there is no code.  If
> this proposal, or some modified version of it, seems likely to be
> acceptable, I and/or my colleagues might try to implement it.

I'll be happy to help with code, discussion and patch review.

Best regards, Andrey Borodin.

[0] https://yadi.sk/i/Y_S1iqNN5WxS6A


Re: block-level incremental backup

From
Konstantin Knizhnik
Date:

On 09.04.2019 18:48, Robert Haas wrote:
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.

Some times ago I have implemented alternative version of ptrack utility 
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be 
it can be sometimes integrated in master.
I attached patch to vanilla to this mail.
Right now it contains just two GUCs:

ptrack_map_size: Size of ptrack map (number of elements) used for 
incremental backup: 0 disabled.
ptrack_block_log: Logarithm of ptrack block size (amount of pages)

and one function:

pg_ptrack_get_changeset(startlsn pg_lsn) returns 
{relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path}

Idea is very simple: it creates hash map of fixed size (ptrack_map_size) 
and stores LSN of written pages in this map.
As far as postgres default page size seems to be too small  for ptrack 
block (requiring too large hash map or increasing number of conflicts, 
as well as
increasing number of random reads) it is possible to configure ptrack 
block to consists of multiple pages (power of 2).

This patch is using memory mapping mechanism. Unfortunately there is no 
portable wrapper for it in Postgres, so I have to provide own 
implementations for Unix/Windows. Certainly it is not good and should be 
rewritten.

How to use?

1. Define ptrack_map_size in postgres.conf, for example (use simple 
number for more uniform hashing):

ptrack_map_size = 1000003

2.  Remember current lsn.

psql postgres -c "select pg_current_wal_lsn()"
  pg_current_wal_lsn
--------------------
  0/224A268
(1 row)

3. Do some updates.

$ pgbench -T 10 postgres

4. Select changed blocks.

  select * from pg_ptrack_get_changeset('0/224A268');
  relid | relfilenode | reltablespace | forknum | blocknum | segsize |  
updlsn   |         path
-------+-------------+---------------+---------+----------+---------+-----------+----------------------
  16390 |       16396 |          1663 |       0 |     1640 |       1 | 
0/224FD88 | base/12710/16396
  16390 |       16396 |          1663 |       0 |     1641 |       1 | 
0/2258680 | base/12710/16396
  16390 |       16396 |          1663 |       0 |     1642 |       1 | 
0/22615A0 | base/12710/16396
...

Certainly ptrack should be used as part of some backup tool (as 
pg_basebackup or pg_probackup).


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: block-level incremental backup

From
Jehan-Guillaume de Rorthais
Date:
Hi,

On Tue, 9 Apr 2019 11:48:38 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.  I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. Generally my idea is:
> 
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup.

+1 this is a simple design and probably a first easy step bringing a lot of
benefices already.

> It will probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

I seize the opportunity to discuss about this on the fly.

I've been playing with the idea of producing incremental backups from
archives since many years. But I've only started PoC'ing on it this year.

My idea would be create a new tool working on archived WAL. No burden
server side. Basic concept is:

* parse archives
* record latest relevant FPW for the incr backup
* write new WALs with recorded FPW and removing/rewriting duplicated walrecords.

It's just a PoC and I hadn't finished the WAL writing part...not even talking
about the replay part. I'm not even sure this project is a good idea, but it is
a good educational exercice to me in the meantime. 

Anyway, using real life OLTP production archives, my stats were:

  # WAL   xlogrec kept     Size WAL kept
    127            39%               50%
    383            22%               38%
    639            20%               29%

Based on this stats, I expect this would save a lot of time during recovery in
a first step. If it get mature, it might even save a lot of archives space or
extend the retention period with degraded granularity. It would even help
taking full backups with a lower frequency.

Any thoughts about this design would be much appreciated. I suppose this should
be offlist or in a new thread to avoid polluting this thread as this is a
slightly different subject.

Regards,


PS: I was surprised to still find some existing piece of code related to
pglesslog in core. This project has been discontinued and WAL format changed in
the meantime.



Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-Apr-09, Peter Eisentraut wrote:
> > On 2019-04-09 17:48, Robert Haas wrote:
> > > 3. There should be a new tool that knows how to merge a full backup
> > > with any number of incremental backups and produce a complete data
> > > directory with no remaining partial files.
> >
> > Are there by any chance standard file formats and tools that describe a
> > binary difference between directories?  That would be really useful here.
>
> VCDIFF? https://tools.ietf.org/html/rfc3284

I don't understand VCDIFF very well, but I see some potential problems
with going in this direction.

First, suppose we take a full backup on Monday.  Then, on Tuesday, we
want to take an incremental backup.  In my proposal, the backup server
only needs to provide the database with one piece of information: the
start-LSN of the previous backup.  The server determines which blocks
are recently modified and sends them to the client, which stores them.
The end.  On the other hand, storing a maximally compact VCDIFF seems
to require that, for each block modified in the Tuesday backup, we go
read the corresponding block as it existed on Monday.  Assuming that
the server is using some efficient method of locating modified blocks,
this will approximately double the amount of read I/O required to
complete the backup: either the server or the client must now read not
only the current version of the block but the previous versions.  If
the previous backup is an incremental backup that does not contain
full block images but only VCDIFF content, whoever is performing the
VCDIFF calculation will need to walk the entire backup chain and
reconstruct the previous contents of the previous block so that it can
compute the newest VCDIFF.  A customer who does an incremental backup
every day and maintains a synthetic full backup from 1 week prior will
see a roughly eightfold increase in read I/O compared to the design I
proposed.

The same problem exists at restore time.  In my design, the total read
I/O required is equal to the size of the database, plus however much
metadata needs to be read from older delta files -- and that should be
fairly small compared to the actual data being read, at least in
normal, non-extreme cases.  But if we are going to proceed by applying
a series of delta files, we're going to need to read every older
backup in its entirety.  If the turnover percentage is significant,
say 20%/day, and if the backup chain is say 7 backups long to get back
to a full backup, this is a huge difference.  Instead of having to
read ~100% of the database size, as in my proposal, we'll need to read
100% + (6 * 20%) = 220% of the database size.

Since VCDIFF uses an add-copy-run language to described differences,
we could try to work around the problem that I just described by
describing each changed data block as an 8192-byte add, and unchanged
blocks as an 8192-byte copy.  If we did that, then I think that the
problem at backup time goes away: we can write out a VCDIFF-format
file for the changed blocks based just on knowing that those are the
blocks that have changed, without needing to access the older file. Of
course, if we do it this way, the file will be larger than it would be
if we actually compared the old and new block contents and wrote out a
minimal VCDIFF, but it does make taking a backup a lot simpler.  Even
with this proposal, though, I think we still have trouble with restore
time.  I proposed putting the metadata about which blocks are included
in a delta file at the beginning of the file, which allows a restore
of a new incremental backup to relatively efficiently flip through
older backups to find just the blocks that it needs, without having to
read the whole file.  But I think (although I am not quite sure) that
in the VCDIFF format, the payload for an ADD instruction is stored
near the payload.  The result would be that you'd have to basically
read the whole file at restore time to figure out which blocks were
available from that file and which ones needed to be retrieved from an
older backup.  So while this approach would fix the backup-time
problem, I believe that it would still require significantly more read
I/O at restore time than my proposal.

Furthermore, if, at backup time, we have to do anything that requires
access to the old data, either the client or the server needs to have
access to that data.  Nonwithstanding the costs of reading it, that
doesn't seem very desirable.  The server is quite unlikely to have
access to the backups, because most users want to back up to a
different server in order to guard against a hardware failure.  The
client is more likely to be running on a machine where it has access
to the data, because many users back up to the same machine every day,
so the machine that is taking the current backup probably has the
older one.  However, accessing that old backup might not be cheap.  It
could be located in an object store in the cloud someplace, or it
could have been written out to a tape drive and the tape removed from
the drive.  In the design I'm proposing, that stuff doesn't matter,
but if you want to run diffs, then it does.  Even if the client has
efficient access to the data and even if it has so much read I/O
bandwidth that the costs of reading that old data to run diffs doesn't
matter, it's still pretty awkward for a tar-format backup.  The client
would have to take the tar archive sent by the server apart and form a
new one.

Another advantage of storing whole blocks in the incremental backup is
that there's no tight coupling between the full backup and the
incremental backup.  Suppose you take a full backup A on S1, and then
another full backup B, and then an incremental backup C based on A,
and then an incremental backup D based on B.  If backup B is destroyed
beyond retrieval, you can restore the chain A-C-D and get back to the
same place that restoring B-D would have gotten you.  Backup D doesn't
really know or care that it happens to be based on B.  It just knows
that it can only give you those blocks that have LSN >= LSN_B.  You
can get those blocks from anywhere that you like.  If D instead stored
deltas between the blocks as they exist in backup B, then those deltas
would have to be applied specifically to backup B, not some
possibly-later version.

I think the way to think about this problem, or at least the way I
think about this problem, is that we need to decide whether want
file-level incremental backup, block-level incremental backup, or
byte-level incremental backup.  pgbackrest implements file-level
incremental backup: if the file has changed, copy the whole thing.
That has an appealing simplicity but risks copying 1GB of data for a
1-byte change. What I'm proposing here is block-level incremental
backup, which is more complicated and still risks copying 8kB of data
for a 1-byte change.  Using VCDIFF would, I think, give us byte-level
incremental backup.  That would probably do an excellent job of making
incremental backups as small as they can possibly be, because we would
not need to include in the backup image even a single byte of
unmodified data.  It also seems like it does some other compression
tricks which could shrink incremental backups further.  However, my
intuition is that we won't gain enough in terms of backup size to make
up for the downsides listed above.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> My idea would be create a new tool working on archived WAL. No burden
> server side. Basic concept is:
>
> * parse archives
> * record latest relevant FPW for the incr backup
> * write new WALs with recorded FPW and removing/rewriting duplicated walrecords.
>
> It's just a PoC and I hadn't finished the WAL writing part...not even talking
> about the replay part. I'm not even sure this project is a good idea, but it is
> a good educational exercice to me in the meantime.
>
> Anyway, using real life OLTP production archives, my stats were:
>
>   # WAL   xlogrec kept     Size WAL kept
>     127            39%               50%
>     383            22%               38%
>     639            20%               29%
>
> Based on this stats, I expect this would save a lot of time during recovery in
> a first step. If it get mature, it might even save a lot of archives space or
> extend the retention period with degraded granularity. It would even help
> taking full backups with a lower frequency.
>
> Any thoughts about this design would be much appreciated. I suppose this should
> be offlist or in a new thread to avoid polluting this thread as this is a
> slightly different subject.

Interesting idea, but I don't see how it can work if you only deal
with the FPWs and not the other records.  For instance, suppose that
you take a full backup at time T0, and then at time T1 there are two
modifications to a certain block in quick succession.  That block is
then never touched again.  Since no checkpoint intervenes between the
modifications, the first one emits an FPI and the second does not.
Capturing the FPI is fine as far as it goes, but unless you also do
something with the non-FPI change, you lose that second modification.
You could fix that by having your tool replicate the effects of WAL
apply outside the server, but that sounds like a ton of work and a ton
of possible bugs.

I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

The reason why I think this is likely to be superior to other possible
approaches, such as the ptrack approach Konstantin suggests elsewhere
on this thread, is because it pushes the work of figuring out which
blocks have been modified into the background.  With a ptrack-type
approach, the server has to do some non-zero amount of extra work in
the foreground every time it modifies a block.  With an approach based
on WAL-scanning, the work is done in the background and nobody has to
wait for it.  It's possible that there are other considerations which
aren't occurring to me right now, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Some times ago I have implemented alternative version of ptrack utility
> (not one used in pg_probackup)
> which detects updated block at file level. It is very simple and may be
> it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.

It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 1000003. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 1000003 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Ashwin Agrawal
Date:

On Wed, Apr 10, 2019 at 9:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind. Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind. Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but can still rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performing rewind activity using target side WAL only.

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
> > - This is just a design proposal at this point; there is no code.  If
> > this proposal, or some modified version of it, seems likely to be
> > acceptable, I and/or my colleagues might try to implement it.
>
> I'll be happy to help with code, discussion and patch review.

That would be great!

We should probably give this discussion some more time before we
plunge into the implementation phase, but I'd love to have some help
with that, whether it's with coding or review or whatever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
> Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for
pg_rewind.Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to
rewind.Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but
canstill rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after
performingrewind activity using target side WAL only. 

Interesting.  So if we build a system like this for incremental
backup, or for pg_rewind, the other one can use the same
infrastructure.  That sound excellent.  I'll start a new thread to
talk about that, and hopefully you and Heikki and others will chime in
with thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jehan-Guillaume de Rorthais
Date:
Hi,

First thank you for your answer!

On Wed, 10 Apr 2019 12:21:03 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> > My idea would be create a new tool working on archived WAL. No burden
> > server side. Basic concept is:
> >
> > * parse archives
> > * record latest relevant FPW for the incr backup
> > * write new WALs with recorded FPW and removing/rewriting duplicated
> > walrecords.
> >
> > It's just a PoC and I hadn't finished the WAL writing part...not even
> > talking about the replay part. I'm not even sure this project is a good
> > idea, but it is a good educational exercice to me in the meantime.
> >
> > Anyway, using real life OLTP production archives, my stats were:
> >
> >   # WAL   xlogrec kept     Size WAL kept
> >     127            39%               50%
> >     383            22%               38%
> >     639            20%               29%
> >
> > Based on this stats, I expect this would save a lot of time during recovery
> > in a first step. If it get mature, it might even save a lot of archives
> > space or extend the retention period with degraded granularity. It would
> > even help taking full backups with a lower frequency.
> >
> > Any thoughts about this design would be much appreciated. I suppose this
> > should be offlist or in a new thread to avoid polluting this thread as this
> > is a slightly different subject.  
> 
> Interesting idea, but I don't see how it can work if you only deal
> with the FPWs and not the other records.  For instance, suppose that
> you take a full backup at time T0, and then at time T1 there are two
> modifications to a certain block in quick succession.  That block is
> then never touched again.  Since no checkpoint intervenes between the
> modifications, the first one emits an FPI and the second does not.
> Capturing the FPI is fine as far as it goes, but unless you also do
> something with the non-FPI change, you lose that second modification.
> You could fix that by having your tool replicate the effects of WAL
> apply outside the server, but that sounds like a ton of work and a ton
> of possible bugs.

In my current design, the scan is done backward from end to start and I keep all
the records appearing after the last occurrence of their respective FPI.

The next challenge I have to achieve is to deal with multiple blocks records
where some need to be removed and other are FPI to keep (eg. UPDATE).

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.

Summary files looks like what Andrey Borodin described as delta-files and
change maps.

> With an approach based
> on WAL-scanning, the work is done in the background and nobody has to
> wait for it.

Agree with this.



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> In my current design, the scan is done backward from end to start and I keep all
> the records appearing after the last occurrence of their respective FPI.

Oh, interesting.  That seems like it would require pretty major
surgery on the WAL stream.

> Summary files looks like what Andrey Borodin described as delta-files and
> change maps.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> > In my current design, the scan is done backward from end to start and I keep all
> > the records appearing after the last occurrence of their respective FPI.
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Can't you just read each segment forward, and then reverse? That's not
that much memory? And sure, there's some inefficient cases where records
span many segments, but that's rare enough that reading a few segments
several times doesn't strike me as particularly bad?

Greetings,

Andres Freund



Re: block-level incremental backup

From
Peter Eisentraut
Date:
On 2019-04-10 17:31, Robert Haas wrote:
> I think the way to think about this problem, or at least the way I
> think about this problem, is that we need to decide whether want
> file-level incremental backup, block-level incremental backup, or
> byte-level incremental backup.

That is a great analysis.  Seems like block-level is the preferred way
forward.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: block-level incremental backup

From
Konstantin Knizhnik
Date:

On 10.04.2019 19:51, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Some times ago I have implemented alternative version of ptrack utility
>> (not one used in pg_probackup)
>> which detects updated block at file level. It is very simple and may be
>> it can be sometimes integrated in master.
> I don't think this is completely crash-safe.  It looks like it
> arranges to msync() the ptrack file at appropriate times (although I
> haven't exhaustively verified the logic), but it uses MS_ASYNC, so
> it's possible that the ptrack file could get updated on disk either
> before or after the relation file itself.  I think before is probably
> OK -- it just risks having some blocks look modified when they aren't
> really -- but after seems like it is very much not OK.  And changing
> this to use MS_SYNC would probably be really expensive.  Likely a
> better approach would be to hook into the new fsync queue machinery
> that Thomas Munro added to PostgreSQL 12.

I do not think that MS_SYNC or fsync queue is needed here.
If power failure or OS crash cause loose of some writes to ptrack map, 
then in any case {ostgres will perform recovery and updating pages from 
WAL cause once again marking them in ptrack map. So as in case of CLOG 
and many other Postgres files it is not critical to loose some writes 
because them will be restored from WAL. And before truncating WAL, 
Postgres performs checkpoint which flushes all changes to the disk, 
including ptrack map updates.


> It looks like your system maps all the blocks in the system into a
> fixed-size map using hashing.  If the number of modified blocks
> between the full backup and the incremental backup is large compared
> to the size of the ptrack map, you'll start to get a lot of
> false-positives.  It will look as if much of the database needs to be
> backed up.  For example, in your sample configuration, you have
> ptrack_map_size = 1000003. If you've got a 100GB database with 20%
> daily turnover, that's about 2.6 million blocks.  If you set bump a
> random entry ~2.6 million times in a map with 1000003 entries, on the
> average ~92% of the entries end up getting bumped, so you will get
> very little benefit from incremental backup.  This problem drops off
> pretty fast if you raise the size of the map, but it's pretty critical
> that your map is large enough for the database you've got, or you may
> as well not bother.
This is why ptrack block size should be larger than page size.
Assume that it is 1Mb. 1MB is considered to be optimal amount of disk 
IO, when frequent seeks are not degrading read speed (it is most 
critical for HDD). In other words reading 10 random pages (20%) from 
this 1Mb block will takes almost the same amount of time (or even 
longer) than reading all this 1Mb in one operation.

There will be just 100000 used entries in ptrack map with very small 
probability of collision.
Actually I have chosen this size (1000003) for ptrack map because with 
1Mb block size is allows to map without noticable number of collisions 
1Tb database which seems to be enough for most Postgres installations. 
But increasing ptrack map size 10 and even 100 times should not also 
cause problems with modern RAM sizes.

>
> It also appears that your system can't really handle resizing of the
> map in any friendly way.  So if your data size grows, you may be faced
> with either letting the map become progressively less effective, or
> throwing it out and losing all the data you have.
>
> None of that is to say that what you're presenting here has no value,
> but I think it's possible to do better (and I think we should try).
>
Definitely I didn't consider proposed patch as perfect solution and 
certainly it requires improvements (and may be complete redesign).
I just want to present this approach (maintaining hash of block's LSN in 
mapped memory) and keeping track of modified blocks at file level 
(unlike current ptrack implementation which logs changes in all places 
in Postgres code where data is updated).

Also, despite to the fact that this patch may be considered as raw 
prototype, I have spent some time thinking about all aspects of this 
approach including fault tolerance and false positives.




Re: block-level incremental backup

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 10 Apr 2019 14:38:43 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> > In my current design, the scan is done backward from end to start and I
> > keep all the records appearing after the last occurrence of their
> > respective FPI.  
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Indeed.

Presently, the surgery in my code is replacing redundant xlogrecord with noop.

I have now to deal with muti-blocks records. So far, I tried to mark non-needed
block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such
marked blocks, but it doesn't play well with dependency between xlogrecord, eg.
during UPDATE. So my plan is to rewrite them to remove non-needed blocks using
eg. XLOG_FPI.

As I wrote, this is mainly an hobby project right now for my own education. Not
sure where it leads me, but I learn a lot while working on it.



Re: block-level incremental backup

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 10 Apr 2019 11:55:51 -0700
Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote:  
> > > In my current design, the scan is done backward from end to start and I
> > > keep all the records appearing after the last occurrence of their
> > > respective FPI.  
> > 
> > Oh, interesting.  That seems like it would require pretty major
> > surgery on the WAL stream.  
> 
> Can't you just read each segment forward, and then reverse?

Not sure what you mean.

I first look for the very last XLOG record by jumping to the last WAL and
scanning it forward. 

Then, I do a backward from there to record LSN of xlogrecord to keep.

Finally, I clone each WAL and edit them as needed (as described in my previous
email). This is my current WIP though.

> That's not that much memory?

I don't know, yet. I did not mesure it.



Re: block-level incremental backup

From
Michael Paquier
Date:
On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote:
> That is a great analysis.  Seems like block-level is the preferred way
> forward.

In any solution related to incremental backups I have see from
community, all of them tend to prefer block-level backups per the
filtering which is possible based on the LSN of the page header.  The
holes in the middle of the page are also easier to handle so as an
incremental page size is reduced in the actual backup.  My preference
tends toward a block-level approach if we were to do something in this
area, though I fear that performance will be bad if we begin to scan
all the relation files to fetch a set of blocks since a past LSN.
Hence we need some kind of LSN map so as it is possible to skip a
one block or a group of blocks (say one LSN every 8/16 blocks for
example) at once for a given relation if the relation is mostly
read-only.
--
Michael

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Apr 11, 2019 at 12:22 AM Michael Paquier <michael@paquier.xyz> wrote:
> incremental page size is reduced in the actual backup.  My preference
> tends toward a block-level approach if we were to do something in this
> area, though I fear that performance will be bad if we begin to scan
> all the relation files to fetch a set of blocks since a past LSN.
> Hence we need some kind of LSN map so as it is possible to skip a
> one block or a group of blocks (say one LSN every 8/16 blocks for
> example) at once for a given relation if the relation is mostly
> read-only.

So, in this thread, I want to focus on the UI and how the incremental
backup is stored on disk.  Making the process of identifying modified
blocks efficient is the subject of
http://postgr.es/m/CA+TgmoahOeuuR4pmDP1W=JnRyp4fWhynTOsa68BfxJq-qB_53A@mail.gmail.com

Over there, the merits of what you are describing here and the
competing approaches are under discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Anastasia Lubennikova
Date:
09.04.2019 18:48, Robert Haas writes:
> Thoughts?
Hi,
Thank you for bringing that up.
In-core support of incremental backups is a long-awaited feature.
Hopefully, this take will end up committed in PG13.

Speaking of UI:
1) I agree that it should be implemented as a new replication command.

2) There should be a command to get only a map of changes without actual 
data.

Most backup tools establish server connection, so they can use this 
protocol to get the list of changed blocks.
Then they can use this information for any purpose. For example, 
distribute files between parallel workers to copy the data,
or estimate backup size before data is sent, or store metadata 
separately from the data itself.
Most methods (except straightforward LSN comparison) consist of two 
steps: get a map of changes and read blocks.
So it won't add much of extra work.

example commands:
GET_FILELIST [lsn]
returning json (or whatever) with filenames and maps of changed blocks

Map format is also the subject of discussion.
Now in pg_probackup we reuse code from pg_rewind/datapagemap,
not sure if this format is good for sending data via the protocol, though.

3) The API should provide functions to request data with a granularity 
of file and block.
It will be useful for parallelism and for various future projects.

example commands:
GET_DATAFILE [filename [map of blocks] ]
GET_DATABLOCK [filename] [blkno]
returning data in some format

4) The algorithm of collecting changed blocks is another topic.
Though, it's API should be discussed here:

Do we want to have multiple implementations?
Personally, I think that it's good to provide several strategies,
since they have different requirements and fit for different workloads.

Maybe we can add a hook to allow custom implementations.

Do we want to allow the backup client to tell what block collection 
method to use?
example commands:
GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
Or should it be server-side cost-based decision?

5) The method based on LSN comparison stands out - it can be done in one 
pass.
So it probably requires special protocol commands.
for example:
GET_DATAFILES [lsn]
GET_DATAFILE [filename] [lsn]

This is pretty simple to implement and pg_basebackup can use this method,
at least until we have something more advanced in-core.

I'll be happy to help with design, code, review, and testing.
Hope that my experience with pg_probackup will be useful.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.

I love the general idea of having additional facilities in core to
support block-level incremental backups.  I've long been unhappy that
any such approach ends up being limited to a subset of the files which
need to be included in the backup, meaning the rest of the files have to
be backed up in their entirety.  I don't think we have to solve for that
as part of this, but I'd like to see a discussion for how to deal with
the other files which are being backed up to avoid needing to just
wholesale copy them.

> I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution.

I'm certainly a fan of improving our in-core backup solutions.

I'm quite concerned that trying to graft this on to pg_basebackup
(which, as you note later, is missing an awful lot of what users expect
from a real backup solution already- retention handling, parallel
capabilities, WAL archive management, and many more... but also is just
not nearly as developed a tool as the external solutions) is going to
make things unnecessairly difficult when what we really want here is
better support from core for block-level incremental backup for the
existing external tools to leverage.

Perhaps there's something here which can be done with pg_basebackup to
have it work with the block-level approach, but I certainly don't see
it as a natural next step for it and really does seem like limiting the
way this is implemented to something that pg_basebackup can easily
digest might make it less useful for the more developed tools.

As an example, I believe all of the other tools mentioned (at least,
those that are open source I'm pretty sure all do) support parallel
backup and therefore having a way to get the block-level changes in a
parallel fashion would be a pretty big thing that those tools will want
and pg_basebackup is single-threaded today and this proposal doesn't
seem to be contemplating changing that, implying that a serial-based
block-level protocol would be fine but that'd be a pretty awful
restriction for the other tools.

> Generally my idea is:
>
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup. It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.  We could also, as several people
> have proposed previously, have some kind of additional relation for
> that stores either a single is-modified bit -- which only helps if the
> reference LSN for the is-modified bit is older than the requested LSN
> but not too much older -- or the highest LSN for each range of K
> blocks, or something like that.  I am at the moment not too concerned
> with the exact strategy we use here. I believe we may want to
> eventually support more than one, since they have different
> trade-offs.

This part of the discussion is a another example of how we're limiting
ourselves in this implementation to the "pg_basebackup can work with
this" case- by only consideration the options of "scan all the files" or
"use the WAL- if the request is for WAL we have available on the
server."  The other backup solutions mentioned in your initial email,
and others that weren't, have a WAL archive which includes a lot more
WAL than just what the primary currently has.  When I've thought about
how WAL could be used to build a differential or incremental backup, the
question of "do we have all the WAL we need" hasn't ever been a
consideration- because the backup tool manages the WAL archive and has
WAL going back across, most likely, weeks or even months.  Having a tool
which can essentially "compress" WAL would be fantastic and would be
able to be leveraged by all of the different backup solutions.

> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.  Each such
> file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

Two things here- having some file that "stops the server from starting"
is just going to cause a lot of pain, in my experience.  Users do a lot
of really rather.... curious things, and then come asking questions
about them, and removing the file that stopped the server from starting
is going to quickly become one of those questions on stack overflow that
people just follow the highest-ranked question for, even though everyone
who follows this list will know that doing so results in corruption of
the database.

An alternative approach in developing this feature would be to have
pg_basebackup have an option to run against an *existing* backup, with
the entire point being that the existing backup is updated with these
incremental changes, instead of having some independent tool which takes
the result of multiple pg_basebackup runs and then combines them.

An alternative tool might be one which simply reads the WAL and keeps
track of the FPIs and the updates and then eliminates any duplication
which exists in the set of WAL provided (that is, multiple FPIs for the
same page would be merged into one, and only the delta changes to that
page are preserved, across the entire set of WAL being combined).  Of
course, that's complicated by having to deal with the other files in the
database, so it wouldn't really work on its own.

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.  The tool should check that
> the threshold LSN for each incremental backup is less than or equal to
> the start LSN of the previous backup; if not, there may be changes
> that happened in between which would be lost, so combining the backups
> is unsafe.  Running this tool can be thought of either as restoring
> the backup or as producing a new synthetic backup from any number of
> incremental backups.  This would allow for a strategy of unending
> incremental backups.  For instance, on day 1, you take a full backup.
> On every subsequent day, you take an incremental backup.  On day 9,
> you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
> day2.  On each subsequent day you do something similar.  Now you can
> always roll back to any of the last seven days by combining the oldest
> backup you have (which is always a synthetic full backup) with as many
> newer incrementals as you want, up to the point where you want to
> stop.

I'd really prefer that we avoid adding in another low-level tool like
the one described here.  Users, imv anyway, don't want to deal with
*more* tools for handling this aspect of backup/recovery.  If we had a
tool in core today which managed multiples backups, kept track of them,
and all of the WAL during and between them, then we could add options to
that tool to do what's being described here in a way that makes sense
and provides a good interface to users.  I don't know that we're going
to be able to do that with pg_basebackup when, really, the goal here
isn't actually to make pg_basebackup into an enterprise backup tool,
it's to make things easier for the external tools to do block-level
backups.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Bruce Momjian
Date:
On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > developed technology that permits a block-level incremental backup to
> > be taken from a PostgreSQL server.  I believe the idea in all of those
> > cases is that non-relation files should be backed up in their
> > entirety, but for relation files, only those blocks that have been
> > changed need to be backed up.
> 
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

I assume you are talking about non-heap/index files.  Which of those are
large enough to benefit from incremental backup?

> > I would like to propose that we should
> > have a solution for this problem in core, rather than leaving it to
> > each individual PostgreSQL company to develop and maintain their own
> > solution. 
> 
> I'm certainly a fan of improving our in-core backup solutions.
> 
> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.

I think there is some interesting complexity brought up in this thread. 
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup.  Eventually we will need to evaluate the incremental
backup options against these criteria.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Apr 11, 2019 at 1:29 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 2) There should be a command to get only a map of changes without actual
> data.

Good idea.

> 4) The algorithm of collecting changed blocks is another topic.
> Though, it's API should be discussed here:
>
> Do we want to have multiple implementations?
> Personally, I think that it's good to provide several strategies,
> since they have different requirements and fit for different workloads.
>
> Maybe we can add a hook to allow custom implementations.

I'm not sure a hook is going to be practical, but I do think we want
more than one strategy.

> I'll be happy to help with design, code, review, and testing.
> Hope that my experience with pg_probackup will be useful.

Great, thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote:
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

Ideas?  Generally, I don't think that anything other than the main
forks of relations are worth worrying about, because the files are too
small to really matter.  Even if they're big, the main forks of
relations will be much bigger.  I think.

> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.
>
> Perhaps there's something here which can be done with pg_basebackup to
> have it work with the block-level approach, but I certainly don't see
> it as a natural next step for it and really does seem like limiting the
> way this is implemented to something that pg_basebackup can easily
> digest might make it less useful for the more developed tools.

I agree that there are a bunch of things that pg_basebackup does not
do, such as backup management.  I think a lot of users do not want
PostgreSQL to do backup management for them.  They have an existing
solution that they use to manage backups, and they want PostgreSQL to
interoperate with it. I think it makes sense for pg_basebackup to be
in charge of taking the backup, and then other tools can either use it
as a building block or use the streaming replication protocol to send
approximately the same commands to the server.  I certainly would not
want to expose server capabilities that let you take an incremental
backup and NOT teach pg_basebackup to use them -- then we'd be in a
situation of saying that PostgreSQL has incremental backup, but you
have to get external tool XYZ to use it.  That will be perceived as
PostgreSQL does NOT have incremental backup and this external tool
adds it.

> As an example, I believe all of the other tools mentioned (at least,
> those that are open source I'm pretty sure all do) support parallel
> backup and therefore having a way to get the block-level changes in a
> parallel fashion would be a pretty big thing that those tools will want
> and pg_basebackup is single-threaded today and this proposal doesn't
> seem to be contemplating changing that, implying that a serial-based
> block-level protocol would be fine but that'd be a pretty awful
> restriction for the other tools.

I mentioned this exact issue in my original email.  I spoke positively
of it.  But I think it is different from what is being proposed here.
We could have parallel backup without incremental backup, and that
would be a good feature.  We could have parallel backup without full
backup, and that would also be a good feature.  We could also have
both, which would be best of all.  I don't see that my proposal throws
up any architectural obstacle to parallelism.  I assume parallel
backup, whether full or incremental, would be implemented by dividing
up the files that need to be sent across the available connections; if
incremental backup exists, each connection then has to decide whether
to send the whole file or only part of it.

> This part of the discussion is a another example of how we're limiting
> ourselves in this implementation to the "pg_basebackup can work with
> this" case- by only consideration the options of "scan all the files" or
> "use the WAL- if the request is for WAL we have available on the
> server."  The other backup solutions mentioned in your initial email,
> and others that weren't, have a WAL archive which includes a lot more
> WAL than just what the primary currently has.  When I've thought about
> how WAL could be used to build a differential or incremental backup, the
> question of "do we have all the WAL we need" hasn't ever been a
> consideration- because the backup tool manages the WAL archive and has
> WAL going back across, most likely, weeks or even months.  Having a tool
> which can essentially "compress" WAL would be fantastic and would be
> able to be leveraged by all of the different backup solutions.

I don't think this is a case of limiting ourselves; I think it's a
case of keeping separate considerations properly separate.  As I said
in my original email, the client doesn't really need to know how the
server is identifying the blocks that have been modified.  That is the
server's job.  I started a separate thread on the WAL-scanning
approach, so we should take that part of the discussion over there.  I
see no reason why the server couldn't be taught to reach back into an
available archive for WAL that it no longer has locally, but that's
really independent of the design ideas being discussed on this thread.

> Two things here- having some file that "stops the server from starting"
> is just going to cause a lot of pain, in my experience.  Users do a lot
> of really rather.... curious things, and then come asking questions
> about them, and removing the file that stopped the server from starting
> is going to quickly become one of those questions on stack overflow that
> people just follow the highest-ranked question for, even though everyone
> who follows this list will know that doing so results in corruption of
> the database.

Wait, you want to make it maximally easy for users to start the server
in a state that is 100% certain to result in a corrupted and unusable
database?  Why?? I'd l like to make that a tiny bit difficult.  If
they really want a corrupted database, they can remove the file.

> An alternative approach in developing this feature would be to have
> pg_basebackup have an option to run against an *existing* backup, with
> the entire point being that the existing backup is updated with these
> incremental changes, instead of having some independent tool which takes
> the result of multiple pg_basebackup runs and then combines them.

That would be really unsafe, because if the tool is interrupted before
it finishes (and fsyncs everything), you no longer have any usable
backup.  It also doesn't lend itself to several of the scenarios I
described in my original email -- like endless incrementals that are
merged into the full backup after some number of days -- a capability
upon which others have already remarked positively.

> An alternative tool might be one which simply reads the WAL and keeps
> track of the FPIs and the updates and then eliminates any duplication
> which exists in the set of WAL provided (that is, multiple FPIs for the
> same page would be merged into one, and only the delta changes to that
> page are preserved, across the entire set of WAL being combined).  Of
> course, that's complicated by having to deal with the other files in the
> database, so it wouldn't really work on its own.

You've jumped back to solving the server's problem (which blocks
should I send?) rather than the client's problem (what does an
incremental backup look like once I've taken it and how do I manage
and restore them?).  It does seem possible to figure out the contents
of modified blocks strictly from looking at the WAL, without any
examination of the current database contents.  However, it also seems
very complicated, because the tool that is figuring out the current
block contents just by looking at the WAL would have to know how to
apply any type of WAL record, not just one that contains an FPI.  And
I really don't want to build a client-side tool that knows how to
apply WAL.

> I'd really prefer that we avoid adding in another low-level tool like
> the one described here.  Users, imv anyway, don't want to deal with
> *more* tools for handling this aspect of backup/recovery.  If we had a
> tool in core today which managed multiples backups, kept track of them,
> and all of the WAL during and between them, then we could add options to
> that tool to do what's being described here in a way that makes sense
> and provides a good interface to users.  I don't know that we're going
> to be able to do that with pg_basebackup when, really, the goal here
> isn't actually to make pg_basebackup into an enterprise backup tool,
> it's to make things easier for the external tools to do block-level
> backups.

Well, I agree with you that the goal is not to make pg_basebackup an
enterprise backup tool.  However, I don't see teaching it to take
incremental backups as opposed to that goal.  I think backup
management and retention should remain firmly outside the purview of
pg_basebackup and left either to some other in-core tool or maybe even
to out-of-core tools.  However, I don't see any reason why that the
task of taking an incremental and/or parallel backup should also be
left to another tool.

There is a very close relationship between the thing that
pg_basebackup already does (copy everything) and the thing that we
want to do here (copy everything except blocks that we know haven't
changed). If we made it the job of some other tool to take parallel
and/or incremental backups, that other tool would need to reimplement
a lot of things that pg_basebackup has already got, like tar vs. plain
format, fast vs. spread checkpoint, rate-limiting, compression levels,
etc.  That seems like a waste.  Better to give pg_basebackup the
capability to do those things, and then any backup management tool
that anyone writes can take advantage of those capabilities.

I come at this, BTW, from the perspective of having just spent a bunch
of time working on EDB's Backup And Recovery Tool (BART).  That tool
works in exactly the manner you seem to be advocating: it knows how to
do incremental and parallel full backups, and it also does backup
management.  However, this has not turned out to be the best division
of labor.  People who don't want to use the backup management
capabilities may still want the parallel or incremental backup
capabilities, and if all of that is within the envelope of an
"enterprise backup tool," they don't have that option.  So I want to
split it up.  I want pg_basebackup to take all the kinds of backups
that PostgreSQL supports -- full, incremental, parallel, serial,
whatever -- and I want some other tool -- pgBackRest, BART, barman, or
some yet-to-be-invented core thing to do the management of those
backups.  Then everybody can use exactly the bits they want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > > developed technology that permits a block-level incremental backup to
> > > be taken from a PostgreSQL server.  I believe the idea in all of those
> > > cases is that non-relation files should be backed up in their
> > > entirety, but for relation files, only those blocks that have been
> > > changed need to be backed up.
> >
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
>
> I assume you are talking about non-heap/index files.  Which of those are
> large enough to benefit from incremental backup?

Based on discussions I had with Andrey, specifically the visibility map
is an issue for them with WAL-G.  I haven't spent a lot of time thinking
about it, but I can understand how that could be an issue.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
>
> I think there is some interesting complexity brought up in this thread.
> Which options are going to minimize storage I/O, network I/O, have only
> background overhead, allow parallel operation, integrate with
> pg_basebackup.  Eventually we will need to evaluate the incremental
> backup options against these criteria.

This presumes that we're going to have multiple competeing incremental
backup options presented, doesn't it?  Are you aware of another effort
going on which aims for inclusion in core?  There's been past attempts
made, but I don't believe there's anyone else currently planning to or
working on something for inclusion in core.

Just to be clear- we're not currently working on one, but I'd really
like to see core provide good support for incremental block-level backup
so that we can leverage when it is there.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > I love the general idea of having additional facilities in core to
> > > support block-level incremental backups.  I've long been unhappy that
> > > any such approach ends up being limited to a subset of the files which
> > > need to be included in the backup, meaning the rest of the files have to
> > > be backed up in their entirety.  I don't think we have to solve for that
> > > as part of this, but I'd like to see a discussion for how to deal with
> > > the other files which are being backed up to avoid needing to just
> > > wholesale copy them.
> >
> > I assume you are talking about non-heap/index files.  Which of those are
> > large enough to benefit from incremental backup?
>
> Based on discussions I had with Andrey, specifically the visibility map
> is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> about it, but I can understand how that could be an issue.

If I understand correctly, the VM contains 1 byte per 4 heap pages and
the FSM contains 1 byte per heap page (plus some overhead for higher
levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
there's a whole lot we can do to avoid having to back it up, although
maybe there's some clever idea I'm not quite seeing.  The VM is
WAL-logged, albeit with some strange warts that I have the honor of
inventing, so there's more possibilities there.

Before worrying about it too much, it would be useful to hear more
about the concerns related to these forks, so that we make sure we're
solving the right problem.  It seems difficult for a single relation
to be big enough for these to be much of an issue.  For example, on a
1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
= 32MB.  Not nothing, but 32MB of useless overhead every time you back
up a 1TB database probably isn't going to break the bank.  It might be
more of a concern for users with many small tables.  For example, if
somebody has got a million tables with 1 page in each one, they'll
have a million data pages, a million VM pages, and 3 million FSM pages
(unless the new don't-create-the-FSM-for-small-tables stuff in v12
kicks in).  I don't know if it's worth going to a lot of trouble to
optimize that case.  Creating a million tables with 100 tuples (or
whatever) in each one sounds like terrible database design to me.

> > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > (which, as you note later, is missing an awful lot of what users expect
> > > from a real backup solution already- retention handling, parallel
> > > capabilities, WAL archive management, and many more... but also is just
> > > not nearly as developed a tool as the external solutions) is going to
> > > make things unnecessairly difficult when what we really want here is
> > > better support from core for block-level incremental backup for the
> > > existing external tools to leverage.
> >
> > I think there is some interesting complexity brought up in this thread.
> > Which options are going to minimize storage I/O, network I/O, have only
> > background overhead, allow parallel operation, integrate with
> > pg_basebackup.  Eventually we will need to evaluate the incremental
> > backup options against these criteria.
>
> This presumes that we're going to have multiple competeing incremental
> backup options presented, doesn't it?  Are you aware of another effort
> going on which aims for inclusion in core?  There's been past attempts
> made, but I don't believe there's anyone else currently planning to or
> working on something for inclusion in core.

Yeah, I really hope we don't end up with dueling patches.  I want to
come up with an approach that can be widely-endorsed and then have
everybody rowing in the same direction.  On the other hand, I do think
that we may support multiple options in certain places which may have
the kinds of trade-offs that Bruce mentions.  For instance,
identifying changed blocks by scanning the whole cluster and checking
the LSN of each block has an advantage in that it requires no prior
setup or extra configuration.  Like a sequential scan, it always
works, and that is an advantage.  Of course, for many people, the
competing advantage of a WAL-scanning approach that can save a lot of
I/O will appear compelling, but maybe not for everyone.  I think
there's room for two or three approaches there -- not in the sense of
competing patches, but in the sense of giving users a choice based on
their needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Bruce Momjian
Date:
On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's room for two or three approaches there -- not in the sense of
> competing patches, but in the sense of giving users a choice based on
> their needs.

Well, by having a separate modblock file for each WAL file, you can keep
both WAL and modblock files and use the modblock list to pull pages from
each WAL file, or from the heap/index files, and it can be done in
parallel.  Having WAL and modblock files in the same directory makes
retention simpler.

In fact, you can do an incremental backup just using the modblock files
and the heap/index files, so you don't even need the WAL.

Also, instead of storing the file name and block number in the modblock
file, using the database oid, relfilenode, and block number (3 int32
values) should be sufficient.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > > I love the general idea of having additional facilities in core to
> > > > support block-level incremental backups.  I've long been unhappy that
> > > > any such approach ends up being limited to a subset of the files which
> > > > need to be included in the backup, meaning the rest of the files have to
> > > > be backed up in their entirety.  I don't think we have to solve for that
> > > > as part of this, but I'd like to see a discussion for how to deal with
> > > > the other files which are being backed up to avoid needing to just
> > > > wholesale copy them.
> > >
> > > I assume you are talking about non-heap/index files.  Which of those are
> > > large enough to benefit from incremental backup?
> >
> > Based on discussions I had with Andrey, specifically the visibility map
> > is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> > about it, but I can understand how that could be an issue.
>
> If I understand correctly, the VM contains 1 byte per 4 heap pages and
> the FSM contains 1 byte per heap page (plus some overhead for higher
> levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
> there's a whole lot we can do to avoid having to back it up, although
> maybe there's some clever idea I'm not quite seeing.  The VM is
> WAL-logged, albeit with some strange warts that I have the honor of
> inventing, so there's more possibilities there.
>
> Before worrying about it too much, it would be useful to hear more
> about the concerns related to these forks, so that we make sure we're
> solving the right problem.  It seems difficult for a single relation
> to be big enough for these to be much of an issue.  For example, on a
> 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
> = 32MB.  Not nothing, but 32MB of useless overhead every time you back
> up a 1TB database probably isn't going to break the bank.  It might be
> more of a concern for users with many small tables.  For example, if
> somebody has got a million tables with 1 page in each one, they'll
> have a million data pages, a million VM pages, and 3 million FSM pages
> (unless the new don't-create-the-FSM-for-small-tables stuff in v12
> kicks in).  I don't know if it's worth going to a lot of trouble to
> optimize that case.  Creating a million tables with 100 tuples (or
> whatever) in each one sounds like terrible database design to me.

As I understand it, the problem is not with backing up an individual
database or cluster, but rather dealing with backing up thousands of
individual clusters with thousands of tables in each, leading to an
awful lot of tables with lots of FSMs/VMs, all of which end up having to
get copied and stored wholesale.  I'll point this thread out to him and
hopefully he'll have a chance to share more specific information.

> > > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > > (which, as you note later, is missing an awful lot of what users expect
> > > > from a real backup solution already- retention handling, parallel
> > > > capabilities, WAL archive management, and many more... but also is just
> > > > not nearly as developed a tool as the external solutions) is going to
> > > > make things unnecessairly difficult when what we really want here is
> > > > better support from core for block-level incremental backup for the
> > > > existing external tools to leverage.
> > >
> > > I think there is some interesting complexity brought up in this thread.
> > > Which options are going to minimize storage I/O, network I/O, have only
> > > background overhead, allow parallel operation, integrate with
> > > pg_basebackup.  Eventually we will need to evaluate the incremental
> > > backup options against these criteria.
> >
> > This presumes that we're going to have multiple competeing incremental
> > backup options presented, doesn't it?  Are you aware of another effort
> > going on which aims for inclusion in core?  There's been past attempts
> > made, but I don't believe there's anyone else currently planning to or
> > working on something for inclusion in core.
>
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's room for two or three approaches there -- not in the sense of
> competing patches, but in the sense of giving users a choice based on
> their needs.

I can agree with the idea of having multiple options for how to collect
up the set of changed blocks, though I continue to feel that a
WAL-scanning approach isn't something that we'd have implemented in the
backend at all since it doesn't require the backend and a given backend
might not even have all of the WAL that is relevant.  I certainly don't
think it makes sense to have a backend go get WAL from the archive to
then merge the WAL to provide the result to a client asking for it-
that's adding entirely unnecessary load to the database server.

As such, only the LSN-based scanning of relation files to produce the
set of changed blocks seems to make sense to me to implement in the
backend.

Just to be clear- I don't have any problem with a tool being implemented
in core to support the scanning of WAL to produce a changeset, I just
don't think that's something we'd have built into the *backend*, nor do
I think it would make sense to add that functionality to the replication
(or any other) protocol, at least not with support for arbitrary LSN
starting and ending points.

A thought that occurs to me is to have the functions for supporting the
WAL merging be included in libcommon and available to both the
independent executable that's available for doing WAL merging, and to
the backend to be able to WAL merging itself- but for a specific
purpose: having a way to reduce the amount of WAL that needs to be sent
to a replica which has a replication slot but that's been disconnected
for a while.  Of course, there'd have to be some way to handle the other
files for that to work to update a long out-of-date replica.  Now, if we
taught the backup tool about having a replication slot then perhaps we
could have the backend effectively have the same capability proposed
above, but without the need to go get the WAL from the archive
repository.

I'm still not entirely sure that this makes sense to do in the backend
due to the additional load, this is really just some brainstorming.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote:
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
>
> Ideas?  Generally, I don't think that anything other than the main
> forks of relations are worth worrying about, because the files are too
> small to really matter.  Even if they're big, the main forks of
> relations will be much bigger.  I think.

Sadly, I haven't got any great ideas today.  I do know that the WAL-G
folks have specifically mentioned issues with the visibility map being
large enough across enough of their systems that it kinda sucks to deal
with.  Perhaps we could do something like the rsync binary-diff protocol
for non-relation files?  This is clearly just hand-waving but maybe
there's something reasonable in that idea.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
> >
> > Perhaps there's something here which can be done with pg_basebackup to
> > have it work with the block-level approach, but I certainly don't see
> > it as a natural next step for it and really does seem like limiting the
> > way this is implemented to something that pg_basebackup can easily
> > digest might make it less useful for the more developed tools.
>
> I agree that there are a bunch of things that pg_basebackup does not
> do, such as backup management.  I think a lot of users do not want
> PostgreSQL to do backup management for them.  They have an existing
> solution that they use to manage backups, and they want PostgreSQL to
> interoperate with it. I think it makes sense for pg_basebackup to be
> in charge of taking the backup, and then other tools can either use it
> as a building block or use the streaming replication protocol to send
> approximately the same commands to the server.

There's something like 6 different backup tools, at least, for
PostgreSQL that provide backup management, so I have a really hard time
agreeing with this idea that users don't want a PG backup management
system.  Maybe that's not what you're suggesting here, but that's what
came across to me.

Yes, there are some users who have an existing backup solution and
they'd like a better way to integrate PostgreSQL into that solution,
but that's usually something like filesystem snapshots or an enterprise
backup tool which has a PostgreSQL agent or similar to do the start/stop
and collect up the WAL, not something that's just calling pg_basebackup.

Those are typically not things we have any visibility into though and
aren't open source either (and, at least as often as not, they don't
seem to be very well thought through, based on my experience with those
tools...).

Unless maybe I'm misunderstanding and what you're suggesting here is
that the "existing solution" is something like the external PG-specific
backup tools?  But then the rest doesn't seem to make sense, as only
maybe one or two of those tools use pg_basebackup internally.

> I certainly would not
> want to expose server capabilities that let you take an incremental
> backup and NOT teach pg_basebackup to use them -- then we'd be in a
> situation of saying that PostgreSQL has incremental backup, but you
> have to get external tool XYZ to use it.  That will be perceived as
> PostgreSQL does NOT have incremental backup and this external tool
> adds it.

... but this is exactly the situation we're in already with all of the
*other* features around backup (parallel backup, backup management, WAL
management, etc).  Users want those features, pg_basebackup/PG core
doesn't provide it, and therefore there's a bunch of other tools which
have been written that do.  In addition, saying that PG has incremental
backup but no built-in management of those full-vs-incremental backups
and telling users that they basically have to build that themselves
really feels a lot like we're trying to address a check-box requirement
rather than making something that our users are going to be happy with.

> > As an example, I believe all of the other tools mentioned (at least,
> > those that are open source I'm pretty sure all do) support parallel
> > backup and therefore having a way to get the block-level changes in a
> > parallel fashion would be a pretty big thing that those tools will want
> > and pg_basebackup is single-threaded today and this proposal doesn't
> > seem to be contemplating changing that, implying that a serial-based
> > block-level protocol would be fine but that'd be a pretty awful
> > restriction for the other tools.
>
> I mentioned this exact issue in my original email.  I spoke positively
> of it.  But I think it is different from what is being proposed here.
> We could have parallel backup without incremental backup, and that
> would be a good feature.  We could have parallel backup without full
> backup, and that would also be a good feature.  We could also have
> both, which would be best of all.  I don't see that my proposal throws
> up any architectural obstacle to parallelism.  I assume parallel
> backup, whether full or incremental, would be implemented by dividing
> up the files that need to be sent across the available connections; if
> incremental backup exists, each connection then has to decide whether
> to send the whole file or only part of it.

I don't think that I was very clear in what my specific concern here
was.  I'm not asking for pg_basebackup to have parallel backup (at
least, not in this part of the discussion), I'm asking for the
incremental block-based protocol that's going to be built-in to core to
be able to be used in a parallel fashion.

The existing protocol that pg_basebackup uses is basically, connect to
the server and then say "please give me a tarball of the data directory"
and that is then streamed on that connection, making that protocol
impossible to use for parallel backup.  That's fine as far as it goes
because only pg_basebackup actually uses that protocol (note that nearly
all of the other tools for doing backups of PostgreSQL don't...).  If
we're expecting the external tools to use the block-level incremental
protocol then that protocol really needs to have a way to be
parallelized, otherwise we're just going to end up with all of the
individual tools doing their own thing for block-level incremental
(though perhaps they'd reimplement whatever is done in core but in a way
that they could parallelize it...), if possible (which I add just in
case there's some idea that we end up in a situation where the
block-level incremental backup has to coordinate with the backend in
some fashion to work...  which would mean that *everyone* has to use the
protocol even if it isn't parallel and that would be really bad, imv).

> > This part of the discussion is a another example of how we're limiting
> > ourselves in this implementation to the "pg_basebackup can work with
> > this" case- by only consideration the options of "scan all the files" or
> > "use the WAL- if the request is for WAL we have available on the
> > server."  The other backup solutions mentioned in your initial email,
> > and others that weren't, have a WAL archive which includes a lot more
> > WAL than just what the primary currently has.  When I've thought about
> > how WAL could be used to build a differential or incremental backup, the
> > question of "do we have all the WAL we need" hasn't ever been a
> > consideration- because the backup tool manages the WAL archive and has
> > WAL going back across, most likely, weeks or even months.  Having a tool
> > which can essentially "compress" WAL would be fantastic and would be
> > able to be leveraged by all of the different backup solutions.
>
> I don't think this is a case of limiting ourselves; I think it's a
> case of keeping separate considerations properly separate.  As I said
> in my original email, the client doesn't really need to know how the
> server is identifying the blocks that have been modified.  That is the
> server's job.  I started a separate thread on the WAL-scanning
> approach, so we should take that part of the discussion over there.  I
> see no reason why the server couldn't be taught to reach back into an
> available archive for WAL that it no longer has locally, but that's
> really independent of the design ideas being discussed on this thread.

I've provided thoughts on that other thread, I'm happy to discuss
further there.

> > Two things here- having some file that "stops the server from starting"
> > is just going to cause a lot of pain, in my experience.  Users do a lot
> > of really rather.... curious things, and then come asking questions
> > about them, and removing the file that stopped the server from starting
> > is going to quickly become one of those questions on stack overflow that
> > people just follow the highest-ranked question for, even though everyone
> > who follows this list will know that doing so results in corruption of
> > the database.
>
> Wait, you want to make it maximally easy for users to start the server
> in a state that is 100% certain to result in a corrupted and unusable
> database?  Why?? I'd l like to make that a tiny bit difficult.  If
> they really want a corrupted database, they can remove the file.

No, I don't want it to be easy for users to start the server in a state
that's going to result in a corrupted cluster.  That's basically the
complete opposite of what I was going for- having a file that can be
trivially removed to start up the cluster is *going* to result in people
having corrupted clusters, no matter how much we tell them "don't do
that".  This is exactly the problem with have with backup_label today.
I'd really rather not double-down on that.

> > An alternative approach in developing this feature would be to have
> > pg_basebackup have an option to run against an *existing* backup, with
> > the entire point being that the existing backup is updated with these
> > incremental changes, instead of having some independent tool which takes
> > the result of multiple pg_basebackup runs and then combines them.
>
> That would be really unsafe, because if the tool is interrupted before
> it finishes (and fsyncs everything), you no longer have any usable
> backup.  It also doesn't lend itself to several of the scenarios I
> described in my original email -- like endless incrementals that are
> merged into the full backup after some number of days -- a capability
> upon which others have already remarked positively.

There's really two things here- the first is that I agree with the
concern about potentially destorying the existing backup if the
pg_basebackup doesn't complete, but there's some ways to address that
(such as filesystem snapshotting), so I'm not sure that the idea is
quite that bad, but it would need to be more than just what
pg_basebackup does in this case in order to be trustworthy (at least,
for most).

The other part here is the idea of endless incrementals where the blocks
which don't appear to have changed are never re-validated against what's
in the backup.  Unfortunately, latent corruption happens and you really
want to have a way to check for that.  In past discussions that I've had
with David, there's been some idea to check some percentage of the
blocks that didn't appear to change for each backup against what's in
the backup.

I share this just to point out that there's some risk to that approach,
not to say that we shouldn't do it or that we should discourage the
development of such a feature.

> > An alternative tool might be one which simply reads the WAL and keeps
> > track of the FPIs and the updates and then eliminates any duplication
> > which exists in the set of WAL provided (that is, multiple FPIs for the
> > same page would be merged into one, and only the delta changes to that
> > page are preserved, across the entire set of WAL being combined).  Of
> > course, that's complicated by having to deal with the other files in the
> > database, so it wouldn't really work on its own.
>
> You've jumped back to solving the server's problem (which blocks
> should I send?) rather than the client's problem (what does an
> incremental backup look like once I've taken it and how do I manage
> and restore them?).  It does seem possible to figure out the contents
> of modified blocks strictly from looking at the WAL, without any
> examination of the current database contents.  However, it also seems
> very complicated, because the tool that is figuring out the current
> block contents just by looking at the WAL would have to know how to
> apply any type of WAL record, not just one that contains an FPI.  And
> I really don't want to build a client-side tool that knows how to
> apply WAL.

Wow.  I have to admit that I feel completely opposite of that- I'd
*love* to have an independent tool (which ideally uses the same code
through the common library, or similar) that can be run to apply WAL.

In other words, I don't agree that it's the server's problem at all to
solve that, or, at least, I don't believe that it needs to be.

> > I'd really prefer that we avoid adding in another low-level tool like
> > the one described here.  Users, imv anyway, don't want to deal with
> > *more* tools for handling this aspect of backup/recovery.  If we had a
> > tool in core today which managed multiples backups, kept track of them,
> > and all of the WAL during and between them, then we could add options to
> > that tool to do what's being described here in a way that makes sense
> > and provides a good interface to users.  I don't know that we're going
> > to be able to do that with pg_basebackup when, really, the goal here
> > isn't actually to make pg_basebackup into an enterprise backup tool,
> > it's to make things easier for the external tools to do block-level
> > backups.
>
> Well, I agree with you that the goal is not to make pg_basebackup an
> enterprise backup tool.  However, I don't see teaching it to take
> incremental backups as opposed to that goal.  I think backup
> management and retention should remain firmly outside the purview of
> pg_basebackup and left either to some other in-core tool or maybe even
> to out-of-core tools.  However, I don't see any reason why that the
> task of taking an incremental and/or parallel backup should also be
> left to another tool.

I've tried to outline how the incremental backup capability and backup
management are really very closely related and having those be
implemented by independent tools is not a good interface for our users
to have to live with.

> There is a very close relationship between the thing that
> pg_basebackup already does (copy everything) and the thing that we
> want to do here (copy everything except blocks that we know haven't
> changed). If we made it the job of some other tool to take parallel
> and/or incremental backups, that other tool would need to reimplement
> a lot of things that pg_basebackup has already got, like tar vs. plain
> format, fast vs. spread checkpoint, rate-limiting, compression levels,
> etc.  That seems like a waste.  Better to give pg_basebackup the
> capability to do those things, and then any backup management tool
> that anyone writes can take advantage of those capabilities.

I don't believe any of the external tools which do backups of PostgreSQL
support tar format.  Fast-vs-spread checkpointing isn't in the purview
of the external tools, they just have to accept the option and pass it
to pg_start_backup(), which they already know how to do.  Rate-limiting
and compression are implemented by those other tools already, where it's
been desired.

Most of the external tools don't use pg_basebackup, nor the base backup
protocol (or, if they do, it's only as an option among others).  In my
opinion, that's pretty clear indication that pg_basebackup and the base
backup protocol aren't sufficient to cover any but the simplest of
use-cases (though those simple use-cases are handled rather well).
We're talking about adding on a capability that's much more complicated
and is one that a lot of tools have already taken a stab at, let's try
to do it in a way that those tools can leverage it and avoid having to
implement it themselves.

> I come at this, BTW, from the perspective of having just spent a bunch
> of time working on EDB's Backup And Recovery Tool (BART).  That tool
> works in exactly the manner you seem to be advocating: it knows how to
> do incremental and parallel full backups, and it also does backup
> management.  However, this has not turned out to be the best division
> of labor.  People who don't want to use the backup management
> capabilities may still want the parallel or incremental backup
> capabilities, and if all of that is within the envelope of an
> "enterprise backup tool," they don't have that option.  So I want to
> split it up.  I want pg_basebackup to take all the kinds of backups
> that PostgreSQL supports -- full, incremental, parallel, serial,
> whatever -- and I want some other tool -- pgBackRest, BART, barman, or
> some yet-to-be-invented core thing to do the management of those
> backups.  Then everybody can use exactly the bits they want.

I come at this from years of working with David on pgBackRest, listening
to what users want, what features they like, what they'd like to see
added, and what they don't like about how it works today.

It's an interesting idea to add in everything to pg_basebackup that
users doing backups would like to see, but that's quite a list:

- full backups
- differential backups
- incremental backups / block-level backups
- (server-side) compression
- (server-side) encryption
- page-level checksum validation
- calculating checksums (on the whole file)
- External object storage (S3, et al)
- more things...

I'm really not convinced that I agree with the division of labor as
you've outlined it, where all of the above is done by pg_basebackup,
where just archiving and backup retention are handled by some external
tool (except that we already have pg_receivewal, so archiving isn't
really an externally handled thing either, unless you want features like
parallel archive-push or parallel archive-get...).

What would really help me, at least, understand the idea here would be
to understand exactly what the existing tools do that the subset of
users you're thinking about doesn't like/want, but which pg_basebackup,
today, does.  Is the issue that there's a repository instead of just a
plain PG directory or set of tar files, like what pg_basebackup produces
today?  But how would we do things like have compression, or encryption,
or block-based incremental backups without some kind of repository or
directory that doesn't actually look exactly like a PG data directory?

Another thing I really don't understand from this discussion, and part of
why it's taken me a while to respond, is this, from above:

> I think a lot of users do not want
> PostgreSQL to do backup management for them.

Followed by:

> I come at this, BTW, from the perspective of having just spent a bunch
> of time working on EDB's Backup And Recovery Tool (BART).  That tool
> works in exactly the manner you seem to be advocating: it knows how to
> do incremental and parallel full backups, and it also does backup
> management.

I certainly can understand that there are PostgreSQL users who want to
leverage incremental backups without having to use BART or another tool
outside of whatever enterprise backup system they've got, but surely
that's a large pool of users who *do* want a PG backup tool that manages
backups, or you wouldn't have spent a considerable amount of your very
valuable time hacking on BART.  I've certainly seen a fair share of both
and I don't think we should set out to exclude either.

Perhaps that's what we're both saying too and just talking past each
other, but I feel like the approach here is "make it work just for the
simple pg_basebackup case and not worry too much about the other tools,
since what we do for pg_basebackup will work for them too" while where
I'm coming from is "focus on what the other tools need first, and then
make pg_basebackup work with that if there's a sensible way to do so."

A third possibility is that it's just too early to be talking about this
since it means we've gotta be awful vaugue about it.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
David Fetter
Date:
On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> > Yeah, I really hope we don't end up with dueling patches.  I want to
> > come up with an approach that can be widely-endorsed and then have
> > everybody rowing in the same direction.  On the other hand, I do think
> > that we may support multiple options in certain places which may have
> > the kinds of trade-offs that Bruce mentions.  For instance,
> > identifying changed blocks by scanning the whole cluster and checking
> > the LSN of each block has an advantage in that it requires no prior
> > setup or extra configuration.  Like a sequential scan, it always
> > works, and that is an advantage.  Of course, for many people, the
> > competing advantage of a WAL-scanning approach that can save a lot of
> > I/O will appear compelling, but maybe not for everyone.  I think
> > there's room for two or three approaches there -- not in the sense of
> > competing patches, but in the sense of giving users a choice based on
> > their needs.
> 
> Well, by having a separate modblock file for each WAL file, you can keep
> both WAL and modblock files and use the modblock list to pull pages from
> each WAL file, or from the heap/index files, and it can be done in
> parallel.  Having WAL and modblock files in the same directory makes
> retention simpler.
> 
> In fact, you can do an incremental backup just using the modblock files
> and the heap/index files, so you don't even need the WAL.
> 
> Also, instead of storing the file name and block number in the modblock
> file, using the database oid, relfilenode, and block number (3 int32
> values) should be sufficient.

Would doing it that way constrain the design of new table access
methods in some meaningful way?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: block-level incremental backup

From
Bruce Momjian
Date:
On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > Also, instead of storing the file name and block number in the modblock
> > file, using the database oid, relfilenode, and block number (3 int32
> > values) should be sufficient.
> 
> Would doing it that way constrain the design of new table access
> methods in some meaningful way?

I think these are the values used in WAL, so I assume table access
methods already have to map to those, unless they use their own.
I actually don't know.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
> As I understand it, the problem is not with backing up an individual
> database or cluster, but rather dealing with backing up thousands of
> individual clusters with thousands of tables in each, leading to an
> awful lot of tables with lots of FSMs/VMs, all of which end up having to
> get copied and stored wholesale.  I'll point this thread out to him and
> hopefully he'll have a chance to share more specific information.

Sounds good.

> I can agree with the idea of having multiple options for how to collect
> up the set of changed blocks, though I continue to feel that a
> WAL-scanning approach isn't something that we'd have implemented in the
> backend at all since it doesn't require the backend and a given backend
> might not even have all of the WAL that is relevant.  I certainly don't
> think it makes sense to have a backend go get WAL from the archive to
> then merge the WAL to provide the result to a client asking for it-
> that's adding entirely unnecessary load to the database server.

My motivation for wanting to include it in the database server was twofold:

1. I was hoping to leverage the background worker machinery.  The
WAL-scanner would just run all the time in the background, and start
up and shut down along with the server.  If it's a standalone tool,
then it can run on a different server or when the server is down, both
of which are nice.  The downside though is that now you probably have
to put it in crontab or under systemd or something, instead of just
setting a couple of GUCs and letting the server handle the rest.  For
me that downside seems rather significant, but YMMV.

2. In order for the information produced by the WAL-scanner to be
useful, it's got to be available to the server when the server is
asked for an incremental backup.  If the information is constructed by
a standalone frontend tool, and stored someplace other than under
$PGDATA, then the server won't have convenient access to it.  I guess
we could make it the client's job to provide that information to the
server, but I kind of liked the simplicity of not needing to give the
server anything more than an LSN.

> A thought that occurs to me is to have the functions for supporting the
> WAL merging be included in libcommon and available to both the
> independent executable that's available for doing WAL merging, and to
> the backend to be able to WAL merging itself-

Yeah, that might be possible.

> but for a specific
> purpose: having a way to reduce the amount of WAL that needs to be sent
> to a replica which has a replication slot but that's been disconnected
> for a while.  Of course, there'd have to be some way to handle the other
> files for that to work to update a long out-of-date replica.  Now, if we
> taught the backup tool about having a replication slot then perhaps we
> could have the backend effectively have the same capability proposed
> above, but without the need to go get the WAL from the archive
> repository.

Hmm, but you can't just skip over WAL records or segments because
there are checksums and previous-record pointers and things....

> I'm still not entirely sure that this makes sense to do in the backend
> due to the additional load, this is really just some brainstorming.

Would it really be that much load?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote:
> On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > > Also, instead of storing the file name and block number in the modblock
> > > file, using the database oid, relfilenode, and block number (3 int32
> > > values) should be sufficient.
> > 
> > Would doing it that way constrain the design of new table access
> > methods in some meaningful way?
> 
> I think these are the values used in WAL, so I assume table access
> methods already have to map to those, unless they use their own.
> I actually don't know.

I don't think it'd be a meaningful restriction. Given that we use those
for shared_buffer descriptors, WAL etc.

Greetings,

Andres Freund



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote:
> Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> folks have specifically mentioned issues with the visibility map being
> large enough across enough of their systems that it kinda sucks to deal
> with.  Perhaps we could do something like the rsync binary-diff protocol
> for non-relation files?  This is clearly just hand-waving but maybe
> there's something reasonable in that idea.

I guess it all comes down to how complicated you're willing to make
the client-server protocol.  With the very simple protocol that I
proposed -- client provides a threshold LSN and server sends blocks
modified since then -- the client need not have access to the old
incremental backup to take a new one.  Of course, if it happens to
have access to the old backup then it can delta-compress however it
likes after-the-fact, but that doesn't help with the amount of network
transfer.  That problem could be solved by doing something like what
you're talking about (with some probably-negligible false match rate)
but I have no intention of trying to implement anything that
complicated, and I don't really think it's necessary, at least not for
a first version.  What I proposed would already allow, for most users,
a large reduction in transfer and storage costs; what you are talking
about here would help more, but also be a lot more work and impose
some additional requirements on the system.  I don't object to you
implementing the more complex system, but I'll pass.

> There's something like 6 different backup tools, at least, for
> PostgreSQL that provide backup management, so I have a really hard time
> agreeing with this idea that users don't want a PG backup management
> system.  Maybe that's not what you're suggesting here, but that's what
> came across to me.

Let me be a little more clear.  Different users want different things.
Some people want a canned PostgreSQL backup solution, while other
people just want access to a reasonable set of facilities from which
they can construct their own solution.  I believe that the proposal I
am making here could be used either by backup tool authors to enhance
their offerings, or by individuals who want to build up their own
solution using facilities provided by core.

> Unless maybe I'm misunderstanding and what you're suggesting here is
> that the "existing solution" is something like the external PG-specific
> backup tools?  But then the rest doesn't seem to make sense, as only
> maybe one or two of those tools use pg_basebackup internally.

Well, what I'm really talking about is in two pieces: providing some
new facilities via the replication protocol, and making pg_basebackup
able to use those facilities.  Nothing would stop other tools from
using those facilities directly if they wish.

> ... but this is exactly the situation we're in already with all of the
> *other* features around backup (parallel backup, backup management, WAL
> management, etc).  Users want those features, pg_basebackup/PG core
> doesn't provide it, and therefore there's a bunch of other tools which
> have been written that do.  In addition, saying that PG has incremental
> backup but no built-in management of those full-vs-incremental backups
> and telling users that they basically have to build that themselves
> really feels a lot like we're trying to address a check-box requirement
> rather than making something that our users are going to be happy with.

I disagree.  Yes, parallel backup, like incremental backup, needs to
go in core.  And pg_basebackup should be able to do a parallel backup.
I will fight tooth, nail, and claw any suggestion that the server
should know how to do a parallel backup but pg_basebackup should not
have an option to exploit that capability.  And similarly for
incremental.

> I don't think that I was very clear in what my specific concern here
> was.  I'm not asking for pg_basebackup to have parallel backup (at
> least, not in this part of the discussion), I'm asking for the
> incremental block-based protocol that's going to be built-in to core to
> be able to be used in a parallel fashion.
>
> The existing protocol that pg_basebackup uses is basically, connect to
> the server and then say "please give me a tarball of the data directory"
> and that is then streamed on that connection, making that protocol
> impossible to use for parallel backup.  That's fine as far as it goes
> because only pg_basebackup actually uses that protocol (note that nearly
> all of the other tools for doing backups of PostgreSQL don't...).  If
> we're expecting the external tools to use the block-level incremental
> protocol then that protocol really needs to have a way to be
> parallelized, otherwise we're just going to end up with all of the
> individual tools doing their own thing for block-level incremental
> (though perhaps they'd reimplement whatever is done in core but in a way
> that they could parallelize it...), if possible (which I add just in
> case there's some idea that we end up in a situation where the
> block-level incremental backup has to coordinate with the backend in
> some fashion to work...  which would mean that *everyone* has to use the
> protocol even if it isn't parallel and that would be really bad, imv).

The obvious way of extending this system to parallel backup is to have
N connections each streaming a separate tarfile such that when you
combine them all you recreate the original data directory.  That would
be perfectly compatible with what I'm proposing for incremental
backup.  Maybe you have another idea in mind, but I don't know what it
is exactly.

> > Wait, you want to make it maximally easy for users to start the server
> > in a state that is 100% certain to result in a corrupted and unusable
> > database?  Why?? I'd l like to make that a tiny bit difficult.  If
> > they really want a corrupted database, they can remove the file.
>
> No, I don't want it to be easy for users to start the server in a state
> that's going to result in a corrupted cluster.  That's basically the
> complete opposite of what I was going for- having a file that can be
> trivially removed to start up the cluster is *going* to result in people
> having corrupted clusters, no matter how much we tell them "don't do
> that".  This is exactly the problem with have with backup_label today.
> I'd really rather not double-down on that.

Well, OK, but short of scanning the entire directory tree on startup,
I don't see how to achieve that.

> There's really two things here- the first is that I agree with the
> concern about potentially destorying the existing backup if the
> pg_basebackup doesn't complete, but there's some ways to address that
> (such as filesystem snapshotting), so I'm not sure that the idea is
> quite that bad, but it would need to be more than just what
> pg_basebackup does in this case in order to be trustworthy (at least,
> for most).

Well, I did mention in my original email that there could be a
combine-backups-destructively option.  I guess this is just taking
that to the next level: merge a backup being taken into an existing
backup on-the-fly.  Given you remarks above, it is worth noting that
this GREATLY increases the chances of people accidentally causing
corruption in ways that are almost undetectable.  All they have to do
is kill -9 the backup tool half way through and then start postgres on
the resulting directory.

> The other part here is the idea of endless incrementals where the blocks
> which don't appear to have changed are never re-validated against what's
> in the backup.  Unfortunately, latent corruption happens and you really
> want to have a way to check for that.  In past discussions that I've had
> with David, there's been some idea to check some percentage of the
> blocks that didn't appear to change for each backup against what's in
> the backup.

Sure, I'm not trying to block anybody from developing something like
that, and I acknowledge that there is risk in a system like this,
but...

> I share this just to point out that there's some risk to that approach,
> not to say that we shouldn't do it or that we should discourage the
> development of such a feature.

...it seems we are viewing this, at least, from the same perspective.

> Wow.  I have to admit that I feel completely opposite of that- I'd
> *love* to have an independent tool (which ideally uses the same code
> through the common library, or similar) that can be run to apply WAL.
>
> In other words, I don't agree that it's the server's problem at all to
> solve that, or, at least, I don't believe that it needs to be.

I mean, I guess I'd love to have that if I could get it by waving a
magic wand, but I wouldn't love it if I had to write the code or
maintain it.  The routines for applying WAL currently all assume that
you have a whole bunch of server infrastructure present; that code
wouldn't run in a frontend environment, I think.  I wouldn't want to
have a second copy of every WAL apply routine that might have its own
set of bugs.

> I've tried to outline how the incremental backup capability and backup
> management are really very closely related and having those be
> implemented by independent tools is not a good interface for our users
> to have to live with.

I disagree.  I think the "existing backup tools don't use
pg_basebackup" argument isn't very compelling, because the reason
those tools don't use pg_basebackup is because it can't do what they
need.  If it did, they'd probably use it.  People don't write a whole
separate engine for running backups just because it's fun to not reuse
code -- they do it because there's no other way to get what they want.

> Most of the external tools don't use pg_basebackup, nor the base backup
> protocol (or, if they do, it's only as an option among others).  In my
> opinion, that's pretty clear indication that pg_basebackup and the base
> backup protocol aren't sufficient to cover any but the simplest of
> use-cases (though those simple use-cases are handled rather well).
> We're talking about adding on a capability that's much more complicated
> and is one that a lot of tools have already taken a stab at, let's try
> to do it in a way that those tools can leverage it and avoid having to
> implement it themselves.

I mean, again, if it were part of pg_basebackup and available via the
replication protocol, they could do exactly that, through either
method.  I don't get it.  You seem to be arguing that we shouldn't add
the necessary capabilities to the replication protocol or
pg_basebackup, but at the same time arguing that pg_basebackup is
inadequate because it's missing important capabilities.  This confuses
me.

> It's an interesting idea to add in everything to pg_basebackup that
> users doing backups would like to see, but that's quite a list:
>
> - full backups
> - differential backups
> - incremental backups / block-level backups
> - (server-side) compression
> - (server-side) encryption
> - page-level checksum validation
> - calculating checksums (on the whole file)
> - External object storage (S3, et al)
> - more things...
>
> I'm really not convinced that I agree with the division of labor as
> you've outlined it, where all of the above is done by pg_basebackup,
> where just archiving and backup retention are handled by some external
> tool (except that we already have pg_receivewal, so archiving isn't
> really an externally handled thing either, unless you want features like
> parallel archive-push or parallel archive-get...).

Yeah, if it were up to me, I'd choose put most of that in the server
and make it available via the replication protocol, and then give
pg_basebackup able to use that functionality.  And external tools
could use that functionality via pg_basebackup or by using the
replication protocol directly.  I actually don't really understand
what the alternative is.  If you want server-side compression, for
example, that really has to be done on the server.  And how would the
server expose that, except through the replication protocol?  Sure, we
could design a new protocol for it. Call it... say... the
shmeplication protocol.  And then you could use the replication
protocol for what it does today and the shmeplication protocol for all
the cool bits.  But why would that be better?

> What would really help me, at least, understand the idea here would be
> to understand exactly what the existing tools do that the subset of
> users you're thinking about doesn't like/want, but which pg_basebackup,
> today, does.  Is the issue that there's a repository instead of just a
> plain PG directory or set of tar files, like what pg_basebackup produces
> today?  But how would we do things like have compression, or encryption,
> or block-based incremental backups without some kind of repository or
> directory that doesn't actually look exactly like a PG data directory?

I guess we're still wallowing in the same confusion here.
pg_basebackup, for me, is just a convenient place to stick this
functionality.  If the server has the ability to construct and send an
incremental backup by some means, then it needs a client on the other
end to receive and store that backup, and since pg_basebackup already
knows how to do that for full backups, extending it to incremental
backups (and/or parallel, encrypted, compressed, and validated
backups) seems very natural to me.  Otherwise I add server-side
functionality to allow $X and then have to  write an entirely new
client to interact with that instead of just using the client I've
already got.  That's more work, and I'm lazy.

Now it's true that if we wanted to build something like the rsync
protocol into PostgreSQL, jamming that into pg_basebackup might well
be a bridge too far.  That would involve taking backups via a method
so different from what we're currently doing that it would probably
make sense to at least consider creating a whole new tool for that
purpose.  But that wasn't my proposal...

> I certainly can understand that there are PostgreSQL users who want to
> leverage incremental backups without having to use BART or another tool
> outside of whatever enterprise backup system they've got, but surely
> that's a large pool of users who *do* want a PG backup tool that manages
> backups, or you wouldn't have spent a considerable amount of your very
> valuable time hacking on BART.  I've certainly seen a fair share of both
> and I don't think we should set out to exclude either.

Sure, I agree.

> Perhaps that's what we're both saying too and just talking past each
> other, but I feel like the approach here is "make it work just for the
> simple pg_basebackup case and not worry too much about the other tools,
> since what we do for pg_basebackup will work for them too" while where
> I'm coming from is "focus on what the other tools need first, and then
> make pg_basebackup work with that if there's a sensible way to do so."

I think perhaps the disconnect is that I just don't see how it can
fail to work for the external tools if it works for pg_basebackup.
Any given piece of functionality is either available in the
replication stream, or it's not.  I suspect that for both BART and
pg_backrest, they won't be able to completely give up on having their
own backup engines solely because core has incremental backup, but I
don't know what the alternative to adding features to core one at a
time is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> >
> > In other words, I don't agree that it's the server's problem at all to
> > solve that, or, at least, I don't believe that it needs to be.
> 
> I mean, I guess I'd love to have that if I could get it by waving a
> magic wand, but I wouldn't love it if I had to write the code or
> maintain it.  The routines for applying WAL currently all assume that
> you have a whole bunch of server infrastructure present; that code
> wouldn't run in a frontend environment, I think.  I wouldn't want to
> have a second copy of every WAL apply routine that might have its own
> set of bugs.

I'll fight tooth and nail not to have a second implementation of replay,
even if it's just portions.  The code we have is complicated and fragile
enough, having a [partial] second version would be way worse.  There's
already plenty improvements we need to make to speed up replay, and a
lot of them require multiple execution threads (be it processes or OS
threads), something not easily feasible in a standalone tool. And
without the already existing concurrent work during replay (primarily
checkpointer doing a lot of the necessary IO), it'd also be pretty
unattractive to use any separate tool.

Unless you just define the server binary as that "independent tool".
Which I think is entirely reasonable. With the 'consistent' and LSN
recovery targets one already can get most of what's needed from such a
tool, anyway.  I'd argue the biggest issue there is that there's no
equivalent to starting postgres with a private socket directory on
windows, and perhaps an option or two making it easier to start postgres
in a "private" mode for things like this.

Greetings,

Andres Freund



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
> > As I understand it, the problem is not with backing up an individual
> > database or cluster, but rather dealing with backing up thousands of
> > individual clusters with thousands of tables in each, leading to an
> > awful lot of tables with lots of FSMs/VMs, all of which end up having to
> > get copied and stored wholesale.  I'll point this thread out to him and
> > hopefully he'll have a chance to share more specific information.
>
> Sounds good.

Ok, done.

> > I can agree with the idea of having multiple options for how to collect
> > up the set of changed blocks, though I continue to feel that a
> > WAL-scanning approach isn't something that we'd have implemented in the
> > backend at all since it doesn't require the backend and a given backend
> > might not even have all of the WAL that is relevant.  I certainly don't
> > think it makes sense to have a backend go get WAL from the archive to
> > then merge the WAL to provide the result to a client asking for it-
> > that's adding entirely unnecessary load to the database server.
>
> My motivation for wanting to include it in the database server was twofold:
>
> 1. I was hoping to leverage the background worker machinery.  The
> WAL-scanner would just run all the time in the background, and start
> up and shut down along with the server.  If it's a standalone tool,
> then it can run on a different server or when the server is down, both
> of which are nice.  The downside though is that now you probably have
> to put it in crontab or under systemd or something, instead of just
> setting a couple of GUCs and letting the server handle the rest.  For
> me that downside seems rather significant, but YMMV.

Background workers can be used to do pretty much anything.  I'm not
suggesting that's a bad thing- just that it's such a completely generic
tool that could be used to put anything/everything into the backend, so
I'm not sure how much it makes sense as an argument when it comes to
designing a new capability/feature.  Yes, there's an advantage there
when it comes to configuration since that means we don't need to set up
a cronjob and can, instead, just set a few GUCs...  but it also means
that it *must* be done on the server and there's no option to do it
elsewhere, as you say.

When it comes to "this is something that I can do on the DB server or on
some other server", the usual preference is to use another system for
it, to reduce load on the server.

If it comes down to something that needs to/should be an ongoing
process, then the packaging can package that as a daemon-type tool which
handles the systemd component to it, assuming the stand-alone tool
supports that, which it hopefully would.

> 2. In order for the information produced by the WAL-scanner to be
> useful, it's got to be available to the server when the server is
> asked for an incremental backup.  If the information is constructed by
> a standalone frontend tool, and stored someplace other than under
> $PGDATA, then the server won't have convenient access to it.  I guess
> we could make it the client's job to provide that information to the
> server, but I kind of liked the simplicity of not needing to give the
> server anything more than an LSN.

If the WAL-scanner tool is a stand-alone tool, and it handles picking
out all of the FPIs and incremental page changes for each relation, then
what does the tool to build out the "new" backup really need to tell the
backend?  I feel like it mainly needs to ask the backend for the
non-relation files, which gets into at least one approach that I've
thought about for redesigning the backup protocol:

1. Ask for a list of files and metadata about them
2. Allow asking for individual files
3. Support multiple connections asking for individual files

Quite a few of the existing backup tools for PG use a model along these
lines (or use tools underneath which do).

> > A thought that occurs to me is to have the functions for supporting the
> > WAL merging be included in libcommon and available to both the
> > independent executable that's available for doing WAL merging, and to
> > the backend to be able to WAL merging itself-
>
> Yeah, that might be possible.

I feel like this would be necessary, as it's certainly delicate and
critical code and having multiple implementations of it will be
difficult to manage.

That said...  we already have independent work going on to do WAL
mergeing (WAL-G, at least), and if we insist that the WAL replay code
only exists in the backend, I strongly suspect we'll end up with
independent implementations of that too.  Sure, we can distance
ourselves from that and say that we don't have to deal with any bugs
from it... but it seems like the better approach would be to have a
common library that provides it.

> > but for a specific
> > purpose: having a way to reduce the amount of WAL that needs to be sent
> > to a replica which has a replication slot but that's been disconnected
> > for a while.  Of course, there'd have to be some way to handle the other
> > files for that to work to update a long out-of-date replica.  Now, if we
> > taught the backup tool about having a replication slot then perhaps we
> > could have the backend effectively have the same capability proposed
> > above, but without the need to go get the WAL from the archive
> > repository.
>
> Hmm, but you can't just skip over WAL records or segments because
> there are checksums and previous-record pointers and things....

Those aren't what I would be worried about, I'd think?  Maybe we're
talking about different things, but if there's a way to scan/compress
WAL so that we have less work to do when replaying, then we should
leverage that for replicas that have been disconnected for a while too.

One important bit here is that the replica wouldn't be able to answer
queries while it's working through this compressed WAL, since it
wouldn't reach a consistent state until more-or-less the end of WAL, but
I am not sure that's a bad thing; who wants to get responses back from a
very out-of-date replica?

> > I'm still not entirely sure that this makes sense to do in the backend
> > due to the additional load, this is really just some brainstorming.
>
> Would it really be that much load?

Well, it'd clearly be more than zero.  There may be an argument to be
made that it's worth it to reduce the overall throughput of the system
in order to add this capability, but I don't think we've got enough
information at this point to know.  My gut feeling, at least, is that
tracking enough information to do WAL-compression on a high-write system
is going to be pretty expensive as you'd need to have a data structure
that makes it easy to identify every page in the system, and be able to
find each of them later on in the stream, and then throw away the old
FPI in favor of the new one, and then track all the incremental page
updates to that page, more-or-less, right?

On a large system, given how much information has to be tracked, it
seems like it could be a fair bit of load, but perhaps you've got some
ideas as to how to reduce it..?

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

I wanted to respond to this point specifically as I feel like it'll
really help clear things up when it comes to the point of view I'm
seeing this from.

* Robert Haas (robertmhaas@gmail.com) wrote:
> > Perhaps that's what we're both saying too and just talking past each
> > other, but I feel like the approach here is "make it work just for the
> > simple pg_basebackup case and not worry too much about the other tools,
> > since what we do for pg_basebackup will work for them too" while where
> > I'm coming from is "focus on what the other tools need first, and then
> > make pg_basebackup work with that if there's a sensible way to do so."
>
> I think perhaps the disconnect is that I just don't see how it can
> fail to work for the external tools if it works for pg_basebackup.

The existing backup protocol that pg_basebackup uses *does* *not* *work*
for the external backup tools.  If it worked, they'd use it, but they
don't and that's because you can't do things like a parallel backup,
which we *know* users want because there's a number of tools which
implement that exact capability.

I do *not* want another piece of functionality added in this space which
is limited in the same way because it does *not* help the external
backup tools at all.

> Any given piece of functionality is either available in the
> replication stream, or it's not.  I suspect that for both BART and
> pg_backrest, they won't be able to completely give up on having their
> own backup engines solely because core has incremental backup, but I
> don't know what the alternative to adding features to core one at a
> time is.

This idea that it's either "in the replication system" or "not in the
replication system" is really bad, in my view, because it can be "in the
replication system" and at the same time not at all useful to the
existing external backup tools, but users and others will see the
"checkbox" as ticked and assume that it's available in a useful fashion
by the backend and then get upset when they discover the limitations.

The existing base backup/replication protocol that's used by
pg_basebackup is *not* useful to most of the backup tools, that's quite
clear since they *don't* use it.  Building on to that an incremental
backup solution that is similairly limited isn't going to make things
easier for the external tools.

If the goal is to make things easier for the external tools by providing
capability in the backend / replication protocol then we need to be
looking at what those tools require and not at what would be minimally
sufficient for pg_basebackup.  If we don't care about the external tools
and *just* care about making it work for pg_basebackup, then let's be
clear about that, and accept that it'll have to be, most likely, ripped
out and rewritten when we go to add parallel capabilities, for example,
to pg_basebackup down the road.  That's clearly the case for the
existing "base backup" protocol, so I don't see why it'd be different
for an incremental backup system that is similairly designed and
implemented.

To be clear, I'm all for adding feature to core one at a time, but
there's different ways to implement features and that's really what
we're talking about here- what's the best way to implement this
feature, ideally in a way that it's useful, practically, to both
pg_basebackup and the other external backup utilities.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

Ok, responding to the rest of this email.

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> > folks have specifically mentioned issues with the visibility map being
> > large enough across enough of their systems that it kinda sucks to deal
> > with.  Perhaps we could do something like the rsync binary-diff protocol
> > for non-relation files?  This is clearly just hand-waving but maybe
> > there's something reasonable in that idea.
>
> I guess it all comes down to how complicated you're willing to make
> the client-server protocol.  With the very simple protocol that I
> proposed -- client provides a threshold LSN and server sends blocks
> modified since then -- the client need not have access to the old
> incremental backup to take a new one.

Where is the client going to get the threshold LSN from?

> Of course, if it happens to
> have access to the old backup then it can delta-compress however it
> likes after-the-fact, but that doesn't help with the amount of network
> transfer.

If it doesn't have access to the old backup, then I'm a bit confused as
to how a incremental backup would be possible?  Isn't that a requirement
here?

> That problem could be solved by doing something like what
> you're talking about (with some probably-negligible false match rate)
> but I have no intention of trying to implement anything that
> complicated, and I don't really think it's necessary, at least not for
> a first version.  What I proposed would already allow, for most users,
> a large reduction in transfer and storage costs; what you are talking
> about here would help more, but also be a lot more work and impose
> some additional requirements on the system.  I don't object to you
> implementing the more complex system, but I'll pass.

I was talking about the rsync binary-diff specifically for the files
that aren't easy to deal with in the WAL stream.  I wouldn't think we'd
use it for other files, and there is definitely a question there of if
there's a way to do better than a binary-diff approach for those files.

> > There's something like 6 different backup tools, at least, for
> > PostgreSQL that provide backup management, so I have a really hard time
> > agreeing with this idea that users don't want a PG backup management
> > system.  Maybe that's not what you're suggesting here, but that's what
> > came across to me.
>
> Let me be a little more clear.  Different users want different things.
> Some people want a canned PostgreSQL backup solution, while other
> people just want access to a reasonable set of facilities from which
> they can construct their own solution.  I believe that the proposal I
> am making here could be used either by backup tool authors to enhance
> their offerings, or by individuals who want to build up their own
> solution using facilities provided by core.

The last thing that I think users really want it so build up their own
solution.  There may be some organizations who would like to provide
their own tool, but that's a bit different.  Personally, I'd *really*
like PG to have a good tool in this area and I've been working, as I've
said before, to try to get to a point where we at least have the option
to add in such a tool that meets our various requirements.

Further, I'm concerned that the approach being presented here won't be
interesting to most of the external tools because it's limited and can't
be used in a parallel fashion.

> > Unless maybe I'm misunderstanding and what you're suggesting here is
> > that the "existing solution" is something like the external PG-specific
> > backup tools?  But then the rest doesn't seem to make sense, as only
> > maybe one or two of those tools use pg_basebackup internally.
>
> Well, what I'm really talking about is in two pieces: providing some
> new facilities via the replication protocol, and making pg_basebackup
> able to use those facilities.  Nothing would stop other tools from
> using those facilities directly if they wish.

If those facilities are developed and implemented in the same way as the
protocol used by pg_basebackup works, then I strongly suspect that the
existing backup tools will treat it similairly- which is to say, they'll
largely end up ignoring it.

> > ... but this is exactly the situation we're in already with all of the
> > *other* features around backup (parallel backup, backup management, WAL
> > management, etc).  Users want those features, pg_basebackup/PG core
> > doesn't provide it, and therefore there's a bunch of other tools which
> > have been written that do.  In addition, saying that PG has incremental
> > backup but no built-in management of those full-vs-incremental backups
> > and telling users that they basically have to build that themselves
> > really feels a lot like we're trying to address a check-box requirement
> > rather than making something that our users are going to be happy with.
>
> I disagree.  Yes, parallel backup, like incremental backup, needs to
> go in core.  And pg_basebackup should be able to do a parallel backup.
> I will fight tooth, nail, and claw any suggestion that the server
> should know how to do a parallel backup but pg_basebackup should not
> have an option to exploit that capability.  And similarly for
> incremental.

These aren't independent things though, the way it seems like you're
portraying them, because there are ways we can implement incremental
backup that would support it being parallelized, and ways we can
implement it that wouldn't work with parallelism at all, and all I'm
argueing for is that we add in this feature in a way that it can be
parallelized (since that's what most of the external tools do today...),
even though pg_basebackup can't be, but in a way that pg_basebackup can
also use it (albeit in a serial fashion).

> > I don't think that I was very clear in what my specific concern here
> > was.  I'm not asking for pg_basebackup to have parallel backup (at
> > least, not in this part of the discussion), I'm asking for the
> > incremental block-based protocol that's going to be built-in to core to
> > be able to be used in a parallel fashion.
> >
> > The existing protocol that pg_basebackup uses is basically, connect to
> > the server and then say "please give me a tarball of the data directory"
> > and that is then streamed on that connection, making that protocol
> > impossible to use for parallel backup.  That's fine as far as it goes
> > because only pg_basebackup actually uses that protocol (note that nearly
> > all of the other tools for doing backups of PostgreSQL don't...).  If
> > we're expecting the external tools to use the block-level incremental
> > protocol then that protocol really needs to have a way to be
> > parallelized, otherwise we're just going to end up with all of the
> > individual tools doing their own thing for block-level incremental
> > (though perhaps they'd reimplement whatever is done in core but in a way
> > that they could parallelize it...), if possible (which I add just in
> > case there's some idea that we end up in a situation where the
> > block-level incremental backup has to coordinate with the backend in
> > some fashion to work...  which would mean that *everyone* has to use the
> > protocol even if it isn't parallel and that would be really bad, imv).
>
> The obvious way of extending this system to parallel backup is to have
> N connections each streaming a separate tarfile such that when you
> combine them all you recreate the original data directory.  That would
> be perfectly compatible with what I'm proposing for incremental
> backup.  Maybe you have another idea in mind, but I don't know what it
> is exactly.

So, while that's an obvious approach, it isn't the most sensible- and
we know that from experience in actually implementing parallel backup of
PG files.  I'm happy to discuss the approach we use in pgBackRest if
you'd like to discuss this further, but it seems a bit far afield from
the topic of discussion here and it seems like you're not interested or
offering to work on supporting parallel backup in core.

I don't think what you're proposing here wouldn't, technically, work for
the various external tools, what I'm saying is that they aren't going to
actually use it, which means that you're really implementing it *only*
for pg_basebackup's benefit... and only for as long as pg_basebackup is
serial in nature.

> > > Wait, you want to make it maximally easy for users to start the server
> > > in a state that is 100% certain to result in a corrupted and unusable
> > > database?  Why?? I'd l like to make that a tiny bit difficult.  If
> > > they really want a corrupted database, they can remove the file.
> >
> > No, I don't want it to be easy for users to start the server in a state
> > that's going to result in a corrupted cluster.  That's basically the
> > complete opposite of what I was going for- having a file that can be
> > trivially removed to start up the cluster is *going* to result in people
> > having corrupted clusters, no matter how much we tell them "don't do
> > that".  This is exactly the problem with have with backup_label today.
> > I'd really rather not double-down on that.
>
> Well, OK, but short of scanning the entire directory tree on startup,
> I don't see how to achieve that.

Ok, so, this is a bit of spit-balling, just to be clear, but we
currently track things like "where we know the heap files are
consistant" by storing it in the control file as a checkpoint LSN, and
then we have a backup_label file to say where we need to get to in order
to be consistent from a backup.  Perhaps there's a way to use those to
cross-validate while we are updating a data directory to be consistent?
Maybe we update those files as we go, and add a cross-check flag between
them, so that we know from two places that we're restoring from a backup
(incremental or full), and then also know where we need to start from
and where we need to get to, in order to be conistant.

Of course, users can still get past this by hacking these files around
and maybe we can provide a tool along the lines of pg_resetwal which
lets them force the files to agree, but then we can at least throw big
glaring warnings and tell users "this is really bad, type YES to
continue".

> > There's really two things here- the first is that I agree with the
> > concern about potentially destorying the existing backup if the
> > pg_basebackup doesn't complete, but there's some ways to address that
> > (such as filesystem snapshotting), so I'm not sure that the idea is
> > quite that bad, but it would need to be more than just what
> > pg_basebackup does in this case in order to be trustworthy (at least,
> > for most).
>
> Well, I did mention in my original email that there could be a
> combine-backups-destructively option.  I guess this is just taking
> that to the next level: merge a backup being taken into an existing
> backup on-the-fly.  Given you remarks above, it is worth noting that
> this GREATLY increases the chances of people accidentally causing
> corruption in ways that are almost undetectable.  All they have to do
> is kill -9 the backup tool half way through and then start postgres on
> the resulting directory.

Right, we need to come up with a way to detect if that happens and
complain loudly, and not continue to move forward unless and until the
user explicitly insists that it's the right thing to do.

> > The other part here is the idea of endless incrementals where the blocks
> > which don't appear to have changed are never re-validated against what's
> > in the backup.  Unfortunately, latent corruption happens and you really
> > want to have a way to check for that.  In past discussions that I've had
> > with David, there's been some idea to check some percentage of the
> > blocks that didn't appear to change for each backup against what's in
> > the backup.
>
> Sure, I'm not trying to block anybody from developing something like
> that, and I acknowledge that there is risk in a system like this,
> but...
>
> > I share this just to point out that there's some risk to that approach,
> > not to say that we shouldn't do it or that we should discourage the
> > development of such a feature.
>
> ...it seems we are viewing this, at least, from the same perspective.

Great, but I feel like the question here is if we're comfortable putting
out this capability *without* some mechanism to verify that the existing
blocks are clean/not corrupted/changed, or if we feel like this risk is
enough that we want to include a check of the existing blocks, in some
fashion, as part of the incremental backup feature.

Personally, and in discussion with David, we've generally felt like we
don't want this feature until we have a way to verify the blocks that
aren't being backed up every time and we are assuming are clean/correct,
(at least some portion of them anyway, with a way to make sure we
eventually check them all) because we are concerned that users will get
bit by latent corruption and then be quite unhappy with us for not
picking up on that.

> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> >
> > In other words, I don't agree that it's the server's problem at all to
> > solve that, or, at least, I don't believe that it needs to be.
>
> I mean, I guess I'd love to have that if I could get it by waving a
> magic wand, but I wouldn't love it if I had to write the code or
> maintain it.  The routines for applying WAL currently all assume that
> you have a whole bunch of server infrastructure present; that code
> wouldn't run in a frontend environment, I think.  I wouldn't want to
> have a second copy of every WAL apply routine that might have its own
> set of bugs.

I agree that we don't want to have multiple implementations or copies of
the WAL apply routines.  On the other hand, while I agree that there's
some server infrastructure they depend on today, I feel like a lot of
that infrastructure is things that we'd actually like to have in at
least some of the client tools (and likely pg_basebackup specifically).
I understand that it's not trivial to implement, of course, or to pull
out into a common library.  We are already seeing some efforts to
consolidate common routines in the client libraries (Peter E's recent
work around the error messaging being a good example) and I feel like
that's something we should encourage and expect to see happening more in
the future as we add more sophisticated client utilities.

> > I've tried to outline how the incremental backup capability and backup
> > management are really very closely related and having those be
> > implemented by independent tools is not a good interface for our users
> > to have to live with.
>
> I disagree.  I think the "existing backup tools don't use
> pg_basebackup" argument isn't very compelling, because the reason
> those tools don't use pg_basebackup is because it can't do what they
> need.  If it did, they'd probably use it.  People don't write a whole
> separate engine for running backups just because it's fun to not reuse
> code -- they do it because there's no other way to get what they want.

I understand that you disagree but I don't clearly understand the
subsequent justification for why you disagree.  As I understand it, you
disagree that an incremental backup capability and backup management are
closely related, but that's because the existing tools don't leverage
pg_basebackup (or the backup protocol), but aren't those pretty
distinct things?  I accept that perhaps it's my fault for implying that
these topics were related in the emails I've sent, and while replying to
various parts of the discussion which has traveled across a number of
topics, some related and some not.  I see incremental backups and backup
management as related because, in part, of expiration- if you expire out
a 'full' backup then you must expire out any incremental or differential
backups based on it.  Just generally that association of which
incremental depends on which full (or prior differential, or prior
incremental) is extremely important and necessary to avoid corrupt
systems (consider that you might apply an incremental to a full backup,
but the incremental taken was actually based on another incremental and
not based on the full, or variations of that...).

In short, I don't think I could confidently trust any incremental backup
that's taken without having a clear link to the backup it's based on,
and having it be expired when the backup it depends on is expired.

> > Most of the external tools don't use pg_basebackup, nor the base backup
> > protocol (or, if they do, it's only as an option among others).  In my
> > opinion, that's pretty clear indication that pg_basebackup and the base
> > backup protocol aren't sufficient to cover any but the simplest of
> > use-cases (though those simple use-cases are handled rather well).
> > We're talking about adding on a capability that's much more complicated
> > and is one that a lot of tools have already taken a stab at, let's try
> > to do it in a way that those tools can leverage it and avoid having to
> > implement it themselves.
>
> I mean, again, if it were part of pg_basebackup and available via the
> replication protocol, they could do exactly that, through either
> method.  I don't get it.

No, they can't.  Today there exists *exactly* this situation:
pg_basebackup uses the base backup protocol for doing backups, and the
external tools don't use it.

Why?

Because it can't be used in a parallel manner, making it largely
uninteresting as a mechanism for doing backups of systems at any scale.

Yes, sure, they *could* technically use it, but from a *practical*
standpoint they don't because it *sucks*.  Let's not do that for
incremental backups.

> You seem to be arguing that we shouldn't add
> the necessary capabilities to the replication protocol or
> pg_basebackup, but at the same time arguing that pg_basebackup is
> inadequate because it's missing important capabilities.  This confuses
> me.

I'm sorry for not being clear.  I'm not argueing that we *shouldn't* add
such capabilities.  I *want* these capabilities to be added, but I want
them added in a way that's actually useful to the external tools and not
something that only works for pg_basebackup (which is currently
single-threaded).

I hope that's the kind of feedback you've been looking for on this
thread.

> > It's an interesting idea to add in everything to pg_basebackup that
> > users doing backups would like to see, but that's quite a list:
> >
> > - full backups
> > - differential backups
> > - incremental backups / block-level backups
> > - (server-side) compression
> > - (server-side) encryption
> > - page-level checksum validation
> > - calculating checksums (on the whole file)
> > - External object storage (S3, et al)
> > - more things...
> >
> > I'm really not convinced that I agree with the division of labor as
> > you've outlined it, where all of the above is done by pg_basebackup,
> > where just archiving and backup retention are handled by some external
> > tool (except that we already have pg_receivewal, so archiving isn't
> > really an externally handled thing either, unless you want features like
> > parallel archive-push or parallel archive-get...).
>
> Yeah, if it were up to me, I'd choose put most of that in the server
> and make it available via the replication protocol, and then give
> pg_basebackup able to use that functionality.

I'm all about that.  I don't know that the client-side tool would still
be called 'pg_basebackup' at that point, but I definitely want to get to
a point where we have all of these capabilities available in core.

> And external tools
> could use that functionality via pg_basebackup or by using the
> replication protocol directly.  I actually don't really understand
> what the alternative is.  If you want server-side compression, for
> example, that really has to be done on the server.  And how would the
> server expose that, except through the replication protocol?  Sure, we
> could design a new protocol for it. Call it... say... the
> shmeplication protocol.  And then you could use the replication
> protocol for what it does today and the shmeplication protocol for all
> the cool bits.  But why would that be better?

The replication protocol (or base backup protocol, really..) is what we
make it, in the end.  Of course server-side compression needs to be done
on the server and we need a way to tell the server "please compress this
for us before sending it".  I'm not suggesting there's some alternative
to that.  What I'm suggesting is that when we go to implement the
incremental backup protocol that we have a way for that to be
parallelized (at least...  maybe other things too) because that's what
the external tools would really like.

Even pg_dump works in the way that it connects and builds a list of
things to run against and then farms that out to the parallel processes,
so we have an example of how this is done in core today.

> > What would really help me, at least, understand the idea here would be
> > to understand exactly what the existing tools do that the subset of
> > users you're thinking about doesn't like/want, but which pg_basebackup,
> > today, does.  Is the issue that there's a repository instead of just a
> > plain PG directory or set of tar files, like what pg_basebackup produces
> > today?  But how would we do things like have compression, or encryption,
> > or block-based incremental backups without some kind of repository or
> > directory that doesn't actually look exactly like a PG data directory?
>
> I guess we're still wallowing in the same confusion here.
> pg_basebackup, for me, is just a convenient place to stick this
> functionality.  If the server has the ability to construct and send an
> incremental backup by some means, then it needs a client on the other
> end to receive and store that backup, and since pg_basebackup already
> knows how to do that for full backups, extending it to incremental
> backups (and/or parallel, encrypted, compressed, and validated
> backups) seems very natural to me.  Otherwise I add server-side
> functionality to allow $X and then have to  write an entirely new
> client to interact with that instead of just using the client I've
> already got.  That's more work, and I'm lazy.

I'm not suggesting that we don't add this functionality to
pg_basebackup, I'm just saying that we should be thinking about how the
external tools will want to leverage this new capability because it's
materially different from the basic minimum that pg_basebackup requires.
Yes, it'd be a bit more work and a somewhat more complicated protocol
than the simple approach needed by pg_basebackup, but that's what those
other tools will want.  If we don't care about them, ok, I get that, but
I thought the idea here was to build something that's useful to both the
external tools and pg_basebackup.  We won't get that if we focus on just
implementing a protocol for pg_basebackup to use.

> Now it's true that if we wanted to build something like the rsync
> protocol into PostgreSQL, jamming that into pg_basebackup might well
> be a bridge too far.  That would involve taking backups via a method
> so different from what we're currently doing that it would probably
> make sense to at least consider creating a whole new tool for that
> purpose.  But that wasn't my proposal...

The idea around the rsync binary-diff protocol was *specifically* for
things that we can't do through block-level updates with WAL scanning,
just to be clear.  I wasn't thinking that would be good for the relation
files since we have more information for those in the LSN, et al.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> > > Wow.  I have to admit that I feel completely opposite of that- I'd
> > > *love* to have an independent tool (which ideally uses the same code
> > > through the common library, or similar) that can be run to apply WAL.
> > >
> > > In other words, I don't agree that it's the server's problem at all to
> > > solve that, or, at least, I don't believe that it needs to be.
> >
> > I mean, I guess I'd love to have that if I could get it by waving a
> > magic wand, but I wouldn't love it if I had to write the code or
> > maintain it.  The routines for applying WAL currently all assume that
> > you have a whole bunch of server infrastructure present; that code
> > wouldn't run in a frontend environment, I think.  I wouldn't want to
> > have a second copy of every WAL apply routine that might have its own
> > set of bugs.
>
> I'll fight tooth and nail not to have a second implementation of replay,
> even if it's just portions.  The code we have is complicated and fragile
> enough, having a [partial] second version would be way worse.  There's
> already plenty improvements we need to make to speed up replay, and a
> lot of them require multiple execution threads (be it processes or OS
> threads), something not easily feasible in a standalone tool. And
> without the already existing concurrent work during replay (primarily
> checkpointer doing a lot of the necessary IO), it'd also be pretty
> unattractive to use any separate tool.

I agree that we don't want another implementation and that there's a lot
that we want to do to improve replay performance.  We've already got
frontend tools which work with multiple execution threads, so I'm not
sure I get the "not easily feasible" bit, and the argument about the
checkpointer seems largely related to that (as in- if we didn't have
multiple threads/processes then things would perform quite badly...  but
we can and do have multiple threads/processes in frontend tools today,
even in pg_basebackup).

You certainly bring up some good concerns though and they make me think
of other bits that would seem like they'd possibly be larger issues for
a frontend tool- like having a large pool of memory for cacheing (aka
shared buffers) the changes.  If what we're talking about here is *just*
replay though, without having the system available for reads, I wonder
if we might want a different solution there.

> Unless you just define the server binary as that "independent tool".

That's certainly an interesting idea.

> Which I think is entirely reasonable. With the 'consistent' and LSN
> recovery targets one already can get most of what's needed from such a
> tool, anyway.  I'd argue the biggest issue there is that there's no
> equivalent to starting postgres with a private socket directory on
> windows, and perhaps an option or two making it easier to start postgres
> in a "private" mode for things like this.

This would mean building in a way to do parallel WAL replay into the
server binary though, as discussed above, and it seems like making that
work in a way that allows us to still be available as a read-only
standby would be quite a bit more difficult.  We could possibly support
parallel WAL replay only when we aren't a replica but from the same
binary.  The concerns mentioned about making it easier to start PG in a
private mode don't seem too bad but I am not entirely sure that the
tools which want to leverage that kind of capability would want to have
to exec out to the PG binary to use it.

A lot of this part of the discussion feels like a tangent though, unless
I'm missing something.  The "WAL compression" tool contemplated
previously would be much simpler and not the full-blown WAL replay
capability, which would be left to the server, unless you're suggesting
that even that should be exclusively the purview of the backend?  Though
that ship's already sailed, given that external projects have
implemented it.  Having a library to provide that which external
projects could leverage would be nicer than having everyone write their
own version.

Thanks!

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Apr 18, 2019 at 6:39 PM Stephen Frost <sfrost@snowman.net> wrote:
> Where is the client going to get the threshold LSN from?
>
> If it doesn't have access to the old backup, then I'm a bit confused as
> to how a incremental backup would be possible?  Isn't that a requirement
> here?

I explained this in the very first email that I wrote on this thread,
and then wrote a very extensive further reply on this exact topic to
Peter Eisentraut.  It's a bit disheartening to see you arguing against
my ideas when it's not clear that you've actually read and understood
them.

> > The obvious way of extending this system to parallel backup is to have
> > N connections each streaming a separate tarfile such that when you
> > combine them all you recreate the original data directory.  That would
> > be perfectly compatible with what I'm proposing for incremental
> > backup.  Maybe you have another idea in mind, but I don't know what it
> > is exactly.
>
> So, while that's an obvious approach, it isn't the most sensible- and
> we know that from experience in actually implementing parallel backup of
> PG files.  I'm happy to discuss the approach we use in pgBackRest if
> you'd like to discuss this further, but it seems a bit far afield from
> the topic of discussion here and it seems like you're not interested or
> offering to work on supporting parallel backup in core.

If there's some way of modifying my proposal so that it makes life
better for external backup tools, I'm certainly willing to consider
that, but you're going to have to tell me what you have in mind.  If
that means describing what pgbackrest does, then do it.

My concern here is that you seem to want a lot of complicated stuff
that will require *significant* setup in order for people to be able
to use it.  From what I am able to gather from your remarks so far,
you think people should archive their WAL to a separate machine, and
then the WAL-summarizer should run there, and then data from that
should be fed back to the backup client, which should then give the
server a list of modified files (and presumably, someday, blocks) and
the server then returns that data, which the client then
cross-verifies with checksums and awesome sauce.

Which is all fine, but actually requires quite a bit of set-up and
quite a bit of buy-in to the tool.  And I have no problem with people
having that level of buy-in to the tool.  EnterpriseDB offers a number
of tools which require similar levels of setup and configuration, and
it's not inappropriate for an enterprise-grade backup tool to have all
that stuff.  However, for those who may not want to do all that, my
original proposal lets you take an incremental backup by doing the
following list of steps:

1. Take an incremental backup.

If you'd like, you can also:

0. Enable the WAL-scanning background worker to make incremental
backups much faster.

You do not need a WAL archive, and you do not need EITHER the backup
tool or the server to have access to previous backups, and you do not
need the client to have any access to archived WAL or the summary
files produced from it.  The only thing you need to know the
start-of-backup LSN for the previous backup.

I expect you to reply with a long complaint about how my proposal is
totally inadequate, but actually I think for most people, most of the
time, it would not only be adequate, but extremely convenient.  And
despite your protestations to the contrary, it does not block
parallelism, checksum verification, or any other cool features that
somebody may want to add later.  It'll work just fine with those
things.

And for the record, I am willing to put some effort into parallelism.
I just think that it makes more sense to do the incremental part
first.  I think that incremental backup is likely to have less effect
on parallel backup than the other way around.  What I'm NOT willing to
do is build a whole bunch of infrastructure that will help pgbackrest
do amazing things but will not provide a simple and convenient way of
taking incremental backups using only core tools.  I do care about
having something that's good for pgbackrest and other out-of-core
tools.  I just care about it MUCH LESS than I care about making
PostgreSQL core awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> What I'm NOT willing to
> do is build a whole bunch of infrastructure that will help pgbackrest
> do amazing things but will not provide a simple and convenient way of
> taking incremental backups using only core tools.  I do care about
> having something that's good for pgbackrest and other out-of-core
> tools.  I just care about it MUCH LESS than I care about making
> PostgreSQL core awesome.

Then I misunderstood your original proposal where you talked about
providing something that the various external tools could use.  If you'd
like to *just* provide a mechanism for pg_basebackup to be able to do a
trivial incremental backup, great, but it's not going to be useful or
used by the external tools, just like the existing base backup protocol
isn't used by the external tools because it can't be used in a parallel
fashion.

As such, and with all the other missing bits from pg_basebackup, it
looks likely to me that such a feature is going to be lackluster, at
best, and end up being only marginally interesting, when it could have
been much more and leveraged by all of the existing tools.  I agree that
making a parallel-supporting protocol work is harder but I actually
don't think it would be *that* much more difficult to do.

That's frankly discouraging, but I'm not going to tell you where to
spend your time.

Making PG core awesome when it comes to backup is going to involve so
much more than just marginal improvements to pg_basebackup, but it's
also something that I'm very much supportive of and have invested a
great deal in, by spending time and resources working to build a tool
that gets closer to what an in-core solution would look like than
anything that exists today.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Andrey Borodin
Date:
Hi!

Sorry for the delay.

> 18 апр. 2019 г., в 21:56, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
>> As I understand it, the problem is not with backing up an individual
>> database or cluster, but rather dealing with backing up thousands of
>> individual clusters with thousands of tables in each, leading to an
>> awful lot of tables with lots of FSMs/VMs, all of which end up having to
>> get copied and stored wholesale.  I'll point this thread out to him and
>> hopefully he'll have a chance to share more specific information.
>
> Sounds good.

During introduction of WAL-delta backups, we faced two things:
1. Heavy spike in network load. We shift beginning of backup randomly, but variation is not very big: night is short
andwe want to make big backups during low rps time. This low variation of time of starts of small backups creates big
networkspike. 
2. Incremental backups became very cheap if measured in used resources of a single cluster.

1st is not a big problem, actually, bit we realized that we can do incremental backups not just at night, but, for
example,4 times a day. Or every hour. Or every minute. Why not, if they are cheap enough? 

Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is made
ofFSM (no LSN) and VM (hard to use LSN). 
Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.

I think that problem of incrementing FSM and VM is too distant now.
But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore.
Lookslike it is possible, but too much AM-specific. 
It is hard when you write backup tool in Go and cannot simply link with PG.

> 15 апр. 2019 г., в 18:01, Stephen Frost <sfrost@snowman.net> написал(а):
> ...the goal here
> isn't actually to make pg_basebackup into an enterprise backup tool,
> ...

BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be
ubiquitousand sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why
externalbackup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc) 


Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel network
transmission.Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads,
it'sabout saturated IO controllers. 
Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we even
throttledisk IO. But users may want parallel backup to catch-up standby. 

Thanks.

Best regards, Andrey Borodin.


Re: block-level incremental backup

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > What I'm NOT willing to
> > do is build a whole bunch of infrastructure that will help pgbackrest
> > do amazing things but will not provide a simple and convenient way of
> > taking incremental backups using only core tools.  I do care about
> > having something that's good for pgbackrest and other out-of-core
> > tools.  I just care about it MUCH LESS than I care about making
> > PostgreSQL core awesome.
>
> Then I misunderstood your original proposal where you talked about
> providing something that the various external tools could use.  If you'd
> like to *just* provide a mechanism for pg_basebackup to be able to do a
> trivial incremental backup, great, but it's not going to be useful or
> used by the external tools, just like the existing base backup protocol
> isn't used by the external tools because it can't be used in a parallel
> fashion.

Well, what I meant - and perhaps I wasn't clear enough about this - is
that it could be used by an external solution for *managing* backups,
not so much an external engine for *taking* backups.  But actually, I
really don't see any reason why the latter wouldn't also be possible.
It was already suggested upthread by Anastasia that there should be a
way to ask the server to give only the identity of the modified blocks
without the contents of those blocks; if we provide that, then a tool
can get those and do whatever it likes with them, including fetching
them in parallel by some other means.  Another obvious extension would
be to add a command that says 'give me this file' or 'give me this
file but only this list of blocks' which would give clients lots of
options: they could provide their own lists of blocks to fetch
computed by whatever internal magic they have, or they could request
the server's modified-block map information first and then schedule
fetching those blocks in parallel using this new command.  So it seems
like with some pretty straightforward extensions this can be made
usable by and valuable to people wanting to build external backup
engines, too.  I do not necessarily feel obliged to implement every
feature that might help with that kind of thing just because I've
expressed an interest in this general area, but I might do some of
them, and maybe people like you or Anastasia who want to make these
facilities available to external tools can help with some of the work,
too.

That being said, as long as there is significant demand for
value-added backup features over and above what is in core, there are
probably going to be non-core backup tools that do things their own
way instead of just leaning on whatever the server provides natively.
In a certain sense that's regrettable, because it means that somebody
- or perhaps multiple somebodys - goes to the trouble of doing
something outside core and then somebody else puts something in core
that obsoletes it and therein lies duplication of effort.  On the
other hand, it also allows people to innovate way faster than can be
done in core, it allows competition among different possible designs,
and it's just kinda the way we roll around here.  I can't get very
worked up about it.

One thing I'm definitely not going to do here is abandon my goal of
producing a *simple* incremental backup solution that can be deployed
*easily* by users. I understand from your remarks that such a solution
will not suit everybody.  However, unlike you, I do not believe that
pg_basebackup was a failure.  I certainly agree that it has some
limitations that mean that it is hard to use in large deployments, but
it's also *extremely* convenient for people with a fairly small
database when they just need a quick and easy backup.  Adding some
more features to it - such as incremental backup - will make it useful
to more people in more cases.  There will doubtless still be people
who need more, and that's OK: those people can use a third-party tool.
I will not get anywhere trying to solve every problem at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is
madeof FSM (no LSN) and VM (hard to use LSN). 
> Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.

It seems like if the backups are only a few minutes apart, PITR might
be a better choice than super-frequent incremental backups.  What do
you think about that?

> I think that problem of incrementing FSM and VM is too distant now.
> But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during
restore.Looks like it is possible, but too much AM-specific. 

Interesting idea - that's worth some more thought.

> BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be
ubiquitousand sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why
externalbackup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc) 

+1

> Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel
networktransmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about
threads,it's about saturated IO controllers. 
> Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we
eventhrottle disk IO. But users may want parallel backup to catch-up standby. 

I'm not sure I entirely understand your point here -- are you saying
that parallel backup is important, or that it's not important, or
something in between?  Do you think it's more or less important than
incremental backup?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > What I'm NOT willing to
> > > do is build a whole bunch of infrastructure that will help pgbackrest
> > > do amazing things but will not provide a simple and convenient way of
> > > taking incremental backups using only core tools.  I do care about
> > > having something that's good for pgbackrest and other out-of-core
> > > tools.  I just care about it MUCH LESS than I care about making
> > > PostgreSQL core awesome.
> >
> > Then I misunderstood your original proposal where you talked about
> > providing something that the various external tools could use.  If you'd
> > like to *just* provide a mechanism for pg_basebackup to be able to do a
> > trivial incremental backup, great, but it's not going to be useful or
> > used by the external tools, just like the existing base backup protocol
> > isn't used by the external tools because it can't be used in a parallel
> > fashion.
>
> Well, what I meant - and perhaps I wasn't clear enough about this - is
> that it could be used by an external solution for *managing* backups,
> not so much an external engine for *taking* backups.  But actually, I
> really don't see any reason why the latter wouldn't also be possible.
> It was already suggested upthread by Anastasia that there should be a
> way to ask the server to give only the identity of the modified blocks
> without the contents of those blocks; if we provide that, then a tool
> can get those and do whatever it likes with them, including fetching
> them in parallel by some other means.  Another obvious extension would
> be to add a command that says 'give me this file' or 'give me this
> file but only this list of blocks' which would give clients lots of
> options: they could provide their own lists of blocks to fetch
> computed by whatever internal magic they have, or they could request
> the server's modified-block map information first and then schedule
> fetching those blocks in parallel using this new command.  So it seems
> like with some pretty straightforward extensions this can be made
> usable by and valuable to people wanting to build external backup
> engines, too.  I do not necessarily feel obliged to implement every
> feature that might help with that kind of thing just because I've
> expressed an interest in this general area, but I might do some of
> them, and maybe people like you or Anastasia who want to make these
> facilities available to external tools can help with some of the work,
> too.

Yes, if we spend a bit of time thinking about how this could be
implemented in a way that could be used by multiple connections
concurrently then we could provide something that both pg_basebackup and
the external tools could use.  Getting a list first and then supporting
a 'give me this file' API, or 'give me these blocks from this file'
would be very similar to what many of the external tools today.  I agree
that I don't think it'd be hard to do.  I'm suggesting that we do that
instead of, at a protocol level, something similar to what was done with
pg_basebackup which prevents that.

I don't really agree that implementing "give me a list of files" and
"give me this file" is really somehow an 'extension' to the tar-based
approach that pg_basebackup uses today, it's really a rather different
thing, and I mention that as a parallel (hah!) to what we're discussing
here regarding the incremental backup approach.

Having been around for a while working on backup-related things, if I
was to implement the protocol for pg_basebackup today, I'd definitely
implement "give me a list" and "give me this file" rather than the
tar-based approach, because I've learned that people want to be
able to do parallel backups and that's a decent way to do that.  I
wouldn't set out and implement something new that's there's just no hope
of making parallel.  Maybe the first write of pg_basebackup would still
be simple and serial since it's certainly more work to make a frontend
tool like that work in parallel, but at least the protocol would be
ready to support a parallel option being added alter without being
rewritten.

And that's really what I was trying to get at here- if we've got the
choice now to decide what this is going to look like from a protocol
level, it'd be great if we could make it able to support being used in a
parallel fashion, even if pg_basebackup is still single-threaded.

> That being said, as long as there is significant demand for
> value-added backup features over and above what is in core, there are
> probably going to be non-core backup tools that do things their own
> way instead of just leaning on whatever the server provides natively.
> In a certain sense that's regrettable, because it means that somebody
> - or perhaps multiple somebodys - goes to the trouble of doing
> something outside core and then somebody else puts something in core
> that obsoletes it and therein lies duplication of effort.  On the
> other hand, it also allows people to innovate way faster than can be
> done in core, it allows competition among different possible designs,
> and it's just kinda the way we roll around here.  I can't get very
> worked up about it.

Yes, that's largely the tact we've taken with it- build something
outside of core, where we can move a lot faster with the implementation
and innovate quickly, until we get to a stable system that's as portable
and in a compatible language to what's in core today.  I don't have any
problem with new things going into core, in fact, I'm all for it, but if
someone asks me "I'd like to do this thing in core and I'd like it to be
useful for external tools" then I'll do my best to share my experiences
with what's been done in core vs. what's been done in this space outside
of core and what some lessons learned from that have been and ways that
we could at least try to make it so that external tools will be able to
use whatever is implemented in core.

> One thing I'm definitely not going to do here is abandon my goal of
> producing a *simple* incremental backup solution that can be deployed
> *easily* by users. I understand from your remarks that such a solution
> will not suit everybody.  However, unlike you, I do not believe that
> pg_basebackup was a failure.  I certainly agree that it has some
> limitations that mean that it is hard to use in large deployments, but
> it's also *extremely* convenient for people with a fairly small
> database when they just need a quick and easy backup.  Adding some
> more features to it - such as incremental backup - will make it useful
> to more people in more cases.  There will doubtless still be people
> who need more, and that's OK: those people can use a third-party tool.
> I will not get anywhere trying to solve every problem at once.

I don't get this at all.  What I've really been focused on has been the
protocol-level questions of what this is going to look like, because
that's what I see the external tools potentially using.  pg_basebackup
itself could remain single-threaded and could provide exactly the same
interface, no matter if the protocol is "give me all the blocks across
the entire cluster as a single compressed stream" or the protocol is
"give me a list of files that changed" and "give me a list of these
blocks in this file" or even "give me all the blocks that changed in
this file".

I also don't think pg_basebackup is a failure, and I didn't mean to
imply that, and I'm sorry for some of the hyperbole which lead to that
impression coming across.  pg_basebackup is great, for what it is, and I
regularly recommend it in certain use-cases as being a simple tool that
does one thing and does it pretty well, for smaller clusters.  The
protocol it uses is unfortunately only useful in a single-threaded
manner though and it'd be great if we could avoid implementing similar
things in the protocol in the future.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Andrey Borodin
Date:

> 21 апр. 2019 г., в 1:13, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is
madeof FSM (no LSN) and VM (hard to use LSN). 
>> Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.
>
> It seems like if the backups are only a few minutes apart, PITR might
> be a better choice than super-frequent incremental backups.  What do
> you think about that?
PITR is painfully slow on heavily loaded clusters. I observed restorations when 5 seconds of WAL were restored in 4
seconds.Backup was only few hours past primary node, but could catch up only at night. 
And during this process only one of 56 cpu cores was used. And SSD RAID throughput was not 100% utilized.

Block level delta backups can be restored very efficiently: if we restore from newest to past steps, we write no more
thancluster size at last backup. 

>> I think that problem of incrementing FSM and VM is too distant now.
>> But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during
restore.Looks like it is possible, but too much AM-specific. 
>
> Interesting idea - that's worth some more thought.

Core routines to recreate VM and FSM would be cool :) But this need to be done without extra IO, not an easy trick.

>> Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel
networktransmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about
threads,it's about saturated IO controllers. 
>> Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we
eventhrottle disk IO. But users may want parallel backup to catch-up standby. 
>
> I'm not sure I entirely understand your point here -- are you saying
> that parallel backup is important, or that it's not important, or
> something in between?  Do you think it's more or less important than
> incremental backup?
I think that there is no such thing as parallel backup. Backup creation is composite process of many subprocesses.

In my experience, parallel network transmission is cool and very important, it makes upload 3 times faster. But my
experienceis limited to cloud storages. Would this hold if storage backend is local FS? I have no idea. 
Parallel reading from disk has the same effect. Compression and encryption can be single threaded, I think it will not
bebottleneck (unless one uses lzma's neighborhood on Pareto frontier). 

For me, I think the most important thing is incremental backups (with parallel steps merge) and then parallel backup.
But there is huge fraction of users, who can benefit from parallel backup and do not need incremental backup at all.


Best regards, Andrey Borodin.


Re: block-level incremental backup

From
Robert Haas
Date:
On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> Having been around for a while working on backup-related things, if I
> was to implement the protocol for pg_basebackup today, I'd definitely
> implement "give me a list" and "give me this file" rather than the
> tar-based approach, because I've learned that people want to be
> able to do parallel backups and that's a decent way to do that.  I
> wouldn't set out and implement something new that's there's just no hope
> of making parallel.  Maybe the first write of pg_basebackup would still
> be simple and serial since it's certainly more work to make a frontend
> tool like that work in parallel, but at least the protocol would be
> ready to support a parallel option being added alter without being
> rewritten.
>
> And that's really what I was trying to get at here- if we've got the
> choice now to decide what this is going to look like from a protocol
> level, it'd be great if we could make it able to support being used in a
> parallel fashion, even if pg_basebackup is still single-threaded.

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Konstantin Knizhnik
Date:

On 22.04.2019 2:02, Robert Haas wrote:
> On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
>> Having been around for a while working on backup-related things, if I
>> was to implement the protocol for pg_basebackup today, I'd definitely
>> implement "give me a list" and "give me this file" rather than the
>> tar-based approach, because I've learned that people want to be
>> able to do parallel backups and that's a decent way to do that.  I
>> wouldn't set out and implement something new that's there's just no hope
>> of making parallel.  Maybe the first write of pg_basebackup would still
>> be simple and serial since it's certainly more work to make a frontend
>> tool like that work in parallel, but at least the protocol would be
>> ready to support a parallel option being added alter without being
>> rewritten.
>>
>> And that's really what I was trying to get at here- if we've got the
>> choice now to decide what this is going to look like from a protocol
>> level, it'd be great if we could make it able to support being used in a
>> parallel fashion, even if pg_basebackup is still single-threaded.
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.  One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.
> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.
>
> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers) think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to discount
> the value of his experience in this area, it sometimes happens on this
> mailing list that [ drum roll please ] not everybody agrees about
> everything.  So, what do other people think?
>

Based on the experience of pg_probackup users I can say that  there is 
no 100% winer and depending on use case either
parallel either incremental backups are preferable.
- If size of database is not so larger and intensity of updates is high 
enough, then parallel backup within one data center is definitely more 
efficient solution.
- If size of database is very large and data is rarely updated or 
database is mostly append-only, then incremental backup is preferable.
- Some customers need to collect at central server backups of databases 
installed at many nodes with slow and unreliable connection (assume DBMS 
installed at locomotives). Definitely parallelism can not help here, 
unlike support of incremental backup.
- Parallel backup more aggressively consumes resources of the system, 
interfering with normal work of application. So performing parallel 
backup may cause significant degradation of application speed.

pg_probackup supports both features: parallel and incremental backups 
and it is up to user how to use it in more efficient way for particular 
configuration.



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Having been around for a while working on backup-related things, if I
> > was to implement the protocol for pg_basebackup today, I'd definitely
> > implement "give me a list" and "give me this file" rather than the
> > tar-based approach, because I've learned that people want to be
> > able to do parallel backups and that's a decent way to do that.  I
> > wouldn't set out and implement something new that's there's just no hope
> > of making parallel.  Maybe the first write of pg_basebackup would still
> > be simple and serial since it's certainly more work to make a frontend
> > tool like that work in parallel, but at least the protocol would be
> > ready to support a parallel option being added alter without being
> > rewritten.
> >
> > And that's really what I was trying to get at here- if we've got the
> > choice now to decide what this is going to look like from a protocol
> > level, it'd be great if we could make it able to support being used in a
> > parallel fashion, even if pg_basebackup is still single-threaded.
>
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.

It wasn't my intent to imply that the whole method of operation of
pg_basebackup would have to change for this.

> One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.

Ok, how do you envision that?  As I mentioned up-thread, I am concerned
that we're talking too high-level here and it's making the discussion
more difficult than it would be if we were to put together specific
ideas and then discuss them.

One way I can imagine to extend BASE_BACKUP is by adding LSN as an
optional parameter and then having the database server scan the entire
cluster and send a tarball which contains essentially a 'diff' file of
some kind for each file where we can construct a diff based on the LSN,
and then the complete contents of the file for everything else that
needs to be in the backup.

So, sure, that would work, but it wouldn't be able to be parallelized
and I don't think it'd end up being very exciting for the external tools
because of that, but it would be fine for pg_basebackup.

On the other hand, if you added new commands for 'list of files changed
since this LSN' and 'give me this file' and 'give me this file with the
changes in it since this LSN', then pg_basebackup could work with that
pretty easily in a single-threaded model (maybe with two connections to
the backend, but still in a single process, or maybe just by slurping up
the file list and then asking for each one) and the external tools could
leverage those new capabilities too for their backups, both full backups
and incremental ones.  This also wouldn't have to change how
pg_basebackup does full backups today one bit, so what we're really
talking about here is the direction to take the new code that's being
written, not about rewriting existing code.  I agree that it'd be a bit
more work...  but hopefully not *that* much more, and it would mean we
could later add parallel backup to pg_basebackup more easily too, if we
wanted to.

> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.

I didn't mean to imply that one had to be done before the other from a
technical standpoint.  I agree that they don't depend on each other.

You're certainly welcome to do what you would like, I simply wanted to
share my experiences and try to help move this in a direction that would
involve less code rewrite in the future and to have a feature that would
be more appealing to the external tools.

> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers)

While David and I do talk, we haven't really discussed this proposal all
that much, so please don't assume that he shares my thoughts here.  I'd
also like to hear what others think, particularly those who have been
working in this area.

> think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to discount
> the value of his experience in this area, it sometimes happens on this
> mailing list that [ drum roll please ] not everybody agrees about
> everything.  So, what do other people think?

I'm afraid this is painting my position here with an extremely broad
brush and so I'd like to clarify a bit: I'm *all* for incremental
backups.  Incremental and differential backups were supported by
pgBackRest very early on and are used extensively.  Today's pgBackRest
does that at a file level, but I would very much like to get to a block
level shortly after we finish rewriting it into C and porting it to
Windows (and probably the other platforms PG runs on today), which isn't
very far off now.  I'd like to make sure that whatever core ends up with
as an incremental backup solution also matches very closely what we do
with pgBackRest too, but everything that's been discussed here seems
pretty reasonable when it comes to the bits around how the blocks are
detected and the files get stitched back together, so I don't expect
there to be too much of an issue there.

What I'm afraid will be lackluster is adding block-level incremental
backup support to pg_basebackup without any support for managing
backups or anything else.  I'm also concerned that it's going to mean
that people who want to use incremental backup with pg_basebackup are
going to have to write a lot of their own management code (probably in
shell scripts and such...) around that and if they get anything wrong
there then people are going to end up with bad backups that they can't
restore from, or they'll have corrupted clusters if they do manage to
get them restored.

It'd also be nice to have as much exposed through the common library as
possible when it comes to, well, everything being discussed, so that the
external tools could leverage that code and avoid having to write their
own.  This would probably apply more to the WAL-scanning discussion, but
figured I'd mention it here too.

If the protocol was implemented in a way that we could leverage it from
external tools in a parallel fashion then I'd be more excited about the
overall body of work, although, thinking about it a bit more, I have to
admit that I'm not sure that pgBackRest would end up using it in any
case, no matter how it's implemented, since it wouldn't support
compression or encryption, both of which we support doing in-stream
before the data leaves the server, though the external tools which don't
support those options likely would find the parallel option more
appealing.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> I agree that we don't want another implementation and that there's a lot
> that we want to do to improve replay performance.  We've already got
> frontend tools which work with multiple execution threads, so I'm not
> sure I get the "not easily feasible" bit, and the argument about the
> checkpointer seems largely related to that (as in- if we didn't have
> multiple threads/processes then things would perform quite badly...  but
> we can and do have multiple threads/processes in frontend tools today,
> even in pg_basebackup).

You need not just multiple execution threads, but basically a new
implementation of shared buffers, locking, process monitoring, with most
of the related infrastructure. You're literally talking about
reimplementing a very substantial portion of the backend.  I'm not sure
I can transport in written words - via a public medium - how bad an idea
it would be to go there.


> You certainly bring up some good concerns though and they make me think
> of other bits that would seem like they'd possibly be larger issues for
> a frontend tool- like having a large pool of memory for cacheing (aka
> shared buffers) the changes.  If what we're talking about here is *just*
> replay though, without having the system available for reads, I wonder
> if we might want a different solution there.

No.


> > Which I think is entirely reasonable. With the 'consistent' and LSN
> > recovery targets one already can get most of what's needed from such a
> > tool, anyway.  I'd argue the biggest issue there is that there's no
> > equivalent to starting postgres with a private socket directory on
> > windows, and perhaps an option or two making it easier to start postgres
> > in a "private" mode for things like this.
> 
> This would mean building in a way to do parallel WAL replay into the
> server binary though, as discussed above, and it seems like making that
> work in a way that allows us to still be available as a read-only
> standby would be quite a bit more difficult.  We could possibly support
> parallel WAL replay only when we aren't a replica but from the same
> binary.

I'm doubtful that we should try to implement parallel WAL apply that
can't support HS - a substantial portion of the the logic to avoid
issues around relfilenode reuse, consistency etc is going to be to be
necessary for non-HS aware apply anyway.  But if somebody had a concrete
proposal for something that's fundamentally only doable without HS, I
could be convinced.


> The concerns mentioned about making it easier to start PG in a
> private mode don't seem too bad but I am not entirely sure that the
> tools which want to leverage that kind of capability would want to have
> to exec out to the PG binary to use it.

Tough luck.  But even leaving infeasability aside, it seems like a quite
bad idea to do this in-process inside a tool that manages backup &
recovery. Creating threads / sub-processes with complicated needs (like
any pared down version of pg to do just recovery would have) from within
a library has substantial complications. So you'd not want to do this
in-process anyway.


> A lot of this part of the discussion feels like a tangent though, unless
> I'm missing something.

I'm replying to:

On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> Wow.  I have to admit that I feel completely opposite of that- I'd
> *love* to have an independent tool (which ideally uses the same code
> through the common library, or similar) that can be run to apply WAL.

And I'm basically saying that anything that starts from this premise is
fatally flawed (in the ex falso quodlibet kind of sense ;)).


> The "WAL compression" tool contemplated
> previously would be much simpler and not the full-blown WAL replay
> capability, which would be left to the server, unless you're suggesting
> that even that should be exclusively the purview of the backend?  Though
> that ship's already sailed, given that external projects have
> implemented it.

I'm extremely doubtful of such tools (but it's not what I was responding
too, see above). I'd be extremely surprised if even one of them came
close to being correct. The old FPI removal tool had data corrupting
bugs left and right.


> Having a library to provide that which external
> projects could leverage would be nicer than having everyone write their
> own version.

No, I don't think that's necessarily true. Something complicated that's
hard to get right doesn't have to be provided by core. Even if other
projects decide that their risk/reward assesment is different than core
postgres'. We don't have to take on all kind of work and complexity for
external tools.

Greetings,

Andres Freund



Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
> > I think we're getting closer to a meeting of the minds here, but I
> > don't think it's intrinsically necessary to rewrite the whole method
> > of operation of pg_basebackup to implement incremental backup in a
> > sensible way.
>
> It wasn't my intent to imply that the whole method of operation of
> pg_basebackup would have to change for this.

Cool.

> > One could instead just do a straightforward extension
> > to the existing BASE_BACKUP command to enable incremental backup.
>
> Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> that we're talking too high-level here and it's making the discussion
> more difficult than it would be if we were to put together specific
> ideas and then discuss them.
>
> One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> optional parameter and then having the database server scan the entire
> cluster and send a tarball which contains essentially a 'diff' file of
> some kind for each file where we can construct a diff based on the LSN,
> and then the complete contents of the file for everything else that
> needs to be in the backup.

/me scratches head.  Isn't that pretty much what I described in my
original post?  I even described what that "'diff' file of some kind"
would look like in some detail in the paragraph of that emailed
numbered "2.", and I described the reasons for that choice at length
in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com

I can't figure out how I'm managing to be so unclear about things
about which I thought I'd been rather explicit.

> So, sure, that would work, but it wouldn't be able to be parallelized
> and I don't think it'd end up being very exciting for the external tools
> because of that, but it would be fine for pg_basebackup.

Stop being such a pessimist.  Yes, if we only add the option to the
BASE_BACKUP command, it won't directly be very exciting for external
tools, but a lot of the work that is needed to do things that ARE
exciting for external tools will have been done.  For instance, if the
work to figure out which blocks have been modified via WAL-scanning
gets done, and initially that's only exposed via BASE_BACKUP, it won't
be much work for somebody to write code for a new code that exposes
that information directly through some new replication command.
There's a difference between something that's going in the wrong
direction and something that's going in the right direction but not as
far or as fast as you'd like.  And I'm 99% sure that everything I'm
proposing here falls in the latter category rather than the former.

> On the other hand, if you added new commands for 'list of files changed
> since this LSN' and 'give me this file' and 'give me this file with the
> changes in it since this LSN', then pg_basebackup could work with that
> pretty easily in a single-threaded model (maybe with two connections to
> the backend, but still in a single process, or maybe just by slurping up
> the file list and then asking for each one) and the external tools could
> leverage those new capabilities too for their backups, both full backups
> and incremental ones.  This also wouldn't have to change how
> pg_basebackup does full backups today one bit, so what we're really
> talking about here is the direction to take the new code that's being
> written, not about rewriting existing code.  I agree that it'd be a bit
> more work...  but hopefully not *that* much more, and it would mean we
> could later add parallel backup to pg_basebackup more easily too, if we
> wanted to.

For purposes of implementing parallel pg_basebackup, it would probably
be better if the server rather than the client decided which files to
send via which connection.  If the client decides, then every time the
server finishes sending a file, the client has to request another
file, and that introduces some latency: after the server finishes
sending each file, it has to wait for the client to finish receiving
the data, and it has to wait for the client to tell it what file to
send next.  If the server decides, then it can just send data at top
speed without a break.  So the ideal interface for pg_basebackup would
really be something like:

START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;

...returning a cookie that can be then be used by each participant for
an argument to a new commands:

JOIN_PARALLLEL_BACKUP 'cookie';

However, that is obviously extremely inconvenient for third-party
tools.  It's possible we need both an interface like this -- for use
by parallel pg_basebackup -- and a
START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
interface for use by external tools.  On the other hand, maybe the
additional overhead caused by managing the list of files to be fetched
on the client side is negligible.  It'd be interesting to see, though,
how busy the server is when running an incremental backup managed by
an external tool like BART or pgbackrest on a cluster with a gazillion
little-tiny relations.  I wonder if we'd find that it spends most of
its time waiting for the client.

> What I'm afraid will be lackluster is adding block-level incremental
> backup support to pg_basebackup without any support for managing
> backups or anything else.  I'm also concerned that it's going to mean
> that people who want to use incremental backup with pg_basebackup are
> going to have to write a lot of their own management code (probably in
> shell scripts and such...) around that and if they get anything wrong
> there then people are going to end up with bad backups that they can't
> restore from, or they'll have corrupted clusters if they do manage to
> get them restored.

I think that this is another complaint that basically falls into the
category of saying that this proposal might not fix everything for
everybody, but that complaint could be levied against any reasonable
development proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> > I agree that we don't want another implementation and that there's a lot
> > that we want to do to improve replay performance.  We've already got
> > frontend tools which work with multiple execution threads, so I'm not
> > sure I get the "not easily feasible" bit, and the argument about the
> > checkpointer seems largely related to that (as in- if we didn't have
> > multiple threads/processes then things would perform quite badly...  but
> > we can and do have multiple threads/processes in frontend tools today,
> > even in pg_basebackup).
>
> You need not just multiple execution threads, but basically a new
> implementation of shared buffers, locking, process monitoring, with most
> of the related infrastructure. You're literally talking about
> reimplementing a very substantial portion of the backend.  I'm not sure
> I can transport in written words - via a public medium - how bad an idea
> it would be to go there.

Yes, there'd be some need for locking and process monitoring, though if
we aren't supporting ongoing read queries at the same time, there's a
whole bunch of things that we don't need from the existing backend.

> > > Which I think is entirely reasonable. With the 'consistent' and LSN
> > > recovery targets one already can get most of what's needed from such a
> > > tool, anyway.  I'd argue the biggest issue there is that there's no
> > > equivalent to starting postgres with a private socket directory on
> > > windows, and perhaps an option or two making it easier to start postgres
> > > in a "private" mode for things like this.
> >
> > This would mean building in a way to do parallel WAL replay into the
> > server binary though, as discussed above, and it seems like making that
> > work in a way that allows us to still be available as a read-only
> > standby would be quite a bit more difficult.  We could possibly support
> > parallel WAL replay only when we aren't a replica but from the same
> > binary.
>
> I'm doubtful that we should try to implement parallel WAL apply that
> can't support HS - a substantial portion of the the logic to avoid
> issues around relfilenode reuse, consistency etc is going to be to be
> necessary for non-HS aware apply anyway.  But if somebody had a concrete
> proposal for something that's fundamentally only doable without HS, I
> could be convinced.

I'd certainly prefer that we support parallel WAL replay *with* HS, that
just seems like a much larger problem, but I'd be quite happy to be told
that it wouldn't be that much harder.

> > A lot of this part of the discussion feels like a tangent though, unless
> > I'm missing something.
>
> I'm replying to:
>
> On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
>
> And I'm basically saying that anything that starts from this premise is
> fatally flawed (in the ex falso quodlibet kind of sense ;)).

I'd just say that it'd be... difficult. :)

> > The "WAL compression" tool contemplated
> > previously would be much simpler and not the full-blown WAL replay
> > capability, which would be left to the server, unless you're suggesting
> > that even that should be exclusively the purview of the backend?  Though
> > that ship's already sailed, given that external projects have
> > implemented it.
>
> I'm extremely doubtful of such tools (but it's not what I was responding
> too, see above). I'd be extremely surprised if even one of them came
> close to being correct. The old FPI removal tool had data corrupting
> bugs left and right.

I have concerns about it myself, which is why I'd actually really like
to see something in core that does it, and does it the right way, that
other projects could then leverage (ideally by just linking into the
library without having to rewrite what's in core, though that might not
be an option for things like WAL-G that are in Go and possibly don't
want to link in some C library).

> > Having a library to provide that which external
> > projects could leverage would be nicer than having everyone write their
> > own version.
>
> No, I don't think that's necessarily true. Something complicated that's
> hard to get right doesn't have to be provided by core. Even if other
> projects decide that their risk/reward assesment is different than core
> postgres'. We don't have to take on all kind of work and complexity for
> external tools.

No, it doesn't have to be provided by core, but I sure would like it to
be and I'd be much more comfortable if it was because then we'd also
take care to not break whatever assumptions are made (or to do so in a
way that can be detected and/or handled) as new code is written.  As
discussed above, as long as it isn't provided by core, it's not going to
be trusted, likely will have bugs, and probably will be broken by things
happening in core moving forward.  The only option left is "well, we
just won't have that capability at all".  Maybe that's what you're
getting at here, but not sure I agree with that as the result.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > One could instead just do a straightforward extension
> > > to the existing BASE_BACKUP command to enable incremental backup.
> >
> > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> > that we're talking too high-level here and it's making the discussion
> > more difficult than it would be if we were to put together specific
> > ideas and then discuss them.
> >
> > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> > optional parameter and then having the database server scan the entire
> > cluster and send a tarball which contains essentially a 'diff' file of
> > some kind for each file where we can construct a diff based on the LSN,
> > and then the complete contents of the file for everything else that
> > needs to be in the backup.
>
> /me scratches head.  Isn't that pretty much what I described in my
> original post?  I even described what that "'diff' file of some kind"
> would look like in some detail in the paragraph of that emailed
> numbered "2.", and I described the reasons for that choice at length
> in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com
>
> I can't figure out how I'm managing to be so unclear about things
> about which I thought I'd been rather explicit.

There was basically zero discussion about what things would look like at
a protocol level (I went back and skimmed over the thread before sending
my last email to specifically see if I was going to get this response
back..).  I get the idea behind the diff file, the contents of which I
wasn't getting into above.

> > So, sure, that would work, but it wouldn't be able to be parallelized
> > and I don't think it'd end up being very exciting for the external tools
> > because of that, but it would be fine for pg_basebackup.
>
> Stop being such a pessimist.  Yes, if we only add the option to the
> BASE_BACKUP command, it won't directly be very exciting for external
> tools, but a lot of the work that is needed to do things that ARE
> exciting for external tools will have been done.  For instance, if the
> work to figure out which blocks have been modified via WAL-scanning
> gets done, and initially that's only exposed via BASE_BACKUP, it won't
> be much work for somebody to write code for a new code that exposes
> that information directly through some new replication command.
> There's a difference between something that's going in the wrong
> direction and something that's going in the right direction but not as
> far or as fast as you'd like.  And I'm 99% sure that everything I'm
> proposing here falls in the latter category rather than the former.

I didn't mean to imply that you're doing in the wrong direction here and
I thought I said somewhere in my last email more-or-less exactly the
same, that a great deal of the work needed for block-level incremental
backup would be done, but specifically that this proposal wouldn't allow
external tools to leverage that.  It sounds like what you're suggesting
now is that you're happy to implement the backend code, expose it in a
way that works just for pg_basebackup, and that if someone else wants to
add things to the protocol to make it easier for external tools to
leverage, great.  All I can say is that that's basically how we ended up
in the situation we're in today where pg_basebackup doesn't support
parallel backup but a bunch of external tools do and they don't go
through the backend to get there, even though they'd probably prefer to.

> > On the other hand, if you added new commands for 'list of files changed
> > since this LSN' and 'give me this file' and 'give me this file with the
> > changes in it since this LSN', then pg_basebackup could work with that
> > pretty easily in a single-threaded model (maybe with two connections to
> > the backend, but still in a single process, or maybe just by slurping up
> > the file list and then asking for each one) and the external tools could
> > leverage those new capabilities too for their backups, both full backups
> > and incremental ones.  This also wouldn't have to change how
> > pg_basebackup does full backups today one bit, so what we're really
> > talking about here is the direction to take the new code that's being
> > written, not about rewriting existing code.  I agree that it'd be a bit
> > more work...  but hopefully not *that* much more, and it would mean we
> > could later add parallel backup to pg_basebackup more easily too, if we
> > wanted to.
>
> For purposes of implementing parallel pg_basebackup, it would probably
> be better if the server rather than the client decided which files to
> send via which connection.  If the client decides, then every time the
> server finishes sending a file, the client has to request another
> file, and that introduces some latency: after the server finishes
> sending each file, it has to wait for the client to finish receiving
> the data, and it has to wait for the client to tell it what file to
> send next.  If the server decides, then it can just send data at top
> speed without a break.  So the ideal interface for pg_basebackup would
> really be something like:
>
> START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;
>
> ...returning a cookie that can be then be used by each participant for
> an argument to a new commands:
>
> JOIN_PARALLLEL_BACKUP 'cookie';
>
> However, that is obviously extremely inconvenient for third-party
> tools.  It's possible we need both an interface like this -- for use
> by parallel pg_basebackup -- and a
> START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
> interface for use by external tools.  On the other hand, maybe the
> additional overhead caused by managing the list of files to be fetched
> on the client side is negligible.  It'd be interesting to see, though,
> how busy the server is when running an incremental backup managed by
> an external tool like BART or pgbackrest on a cluster with a gazillion
> little-tiny relations.  I wonder if we'd find that it spends most of
> its time waiting for the client.

Thanks for sharing your thoughts on that, certainly having the backend
able to be more intelligent about streaming files to avoid latency is
good and possibly the best approach.  Another alternative to reducing
the latency would be to have a way for the client to request a set of
files, but I don't know that it'd be better.

I'm not really sure why the above is extremely inconvenient for
third-party tools, beyond just that they've already been written to work
with an assumption that the server-side of things isn't as intelligent
as PG is.

> > What I'm afraid will be lackluster is adding block-level incremental
> > backup support to pg_basebackup without any support for managing
> > backups or anything else.  I'm also concerned that it's going to mean
> > that people who want to use incremental backup with pg_basebackup are
> > going to have to write a lot of their own management code (probably in
> > shell scripts and such...) around that and if they get anything wrong
> > there then people are going to end up with bad backups that they can't
> > restore from, or they'll have corrupted clusters if they do manage to
> > get them restored.
>
> I think that this is another complaint that basically falls into the
> category of saying that this proposal might not fix everything for
> everybody, but that complaint could be levied against any reasonable
> development proposal.

I'm disappointed that the concerns about the trouble that end users are
likely to have with this didn't garner more discussion.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Andres Freund
Date:
Hi,

On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> I'm disappointed that the concerns about the trouble that end users are
> likely to have with this didn't garner more discussion.

My impression is that endusers are having a lot more trouble due to
important backup/restore features not being in core/pg_basebackup, than
due to external tools having a harder time to implement certain
features. Focusing on external tools being able to provide all those
features, because core hasn't yet, is imo entirely the wrong thing to
concentrate upon.  And it's not like things largely haven't been
implemented in pg_basebackup for fundamental architectural reasons.
It's because we've built like 5 different external tools with randomly
differing featureset and licenses.

Greetings,

Andres Freund



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> > I'm disappointed that the concerns about the trouble that end users are
> > likely to have with this didn't garner more discussion.
>
> My impression is that endusers are having a lot more trouble due to
> important backup/restore features not being in core/pg_basebackup, than
> due to external tools having a harder time to implement certain
> features.

I had been referring specifically to the concern I raised about
incremental block-level backups being added to pg_basebackup and how
that'll make using pg_basebackup more complicated and therefore more
difficult for end-users to get right, particularly if the end user is
having to handle management of the association between the full backup
and the incremental backups.  I wasn't referring to anything regarding
external tools.

> Focusing on external tools being able to provide all those
> features, because core hasn't yet, is imo entirely the wrong thing to
> concentrate upon.  And it's not like things largely haven't been
> implemented in pg_basebackup for fundamental architectural reasons.
> It's because we've built like 5 different external tools with randomly
> differing featureset and licenses.

There's a few challenges when it comes to adding backup features to
core.  One of the reasons is that core naturally moves slower when it
comes to development than external projects do, as was discusssed
earlier on this thread.  Another is that, when it comes to backup,
specifically, people want to back up their *existing* systems, which
means that they need a backup tool that's going to work with whatever
version of PG they've currently got deployed and that's often a few
years old already.  Certainly when I've thought about features that we'd
like to see and considered if there's something that could be
implemented in core vs. implemented outside of core, the answer often
ends up being "well, if we do it ourselves then we can make it work for
PG 9.2 and above, and have it working for existing users, but if we work
it in as part of core, it won't be available until next year and only
for version 12 and above, and users can only use it once they've
upgraded.."

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote:
> There was basically zero discussion about what things would look like at
> a protocol level (I went back and skimmed over the thread before sending
> my last email to specifically see if I was going to get this response
> back..).  I get the idea behind the diff file, the contents of which I
> wasn't getting into above.

Well, I wrote:

"There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value."

I guess I assumed that people would interested in the details take
that to mean "and therefore the protocol would grow an option for this
type of request in whatever way is the most straightforward possible
extension of the current functionality is," which is indeed how you
eventually interpreted it when you said we could "extend BASE_BACKUP
is by adding LSN as an optional parameter."

I could have been more explicit, but sometimes people tell me that my
emails are too long.

> external tools to leverage that.  It sounds like what you're suggesting
> now is that you're happy to implement the backend code, expose it in a
> way that works just for pg_basebackup, and that if someone else wants to
> add things to the protocol to make it easier for external tools to
> leverage, great.

Yep, that's more or less it, although I am potentially willing to do
some modest amount of that other work along the way.  I just don't
want to prioritize it higher than getting the actual thing I want to
build built, which I think is a pretty fair position for me to take.

> All I can say is that that's basically how we ended up
> in the situation we're in today where pg_basebackup doesn't support
> parallel backup but a bunch of external tools do and they don't go
> through the backend to get there, even though they'd probably prefer to.

I certainly agree that core should try to do things in a way that is
useful to external tools when that can be done without undue effort,
but only if it can actually be done without undo effort.  Let's see
whether that's the case here:

- Anastasia wants a command added that dumps out whatever the server
knows about what files have changed, which I already agreed was a
reasonable extension of my initial proposal.

- You said that for this to be useful to pgbackrest, it'd have to use
a whole different mechanism that includes commands to request
individual files and blocks within those files, which would be a
significant rewrite of pg_basebackup that you agreed is more closely
related to parallel backup than to the project under discussion on
this thread.  And that even then pgbackrest probably wouldn't use it
because it also does server-side compression and encryption which are
not included in this proposal.

It seems to me that the first one falls into the category a reasonable
additional effort and the second one falls into the category of lots
of extra and unrelated work that wouldn't even get used.

> Thanks for sharing your thoughts on that, certainly having the backend
> able to be more intelligent about streaming files to avoid latency is
> good and possibly the best approach.  Another alternative to reducing
> the latency would be to have a way for the client to request a set of
> files, but I don't know that it'd be better.

I don't know either.  This is an area that needs more thought, I
think, although as discussed, it's more related to parallel backup
than $SUBJECT.

> I'm not really sure why the above is extremely inconvenient for
> third-party tools, beyond just that they've already been written to work
> with an assumption that the server-side of things isn't as intelligent
> as PG is.

Well, one thing you might want to do is have a tool that connects to
the server, enters backup mode, requests information on what blocks
have changed, copies those blocks via direct filesystem access, and
then exits backup mode.  Such a tool would really benefit from a
START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
command language, because it would just skip ever issuing the
SEND_FILE_CONTENTS command in favor of doing that part of the work via
other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
command is useless to such a tool.

Contrariwise, a tool that has its own magic - perhaps based on
WAL-scanning or something like ptrack - to know which files currently
exist and which blocks are modified could use SEND_FILE_CONTENTS but
not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
use START_BACKUP and STOP_BACKUP but nothing else.

In short, providing granular commands like this lets the client be
really intelligent even if the server isn't, and lets the client have
fine-grained control of the process.  This is very good if you're an
out-of-core tool maintainer and your tool is trying to be smarter than
- or even just differently-designed than - core.

But if what you really want is just a maximally-efficient parallel
backup, you don't need the commands to be fine-grained like this.  You
don't even really *want* the commands to be fine-grained like this,
because it's better if the server works it all out so as to avoid
unnecessary network round-trips.  You just want to tell the server
"hey, I want to do a parallel backup with 5 participants - hit me!"
and have it do that in the most efficient way that it knows how,
without forcing the client to make any decisions that can be made just
as well, and perhaps more efficiently, on the server.

On the third hand, one advantage of having the fine-grained commands
is that it would not only make it easier for out-of-core tools to do
cool things, but also in-core tools.  For instance, you can imagine
being able to do something like:

pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA

If the client is using what I'm calling fine-grained commands, this is
easy to implement.  If it's just calling a piece of server side
functionality that sends back a tarball as a blob, it's not.

So each approach has some pros and cons.

> I'm disappointed that the concerns about the trouble that end users are
> likely to have with this didn't garner more discussion.

Well, we can keep discussing things.  I've tried to reply to as many
of your concerns as I can, but I believe you've written more email on
this thread than everyone else combined, so perhaps I haven't entirely
been able to keep up.

That being said, as far as I can tell, those concerns were not
seconded by anyone else.  Also, if I understand correctly, when I
asked how we could avoid that problem, you that you didn't know.  And
I said it seemed like we would need to a very expensive operation at
server startup, or magic.  So I feel that perhaps it is a problem that
(1) is not of great general concern and (2) to which no really
superior engineering solution is possible.

I may, however, be mistaken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Anastasia Lubennikova
Date:
22.04.2019 2:02, Robert Haas wrote:
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.  One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.
> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.
>
> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers) think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to discount
> the value of his experience in this area, it sometimes happens on this
> mailing list that [ drum roll please ] not everybody agrees about
> everything.  So, what do other people think?
>
Personally, I believe that incremental backups are more useful to implement
first since they benefit both backup speed and the space taken by a backup.
Frankly speaking, I'm a bit surprised that the discussion of parallel 
backups
took so much of this thread.
Of course, we must keep it in mind, while designing the API to avoid 
introducing
any architectural obstacles, but any further discussion of parallelism is a
subject of another topic.


I understand Stephen's concerns about the difficulties of incremental backup
management.
Even with an assumption that user is ready to manage backup chains, 
retention,
and other stuff, we must consider the format of backup metadata that 
will allow
us to perform some primitive commands:

1) Tell whether this backup full or incremental.

2) Tell what backup is a parent of this incremental backup.
Probably, we can limit it to just returning "start_lsn", which later can be
compared to "stop_lsn" of parent backup.

3) Take an incremental backup based on this backup.
Here we must help a backup manager to retrieve the LSN to pass it to
pg_basebackup.

4) Restore an incremental backup into a directory (on top of already 
restored
full backup).
One may use it to perform "merge" or "restore" of the incremental backup,
depending on the destination directory.
I wonder if it is possible to integrate it into any existing tool, or we 
end up
with something like pg_basebackup/pg_baserestore as in case of
pg_dump/pg_restore.

Have you designed these? I may only recall "pg_combinebackup" from the very
first message in this thread, which looks more like a sketch to explain the
idea, rather than the thought-out feature design. I also found a page
https://wiki.postgresql.org/wiki/Incremental_backup that raises the same
questions.
I'm volunteering to write a draft patch or, more likely, set of patches, 
which
will allow us to discuss the subject in more detail.
And to do that I wish we agree on the API and data format (at least 
broadly).
Looking forward to hearing your thoughts.


As I see it, ideally the backup management tools should concentrate more on
managing multiple backups, while all the logic of taking a single backup 
(of any
kind) should be integrated into the core. It means that any out-of-core 
client
won't have to walk the PGDATA directory and care about all the postgres 
specific
knowledge of data files consisting of blocks with headers and LSNs and 
so on. It
simply requests data and gets it.
Understandably, it won't be implemented in one take and what is more 
probably,
it is not reachable fully.
Still, it will be great to do our best to provide such tools (both 
existing and
future) with conveniently formatted data and API to get it.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: block-level incremental backup

From
Adam Brusselback
Date:

I hope it's alright to throw in my $0.02 as a user. I've been following this (and the other thread on reading WAL to find modified blocks, prefaulting, whatever else) since the start with great excitement and would love to see the built-in backup capabilities in Postgres greatly improved. I know this is not completely on-topic for just incremental backups, so I apologize in advance. It just seemed like the most apt place to chime in.


Just to preface where I am coming from, I have been using pgBackRest for the past couple years and used wal-e prior to that.  I am not a big *nix user other than all my servers, do all my development on Windows / use primarily Java. The command line is not where I feel most comfortable despite my best efforts over the last 5-6 years. Prior to Postgres, I used SQL Server for quite a few years at previous companies but was more a junior / intermediate skill set back then. I just wanted to put that out there so you can see where my bias's are.


 

With all that said, I would not be comfortable using pg_basebackup as my main backup tool simply because I’d have to cobble together numerous tools to get backups stored in a safe (not on the same server) location, I’d have to manage expiring backups and the WAL which is no longer needed, along with the rest of the stuff that makes these backup management tools useful.


The command line scares me, and even if I was able to get all that working, I would not feel warm and fuzzy I didn’t mess something up horribly and I may hit an edge case which destroys backups, silently corrupts data, etc. 

I love that there are tools that manage all of it; backups, wal archiving, remote storage, integrate with cloud storage (S3 and the like), manages the retention of these backups with all their dependencies for me, and has all the restore options necessary built in as well.


Block level incremental backup would be amazing for my use case. I have small updates / deletes that happen to data all over some of my largest tables. With pgBackRest, since the diff/incremental backups are at the file level, I can have a single update / delete which touched a random spot in a table and now requires that whole 1gb file to be backed up again. That said, even if pg_basebackup was the only tool that did incremental block level backup tomorrow, I still wouldn’t start using it directly. I went into the issues I’d have to deal with if I used pg_basebackup above, and incremental backups without a management tool make me think using it correctly would be much harder.


I know this thread is just about incremental backup, and that pretty much everything in core is built up from small features into larger more complex ones. I understand that and am not trying to dump on any efforts, I am super excited to see work being done in this area! I just wanted to share my perspective on how crucial good backup management is to me (and I’m sure a few others may share my sentiment considering how popular all the external tools are).

I would never put a system in production unless I have some backup management in place. If core builds a backup management tool which uses pg_basebackup as building blocks for its solution…awesome! That may be something I’d use.  If pg_basebackup can be improved so it can be used as the basis most external backup management tools can build on top of, that’s also great. All the external tools which practically every Postgres company have built show that it’s obviously a need for a lot of users. Core will never solve every single problem for all users, I know that. It would just be great to see some of the fundamental features of backup management baked into core in an extensible way. 

With that, there could be a recommended way to set up backups (full/incremental, parallel, compressed), point in time recovery, backup retention, and perform restores (to a point in time, on a replica server, etc) with just the tooling within core with a nice and simple user interface, and great performance.

If those features core supports in the internal tooling are built in an extensible way (as has been discussed), there could be much less duplication of work implementing the same base features over and over for each external tool. Those companies can focus on more value-added features to their own products that core would never support, or on improving the tooling/performance/features core provides.


Well, this is way longer and a lot less coherent than I was hoping, so I apologize for that. Hopefully my stream of thoughts made a little bit of sense to someone.


-Adam


Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote:
> > There was basically zero discussion about what things would look like at
> > a protocol level (I went back and skimmed over the thread before sending
> > my last email to specifically see if I was going to get this response
> > back..).  I get the idea behind the diff file, the contents of which I
> > wasn't getting into above.
>
> Well, I wrote:
>
> "There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value."
>
> I guess I assumed that people would interested in the details take
> that to mean "and therefore the protocol would grow an option for this
> type of request in whatever way is the most straightforward possible
> extension of the current functionality is," which is indeed how you
> eventually interpreted it when you said we could "extend BASE_BACKUP
> is by adding LSN as an optional parameter."

Looking at it from what I'm sitting, I brought up two ways that we
could extend the protocol to "request from the server only those blocks
where LSN >= threshold_value" with one being the modification to
BASE_BACKUP and the other being a new set of commands that could be
parallelized.  If I had assumed that you'd be thinking the same way I am
about extending the backup protocol, I wouldn't have said anything now
and then would have complained after you wrote a patch that just
extended the BASE_BACKUP command, at which point I likely would have
been told that it's now been done and that I should have mentioned it
earlier.

> > external tools to leverage that.  It sounds like what you're suggesting
> > now is that you're happy to implement the backend code, expose it in a
> > way that works just for pg_basebackup, and that if someone else wants to
> > add things to the protocol to make it easier for external tools to
> > leverage, great.
>
> Yep, that's more or less it, although I am potentially willing to do
> some modest amount of that other work along the way.  I just don't
> want to prioritize it higher than getting the actual thing I want to
> build built, which I think is a pretty fair position for me to take.

At least in part then it seems like we're viewing the level of effort
around what I'm talking about quite differently, and I feel like that's
largely because every time I mention parallel anything there's this
assumption that I'm asking you to parallelize pg_basebackup or write a
whole bunch more code to provide a fully optimized server-side parallel
implementation for backups.  That really wasn't what I was going for.  I
was thinking it would be a modest amount of additional work add
incremental backup via a few new commands, instead of through the
BASE_BACKUP protocol command, that would make parallelization possible.

Now, through this discussion, you've brought up some really good points
about how the initial thoughts I had around how we could add some
relatively simple commands, as part of this work, to make it easier for
someone to later add parallel support to pg_basebackup (either full or
incremental), or for external tools to leverage, might not be the best
solution when it comes to having parallel backup in core, and therefore
wouldn't actually end up being useful towards that end.  That's
certainly a fair point and possibly enough to justify not spending even
the modest time I was thinking it'd need, but I'm not convinced.  Now,
that said, if you are convinced that's the case, and you're doing the
work, then it's certainly your prerogative to go in the direction you're
convinced of.  I don't mean any of this discussion to imply that I'd
object to a commit that extended BASE_BACKUP in the way outlined above,
but I understood the question to be "what do people think of this idea?"
and to that I'm still of the opinion that spending a modest amount of
time to provide a way to parallelize an incremental backup is worth it,
even if it isn't optimal and isn't the direct goal of this effort.

There's a tangent on all of this that's pretty key though, which is the
question around just how the blocks are identified.  If the WAL scanning
is done to figure out the blocks, then that's quite a bit different from
the other idea of "open this relation and scan it, but only give me the
blocks after this LSN".  It's the latter case that I've been mostly
thinking about in this thread, which is part of why I was thinking it'd
be a modest amount of work to have protocol commands that accepted a
file (or perhaps a relation..) to scan and return blocks from instead of
baking this into BASE_BACKUP which by definition just serially scans the
data directory and returns things as it finds them.  For the case where
we have WAL scanning happening and modfiles which are being read and
used to figure out the blocks to send, it seems like it might be more
complicated and therefore potentially quite a bit more work to have a
parallel version of that.

> > All I can say is that that's basically how we ended up
> > in the situation we're in today where pg_basebackup doesn't support
> > parallel backup but a bunch of external tools do and they don't go
> > through the backend to get there, even though they'd probably prefer to.
>
> I certainly agree that core should try to do things in a way that is
> useful to external tools when that can be done without undue effort,
> but only if it can actually be done without undo effort.  Let's see
> whether that's the case here:
>
> - Anastasia wants a command added that dumps out whatever the server
> knows about what files have changed, which I already agreed was a
> reasonable extension of my initial proposal.

That seems like a useful thing to have, I agree.

> - You said that for this to be useful to pgbackrest, it'd have to use
> a whole different mechanism that includes commands to request
> individual files and blocks within those files, which would be a
> significant rewrite of pg_basebackup that you agreed is more closely
> related to parallel backup than to the project under discussion on
> this thread.  And that even then pgbackrest probably wouldn't use it
> because it also does server-side compression and encryption which are
> not included in this proposal.

Yes, having thought about it a bit more, without adding in the other
features that we already support in pgBackRest, it's unlikely we'd use
it in the form that I was contemplating.  That said, it'd at least be
closer to something we could use and adding those other features, such
as compression and encryption, would almost certainly be simpler and
easier if there were already protocol commands like those we discussed
for parallel work.

> > Thanks for sharing your thoughts on that, certainly having the backend
> > able to be more intelligent about streaming files to avoid latency is
> > good and possibly the best approach.  Another alternative to reducing
> > the latency would be to have a way for the client to request a set of
> > files, but I don't know that it'd be better.
>
> I don't know either.  This is an area that needs more thought, I
> think, although as discussed, it's more related to parallel backup
> than $SUBJECT.

Yes, I agree with that.

> > I'm not really sure why the above is extremely inconvenient for
> > third-party tools, beyond just that they've already been written to work
> > with an assumption that the server-side of things isn't as intelligent
> > as PG is.
>
> Well, one thing you might want to do is have a tool that connects to
> the server, enters backup mode, requests information on what blocks
> have changed, copies those blocks via direct filesystem access, and
> then exits backup mode.  Such a tool would really benefit from a
> START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> command language, because it would just skip ever issuing the
> SEND_FILE_CONTENTS command in favor of doing that part of the work via
> other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> command is useless to such a tool.

That's true, but I hardly ever hear people talking about how wonderful
it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
that people would really like backups to be done over the PG protocol on
the same port that replication is done on.  A possible compromise is
having a dedicated port for the backup agent to use, but it's definitely
not the preference.

> Contrariwise, a tool that has its own magic - perhaps based on
> WAL-scanning or something like ptrack - to know which files currently
> exist and which blocks are modified could use SEND_FILE_CONTENTS but
> not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
> use START_BACKUP and STOP_BACKUP but nothing else.
>
> In short, providing granular commands like this lets the client be
> really intelligent even if the server isn't, and lets the client have
> fine-grained control of the process.  This is very good if you're an
> out-of-core tool maintainer and your tool is trying to be smarter than
> - or even just differently-designed than - core.
>
> But if what you really want is just a maximally-efficient parallel
> backup, you don't need the commands to be fine-grained like this.  You
> don't even really *want* the commands to be fine-grained like this,
> because it's better if the server works it all out so as to avoid
> unnecessary network round-trips.  You just want to tell the server
> "hey, I want to do a parallel backup with 5 participants - hit me!"
> and have it do that in the most efficient way that it knows how,
> without forcing the client to make any decisions that can be made just
> as well, and perhaps more efficiently, on the server.
>
> On the third hand, one advantage of having the fine-grained commands
> is that it would not only make it easier for out-of-core tools to do
> cool things, but also in-core tools.  For instance, you can imagine
> being able to do something like:
>
> pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA
>
> If the client is using what I'm calling fine-grained commands, this is
> easy to implement.  If it's just calling a piece of server side
> functionality that sends back a tarball as a blob, it's not.
>
> So each approach has some pros and cons.

I agree that each has some pros and cons.  Certainly one of the big
'cons' here is that it'd be a lot more backend work to implement the
'maximally-efficient parallel backup', while the fine-grained commands
wouldn't require nearly as much but would still allow a great deal of
the benefit for both in-core and out-of-core tools, potentially.

> > I'm disappointed that the concerns about the trouble that end users are
> > likely to have with this didn't garner more discussion.
>
> Well, we can keep discussing things.  I've tried to reply to as many
> of your concerns as I can, but I believe you've written more email on
> this thread than everyone else combined, so perhaps I haven't entirely
> been able to keep up.
>
> That being said, as far as I can tell, those concerns were not
> seconded by anyone else.  Also, if I understand correctly, when I
> asked how we could avoid that problem, you that you didn't know.  And
> I said it seemed like we would need to a very expensive operation at
> server startup, or magic.  So I feel that perhaps it is a problem that
> (1) is not of great general concern and (2) to which no really
> superior engineering solution is possible.

The comments that Anastasia had around the issues with being able to
identify the full backup that goes with a given incremental backup, et
al, certainly echoed some my concerns regarding this part of the
discussion.

As for the concerns about trying to avoid corruption from starting up an
invalid cluster, I didn't see much discussion about the idea of some
kind of cross-check between pg_control and backup_label.  That was all
very hand-wavy, so I'm not too surprised, but I don't think it's
completely impossible to have something better than "well, if you just
remove this one file, then you get a non-obviously corrupt cluster that
you can happily start up".  I'll certainly accept that it requires more
thought though and if we're willing to continue a discussion around
that, great.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote:
> Looking at it from what I'm sitting, I brought up two ways that we
> could extend the protocol to "request from the server only those blocks
> where LSN >= threshold_value" with one being the modification to
> BASE_BACKUP and the other being a new set of commands that could be
> parallelized.  If I had assumed that you'd be thinking the same way I am
> about extending the backup protocol, I wouldn't have said anything now
> and then would have complained after you wrote a patch that just
> extended the BASE_BACKUP command, at which point I likely would have
> been told that it's now been done and that I should have mentioned it
> earlier.

Fair enough.

> At least in part then it seems like we're viewing the level of effort
> around what I'm talking about quite differently, and I feel like that's
> largely because every time I mention parallel anything there's this
> assumption that I'm asking you to parallelize pg_basebackup or write a
> whole bunch more code to provide a fully optimized server-side parallel
> implementation for backups.  That really wasn't what I was going for.  I
> was thinking it would be a modest amount of additional work add
> incremental backup via a few new commands, instead of through the
> BASE_BACKUP protocol command, that would make parallelization possible.

I'm not sure about that.  It doesn't seem crazy difficult, but there
are a few wrinkles.  One is that if the client is requesting files one
at a time, it's got to have a list of all the files that it needs to
request, and that means that it has to ask the server to make a
preparatory pass over the whole PGDATA directory to get a list of all
the files that exist.  That overhead is not otherwise needed.  Another
is that the list of files might be really large, and that means that
the client would either use a lot of memory to hold that great big
list, or need to deal with spilling the list to a spool file
someplace, or else have a server protocol that lets the list be
fetched in incrementally in chunks.  A third is that, as you mention
further on, it means that the client has to care a lot more about
exactly how the server is figuring out which blocks have been
modified.  If it just says BASE_BACKUP ..., the server an be
internally reading each block and checking the LSN, or using
WAL-scanning or ptrack or whatever and the client doesn't need to know
or care.  But if the client is asking for a list of modified files or
blocks, then that presumes the information is available, and not too
expensively, without actually reading the files.  Fourth, MAX_RATE
probably won't actually limit to the correct rate overall if the limit
is applied separately to each file.

I'd be afraid that a patch that tried to handle all that as part of
this project would get rejected on the grounds that it was trying to
solve too many unrelated problems.  Also, though not everybody has to
agree on what constitutes a "modest amount of additional work," I
would not describe solving all of those problems as a modest effort,
but rather a pretty substantial one.

> There's a tangent on all of this that's pretty key though, which is the
> question around just how the blocks are identified.  If the WAL scanning
> is done to figure out the blocks, then that's quite a bit different from
> the other idea of "open this relation and scan it, but only give me the
> blocks after this LSN".  It's the latter case that I've been mostly
> thinking about in this thread, which is part of why I was thinking it'd
> be a modest amount of work to have protocol commands that accepted a
> file (or perhaps a relation..) to scan and return blocks from instead of
> baking this into BASE_BACKUP which by definition just serially scans the
> data directory and returns things as it finds them.  For the case where
> we have WAL scanning happening and modfiles which are being read and
> used to figure out the blocks to send, it seems like it might be more
> complicated and therefore potentially quite a bit more work to have a
> parallel version of that.

Yeah.  I don't entirely agree that the first one is simple, as per the
above, but I definitely agree that the second one is more complicated
than the first one.

> > Well, one thing you might want to do is have a tool that connects to
> > the server, enters backup mode, requests information on what blocks
> > have changed, copies those blocks via direct filesystem access, and
> > then exits backup mode.  Such a tool would really benefit from a
> > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > command language, because it would just skip ever issuing the
> > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > command is useless to such a tool.
>
> That's true, but I hardly ever hear people talking about how wonderful
> it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
> that people would really like backups to be done over the PG protocol on
> the same port that replication is done on.  A possible compromise is
> having a dedicated port for the backup agent to use, but it's definitely
> not the preference.

If you happen to be on the same system where the backup is running,
reading straight from the data directory might be a lot faster.
Otherwise, I tend to agree with you that using libpq is probably best.

> I agree that each has some pros and cons.  Certainly one of the big
> 'cons' here is that it'd be a lot more backend work to implement the
> 'maximally-efficient parallel backup', while the fine-grained commands
> wouldn't require nearly as much but would still allow a great deal of
> the benefit for both in-core and out-of-core tools, potentially.

I agree.

> The comments that Anastasia had around the issues with being able to
> identify the full backup that goes with a given incremental backup, et
> al, certainly echoed some my concerns regarding this part of the
> discussion.
>
> As for the concerns about trying to avoid corruption from starting up an
> invalid cluster, I didn't see much discussion about the idea of some
> kind of cross-check between pg_control and backup_label.  That was all
> very hand-wavy, so I'm not too surprised, but I don't think it's
> completely impossible to have something better than "well, if you just
> remove this one file, then you get a non-obviously corrupt cluster that
> you can happily start up".  I'll certainly accept that it requires more
> thought though and if we're willing to continue a discussion around
> that, great.

I think there are three different issues here that need to be
considered separately.

Issue #1: If you manually add files to your backup, remove files from
your backup, or change files in your backup, bad things will happen.
There is fundamentally nothing we can do to prevent this completely,
but it may be possible to make the system more resilient against
ham-handed modifications, at least to the extent of detecting them.
That's maybe a topic for another thread, but it's an interesting one:
Andres and I were brainstorming about it at some point.

Issue #2: You can only restore an LSN-based incremental backup
correctly if you have a base backup whose start-of-backup LSN is
greater than or equal to the threshold LSN used to take the
incremental backup.  If #1 is not in play, this is just a simple
cross-check at restoration time: retrieve the 'START WAL LOCATION'
from the prior backup's backup_label file and the threshold LSN for
the incremental backup from wherever you decide to store it and
compare them; if they do not have the right relationship, ERROR.  As
to whether #1 might end up in play here, anything's possible, but
wouldn't manually editing LSNs in backup metadata files be pretty
obviously a bad idea?  (Then again, I didn't really think the whole
backup_label thing was that confusing either, and obviously I was
wrong about that.  Still, editing a file requires a little more work
than removing it... you have to not only lie to the system, you have
to decide which lie to tell!)

Issue #3: Even if you clearly understand the rule articulated in #2,
you might find it hard to follow in practice.  If you take a full
backup on Sunday and an incremental against Sunday's backup or against
the previous day's backup on each subsequent day, it's not really that
hard to understand.  But in more complex scenarios it could be hard to
get right.  For example if you've been removing your backups when they
are a month old and and then you start doing the same thing once you
add incrementals to the picture you might easily remove a full backup
upon which a newer incremental depends.  I see the need for good tools
to manage this kind of complexity, but have no plan as part of this
project to provide them.  I think that just requires too many
assumptions about where those backups are being stored and how they
are being catalogued and managed; I don't believe I currently am
knowledgeable enough to design something that would be good enough to
meet core standards for inclusion, and I don't want to waste energy
trying.  If someone else wants to try, that's OK with me, but I think
it's probably better to let this be a thing that people experiment
with outside of core for a while until we see what ends up being a
winner.  I realize that this is a debatable position, but as I'm sure
you realize by now, I have a strong desire to limit the scope of this
project in such a way that I can get it done, 'cuz a bird in the hand
is worth two in the bush.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote:
> > At least in part then it seems like we're viewing the level of effort
> > around what I'm talking about quite differently, and I feel like that's
> > largely because every time I mention parallel anything there's this
> > assumption that I'm asking you to parallelize pg_basebackup or write a
> > whole bunch more code to provide a fully optimized server-side parallel
> > implementation for backups.  That really wasn't what I was going for.  I
> > was thinking it would be a modest amount of additional work add
> > incremental backup via a few new commands, instead of through the
> > BASE_BACKUP protocol command, that would make parallelization possible.
>
> I'm not sure about that.  It doesn't seem crazy difficult, but there
> are a few wrinkles.  One is that if the client is requesting files one
> at a time, it's got to have a list of all the files that it needs to
> request, and that means that it has to ask the server to make a
> preparatory pass over the whole PGDATA directory to get a list of all
> the files that exist.  That overhead is not otherwise needed.  Another
> is that the list of files might be really large, and that means that
> the client would either use a lot of memory to hold that great big
> list, or need to deal with spilling the list to a spool file
> someplace, or else have a server protocol that lets the list be
> fetched in incrementally in chunks.

So, I had a thought about that when I was composing the last email and
while I'm still unsure about it, maybe it'd be useful to mention it
here- do we really need a list of every *file*, or could we reduce that
down to a list of relations + forks for the main data directory, and
then always include whatever other directories/files are appropriate?

When it comes to operating in chunks, well, if we're getting a list of
relations instead of files, we do have this thing called cursors..

> A third is that, as you mention
> further on, it means that the client has to care a lot more about
> exactly how the server is figuring out which blocks have been
> modified.  If it just says BASE_BACKUP ..., the server an be
> internally reading each block and checking the LSN, or using
> WAL-scanning or ptrack or whatever and the client doesn't need to know
> or care.  But if the client is asking for a list of modified files or
> blocks, then that presumes the information is available, and not too
> expensively, without actually reading the files.

I would think the client would be able to just ask for the list of
modified files, when it comes to building up the list of files to ask
for, which could potentially be done based on mtime instead of by WAL
scanning or by scanning the files themselves.  Don't get me wrong, I'd
prefer that we work based on the WAL, since I have more confidence in
that, but certainly quite a few of the tools do work off mtime these
days and while it's not perfect, the risk/reward there is pretty
palatable to a lot of people.

> Fourth, MAX_RATE
> probably won't actually limit to the correct rate overall if the limit
> is applied separately to each file.

Sure, I hadn't been thinking about MAX_RATE and that would certainly
complicate things if we're offering to provide MAX_RATE-type
capabilities as part of this new set of commands.

> I'd be afraid that a patch that tried to handle all that as part of
> this project would get rejected on the grounds that it was trying to
> solve too many unrelated problems.  Also, though not everybody has to
> agree on what constitutes a "modest amount of additional work," I
> would not describe solving all of those problems as a modest effort,
> but rather a pretty substantial one.

I suspect some of that's driven by how they get solved and if we decide
we have to solve all of them.  With things like MAX_RATE + incremental
backups, I wonder how that's going to end up working, when you have the
option to apply the limit to the network, or to the disk I/O.  You might
have addressed that elsewhere, I've not looked, and I'm not too
particular about it personally either, but a definition could be "max
rate at which we'll read the file you asked for on this connection" and
that would be pretty straight-forward, I'd think.

> > > Well, one thing you might want to do is have a tool that connects to
> > > the server, enters backup mode, requests information on what blocks
> > > have changed, copies those blocks via direct filesystem access, and
> > > then exits backup mode.  Such a tool would really benefit from a
> > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > > command language, because it would just skip ever issuing the
> > > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > > command is useless to such a tool.
> >
> > That's true, but I hardly ever hear people talking about how wonderful
> > it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
> > that people would really like backups to be done over the PG protocol on
> > the same port that replication is done on.  A possible compromise is
> > having a dedicated port for the backup agent to use, but it's definitely
> > not the preference.
>
> If you happen to be on the same system where the backup is running,
> reading straight from the data directory might be a lot faster.

Yes, that's certainly true.

> > The comments that Anastasia had around the issues with being able to
> > identify the full backup that goes with a given incremental backup, et
> > al, certainly echoed some my concerns regarding this part of the
> > discussion.
> >
> > As for the concerns about trying to avoid corruption from starting up an
> > invalid cluster, I didn't see much discussion about the idea of some
> > kind of cross-check between pg_control and backup_label.  That was all
> > very hand-wavy, so I'm not too surprised, but I don't think it's
> > completely impossible to have something better than "well, if you just
> > remove this one file, then you get a non-obviously corrupt cluster that
> > you can happily start up".  I'll certainly accept that it requires more
> > thought though and if we're willing to continue a discussion around
> > that, great.
>
> I think there are three different issues here that need to be
> considered separately.
>
> Issue #1: If you manually add files to your backup, remove files from
> your backup, or change files in your backup, bad things will happen.
> There is fundamentally nothing we can do to prevent this completely,
> but it may be possible to make the system more resilient against
> ham-handed modifications, at least to the extent of detecting them.
> That's maybe a topic for another thread, but it's an interesting one:
> Andres and I were brainstorming about it at some point.

I'd certainly be interested in hearing about ways we can improve on
that.  I'm alright with it being on another thread as it's a broader
concern than just what we're talking about here.

> Issue #2: You can only restore an LSN-based incremental backup
> correctly if you have a base backup whose start-of-backup LSN is
> greater than or equal to the threshold LSN used to take the
> incremental backup.  If #1 is not in play, this is just a simple
> cross-check at restoration time: retrieve the 'START WAL LOCATION'
> from the prior backup's backup_label file and the threshold LSN for
> the incremental backup from wherever you decide to store it and
> compare them; if they do not have the right relationship, ERROR.  As
> to whether #1 might end up in play here, anything's possible, but
> wouldn't manually editing LSNs in backup metadata files be pretty
> obviously a bad idea?  (Then again, I didn't really think the whole
> backup_label thing was that confusing either, and obviously I was
> wrong about that.  Still, editing a file requires a little more work
> than removing it... you have to not only lie to the system, you have
> to decide which lie to tell!)

Yes, that'd certainly be at least one cross-check, but what if you've
got an incremental backup based on a prior incremental backup that's
based on a prior full, and you skip the incremental backup inbetween
somehow?  Or are we just going to state outright that we don't support
incremental-on-incremental (in which case, all backups would actually be
either 'full' or 'differential' in the pgBackRest parlance, anyway, and
that parlance comes from my recollection of how other tools describe the
different backup types, but that was from many moons ago and might be
entirely wrong)?

> Issue #3: Even if you clearly understand the rule articulated in #2,
> you might find it hard to follow in practice.  If you take a full
> backup on Sunday and an incremental against Sunday's backup or against
> the previous day's backup on each subsequent day, it's not really that
> hard to understand.  But in more complex scenarios it could be hard to
> get right.  For example if you've been removing your backups when they
> are a month old and and then you start doing the same thing once you
> add incrementals to the picture you might easily remove a full backup
> upon which a newer incremental depends.  I see the need for good tools
> to manage this kind of complexity, but have no plan as part of this
> project to provide them.  I think that just requires too many
> assumptions about where those backups are being stored and how they
> are being catalogued and managed; I don't believe I currently am
> knowledgeable enough to design something that would be good enough to
> meet core standards for inclusion, and I don't want to waste energy
> trying.  If someone else wants to try, that's OK with me, but I think
> it's probably better to let this be a thing that people experiment
> with outside of core for a while until we see what ends up being a
> winner.  I realize that this is a debatable position, but as I'm sure
> you realize by now, I have a strong desire to limit the scope of this
> project in such a way that I can get it done, 'cuz a bird in the hand
> is worth two in the bush.

Even if what we're talking about here is really only "differentials", or
backups where the incremental contains all the changes from a prior full
backup, if the only check is "full LSN is greater than or equal to the
incremental backup LSN", then you have a potential problem that's larger
than just the incrementals no longer being valid because you removed the
full backup on which they were taken- you might think that an *earlier*
full backup is the one for a given incremental and perform a restore
with the wrong full/incremental matchup and end up with a corrupted
database.

These are exactly the kind of issues that make me really wonder if this
is the right natural progression for pg_basebackup or any backup tool to
go in.  Maybe there's some additional things we can do to make it harder
for someone to end up with a corrupted database when they restore, but
it's really hard to get things like expiration correct.  We see users
already ending up with problems because they don't manage expiration of
their WAL correctly, and now we're adding another level of serious
complication to the expiration requirements that, as we've seen even on
this thread, some users are just not going to ever feel comfortable
with doing on their own.

Perhaps it's not relevant and I get that you want to build this cool
incremental backup capability into pg_basebackup and I'm not going to
stop you from doing it, but if I was going to build a backup tool,
adding support for block-level incremental backup wouldn't be where I'd
start, and, in fact, I might not even get to it even after investing
over 5 years in the project and even after building in proper backup
management.  The idea of implementing block-level incrementals while
pushing the backup management, expiration, and dependency between
incrementals and fulls on to the user to figure out just strikes me as
entirely backwards and, frankly, to be gratuitously 'itch scratching' at
the expense of what users really want and need here.

One of the great things about pg_basebackup is its simplicity and
ability to be a one-time "give me a snapshot of the database" and this
is building in a complicated feature to it that *requires* users to
build their own basic capabilities externally in order to be able to use
it.  I've tried to avoid getting into that here and I won't go on about
it, since it's your time to do with as you feel appropriate, but I do
worry that it makes us, as a project, look a bit more cavalier about
what users are asking for vs. what cool new thing we want to play with
than I, at least, would like us to be (so, I'll caveat that with "in
this area anyway", since I suspect saying this will probably come back
to bite me in some other discussion later ;).

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Apr 24, 2019 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote:
> So, I had a thought about that when I was composing the last email and
> while I'm still unsure about it, maybe it'd be useful to mention it
> here- do we really need a list of every *file*, or could we reduce that
> down to a list of relations + forks for the main data directory, and
> then always include whatever other directories/files are appropriate?

I'm not quite sure what the difference is here.  I agree that we could
try to compact the list of file names by saying 16384 (24 segments)
instead of 16384, 16384.1, ..., 16384.23, but I doubt that saves
anything meaningful.  I don't see how we can leave anything out
altogether.  If there's a filename called boaty.mcboatface in the
server directory, I think we've got to back it up, and that won't
happen unless the client knows that it is there, and it won't know
unless we include it in a list.

> When it comes to operating in chunks, well, if we're getting a list of
> relations instead of files, we do have this thing called cursors..

Sure... but they don't work for replication commands and I am
definitely not volunteering to change that.

> I would think the client would be able to just ask for the list of
> modified files, when it comes to building up the list of files to ask
> for, which could potentially be done based on mtime instead of by WAL
> scanning or by scanning the files themselves.  Don't get me wrong, I'd
> prefer that we work based on the WAL, since I have more confidence in
> that, but certainly quite a few of the tools do work off mtime these
> days and while it's not perfect, the risk/reward there is pretty
> palatable to a lot of people.

That approach, as with a few others that have been suggested, requires
that the client have access to the previous backup, which makes me
uninterested in implementing it.  I want a version of incremental
backup where the client needs to know the LSN of the previous backup
and nothing else.  That way, if you store your actual backups on a
tape drive in an airless vault at the bottom of the Pacific Ocean, you
can still take incremental backup against them, as long as you
remember to note the LSNs before you ship the backups to the vault.
Woohoo!  It also allows for the wire protocol to be very simple and
the client to be very simple; neither of those things is essential,
but both are nice.

Also, I think using mtimes is just asking to get burned.  Yeah, almost
nobody will, but an LSN-based approach is more granular (block level)
and more reliable (can't be fooled by resetting a clock backward, or
by a filesystem being careless with file metadata), so I think it
makes sense to focus on getting that to work.  It's worth keeping in
mind that there may be somewhat different expectations for an external
tool vs. a core feature.  Stupid as it may sound, I think people using
an external tool are more likely to do things read the directions, and
those directions can say things like "use a reasonable filesystem and
don't set your clock backward."  When stuff goes into core, people
assume that they should be able to run it on any filesystem on any
hardware where they can get it to work and it should just work.  And
you also get a lot more users, so even if the percentage of people not
reading the directions were to stay constant, the actual number of
such people will go up a lot. So picking what we seem to both agree to
be the most robust way of detecting changes seems like the way to go
from here.

> I suspect some of that's driven by how they get solved and if we decide
> we have to solve all of them.  With things like MAX_RATE + incremental
> backups, I wonder how that's going to end up working, when you have the
> option to apply the limit to the network, or to the disk I/O.  You might
> have addressed that elsewhere, I've not looked, and I'm not too
> particular about it personally either, but a definition could be "max
> rate at which we'll read the file you asked for on this connection" and
> that would be pretty straight-forward, I'd think.

I mean, it's just so people can tell pg_basebackup what rate they want
via a command-line option and have it happen like that.  They don't
care about the rates for individual files.

> > Issue #1: If you manually add files to your backup, remove files from
> > your backup, or change files in your backup, bad things will happen.
> > There is fundamentally nothing we can do to prevent this completely,
> > but it may be possible to make the system more resilient against
> > ham-handed modifications, at least to the extent of detecting them.
> > That's maybe a topic for another thread, but it's an interesting one:
> > Andres and I were brainstorming about it at some point.
>
> I'd certainly be interested in hearing about ways we can improve on
> that.  I'm alright with it being on another thread as it's a broader
> concern than just what we're talking about here.

Might be a good topic to chat about at PGCon.

> > Issue #2: You can only restore an LSN-based incremental backup
> > correctly if you have a base backup whose start-of-backup LSN is
> > greater than or equal to the threshold LSN used to take the
> > incremental backup.  If #1 is not in play, this is just a simple
> > cross-check at restoration time: retrieve the 'START WAL LOCATION'
> > from the prior backup's backup_label file and the threshold LSN for
> > the incremental backup from wherever you decide to store it and
> > compare them; if they do not have the right relationship, ERROR.  As
> > to whether #1 might end up in play here, anything's possible, but
> > wouldn't manually editing LSNs in backup metadata files be pretty
> > obviously a bad idea?  (Then again, I didn't really think the whole
> > backup_label thing was that confusing either, and obviously I was
> > wrong about that.  Still, editing a file requires a little more work
> > than removing it... you have to not only lie to the system, you have
> > to decide which lie to tell!)
>
> Yes, that'd certainly be at least one cross-check, but what if you've
> got an incremental backup based on a prior incremental backup that's
> based on a prior full, and you skip the incremental backup inbetween
> somehow?  Or are we just going to state outright that we don't support
> incremental-on-incremental (in which case, all backups would actually be
> either 'full' or 'differential' in the pgBackRest parlance, anyway, and
> that parlance comes from my recollection of how other tools describe the
> different backup types, but that was from many moons ago and might be
> entirely wrong)?

I have every intention of supporting that case, just as I described in
my original email, and the algorithm that I just described handles it.
You just have to repeat the checks for every backup in the chain.   If
you have a backup A, and a backup B intended as an incremental vs. A,
and a backup C intended as an incremental vs. B, then the threshold
LSN for C is presumably the starting LSN for B, and the threshold LSN
for B is presumably the starting LSN for A.  If you try to restore
A-B-C you'll check C vs. B and find that all is well and similarly for
B vs. A.  If you try to restore A-C, you'll find out that A's start
LSN precedes C's threshold LSN and error out.

> Even if what we're talking about here is really only "differentials", or
> backups where the incremental contains all the changes from a prior full
> backup, if the only check is "full LSN is greater than or equal to the
> incremental backup LSN", then you have a potential problem that's larger
> than just the incrementals no longer being valid because you removed the
> full backup on which they were taken- you might think that an *earlier*
> full backup is the one for a given incremental and perform a restore
> with the wrong full/incremental matchup and end up with a corrupted
> database.

No, the proposed check is explicitly designed to prevent that.  You'd
get a restore failure (which is not great either, of course).

> management.  The idea of implementing block-level incrementals while
> pushing the backup management, expiration, and dependency between
> incrementals and fulls on to the user to figure out just strikes me as
> entirely backwards and, frankly, to be gratuitously 'itch scratching' at
> the expense of what users really want and need here.

Well, not everybody needs or wants the same thing.  I wouldn't be
proposing it if my employer didn't think it was gonna solve a real
problem...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Anastasia Lubennikova
Date:
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of 
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least 
> broadly).
> Looking forward to hearing your thoughts. 

Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file 
format and protocol extension,
which is necessary to start implementing the feature.

--------- Proof Of Concept patch ---------

In attachments, you can find a prototype of incremental pg_basebackup, 
which consists of 2 features:

1) To perform incremental backup one should call pg_basebackup with a 
new argument:

pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'

where lsn is a start_lsn of parent backup (can be found in 
"backup_label" file)

It calls BASE_BACKUP replication command with a new argument 
PREV_BACKUP_START_LSN 'lsn'.

For datafiles, only pages with LSN > prev_backup_start_lsn will be 
included in the backup.
They are saved into 'filename.partial' file, 'filename.blockmap' file 
contains an array of BlockNumbers.
For example, if we backuped blocks 1,3,5, filename.partial will contain 
3 blocks, and 'filename.blockmap' will contain array {1,3,5}.

Non-datafiles use the same format as before.

2) To merge incremental backup into a full backup call

pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' 
--merge-backups

It will move all files from 'incremental_basedir' to 'basedir' handling 
'.partial' files correctly.


--------- Questions to discuss ---------

Please note that it is just a proof-of-concept patch and it can be 
optimized in many ways.
Let's concentrate on issues that affect the protocol or data format.

1) Whether we collect block maps using simple "read everything page by 
page" approach
or WAL scanning or any other page tracking algorithm, we must choose a 
map format.
I implemented the simplest one, while there are more ideas:

- We can have a map not per file, but per relation or maybe per tablespace,
which will make implementation more complex, but probably more optimal.
The only problem I see with existing implementation is that even if only 
a few blocks changed,
we still must pad it to 512 bytes per tar format requirements.

- We can save LSNs into the block map.

typedef struct BlockMapItem {
     BlockNumber blkno;
     XLogRecPtr lsn;
} BlockMapItem;

In my implementation, invalid prev_backup_start_lsn means fallback to 
regular basebackup
without any block maps. Alternatively, we can define another meaning of 
this value and send a block map for all files.
Backup utilities can use these maps to speed up backup merge or restore.

2) We can implement BASE_BACKUP SEND_FILELIST replication command,
which will return a list of filenames with file sizes and block maps if 
lsn was provided.

To avoid changing format, we can simply send tar headers for each file:
- tarHeader("filename.blockmap") followed by blockmap for relation files 
if prev_backup_start_lsn is provided;
- tarHeader("filename") without actual file content for non relation 
files or for all files in "FULL" backup

The caller can parse messages and use them for any purpose, for example, 
to perform a parallel backup.

Thoughts?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: block-level incremental backup

From
Jeevan Chalke
Date:
Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.

Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.

It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.

Attached series of patches I had so far... (which needed further optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to implement:

1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks


We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Attachment

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.

Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.

It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.

Attached series of patches I had so far... (which needed further optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to implement:

1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks


We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.

For combining a full backup with one or more incremental backup, we are adding
a new utility called pg_combinebackup in src/bin.

Here is the overall design as proposed by Robert.

pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result is
wrong. We should aim for a system where every block in the output directory is
written exactly once and nothing ever has to be created and then removed.

To make that work, we should start by examining the final incremental backup.
We should proceed with one file at a time. For each file:

1. If the complete file is present in the incremental backup, then just copy it
to the output directory - and move on to the next file.

2. Otherwise, we have a .partial file. Work backward through the backup chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each file
where that block is stored at this stage, again using the header information.

3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's done,
go on to the next file and start over at step 1.


We are already started working on this design.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

From
Ibrar Ahmed
Date:

On Wed, Jul 17, 2019 at 10:22 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.

Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.

It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.

Attached series of patches I had so far... (which needed further optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to implement:

1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks


We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.

For combining a full backup with one or more incremental backup, we are adding
a new utility called pg_combinebackup in src/bin.

Here is the overall design as proposed by Robert.

pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result is
wrong. We should aim for a system where every block in the output directory is
written exactly once and nothing ever has to be created and then removed.

To make that work, we should start by examining the final incremental backup.
We should proceed with one file at a time. For each file:

1. If the complete file is present in the incremental backup, then just copy it
to the output directory - and move on to the next file.

2. Otherwise, we have a .partial file. Work backward through the backup chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each file
where that block is stored at this stage, again using the header information.

3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's done,
go on to the next file and start over at step 1.


At what stage you will apply the WAL generated in between the START/STOP backup. 
 
We are already started working on this design.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation



--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Chalke
Date:
On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

At what stage you will apply the WAL generated in between the START/STOP backup. 

In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.
 
--
Ibrar Ahmed

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

At what stage you will apply the WAL generated in between the START/STOP backup. 

In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.

Ok, so you keep all the WAL files since the first backup, right?   
 
--
Ibrar Ahmed

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation



--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:


On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

At what stage you will apply the WAL generated in between the START/STOP backup. 

In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.

Ok, so you keep all the WAL files since the first backup, right?   

The WAL files will anyway be copied while taking a backup (full or incremental),
but only last incremental backup's WAL files are copied to the combined
synthetic full backup.

 
--
Ibrar Ahmed

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation



--
Ibrar Ahmed


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

From
vignesh C
Date:
Hi Jeevan,

The idea is very nice.
When Insert/update/delete and truncate/drop happens at various
combinations, How the incremental backup handles the copying of the
blocks?


On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
>
> On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>
>>
>>
>> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>>
>>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>>>
>>>>
>>>> At what stage you will apply the WAL generated in between the START/STOP backup.
>>>
>>>
>>> In this design, we are not touching any WAL related code. The WAL files will
>>> get copied with each backup either full or incremental. And thus, the last
>>> incremental backup will have the final WAL files which will be copied as-is
>>> in the combined full-backup and they will get apply automatically if that
>>> the data directory is used to start the server.
>>
>>
>> Ok, so you keep all the WAL files since the first backup, right?
>
>
> The WAL files will anyway be copied while taking a backup (full or incremental),
> but only last incremental backup's WAL files are copied to the combined
> synthetic full backup.
>
>>>
>>>>
>>>> --
>>>> Ibrar Ahmed
>>>
>>>
>>> --
>>> Jeevan Chalke
>>> Technical Architect, Product Development
>>> EnterpriseDB Corporation
>>>
>>
>>
>> --
>> Ibrar Ahmed
>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>


--
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Hi Vignesh,

This backup technology is extending the pg_basebackup itself, which means we can
still take online backups. This is internally done using pg_start_backup and
pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is
used in the recovery process while starting the cluster from a backup image. What
incremental backup will just modify (as compared to traditional pg_basebackup)
is - After doing the checkpoint, instead of copying the entire relation files,
it takes an input LSN and scan all the blocks in all relation files, and store
the blocks having LSN >= InputLSN. This means it considers all the changes
that are already written into relation files including insert/update/delete etc
up to the checkpoint performed by pg_start_backup internally, and as Jeevan Chalke
mentioned upthread the incremental backup will also contain copy of WAL files.
Once this incremental backup is combined with the parent backup by means of new
combine process (that will be introduced as part of this feature itself) should
ideally look like a full pg_basebackup. Note that any changes done by these
insert/delete/update operations while the incremental backup was being taken
will be still available via WAL files and as normal restore process, will be
replayed from the checkpoint onwards up to a consistent point.

My two cents!

Regards,
Jeevan Ladhe

On Sat, Jul 20, 2019 at 11:22 PM vignesh C <vignesh21@gmail.com> wrote:
Hi Jeevan,

The idea is very nice.
When Insert/update/delete and truncate/drop happens at various
combinations, How the incremental backup handles the copying of the
blocks?


On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
>
> On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>
>>
>>
>> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>>
>>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>>>
>>>>
>>>> At what stage you will apply the WAL generated in between the START/STOP backup.
>>>
>>>
>>> In this design, we are not touching any WAL related code. The WAL files will
>>> get copied with each backup either full or incremental. And thus, the last
>>> incremental backup will have the final WAL files which will be copied as-is
>>> in the combined full-backup and they will get apply automatically if that
>>> the data directory is used to start the server.
>>
>>
>> Ok, so you keep all the WAL files since the first backup, right?
>
>
> The WAL files will anyway be copied while taking a backup (full or incremental),
> but only last incremental backup's WAL files are copied to the combined
> synthetic full backup.
>
>>>
>>>>
>>>> --
>>>> Ibrar Ahmed
>>>
>>>
>>> --
>>> Jeevan Chalke
>>> Technical Architect, Product Development
>>> EnterpriseDB Corporation
>>>
>>
>>
>> --
>> Ibrar Ahmed
>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>


--
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: block-level incremental backup

From
vignesh C
Date:
Thanks Jeevan.

1) If relation file has changed due to truncate or vacuum.
    During incremental backup the new files will be copied.
    There are chances that both the old  file and new file
    will be present. I'm not sure if cleaning up of the
    old file is handled.
2) Just a small thought on building the bitmap,
    can the bitmap be built and maintained as
    and when the changes are happening in the system.
    If we are building the bitmap while doing the incremental backup,
    Scanning through each file might take more time.
    This can be a configurable parameter, the system can run
    without capturing this information by default, but if there are some
    of them who will be taking incremental backup frequently this
    configuration can be enabled which should track the modified blocks.

    What is your thought on this?
-- 
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Tue, Jul 23, 2019 at 11:19 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
>
> Hi Vignesh,
>
> This backup technology is extending the pg_basebackup itself, which means we can
> still take online backups. This is internally done using pg_start_backup and
> pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is
> used in the recovery process while starting the cluster from a backup image. What
> incremental backup will just modify (as compared to traditional pg_basebackup)
> is - After doing the checkpoint, instead of copying the entire relation files,
> it takes an input LSN and scan all the blocks in all relation files, and store
> the blocks having LSN >= InputLSN. This means it considers all the changes
> that are already written into relation files including insert/update/delete etc
> up to the checkpoint performed by pg_start_backup internally, and as Jeevan Chalke
> mentioned upthread the incremental backup will also contain copy of WAL files.
> Once this incremental backup is combined with the parent backup by means of new
> combine process (that will be introduced as part of this feature itself) should
> ideally look like a full pg_basebackup. Note that any changes done by these
> insert/delete/update operations while the incremental backup was being taken
> will be still available via WAL files and as normal restore process, will be
> replayed from the checkpoint onwards up to a consistent point.
>
> My two cents!
>
> Regards,
> Jeevan Ladhe
>
> On Sat, Jul 20, 2019 at 11:22 PM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Hi Jeevan,
>>
>> The idea is very nice.
>> When Insert/update/delete and truncate/drop happens at various
>> combinations, How the incremental backup handles the copying of the
>> blocks?
>>
>>
>> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
>> <jeevan.chalke@enterprisedb.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>> >>>
>> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>> >>>>
>> >>>>
>> >>>> At what stage you will apply the WAL generated in between the START/STOP backup.
>> >>>
>> >>>
>> >>> In this design, we are not touching any WAL related code. The WAL files will
>> >>> get copied with each backup either full or incremental. And thus, the last
>> >>> incremental backup will have the final WAL files which will be copied as-is
>> >>> in the combined full-backup and they will get apply automatically if that
>> >>> the data directory is used to start the server.
>> >>
>> >>
>> >> Ok, so you keep all the WAL files since the first backup, right?
>> >
>> >
>> > The WAL files will anyway be copied while taking a backup (full or incremental),
>> > but only last incremental backup's WAL files are copied to the combined
>> > synthetic full backup.
>> >
>> >>>
>> >>>>
>> >>>> --
>> >>>> Ibrar Ahmed
>> >>>
>> >>>
>> >>> --
>> >>> Jeevan Chalke
>> >>> Technical Architect, Product Development
>> >>> EnterpriseDB Corporation
>> >>>
>> >>
>> >>
>> >> --
>> >> Ibrar Ahmed
>> >
>> >
>> >
>> > --
>> > Jeevan Chalke
>> > Technical Architect, Product Development
>> > EnterpriseDB Corporation
>> >
>>
>>
>> --
>> Regards,
>> vignesh
>>
>>
>>



Re: block-level incremental backup

From
Jeevan Ladhe
Date:

Hi Vignesh,

Please find my comments inline below:

1) If relation file has changed due to truncate or vacuum.
    During incremental backup the new files will be copied.
    There are chances that both the old  file and new file
    will be present. I'm not sure if cleaning up of the
    old file is handled.

When an incremental backup is taken it either copies the file in its entirety if
a file is changed more than 90%, or writes .partial with changed blocks bitmap
and actual data. For the files that are unchanged, it writes 0 bytes and still
creates a .partial file for unchanged files too. This means there is a .partitial
file for all the files that are to be looked up in full backup.
While composing a synthetic backup from incremental backup the pg_combinebackup
tool will only look for those relation files in full(parent) backup which are
having .partial files in the incremental backup. So, if vacuum/truncate happened
between full and incremental backup, then the incremental backup image will not
have a 0-length .partial file for that relation, and so the synthetic backup
that is restored using pg_combinebackup will not have that file as well.
 
2) Just a small thought on building the bitmap,
    can the bitmap be built and maintained as
    and when the changes are happening in the system.
    If we are building the bitmap while doing the incremental backup,
    Scanning through each file might take more time.
    This can be a configurable parameter, the system can run
    without capturing this information by default, but if there are some
    of them who will be taking incremental backup frequently this
    configuration can be enabled which should track the modified blocks.

IIUC, this will need changes in the backend. Honestly, I think backup is a
maintenance task and hampering the backend for this does not look like a good
idea. But, having said that even if we have to provide this as a switch for some
of the users, it will need a different infrastructure than what we are building
here for constructing bitmap, where we scan all the files one by one. Maybe for
the initial version, we can go with the current proposal that Robert has suggested,
and add this switch at a later point as an enhancement.
- My thoughts.

Regards,
Jeevan Ladhe

Re: block-level incremental backup

From
vignesh C
Date:
On Fri, Jul 26, 2019 at 11:21 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Hi Vignesh,

Please find my comments inline below:

1) If relation file has changed due to truncate or vacuum.
    During incremental backup the new files will be copied.
    There are chances that both the old  file and new file
    will be present. I'm not sure if cleaning up of the
    old file is handled.

When an incremental backup is taken it either copies the file in its entirety if
a file is changed more than 90%, or writes .partial with changed blocks bitmap
and actual data. For the files that are unchanged, it writes 0 bytes and still
creates a .partial file for unchanged files too. This means there is a .partitial
file for all the files that are to be looked up in full backup.
While composing a synthetic backup from incremental backup the pg_combinebackup
tool will only look for those relation files in full(parent) backup which are
having .partial files in the incremental backup. So, if vacuum/truncate happened
between full and incremental backup, then the incremental backup image will not
have a 0-length .partial file for that relation, and so the synthetic backup
that is restored using pg_combinebackup will not have that file as well.
Thanks Jeevan for the update, I feel this logic is good.  
It will handle the case of deleting the old relation files.
 
2) Just a small thought on building the bitmap,
    can the bitmap be built and maintained as
    and when the changes are happening in the system.
    If we are building the bitmap while doing the incremental backup,
    Scanning through each file might take more time.
    This can be a configurable parameter, the system can run
    without capturing this information by default, but if there are some
    of them who will be taking incremental backup frequently this
    configuration can be enabled which should track the modified blocks.

IIUC, this will need changes in the backend. Honestly, I think backup is a
maintenance task and hampering the backend for this does not look like a good
idea. But, having said that even if we have to provide this as a switch for some
of the users, it will need a different infrastructure than what we are building
here for constructing bitmap, where we scan all the files one by one. Maybe for
the initial version, we can go with the current proposal that Robert has suggested,
and add this switch at a later point as an enhancement. 
That sounds fair to me.


Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> In attachments, you can find a prototype of incremental pg_basebackup,
> which consists of 2 features:
>
> 1) To perform incremental backup one should call pg_basebackup with a
> new argument:
>
> pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
>
> where lsn is a start_lsn of parent backup (can be found in
> "backup_label" file)
>
> It calls BASE_BACKUP replication command with a new argument
> PREV_BACKUP_START_LSN 'lsn'.
>
> For datafiles, only pages with LSN > prev_backup_start_lsn will be
> included in the backup.
> They are saved into 'filename.partial' file, 'filename.blockmap' file
> contains an array of BlockNumbers.
> For example, if we backuped blocks 1,3,5, filename.partial will contain
> 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.

I think it's better to keep both the information about changed blocks
and the contents of the changed blocks in a single file.  The list of
changed blocks is probably quite short, and I don't really want to
double the number of files in the backup if there's no real need. I
suspect it's just overall a bit simpler to keep everything together.
I don't think this is a make-or-break thing, and welcome contrary
arguments, but that's my preference.

> 2) To merge incremental backup into a full backup call
>
> pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> --merge-backups
>
> It will move all files from 'incremental_basedir' to 'basedir' handling
> '.partial' files correctly.

This, to me, looks like it's much worse than the design that I
proposed originally.  It means that:

1. You can't take an incremental backup without having the full backup
available at the time you want to take the incremental backup.

2. You're always storing a full backup, which means that you need more
disk space, and potentially much more I/O while taking the backup.
You save on transfer bandwidth, but you add a lot of disk reads and
writes, costs which have to be paid even if the backup is never
restored.

> 1) Whether we collect block maps using simple "read everything page by
> page" approach
> or WAL scanning or any other page tracking algorithm, we must choose a
> map format.
> I implemented the simplest one, while there are more ideas:

I think we should start simple.

I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together.  I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jeevan Ladhe
Date:

Hi Jeevan


I reviewed first two patches -


0001-Add-support-for-command-line-option-to-pass-LSN.patch and

0002-Add-TAP-test-to-test-LSN-option.patch


from the set of incremental backup patches, and the changes look good to me.


I had some concerns around the way we are working around with the fact that

pg_lsn_in() accepts the lsn with 0 as a valid lsn and I think that itself is

contradictory to the definition of InvalidXLogRecPtr. I have started a separate

new thread[1] for the same.


Also, I observe that now commit 21f428eb, has already moved the lsn decoding

logic to a separate function pg_lsn_in_internal(), so the function

decode_lsn_internal() from patch 0001 will go away and the dependent code needs

to be modified.


I shall review the rest of the patches, and post the comments.


Regards,

Jeevan Ladhe


[1] https://www.postgresql.org/message-id/CAOgcT0NOM9oR0Hag_3VpyW0uF3iCU=BDUFSPfk9JrWXRcWQHqw@mail.gmail.com


On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.

Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.

It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.

Attached series of patches I had so far... (which needed further optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to implement:

1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks


We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Tue, Jul 30, 2019 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote:

I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together.  I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.

I had a look over Anastasia's PoC patch to understand the approach she has
taken and here are my observations.

1.
The patch first creates a .blockmap file for each relation file containing
an array of all modified block numbers. This is done by reading all blocks
(in a chunk of 4 (32kb in total) in a loop) from a file and checking the page
LSN with given LSN. Later, to create .partial file, a relation file is opened
again and all blocks are read in a chunk of 4 in a loop. If found modified,
it is copied into another memory and after scanning all 4 blocks, all copied
blocks are sent to the .partial file.

In this approach, each file is opened and read twice which looks more expensive
to me. Whereas in my patch, I do that just once. However, I read the entire
file in memory to check which blocks are modified but in Anastasia's design
max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
that as we wanted to know how heavily the file got modified so that we can
send the entire file if it was modified beyond the threshold (currently 90%).

2.
Also, while sending modified blocks, they are copied in another buffer, instead
they can be just sent from the read files contents (in BLCKSZ block size).
Here, the .blockmap created earlier was not used. In my implementation, we are
sending just a .partial file with a header containing all required details like
the number of blocks changes along with the block numbers including CRC
followed by the blocks itself.

3.
I tried compiling Anastasia's patch, but getting an error. So could not see or
test how it goes. Also, like a normal backup option, the incremental backup
option needs to verify the checksum if requested.

4.
While combining full and incremental backup, files from the incremental backup
are just copied into the full backup directory. While the design I posted
earlier, we are trying another way round to avoid over-writing and other issues
as I explained earlier.

I am almost done writing the patch for pg_combinebackup and will post soon.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Tue, Jul 30, 2019 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> In attachments, you can find a prototype of incremental pg_basebackup,
> which consists of 2 features:
>
> 1) To perform incremental backup one should call pg_basebackup with a
> new argument:
>
> pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
>
> where lsn is a start_lsn of parent backup (can be found in
> "backup_label" file)
>
> It calls BASE_BACKUP replication command with a new argument
> PREV_BACKUP_START_LSN 'lsn'.
>
> For datafiles, only pages with LSN > prev_backup_start_lsn will be
> included in the backup.
> They are saved into 'filename.partial' file, 'filename.blockmap' file
> contains an array of BlockNumbers.
> For example, if we backuped blocks 1,3,5, filename.partial will contain
> 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.

I think it's better to keep both the information about changed blocks
and the contents of the changed blocks in a single file.  The list of
changed blocks is probably quite short, and I don't really want to
double the number of files in the backup if there's no real need. I
suspect it's just overall a bit simpler to keep everything together.
I don't think this is a make-or-break thing, and welcome contrary
arguments, but that's my preference.

I had experience working on a similar product and I agree with Robert to keep
the changed block info and the changed block in a single file make more sense.  
+1

> 2) To merge incremental backup into a full backup call
>
> pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> --merge-backups
>
> It will move all files from 'incremental_basedir' to 'basedir' handling
> '.partial' files correctly.

This, to me, looks like it's much worse than the design that I
proposed originally.  It means that:

1. You can't take an incremental backup without having the full backup
available at the time you want to take the incremental backup.

2. You're always storing a full backup, which means that you need more
disk space, and potentially much more I/O while taking the backup.
You save on transfer bandwidth, but you add a lot of disk reads and
writes, costs which have to be paid even if the backup is never
restored.

> 1) Whether we collect block maps using simple "read everything page by
> page" approach
> or WAL scanning or any other page tracking algorithm, we must choose a
> map format.
> I implemented the simplest one, while there are more ideas:

I think we should start simple.

I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together.  I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




--
Ibrar Ahmed

Re: block-level incremental backup

From
vignesh C
Date:
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
> > In attachments, you can find a prototype of incremental pg_basebackup,
> > which consists of 2 features:
> >
> > 1) To perform incremental backup one should call pg_basebackup with a
> > new argument:
> >
> > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
> >
> > where lsn is a start_lsn of parent backup (can be found in
> > "backup_label" file)
> >
> > It calls BASE_BACKUP replication command with a new argument
> > PREV_BACKUP_START_LSN 'lsn'.
> >
> > For datafiles, only pages with LSN > prev_backup_start_lsn will be
> > included in the backup.
>>
One thought, if the file is not modified no need to check the lsn.
>>
> > They are saved into 'filename.partial' file, 'filename.blockmap' file
> > contains an array of BlockNumbers.
> > For example, if we backuped blocks 1,3,5, filename.partial will contain
> > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
>
> I think it's better to keep both the information about changed blocks
> and the contents of the changed blocks in a single file.  The list of
> changed blocks is probably quite short, and I don't really want to
> double the number of files in the backup if there's no real need. I
> suspect it's just overall a bit simpler to keep everything together.
> I don't think this is a make-or-break thing, and welcome contrary
> arguments, but that's my preference.
>
I feel Robert's suggestion is good.
We can probably keep one meta file for each backup with some basic information
of all the files being backed up, this metadata file will be useful in the
below case:
Table dropped before incremental backup
Table truncated and Insert/Update/Delete operations before incremental backup

I feel if we have the metadata, we can add some optimization to decide the
above scenario with the metadata information to identify the file deletion
and avoiding write and delete for pg_combinebackup which Jeevan has told in
his previous mail.

Probably it can also help us to decide which work the worker needs to do
if we are planning to backup in parallel.

Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Jul 31, 2019 at 1:59 PM vignesh C <vignesh21@gmail.com> wrote:
> I feel Robert's suggestion is good.
> We can probably keep one meta file for each backup with some basic information
> of all the files being backed up, this metadata file will be useful in the
> below case:
> Table dropped before incremental backup
> Table truncated and Insert/Update/Delete operations before incremental backup

There's really no need for this with the design I proposed.  The files
that should exist when you restore in incremental backup are exactly
the set of files that exist in the final incremental backup, except
that any .partial files need to be replaced with a correct
reconstruction of the underlying file.  You don't need to know what
got dropped or truncated; you only need to know what's supposed to be
there at the end.

You may be thinking, as I once did, that restoring an incremental
backup would consist of restoring the full backup first and then
layering the incrementals over it, but if you read what I proposed, it
actually works the other way around: you restore the files that are
present in the incremental, and as needed, pull pieces of them from
earlier incremental and/or full backups.  I think this is a *much*
better design than doing it the other way; it avoids any risk of
getting the wrong answer due to truncations or drops, and it also is
faster, because you only read older backups to the extent that you
actually need their contents.

I think it's a good idea to try to keep all the information about a
single file being backup in one place. It's just less confusing.  If,
for example, you have a metadata file that tells you which files are
dropped - that is, which files you DON'T have - then what happen if
one of those files is present in the data directory after all?  Well,
then you have inconsistent information and are confused, and maybe
your code won't even notice the inconsistency.  Similarly, if the
metadata file is separate from the block data, then what happens if
one file is missing, or isn't from the same backup as the other file?
That shouldn't happen, of course, but if it does, you'll get confused.
There's no perfect solution to these kinds of problems: if we suppose
that the backup can be corrupted by having missing or extra files, why
not also corruption within a single file? Still, on balance I tend to
think that keeping related stuff together minimizes the surface area
for bugs.  I realize that's arguable, though.

One consideration that goes the other way: if you have a manifest file
that says what files are supposed to be present in the backup, then
you can detect a disappearing file, which is impossible with the
design I've proposed (and with the current full backup machinery).
That might be worth fixing, but it's a separate feature that has
little to do with incremental backup.

> Probably it can also help us to decide which work the worker needs to do
> if we are planning to backup in parallel.

I don't think we need a manifest file for parallel backup.  One
process or thread can scan the directory tree, make a list of which
files are present, and then hand individual files off to other
processes or threads. In short, the directory listing serves as the
manifest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:



I am almost done writing the patch for pg_combinebackup and will post soon.

Attached patch which implements the pg_combinebackup utility used to combine
full basebackup with one or more incremental backups.

I have tested it manually and it works for all best cases.

Let me know if you have any inputs/suggestions/review comments?

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: block-level incremental backup

From
vignesh C
Date:
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;
+
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 1000000L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files

3)
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime

4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;

If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);

Add verbose for copying whole file

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

9)
+ printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_("  -n, --no-clean              do not clean up after errors\n"));

Combine backup into directory can be combine backup directory

10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */

MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> + rc = system(copycmd);

I don't think this patch should be calling system() in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> > + rc = system(copycmd);
>
> I don't think this patch should be calling system() in the first place.

+1.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Ibrar Ahmed
Date:

I have not looked at the patch in detail, but just some nits from my side.

On Fri, Aug 2, 2019 at 6:13 PM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;

Use API to copy the file instead of "system", better to use the secure copy.
 
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 1000000L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files

The log and the sleep time does not match, you are multiplying sleeptime with numretries++ and logging only "sleeptime" 

Why we are retiring here, capture proper copy error and act accordingly. Blindly retiring does not make sense.  

3)
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime

4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure

Use pg_malloc instead of malloc
 
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

Yes, we should, but this is not the right time to do that.
 
6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;

If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

Why we have that limit at first place? 
  
7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);

Add verbose for copying whole file

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

9)
+ printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_("  -n, --no-clean              do not clean up after errors\n"));

Combine backup into directory can be combine backup directory

10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */

MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




--
Ibrar Ahmed

Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Tue, Aug 6, 2019 at 11:31 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:

I have not looked at the patch in detail, but just some nits from my side.

On Fri, Aug 2, 2019 at 6:13 PM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;

Use API to copy the file instead of "system", better to use the secure copy.
Ah, it is a local copy, simple copy API is enough.   
 
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 1000000L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files

The log and the sleep time does not match, you are multiplying sleeptime with numretries++ and logging only "sleeptime" 

Why we are retiring here, capture proper copy error and act accordingly. Blindly retiring does not make sense.  

3)
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime

4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure

Use pg_malloc instead of malloc
 
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

Yes, we should, but this is not the right time to do that.
 
6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;

If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

Why we have that limit at first place? 
  
7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);

Add verbose for copying whole file

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

9)
+ printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_("  -n, --no-clean              do not clean up after errors\n"));

Combine backup into directory can be combine backup directory

10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */

MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




--
Ibrar Ahmed


--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Mon, Aug 5, 2019 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> + rc = system(copycmd);

I don't think this patch should be calling system() in the first place.

So, do you mean we should just do fread() and fwrite() for the whole file?

I thought it is better if it was done by the OS itself instead of reading 1GB
into the memory and writing the same to the file.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Wed, Aug 7, 2019 at 2:47 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Mon, Aug 5, 2019 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> + rc = system(copycmd);

I don't think this patch should be calling system() in the first place.

So, do you mean we should just do fread() and fwrite() for the whole file?

I thought it is better if it was done by the OS itself instead of reading 1GB
into the memory and writing the same to the file.

It is not necessary to read the whole 1GB into Ram.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company



--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Hi Jeevan,

I have reviewed the backup part at code level and still looking into the
restore(combine) and functional part of it. But, here are my comments so far:

The patches need rebase.
----------------------------------------------------
+       if (!XLogRecPtrIsInvalid(previous_lsn))
+           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
+                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);

May be we should rename to something like:
"INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
to make it more intuitive?

----------------------------------------------------

+typedef struct                                                                                   
+{                                                                                                
+   uint32      magic;                                                                            
+   pg_crc32c   checksum;                                                                         
+   uint32      nblocks;                                                                          
+   uint32      blocknumbers[FLEXIBLE_ARRAY_MEMBER];                                              
+} partial_file_header;                                                                           

File header structure is defined in both the files basebackup.c and
pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.

----------------------------------------------------

+   bool        isrelfile = false;

I think we can avoid having flag isrelfile in sendFile(). 
Something like this:

if (startincrptr && OidIsValid(dboid) && looks_like_rel_name(filename))
{
//include the code here that is under "if (isrelfile)" block.
}
else
{
_tarWriteHeader(tarfilename, NULL, statbuf, false);
while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
{
...
}
}

----------------------------------------------------

Also, having isrelfile as part of following condition:
{code}
+   while (!isrelfile &&
+          (cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
{code}

is confusing, because even the relation files in full backup are going to be
backed up by this loop only, but still, the condition reads '(!isrelfile &&...)'.

----------------------------------------------------

verify_page_checksum()
{
while(1)
{
....
break;
}
}

IMHO, while labels are not advisable in general, it may be better to use a label
here rather than a while(1) loop, so that we can move to the label in case we
want to retry once. I think here it opens doors for future bugs if someone
happens to add code here, ending up adding some condition and then the
break becomes conditional. That will leave us in an infinite loop.

----------------------------------------------------

+/* magic number in incremental backup's .partial file */
+#define INCREMENTAL_BACKUP_MAGIC   0x494E4352

Similar to structure partial_file_header, I think above macro can also be moved
to basebackup.h instead of defining it twice.

----------------------------------------------------

In sendFile():

+       buf = (char *) malloc(RELSEG_SIZE * BLCKSZ);

I think this is a huge memory request (1GB) and may fail on busy/loaded server at
times. We should check for failures of malloc, maybe throw some error on
getting ENOMEM as errno.

----------------------------------------------------

+       /* Perform incremenatl backup stuff here. */
+       if ((cnt = fread(buf, 1, Min(RELSEG_SIZE * BLCKSZ, statbuf->st_size), fp)) > 0)
+       {

Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and it
should be safe to read just statbuf_st_size always I guess? But, I am ok with
having this extra guard here.

----------------------------------------------------

In sendFile(), I am sorry if I am missing something, but I am not able to
understand why 'cnt' and 'i' should have different values when they are being
passed to verify_page_checksum(). I think passing only one of them should be
sufficient.

----------------------------------------------------

+               XLogRecPtr  pglsn;
+
+               for (i = 0; i < cnt / BLCKSZ; i++)
+               {

Maybe we should just have a variable no_of_blocks to store a number of blocks,
rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worst
case.

----------------------------------------------------
+               len += cnt;
+               throttle(cnt);
+           }    

Sorry if I am missing something, but, should not it be just:

len = cnt;

----------------------------------------------------

As I said earlier in my previous email, we now do not need +decode_lsn_internal()
as it is already taken care by the introduction of function pg_lsn_in_internal().

Regards,
Jeevan Ladhe

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.

Well, 'cp' is just a C program.  If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc.  There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use.  I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> +       if (!XLogRecPtrIsInvalid(previous_lsn))
> +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> +                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?

So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear.  This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not.  Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.

> File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.

Or some other header, but yeah, definitely don't duplicate the struct
definition (or any other kind of definition).

> IMHO, while labels are not advisable in general, it may be better to use a label
> here rather than a while(1) loop, so that we can move to the label in case we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.

I'm not sure which style is better here, but I don't really buy this argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Hi Robert,

On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> +       if (!XLogRecPtrIsInvalid(previous_lsn))
> +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> +                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?

So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear.  This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not.  Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.

Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
 
> File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.

Or some other header, but yeah, definitely don't duplicate the struct
definition (or any other kind of definition).

Thanks.
 
> IMHO, while labels are not advisable in general, it may be better to use a label
> here rather than a while(1) loop, so that we can move to the label in case we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.

I'm not sure which style is better here, but I don't really buy this argument.

No issues. I am ok either way.

Regards,
Jeevan Ladhe

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Fri, Aug 9, 2019 at 6:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.

Well, 'cp' is just a C program.  If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc.  There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use.  I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.

Agree that we can certainly use open(), read(), write(), and close() here, but
given that pg_basebackup.c and basbackup.c are using file operations, I think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.

Let me know if we still want to go with native OS calls.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Hi Robert,

On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> +       if (!XLogRecPtrIsInvalid(previous_lsn))
> +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> +                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?

So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear.  This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not.  Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.

Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?

+1 for INCREMENTAL BACKUP REFERENCE WA.

 

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Mon, Aug 12, 2019 at 5:29 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Hi Robert,

On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> +       if (!XLogRecPtrIsInvalid(previous_lsn))
> +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> +                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?

So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear.  This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not.  Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.

Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?

+1 for INCREMENTAL BACKUP REFERENCE WA.

Sorry for the typo:
+1 for the INCREMENTAL BACKUP REFERENCE WAL LOCATION.


 

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Aug 12, 2019 at 7:57 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Agree that we can certainly use open(), read(), write(), and close() here, but
> given that pg_basebackup.c and basbackup.c are using file operations, I think
> using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
> for consistetncy.

Oh, that's fine.  Whatever's more consistent with the pre-existing
code. Just, let's not use system().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Mon, Aug 12, 2019 at 4:57 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Fri, Aug 9, 2019 at 6:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.

Well, 'cp' is just a C program.  If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc.  There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use.  I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.

Agree that we can certainly use open(), read(), write(), and close() here, but
given that pg_basebackup.c and basbackup.c are using file operations, I think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.
 
+1 for using  fopen(), fread(), fwrite(), and fclose()


Let me know if we still want to go with native OS calls.
 

-1 for OS call
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company



--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

Yep. I have that in my ToDo.
Will start working on that soon.
 
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptime

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.
 
4)
Should we check for malloc failure

Used pg_malloc() instead. Same is also suggested by Ibrar.
 

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

Can be done afterward once we have the functionality in place.
 

6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit this.
 

7)
Add verbose for copying whole file
Done
 

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

Hmm... will leave it for now. User will get an error anyway.
 

9)
Combine backup into directory can be combine backup directory
Done
 

10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Fri, Aug 9, 2019 at 6:07 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Hi Jeevan,

I have reviewed the backup part at code level and still looking into the
restore(combine) and functional part of it. But, here are my comments so far:

Thank you Jeevan Ladhe for reviewing the changes.
 

The patches need rebase.

Done.
 
May be we should rename to something like:
"INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
to make it more intuitive?

As discussed, used "INCREMENTAL BACKUP REFERENCE WAL LOCATION".

File header structure is defined in both the files basebackup.c and
pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.

Yep. Was that in my cleanup list. Done now.
 
I think we can avoid having flag isrelfile in sendFile(). 
Something like this:
Also, having isrelfile as part of following condition:
is confusing, because even the relation files in full backup are going to be
backed up by this loop only, but still, the condition reads '(!isrelfile &&...)'.

In the refactored patch I have moved full backup code in a separate function.
And now all incremental backup code is also done in its own function.
Hopefully, the code is now more readable.
 

IMHO, while labels are not advisable in general, it may be better to use a label
here rather than a while(1) loop, so that we can move to the label in case we
want to retry once. I think here it opens doors for future bugs if someone
happens to add code here, ending up adding some condition and then the
break becomes conditional. That will leave us in an infinite loop.

I kept it as is as I don't see any correctness issue here.

Similar to structure partial_file_header, I think above macro can also be moved
to basebackup.h instead of defining it twice.

Yes. Done.
 
I think this is a huge memory request (1GB) and may fail on busy/loaded server at
times. We should check for failures of malloc, maybe throw some error on
getting ENOMEM as errno.

Agree. Done.
 
Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and it
should be safe to read just statbuf_st_size always I guess? But, I am ok with
having this extra guard here.

Yes, we can do this way. Added an Assert() before that and used just statbuf->st_size.

In sendFile(), I am sorry if I am missing something, but I am not able to
understand why 'cnt' and 'i' should have different values when they are being
passed to verify_page_checksum(). I think passing only one of them should be
sufficient.

As discussed offline, you meant to say i and blkno.
These two are different. i represent the current block offset from the read
buffer whereas blkno is the offset from the start of the page. For incremental
backup, they are same as we read the whole file but they are different in case
of regular full backup where we read 4 blocks at a time. i value there will be
between 0 and 3.
 
Maybe we should just have a variable no_of_blocks to store a number of blocks,
rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worst
case.

OK. Done.
 
Sorry if I am missing something, but, should not it be just:

len = cnt;

Yeah. Done.
 
As I said earlier in my previous email, we now do not need +decode_lsn_internal()
as it is already taken care by the introduction of function pg_lsn_in_internal().

Yes. Done that and rebased on latest HEAD.
 

Regards,
Jeevan Ladhe

Patches attached in the previous reply.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Ibrar Ahmed
Date:




On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

Yep. I have that in my ToDo.
Will start working on that soon.
 
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptime

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.
 
4)
Should we check for malloc failure

Used pg_malloc() instead. Same is also suggested by Ibrar.
 

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

Can be done afterward once we have the functionality in place.
 

6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit this.
 

7)
Add verbose for copying whole file
Done
 

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

Hmm... will leave it for now. User will get an error anyway.
 

9)
Combine backup into directory can be combine backup directory
Done
 

10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.
Why not use a list?
 


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 

+       buf = (char *) malloc(statbuf->st_size);

+       if (buf == NULL)

+               ereport(ERROR,

+                               (errcode(ERRCODE_OUT_OF_MEMORY),

+                                errmsg("out of memory")));


Why are you using malloc, you can use palloc here.




--
Ibrar Ahmed

Re: block-level incremental backup

From
Ibrar Ahmed
Date:




On Fri, Aug 16, 2019 at 4:12 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:




On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

Yep. I have that in my ToDo.
Will start working on that soon.
 
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptime

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.
 
4)
Should we check for malloc failure

Used pg_malloc() instead. Same is also suggested by Ibrar.
 

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

Can be done afterward once we have the functionality in place.
 

6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit this.
 

7)
Add verbose for copying whole file
Done
 

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

Hmm... will leave it for now. User will get an error anyway.
 

9)
Combine backup into directory can be combine backup directory
Done
 

10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.
Why not use a list?
 


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 

+       buf = (char *) malloc(statbuf->st_size);

+       if (buf == NULL)

+               ereport(ERROR,

+                               (errcode(ERRCODE_OUT_OF_MEMORY),

+                                errmsg("out of memory")));


Why are you using malloc, you can use palloc here.



Hi, I gave another look at the patch and have some quick comments.


-
> char       *extptr = strstr(fn, ".partial");

I think there should be a better and strict way to check the file extension.

-
> +               extptr = strstr(outfn, ".partial");
> +               Assert (extptr != NULL);

Why are you checking that again, you just appended that in the above statement?

-
> +       if (verbose && statbuf.st_size > (RELSEG_SIZE * BLCKSZ))
> +               pg_log_info("found big file \"%s\" (size: %.2lfGB): %m", fromfn,
> +                                       (double) statbuf.st_size / (RELSEG_SIZE * BLCKSZ));

This is not just a log, you find a file which is bigger which surely has some problem.

-
> +        * We do read entire 1GB file in memory while taking incremental backup; so
> +        * I don't see any reason why can't we do that here.  Also, copying data in
> +        * chunks is expensive.  However, for bigger files, we still slice at 1GB
> +        * border.


What do you mean by bigger file, a file greater than 1GB? In which case you get file > 1GB?
 

--
Ibrar Ahmed

Re: block-level incremental backup

From
vignesh C
Date:
On Fri, Aug 16, 2019 at 8:07 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>
> What do you mean by bigger file, a file greater than 1GB? In which case you get file > 1GB?
>
>
>
Few comments:
Comment:
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
+ {
+ Bitmapset  *mod_blocks = NULL;
+ int nmodblocks = 0;
+
+ if (cnt % BLCKSZ != 0)
+ {

We can use same size as full page size.
After pg start backup full page write will be enabled.
We can use the same file size to maintain data consistency.

Comment:
/* Validate given LSN and convert it into XLogRecPtr. */
+ opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+ if (XLogRecPtrIsInvalid(opt->lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid value for LSN")));

Validate input lsn is less than current system lsn.

Comment:
/* Validate given LSN and convert it into XLogRecPtr. */
+ opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+ if (XLogRecPtrIsInvalid(opt->lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid value for LSN")));

Should we check if it is same timeline as the system's timeline.

Comment:
+ if (fread(blkdata, 1, BLCKSZ, infp) != BLCKSZ)
+ {
+ pg_log_error("could not read from file \"%s\": %m", outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ exit(1);
+ }
+
+ /* Finally write one block to the output file */
+ if (fwrite(blkdata, 1, BLCKSZ, outfp) != BLCKSZ)
+ {
+ pg_log_error("could not write to file \"%s\": %m", outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ exit(1);
+ }

Should we support compression formats supported by pg_basebackup.
This can be an enhancement after the functionality is completed.

Comment:
We should provide some mechanism to validate the backup. To identify
if some backup is corrupt or some file is missing(deleted) in a
backup.

Comment:
+ ofp = fopen(tofn, "wb");
+ if (ofp == NULL)
+ {
+ pg_log_error("could not create file \"%s\": %m", tofn);
+ exit(1);
+ }

ifp should be closed in the error flow.

Comment:
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = pg_malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ pg_free(labelfile);
+ exit(1);
+ }

fclose can be moved above.

Comment:
+ if (!modifiedblockfound)
+ {
+ copy_whole_file(fm->filename, outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ return;
+ }
+
+ /* Write all blocks to the output file */
+
+ if (fstat(fileno(fm->fp), &statbuf) != 0)
+ {
+ pg_log_error("could not stat file \"%s\": %m", fm->filename);
+ pg_free(filemaps);
+ exit(1);
+ }

Some error flow, cleanup_filemaps need to be called to close the file
descriptors that are opened.

Comment:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9
+

This can be user configured value.
This can be an enhancement after the functionality is completed.


Comment:
We can add a readme file with all the details regarding incremental
backup and combine backup.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> [ patches ]

Reviewing 0002 and 0003:

- Commit message for 0003 claims magic number and checksum are 0, but
that (fortunately) doesn't seem to be the case.

- looks_like_rel_name actually checks whether it looks like a
*non-temporary* relation name; suggest adjusting the function name.

- The names do_full_backup and do_incremental_backup are quite
confusing because you're really talking about what to do with one
file.  I suggest sendCompleteFile() and sendPartialFile().

- Is there any good reason to have 'refptr' as a global variable, or
could we just pass the LSN around via function arguments?  I know it's
just mimicking startptr, but storing startptr in a global variable
doesn't seem like a great idea either, so if it's not too annoying,
let's pass it down via function arguments instead.  Also, refptr is a
crappy name (even worse than startptr); whether we end up with a
global variable or a bunch of local variables, let's make the name(s)
clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
that's long, but I still think it's better than being unclear.

- do_incremental_backup looks like it can never report an error from
fread(), which is bad.  But I see that this is just copied from the
existing code which has the same problem, so I started a separate
thread about that.

- I think that passing cnt and blkindex to verify_page_checksum()
doesn't look very good from an abstraction point of view.  Granted,
the existing code isn't great either, but I think this makes the
problem worse.  I suggest passing "int backup_distance" to this
function, computed as cnt - BLCKSZ * blkindex.  Then, you can
fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
- BLCKSZ).

- While I generally support the use of while and for loops rather than
goto for flow control, a while (1) loop that ends with a break is
functionally a goto anyway.  I think there are several ways this could
be revised.  The most obvious one is probably to use goto, but I vote
for inverting the sense of the test: if (PageIsNew(page) ||
PageGetLSN(page) >= startptr) break; This approach also saves a level
of indentation for more than half of the function.

- I am not sure that it's a good idea for sendwholefile = true to
result in dumping the entire file onto the wire in a single CopyData
message.  I don't know of a concrete problem in typical
configurations, but someone who increases RELSEG_SIZE might be able to
overflow CopyData's length word.  At 2GB the length word would be
negative, which might break, and at 4GB it would wrap around, which
would certainly break.  See CopyData in
https://www.postgresql.org/docs/12/protocol-message-formats.html  To
avoid this issue, and maybe some others, I suggest defining a
reasonably large chunk size, say 1MB as a constant in this file
someplace, and sending the data as a series of chunks of that size.

- I don't think that the way concurrent truncation is handled is
correct for partial files.  Right now it just falls through to code
which appends blocks of zeroes in either the complete-file or
partial-file case.  I think that logic should be moved into the
function that handles the complete-file case.  In the partial-file
case, the blocks that we actually send need to match the list of block
numbers we promised to send.  We can't just send the promised blocks
and then tack a bunch of zero-filled blocks onto the end that the file
header doesn't know about.

- For reviewer convenience, please use the -v option to git
format-patch when posting and reposting a patch series.  Using -v2,
-v3, etc. on successive versions really helps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Due to the inherent nature of pg_basebackup, the incremental backup also
allows taking backup in tar and compressed format. But, pg_combinebackup
does not understand how to restore this. I think we should either make
pg_combinebackup support restoration of tar incremental backup or restrict
taking the incremental backup in tar format until pg_combinebackup
supports the restoration by making option '--lsn' and '-Ft' exclusive.

It is arguable that one can take the incremental backup in tar format, extract
that manually and then give the resultant directory as input to the
pg_combinebackup, but I think that kills the purpose of having
pg_combinebackup utility.

Thoughts?

Regards,
Jeevan Ladhe

Re: block-level incremental backup

From
Rajkumar Raghuwanshi
Date:
Hi,

I am doing some testing on pg_basebackup and pg_combinebackup patches. I have also tried to create tap test for pg_combinebackup by taking  reference from pg_basebackup tap cases.
Attaching first draft test patch.

I have done some testing with compression options, both -z and -Z level is working with incremental backup.

A minor comment : It is mentioned in pg_combinebackup help that maximum 10 incremental backup can be given with -i option, but I found maximum 9 incremental backup directories can be given at a time.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


On Thu, Aug 29, 2019 at 10:06 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Due to the inherent nature of pg_basebackup, the incremental backup also
allows taking backup in tar and compressed format. But, pg_combinebackup
does not understand how to restore this. I think we should either make
pg_combinebackup support restoration of tar incremental backup or restrict
taking the incremental backup in tar format until pg_combinebackup
supports the restoration by making option '--lsn' and '-Ft' exclusive.

It is arguable that one can take the incremental backup in tar format, extract
that manually and then give the resultant directory as input to the
pg_combinebackup, but I think that kills the purpose of having
pg_combinebackup utility.

Thoughts?

Regards,
Jeevan Ladhe
Attachment

Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Here are some comments:


+/* The reference XLOG position for the incremental backup. */                                    
+static XLogRecPtr refptr;    

As Robert already pointed we may want to pass this as parameter around instead
of a global variable. Also, can be renamed to something like: incr_backup_refptr.
I see in your earlier version of patch this was named startincrptr, which I
think was more meaningful.

---------

    /*                                                                                          
     * If incremental backup, see whether the filename is a relation filename                   
     * or not.                                                                                  
     */

Can be reworded something like:
"If incremental backup, check if it is relation file and can be sent partially."

---------

+           if (verify_checksum)
+           {
+               ereport(WARNING,
+                       (errmsg("cannot verify checksum in file \"%s\", block "
+                               "%d: read buffer size %d and page size %d "
+                               "differ",
+                               readfilename, blkno, (int) cnt, BLCKSZ)));
+               verify_checksum = false;
+           }

For do_incremental_backup() it does not make sense to show the block number in
warning as it is always going to be 0 when we throw this warning.
Further, I think this can be rephrased as:
"cannot verify checksum in file \"%s\", read file size %d is not multiple of
page size %d".

Or maybe we can just say:
"cannot verify checksum in file \"%s\"" if checksum requested, disable the
checksum and leave it to the following message:

+           ereport(WARNING,
+                   (errmsg("file size (%d) not in multiple of page size (%d), sending whole file",
+                           (int) cnt, BLCKSZ))); 

---------

If you agree on the above comment for blkno, then we can shift declaration of blkno
inside the condition "       if (!sendwholefile)" in do_incremental_backup(), or
avoid it altogether, and just pass "i" as blkindex, as well as blkno to
verify_page_checksum(). May be add a comment why they are same in case of
incremental backup.

---------

I think we should give the user hint from where he should be reading the input
lsn for incremental backup in the --help option as well as documentation?
Something like - "To take an incremental backup, please provide value of "--lsn"
as the "START WAL LOCATION" of previously taken full backup or incremental
backup from backup_lable file. 

---------

pg_combinebackup:

+static bool made_new_outputdata = false;
+static bool found_existing_outputdata = false;

Both of these are global, I understand that we need them global so that they are
accessible in cleanup_directories_atexit(). But they are passed to
verify_dir_is_empty_or_create() as parameters, which I think is not needed.
Instead verify_dir_is_empty_or_create() can directly change the globals.

---------

I see that checksum_failure is never set and always remains as false. May be
it is something that you wanted to set in combine_partial_files() when a
the corrupted partial file is detected?

---------

I think the logic for verifying the backup chain should be moved out from main()
function to a separate function.

---------

+ /*
+ * Verify the backup chain.  INCREMENTAL BACKUP REFERENCE WAL LOCATION of
+ * the incremental backup must match with the START WAL LOCATION of the
+ * previous backup, until we reach a full backup in which there is no
+ * INCREMENTAL BACKUP REFERENCE WAL LOCATION.
+ */

The current logic assumes the incremental backup directories are to be provided
as input in the serial order the backups were taken. This is bit confusing
unless clarified in pg_combinebackup help menu or documentation. I think we
should clarify it at both the places.

---------

I think scan_directory() should be rather renamed as do_combinebackup().

Regards,
Jeevan Ladhe

On Thu, Aug 29, 2019 at 8:11 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Due to the inherent nature of pg_basebackup, the incremental backup also
allows taking backup in tar and compressed format. But, pg_combinebackup
does not understand how to restore this. I think we should either make
pg_combinebackup support restoration of tar incremental backup or restrict
taking the incremental backup in tar format until pg_combinebackup
supports the restoration by making option '--lsn' and '-Ft' exclusive.

It is arguable that one can take the incremental backup in tar format, extract
that manually and then give the resultant directory as input to the
pg_combinebackup, but I think that kills the purpose of having
pg_combinebackup utility.

Thoughts?

Regards,
Jeevan Ladhe

Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Due to the inherent nature of pg_basebackup, the incremental backup also
> allows taking backup in tar and compressed format. But, pg_combinebackup
> does not understand how to restore this. I think we should either make
> pg_combinebackup support restoration of tar incremental backup or restrict
> taking the incremental backup in tar format until pg_combinebackup
> supports the restoration by making option '--lsn' and '-Ft' exclusive.
>
> It is arguable that one can take the incremental backup in tar format, extract
> that manually and then give the resultant directory as input to the
> pg_combinebackup, but I think that kills the purpose of having
> pg_combinebackup utility.

I don't agree. You're right that you would have to untar (and
uncompress) the backup to run pg_combinebackup, but you would also
have to do that to restore a non-incremental backup, so it doesn't
seem much different.  It's true that for an incremental backup you
might need to untar and uncompress multiple prior backups rather than
just one, but that's just the nature of an incremental backup.  And,
on a practical level, if you want compression, which is pretty likely
if you're thinking about incremental backups, the way to get that is
to use tar format with -z or -Z.

It might be interesting to teach pg_combinebackup to be able to read
tar-format backups, but I think that there are several variants of the
tar format, and I suspect it would need to read them all.  If someone
un-tars and re-tars a backup with a different tar tool, we don't want
it to become unreadable.  So we'd either have to write our own
de-tarring library or add an external dependency on one.  I don't
think it's worth doing that at this point; I definitely don't think it
needs to be part of the first patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Sat, Aug 31, 2019 at 7:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Due to the inherent nature of pg_basebackup, the incremental backup also
> allows taking backup in tar and compressed format. But, pg_combinebackup
> does not understand how to restore this. I think we should either make
> pg_combinebackup support restoration of tar incremental backup or restrict
> taking the incremental backup in tar format until pg_combinebackup
> supports the restoration by making option '--lsn' and '-Ft' exclusive.
>
> It is arguable that one can take the incremental backup in tar format, extract
> that manually and then give the resultant directory as input to the
> pg_combinebackup, but I think that kills the purpose of having
> pg_combinebackup utility.

I don't agree. You're right that you would have to untar (and
uncompress) the backup to run pg_combinebackup, but you would also
have to do that to restore a non-incremental backup, so it doesn't
seem much different.  It's true that for an incremental backup you
might need to untar and uncompress multiple prior backups rather than
just one, but that's just the nature of an incremental backup.  And,
on a practical level, if you want compression, which is pretty likely
if you're thinking about incremental backups, the way to get that is
to use tar format with -z or -Z.

It might be interesting to teach pg_combinebackup to be able to read
tar-format backups, but I think that there are several variants of the
tar format, and I suspect it would need to read them all.  If someone
un-tars and re-tars a backup with a different tar tool, we don't want
it to become unreadable.  So we'd either have to write our own
de-tarring library or add an external dependency on one. 

Are we using any tar library in pg_basebackup.c? We already have the capability
in pg_basebackup to do that. 

 
I don't
think it's worth doing that at this point; I definitely don't think it
needs to be part of the first patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




--
Ibrar Ahmed

Re: block-level incremental backup

From
Jeevan Ladhe
Date:
Hi Robert,

On Sat, Aug 31, 2019 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Due to the inherent nature of pg_basebackup, the incremental backup also
> allows taking backup in tar and compressed format. But, pg_combinebackup
> does not understand how to restore this. I think we should either make
> pg_combinebackup support restoration of tar incremental backup or restrict
> taking the incremental backup in tar format until pg_combinebackup
> supports the restoration by making option '--lsn' and '-Ft' exclusive.
>
> It is arguable that one can take the incremental backup in tar format, extract
> that manually and then give the resultant directory as input to the
> pg_combinebackup, but I think that kills the purpose of having
> pg_combinebackup utility.

I don't agree. You're right that you would have to untar (and
uncompress) the backup to run pg_combinebackup, but you would also
have to do that to restore a non-incremental backup, so it doesn't
seem much different. 
 
Thanks. Yes I agree about the similarity between restoring non-incremental
and incremental backup in this case.
 
 I don't think it's worth doing that at this point; I definitely don't think it
needs to be part of the first patch.

Makes sense.

Regards,
Jeevan Ladhe 

Re: block-level incremental backup

From
Dilip Kumar
Date:
On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
0003:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9

How this threshold is selected.  Is it by some test?


- magic number, currently 0 (4 bytes)
I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
0x494E4352) as a magic number, not 0


+ Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ));
+
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
+ {
+ Bitmapset  *mod_blocks = NULL;
+ int nmodblocks = 0;
+
+ if (cnt % BLCKSZ != 0)
+ {

It will be good to add some comments for the if block and also for the
assert. Actully, it's not very clear from the code.

0004:
+#include <time.h>
+#include <sys/stat.h>
+#include <unistd.h>
Header file include order (sys/state.h should be before time.h)



+ printf(_("%s combines full backup with incremental backup.\n\n"), progname);
/backup/backups


+ * scan_file
+ *
+ * Checks whether given file is partial file or not.  If partial, then combines
+ * it into a full backup file, else copies as is to the output directory.
+ */

/If partial, then combines/ If partial, then combine



+static void
+combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir,
+   const char *subdirpath, const char *outfn)
+ /*
+ * Open all files from all incremental backup directories and create a file
+ * map.
+ */
+ basefilefound = false;
+ for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++)
+ {
+ fm = &filemaps[fmindex];
+
.....
+ }
+
+
+ /* Process all opened files. */
+ lastblkno = 0;
+ modifiedblockfound = false;
+ for (i = 0; i < fmindex; i++)
+ {
+ char    *buf;
+ int hsize;
+ int k;
+ int blkstartoffset;
......
+ }
+
+ for (i = 0; i <= lastblkno; i++)
+ {
+ char blkdata[BLCKSZ];
+ FILE    *infp;
+ int offset;
...
+ }
}

Can we breakdown this function in 2-3 functions.  At least creating a
file map can directly go to a separate function.

I have read 0003 and 0004 patch and there are few cosmetic comments.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> Are we using any tar library in pg_basebackup.c? We already have the capability
> in pg_basebackup to do that.

I think pg_basebackup is using homebrew code to generate tar files,
but I'm reluctant to do that for reading tar files.  For generating a
file, you can always emit the newest and "best" tar format, but for
reading a file, you probably want to be prepared for older or cruftier
variants.  Maybe not -- I'm not super-familiar with the tar on-disk
format.  But I think there must be a reason why tar libraries exist,
and I don't want to write a new one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Tue, Sep 3, 2019 at 6:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> Are we using any tar library in pg_basebackup.c? We already have the capability
> in pg_basebackup to do that.

I think pg_basebackup is using homebrew code to generate tar files,
but I'm reluctant to do that for reading tar files.  For generating a
file, you can always emit the newest and "best" tar format, but for
reading a file, you probably want to be prepared for older or cruftier
variants.  Maybe not -- I'm not super-familiar with the tar on-disk
format.  But I think there must be a reason why tar libraries exist,
and I don't want to write a new one.
+1 using the library to tar. But I think reason not using tar library is TAR is
one of the most simple file format. What is the best/newest format of TAR?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Ibrar Ahmed

Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> +1 using the library to tar. But I think reason not using tar library is TAR is
> one of the most simple file format. What is the best/newest format of TAR?

So, I don't really want to go down this path at all, as I already
said.  You can certainly do your own research on this topic if you
wish.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Tom Lane
Date:
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
> +1 using the library to tar.

Uh, *what* library?

pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction
of which is boilerplate for interfacing to pg_backup_archiver's APIs.
The stuff that actually knows specifically about tar looks to be maybe
a couple hundred lines, plus there's another couple hundred lines of
(rather duplicative?) code in src/port/tar.c.  None of it is rocket
science.

I can't believe that it'd be a good tradeoff to create a new external
dependency to replace that amount of code.  In case you haven't noticed,
our luck with depending on external libraries has been abysmal.

Possibly there's an argument for refactoring things so that there's
more stuff in tar.c and less elsewhere, but let's not go looking
for external code to depend on.

            regards, tom lane



Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Tue, Sep 3, 2019 at 8:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
> +1 using the library to tar.

Uh, *what* library?

I was just replying the Robert that he said  

"But I think there must be a reason why tar libraries exist,
and I don't want to write a new one."

I said I am ok to use a library "what he is proposing/thinking", 
but explained to him that TAR is the most simpler format that
why PG has its own code.


pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction
of which is boilerplate for interfacing to pg_backup_archiver's APIs.
The stuff that actually knows specifically about tar looks to be maybe
a couple hundred lines, plus there's another couple hundred lines of
(rather duplicative?) code in src/port/tar.c.  None of it is rocket
science.

I can't believe that it'd be a good tradeoff to create a new external
dependency to replace that amount of code.  In case you haven't noticed,
our luck with depending on external libraries has been abysmal.

Possibly there's an argument for refactoring things so that there's
more stuff in tar.c and less elsewhere, but let's not go looking
for external code to depend on.

                        regards, tom lane


--
Ibrar Ahmed

Re: block-level incremental backup

From
Ibrar Ahmed
Date:


On Tue, Sep 3, 2019 at 7:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> +1 using the library to tar. But I think reason not using tar library is TAR is
> one of the most simple file format. What is the best/newest format of TAR?

So, I don't really want to go down this path at all, as I already
said.  You can certainly do your own research on this topic if you
wish.

I did that and have experience working on the TAR format.  I was curious about what
"best/newest" you are talking.

  
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Ibrar Ahmed

Re: block-level incremental backup

From
Dilip Kumar
Date:
On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
> >
> 0003:
> +/*
> + * When to send the whole file, % blocks modified (90%)
> + */
> +#define WHOLE_FILE_THRESHOLD 0.9
>
> How this threshold is selected.  Is it by some test?
>
>
> - magic number, currently 0 (4 bytes)
> I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
> 0x494E4352) as a magic number, not 0
>
>
> + Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ));
> +
> + buf = (char *) malloc(statbuf->st_size);
> + if (buf == NULL)
> + ereport(ERROR,
> + (errcode(ERRCODE_OUT_OF_MEMORY),
> + errmsg("out of memory")));
> +
> + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
> + {
> + Bitmapset  *mod_blocks = NULL;
> + int nmodblocks = 0;
> +
> + if (cnt % BLCKSZ != 0)
> + {
>
> It will be good to add some comments for the if block and also for the
> assert. Actully, it's not very clear from the code.
>
> 0004:
> +#include <time.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> Header file include order (sys/state.h should be before time.h)
>
>
>
> + printf(_("%s combines full backup with incremental backup.\n\n"), progname);
> /backup/backups
>
>
> + * scan_file
> + *
> + * Checks whether given file is partial file or not.  If partial, then combines
> + * it into a full backup file, else copies as is to the output directory.
> + */
>
> /If partial, then combines/ If partial, then combine
>
>
>
> +static void
> +combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir,
> +   const char *subdirpath, const char *outfn)
> + /*
> + * Open all files from all incremental backup directories and create a file
> + * map.
> + */
> + basefilefound = false;
> + for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++)
> + {
> + fm = &filemaps[fmindex];
> +
> .....
> + }
> +
> +
> + /* Process all opened files. */
> + lastblkno = 0;
> + modifiedblockfound = false;
> + for (i = 0; i < fmindex; i++)
> + {
> + char    *buf;
> + int hsize;
> + int k;
> + int blkstartoffset;
> ......
> + }
> +
> + for (i = 0; i <= lastblkno; i++)
> + {
> + char blkdata[BLCKSZ];
> + FILE    *infp;
> + int offset;
> ...
> + }
> }
>
> Can we breakdown this function in 2-3 functions.  At least creating a
> file map can directly go to a separate function.
>
> I have read 0003 and 0004 patch and there are few cosmetic comments.
>
 I have not yet completed the review for 0004, but I have few more
comments.  Tomorrow I will try to complete the review and some testing
as well.

1. It seems that the output full backup generated with
pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
LOCATION".  It seems confusing
because now this is a full backup, not the incremental backup.

2.
+ FILE    *outfp;
+ FileOffset outblocks[RELSEG_SIZE];
+ int i;
+ FileMap    *filemaps;
+ int fmindex;
+ bool basefilefound;
+ bool modifiedblockfound;
+ uint32 lastblkno;
+ FileMap    *fm;
+ struct stat statbuf;
+ uint32 nblocks;
+
+ memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);

I don't think you need to memset this explicitly as you can initialize
the array itself no?
FileOffset outblocks[RELSEG_SIZE] = {{0}}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Sep 3, 2019 at 12:46 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> I did that and have experience working on the TAR format.  I was curious about what
> "best/newest" you are talking.

Well, why not go look it up?

On my MacBook, tar is documented to understand three different tar
formats: gnutar, ustar, and v7, and two sets of extensions to the tar
format: numeric extensions required by POSIX, and Solaris extensions.
It also understands the pax and restricted-pax formats which are
derived from the ustar format.  I don't know what your system
supports, but it's probably not hugely different; the fact that there
are multiple tar formats has been documented in the tar man page on
every machine I've checked for the past 20 years.  Here, 'man tar'
refers the reader to 'man libarchive-formats', which contains the
details mentioned above.

A quick Google search for 'multiple tar formats' also finds
https://en.wikipedia.org/wiki/Tar_(computing)#File_format and
https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html each
of which explains a good deal of the complexity in this area.

I don't really understand why I have to explain to you what I mean
when I say there are multiple tar formats when you can look it up on
Google and find that there are multiple tar formats.  Again, the point
is that the current code only generates tar archives and therefore
only needs to generate one format, but if we add code that reads a tar
archive, it probably needs to read several formats, because there are
several formats that are popular enough to be widely-supported.

It's possible that somebody else here knows more about this topic and
could make better judgements than I can, but my view at present is
that if we want to read tar archives, we probably would want to do it
by depending on libarchive.  And I don't think we should do that for
this project because I don't think it would provide much value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Michael Paquier
Date:
On Tue, Sep 03, 2019 at 08:59:53AM -0400, Robert Haas wrote:
> I think pg_basebackup is using homebrew code to generate tar files,
> but I'm reluctant to do that for reading tar files.

Yes.  This code has not actually changed since its introduction.
Please note that we also have code which reads directly data from a
tarball in pg_basebackup.c when appending the recovery parameters to
postgresql.auto.conf for -R.  There could be some consolidation here
with what you are doing.

> For generating a
> file, you can always emit the newest and "best" tar format, but for
> reading a file, you probably want to be prepared for older or cruftier
> variants.  Maybe not -- I'm not super-familiar with the tar on-disk
> format.  But I think there must be a reason why tar libraries exist,
> and I don't want to write a new one.

We need to be sure as well that the library chosen does not block
access to a feature in all the various platforms we have.
--
Michael

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Wed, Sep 4, 2019 at 10:08 PM Michael Paquier <michael@paquier.xyz> wrote:
> > For generating a
> > file, you can always emit the newest and "best" tar format, but for
> > reading a file, you probably want to be prepared for older or cruftier
> > variants.  Maybe not -- I'm not super-familiar with the tar on-disk
> > format.  But I think there must be a reason why tar libraries exist,
> > and I don't want to write a new one.
>
> We need to be sure as well that the library chosen does not block
> access to a feature in all the various platforms we have.

Well, again, my preference is to just not make this particular feature
work natively with tar files.  Then I don't need to choose a library,
so the question is moot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Jeevan Chalke
Date:
Hi,

Attached new set of patches adding support for the tablespace handling.

This patchset also fixes the issues reported by Vignesh, Robert, Jeevan Ladhe,
and Dilip Kumar.

Please have a look and let me know if I  missed any comments to account.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Tue, Aug 27, 2019 at 4:46 PM vignesh C <vignesh21@gmail.com> wrote:
Few comments:
Comment:
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
+ {
+ Bitmapset  *mod_blocks = NULL;
+ int nmodblocks = 0;
+
+ if (cnt % BLCKSZ != 0)
+ {

We can use same size as full page size.
After pg start backup full page write will be enabled.
We can use the same file size to maintain data consistency.

Can you please explain which size?
The aim here is to read entire file in-memory and thus used statbuf->st_size.

Comment:
Should we check if it is same timeline as the system's timeline.

At the time of taking the incremental backup, we can't check that.
However, while combining, I made sure that the timeline is the same for all backups.
 

Comment:

Should we support compression formats supported by pg_basebackup.
This can be an enhancement after the functionality is completed.

For the incremental backup, it just works out of the box.
For combining backup, as discussed up-thread, the user has to
uncompress first, combine them, compress if required.


Comment:
We should provide some mechanism to validate the backup. To identify
if some backup is corrupt or some file is missing(deleted) in a
backup.

Maybe, but not for the first version.
 
Comment:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9
+
This can be user configured value.
This can be an enhancement after the functionality is completed.

Yes.
 
Comment:
We can add a readme file with all the details regarding incremental
backup and combine backup.

Will have a look.
 

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Tue, Aug 27, 2019 at 11:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> [ patches ]

Reviewing 0002 and 0003:

- Commit message for 0003 claims magic number and checksum are 0, but
that (fortunately) doesn't seem to be the case.

Oops, updated commit message.
 

- looks_like_rel_name actually checks whether it looks like a
*non-temporary* relation name; suggest adjusting the function name.

- The names do_full_backup and do_incremental_backup are quite
confusing because you're really talking about what to do with one
file.  I suggest sendCompleteFile() and sendPartialFile().

Changed function names.
 

- Is there any good reason to have 'refptr' as a global variable, or
could we just pass the LSN around via function arguments?  I know it's
just mimicking startptr, but storing startptr in a global variable
doesn't seem like a great idea either, so if it's not too annoying,
let's pass it down via function arguments instead.  Also, refptr is a
crappy name (even worse than startptr); whether we end up with a
global variable or a bunch of local variables, let's make the name(s)
clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
that's long, but I still think it's better than being unclear.

Renamed variable.
However, I have kept that as global only as it needs many functions to
change their signature, like, sendFile(), sendDir(), sendTablspeace() etc.


- do_incremental_backup looks like it can never report an error from
fread(), which is bad.  But I see that this is just copied from the
existing code which has the same problem, so I started a separate
thread about that.

- I think that passing cnt and blkindex to verify_page_checksum()
doesn't look very good from an abstraction point of view.  Granted,
the existing code isn't great either, but I think this makes the
problem worse.  I suggest passing "int backup_distance" to this
function, computed as cnt - BLCKSZ * blkindex.  Then, you can
fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
- BLCKSZ).

Yep. Done these changes in the refactoring patch.
 

- While I generally support the use of while and for loops rather than
goto for flow control, a while (1) loop that ends with a break is
functionally a goto anyway.  I think there are several ways this could
be revised.  The most obvious one is probably to use goto, but I vote
for inverting the sense of the test: if (PageIsNew(page) ||
PageGetLSN(page) >= startptr) break; This approach also saves a level
of indentation for more than half of the function.

I have used this new inverted condition, but we still need a while(1) loop.


- I am not sure that it's a good idea for sendwholefile = true to
result in dumping the entire file onto the wire in a single CopyData
message.  I don't know of a concrete problem in typical
configurations, but someone who increases RELSEG_SIZE might be able to
overflow CopyData's length word.  At 2GB the length word would be
negative, which might break, and at 4GB it would wrap around, which
would certainly break.  See CopyData in
https://www.postgresql.org/docs/12/protocol-message-formats.html  To
avoid this issue, and maybe some others, I suggest defining a
reasonably large chunk size, say 1MB as a constant in this file
someplace, and sending the data as a series of chunks of that size.

OK. Done as per the suggestions.
 

- I don't think that the way concurrent truncation is handled is
correct for partial files.  Right now it just falls through to code
which appends blocks of zeroes in either the complete-file or
partial-file case.  I think that logic should be moved into the
function that handles the complete-file case.  In the partial-file
case, the blocks that we actually send need to match the list of block
numbers we promised to send.  We can't just send the promised blocks
and then tack a bunch of zero-filled blocks onto the end that the file
header doesn't know about.

Well, in partial file case we won't end up inside that block. So we are
never sending zeroes at the end in case of partial file.


- For reviewer convenience, please use the -v option to git
format-patch when posting and reposting a patch series.  Using -v2,
-v3, etc. on successive versions really helps.

Sure. Thanks for letting me know about this option.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Fri, Aug 30, 2019 at 6:52 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:
Here are some comments:
Or maybe we can just say:
"cannot verify checksum in file \"%s\"" if checksum requested, disable the
checksum and leave it to the following message:

+           ereport(WARNING,
+                   (errmsg("file size (%d) not in multiple of page size (%d), sending whole file",
+                           (int) cnt, BLCKSZ))); 


Opted for the above suggestion.
 

I think we should give the user hint from where he should be reading the input
lsn for incremental backup in the --help option as well as documentation?
Something like - "To take an incremental backup, please provide value of "--lsn"
as the "START WAL LOCATION" of previously taken full backup or incremental
backup from backup_lable file. 

Added this in the documentation. In help, it will be too crowdy.
 
pg_combinebackup:

+static bool made_new_outputdata = false;
+static bool found_existing_outputdata = false;

Both of these are global, I understand that we need them global so that they are
accessible in cleanup_directories_atexit(). But they are passed to
verify_dir_is_empty_or_create() as parameters, which I think is not needed.
Instead verify_dir_is_empty_or_create() can directly change the globals.

After adding support for a tablespace, these two functions take different values depending upon the context.


The current logic assumes the incremental backup directories are to be provided
as input in the serial order the backups were taken. This is bit confusing
unless clarified in pg_combinebackup help menu or documentation. I think we
should clarify it at both the places.

Added in doc.
 

I think scan_directory() should be rather renamed as do_combinebackup().

I am not sure about this renaming. scan_directory() is called recursively
to scan each sub-directories too. If we rename it then it is not actually
recursively doing a combinebackup. Combine backup is a single whole process.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
0003:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9

How this threshold is selected.  Is it by some test?

Currently, it is set arbitrarily. If required, we will make it a GUC.



- magic number, currently 0 (4 bytes)
I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
0x494E4352) as a magic number, not 0

Yes. Robert too reported this. Updated the commit message.
 

Can we breakdown this function in 2-3 functions.  At least creating a
file map can directly go to a separate function.

Separated out filemap changes to separate function. Rest kept as is to have an easy followup.
 

I have read 0003 and 0004 patch and there are few cosmetic comments.

Can you please post those too?

Other comments are fixed.
 


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:


On Wed, Sep 4, 2019 at 5:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

 I have not yet completed the review for 0004, but I have few more
comments.  Tomorrow I will try to complete the review and some testing
as well.

1. It seems that the output full backup generated with
pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
LOCATION".  It seems confusing
because now this is a full backup, not the incremental backup.

Yes, that was remaining and was in my TODO.
Done in the new patchset. Also, taking --label as an input like pg_basebackup.
 

2.
+ memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);

I don't think you need to memset this explicitly as you can initialize
the array itself no?
FileOffset outblocks[RELSEG_SIZE] = {{0}}

I didn't see any issue with memset either but changed this per your suggestion.
 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
Jeevan Chalke
Date:
Hi,

One of my colleague at EDB, Rajkumar Raghuwanshi, while testing this
feature reported an issue. He reported that if a full base-backup is
taken, and then created a database, and then took an incremental backup,
combining full backup with incremental backup is then failing.

I had a look over this issue and observed that when the new database is
created, the catalog files are copied as-is into the new directory
corresponding to a newly created database. And as they are just copied,
the LSN on those pages are not changed. Due to this incremental backup
thinks that its an existing file and thus do not copy the blocks from
these new files, leading to the failure.

I have surprised to know that even though we are creating new files from
old files, we kept the LSN unmodified. I didn't see any other parameter
in basebackup which tells that this is a new file from last LSN or
something.

I tried looking for any other DDL doing similar stuff like creating a new
page with existing LSN. But I could not find any other commands than
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE.

Suggestions/thoughts?

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

From
vignesh C
Date:
On Mon, Sep 9, 2019 at 4:51 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>
>
>
> On Tue, Aug 27, 2019 at 4:46 PM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Few comments:
>> Comment:
>> + buf = (char *) malloc(statbuf->st_size);
>> + if (buf == NULL)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_OUT_OF_MEMORY),
>> + errmsg("out of memory")));
>> +
>> + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
>> + {
>> + Bitmapset  *mod_blocks = NULL;
>> + int nmodblocks = 0;
>> +
>> + if (cnt % BLCKSZ != 0)
>> + {
>>
>> We can use same size as full page size.
>> After pg start backup full page write will be enabled.
>> We can use the same file size to maintain data consistency.
>
>
> Can you please explain which size?
> The aim here is to read entire file in-memory and thus used statbuf->st_size.
>
Instead of reading the whole file here, we can read the file page by page. There is a possibility of data inconsistency if data is not read page by page, data will be consistent if read page by page as full page write will be enabled at this time.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: block-level incremental backup

From
Robert Haas
Date:
On Fri, Sep 13, 2019 at 1:08 PM vignesh C <vignesh21@gmail.com> wrote:
> Instead of reading the whole file here, we can read the file page by page. There is a possibility of data
inconsistencyif data is not read page by page, data will be consistent if read page by page as full page write will be
enabledat this time.
 

I think you are confused about what "full page writes" means. It has
to do what gets written to the write-ahead log, not the way that the
pages themselves are written. There is no portable way to ensure that
an 8kB read or write is atomic, and generally it isn't.

It shouldn't matter whether the file is read all at once, page by
page, or byte by byte, except for performance. Recovery is going to
run when that backup is restored, and any inconsistencies should get
fixed up at that time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I had a look over this issue and observed that when the new database is
> created, the catalog files are copied as-is into the new directory
> corresponding to a newly created database. And as they are just copied,
> the LSN on those pages are not changed. Due to this incremental backup
> thinks that its an existing file and thus do not copy the blocks from
> these new files, leading to the failure.

*facepalm*

Well, this shoots a pretty big hole in my design for this feature. I
don't know why I didn't think of this when I wrote out that design
originally. Ugh.

Unless we change the way that CREATE DATABASE and any similar
operations work so that they always stamp pages with new LSNs, I think
we have to give up on the idea of being able to take an incremental
backup by just specifying an LSN. We'll instead need to get a list of
files from the server first, and then request the entirety of any that
we don't have, plus the changed blocks from the ones that we do have.
I guess that will make Stephen happy, since it's more like the design
he wanted originally (and should generalize more simply to parallel
backup).

One question I have is: is there any scenario in which an existing
page gets modified after the full backup and before the incremental
backup but does not end up with an LSN that follows the full backup's
start LSN? If there is, then the whole concept of using LSNs to tell
which blocks have been modified doesn't really work. I can't think of
a way that can happen off-hand, but then, I thought my last design was
good, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Amit Kapila
Date:
On Mon, Sep 16, 2019 at 7:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
> > I had a look over this issue and observed that when the new database is
> > created, the catalog files are copied as-is into the new directory
> > corresponding to a newly created database. And as they are just copied,
> > the LSN on those pages are not changed. Due to this incremental backup
> > thinks that its an existing file and thus do not copy the blocks from
> > these new files, leading to the failure.
>
> *facepalm*
>
> Well, this shoots a pretty big hole in my design for this feature. I
> don't know why I didn't think of this when I wrote out that design
> originally. Ugh.
>
> Unless we change the way that CREATE DATABASE and any similar
> operations work so that they always stamp pages with new LSNs, I think
> we have to give up on the idea of being able to take an incremental
> backup by just specifying an LSN.
>

This seems to be a blocking problem for the LSN based design.  Can we
think of using creation time for file?  Basically, if the file
creation time is later than backup-labels "START TIME:", then include
that file entirely.  I think one big point against this is clock skew
like what if somebody tinkers with the clock.  And also, this can
cover cases like
what Jeevan has pointed but might not cover other cases which we found
problematic.

>  We'll instead need to get a list of
> files from the server first, and then request the entirety of any that
> we don't have, plus the changed blocks from the ones that we do have.
> I guess that will make Stephen happy, since it's more like the design
> he wanted originally (and should generalize more simply to parallel
> backup).
>
> One question I have is: is there any scenario in which an existing
> page gets modified after the full backup and before the incremental
> backup but does not end up with an LSN that follows the full backup's
> start LSN?
>

I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
have similar problems.

One related point is how do incremental backups handle the case where
vacuum truncates the relation partially?  Basically, with current
patch/design, it doesn't appear that such information can be passed
via incremental backup.  I am not sure if this is a problem, but it
would be good if we can somehow handle this.

Isn't some operations where at the end we directly call heap_sync
without writing WAL will have a similar problem as well?  Similarly,
it is not very clear if unlogged relations are handled in some way if
not, the same could be documented.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This seems to be a blocking problem for the LSN based design.

Well, only the simplest version of it, I think.

> Can we think of using creation time for file?  Basically, if the file
> creation time is later than backup-labels "START TIME:", then include
> that file entirely.  I think one big point against this is clock skew
> like what if somebody tinkers with the clock.  And also, this can
> cover cases like
> what Jeevan has pointed but might not cover other cases which we found
> problematic.

Well that would mean, for example, that if you copied the data
directory from one machine to another, the next "incremental" backup
would turn into a full backup. That sucks. And in other situations,
like resetting the clock, it could mean that you end up with a corrupt
backup without any real ability for PostgreSQL to detect it. I'm not
saying that it is impossible to create a practically useful system
based on file time stamps, but I really don't like it.

> I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
> have similar problems.

I'm not sure quite what you mean by that.  Can you elaborate? It
appears to me that the XLR_SPECIAL_REL_UPDATE operations are all
things that create files, remove files, or truncate files, and the
sketch in my previous email would handle the first two of those cases
correctly.  See below for the third.

> One related point is how do incremental backups handle the case where
> vacuum truncates the relation partially?  Basically, with current
> patch/design, it doesn't appear that such information can be passed
> via incremental backup.  I am not sure if this is a problem, but it
> would be good if we can somehow handle this.

As to this, if you're taking a full backup of a particular file,
there's no problem.  If you're taking a partial backup of a particular
file, you need to include the current length of the file and the
identity and contents of each modified block.  Then you're fine.

> Isn't some operations where at the end we directly call heap_sync
> without writing WAL will have a similar problem as well?

Maybe.  Can you give an example?

> Similarly,
> it is not very clear if unlogged relations are handled in some way if
> not, the same could be documented.

I think that we don't need to back up the contents of unlogged
relations at all, right? Restoration from an online backup always
involves running recovery, and so unlogged relations will anyway get
zapped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Can we think of using creation time for file?  Basically, if the file
> > creation time is later than backup-labels "START TIME:", then include
> > that file entirely.  I think one big point against this is clock skew
> > like what if somebody tinkers with the clock.  And also, this can
> > cover cases like
> > what Jeevan has pointed but might not cover other cases which we found
> > problematic.
>
> Well that would mean, for example, that if you copied the data
> directory from one machine to another, the next "incremental" backup
> would turn into a full backup. That sucks. And in other situations,
> like resetting the clock, it could mean that you end up with a corrupt
> backup without any real ability for PostgreSQL to detect it. I'm not
> saying that it is impossible to create a practically useful system
> based on file time stamps, but I really don't like it.

In a number of cases, trying to make sure that on a failover or copy of
the backup the next 'incremental' is really an 'incremental' is
dangerous.  A better strategy to address this, and the other issues
realized on this thread recently, is to:

- Have a manifest of every file in each backup
- Always back up new files that weren't in the prior backup
- Keep a checksum of each file
- Track the timestamp of each file as of when it was backed up
- Track the file size of each file
- Track the starting timestamp of each backup
- Always include files with a modification time after the starting
  timestamp of the prior backup, or if the file size has changed
- In the event of any anomolies (which includes things like a timeline
  switch), use checksum matching (aka 'delta checksum backup') to
  perform the backup instead of using timestamps (or just always do that
  if you want to be particularly careful- having an option for it is
  great)
- Probably other things I'm not thinking of off-hand, but this is at
  least a good start.  Make sure to checksum this information too.

I agree entirely that it is dangerous to simply rely on creation time as
compared to some other time, or to rely on modification time of a given
file across multiple backups (which has been shown to reliably cause
corruption, at least with rsync and its 1-second granularity on
modification time).

By having a manifest for each backed up file for each backup, you also
gain the ability to validate that a backup in the repository hasn't been
corrupted post-backup, a feature that at least some other database
backup and restore systems have (referring specifically to the big O in
this particular case, but I bet others do too).

Having a system of keeping track of which backups are full and which are
differential in an overall system also gives you the ability to do
things like expiration in a sensible way, including handling WAL
expiration.

As also mentioned up-thread, this likely also allows you to have a
simpler approach to parallelizing the overall backup.

I'd like to clarify that while I would like to have an easier way to
parallelize backups, that's a relatively minor complaint- the much
bigger issue that I have with this feature is that trying to address
everything correctly while having only the amount of information that
could be passed on the command-line about the prior full/incremental is
going to be extremely difficult, complicated, and likely to lead to
subtle bugs in the actual code, and probably less than subtle bugs in
how users end up using it, since they'll have to implement the
expiration and tracking of information between backups themselves
(unless something's changed in that part during this discussion- I admit
that I've not read every email in this thread).

> > One related point is how do incremental backups handle the case where
> > vacuum truncates the relation partially?  Basically, with current
> > patch/design, it doesn't appear that such information can be passed
> > via incremental backup.  I am not sure if this is a problem, but it
> > would be good if we can somehow handle this.
>
> As to this, if you're taking a full backup of a particular file,
> there's no problem.  If you're taking a partial backup of a particular
> file, you need to include the current length of the file and the
> identity and contents of each modified block.  Then you're fine.

I would also expect this to be fine but if there's an example of where
this is an issue, please share.  The only issue that I can think of
off-hand is orphaned-file risk, whereby you have something like CREATE
DATABASE or perhaps ALTER TABLE .. SET TABLESPACE or such, take a
backup while that's happening, but that doesn't complete during the
backup (or recovery, or perhaps even in some other scenarios, it's
unfortunately quite complicated).  This orphaned file risk isn't newly
discovered but fixing it is pretty complicated- would love to discuss
ideas around how to handle it.

> > Isn't some operations where at the end we directly call heap_sync
> > without writing WAL will have a similar problem as well?
>
> Maybe.  Can you give an example?

I'd be curious to hear what the concern is here also.

> > Similarly,
> > it is not very clear if unlogged relations are handled in some way if
> > not, the same could be documented.
>
> I think that we don't need to back up the contents of unlogged
> relations at all, right? Restoration from an online backup always
> involves running recovery, and so unlogged relations will anyway get
> zapped.

Unlogged relations shouldn't be in the backup at all, since, yes, they
get zapped at the start of recovery.  We recently taught pg_basebackup
how to avoid backing them up so this shouldn't be an issue, as they
should be skipped for incrementals as well as fulls.  I expect the
orphaned file problem also exists for UNLOGGED->LOGGED transitions.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Isn't some operations where at the end we directly call heap_sync
> > without writing WAL will have a similar problem as well?
>
> Maybe.  Can you give an example?

Looking through the code, I found two cases where we do this.  One is
a bulk insert operation with wal_level = minimal, and the other is
CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
cases we are generating new blocks whose LSNs will be 0/0. So, I think
we need a rule that if the server is asked to back up all blocks in a
file with LSNs > some threshold LSN, it must also include any blocks
whose LSN is 0/0. Those blocks are either uninitialized or are
populated without WAL logging, so they always need to be copied.

Outside of unlogged and temporary tables, I don't know of any case
where make a critical modification to an already-existing block
without bumping the LSN. I hope there is no such case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote:
> In a number of cases, trying to make sure that on a failover or copy of
> the backup the next 'incremental' is really an 'incremental' is
> dangerous.  A better strategy to address this, and the other issues
> realized on this thread recently, is to:
>
> - Have a manifest of every file in each backup
> - Always back up new files that weren't in the prior backup
> - Keep a checksum of each file
> - Track the timestamp of each file as of when it was backed up
> - Track the file size of each file
> - Track the starting timestamp of each backup
> - Always include files with a modification time after the starting
>   timestamp of the prior backup, or if the file size has changed
> - In the event of any anomolies (which includes things like a timeline
>   switch), use checksum matching (aka 'delta checksum backup') to
>   perform the backup instead of using timestamps (or just always do that
>   if you want to be particularly careful- having an option for it is
>   great)
> - Probably other things I'm not thinking of off-hand, but this is at
>   least a good start.  Make sure to checksum this information too.

I agree with some of these ideas but not all of them.  I think having
a backup manifest is a good idea; that would allow taking a new
incremental backup to work from the manifest rather than the data
directory, which could be extremely useful, because it might be a lot
faster and the manifest could also be copied to a machine other than
the one where the entire backup is stored. If the backup itself has
been pushed off to S3 or whatever, you can't access it quickly, but
you could keep the manifest around.

I also agree that backing up all files that weren't in the previous
backup is a good strategy.  I proposed that fairly explicitly a few
emails back; but also, the contrary is obviously nonsense. And I also
agree with, and proposed, that we record the size along with the file.

I don't really agree with your comments about checksums and
timestamps.  I think that, if possible, there should be ONE method of
determining whether a block has changed in some important way, and I
think if we can make LSN work, that would be for the best. If you use
multiple methods of detecting changes without any clearly-defined
reason for so doing, maybe what you're saying is that you don't really
believe that any of the methods are reliable but if we throw the
kitchen sink at the problem it should come out OK. Any bugs in one
mechanism are likely to be masked by one of the others, but that's not
as as good as one method that is known to be altogether reliable.

> By having a manifest for each backed up file for each backup, you also
> gain the ability to validate that a backup in the repository hasn't been
> corrupted post-backup, a feature that at least some other database
> backup and restore systems have (referring specifically to the big O in
> this particular case, but I bet others do too).

Agreed. The manifest only lets you validate to a limited extent, but
that's still useful.

> Having a system of keeping track of which backups are full and which are
> differential in an overall system also gives you the ability to do
> things like expiration in a sensible way, including handling WAL
> expiration.

True, but I'm not sure that functionality belongs in core. It
certainly needs to be possible for out-of-core code to do this part of
the work if desired, because people want to integrate with enterprise
backup systems, and we can't come in and say, well, you back up
everything else using Netbackup or Tivoli, but for PostgreSQL you have
to use pg_backrest. I mean, maybe you can win that argument, but I
know I can't.

> I'd like to clarify that while I would like to have an easier way to
> parallelize backups, that's a relatively minor complaint- the much
> bigger issue that I have with this feature is that trying to address
> everything correctly while having only the amount of information that
> could be passed on the command-line about the prior full/incremental is
> going to be extremely difficult, complicated, and likely to lead to
> subtle bugs in the actual code, and probably less than subtle bugs in
> how users end up using it, since they'll have to implement the
> expiration and tracking of information between backups themselves
> (unless something's changed in that part during this discussion- I admit
> that I've not read every email in this thread).

Well, the evidence seems to show that you are right, at least to some
extent. I consider it a positive good if the client needs to give the
server only a limited amount of information. After all, you could
always take an incremental backup by shipping every byte of the
previous backup to the server, having it compare everything to the
current contents, and having it then send you back the stuff that is
new or different. But that would be dumb, because most of the point of
an incremental backup is to save on sending lots of data over the
network unnecessarily. Now, it seems that I took that goal to an
unhealthy extreme, because as we've now realized, sending only an LSN
and nothing else isn't enough to get a correct backup. So we need to
send more, and it doesn't have to be the absolutely most
stripped-down, bear-bones version of what could be sent. But it should
be fairly minimal, I think; that's kinda the point of the feature.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote:
> > In a number of cases, trying to make sure that on a failover or copy of
> > the backup the next 'incremental' is really an 'incremental' is
> > dangerous.  A better strategy to address this, and the other issues
> > realized on this thread recently, is to:
> >
> > - Have a manifest of every file in each backup
> > - Always back up new files that weren't in the prior backup
> > - Keep a checksum of each file
> > - Track the timestamp of each file as of when it was backed up
> > - Track the file size of each file
> > - Track the starting timestamp of each backup
> > - Always include files with a modification time after the starting
> >   timestamp of the prior backup, or if the file size has changed
> > - In the event of any anomolies (which includes things like a timeline
> >   switch), use checksum matching (aka 'delta checksum backup') to
> >   perform the backup instead of using timestamps (or just always do that
> >   if you want to be particularly careful- having an option for it is
> >   great)
> > - Probably other things I'm not thinking of off-hand, but this is at
> >   least a good start.  Make sure to checksum this information too.
>
> I agree with some of these ideas but not all of them.  I think having
> a backup manifest is a good idea; that would allow taking a new
> incremental backup to work from the manifest rather than the data
> directory, which could be extremely useful, because it might be a lot
> faster and the manifest could also be copied to a machine other than
> the one where the entire backup is stored. If the backup itself has
> been pushed off to S3 or whatever, you can't access it quickly, but
> you could keep the manifest around.

Yes, those are also good reasons for having a manifest.

> I also agree that backing up all files that weren't in the previous
> backup is a good strategy.  I proposed that fairly explicitly a few
> emails back; but also, the contrary is obviously nonsense. And I also
> agree with, and proposed, that we record the size along with the file.

Sure, I didn't mean to imply that there was something wrong with that.
Including the checksum and other metadata is also valuable, both for
helping to identify corruption in the backup archive and for forensics,
if not for other reasons.

> I don't really agree with your comments about checksums and
> timestamps.  I think that, if possible, there should be ONE method of
> determining whether a block has changed in some important way, and I
> think if we can make LSN work, that would be for the best. If you use
> multiple methods of detecting changes without any clearly-defined
> reason for so doing, maybe what you're saying is that you don't really
> believe that any of the methods are reliable but if we throw the
> kitchen sink at the problem it should come out OK. Any bugs in one
> mechanism are likely to be masked by one of the others, but that's not
> as as good as one method that is known to be altogether reliable.

I disagree with this on a couple of levels.  The first is pretty simple-
we don't have all of the information.  The user may have some reason to
believe that timestamp-based is a bad idea, for example, and therefore
having an option to perform a checksum-based backup makes sense.  rsync
is a pretty good tool in my view and it has a very similar option-
because there are trade-offs to be made.  LSN is great, if you don't
mind reading every file of your database start-to-finish every time, but
in a running system which hasn't suffered from clock skew or other odd
issues (some of which we can also detect), it's pretty painful to scan
absolutely everything like that for an incremental.

Perhaps the discussion has already moved on to having some way of our
own to track if a given file has changed without having to scan all of
it- if so, that's a discussion I'd be interested in.  I'm not against
other approaches here besides timestamps if there's a solid reason why
they're better and they're also able to avoid scanning the entire
database.

> > By having a manifest for each backed up file for each backup, you also
> > gain the ability to validate that a backup in the repository hasn't been
> > corrupted post-backup, a feature that at least some other database
> > backup and restore systems have (referring specifically to the big O in
> > this particular case, but I bet others do too).
>
> Agreed. The manifest only lets you validate to a limited extent, but
> that's still useful.

If you track the checksum of the file in the manifest then it's a pretty
strong validation that the backup repo hasn't been corrupted between the
backup and the restore.  Of course, the database could have been
corrupted at the source, and perhaps that's what you were getting at
with your 'limited extent' but that isn't what I was referring to.

Claiming that the backup has been 'validated' by only looking at file
sizes certainly wouldn't be acceptable.  I can't imagine you were
suggesting that as you're certainly capable of realizing that, but I got
the feeling you weren't agreeing that having the checksum of the file
made sense to include in the manifest, so I feel like I'm missing
something here.

> > Having a system of keeping track of which backups are full and which are
> > differential in an overall system also gives you the ability to do
> > things like expiration in a sensible way, including handling WAL
> > expiration.
>
> True, but I'm not sure that functionality belongs in core. It
> certainly needs to be possible for out-of-core code to do this part of
> the work if desired, because people want to integrate with enterprise
> backup systems, and we can't come in and say, well, you back up
> everything else using Netbackup or Tivoli, but for PostgreSQL you have
> to use pg_backrest. I mean, maybe you can win that argument, but I
> know I can't.

I'm pretty baffled by this argument, particularly in this context.  We
already have tooling around trying to manage WAL archives in core- see
pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
a set of tar files, or a data directory) could happily be backed up
using whatever Enterprise tool folks want to use- in much the same way
that a pgbackrest repo is also able to be backed up using whatever
Enterprise tools someone wishes to use.  We designed it quite carefully
to work with exactly that use-case, so the distinction here is quite
lost on me.  Perhaps you could clarify what use-case these changes to
pg_basebackup solve, when working with a Netbackup or Tivoli system,
that pgbackrest doesn't, since you bring it up here?

> > I'd like to clarify that while I would like to have an easier way to
> > parallelize backups, that's a relatively minor complaint- the much
> > bigger issue that I have with this feature is that trying to address
> > everything correctly while having only the amount of information that
> > could be passed on the command-line about the prior full/incremental is
> > going to be extremely difficult, complicated, and likely to lead to
> > subtle bugs in the actual code, and probably less than subtle bugs in
> > how users end up using it, since they'll have to implement the
> > expiration and tracking of information between backups themselves
> > (unless something's changed in that part during this discussion- I admit
> > that I've not read every email in this thread).
>
> Well, the evidence seems to show that you are right, at least to some
> extent. I consider it a positive good if the client needs to give the
> server only a limited amount of information. After all, you could
> always take an incremental backup by shipping every byte of the
> previous backup to the server, having it compare everything to the
> current contents, and having it then send you back the stuff that is
> new or different. But that would be dumb, because most of the point of
> an incremental backup is to save on sending lots of data over the
> network unnecessarily. Now, it seems that I took that goal to an
> unhealthy extreme, because as we've now realized, sending only an LSN
> and nothing else isn't enough to get a correct backup. So we need to
> send more, and it doesn't have to be the absolutely most
> stripped-down, bear-bones version of what could be sent. But it should
> be fairly minimal, I think; that's kinda the point of the feature.

Right- much of the point of an incremental backup feature is to try and
minimize the amount of work that's done while still getting a good
backup.  I don't agree that we should focus solely on network bandwidth
as there are also trade-offs to be made around disk bandwidth to
consider, see above discussion regarding timestamps vs. checksum'ing
every file.

As for if we should be sending more to the server, or asking the server
to send more to us, I don't really have a good feel for what's "best".
At least one implementation I'm familiar with builds a manifest on the
PG server side and then compares the results of that to the manifest
stored with the backup (where that comparison is actually done is on
whatever system the "backup" was started from, typically a backup
server).  Perhaps there's an argument for sending the manifest from the
backup repository to PostgreSQL for it to then compare against the data
directory but I'm not really sure how it could possibly do that more
efficiently and that's moving work to the PG server that it doesn't
really need to do.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Isn't some operations where at the end we directly call heap_sync
> > > without writing WAL will have a similar problem as well?
> >
> > Maybe.  Can you give an example?
>
> Looking through the code, I found two cases where we do this.  One is
> a bulk insert operation with wal_level = minimal, and the other is
> CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
> cases we are generating new blocks whose LSNs will be 0/0. So, I think
> we need a rule that if the server is asked to back up all blocks in a
> file with LSNs > some threshold LSN, it must also include any blocks
> whose LSN is 0/0. Those blocks are either uninitialized or are
> populated without WAL logging, so they always need to be copied.

I'm not sure I see a way around it but this seems pretty unfortunate-
every single incremental backup will have all of those included even
though the full backup likely also does (I say likely since someone
could do a full backup, set the WAL to minimal, load a bunch of data,
and then restart back to a WAL level where we can do a new backup, and
then do an incremental, so we don't *know* that the full includes those
blocks unless we also track a block-level checksum or similar).  Then
again, doing these kinds of server bounces to change the WAL level
around is, hopefully, relatively rare..

> Outside of unlogged and temporary tables, I don't know of any case
> where make a critical modification to an already-existing block
> without bumping the LSN. I hope there is no such case.

I believe we all do. :)

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote:
> I disagree with this on a couple of levels.  The first is pretty simple-
> we don't have all of the information.  The user may have some reason to
> believe that timestamp-based is a bad idea, for example, and therefore
> having an option to perform a checksum-based backup makes sense.  rsync
> is a pretty good tool in my view and it has a very similar option-
> because there are trade-offs to be made.  LSN is great, if you don't
> mind reading every file of your database start-to-finish every time, but
> in a running system which hasn't suffered from clock skew or other odd
> issues (some of which we can also detect), it's pretty painful to scan
> absolutely everything like that for an incremental.

There's a separate thread on using WAL-scanning to avoid having to
scan all the data every time. I pointed it out to you early in this
thread, too.

> If you track the checksum of the file in the manifest then it's a pretty
> strong validation that the backup repo hasn't been corrupted between the
> backup and the restore.  Of course, the database could have been
> corrupted at the source, and perhaps that's what you were getting at
> with your 'limited extent' but that isn't what I was referring to.

Yeah, that all seems fair. Without the checksum, you can only validate
that you have the right files and that they are the right sizes, which
is not bad, but the checksums certainly make it stronger. But,
wouldn't having to checksum all of the files add significantly to the
cost of taking the backup? If so, I can imagine that some people might
want to pay that cost but others might not. If it's basically free to
checksum the data while we have it in memory anyway, then I guess
there's little to be lost.

> I'm pretty baffled by this argument, particularly in this context.  We
> already have tooling around trying to manage WAL archives in core- see
> pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
> about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
> a set of tar files, or a data directory) could happily be backed up
> using whatever Enterprise tool folks want to use- in much the same way
> that a pgbackrest repo is also able to be backed up using whatever
> Enterprise tools someone wishes to use.  We designed it quite carefully
> to work with exactly that use-case, so the distinction here is quite
> lost on me.  Perhaps you could clarify what use-case these changes to
> pg_basebackup solve, when working with a Netbackup or Tivoli system,
> that pgbackrest doesn't, since you bring it up here?

I'm not an expert on any of those systems, but I doubt that
everybody's OK with backing everything up to a pgbackrest repository
and then separately backing up that repository to some other system.
That sounds like a pretty large storage cost.

> As for if we should be sending more to the server, or asking the server
> to send more to us, I don't really have a good feel for what's "best".
> At least one implementation I'm familiar with builds a manifest on the
> PG server side and then compares the results of that to the manifest
> stored with the backup (where that comparison is actually done is on
> whatever system the "backup" was started from, typically a backup
> server).  Perhaps there's an argument for sending the manifest from the
> backup repository to PostgreSQL for it to then compare against the data
> directory but I'm not really sure how it could possibly do that more
> efficiently and that's moving work to the PG server that it doesn't
> really need to do.

I agree with all that, but... if the server builds a manifest on the
PG server that is to be compared with the backup's manifest, the one
the PG server builds can't really include checksums, I think. To get
the checksums, it would have to read the entire cluster while building
the manifest, which sounds insane. Presumably it would have to build a
checksum-free version of the manifest, and then the client could
checksum the files as they're streamed down and write out a revised
manifest that adds the checksums.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote:
> > I disagree with this on a couple of levels.  The first is pretty simple-
> > we don't have all of the information.  The user may have some reason to
> > believe that timestamp-based is a bad idea, for example, and therefore
> > having an option to perform a checksum-based backup makes sense.  rsync
> > is a pretty good tool in my view and it has a very similar option-
> > because there are trade-offs to be made.  LSN is great, if you don't
> > mind reading every file of your database start-to-finish every time, but
> > in a running system which hasn't suffered from clock skew or other odd
> > issues (some of which we can also detect), it's pretty painful to scan
> > absolutely everything like that for an incremental.
>
> There's a separate thread on using WAL-scanning to avoid having to
> scan all the data every time. I pointed it out to you early in this
> thread, too.

As discussed nearby, not everything that needs to be included in the
backup is actually going to be in the WAL though, right?  How would that
ever be able to handle the case where someone starts the server under
wal_level = logical, takes a full backup, then restarts with wal_level =
minimal, writes out a bunch of new data, and then restarts back to
wal_level = logical and takes an incremental?

How would we even detect that such a thing happened?

> > If you track the checksum of the file in the manifest then it's a pretty
> > strong validation that the backup repo hasn't been corrupted between the
> > backup and the restore.  Of course, the database could have been
> > corrupted at the source, and perhaps that's what you were getting at
> > with your 'limited extent' but that isn't what I was referring to.
>
> Yeah, that all seems fair. Without the checksum, you can only validate
> that you have the right files and that they are the right sizes, which
> is not bad, but the checksums certainly make it stronger. But,
> wouldn't having to checksum all of the files add significantly to the
> cost of taking the backup? If so, I can imagine that some people might
> want to pay that cost but others might not. If it's basically free to
> checksum the data while we have it in memory anyway, then I guess
> there's little to be lost.

On larger systems, so many of the files are 1GB in size that checking
the file size is quite close to meaningless.  Yes, having to checksum
all of the files definitely adds to the cost of taking the backup, but
to avoid it we need strong assurances that a given file hasn't been
changed since our last full backup.  WAL, today at least, isn't quite
that, and timestamps can possibly be fooled with, so if you'd like to be
particularly careful, there doesn't seem to be a lot of alternatives.

> > I'm pretty baffled by this argument, particularly in this context.  We
> > already have tooling around trying to manage WAL archives in core- see
> > pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
> > about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
> > a set of tar files, or a data directory) could happily be backed up
> > using whatever Enterprise tool folks want to use- in much the same way
> > that a pgbackrest repo is also able to be backed up using whatever
> > Enterprise tools someone wishes to use.  We designed it quite carefully
> > to work with exactly that use-case, so the distinction here is quite
> > lost on me.  Perhaps you could clarify what use-case these changes to
> > pg_basebackup solve, when working with a Netbackup or Tivoli system,
> > that pgbackrest doesn't, since you bring it up here?
>
> I'm not an expert on any of those systems, but I doubt that
> everybody's OK with backing everything up to a pgbackrest repository
> and then separately backing up that repository to some other system.
> That sounds like a pretty large storage cost.

I'm not asking you to be an expert on those systems, just to help me
understand the statements you're making.  How is backing up to a
pgbackrest repo different than running a pg_basebackup in the context of
using some other Enterprise backup system?  In both cases, you'll have a
full copy of the backup (presumably compressed) somewhere out on a disk
or filesystem which is then backed up by the Enterprise tool.

> > As for if we should be sending more to the server, or asking the server
> > to send more to us, I don't really have a good feel for what's "best".
> > At least one implementation I'm familiar with builds a manifest on the
> > PG server side and then compares the results of that to the manifest
> > stored with the backup (where that comparison is actually done is on
> > whatever system the "backup" was started from, typically a backup
> > server).  Perhaps there's an argument for sending the manifest from the
> > backup repository to PostgreSQL for it to then compare against the data
> > directory but I'm not really sure how it could possibly do that more
> > efficiently and that's moving work to the PG server that it doesn't
> > really need to do.
>
> I agree with all that, but... if the server builds a manifest on the
> PG server that is to be compared with the backup's manifest, the one
> the PG server builds can't really include checksums, I think. To get
> the checksums, it would have to read the entire cluster while building
> the manifest, which sounds insane. Presumably it would have to build a
> checksum-free version of the manifest, and then the client could
> checksum the files as they're streamed down and write out a revised
> manifest that adds the checksums.

Unless files can be excluded based on some relatively strong criteria,
then yes, the approach would be to use checksums of the files and would
necessairly include all files, meaning that you'd have to read them all.

That's not great, of course, which is why there are trade-offs to be
made, one of which typically involves using timestamps, but doing so
quite carefully, to perform the file exclusion.  Other ideas are great
but it seems like WAL isn't really a great idea unless we make some
changes there and we, as in PG, haven't got a robust "we know this file
changed as of this point" to work from.  I worry that we're putting too
much faith into a system to do something independent of what it was
actually built and designed to do, and thinking that because we could
trust it for X, we can trust it for Y.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Amit Kapila
Date:
On Mon, Sep 16, 2019 at 11:09 PM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > Isn't some operations where at the end we directly call heap_sync
> > > > without writing WAL will have a similar problem as well?
> > >
> > > Maybe.  Can you give an example?
> >
> > Looking through the code, I found two cases where we do this.  One is
> > a bulk insert operation with wal_level = minimal, and the other is
> > CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
> > cases we are generating new blocks whose LSNs will be 0/0. So, I think
> > we need a rule that if the server is asked to back up all blocks in a
> > file with LSNs > some threshold LSN, it must also include any blocks
> > whose LSN is 0/0. Those blocks are either uninitialized or are
> > populated without WAL logging, so they always need to be copied.
>
> I'm not sure I see a way around it but this seems pretty unfortunate-
> every single incremental backup will have all of those included even
> though the full backup likely also does
>

Yeah, this is quite unfortunate.  One more thing to note is that the
same is true for other operation like 'create index' (ex. nbtree
bypasses buffer manager while creating the index, doesn't write wal
for wal_level=minimal and then syncs at the end).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Amit Kapila
Date:
On Mon, Sep 16, 2019 at 7:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This seems to be a blocking problem for the LSN based design.
>
> Well, only the simplest version of it, I think.
>
> > Can we think of using creation time for file?  Basically, if the file
> > creation time is later than backup-labels "START TIME:", then include
> > that file entirely.  I think one big point against this is clock skew
> > like what if somebody tinkers with the clock.  And also, this can
> > cover cases like
> > what Jeevan has pointed but might not cover other cases which we found
> > problematic.
>
> Well that would mean, for example, that if you copied the data
> directory from one machine to another, the next "incremental" backup
> would turn into a full backup. That sucks. And in other situations,
> like resetting the clock, it could mean that you end up with a corrupt
> backup without any real ability for PostgreSQL to detect it. I'm not
> saying that it is impossible to create a practically useful system
> based on file time stamps, but I really don't like it.
>
> > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
> > have similar problems.
>
> I'm not sure quite what you mean by that.  Can you elaborate? It
> appears to me that the XLR_SPECIAL_REL_UPDATE operations are all
> things that create files, remove files, or truncate files, and the
> sketch in my previous email would handle the first two of those cases
> correctly.  See below for the third.
>
> > One related point is how do incremental backups handle the case where
> > vacuum truncates the relation partially?  Basically, with current
> > patch/design, it doesn't appear that such information can be passed
> > via incremental backup.  I am not sure if this is a problem, but it
> > would be good if we can somehow handle this.
>
> As to this, if you're taking a full backup of a particular file,
> there's no problem.  If you're taking a partial backup of a particular
> file, you need to include the current length of the file and the
> identity and contents of each modified block.  Then you're fine.
>

Right, this should address that point.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: block-level incremental backup

From
Robert Haas
Date:
On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> As discussed nearby, not everything that needs to be included in the
> backup is actually going to be in the WAL though, right?  How would that
> ever be able to handle the case where someone starts the server under
> wal_level = logical, takes a full backup, then restarts with wal_level =
> minimal, writes out a bunch of new data, and then restarts back to
> wal_level = logical and takes an incremental?

Fair point. I think the WAL-scanning approach can only work if
wal_level > minimal. But, I also think that few people run with
wal_level = minimal in this era where the default has been changed to
replica; and I think we can detect the WAL level in use while scanning
WAL. It can only change at a checkpoint.

> On larger systems, so many of the files are 1GB in size that checking
> the file size is quite close to meaningless.  Yes, having to checksum
> all of the files definitely adds to the cost of taking the backup, but
> to avoid it we need strong assurances that a given file hasn't been
> changed since our last full backup.  WAL, today at least, isn't quite
> that, and timestamps can possibly be fooled with, so if you'd like to be
> particularly careful, there doesn't seem to be a lot of alternatives.

I see your points, but it feels like you're trying to talk down the
WAL-based approach over what seem to me to be fairly manageable corner
cases.

> I'm not asking you to be an expert on those systems, just to help me
> understand the statements you're making.  How is backing up to a
> pgbackrest repo different than running a pg_basebackup in the context of
> using some other Enterprise backup system?  In both cases, you'll have a
> full copy of the backup (presumably compressed) somewhere out on a disk
> or filesystem which is then backed up by the Enterprise tool.

Well, I think that what people really want is to be able to backup
straight into the enterprise tool, without an intermediate step.

My basic point here is: As with practically all PostgreSQL
development, I think we should try to expose capabilities and avoid
making policy on behalf of users.

I'm not objecting to the idea of having tools that can help users
figure out how much WAL they need to retain -- but insofar as we can
do it, such tools should work regardless of where that WAL is actually
stored. I dislike the idea that PostgreSQL would provide something
akin to a "pgbackrest repository" in core, or I at least I think it
would be important that we're careful about how much functionality
gets tied to the presence and use of such a thing, because, at least
based on my experience working at EnterpriseDB, larger customers often
don't want to do it that way.

> That's not great, of course, which is why there are trade-offs to be
> made, one of which typically involves using timestamps, but doing so
> quite carefully, to perform the file exclusion.  Other ideas are great
> but it seems like WAL isn't really a great idea unless we make some
> changes there and we, as in PG, haven't got a robust "we know this file
> changed as of this point" to work from.  I worry that we're putting too
> much faith into a system to do something independent of what it was
> actually built and designed to do, and thinking that because we could
> trust it for X, we can trust it for Y.

That seems like a considerable overreaction to me based on the
problems reported thus far. The fact is, WAL was originally intended
for crash recovery and has subsequently been generalized to be usable
for point-in-time recovery, standby servers, and logical decoding.
It's clearly established at this point as the canonical way that you
know what in the database has changed, which is the same need that we
have for incremental backup.

At any rate, the same criticism can be leveled - IMHO with a lot more
validity - at timestamps. Last-modification timestamps are completely
outside of our control; they are owned by the OS and various operating
systems can and do have varying behavior. They can go backwards when
things have changed; they can go forwards when things have not
changed. They were clearly not intended to meet this kind of
requirement. Even, they were intended for that purpose much less so
than WAL, which was actually designed for a requirement in this
general ballpark, if not this thing precisely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
> > As discussed nearby, not everything that needs to be included in the
> > backup is actually going to be in the WAL though, right?  How would that
> > ever be able to handle the case where someone starts the server under
> > wal_level = logical, takes a full backup, then restarts with wal_level =
> > minimal, writes out a bunch of new data, and then restarts back to
> > wal_level = logical and takes an incremental?
>
> Fair point. I think the WAL-scanning approach can only work if
> wal_level > minimal. But, I also think that few people run with
> wal_level = minimal in this era where the default has been changed to
> replica; and I think we can detect the WAL level in use while scanning
> WAL. It can only change at a checkpoint.

We need to be sure that we can detect if the WAL level has ever been set
to minimal between a full and an incremental and, if so, either refuse
to run the incremental, or promote it to a full, or make it a
checksum-based incremental instead of trusting the WAL stream.

I'm also glad that we ended up changing the default though and I do hope
that there's relatively few people running with minimal and that there's
even fewer who play around with flipping it back and forth.

> > On larger systems, so many of the files are 1GB in size that checking
> > the file size is quite close to meaningless.  Yes, having to checksum
> > all of the files definitely adds to the cost of taking the backup, but
> > to avoid it we need strong assurances that a given file hasn't been
> > changed since our last full backup.  WAL, today at least, isn't quite
> > that, and timestamps can possibly be fooled with, so if you'd like to be
> > particularly careful, there doesn't seem to be a lot of alternatives.
>
> I see your points, but it feels like you're trying to talk down the
> WAL-based approach over what seem to me to be fairly manageable corner
> cases.

Just to be clear, I see your points and I like the general idea of
finding solutions, but it seems like the issues are likely to be pretty
complex and I'm not sure that's being appreciated very well.

> > I'm not asking you to be an expert on those systems, just to help me
> > understand the statements you're making.  How is backing up to a
> > pgbackrest repo different than running a pg_basebackup in the context of
> > using some other Enterprise backup system?  In both cases, you'll have a
> > full copy of the backup (presumably compressed) somewhere out on a disk
> > or filesystem which is then backed up by the Enterprise tool.
>
> Well, I think that what people really want is to be able to backup
> straight into the enterprise tool, without an intermediate step.

Ok..  I can understand that but I don't get how these changes to
pg_basebackup will help facilitate that.  If they don't and what you're
talking about here is independent, then great, that clarifies things,
but if you're saying that these changes to pg_basebackup are to help
with backing up directly into those Enterprise systems then I'm just
asking for some help in understanding how- what's the use-case here that
we're adding to pg_basebackup that makes it work with these Enterprise
systems?

I'm not trying to be difficult here, I'm just trying to understand.

> My basic point here is: As with practically all PostgreSQL
> development, I think we should try to expose capabilities and avoid
> making policy on behalf of users.
>
> I'm not objecting to the idea of having tools that can help users
> figure out how much WAL they need to retain -- but insofar as we can
> do it, such tools should work regardless of where that WAL is actually
> stored.

How would that tool work, if it's to be able to work regardless of where
the WAL is actually stored..?  Today, pg_archivecleanup just works
against a POSIX filesystem- are you thinking that the tool would have a
pluggable storage system, so that it could work with, say, a POSIX
filesystem, or a CIFS mount, or a s3-like system?

> I dislike the idea that PostgreSQL would provide something
> akin to a "pgbackrest repository" in core, or I at least I think it
> would be important that we're careful about how much functionality
> gets tied to the presence and use of such a thing, because, at least
> based on my experience working at EnterpriseDB, larger customers often
> don't want to do it that way.

This seems largely independent of the above discussion, but since we're
discussing it, I've certainly had various experiences in this area too-
some larger customers would like to use an s3-like store (which
pgbackrest already supports and will be supporting others going forward
as it has a pluggable storage mechanism for the repo...), and then
there's customers who would like to point their Enterprise backup
solution at a directory on disk to back it up (which pgbackrest also
supports, as mentioned previously), and lastly there's customers who
really want to just backup the PG data directory and they'd like it to
"just work", thank you, and no they don't have any thought or concern
about how to handle WAL, but surely it can't be that important, can it?

The last is tongue-in-cheek and I'm half-kidding there, but this is why
I was trying to understand the comments above about what the use-case is
here that we're trying to solve for that answers the call for the
Enterprise software crowd, and ideally what distinguishes that from
pgbackrest, but just the clear cut "this is what this change will do to
make pg_basebackup work for Enterprise customers" would be great, or
even a "well, pg_basebackup today works for them because it does X and
it'll continue to be able to do X even after this change."

I'll take a wild shot in the dark to try to help move us through this-
is it that pg_basebackup can stream out to stdout in some cases..?
Though that's quite limited since it means you can't have additional
tablespaces and you can't stream the WAL, and how would that work with
the manifest idea that's being discussed..?  If there's a directory
that's got manifest files in it for each backup, so we have the file
sizes for them, those would need to be accessible when we go to do the
incremental backup and couldn't be stored off somewhere else, I wouldn't
think..

> > That's not great, of course, which is why there are trade-offs to be
> > made, one of which typically involves using timestamps, but doing so
> > quite carefully, to perform the file exclusion.  Other ideas are great
> > but it seems like WAL isn't really a great idea unless we make some
> > changes there and we, as in PG, haven't got a robust "we know this file
> > changed as of this point" to work from.  I worry that we're putting too
> > much faith into a system to do something independent of what it was
> > actually built and designed to do, and thinking that because we could
> > trust it for X, we can trust it for Y.
>
> That seems like a considerable overreaction to me based on the
> problems reported thus far. The fact is, WAL was originally intended
> for crash recovery and has subsequently been generalized to be usable
> for point-in-time recovery, standby servers, and logical decoding.
> It's clearly established at this point as the canonical way that you
> know what in the database has changed, which is the same need that we
> have for incremental backup.

Provided the WAL level is at the level that you need it to be that will
be true for things which are actually supported with PITR, replication
to standby servers, et al.  I can see how it might come across as an
overreaction but this strikes me as a pretty glaring issue and I worry
that if it was overlooked until now that there'll be other more subtle
issues, and backups are just plain complicated to get right, just to
begin with already, something that I don't think people appreciate until
they've been dealing with them for quite a while.

Not that this would be the first time we've had issues in this area, and
we'd likely work through them over time, but I'm sure we'd all prefer to
get it as close to right as possible the first time around, and that's
going to require some pretty in depth review.

> At any rate, the same criticism can be leveled - IMHO with a lot more
> validity - at timestamps. Last-modification timestamps are completely
> outside of our control; they are owned by the OS and various operating
> systems can and do have varying behavior. They can go backwards when
> things have changed; they can go forwards when things have not
> changed. They were clearly not intended to meet this kind of
> requirement. Even, they were intended for that purpose much less so
> than WAL, which was actually designed for a requirement in this
> general ballpark, if not this thing precisely.

While I understand that timestamps may be used for a lot of things and
that the time on a system could go forward or backward, the actual
requirement is:

- If the file was modified after the backup was done, the timestamp (or
  the size) needs to be different.  Doesn't actually matter if it's
  forwards, or backwards, different is all that's needed.  The timestamp
  also needs to be before the backup started for it to be considered an
  option to skip it.

Is it possible for that to be fool'd?  Yes, of course, but it isn't as
simply fooled as your typical "just copy files newer than X" issue that
other tools have, at least, if you're keeping a manifest of all of the
files, et al, as discussed earlier.

Thanks,

Stephen

Attachment

Re: block-level incremental backup

From
Robert Haas
Date:
On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote:
> We need to be sure that we can detect if the WAL level has ever been set
> to minimal between a full and an incremental and, if so, either refuse
> to run the incremental, or promote it to a full, or make it a
> checksum-based incremental instead of trusting the WAL stream.

Sure. What about checksum collisions?

> Just to be clear, I see your points and I like the general idea of
> finding solutions, but it seems like the issues are likely to be pretty
> complex and I'm not sure that's being appreciated very well.

Definitely possible, but it's more helpful if you can point out the
actual issues.

> Ok..  I can understand that but I don't get how these changes to
> pg_basebackup will help facilitate that.  If they don't and what you're
> talking about here is independent, then great, that clarifies things,
> but if you're saying that these changes to pg_basebackup are to help
> with backing up directly into those Enterprise systems then I'm just
> asking for some help in understanding how- what's the use-case here that
> we're adding to pg_basebackup that makes it work with these Enterprise
> systems?
>
> I'm not trying to be difficult here, I'm just trying to understand.

Man, I feel like we're totally drifting off into the weeds here.  I'm
not arguing that these changes to pg_basebackup will help enterprise
users except insofar as those users want incremental backup.  All of
this discussion started with this comment from you:

"Having a system of keeping track of which backups are full and which
are differential in an overall system also gives you the ability to do
things like expiration in a sensible way, including handling WAL
expiration."

All I was doing was saying that for an enterprise user, the overall
system might be something entirely outside of our control, like
NetBackup or Tivoli. Therefore, whatever functionality we provide to
do that kind of thing should be able to be used in such contexts. That
hardly seems like a controversial proposition.

> How would that tool work, if it's to be able to work regardless of where
> the WAL is actually stored..?  Today, pg_archivecleanup just works
> against a POSIX filesystem- are you thinking that the tool would have a
> pluggable storage system, so that it could work with, say, a POSIX
> filesystem, or a CIFS mount, or a s3-like system?

Again, I was making a general statement about design goals -- "we
should try to work nicely with enterprise backup products" -- not
proposing a specific design for a specific thing. I don't think the
idea of some pluggability in that area is a bad one, but it's not even
slightly what this thread is about.

> Provided the WAL level is at the level that you need it to be that will
> be true for things which are actually supported with PITR, replication
> to standby servers, et al.  I can see how it might come across as an
> overreaction but this strikes me as a pretty glaring issue and I worry
> that if it was overlooked until now that there'll be other more subtle
> issues, and backups are just plain complicated to get right, just to
> begin with already, something that I don't think people appreciate until
> they've been dealing with them for quite a while.

Permit me to be unpersuaded. If it was such a glaring issue, and if
experience is the key to spotting such issues, then why didn't YOU
spot it?

I'm not arguing that this stuff isn't hard. It is. Nor am I arguing
that I didn't screw up. I did. But designs need to be accepted or
rejected based on facts, not FUD. You've raised some good technical
points and if you've got more concerns, I'd like to hear them, but I
don't think arguing vaguely that a certain approach will probably run
into trouble gets us anywhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: block-level incremental backup

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote:
> > We need to be sure that we can detect if the WAL level has ever been set
> > to minimal between a full and an incremental and, if so, either refuse
> > to run the incremental, or promote it to a full, or make it a
> > checksum-based incremental instead of trusting the WAL stream.
>
> Sure. What about checksum collisions?

Certainly possible, of course, but a sha256 of each file is at least
somewhat better than, say, our page-level checksums.  I do agree that
having the option to just say "promote it to a full", or "do a
byte-by-byte comparison against the prior backed up file" would be
useful for those who are concerned about sha256 collision probabilities.

Having a cross-check of "does this X% of files that we decided not to
back up due to whatever really still match what we think is in the
backup?" is definitely a valuable feature and one which I'd hope we get
to at some point.

> > Ok..  I can understand that but I don't get how these changes to
> > pg_basebackup will help facilitate that.  If they don't and what you're
> > talking about here is independent, then great, that clarifies things,
> > but if you're saying that these changes to pg_basebackup are to help
> > with backing up directly into those Enterprise systems then I'm just
> > asking for some help in understanding how- what's the use-case here that
> > we're adding to pg_basebackup that makes it work with these Enterprise
> > systems?
> >
> > I'm not trying to be difficult here, I'm just trying to understand.
>
> Man, I feel like we're totally drifting off into the weeds here.  I'm
> not arguing that these changes to pg_basebackup will help enterprise
> users except insofar as those users want incremental backup.  All of
> this discussion started with this comment from you:
>
> "Having a system of keeping track of which backups are full and which
> are differential in an overall system also gives you the ability to do
> things like expiration in a sensible way, including handling WAL
> expiration."
>
> All I was doing was saying that for an enterprise user, the overall
> system might be something entirely outside of our control, like
> NetBackup or Tivoli. Therefore, whatever functionality we provide to
> do that kind of thing should be able to be used in such contexts. That
> hardly seems like a controversial proposition.

And all I was trying to understand was how what pg_basebackup does in
this context is really different from what can be done with pgbackrest,
since you brought it up:

"True, but I'm not sure that functionality belongs in core. It
certainly needs to be possible for out-of-core code to do this part of
the work if desired, because people want to integrate with enterprise
backup systems, and we can't come in and say, well, you back up
everything else using Netbackup or Tivoli, but for PostgreSQL you have
to use pg_backrest. I mean, maybe you can win that argument, but I
know I can't."

What it sounds like you're argueing here is that what pg_basebackup
"has" in it is that it specifically doesn't include any kind of
expiration management of any kind, and that's somehow helpful to people
who want to use Enterprise backup solutions.  Maybe that's what you were
getting at, in which case, I'm sorry for misunderstanding and dragging
it out, and thanks for helping me understand.

> > How would that tool work, if it's to be able to work regardless of where
> > the WAL is actually stored..?  Today, pg_archivecleanup just works
> > against a POSIX filesystem- are you thinking that the tool would have a
> > pluggable storage system, so that it could work with, say, a POSIX
> > filesystem, or a CIFS mount, or a s3-like system?
>
> Again, I was making a general statement about design goals -- "we
> should try to work nicely with enterprise backup products" -- not
> proposing a specific design for a specific thing. I don't think the
> idea of some pluggability in that area is a bad one, but it's not even
> slightly what this thread is about.

Well, I agree with you, as I said up-thread, that this seemed to be
going in a different and perhaps not entirely relevant direction.

> > Provided the WAL level is at the level that you need it to be that will
> > be true for things which are actually supported with PITR, replication
> > to standby servers, et al.  I can see how it might come across as an
> > overreaction but this strikes me as a pretty glaring issue and I worry
> > that if it was overlooked until now that there'll be other more subtle
> > issues, and backups are just plain complicated to get right, just to
> > begin with already, something that I don't think people appreciate until
> > they've been dealing with them for quite a while.
>
> Permit me to be unpersuaded. If it was such a glaring issue, and if
> experience is the key to spotting such issues, then why didn't YOU
> spot it?

I'm not designing the feature..?  Sure, I agreed earlier with the
general idea that we might be able to use WAL scanning and/or the LSN to
figure out if a page had changed, but the next step would have been, I
would have thought anyway, for someone to go do the analysis that has
only recently been started to look at the places when we write and the
cases where we write the WAL and actually build up confidence that this
approach isn't missing anything.  Instead, we seem to have come a long
way in the development of this without having done that, and that does
shake my confidence in this effort.

> I'm not arguing that this stuff isn't hard. It is. Nor am I arguing
> that I didn't screw up. I did. But designs need to be accepted or
> rejected based on facts, not FUD. You've raised some good technical
> points and if you've got more concerns, I'd like to hear them, but I
> don't think arguing vaguely that a certain approach will probably run
> into trouble gets us anywhere.

This just gets back to what I was saying earlier.  It seems like we're
presuming this is going to 'just work' because, say, replication works
great, or crash recovery works great, and those are based on WAL.  I'm
still hopeful that we can do something based on WAL or LSN here, but it
needs a careful review of when we are, and when we aren't, writing out
WAL for basically everything we do, an effort that I'm glad to see might
be starting to happen, but a quick "oh, this is why in this one case
with this one thing, and we're all good now" doesn't instill confidence
in me, at least.

Thanks,

Stephen

Attachment