Thread: block-level incremental backup
Hi, Several companies, including EnterpriseDB, NTT, and Postgres Pro, have developed technology that permits a block-level incremental backup to be taken from a PostgreSQL server. I believe the idea in all of those cases is that non-relation files should be backed up in their entirety, but for relation files, only those blocks that have been changed need to be backed up. I would like to propose that we should have a solution for this problem in core, rather than leaving it to each individual PostgreSQL company to develop and maintain their own solution. Generally my idea is: 1. There should be a way to tell pg_basebackup to request from the server only those blocks where LSN >= threshold_value. There are several possible ways for the server to implement this, the simplest of which is to just scan all the blocks and send only the ones that satisfy that criterion. That might sound dumb, but it does still save network bandwidth, and it works even without any prior setup. It will probably be more efficient in many cases to instead scan all the WAL generated since that LSN and extract block references from it, but that is only possible if the server has all of that WAL available or can somehow get it from the archive. We could also, as several people have proposed previously, have some kind of additional relation for that stores either a single is-modified bit -- which only helps if the reference LSN for the is-modified bit is older than the requested LSN but not too much older -- or the highest LSN for each range of K blocks, or something like that. I am at the moment not too concerned with the exact strategy we use here. I believe we may want to eventually support more than one, since they have different trade-offs. 2. When you use pg_basebackup in this way, each relation file that is not sent in its entirety is replaced by a file with a different name. For example, instead of base/16384/16417, you might get base/16384/partial.16417 or however we decide to name them. Each such file will store near the beginning of the file a list of all the blocks contained in that file, and the blocks themselves will follow at offsets that can be predicted from the metadata at the beginning of the file. The idea is that you shouldn't have to read the whole file to figure out which blocks it contains, and if you know specifically what blocks you want, you should be able to reasonably efficiently read just those blocks. A backup taken in this manner should also probably create some kind of metadata file in the root directory that stops the server from starting and lists other salient details of the backup. In particular, you need the threshold LSN for the backup (i.e. contains blocks newer than this) and the start LSN for the backup (i.e. the LSN that would have been returned from pg_start_backup). 3. There should be a new tool that knows how to merge a full backup with any number of incremental backups and produce a complete data directory with no remaining partial files. The tool should check that the threshold LSN for each incremental backup is less than or equal to the start LSN of the previous backup; if not, there may be changes that happened in between which would be lost, so combining the backups is unsafe. Running this tool can be thought of either as restoring the backup or as producing a new synthetic backup from any number of incremental backups. This would allow for a strategy of unending incremental backups. For instance, on day 1, you take a full backup. On every subsequent day, you take an incremental backup. On day 9, you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full day2. On each subsequent day you do something similar. Now you can always roll back to any of the last seven days by combining the oldest backup you have (which is always a synthetic full backup) with as many newer incrementals as you want, up to the point where you want to stop. Other random points: - If the server has multiple ways of finding blocks with an LSN greater than or equal to the threshold LSN, it could make a cost-based decision between those methods, or it could allow the client to specify the method to be used. - I imagine that the server would offer this functionality through a new replication command or a syntax extension to an existing command, so it could also be used by tools other than pg_basebackup if they wished. - Combining backups could also be done destructively rather than, as proposed above, non-destructively, but you have to be careful about what happens in case of a failure. - The pg_combinebackup tool (or whatever we call it) should probably have an option to exploit hard links to save disk space; this could in particular make construction of a new synthetic full backup much cheaper. However you'd better be careful not to use this option when actually trying to restore, because if you start the server and run recovery, you don't want to change the copies of those same files that are in your backup directory. I guess the server could be taught to complain about st_nlink > 1 but I'm not sure we want to go there. - It would also be possible to collapse multiple incremental backups into a single incremental backup, without combining with a full backup. In the worst case, size(i1+i2) = size(i1) + size(i2), but if the same data is modified repeatedly collapsing backups would save lots of space. This doesn't seem like a must-have for v1, though. - If you have a SAN and are taking backups using filesystem snapshots, then you don't need this, because your SAN probably already uses copy-on-write magic for those snapshots, and so you are already getting all of the same benefits in terms of saving storage space that you would get from something like this. But not everybody has a SAN. - I know that there have been several previous efforts in this area, but none of them have gotten to the point of being committed. I intend no disrespect to those efforts. I believe I'm taking a slightly different view of the problem here than what has been done previously, trying to focus on the user experience rather than, e.g., the technology that is used to decide which blocks need to be sent. However it's possible I've missed a promising patch that takes an approach very similar to what I'm outlining here, and if so, I don't mind a bit having that pointed out to me. - This is just a design proposal at this point; there is no code. If this proposal, or some modified version of it, seems likely to be acceptable, I and/or my colleagues might try to implement it. - It would also be nice to support *parallel* backup, both for full backups as we can do them today and for incremental backups. But that sound like a separate effort. pg_combinebackup could potentially support parallel operation as well, although that might be too ambitious for v1. - It would also be nice if pg_basebackup could write backups to places other than the local disk, like an object store, a tape drive, etc. But that also sounds like a separate effort. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, On 09.04.2019 18:48, Robert Haas wrote: > - It would also be nice if pg_basebackup could write backups to places > other than the local disk, like an object store, a tape drive, etc. > But that also sounds like a separate effort. > > Thoughts? (Just thinking out loud) Also it might be useful to have remote restore facility (i.e. if pg_combinebackup could write to non-local storage), so you don't need to restore the instance into a locale place and copy/move to the remote machine. But it seems to me that it is the most nontrivial feature and requires much more effort than other points. In pg_probackup we have remote restore via SSH in the beta state. But SSH isn't an option for in-core approach I think. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hi, On 2019-04-09 11:48:38 -0400, Robert Haas wrote: > 2. When you use pg_basebackup in this way, each relation file that is > not sent in its entirety is replaced by a file with a different name. > For example, instead of base/16384/16417, you might get > base/16384/partial.16417 or however we decide to name them. Hm. But that means that files that are shipped nearly in their entirety, need to be fully rewritten. Wonder if it's better to ship them as files with holes, and have the metadata in a separate file. That'd then allow to just fill in the holes with data from the older version. I'd assume that there's a lot of workloads where some significantly sized relations will get updated in nearly their entirety between backups. > Each such file will store near the beginning of the file a list of all the > blocks contained in that file, and the blocks themselves will follow > at offsets that can be predicted from the metadata at the beginning of > the file. The idea is that you shouldn't have to read the whole file > to figure out which blocks it contains, and if you know specifically > what blocks you want, you should be able to reasonably efficiently > read just those blocks. A backup taken in this manner should also > probably create some kind of metadata file in the root directory that > stops the server from starting and lists other salient details of the > backup. In particular, you need the threshold LSN for the backup > (i.e. contains blocks newer than this) and the start LSN for the > backup (i.e. the LSN that would have been returned from > pg_start_backup). I wonder if we shouldn't just integrate that into pg_control or such. So that: > 3. There should be a new tool that knows how to merge a full backup > with any number of incremental backups and produce a complete data > directory with no remaining partial files. Could just be part of server startup? > - I imagine that the server would offer this functionality through a > new replication command or a syntax extension to an existing command, > so it could also be used by tools other than pg_basebackup if they > wished. Would this logic somehow be usable from tools that don't want to copy the data directory via pg_basebackup (e.g. for parallelism, to directly send to some backup service / SAN / whatnot)? > - It would also be nice if pg_basebackup could write backups to places > other than the local disk, like an object store, a tape drive, etc. > But that also sounds like a separate effort. Indeed seems separate. But worthwhile. Greetings, Andres Freund
Hi,
On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.
Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version. I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.
> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file. The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks. A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup. In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).
I wonder if we shouldn't just integrate that into pg_control or such. So
that:
> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.
Could just be part of server startup?
> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.
Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?
> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.
Indeed seems separate. But worthwhile.
Greetings,
Andres Freund
On Tue, Apr 9, 2019 at 12:35 PM Andres Freund <andres@anarazel.de> wrote: > Hm. But that means that files that are shipped nearly in their entirety, > need to be fully rewritten. Wonder if it's better to ship them as files > with holes, and have the metadata in a separate file. That'd then allow > to just fill in the holes with data from the older version. I'd assume > that there's a lot of workloads where some significantly sized relations > will get updated in nearly their entirety between backups. I don't want to rely on holes at the FS level. I don't want to have to worry about what Windows does and what every Linux filesystem does and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do. And I don't want to have to write documentation for the fine manual explaining to people that they need to use a hole-preserving tool when they copy an incremental backup around. And I don't want to have to listen to complaints from $USER that their backup tool, $THING, is not hole-aware. Just - no. But what we could do is have some threshold (as git does), beyond which you just send the whole file. For example if >90% of the blocks have changed, or >80% or whatever, then you just send everything. That way, if you have a database where you have lots and lots of 1GB segments with low churn (so that you can't just use full backups) and lots and lots of 1GB segments with high churn (to create the problem you're describing) you'll still be OK. > > 3. There should be a new tool that knows how to merge a full backup > > with any number of incremental backups and produce a complete data > > directory with no remaining partial files. > > Could just be part of server startup? Yes, but I think that sucks. You might not want to start the server but rather just create a new synthetic backup. And realistically, it's hard to imagine the server doing anything but synthesizing the backup first and then proceeding as normal. In theory there's no reason why it couldn't be smart enough to construct the files it needs "on demand" in the background, but that sounds really hard and I don't think there's enough value to justify that level of effort. YMMV, of course. > > - I imagine that the server would offer this functionality through a > > new replication command or a syntax extension to an existing command, > > so it could also be used by tools other than pg_basebackup if they > > wished. > > Would this logic somehow be usable from tools that don't want to copy > the data directory via pg_basebackup (e.g. for parallelism, to directly > send to some backup service / SAN / whatnot)? Well, I'm imagining it as a piece of server-side functionality that can figure out what has changed using one of several possible methods, and then send that stuff to you. So I think if you don't have a server connection you are out of luck. If you have a server connection but just want to be told what has changed rather than actually being given that data, that might be something that could be worked into the design. I'm not sure whether that's a real need, though, or just extra work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 9, 2019 at 12:32 PM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > In pg_probackup we have remote restore via SSH in the beta state. But > SSH isn't an option for in-core approach I think. That's a little off-topic for this thread, but I think we should have some kind of extensible mechanism for pg_basebackup and maybe other tools, so that you can teach it to send backups to AWS or your teletype or etch them on stone tablets or whatever without having to modify core code. But let's not design that mechanism on this thread, 'cuz that will distract from what I want to talk about here. Feel free to start a new thread for it, though, and I'll jump in. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-04-09 17:48, Robert Haas wrote: > It will > probably be more efficient in many cases to instead scan all the WAL > generated since that LSN and extract block references from it, but > that is only possible if the server has all of that WAL available or > can somehow get it from the archive. This could be a variant of a replication slot that preserves WAL between incremental backup runs. > 3. There should be a new tool that knows how to merge a full backup > with any number of incremental backups and produce a complete data > directory with no remaining partial files. Are there by any chance standard file formats and tools that describe a binary difference between directories? That would be really useful here. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Apr-09, Peter Eisentraut wrote: > On 2019-04-09 17:48, Robert Haas wrote: > > 3. There should be a new tool that knows how to merge a full backup > > with any number of incremental backups and produce a complete data > > directory with no remaining partial files. > > Are there by any chance standard file formats and tools that describe a > binary difference between directories? That would be really useful here. VCDIFF? https://tools.ietf.org/html/rfc3284 -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi! > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а): > > Thoughts? Thanks for this long and thoughtful post! At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we implementedthis functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters backupedwith this technology. Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are tooperate backups cheaply and restore new node quickly. Here's what I see from our perspective. 1. Yes, this feature is important. 2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap. 3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of Tbdaily. 4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied sequentiallyis not very useful, it will not be much faster than simple WAL replay in many cases. 5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization reduction.We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in thearchive. 6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare LSNson a page of visibility map: some operations do not bump them. 7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch themto page cache like pg_prefaulter does. 8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That databasealso has cool functionality "increment for catchup". 9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version ofDB", it is "difference between two DB states". But wording choice does not matter much. Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted. Iwill try next year :) > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а): > - This is just a design proposal at this point; there is no code. If > this proposal, or some modified version of it, seems likely to be > acceptable, I and/or my colleagues might try to implement it. I'll be happy to help with code, discussion and patch review. Best regards, Andrey Borodin. [0] https://yadi.sk/i/Y_S1iqNN5WxS6A
On 09.04.2019 18:48, Robert Haas wrote: > 1. There should be a way to tell pg_basebackup to request from the > server only those blocks where LSN >= threshold_value. Some times ago I have implemented alternative version of ptrack utility (not one used in pg_probackup) which detects updated block at file level. It is very simple and may be it can be sometimes integrated in master. I attached patch to vanilla to this mail. Right now it contains just two GUCs: ptrack_map_size: Size of ptrack map (number of elements) used for incremental backup: 0 disabled. ptrack_block_log: Logarithm of ptrack block size (amount of pages) and one function: pg_ptrack_get_changeset(startlsn pg_lsn) returns {relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path} Idea is very simple: it creates hash map of fixed size (ptrack_map_size) and stores LSN of written pages in this map. As far as postgres default page size seems to be too small for ptrack block (requiring too large hash map or increasing number of conflicts, as well as increasing number of random reads) it is possible to configure ptrack block to consists of multiple pages (power of 2). This patch is using memory mapping mechanism. Unfortunately there is no portable wrapper for it in Postgres, so I have to provide own implementations for Unix/Windows. Certainly it is not good and should be rewritten. How to use? 1. Define ptrack_map_size in postgres.conf, for example (use simple number for more uniform hashing): ptrack_map_size = 1000003 2. Remember current lsn. psql postgres -c "select pg_current_wal_lsn()" pg_current_wal_lsn -------------------- 0/224A268 (1 row) 3. Do some updates. $ pgbench -T 10 postgres 4. Select changed blocks. select * from pg_ptrack_get_changeset('0/224A268'); relid | relfilenode | reltablespace | forknum | blocknum | segsize | updlsn | path -------+-------------+---------------+---------+----------+---------+-----------+---------------------- 16390 | 16396 | 1663 | 0 | 1640 | 1 | 0/224FD88 | base/12710/16396 16390 | 16396 | 1663 | 0 | 1641 | 1 | 0/2258680 | base/12710/16396 16390 | 16396 | 1663 | 0 | 1642 | 1 | 0/22615A0 | base/12710/16396 ... Certainly ptrack should be used as part of some backup tool (as pg_basebackup or pg_probackup). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi, On Tue, 9 Apr 2019 11:48:38 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have > developed technology that permits a block-level incremental backup to > be taken from a PostgreSQL server. I believe the idea in all of those > cases is that non-relation files should be backed up in their > entirety, but for relation files, only those blocks that have been > changed need to be backed up. I would like to propose that we should > have a solution for this problem in core, rather than leaving it to > each individual PostgreSQL company to develop and maintain their own > solution. Generally my idea is: > > 1. There should be a way to tell pg_basebackup to request from the > server only those blocks where LSN >= threshold_value. There are > several possible ways for the server to implement this, the simplest > of which is to just scan all the blocks and send only the ones that > satisfy that criterion. That might sound dumb, but it does still save > network bandwidth, and it works even without any prior setup. +1 this is a simple design and probably a first easy step bringing a lot of benefices already. > It will probably be more efficient in many cases to instead scan all the WAL > generated since that LSN and extract block references from it, but > that is only possible if the server has all of that WAL available or > can somehow get it from the archive. I seize the opportunity to discuss about this on the fly. I've been playing with the idea of producing incremental backups from archives since many years. But I've only started PoC'ing on it this year. My idea would be create a new tool working on archived WAL. No burden server side. Basic concept is: * parse archives * record latest relevant FPW for the incr backup * write new WALs with recorded FPW and removing/rewriting duplicated walrecords. It's just a PoC and I hadn't finished the WAL writing part...not even talking about the replay part. I'm not even sure this project is a good idea, but it is a good educational exercice to me in the meantime. Anyway, using real life OLTP production archives, my stats were: # WAL xlogrec kept Size WAL kept 127 39% 50% 383 22% 38% 639 20% 29% Based on this stats, I expect this would save a lot of time during recovery in a first step. If it get mature, it might even save a lot of archives space or extend the retention period with degraded granularity. It would even help taking full backups with a lower frequency. Any thoughts about this design would be much appreciated. I suppose this should be offlist or in a new thread to avoid polluting this thread as this is a slightly different subject. Regards, PS: I was surprised to still find some existing piece of code related to pglesslog in core. This project has been discontinued and WAL format changed in the meantime.
On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Apr-09, Peter Eisentraut wrote: > > On 2019-04-09 17:48, Robert Haas wrote: > > > 3. There should be a new tool that knows how to merge a full backup > > > with any number of incremental backups and produce a complete data > > > directory with no remaining partial files. > > > > Are there by any chance standard file formats and tools that describe a > > binary difference between directories? That would be really useful here. > > VCDIFF? https://tools.ietf.org/html/rfc3284 I don't understand VCDIFF very well, but I see some potential problems with going in this direction. First, suppose we take a full backup on Monday. Then, on Tuesday, we want to take an incremental backup. In my proposal, the backup server only needs to provide the database with one piece of information: the start-LSN of the previous backup. The server determines which blocks are recently modified and sends them to the client, which stores them. The end. On the other hand, storing a maximally compact VCDIFF seems to require that, for each block modified in the Tuesday backup, we go read the corresponding block as it existed on Monday. Assuming that the server is using some efficient method of locating modified blocks, this will approximately double the amount of read I/O required to complete the backup: either the server or the client must now read not only the current version of the block but the previous versions. If the previous backup is an incremental backup that does not contain full block images but only VCDIFF content, whoever is performing the VCDIFF calculation will need to walk the entire backup chain and reconstruct the previous contents of the previous block so that it can compute the newest VCDIFF. A customer who does an incremental backup every day and maintains a synthetic full backup from 1 week prior will see a roughly eightfold increase in read I/O compared to the design I proposed. The same problem exists at restore time. In my design, the total read I/O required is equal to the size of the database, plus however much metadata needs to be read from older delta files -- and that should be fairly small compared to the actual data being read, at least in normal, non-extreme cases. But if we are going to proceed by applying a series of delta files, we're going to need to read every older backup in its entirety. If the turnover percentage is significant, say 20%/day, and if the backup chain is say 7 backups long to get back to a full backup, this is a huge difference. Instead of having to read ~100% of the database size, as in my proposal, we'll need to read 100% + (6 * 20%) = 220% of the database size. Since VCDIFF uses an add-copy-run language to described differences, we could try to work around the problem that I just described by describing each changed data block as an 8192-byte add, and unchanged blocks as an 8192-byte copy. If we did that, then I think that the problem at backup time goes away: we can write out a VCDIFF-format file for the changed blocks based just on knowing that those are the blocks that have changed, without needing to access the older file. Of course, if we do it this way, the file will be larger than it would be if we actually compared the old and new block contents and wrote out a minimal VCDIFF, but it does make taking a backup a lot simpler. Even with this proposal, though, I think we still have trouble with restore time. I proposed putting the metadata about which blocks are included in a delta file at the beginning of the file, which allows a restore of a new incremental backup to relatively efficiently flip through older backups to find just the blocks that it needs, without having to read the whole file. But I think (although I am not quite sure) that in the VCDIFF format, the payload for an ADD instruction is stored near the payload. The result would be that you'd have to basically read the whole file at restore time to figure out which blocks were available from that file and which ones needed to be retrieved from an older backup. So while this approach would fix the backup-time problem, I believe that it would still require significantly more read I/O at restore time than my proposal. Furthermore, if, at backup time, we have to do anything that requires access to the old data, either the client or the server needs to have access to that data. Nonwithstanding the costs of reading it, that doesn't seem very desirable. The server is quite unlikely to have access to the backups, because most users want to back up to a different server in order to guard against a hardware failure. The client is more likely to be running on a machine where it has access to the data, because many users back up to the same machine every day, so the machine that is taking the current backup probably has the older one. However, accessing that old backup might not be cheap. It could be located in an object store in the cloud someplace, or it could have been written out to a tape drive and the tape removed from the drive. In the design I'm proposing, that stuff doesn't matter, but if you want to run diffs, then it does. Even if the client has efficient access to the data and even if it has so much read I/O bandwidth that the costs of reading that old data to run diffs doesn't matter, it's still pretty awkward for a tar-format backup. The client would have to take the tar archive sent by the server apart and form a new one. Another advantage of storing whole blocks in the incremental backup is that there's no tight coupling between the full backup and the incremental backup. Suppose you take a full backup A on S1, and then another full backup B, and then an incremental backup C based on A, and then an incremental backup D based on B. If backup B is destroyed beyond retrieval, you can restore the chain A-C-D and get back to the same place that restoring B-D would have gotten you. Backup D doesn't really know or care that it happens to be based on B. It just knows that it can only give you those blocks that have LSN >= LSN_B. You can get those blocks from anywhere that you like. If D instead stored deltas between the blocks as they exist in backup B, then those deltas would have to be applied specifically to backup B, not some possibly-later version. I think the way to think about this problem, or at least the way I think about this problem, is that we need to decide whether want file-level incremental backup, block-level incremental backup, or byte-level incremental backup. pgbackrest implements file-level incremental backup: if the file has changed, copy the whole thing. That has an appealing simplicity but risks copying 1GB of data for a 1-byte change. What I'm proposing here is block-level incremental backup, which is more complicated and still risks copying 8kB of data for a 1-byte change. Using VCDIFF would, I think, give us byte-level incremental backup. That would probably do an excellent job of making incremental backups as small as they can possibly be, because we would not need to include in the backup image even a single byte of unmodified data. It also seems like it does some other compression tricks which could shrink incremental backups further. However, my intuition is that we won't gain enough in terms of backup size to make up for the downsides listed above. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > My idea would be create a new tool working on archived WAL. No burden > server side. Basic concept is: > > * parse archives > * record latest relevant FPW for the incr backup > * write new WALs with recorded FPW and removing/rewriting duplicated walrecords. > > It's just a PoC and I hadn't finished the WAL writing part...not even talking > about the replay part. I'm not even sure this project is a good idea, but it is > a good educational exercice to me in the meantime. > > Anyway, using real life OLTP production archives, my stats were: > > # WAL xlogrec kept Size WAL kept > 127 39% 50% > 383 22% 38% > 639 20% 29% > > Based on this stats, I expect this would save a lot of time during recovery in > a first step. If it get mature, it might even save a lot of archives space or > extend the retention period with degraded granularity. It would even help > taking full backups with a lower frequency. > > Any thoughts about this design would be much appreciated. I suppose this should > be offlist or in a new thread to avoid polluting this thread as this is a > slightly different subject. Interesting idea, but I don't see how it can work if you only deal with the FPWs and not the other records. For instance, suppose that you take a full backup at time T0, and then at time T1 there are two modifications to a certain block in quick succession. That block is then never touched again. Since no checkpoint intervenes between the modifications, the first one emits an FPI and the second does not. Capturing the FPI is fine as far as it goes, but unless you also do something with the non-FPI change, you lose that second modification. You could fix that by having your tool replicate the effects of WAL apply outside the server, but that sounds like a ton of work and a ton of possible bugs. I have a related idea, though. Suppose that, as Peter says upthread, you have a replication slot that prevents old WAL from being removed. You also have a background worker that is connected to that slot. It decodes WAL and produces summary files containing all block-references extracted from those WAL records and the associated LSN (or maybe some approximation of the LSN instead of the exact value, to allow for compression and combining of nearby references). Then you hold onto those summary files after the actual WAL is removed. Now, when somebody asks the server for all blocks changed since a certain LSN, it can use those summary files to figure out which blocks to send without having to read all the pages in the database. Although I believe that a simple system that finds modified blocks by reading them all is good enough for a first version of this feature and useful in its own right, a more efficient system will be a lot more useful, and something like this seems to me to be probably the best way to implement it. The reason why I think this is likely to be superior to other possible approaches, such as the ptrack approach Konstantin suggests elsewhere on this thread, is because it pushes the work of figuring out which blocks have been modified into the background. With a ptrack-type approach, the server has to do some non-zero amount of extra work in the foreground every time it modifies a block. With an approach based on WAL-scanning, the work is done in the background and nobody has to wait for it. It's possible that there are other considerations which aren't occurring to me right now, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Some times ago I have implemented alternative version of ptrack utility > (not one used in pg_probackup) > which detects updated block at file level. It is very simple and may be > it can be sometimes integrated in master. I don't think this is completely crash-safe. It looks like it arranges to msync() the ptrack file at appropriate times (although I haven't exhaustively verified the logic), but it uses MS_ASYNC, so it's possible that the ptrack file could get updated on disk either before or after the relation file itself. I think before is probably OK -- it just risks having some blocks look modified when they aren't really -- but after seems like it is very much not OK. And changing this to use MS_SYNC would probably be really expensive. Likely a better approach would be to hook into the new fsync queue machinery that Thomas Munro added to PostgreSQL 12. It looks like your system maps all the blocks in the system into a fixed-size map using hashing. If the number of modified blocks between the full backup and the incremental backup is large compared to the size of the ptrack map, you'll start to get a lot of false-positives. It will look as if much of the database needs to be backed up. For example, in your sample configuration, you have ptrack_map_size = 1000003. If you've got a 100GB database with 20% daily turnover, that's about 2.6 million blocks. If you set bump a random entry ~2.6 million times in a map with 1000003 entries, on the average ~92% of the entries end up getting bumped, so you will get very little benefit from incremental backup. This problem drops off pretty fast if you raise the size of the map, but it's pretty critical that your map is large enough for the database you've got, or you may as well not bother. It also appears that your system can't really handle resizing of the map in any friendly way. So if your data size grows, you may be faced with either letting the map become progressively less effective, or throwing it out and losing all the data you have. None of that is to say that what you're presenting here has no value, but I think it's possible to do better (and I think we should try). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I have a related idea, though. Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot. It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references). Then you hold onto
those summary files after the actual WAL is removed. Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database. Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.
On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а): > > - This is just a design proposal at this point; there is no code. If > > this proposal, or some modified version of it, seems likely to be > > acceptable, I and/or my colleagues might try to implement it. > > I'll be happy to help with code, discussion and patch review. That would be great! We should probably give this discussion some more time before we plunge into the implementation phase, but I'd love to have some help with that, whether it's with coding or review or whatever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind.Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind.Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but canstill rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performingrewind activity using target side WAL only. Interesting. So if we build a system like this for incremental backup, or for pg_rewind, the other one can use the same infrastructure. That sound excellent. I'll start a new thread to talk about that, and hopefully you and Heikki and others will chime in with thoughts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, First thank you for your answer! On Wed, 10 Apr 2019 12:21:03 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > My idea would be create a new tool working on archived WAL. No burden > > server side. Basic concept is: > > > > * parse archives > > * record latest relevant FPW for the incr backup > > * write new WALs with recorded FPW and removing/rewriting duplicated > > walrecords. > > > > It's just a PoC and I hadn't finished the WAL writing part...not even > > talking about the replay part. I'm not even sure this project is a good > > idea, but it is a good educational exercice to me in the meantime. > > > > Anyway, using real life OLTP production archives, my stats were: > > > > # WAL xlogrec kept Size WAL kept > > 127 39% 50% > > 383 22% 38% > > 639 20% 29% > > > > Based on this stats, I expect this would save a lot of time during recovery > > in a first step. If it get mature, it might even save a lot of archives > > space or extend the retention period with degraded granularity. It would > > even help taking full backups with a lower frequency. > > > > Any thoughts about this design would be much appreciated. I suppose this > > should be offlist or in a new thread to avoid polluting this thread as this > > is a slightly different subject. > > Interesting idea, but I don't see how it can work if you only deal > with the FPWs and not the other records. For instance, suppose that > you take a full backup at time T0, and then at time T1 there are two > modifications to a certain block in quick succession. That block is > then never touched again. Since no checkpoint intervenes between the > modifications, the first one emits an FPI and the second does not. > Capturing the FPI is fine as far as it goes, but unless you also do > something with the non-FPI change, you lose that second modification. > You could fix that by having your tool replicate the effects of WAL > apply outside the server, but that sounds like a ton of work and a ton > of possible bugs. In my current design, the scan is done backward from end to start and I keep all the records appearing after the last occurrence of their respective FPI. The next challenge I have to achieve is to deal with multiple blocks records where some need to be removed and other are FPI to keep (eg. UPDATE). > I have a related idea, though. Suppose that, as Peter says upthread, > you have a replication slot that prevents old WAL from being removed. > You also have a background worker that is connected to that slot. It > decodes WAL and produces summary files containing all block-references > extracted from those WAL records and the associated LSN (or maybe some > approximation of the LSN instead of the exact value, to allow for > compression and combining of nearby references). Then you hold onto > those summary files after the actual WAL is removed. Now, when > somebody asks the server for all blocks changed since a certain LSN, > it can use those summary files to figure out which blocks to send > without having to read all the pages in the database. Although I > believe that a simple system that finds modified blocks by reading > them all is good enough for a first version of this feature and useful > in its own right, a more efficient system will be a lot more useful, > and something like this seems to me to be probably the best way to > implement it. Summary files looks like what Andrey Borodin described as delta-files and change maps. > With an approach based > on WAL-scanning, the work is done in the background and nobody has to > wait for it. Agree with this.
On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > In my current design, the scan is done backward from end to start and I keep all > the records appearing after the last occurrence of their respective FPI. Oh, interesting. That seems like it would require pretty major surgery on the WAL stream. > Summary files looks like what Andrey Borodin described as delta-files and > change maps. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-04-10 14:38:43 -0400, Robert Haas wrote: > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > In my current design, the scan is done backward from end to start and I keep all > > the records appearing after the last occurrence of their respective FPI. > > Oh, interesting. That seems like it would require pretty major > surgery on the WAL stream. Can't you just read each segment forward, and then reverse? That's not that much memory? And sure, there's some inefficient cases where records span many segments, but that's rare enough that reading a few segments several times doesn't strike me as particularly bad? Greetings, Andres Freund
On 2019-04-10 17:31, Robert Haas wrote: > I think the way to think about this problem, or at least the way I > think about this problem, is that we need to decide whether want > file-level incremental backup, block-level incremental backup, or > byte-level incremental backup. That is a great analysis. Seems like block-level is the preferred way forward. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10.04.2019 19:51, Robert Haas wrote: > On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> Some times ago I have implemented alternative version of ptrack utility >> (not one used in pg_probackup) >> which detects updated block at file level. It is very simple and may be >> it can be sometimes integrated in master. > I don't think this is completely crash-safe. It looks like it > arranges to msync() the ptrack file at appropriate times (although I > haven't exhaustively verified the logic), but it uses MS_ASYNC, so > it's possible that the ptrack file could get updated on disk either > before or after the relation file itself. I think before is probably > OK -- it just risks having some blocks look modified when they aren't > really -- but after seems like it is very much not OK. And changing > this to use MS_SYNC would probably be really expensive. Likely a > better approach would be to hook into the new fsync queue machinery > that Thomas Munro added to PostgreSQL 12. I do not think that MS_SYNC or fsync queue is needed here. If power failure or OS crash cause loose of some writes to ptrack map, then in any case {ostgres will perform recovery and updating pages from WAL cause once again marking them in ptrack map. So as in case of CLOG and many other Postgres files it is not critical to loose some writes because them will be restored from WAL. And before truncating WAL, Postgres performs checkpoint which flushes all changes to the disk, including ptrack map updates. > It looks like your system maps all the blocks in the system into a > fixed-size map using hashing. If the number of modified blocks > between the full backup and the incremental backup is large compared > to the size of the ptrack map, you'll start to get a lot of > false-positives. It will look as if much of the database needs to be > backed up. For example, in your sample configuration, you have > ptrack_map_size = 1000003. If you've got a 100GB database with 20% > daily turnover, that's about 2.6 million blocks. If you set bump a > random entry ~2.6 million times in a map with 1000003 entries, on the > average ~92% of the entries end up getting bumped, so you will get > very little benefit from incremental backup. This problem drops off > pretty fast if you raise the size of the map, but it's pretty critical > that your map is large enough for the database you've got, or you may > as well not bother. This is why ptrack block size should be larger than page size. Assume that it is 1Mb. 1MB is considered to be optimal amount of disk IO, when frequent seeks are not degrading read speed (it is most critical for HDD). In other words reading 10 random pages (20%) from this 1Mb block will takes almost the same amount of time (or even longer) than reading all this 1Mb in one operation. There will be just 100000 used entries in ptrack map with very small probability of collision. Actually I have chosen this size (1000003) for ptrack map because with 1Mb block size is allows to map without noticable number of collisions 1Tb database which seems to be enough for most Postgres installations. But increasing ptrack map size 10 and even 100 times should not also cause problems with modern RAM sizes. > > It also appears that your system can't really handle resizing of the > map in any friendly way. So if your data size grows, you may be faced > with either letting the map become progressively less effective, or > throwing it out and losing all the data you have. > > None of that is to say that what you're presenting here has no value, > but I think it's possible to do better (and I think we should try). > Definitely I didn't consider proposed patch as perfect solution and certainly it requires improvements (and may be complete redesign). I just want to present this approach (maintaining hash of block's LSN in mapped memory) and keeping track of modified blocks at file level (unlike current ptrack implementation which logs changes in all places in Postgres code where data is updated). Also, despite to the fact that this patch may be considered as raw prototype, I have spent some time thinking about all aspects of this approach including fault tolerance and false positives.
On Wed, 10 Apr 2019 14:38:43 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > In my current design, the scan is done backward from end to start and I > > keep all the records appearing after the last occurrence of their > > respective FPI. > > Oh, interesting. That seems like it would require pretty major > surgery on the WAL stream. Indeed. Presently, the surgery in my code is replacing redundant xlogrecord with noop. I have now to deal with muti-blocks records. So far, I tried to mark non-needed block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such marked blocks, but it doesn't play well with dependency between xlogrecord, eg. during UPDATE. So my plan is to rewrite them to remove non-needed blocks using eg. XLOG_FPI. As I wrote, this is mainly an hobby project right now for my own education. Not sure where it leads me, but I learn a lot while working on it.
On Wed, 10 Apr 2019 11:55:51 -0700 Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2019-04-10 14:38:43 -0400, Robert Haas wrote: > > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais > > <jgdr@dalibo.com> wrote: > > > In my current design, the scan is done backward from end to start and I > > > keep all the records appearing after the last occurrence of their > > > respective FPI. > > > > Oh, interesting. That seems like it would require pretty major > > surgery on the WAL stream. > > Can't you just read each segment forward, and then reverse? Not sure what you mean. I first look for the very last XLOG record by jumping to the last WAL and scanning it forward. Then, I do a backward from there to record LSN of xlogrecord to keep. Finally, I clone each WAL and edit them as needed (as described in my previous email). This is my current WIP though. > That's not that much memory? I don't know, yet. I did not mesure it.
On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote: > That is a great analysis. Seems like block-level is the preferred way > forward. In any solution related to incremental backups I have see from community, all of them tend to prefer block-level backups per the filtering which is possible based on the LSN of the page header. The holes in the middle of the page are also easier to handle so as an incremental page size is reduced in the actual backup. My preference tends toward a block-level approach if we were to do something in this area, though I fear that performance will be bad if we begin to scan all the relation files to fetch a set of blocks since a past LSN. Hence we need some kind of LSN map so as it is possible to skip a one block or a group of blocks (say one LSN every 8/16 blocks for example) at once for a given relation if the relation is mostly read-only. -- Michael
Attachment
On Thu, Apr 11, 2019 at 12:22 AM Michael Paquier <michael@paquier.xyz> wrote: > incremental page size is reduced in the actual backup. My preference > tends toward a block-level approach if we were to do something in this > area, though I fear that performance will be bad if we begin to scan > all the relation files to fetch a set of blocks since a past LSN. > Hence we need some kind of LSN map so as it is possible to skip a > one block or a group of blocks (say one LSN every 8/16 blocks for > example) at once for a given relation if the relation is mostly > read-only. So, in this thread, I want to focus on the UI and how the incremental backup is stored on disk. Making the process of identifying modified blocks efficient is the subject of http://postgr.es/m/CA+TgmoahOeuuR4pmDP1W=JnRyp4fWhynTOsa68BfxJq-qB_53A@mail.gmail.com Over there, the merits of what you are describing here and the competing approaches are under discussion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
09.04.2019 18:48, Robert Haas writes: > Thoughts? Hi, Thank you for bringing that up. In-core support of incremental backups is a long-awaited feature. Hopefully, this take will end up committed in PG13. Speaking of UI: 1) I agree that it should be implemented as a new replication command. 2) There should be a command to get only a map of changes without actual data. Most backup tools establish server connection, so they can use this protocol to get the list of changed blocks. Then they can use this information for any purpose. For example, distribute files between parallel workers to copy the data, or estimate backup size before data is sent, or store metadata separately from the data itself. Most methods (except straightforward LSN comparison) consist of two steps: get a map of changes and read blocks. So it won't add much of extra work. example commands: GET_FILELIST [lsn] returning json (or whatever) with filenames and maps of changed blocks Map format is also the subject of discussion. Now in pg_probackup we reuse code from pg_rewind/datapagemap, not sure if this format is good for sending data via the protocol, though. 3) The API should provide functions to request data with a granularity of file and block. It will be useful for parallelism and for various future projects. example commands: GET_DATAFILE [filename [map of blocks] ] GET_DATABLOCK [filename] [blkno] returning data in some format 4) The algorithm of collecting changed blocks is another topic. Though, it's API should be discussed here: Do we want to have multiple implementations? Personally, I think that it's good to provide several strategies, since they have different requirements and fit for different workloads. Maybe we can add a hook to allow custom implementations. Do we want to allow the backup client to tell what block collection method to use? example commands: GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc] Or should it be server-side cost-based decision? 5) The method based on LSN comparison stands out - it can be done in one pass. So it probably requires special protocol commands. for example: GET_DATAFILES [lsn] GET_DATAFILE [filename] [lsn] This is pretty simple to implement and pg_basebackup can use this method, at least until we have something more advanced in-core. I'll be happy to help with design, code, review, and testing. Hope that my experience with pg_probackup will be useful. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have > developed technology that permits a block-level incremental backup to > be taken from a PostgreSQL server. I believe the idea in all of those > cases is that non-relation files should be backed up in their > entirety, but for relation files, only those blocks that have been > changed need to be backed up. I love the general idea of having additional facilities in core to support block-level incremental backups. I've long been unhappy that any such approach ends up being limited to a subset of the files which need to be included in the backup, meaning the rest of the files have to be backed up in their entirety. I don't think we have to solve for that as part of this, but I'd like to see a discussion for how to deal with the other files which are being backed up to avoid needing to just wholesale copy them. > I would like to propose that we should > have a solution for this problem in core, rather than leaving it to > each individual PostgreSQL company to develop and maintain their own > solution. I'm certainly a fan of improving our in-core backup solutions. I'm quite concerned that trying to graft this on to pg_basebackup (which, as you note later, is missing an awful lot of what users expect from a real backup solution already- retention handling, parallel capabilities, WAL archive management, and many more... but also is just not nearly as developed a tool as the external solutions) is going to make things unnecessairly difficult when what we really want here is better support from core for block-level incremental backup for the existing external tools to leverage. Perhaps there's something here which can be done with pg_basebackup to have it work with the block-level approach, but I certainly don't see it as a natural next step for it and really does seem like limiting the way this is implemented to something that pg_basebackup can easily digest might make it less useful for the more developed tools. As an example, I believe all of the other tools mentioned (at least, those that are open source I'm pretty sure all do) support parallel backup and therefore having a way to get the block-level changes in a parallel fashion would be a pretty big thing that those tools will want and pg_basebackup is single-threaded today and this proposal doesn't seem to be contemplating changing that, implying that a serial-based block-level protocol would be fine but that'd be a pretty awful restriction for the other tools. > Generally my idea is: > > 1. There should be a way to tell pg_basebackup to request from the > server only those blocks where LSN >= threshold_value. There are > several possible ways for the server to implement this, the simplest > of which is to just scan all the blocks and send only the ones that > satisfy that criterion. That might sound dumb, but it does still save > network bandwidth, and it works even without any prior setup. It will > probably be more efficient in many cases to instead scan all the WAL > generated since that LSN and extract block references from it, but > that is only possible if the server has all of that WAL available or > can somehow get it from the archive. We could also, as several people > have proposed previously, have some kind of additional relation for > that stores either a single is-modified bit -- which only helps if the > reference LSN for the is-modified bit is older than the requested LSN > but not too much older -- or the highest LSN for each range of K > blocks, or something like that. I am at the moment not too concerned > with the exact strategy we use here. I believe we may want to > eventually support more than one, since they have different > trade-offs. This part of the discussion is a another example of how we're limiting ourselves in this implementation to the "pg_basebackup can work with this" case- by only consideration the options of "scan all the files" or "use the WAL- if the request is for WAL we have available on the server." The other backup solutions mentioned in your initial email, and others that weren't, have a WAL archive which includes a lot more WAL than just what the primary currently has. When I've thought about how WAL could be used to build a differential or incremental backup, the question of "do we have all the WAL we need" hasn't ever been a consideration- because the backup tool manages the WAL archive and has WAL going back across, most likely, weeks or even months. Having a tool which can essentially "compress" WAL would be fantastic and would be able to be leveraged by all of the different backup solutions. > 2. When you use pg_basebackup in this way, each relation file that is > not sent in its entirety is replaced by a file with a different name. > For example, instead of base/16384/16417, you might get > base/16384/partial.16417 or however we decide to name them. Each such > file will store near the beginning of the file a list of all the > blocks contained in that file, and the blocks themselves will follow > at offsets that can be predicted from the metadata at the beginning of > the file. The idea is that you shouldn't have to read the whole file > to figure out which blocks it contains, and if you know specifically > what blocks you want, you should be able to reasonably efficiently > read just those blocks. A backup taken in this manner should also > probably create some kind of metadata file in the root directory that > stops the server from starting and lists other salient details of the > backup. In particular, you need the threshold LSN for the backup > (i.e. contains blocks newer than this) and the start LSN for the > backup (i.e. the LSN that would have been returned from > pg_start_backup). Two things here- having some file that "stops the server from starting" is just going to cause a lot of pain, in my experience. Users do a lot of really rather.... curious things, and then come asking questions about them, and removing the file that stopped the server from starting is going to quickly become one of those questions on stack overflow that people just follow the highest-ranked question for, even though everyone who follows this list will know that doing so results in corruption of the database. An alternative approach in developing this feature would be to have pg_basebackup have an option to run against an *existing* backup, with the entire point being that the existing backup is updated with these incremental changes, instead of having some independent tool which takes the result of multiple pg_basebackup runs and then combines them. An alternative tool might be one which simply reads the WAL and keeps track of the FPIs and the updates and then eliminates any duplication which exists in the set of WAL provided (that is, multiple FPIs for the same page would be merged into one, and only the delta changes to that page are preserved, across the entire set of WAL being combined). Of course, that's complicated by having to deal with the other files in the database, so it wouldn't really work on its own. > 3. There should be a new tool that knows how to merge a full backup > with any number of incremental backups and produce a complete data > directory with no remaining partial files. The tool should check that > the threshold LSN for each incremental backup is less than or equal to > the start LSN of the previous backup; if not, there may be changes > that happened in between which would be lost, so combining the backups > is unsafe. Running this tool can be thought of either as restoring > the backup or as producing a new synthetic backup from any number of > incremental backups. This would allow for a strategy of unending > incremental backups. For instance, on day 1, you take a full backup. > On every subsequent day, you take an incremental backup. On day 9, > you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full > day2. On each subsequent day you do something similar. Now you can > always roll back to any of the last seven days by combining the oldest > backup you have (which is always a synthetic full backup) with as many > newer incrementals as you want, up to the point where you want to > stop. I'd really prefer that we avoid adding in another low-level tool like the one described here. Users, imv anyway, don't want to deal with *more* tools for handling this aspect of backup/recovery. If we had a tool in core today which managed multiples backups, kept track of them, and all of the WAL during and between them, then we could add options to that tool to do what's being described here in a way that makes sense and provides a good interface to users. I don't know that we're going to be able to do that with pg_basebackup when, really, the goal here isn't actually to make pg_basebackup into an enterprise backup tool, it's to make things easier for the external tools to do block-level backups. Thanks! Stephen
Attachment
On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote: > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have > > developed technology that permits a block-level incremental backup to > > be taken from a PostgreSQL server. I believe the idea in all of those > > cases is that non-relation files should be backed up in their > > entirety, but for relation files, only those blocks that have been > > changed need to be backed up. > > I love the general idea of having additional facilities in core to > support block-level incremental backups. I've long been unhappy that > any such approach ends up being limited to a subset of the files which > need to be included in the backup, meaning the rest of the files have to > be backed up in their entirety. I don't think we have to solve for that > as part of this, but I'd like to see a discussion for how to deal with > the other files which are being backed up to avoid needing to just > wholesale copy them. I assume you are talking about non-heap/index files. Which of those are large enough to benefit from incremental backup? > > I would like to propose that we should > > have a solution for this problem in core, rather than leaving it to > > each individual PostgreSQL company to develop and maintain their own > > solution. > > I'm certainly a fan of improving our in-core backup solutions. > > I'm quite concerned that trying to graft this on to pg_basebackup > (which, as you note later, is missing an awful lot of what users expect > from a real backup solution already- retention handling, parallel > capabilities, WAL archive management, and many more... but also is just > not nearly as developed a tool as the external solutions) is going to > make things unnecessairly difficult when what we really want here is > better support from core for block-level incremental backup for the > existing external tools to leverage. I think there is some interesting complexity brought up in this thread. Which options are going to minimize storage I/O, network I/O, have only background overhead, allow parallel operation, integrate with pg_basebackup. Eventually we will need to evaluate the incremental backup options against these criteria. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Apr 11, 2019 at 1:29 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 2) There should be a command to get only a map of changes without actual > data. Good idea. > 4) The algorithm of collecting changed blocks is another topic. > Though, it's API should be discussed here: > > Do we want to have multiple implementations? > Personally, I think that it's good to provide several strategies, > since they have different requirements and fit for different workloads. > > Maybe we can add a hook to allow custom implementations. I'm not sure a hook is going to be practical, but I do think we want more than one strategy. > I'll be happy to help with design, code, review, and testing. > Hope that my experience with pg_probackup will be useful. Great, thanks! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote: > I love the general idea of having additional facilities in core to > support block-level incremental backups. I've long been unhappy that > any such approach ends up being limited to a subset of the files which > need to be included in the backup, meaning the rest of the files have to > be backed up in their entirety. I don't think we have to solve for that > as part of this, but I'd like to see a discussion for how to deal with > the other files which are being backed up to avoid needing to just > wholesale copy them. Ideas? Generally, I don't think that anything other than the main forks of relations are worth worrying about, because the files are too small to really matter. Even if they're big, the main forks of relations will be much bigger. I think. > I'm quite concerned that trying to graft this on to pg_basebackup > (which, as you note later, is missing an awful lot of what users expect > from a real backup solution already- retention handling, parallel > capabilities, WAL archive management, and many more... but also is just > not nearly as developed a tool as the external solutions) is going to > make things unnecessairly difficult when what we really want here is > better support from core for block-level incremental backup for the > existing external tools to leverage. > > Perhaps there's something here which can be done with pg_basebackup to > have it work with the block-level approach, but I certainly don't see > it as a natural next step for it and really does seem like limiting the > way this is implemented to something that pg_basebackup can easily > digest might make it less useful for the more developed tools. I agree that there are a bunch of things that pg_basebackup does not do, such as backup management. I think a lot of users do not want PostgreSQL to do backup management for them. They have an existing solution that they use to manage backups, and they want PostgreSQL to interoperate with it. I think it makes sense for pg_basebackup to be in charge of taking the backup, and then other tools can either use it as a building block or use the streaming replication protocol to send approximately the same commands to the server. I certainly would not want to expose server capabilities that let you take an incremental backup and NOT teach pg_basebackup to use them -- then we'd be in a situation of saying that PostgreSQL has incremental backup, but you have to get external tool XYZ to use it. That will be perceived as PostgreSQL does NOT have incremental backup and this external tool adds it. > As an example, I believe all of the other tools mentioned (at least, > those that are open source I'm pretty sure all do) support parallel > backup and therefore having a way to get the block-level changes in a > parallel fashion would be a pretty big thing that those tools will want > and pg_basebackup is single-threaded today and this proposal doesn't > seem to be contemplating changing that, implying that a serial-based > block-level protocol would be fine but that'd be a pretty awful > restriction for the other tools. I mentioned this exact issue in my original email. I spoke positively of it. But I think it is different from what is being proposed here. We could have parallel backup without incremental backup, and that would be a good feature. We could have parallel backup without full backup, and that would also be a good feature. We could also have both, which would be best of all. I don't see that my proposal throws up any architectural obstacle to parallelism. I assume parallel backup, whether full or incremental, would be implemented by dividing up the files that need to be sent across the available connections; if incremental backup exists, each connection then has to decide whether to send the whole file or only part of it. > This part of the discussion is a another example of how we're limiting > ourselves in this implementation to the "pg_basebackup can work with > this" case- by only consideration the options of "scan all the files" or > "use the WAL- if the request is for WAL we have available on the > server." The other backup solutions mentioned in your initial email, > and others that weren't, have a WAL archive which includes a lot more > WAL than just what the primary currently has. When I've thought about > how WAL could be used to build a differential or incremental backup, the > question of "do we have all the WAL we need" hasn't ever been a > consideration- because the backup tool manages the WAL archive and has > WAL going back across, most likely, weeks or even months. Having a tool > which can essentially "compress" WAL would be fantastic and would be > able to be leveraged by all of the different backup solutions. I don't think this is a case of limiting ourselves; I think it's a case of keeping separate considerations properly separate. As I said in my original email, the client doesn't really need to know how the server is identifying the blocks that have been modified. That is the server's job. I started a separate thread on the WAL-scanning approach, so we should take that part of the discussion over there. I see no reason why the server couldn't be taught to reach back into an available archive for WAL that it no longer has locally, but that's really independent of the design ideas being discussed on this thread. > Two things here- having some file that "stops the server from starting" > is just going to cause a lot of pain, in my experience. Users do a lot > of really rather.... curious things, and then come asking questions > about them, and removing the file that stopped the server from starting > is going to quickly become one of those questions on stack overflow that > people just follow the highest-ranked question for, even though everyone > who follows this list will know that doing so results in corruption of > the database. Wait, you want to make it maximally easy for users to start the server in a state that is 100% certain to result in a corrupted and unusable database? Why?? I'd l like to make that a tiny bit difficult. If they really want a corrupted database, they can remove the file. > An alternative approach in developing this feature would be to have > pg_basebackup have an option to run against an *existing* backup, with > the entire point being that the existing backup is updated with these > incremental changes, instead of having some independent tool which takes > the result of multiple pg_basebackup runs and then combines them. That would be really unsafe, because if the tool is interrupted before it finishes (and fsyncs everything), you no longer have any usable backup. It also doesn't lend itself to several of the scenarios I described in my original email -- like endless incrementals that are merged into the full backup after some number of days -- a capability upon which others have already remarked positively. > An alternative tool might be one which simply reads the WAL and keeps > track of the FPIs and the updates and then eliminates any duplication > which exists in the set of WAL provided (that is, multiple FPIs for the > same page would be merged into one, and only the delta changes to that > page are preserved, across the entire set of WAL being combined). Of > course, that's complicated by having to deal with the other files in the > database, so it wouldn't really work on its own. You've jumped back to solving the server's problem (which blocks should I send?) rather than the client's problem (what does an incremental backup look like once I've taken it and how do I manage and restore them?). It does seem possible to figure out the contents of modified blocks strictly from looking at the WAL, without any examination of the current database contents. However, it also seems very complicated, because the tool that is figuring out the current block contents just by looking at the WAL would have to know how to apply any type of WAL record, not just one that contains an FPI. And I really don't want to build a client-side tool that knows how to apply WAL. > I'd really prefer that we avoid adding in another low-level tool like > the one described here. Users, imv anyway, don't want to deal with > *more* tools for handling this aspect of backup/recovery. If we had a > tool in core today which managed multiples backups, kept track of them, > and all of the WAL during and between them, then we could add options to > that tool to do what's being described here in a way that makes sense > and provides a good interface to users. I don't know that we're going > to be able to do that with pg_basebackup when, really, the goal here > isn't actually to make pg_basebackup into an enterprise backup tool, > it's to make things easier for the external tools to do block-level > backups. Well, I agree with you that the goal is not to make pg_basebackup an enterprise backup tool. However, I don't see teaching it to take incremental backups as opposed to that goal. I think backup management and retention should remain firmly outside the purview of pg_basebackup and left either to some other in-core tool or maybe even to out-of-core tools. However, I don't see any reason why that the task of taking an incremental and/or parallel backup should also be left to another tool. There is a very close relationship between the thing that pg_basebackup already does (copy everything) and the thing that we want to do here (copy everything except blocks that we know haven't changed). If we made it the job of some other tool to take parallel and/or incremental backups, that other tool would need to reimplement a lot of things that pg_basebackup has already got, like tar vs. plain format, fast vs. spread checkpoint, rate-limiting, compression levels, etc. That seems like a waste. Better to give pg_basebackup the capability to do those things, and then any backup management tool that anyone writes can take advantage of those capabilities. I come at this, BTW, from the perspective of having just spent a bunch of time working on EDB's Backup And Recovery Tool (BART). That tool works in exactly the manner you seem to be advocating: it knows how to do incremental and parallel full backups, and it also does backup management. However, this has not turned out to be the best division of labor. People who don't want to use the backup management capabilities may still want the parallel or incremental backup capabilities, and if all of that is within the envelope of an "enterprise backup tool," they don't have that option. So I want to split it up. I want pg_basebackup to take all the kinds of backups that PostgreSQL supports -- full, incremental, parallel, serial, whatever -- and I want some other tool -- pgBackRest, BART, barman, or some yet-to-be-invented core thing to do the management of those backups. Then everybody can use exactly the bits they want. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have > > > developed technology that permits a block-level incremental backup to > > > be taken from a PostgreSQL server. I believe the idea in all of those > > > cases is that non-relation files should be backed up in their > > > entirety, but for relation files, only those blocks that have been > > > changed need to be backed up. > > > > I love the general idea of having additional facilities in core to > > support block-level incremental backups. I've long been unhappy that > > any such approach ends up being limited to a subset of the files which > > need to be included in the backup, meaning the rest of the files have to > > be backed up in their entirety. I don't think we have to solve for that > > as part of this, but I'd like to see a discussion for how to deal with > > the other files which are being backed up to avoid needing to just > > wholesale copy them. > > I assume you are talking about non-heap/index files. Which of those are > large enough to benefit from incremental backup? Based on discussions I had with Andrey, specifically the visibility map is an issue for them with WAL-G. I haven't spent a lot of time thinking about it, but I can understand how that could be an issue. > > I'm quite concerned that trying to graft this on to pg_basebackup > > (which, as you note later, is missing an awful lot of what users expect > > from a real backup solution already- retention handling, parallel > > capabilities, WAL archive management, and many more... but also is just > > not nearly as developed a tool as the external solutions) is going to > > make things unnecessairly difficult when what we really want here is > > better support from core for block-level incremental backup for the > > existing external tools to leverage. > > I think there is some interesting complexity brought up in this thread. > Which options are going to minimize storage I/O, network I/O, have only > background overhead, allow parallel operation, integrate with > pg_basebackup. Eventually we will need to evaluate the incremental > backup options against these criteria. This presumes that we're going to have multiple competeing incremental backup options presented, doesn't it? Are you aware of another effort going on which aims for inclusion in core? There's been past attempts made, but I don't believe there's anyone else currently planning to or working on something for inclusion in core. Just to be clear- we're not currently working on one, but I'd really like to see core provide good support for incremental block-level backup so that we can leverage when it is there. Thanks! Stephen
Attachment
On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > > I love the general idea of having additional facilities in core to > > > support block-level incremental backups. I've long been unhappy that > > > any such approach ends up being limited to a subset of the files which > > > need to be included in the backup, meaning the rest of the files have to > > > be backed up in their entirety. I don't think we have to solve for that > > > as part of this, but I'd like to see a discussion for how to deal with > > > the other files which are being backed up to avoid needing to just > > > wholesale copy them. > > > > I assume you are talking about non-heap/index files. Which of those are > > large enough to benefit from incremental backup? > > Based on discussions I had with Andrey, specifically the visibility map > is an issue for them with WAL-G. I haven't spent a lot of time thinking > about it, but I can understand how that could be an issue. If I understand correctly, the VM contains 1 byte per 4 heap pages and the FSM contains 1 byte per heap page (plus some overhead for higher levels of the tree). Since the FSM is not WAL-logged, I'm not sure there's a whole lot we can do to avoid having to back it up, although maybe there's some clever idea I'm not quite seeing. The VM is WAL-logged, albeit with some strange warts that I have the honor of inventing, so there's more possibilities there. Before worrying about it too much, it would be useful to hear more about the concerns related to these forks, so that we make sure we're solving the right problem. It seems difficult for a single relation to be big enough for these to be much of an issue. For example, on a 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork = 32MB. Not nothing, but 32MB of useless overhead every time you back up a 1TB database probably isn't going to break the bank. It might be more of a concern for users with many small tables. For example, if somebody has got a million tables with 1 page in each one, they'll have a million data pages, a million VM pages, and 3 million FSM pages (unless the new don't-create-the-FSM-for-small-tables stuff in v12 kicks in). I don't know if it's worth going to a lot of trouble to optimize that case. Creating a million tables with 100 tuples (or whatever) in each one sounds like terrible database design to me. > > > I'm quite concerned that trying to graft this on to pg_basebackup > > > (which, as you note later, is missing an awful lot of what users expect > > > from a real backup solution already- retention handling, parallel > > > capabilities, WAL archive management, and many more... but also is just > > > not nearly as developed a tool as the external solutions) is going to > > > make things unnecessairly difficult when what we really want here is > > > better support from core for block-level incremental backup for the > > > existing external tools to leverage. > > > > I think there is some interesting complexity brought up in this thread. > > Which options are going to minimize storage I/O, network I/O, have only > > background overhead, allow parallel operation, integrate with > > pg_basebackup. Eventually we will need to evaluate the incremental > > backup options against these criteria. > > This presumes that we're going to have multiple competeing incremental > backup options presented, doesn't it? Are you aware of another effort > going on which aims for inclusion in core? There's been past attempts > made, but I don't believe there's anyone else currently planning to or > working on something for inclusion in core. Yeah, I really hope we don't end up with dueling patches. I want to come up with an approach that can be widely-endorsed and then have everybody rowing in the same direction. On the other hand, I do think that we may support multiple options in certain places which may have the kinds of trade-offs that Bruce mentions. For instance, identifying changed blocks by scanning the whole cluster and checking the LSN of each block has an advantage in that it requires no prior setup or extra configuration. Like a sequential scan, it always works, and that is an advantage. Of course, for many people, the competing advantage of a WAL-scanning approach that can save a lot of I/O will appear compelling, but maybe not for everyone. I think there's room for two or three approaches there -- not in the sense of competing patches, but in the sense of giving users a choice based on their needs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote: > Yeah, I really hope we don't end up with dueling patches. I want to > come up with an approach that can be widely-endorsed and then have > everybody rowing in the same direction. On the other hand, I do think > that we may support multiple options in certain places which may have > the kinds of trade-offs that Bruce mentions. For instance, > identifying changed blocks by scanning the whole cluster and checking > the LSN of each block has an advantage in that it requires no prior > setup or extra configuration. Like a sequential scan, it always > works, and that is an advantage. Of course, for many people, the > competing advantage of a WAL-scanning approach that can save a lot of > I/O will appear compelling, but maybe not for everyone. I think > there's room for two or three approaches there -- not in the sense of > competing patches, but in the sense of giving users a choice based on > their needs. Well, by having a separate modblock file for each WAL file, you can keep both WAL and modblock files and use the modblock list to pull pages from each WAL file, or from the heap/index files, and it can be done in parallel. Having WAL and modblock files in the same directory makes retention simpler. In fact, you can do an incremental backup just using the modblock files and the heap/index files, so you don't even need the WAL. Also, instead of storing the file name and block number in the modblock file, using the database oid, relfilenode, and block number (3 int32 values) should be sufficient. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > > > I love the general idea of having additional facilities in core to > > > > support block-level incremental backups. I've long been unhappy that > > > > any such approach ends up being limited to a subset of the files which > > > > need to be included in the backup, meaning the rest of the files have to > > > > be backed up in their entirety. I don't think we have to solve for that > > > > as part of this, but I'd like to see a discussion for how to deal with > > > > the other files which are being backed up to avoid needing to just > > > > wholesale copy them. > > > > > > I assume you are talking about non-heap/index files. Which of those are > > > large enough to benefit from incremental backup? > > > > Based on discussions I had with Andrey, specifically the visibility map > > is an issue for them with WAL-G. I haven't spent a lot of time thinking > > about it, but I can understand how that could be an issue. > > If I understand correctly, the VM contains 1 byte per 4 heap pages and > the FSM contains 1 byte per heap page (plus some overhead for higher > levels of the tree). Since the FSM is not WAL-logged, I'm not sure > there's a whole lot we can do to avoid having to back it up, although > maybe there's some clever idea I'm not quite seeing. The VM is > WAL-logged, albeit with some strange warts that I have the honor of > inventing, so there's more possibilities there. > > Before worrying about it too much, it would be useful to hear more > about the concerns related to these forks, so that we make sure we're > solving the right problem. It seems difficult for a single relation > to be big enough for these to be much of an issue. For example, on a > 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork > = 32MB. Not nothing, but 32MB of useless overhead every time you back > up a 1TB database probably isn't going to break the bank. It might be > more of a concern for users with many small tables. For example, if > somebody has got a million tables with 1 page in each one, they'll > have a million data pages, a million VM pages, and 3 million FSM pages > (unless the new don't-create-the-FSM-for-small-tables stuff in v12 > kicks in). I don't know if it's worth going to a lot of trouble to > optimize that case. Creating a million tables with 100 tuples (or > whatever) in each one sounds like terrible database design to me. As I understand it, the problem is not with backing up an individual database or cluster, but rather dealing with backing up thousands of individual clusters with thousands of tables in each, leading to an awful lot of tables with lots of FSMs/VMs, all of which end up having to get copied and stored wholesale. I'll point this thread out to him and hopefully he'll have a chance to share more specific information. > > > > I'm quite concerned that trying to graft this on to pg_basebackup > > > > (which, as you note later, is missing an awful lot of what users expect > > > > from a real backup solution already- retention handling, parallel > > > > capabilities, WAL archive management, and many more... but also is just > > > > not nearly as developed a tool as the external solutions) is going to > > > > make things unnecessairly difficult when what we really want here is > > > > better support from core for block-level incremental backup for the > > > > existing external tools to leverage. > > > > > > I think there is some interesting complexity brought up in this thread. > > > Which options are going to minimize storage I/O, network I/O, have only > > > background overhead, allow parallel operation, integrate with > > > pg_basebackup. Eventually we will need to evaluate the incremental > > > backup options against these criteria. > > > > This presumes that we're going to have multiple competeing incremental > > backup options presented, doesn't it? Are you aware of another effort > > going on which aims for inclusion in core? There's been past attempts > > made, but I don't believe there's anyone else currently planning to or > > working on something for inclusion in core. > > Yeah, I really hope we don't end up with dueling patches. I want to > come up with an approach that can be widely-endorsed and then have > everybody rowing in the same direction. On the other hand, I do think > that we may support multiple options in certain places which may have > the kinds of trade-offs that Bruce mentions. For instance, > identifying changed blocks by scanning the whole cluster and checking > the LSN of each block has an advantage in that it requires no prior > setup or extra configuration. Like a sequential scan, it always > works, and that is an advantage. Of course, for many people, the > competing advantage of a WAL-scanning approach that can save a lot of > I/O will appear compelling, but maybe not for everyone. I think > there's room for two or three approaches there -- not in the sense of > competing patches, but in the sense of giving users a choice based on > their needs. I can agree with the idea of having multiple options for how to collect up the set of changed blocks, though I continue to feel that a WAL-scanning approach isn't something that we'd have implemented in the backend at all since it doesn't require the backend and a given backend might not even have all of the WAL that is relevant. I certainly don't think it makes sense to have a backend go get WAL from the archive to then merge the WAL to provide the result to a client asking for it- that's adding entirely unnecessary load to the database server. As such, only the LSN-based scanning of relation files to produce the set of changed blocks seems to make sense to me to implement in the backend. Just to be clear- I don't have any problem with a tool being implemented in core to support the scanning of WAL to produce a changeset, I just don't think that's something we'd have built into the *backend*, nor do I think it would make sense to add that functionality to the replication (or any other) protocol, at least not with support for arbitrary LSN starting and ending points. A thought that occurs to me is to have the functions for supporting the WAL merging be included in libcommon and available to both the independent executable that's available for doing WAL merging, and to the backend to be able to WAL merging itself- but for a specific purpose: having a way to reduce the amount of WAL that needs to be sent to a replica which has a replication slot but that's been disconnected for a while. Of course, there'd have to be some way to handle the other files for that to work to update a long out-of-date replica. Now, if we taught the backup tool about having a replication slot then perhaps we could have the backend effectively have the same capability proposed above, but without the need to go get the WAL from the archive repository. I'm still not entirely sure that this makes sense to do in the backend due to the additional load, this is really just some brainstorming. Thanks! Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote: > > I love the general idea of having additional facilities in core to > > support block-level incremental backups. I've long been unhappy that > > any such approach ends up being limited to a subset of the files which > > need to be included in the backup, meaning the rest of the files have to > > be backed up in their entirety. I don't think we have to solve for that > > as part of this, but I'd like to see a discussion for how to deal with > > the other files which are being backed up to avoid needing to just > > wholesale copy them. > > Ideas? Generally, I don't think that anything other than the main > forks of relations are worth worrying about, because the files are too > small to really matter. Even if they're big, the main forks of > relations will be much bigger. I think. Sadly, I haven't got any great ideas today. I do know that the WAL-G folks have specifically mentioned issues with the visibility map being large enough across enough of their systems that it kinda sucks to deal with. Perhaps we could do something like the rsync binary-diff protocol for non-relation files? This is clearly just hand-waving but maybe there's something reasonable in that idea. > > I'm quite concerned that trying to graft this on to pg_basebackup > > (which, as you note later, is missing an awful lot of what users expect > > from a real backup solution already- retention handling, parallel > > capabilities, WAL archive management, and many more... but also is just > > not nearly as developed a tool as the external solutions) is going to > > make things unnecessairly difficult when what we really want here is > > better support from core for block-level incremental backup for the > > existing external tools to leverage. > > > > Perhaps there's something here which can be done with pg_basebackup to > > have it work with the block-level approach, but I certainly don't see > > it as a natural next step for it and really does seem like limiting the > > way this is implemented to something that pg_basebackup can easily > > digest might make it less useful for the more developed tools. > > I agree that there are a bunch of things that pg_basebackup does not > do, such as backup management. I think a lot of users do not want > PostgreSQL to do backup management for them. They have an existing > solution that they use to manage backups, and they want PostgreSQL to > interoperate with it. I think it makes sense for pg_basebackup to be > in charge of taking the backup, and then other tools can either use it > as a building block or use the streaming replication protocol to send > approximately the same commands to the server. There's something like 6 different backup tools, at least, for PostgreSQL that provide backup management, so I have a really hard time agreeing with this idea that users don't want a PG backup management system. Maybe that's not what you're suggesting here, but that's what came across to me. Yes, there are some users who have an existing backup solution and they'd like a better way to integrate PostgreSQL into that solution, but that's usually something like filesystem snapshots or an enterprise backup tool which has a PostgreSQL agent or similar to do the start/stop and collect up the WAL, not something that's just calling pg_basebackup. Those are typically not things we have any visibility into though and aren't open source either (and, at least as often as not, they don't seem to be very well thought through, based on my experience with those tools...). Unless maybe I'm misunderstanding and what you're suggesting here is that the "existing solution" is something like the external PG-specific backup tools? But then the rest doesn't seem to make sense, as only maybe one or two of those tools use pg_basebackup internally. > I certainly would not > want to expose server capabilities that let you take an incremental > backup and NOT teach pg_basebackup to use them -- then we'd be in a > situation of saying that PostgreSQL has incremental backup, but you > have to get external tool XYZ to use it. That will be perceived as > PostgreSQL does NOT have incremental backup and this external tool > adds it. ... but this is exactly the situation we're in already with all of the *other* features around backup (parallel backup, backup management, WAL management, etc). Users want those features, pg_basebackup/PG core doesn't provide it, and therefore there's a bunch of other tools which have been written that do. In addition, saying that PG has incremental backup but no built-in management of those full-vs-incremental backups and telling users that they basically have to build that themselves really feels a lot like we're trying to address a check-box requirement rather than making something that our users are going to be happy with. > > As an example, I believe all of the other tools mentioned (at least, > > those that are open source I'm pretty sure all do) support parallel > > backup and therefore having a way to get the block-level changes in a > > parallel fashion would be a pretty big thing that those tools will want > > and pg_basebackup is single-threaded today and this proposal doesn't > > seem to be contemplating changing that, implying that a serial-based > > block-level protocol would be fine but that'd be a pretty awful > > restriction for the other tools. > > I mentioned this exact issue in my original email. I spoke positively > of it. But I think it is different from what is being proposed here. > We could have parallel backup without incremental backup, and that > would be a good feature. We could have parallel backup without full > backup, and that would also be a good feature. We could also have > both, which would be best of all. I don't see that my proposal throws > up any architectural obstacle to parallelism. I assume parallel > backup, whether full or incremental, would be implemented by dividing > up the files that need to be sent across the available connections; if > incremental backup exists, each connection then has to decide whether > to send the whole file or only part of it. I don't think that I was very clear in what my specific concern here was. I'm not asking for pg_basebackup to have parallel backup (at least, not in this part of the discussion), I'm asking for the incremental block-based protocol that's going to be built-in to core to be able to be used in a parallel fashion. The existing protocol that pg_basebackup uses is basically, connect to the server and then say "please give me a tarball of the data directory" and that is then streamed on that connection, making that protocol impossible to use for parallel backup. That's fine as far as it goes because only pg_basebackup actually uses that protocol (note that nearly all of the other tools for doing backups of PostgreSQL don't...). If we're expecting the external tools to use the block-level incremental protocol then that protocol really needs to have a way to be parallelized, otherwise we're just going to end up with all of the individual tools doing their own thing for block-level incremental (though perhaps they'd reimplement whatever is done in core but in a way that they could parallelize it...), if possible (which I add just in case there's some idea that we end up in a situation where the block-level incremental backup has to coordinate with the backend in some fashion to work... which would mean that *everyone* has to use the protocol even if it isn't parallel and that would be really bad, imv). > > This part of the discussion is a another example of how we're limiting > > ourselves in this implementation to the "pg_basebackup can work with > > this" case- by only consideration the options of "scan all the files" or > > "use the WAL- if the request is for WAL we have available on the > > server." The other backup solutions mentioned in your initial email, > > and others that weren't, have a WAL archive which includes a lot more > > WAL than just what the primary currently has. When I've thought about > > how WAL could be used to build a differential or incremental backup, the > > question of "do we have all the WAL we need" hasn't ever been a > > consideration- because the backup tool manages the WAL archive and has > > WAL going back across, most likely, weeks or even months. Having a tool > > which can essentially "compress" WAL would be fantastic and would be > > able to be leveraged by all of the different backup solutions. > > I don't think this is a case of limiting ourselves; I think it's a > case of keeping separate considerations properly separate. As I said > in my original email, the client doesn't really need to know how the > server is identifying the blocks that have been modified. That is the > server's job. I started a separate thread on the WAL-scanning > approach, so we should take that part of the discussion over there. I > see no reason why the server couldn't be taught to reach back into an > available archive for WAL that it no longer has locally, but that's > really independent of the design ideas being discussed on this thread. I've provided thoughts on that other thread, I'm happy to discuss further there. > > Two things here- having some file that "stops the server from starting" > > is just going to cause a lot of pain, in my experience. Users do a lot > > of really rather.... curious things, and then come asking questions > > about them, and removing the file that stopped the server from starting > > is going to quickly become one of those questions on stack overflow that > > people just follow the highest-ranked question for, even though everyone > > who follows this list will know that doing so results in corruption of > > the database. > > Wait, you want to make it maximally easy for users to start the server > in a state that is 100% certain to result in a corrupted and unusable > database? Why?? I'd l like to make that a tiny bit difficult. If > they really want a corrupted database, they can remove the file. No, I don't want it to be easy for users to start the server in a state that's going to result in a corrupted cluster. That's basically the complete opposite of what I was going for- having a file that can be trivially removed to start up the cluster is *going* to result in people having corrupted clusters, no matter how much we tell them "don't do that". This is exactly the problem with have with backup_label today. I'd really rather not double-down on that. > > An alternative approach in developing this feature would be to have > > pg_basebackup have an option to run against an *existing* backup, with > > the entire point being that the existing backup is updated with these > > incremental changes, instead of having some independent tool which takes > > the result of multiple pg_basebackup runs and then combines them. > > That would be really unsafe, because if the tool is interrupted before > it finishes (and fsyncs everything), you no longer have any usable > backup. It also doesn't lend itself to several of the scenarios I > described in my original email -- like endless incrementals that are > merged into the full backup after some number of days -- a capability > upon which others have already remarked positively. There's really two things here- the first is that I agree with the concern about potentially destorying the existing backup if the pg_basebackup doesn't complete, but there's some ways to address that (such as filesystem snapshotting), so I'm not sure that the idea is quite that bad, but it would need to be more than just what pg_basebackup does in this case in order to be trustworthy (at least, for most). The other part here is the idea of endless incrementals where the blocks which don't appear to have changed are never re-validated against what's in the backup. Unfortunately, latent corruption happens and you really want to have a way to check for that. In past discussions that I've had with David, there's been some idea to check some percentage of the blocks that didn't appear to change for each backup against what's in the backup. I share this just to point out that there's some risk to that approach, not to say that we shouldn't do it or that we should discourage the development of such a feature. > > An alternative tool might be one which simply reads the WAL and keeps > > track of the FPIs and the updates and then eliminates any duplication > > which exists in the set of WAL provided (that is, multiple FPIs for the > > same page would be merged into one, and only the delta changes to that > > page are preserved, across the entire set of WAL being combined). Of > > course, that's complicated by having to deal with the other files in the > > database, so it wouldn't really work on its own. > > You've jumped back to solving the server's problem (which blocks > should I send?) rather than the client's problem (what does an > incremental backup look like once I've taken it and how do I manage > and restore them?). It does seem possible to figure out the contents > of modified blocks strictly from looking at the WAL, without any > examination of the current database contents. However, it also seems > very complicated, because the tool that is figuring out the current > block contents just by looking at the WAL would have to know how to > apply any type of WAL record, not just one that contains an FPI. And > I really don't want to build a client-side tool that knows how to > apply WAL. Wow. I have to admit that I feel completely opposite of that- I'd *love* to have an independent tool (which ideally uses the same code through the common library, or similar) that can be run to apply WAL. In other words, I don't agree that it's the server's problem at all to solve that, or, at least, I don't believe that it needs to be. > > I'd really prefer that we avoid adding in another low-level tool like > > the one described here. Users, imv anyway, don't want to deal with > > *more* tools for handling this aspect of backup/recovery. If we had a > > tool in core today which managed multiples backups, kept track of them, > > and all of the WAL during and between them, then we could add options to > > that tool to do what's being described here in a way that makes sense > > and provides a good interface to users. I don't know that we're going > > to be able to do that with pg_basebackup when, really, the goal here > > isn't actually to make pg_basebackup into an enterprise backup tool, > > it's to make things easier for the external tools to do block-level > > backups. > > Well, I agree with you that the goal is not to make pg_basebackup an > enterprise backup tool. However, I don't see teaching it to take > incremental backups as opposed to that goal. I think backup > management and retention should remain firmly outside the purview of > pg_basebackup and left either to some other in-core tool or maybe even > to out-of-core tools. However, I don't see any reason why that the > task of taking an incremental and/or parallel backup should also be > left to another tool. I've tried to outline how the incremental backup capability and backup management are really very closely related and having those be implemented by independent tools is not a good interface for our users to have to live with. > There is a very close relationship between the thing that > pg_basebackup already does (copy everything) and the thing that we > want to do here (copy everything except blocks that we know haven't > changed). If we made it the job of some other tool to take parallel > and/or incremental backups, that other tool would need to reimplement > a lot of things that pg_basebackup has already got, like tar vs. plain > format, fast vs. spread checkpoint, rate-limiting, compression levels, > etc. That seems like a waste. Better to give pg_basebackup the > capability to do those things, and then any backup management tool > that anyone writes can take advantage of those capabilities. I don't believe any of the external tools which do backups of PostgreSQL support tar format. Fast-vs-spread checkpointing isn't in the purview of the external tools, they just have to accept the option and pass it to pg_start_backup(), which they already know how to do. Rate-limiting and compression are implemented by those other tools already, where it's been desired. Most of the external tools don't use pg_basebackup, nor the base backup protocol (or, if they do, it's only as an option among others). In my opinion, that's pretty clear indication that pg_basebackup and the base backup protocol aren't sufficient to cover any but the simplest of use-cases (though those simple use-cases are handled rather well). We're talking about adding on a capability that's much more complicated and is one that a lot of tools have already taken a stab at, let's try to do it in a way that those tools can leverage it and avoid having to implement it themselves. > I come at this, BTW, from the perspective of having just spent a bunch > of time working on EDB's Backup And Recovery Tool (BART). That tool > works in exactly the manner you seem to be advocating: it knows how to > do incremental and parallel full backups, and it also does backup > management. However, this has not turned out to be the best division > of labor. People who don't want to use the backup management > capabilities may still want the parallel or incremental backup > capabilities, and if all of that is within the envelope of an > "enterprise backup tool," they don't have that option. So I want to > split it up. I want pg_basebackup to take all the kinds of backups > that PostgreSQL supports -- full, incremental, parallel, serial, > whatever -- and I want some other tool -- pgBackRest, BART, barman, or > some yet-to-be-invented core thing to do the management of those > backups. Then everybody can use exactly the bits they want. I come at this from years of working with David on pgBackRest, listening to what users want, what features they like, what they'd like to see added, and what they don't like about how it works today. It's an interesting idea to add in everything to pg_basebackup that users doing backups would like to see, but that's quite a list: - full backups - differential backups - incremental backups / block-level backups - (server-side) compression - (server-side) encryption - page-level checksum validation - calculating checksums (on the whole file) - External object storage (S3, et al) - more things... I'm really not convinced that I agree with the division of labor as you've outlined it, where all of the above is done by pg_basebackup, where just archiving and backup retention are handled by some external tool (except that we already have pg_receivewal, so archiving isn't really an externally handled thing either, unless you want features like parallel archive-push or parallel archive-get...). What would really help me, at least, understand the idea here would be to understand exactly what the existing tools do that the subset of users you're thinking about doesn't like/want, but which pg_basebackup, today, does. Is the issue that there's a repository instead of just a plain PG directory or set of tar files, like what pg_basebackup produces today? But how would we do things like have compression, or encryption, or block-based incremental backups without some kind of repository or directory that doesn't actually look exactly like a PG data directory? Another thing I really don't understand from this discussion, and part of why it's taken me a while to respond, is this, from above: > I think a lot of users do not want > PostgreSQL to do backup management for them. Followed by: > I come at this, BTW, from the perspective of having just spent a bunch > of time working on EDB's Backup And Recovery Tool (BART). That tool > works in exactly the manner you seem to be advocating: it knows how to > do incremental and parallel full backups, and it also does backup > management. I certainly can understand that there are PostgreSQL users who want to leverage incremental backups without having to use BART or another tool outside of whatever enterprise backup system they've got, but surely that's a large pool of users who *do* want a PG backup tool that manages backups, or you wouldn't have spent a considerable amount of your very valuable time hacking on BART. I've certainly seen a fair share of both and I don't think we should set out to exclude either. Perhaps that's what we're both saying too and just talking past each other, but I feel like the approach here is "make it work just for the simple pg_basebackup case and not worry too much about the other tools, since what we do for pg_basebackup will work for them too" while where I'm coming from is "focus on what the other tools need first, and then make pg_basebackup work with that if there's a sensible way to do so." A third possibility is that it's just too early to be talking about this since it means we've gotta be awful vaugue about it. Thanks! Stephen
Attachment
On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote: > > Yeah, I really hope we don't end up with dueling patches. I want to > > come up with an approach that can be widely-endorsed and then have > > everybody rowing in the same direction. On the other hand, I do think > > that we may support multiple options in certain places which may have > > the kinds of trade-offs that Bruce mentions. For instance, > > identifying changed blocks by scanning the whole cluster and checking > > the LSN of each block has an advantage in that it requires no prior > > setup or extra configuration. Like a sequential scan, it always > > works, and that is an advantage. Of course, for many people, the > > competing advantage of a WAL-scanning approach that can save a lot of > > I/O will appear compelling, but maybe not for everyone. I think > > there's room for two or three approaches there -- not in the sense of > > competing patches, but in the sense of giving users a choice based on > > their needs. > > Well, by having a separate modblock file for each WAL file, you can keep > both WAL and modblock files and use the modblock list to pull pages from > each WAL file, or from the heap/index files, and it can be done in > parallel. Having WAL and modblock files in the same directory makes > retention simpler. > > In fact, you can do an incremental backup just using the modblock files > and the heap/index files, so you don't even need the WAL. > > Also, instead of storing the file name and block number in the modblock > file, using the database oid, relfilenode, and block number (3 int32 > values) should be sufficient. Would doing it that way constrain the design of new table access methods in some meaningful way? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote: > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > > Also, instead of storing the file name and block number in the modblock > > file, using the database oid, relfilenode, and block number (3 int32 > > values) should be sufficient. > > Would doing it that way constrain the design of new table access > methods in some meaningful way? I think these are the values used in WAL, so I assume table access methods already have to map to those, unless they use their own. I actually don't know. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote: > As I understand it, the problem is not with backing up an individual > database or cluster, but rather dealing with backing up thousands of > individual clusters with thousands of tables in each, leading to an > awful lot of tables with lots of FSMs/VMs, all of which end up having to > get copied and stored wholesale. I'll point this thread out to him and > hopefully he'll have a chance to share more specific information. Sounds good. > I can agree with the idea of having multiple options for how to collect > up the set of changed blocks, though I continue to feel that a > WAL-scanning approach isn't something that we'd have implemented in the > backend at all since it doesn't require the backend and a given backend > might not even have all of the WAL that is relevant. I certainly don't > think it makes sense to have a backend go get WAL from the archive to > then merge the WAL to provide the result to a client asking for it- > that's adding entirely unnecessary load to the database server. My motivation for wanting to include it in the database server was twofold: 1. I was hoping to leverage the background worker machinery. The WAL-scanner would just run all the time in the background, and start up and shut down along with the server. If it's a standalone tool, then it can run on a different server or when the server is down, both of which are nice. The downside though is that now you probably have to put it in crontab or under systemd or something, instead of just setting a couple of GUCs and letting the server handle the rest. For me that downside seems rather significant, but YMMV. 2. In order for the information produced by the WAL-scanner to be useful, it's got to be available to the server when the server is asked for an incremental backup. If the information is constructed by a standalone frontend tool, and stored someplace other than under $PGDATA, then the server won't have convenient access to it. I guess we could make it the client's job to provide that information to the server, but I kind of liked the simplicity of not needing to give the server anything more than an LSN. > A thought that occurs to me is to have the functions for supporting the > WAL merging be included in libcommon and available to both the > independent executable that's available for doing WAL merging, and to > the backend to be able to WAL merging itself- Yeah, that might be possible. > but for a specific > purpose: having a way to reduce the amount of WAL that needs to be sent > to a replica which has a replication slot but that's been disconnected > for a while. Of course, there'd have to be some way to handle the other > files for that to work to update a long out-of-date replica. Now, if we > taught the backup tool about having a replication slot then perhaps we > could have the backend effectively have the same capability proposed > above, but without the need to go get the WAL from the archive > repository. Hmm, but you can't just skip over WAL records or segments because there are checksums and previous-record pointers and things.... > I'm still not entirely sure that this makes sense to do in the backend > due to the additional load, this is really just some brainstorming. Would it really be that much load? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote: > On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote: > > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > > > Also, instead of storing the file name and block number in the modblock > > > file, using the database oid, relfilenode, and block number (3 int32 > > > values) should be sufficient. > > > > Would doing it that way constrain the design of new table access > > methods in some meaningful way? > > I think these are the values used in WAL, so I assume table access > methods already have to map to those, unless they use their own. > I actually don't know. I don't think it'd be a meaningful restriction. Given that we use those for shared_buffer descriptors, WAL etc. Greetings, Andres Freund
On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote: > Sadly, I haven't got any great ideas today. I do know that the WAL-G > folks have specifically mentioned issues with the visibility map being > large enough across enough of their systems that it kinda sucks to deal > with. Perhaps we could do something like the rsync binary-diff protocol > for non-relation files? This is clearly just hand-waving but maybe > there's something reasonable in that idea. I guess it all comes down to how complicated you're willing to make the client-server protocol. With the very simple protocol that I proposed -- client provides a threshold LSN and server sends blocks modified since then -- the client need not have access to the old incremental backup to take a new one. Of course, if it happens to have access to the old backup then it can delta-compress however it likes after-the-fact, but that doesn't help with the amount of network transfer. That problem could be solved by doing something like what you're talking about (with some probably-negligible false match rate) but I have no intention of trying to implement anything that complicated, and I don't really think it's necessary, at least not for a first version. What I proposed would already allow, for most users, a large reduction in transfer and storage costs; what you are talking about here would help more, but also be a lot more work and impose some additional requirements on the system. I don't object to you implementing the more complex system, but I'll pass. > There's something like 6 different backup tools, at least, for > PostgreSQL that provide backup management, so I have a really hard time > agreeing with this idea that users don't want a PG backup management > system. Maybe that's not what you're suggesting here, but that's what > came across to me. Let me be a little more clear. Different users want different things. Some people want a canned PostgreSQL backup solution, while other people just want access to a reasonable set of facilities from which they can construct their own solution. I believe that the proposal I am making here could be used either by backup tool authors to enhance their offerings, or by individuals who want to build up their own solution using facilities provided by core. > Unless maybe I'm misunderstanding and what you're suggesting here is > that the "existing solution" is something like the external PG-specific > backup tools? But then the rest doesn't seem to make sense, as only > maybe one or two of those tools use pg_basebackup internally. Well, what I'm really talking about is in two pieces: providing some new facilities via the replication protocol, and making pg_basebackup able to use those facilities. Nothing would stop other tools from using those facilities directly if they wish. > ... but this is exactly the situation we're in already with all of the > *other* features around backup (parallel backup, backup management, WAL > management, etc). Users want those features, pg_basebackup/PG core > doesn't provide it, and therefore there's a bunch of other tools which > have been written that do. In addition, saying that PG has incremental > backup but no built-in management of those full-vs-incremental backups > and telling users that they basically have to build that themselves > really feels a lot like we're trying to address a check-box requirement > rather than making something that our users are going to be happy with. I disagree. Yes, parallel backup, like incremental backup, needs to go in core. And pg_basebackup should be able to do a parallel backup. I will fight tooth, nail, and claw any suggestion that the server should know how to do a parallel backup but pg_basebackup should not have an option to exploit that capability. And similarly for incremental. > I don't think that I was very clear in what my specific concern here > was. I'm not asking for pg_basebackup to have parallel backup (at > least, not in this part of the discussion), I'm asking for the > incremental block-based protocol that's going to be built-in to core to > be able to be used in a parallel fashion. > > The existing protocol that pg_basebackup uses is basically, connect to > the server and then say "please give me a tarball of the data directory" > and that is then streamed on that connection, making that protocol > impossible to use for parallel backup. That's fine as far as it goes > because only pg_basebackup actually uses that protocol (note that nearly > all of the other tools for doing backups of PostgreSQL don't...). If > we're expecting the external tools to use the block-level incremental > protocol then that protocol really needs to have a way to be > parallelized, otherwise we're just going to end up with all of the > individual tools doing their own thing for block-level incremental > (though perhaps they'd reimplement whatever is done in core but in a way > that they could parallelize it...), if possible (which I add just in > case there's some idea that we end up in a situation where the > block-level incremental backup has to coordinate with the backend in > some fashion to work... which would mean that *everyone* has to use the > protocol even if it isn't parallel and that would be really bad, imv). The obvious way of extending this system to parallel backup is to have N connections each streaming a separate tarfile such that when you combine them all you recreate the original data directory. That would be perfectly compatible with what I'm proposing for incremental backup. Maybe you have another idea in mind, but I don't know what it is exactly. > > Wait, you want to make it maximally easy for users to start the server > > in a state that is 100% certain to result in a corrupted and unusable > > database? Why?? I'd l like to make that a tiny bit difficult. If > > they really want a corrupted database, they can remove the file. > > No, I don't want it to be easy for users to start the server in a state > that's going to result in a corrupted cluster. That's basically the > complete opposite of what I was going for- having a file that can be > trivially removed to start up the cluster is *going* to result in people > having corrupted clusters, no matter how much we tell them "don't do > that". This is exactly the problem with have with backup_label today. > I'd really rather not double-down on that. Well, OK, but short of scanning the entire directory tree on startup, I don't see how to achieve that. > There's really two things here- the first is that I agree with the > concern about potentially destorying the existing backup if the > pg_basebackup doesn't complete, but there's some ways to address that > (such as filesystem snapshotting), so I'm not sure that the idea is > quite that bad, but it would need to be more than just what > pg_basebackup does in this case in order to be trustworthy (at least, > for most). Well, I did mention in my original email that there could be a combine-backups-destructively option. I guess this is just taking that to the next level: merge a backup being taken into an existing backup on-the-fly. Given you remarks above, it is worth noting that this GREATLY increases the chances of people accidentally causing corruption in ways that are almost undetectable. All they have to do is kill -9 the backup tool half way through and then start postgres on the resulting directory. > The other part here is the idea of endless incrementals where the blocks > which don't appear to have changed are never re-validated against what's > in the backup. Unfortunately, latent corruption happens and you really > want to have a way to check for that. In past discussions that I've had > with David, there's been some idea to check some percentage of the > blocks that didn't appear to change for each backup against what's in > the backup. Sure, I'm not trying to block anybody from developing something like that, and I acknowledge that there is risk in a system like this, but... > I share this just to point out that there's some risk to that approach, > not to say that we shouldn't do it or that we should discourage the > development of such a feature. ...it seems we are viewing this, at least, from the same perspective. > Wow. I have to admit that I feel completely opposite of that- I'd > *love* to have an independent tool (which ideally uses the same code > through the common library, or similar) that can be run to apply WAL. > > In other words, I don't agree that it's the server's problem at all to > solve that, or, at least, I don't believe that it needs to be. I mean, I guess I'd love to have that if I could get it by waving a magic wand, but I wouldn't love it if I had to write the code or maintain it. The routines for applying WAL currently all assume that you have a whole bunch of server infrastructure present; that code wouldn't run in a frontend environment, I think. I wouldn't want to have a second copy of every WAL apply routine that might have its own set of bugs. > I've tried to outline how the incremental backup capability and backup > management are really very closely related and having those be > implemented by independent tools is not a good interface for our users > to have to live with. I disagree. I think the "existing backup tools don't use pg_basebackup" argument isn't very compelling, because the reason those tools don't use pg_basebackup is because it can't do what they need. If it did, they'd probably use it. People don't write a whole separate engine for running backups just because it's fun to not reuse code -- they do it because there's no other way to get what they want. > Most of the external tools don't use pg_basebackup, nor the base backup > protocol (or, if they do, it's only as an option among others). In my > opinion, that's pretty clear indication that pg_basebackup and the base > backup protocol aren't sufficient to cover any but the simplest of > use-cases (though those simple use-cases are handled rather well). > We're talking about adding on a capability that's much more complicated > and is one that a lot of tools have already taken a stab at, let's try > to do it in a way that those tools can leverage it and avoid having to > implement it themselves. I mean, again, if it were part of pg_basebackup and available via the replication protocol, they could do exactly that, through either method. I don't get it. You seem to be arguing that we shouldn't add the necessary capabilities to the replication protocol or pg_basebackup, but at the same time arguing that pg_basebackup is inadequate because it's missing important capabilities. This confuses me. > It's an interesting idea to add in everything to pg_basebackup that > users doing backups would like to see, but that's quite a list: > > - full backups > - differential backups > - incremental backups / block-level backups > - (server-side) compression > - (server-side) encryption > - page-level checksum validation > - calculating checksums (on the whole file) > - External object storage (S3, et al) > - more things... > > I'm really not convinced that I agree with the division of labor as > you've outlined it, where all of the above is done by pg_basebackup, > where just archiving and backup retention are handled by some external > tool (except that we already have pg_receivewal, so archiving isn't > really an externally handled thing either, unless you want features like > parallel archive-push or parallel archive-get...). Yeah, if it were up to me, I'd choose put most of that in the server and make it available via the replication protocol, and then give pg_basebackup able to use that functionality. And external tools could use that functionality via pg_basebackup or by using the replication protocol directly. I actually don't really understand what the alternative is. If you want server-side compression, for example, that really has to be done on the server. And how would the server expose that, except through the replication protocol? Sure, we could design a new protocol for it. Call it... say... the shmeplication protocol. And then you could use the replication protocol for what it does today and the shmeplication protocol for all the cool bits. But why would that be better? > What would really help me, at least, understand the idea here would be > to understand exactly what the existing tools do that the subset of > users you're thinking about doesn't like/want, but which pg_basebackup, > today, does. Is the issue that there's a repository instead of just a > plain PG directory or set of tar files, like what pg_basebackup produces > today? But how would we do things like have compression, or encryption, > or block-based incremental backups without some kind of repository or > directory that doesn't actually look exactly like a PG data directory? I guess we're still wallowing in the same confusion here. pg_basebackup, for me, is just a convenient place to stick this functionality. If the server has the ability to construct and send an incremental backup by some means, then it needs a client on the other end to receive and store that backup, and since pg_basebackup already knows how to do that for full backups, extending it to incremental backups (and/or parallel, encrypted, compressed, and validated backups) seems very natural to me. Otherwise I add server-side functionality to allow $X and then have to write an entirely new client to interact with that instead of just using the client I've already got. That's more work, and I'm lazy. Now it's true that if we wanted to build something like the rsync protocol into PostgreSQL, jamming that into pg_basebackup might well be a bridge too far. That would involve taking backups via a method so different from what we're currently doing that it would probably make sense to at least consider creating a whole new tool for that purpose. But that wasn't my proposal... > I certainly can understand that there are PostgreSQL users who want to > leverage incremental backups without having to use BART or another tool > outside of whatever enterprise backup system they've got, but surely > that's a large pool of users who *do* want a PG backup tool that manages > backups, or you wouldn't have spent a considerable amount of your very > valuable time hacking on BART. I've certainly seen a fair share of both > and I don't think we should set out to exclude either. Sure, I agree. > Perhaps that's what we're both saying too and just talking past each > other, but I feel like the approach here is "make it work just for the > simple pg_basebackup case and not worry too much about the other tools, > since what we do for pg_basebackup will work for them too" while where > I'm coming from is "focus on what the other tools need first, and then > make pg_basebackup work with that if there's a sensible way to do so." I think perhaps the disconnect is that I just don't see how it can fail to work for the external tools if it works for pg_basebackup. Any given piece of functionality is either available in the replication stream, or it's not. I suspect that for both BART and pg_backrest, they won't be able to completely give up on having their own backup engines solely because core has incremental backup, but I don't know what the alternative to adding features to core one at a time is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > > Wow. I have to admit that I feel completely opposite of that- I'd > > *love* to have an independent tool (which ideally uses the same code > > through the common library, or similar) that can be run to apply WAL. > > > > In other words, I don't agree that it's the server's problem at all to > > solve that, or, at least, I don't believe that it needs to be. > > I mean, I guess I'd love to have that if I could get it by waving a > magic wand, but I wouldn't love it if I had to write the code or > maintain it. The routines for applying WAL currently all assume that > you have a whole bunch of server infrastructure present; that code > wouldn't run in a frontend environment, I think. I wouldn't want to > have a second copy of every WAL apply routine that might have its own > set of bugs. I'll fight tooth and nail not to have a second implementation of replay, even if it's just portions. The code we have is complicated and fragile enough, having a [partial] second version would be way worse. There's already plenty improvements we need to make to speed up replay, and a lot of them require multiple execution threads (be it processes or OS threads), something not easily feasible in a standalone tool. And without the already existing concurrent work during replay (primarily checkpointer doing a lot of the necessary IO), it'd also be pretty unattractive to use any separate tool. Unless you just define the server binary as that "independent tool". Which I think is entirely reasonable. With the 'consistent' and LSN recovery targets one already can get most of what's needed from such a tool, anyway. I'd argue the biggest issue there is that there's no equivalent to starting postgres with a private socket directory on windows, and perhaps an option or two making it easier to start postgres in a "private" mode for things like this. Greetings, Andres Freund
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote: > > As I understand it, the problem is not with backing up an individual > > database or cluster, but rather dealing with backing up thousands of > > individual clusters with thousands of tables in each, leading to an > > awful lot of tables with lots of FSMs/VMs, all of which end up having to > > get copied and stored wholesale. I'll point this thread out to him and > > hopefully he'll have a chance to share more specific information. > > Sounds good. Ok, done. > > I can agree with the idea of having multiple options for how to collect > > up the set of changed blocks, though I continue to feel that a > > WAL-scanning approach isn't something that we'd have implemented in the > > backend at all since it doesn't require the backend and a given backend > > might not even have all of the WAL that is relevant. I certainly don't > > think it makes sense to have a backend go get WAL from the archive to > > then merge the WAL to provide the result to a client asking for it- > > that's adding entirely unnecessary load to the database server. > > My motivation for wanting to include it in the database server was twofold: > > 1. I was hoping to leverage the background worker machinery. The > WAL-scanner would just run all the time in the background, and start > up and shut down along with the server. If it's a standalone tool, > then it can run on a different server or when the server is down, both > of which are nice. The downside though is that now you probably have > to put it in crontab or under systemd or something, instead of just > setting a couple of GUCs and letting the server handle the rest. For > me that downside seems rather significant, but YMMV. Background workers can be used to do pretty much anything. I'm not suggesting that's a bad thing- just that it's such a completely generic tool that could be used to put anything/everything into the backend, so I'm not sure how much it makes sense as an argument when it comes to designing a new capability/feature. Yes, there's an advantage there when it comes to configuration since that means we don't need to set up a cronjob and can, instead, just set a few GUCs... but it also means that it *must* be done on the server and there's no option to do it elsewhere, as you say. When it comes to "this is something that I can do on the DB server or on some other server", the usual preference is to use another system for it, to reduce load on the server. If it comes down to something that needs to/should be an ongoing process, then the packaging can package that as a daemon-type tool which handles the systemd component to it, assuming the stand-alone tool supports that, which it hopefully would. > 2. In order for the information produced by the WAL-scanner to be > useful, it's got to be available to the server when the server is > asked for an incremental backup. If the information is constructed by > a standalone frontend tool, and stored someplace other than under > $PGDATA, then the server won't have convenient access to it. I guess > we could make it the client's job to provide that information to the > server, but I kind of liked the simplicity of not needing to give the > server anything more than an LSN. If the WAL-scanner tool is a stand-alone tool, and it handles picking out all of the FPIs and incremental page changes for each relation, then what does the tool to build out the "new" backup really need to tell the backend? I feel like it mainly needs to ask the backend for the non-relation files, which gets into at least one approach that I've thought about for redesigning the backup protocol: 1. Ask for a list of files and metadata about them 2. Allow asking for individual files 3. Support multiple connections asking for individual files Quite a few of the existing backup tools for PG use a model along these lines (or use tools underneath which do). > > A thought that occurs to me is to have the functions for supporting the > > WAL merging be included in libcommon and available to both the > > independent executable that's available for doing WAL merging, and to > > the backend to be able to WAL merging itself- > > Yeah, that might be possible. I feel like this would be necessary, as it's certainly delicate and critical code and having multiple implementations of it will be difficult to manage. That said... we already have independent work going on to do WAL mergeing (WAL-G, at least), and if we insist that the WAL replay code only exists in the backend, I strongly suspect we'll end up with independent implementations of that too. Sure, we can distance ourselves from that and say that we don't have to deal with any bugs from it... but it seems like the better approach would be to have a common library that provides it. > > but for a specific > > purpose: having a way to reduce the amount of WAL that needs to be sent > > to a replica which has a replication slot but that's been disconnected > > for a while. Of course, there'd have to be some way to handle the other > > files for that to work to update a long out-of-date replica. Now, if we > > taught the backup tool about having a replication slot then perhaps we > > could have the backend effectively have the same capability proposed > > above, but without the need to go get the WAL from the archive > > repository. > > Hmm, but you can't just skip over WAL records or segments because > there are checksums and previous-record pointers and things.... Those aren't what I would be worried about, I'd think? Maybe we're talking about different things, but if there's a way to scan/compress WAL so that we have less work to do when replaying, then we should leverage that for replicas that have been disconnected for a while too. One important bit here is that the replica wouldn't be able to answer queries while it's working through this compressed WAL, since it wouldn't reach a consistent state until more-or-less the end of WAL, but I am not sure that's a bad thing; who wants to get responses back from a very out-of-date replica? > > I'm still not entirely sure that this makes sense to do in the backend > > due to the additional load, this is really just some brainstorming. > > Would it really be that much load? Well, it'd clearly be more than zero. There may be an argument to be made that it's worth it to reduce the overall throughput of the system in order to add this capability, but I don't think we've got enough information at this point to know. My gut feeling, at least, is that tracking enough information to do WAL-compression on a high-write system is going to be pretty expensive as you'd need to have a data structure that makes it easy to identify every page in the system, and be able to find each of them later on in the stream, and then throw away the old FPI in favor of the new one, and then track all the incremental page updates to that page, more-or-less, right? On a large system, given how much information has to be tracked, it seems like it could be a fair bit of load, but perhaps you've got some ideas as to how to reduce it..? Thanks! Stephen
Attachment
Greetings, I wanted to respond to this point specifically as I feel like it'll really help clear things up when it comes to the point of view I'm seeing this from. * Robert Haas (robertmhaas@gmail.com) wrote: > > Perhaps that's what we're both saying too and just talking past each > > other, but I feel like the approach here is "make it work just for the > > simple pg_basebackup case and not worry too much about the other tools, > > since what we do for pg_basebackup will work for them too" while where > > I'm coming from is "focus on what the other tools need first, and then > > make pg_basebackup work with that if there's a sensible way to do so." > > I think perhaps the disconnect is that I just don't see how it can > fail to work for the external tools if it works for pg_basebackup. The existing backup protocol that pg_basebackup uses *does* *not* *work* for the external backup tools. If it worked, they'd use it, but they don't and that's because you can't do things like a parallel backup, which we *know* users want because there's a number of tools which implement that exact capability. I do *not* want another piece of functionality added in this space which is limited in the same way because it does *not* help the external backup tools at all. > Any given piece of functionality is either available in the > replication stream, or it's not. I suspect that for both BART and > pg_backrest, they won't be able to completely give up on having their > own backup engines solely because core has incremental backup, but I > don't know what the alternative to adding features to core one at a > time is. This idea that it's either "in the replication system" or "not in the replication system" is really bad, in my view, because it can be "in the replication system" and at the same time not at all useful to the existing external backup tools, but users and others will see the "checkbox" as ticked and assume that it's available in a useful fashion by the backend and then get upset when they discover the limitations. The existing base backup/replication protocol that's used by pg_basebackup is *not* useful to most of the backup tools, that's quite clear since they *don't* use it. Building on to that an incremental backup solution that is similairly limited isn't going to make things easier for the external tools. If the goal is to make things easier for the external tools by providing capability in the backend / replication protocol then we need to be looking at what those tools require and not at what would be minimally sufficient for pg_basebackup. If we don't care about the external tools and *just* care about making it work for pg_basebackup, then let's be clear about that, and accept that it'll have to be, most likely, ripped out and rewritten when we go to add parallel capabilities, for example, to pg_basebackup down the road. That's clearly the case for the existing "base backup" protocol, so I don't see why it'd be different for an incremental backup system that is similairly designed and implemented. To be clear, I'm all for adding feature to core one at a time, but there's different ways to implement features and that's really what we're talking about here- what's the best way to implement this feature, ideally in a way that it's useful, practically, to both pg_basebackup and the other external backup utilities. Thanks! Stephen
Attachment
Greetings, Ok, responding to the rest of this email. * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote: > > Sadly, I haven't got any great ideas today. I do know that the WAL-G > > folks have specifically mentioned issues with the visibility map being > > large enough across enough of their systems that it kinda sucks to deal > > with. Perhaps we could do something like the rsync binary-diff protocol > > for non-relation files? This is clearly just hand-waving but maybe > > there's something reasonable in that idea. > > I guess it all comes down to how complicated you're willing to make > the client-server protocol. With the very simple protocol that I > proposed -- client provides a threshold LSN and server sends blocks > modified since then -- the client need not have access to the old > incremental backup to take a new one. Where is the client going to get the threshold LSN from? > Of course, if it happens to > have access to the old backup then it can delta-compress however it > likes after-the-fact, but that doesn't help with the amount of network > transfer. If it doesn't have access to the old backup, then I'm a bit confused as to how a incremental backup would be possible? Isn't that a requirement here? > That problem could be solved by doing something like what > you're talking about (with some probably-negligible false match rate) > but I have no intention of trying to implement anything that > complicated, and I don't really think it's necessary, at least not for > a first version. What I proposed would already allow, for most users, > a large reduction in transfer and storage costs; what you are talking > about here would help more, but also be a lot more work and impose > some additional requirements on the system. I don't object to you > implementing the more complex system, but I'll pass. I was talking about the rsync binary-diff specifically for the files that aren't easy to deal with in the WAL stream. I wouldn't think we'd use it for other files, and there is definitely a question there of if there's a way to do better than a binary-diff approach for those files. > > There's something like 6 different backup tools, at least, for > > PostgreSQL that provide backup management, so I have a really hard time > > agreeing with this idea that users don't want a PG backup management > > system. Maybe that's not what you're suggesting here, but that's what > > came across to me. > > Let me be a little more clear. Different users want different things. > Some people want a canned PostgreSQL backup solution, while other > people just want access to a reasonable set of facilities from which > they can construct their own solution. I believe that the proposal I > am making here could be used either by backup tool authors to enhance > their offerings, or by individuals who want to build up their own > solution using facilities provided by core. The last thing that I think users really want it so build up their own solution. There may be some organizations who would like to provide their own tool, but that's a bit different. Personally, I'd *really* like PG to have a good tool in this area and I've been working, as I've said before, to try to get to a point where we at least have the option to add in such a tool that meets our various requirements. Further, I'm concerned that the approach being presented here won't be interesting to most of the external tools because it's limited and can't be used in a parallel fashion. > > Unless maybe I'm misunderstanding and what you're suggesting here is > > that the "existing solution" is something like the external PG-specific > > backup tools? But then the rest doesn't seem to make sense, as only > > maybe one or two of those tools use pg_basebackup internally. > > Well, what I'm really talking about is in two pieces: providing some > new facilities via the replication protocol, and making pg_basebackup > able to use those facilities. Nothing would stop other tools from > using those facilities directly if they wish. If those facilities are developed and implemented in the same way as the protocol used by pg_basebackup works, then I strongly suspect that the existing backup tools will treat it similairly- which is to say, they'll largely end up ignoring it. > > ... but this is exactly the situation we're in already with all of the > > *other* features around backup (parallel backup, backup management, WAL > > management, etc). Users want those features, pg_basebackup/PG core > > doesn't provide it, and therefore there's a bunch of other tools which > > have been written that do. In addition, saying that PG has incremental > > backup but no built-in management of those full-vs-incremental backups > > and telling users that they basically have to build that themselves > > really feels a lot like we're trying to address a check-box requirement > > rather than making something that our users are going to be happy with. > > I disagree. Yes, parallel backup, like incremental backup, needs to > go in core. And pg_basebackup should be able to do a parallel backup. > I will fight tooth, nail, and claw any suggestion that the server > should know how to do a parallel backup but pg_basebackup should not > have an option to exploit that capability. And similarly for > incremental. These aren't independent things though, the way it seems like you're portraying them, because there are ways we can implement incremental backup that would support it being parallelized, and ways we can implement it that wouldn't work with parallelism at all, and all I'm argueing for is that we add in this feature in a way that it can be parallelized (since that's what most of the external tools do today...), even though pg_basebackup can't be, but in a way that pg_basebackup can also use it (albeit in a serial fashion). > > I don't think that I was very clear in what my specific concern here > > was. I'm not asking for pg_basebackup to have parallel backup (at > > least, not in this part of the discussion), I'm asking for the > > incremental block-based protocol that's going to be built-in to core to > > be able to be used in a parallel fashion. > > > > The existing protocol that pg_basebackup uses is basically, connect to > > the server and then say "please give me a tarball of the data directory" > > and that is then streamed on that connection, making that protocol > > impossible to use for parallel backup. That's fine as far as it goes > > because only pg_basebackup actually uses that protocol (note that nearly > > all of the other tools for doing backups of PostgreSQL don't...). If > > we're expecting the external tools to use the block-level incremental > > protocol then that protocol really needs to have a way to be > > parallelized, otherwise we're just going to end up with all of the > > individual tools doing their own thing for block-level incremental > > (though perhaps they'd reimplement whatever is done in core but in a way > > that they could parallelize it...), if possible (which I add just in > > case there's some idea that we end up in a situation where the > > block-level incremental backup has to coordinate with the backend in > > some fashion to work... which would mean that *everyone* has to use the > > protocol even if it isn't parallel and that would be really bad, imv). > > The obvious way of extending this system to parallel backup is to have > N connections each streaming a separate tarfile such that when you > combine them all you recreate the original data directory. That would > be perfectly compatible with what I'm proposing for incremental > backup. Maybe you have another idea in mind, but I don't know what it > is exactly. So, while that's an obvious approach, it isn't the most sensible- and we know that from experience in actually implementing parallel backup of PG files. I'm happy to discuss the approach we use in pgBackRest if you'd like to discuss this further, but it seems a bit far afield from the topic of discussion here and it seems like you're not interested or offering to work on supporting parallel backup in core. I don't think what you're proposing here wouldn't, technically, work for the various external tools, what I'm saying is that they aren't going to actually use it, which means that you're really implementing it *only* for pg_basebackup's benefit... and only for as long as pg_basebackup is serial in nature. > > > Wait, you want to make it maximally easy for users to start the server > > > in a state that is 100% certain to result in a corrupted and unusable > > > database? Why?? I'd l like to make that a tiny bit difficult. If > > > they really want a corrupted database, they can remove the file. > > > > No, I don't want it to be easy for users to start the server in a state > > that's going to result in a corrupted cluster. That's basically the > > complete opposite of what I was going for- having a file that can be > > trivially removed to start up the cluster is *going* to result in people > > having corrupted clusters, no matter how much we tell them "don't do > > that". This is exactly the problem with have with backup_label today. > > I'd really rather not double-down on that. > > Well, OK, but short of scanning the entire directory tree on startup, > I don't see how to achieve that. Ok, so, this is a bit of spit-balling, just to be clear, but we currently track things like "where we know the heap files are consistant" by storing it in the control file as a checkpoint LSN, and then we have a backup_label file to say where we need to get to in order to be consistent from a backup. Perhaps there's a way to use those to cross-validate while we are updating a data directory to be consistent? Maybe we update those files as we go, and add a cross-check flag between them, so that we know from two places that we're restoring from a backup (incremental or full), and then also know where we need to start from and where we need to get to, in order to be conistant. Of course, users can still get past this by hacking these files around and maybe we can provide a tool along the lines of pg_resetwal which lets them force the files to agree, but then we can at least throw big glaring warnings and tell users "this is really bad, type YES to continue". > > There's really two things here- the first is that I agree with the > > concern about potentially destorying the existing backup if the > > pg_basebackup doesn't complete, but there's some ways to address that > > (such as filesystem snapshotting), so I'm not sure that the idea is > > quite that bad, but it would need to be more than just what > > pg_basebackup does in this case in order to be trustworthy (at least, > > for most). > > Well, I did mention in my original email that there could be a > combine-backups-destructively option. I guess this is just taking > that to the next level: merge a backup being taken into an existing > backup on-the-fly. Given you remarks above, it is worth noting that > this GREATLY increases the chances of people accidentally causing > corruption in ways that are almost undetectable. All they have to do > is kill -9 the backup tool half way through and then start postgres on > the resulting directory. Right, we need to come up with a way to detect if that happens and complain loudly, and not continue to move forward unless and until the user explicitly insists that it's the right thing to do. > > The other part here is the idea of endless incrementals where the blocks > > which don't appear to have changed are never re-validated against what's > > in the backup. Unfortunately, latent corruption happens and you really > > want to have a way to check for that. In past discussions that I've had > > with David, there's been some idea to check some percentage of the > > blocks that didn't appear to change for each backup against what's in > > the backup. > > Sure, I'm not trying to block anybody from developing something like > that, and I acknowledge that there is risk in a system like this, > but... > > > I share this just to point out that there's some risk to that approach, > > not to say that we shouldn't do it or that we should discourage the > > development of such a feature. > > ...it seems we are viewing this, at least, from the same perspective. Great, but I feel like the question here is if we're comfortable putting out this capability *without* some mechanism to verify that the existing blocks are clean/not corrupted/changed, or if we feel like this risk is enough that we want to include a check of the existing blocks, in some fashion, as part of the incremental backup feature. Personally, and in discussion with David, we've generally felt like we don't want this feature until we have a way to verify the blocks that aren't being backed up every time and we are assuming are clean/correct, (at least some portion of them anyway, with a way to make sure we eventually check them all) because we are concerned that users will get bit by latent corruption and then be quite unhappy with us for not picking up on that. > > Wow. I have to admit that I feel completely opposite of that- I'd > > *love* to have an independent tool (which ideally uses the same code > > through the common library, or similar) that can be run to apply WAL. > > > > In other words, I don't agree that it's the server's problem at all to > > solve that, or, at least, I don't believe that it needs to be. > > I mean, I guess I'd love to have that if I could get it by waving a > magic wand, but I wouldn't love it if I had to write the code or > maintain it. The routines for applying WAL currently all assume that > you have a whole bunch of server infrastructure present; that code > wouldn't run in a frontend environment, I think. I wouldn't want to > have a second copy of every WAL apply routine that might have its own > set of bugs. I agree that we don't want to have multiple implementations or copies of the WAL apply routines. On the other hand, while I agree that there's some server infrastructure they depend on today, I feel like a lot of that infrastructure is things that we'd actually like to have in at least some of the client tools (and likely pg_basebackup specifically). I understand that it's not trivial to implement, of course, or to pull out into a common library. We are already seeing some efforts to consolidate common routines in the client libraries (Peter E's recent work around the error messaging being a good example) and I feel like that's something we should encourage and expect to see happening more in the future as we add more sophisticated client utilities. > > I've tried to outline how the incremental backup capability and backup > > management are really very closely related and having those be > > implemented by independent tools is not a good interface for our users > > to have to live with. > > I disagree. I think the "existing backup tools don't use > pg_basebackup" argument isn't very compelling, because the reason > those tools don't use pg_basebackup is because it can't do what they > need. If it did, they'd probably use it. People don't write a whole > separate engine for running backups just because it's fun to not reuse > code -- they do it because there's no other way to get what they want. I understand that you disagree but I don't clearly understand the subsequent justification for why you disagree. As I understand it, you disagree that an incremental backup capability and backup management are closely related, but that's because the existing tools don't leverage pg_basebackup (or the backup protocol), but aren't those pretty distinct things? I accept that perhaps it's my fault for implying that these topics were related in the emails I've sent, and while replying to various parts of the discussion which has traveled across a number of topics, some related and some not. I see incremental backups and backup management as related because, in part, of expiration- if you expire out a 'full' backup then you must expire out any incremental or differential backups based on it. Just generally that association of which incremental depends on which full (or prior differential, or prior incremental) is extremely important and necessary to avoid corrupt systems (consider that you might apply an incremental to a full backup, but the incremental taken was actually based on another incremental and not based on the full, or variations of that...). In short, I don't think I could confidently trust any incremental backup that's taken without having a clear link to the backup it's based on, and having it be expired when the backup it depends on is expired. > > Most of the external tools don't use pg_basebackup, nor the base backup > > protocol (or, if they do, it's only as an option among others). In my > > opinion, that's pretty clear indication that pg_basebackup and the base > > backup protocol aren't sufficient to cover any but the simplest of > > use-cases (though those simple use-cases are handled rather well). > > We're talking about adding on a capability that's much more complicated > > and is one that a lot of tools have already taken a stab at, let's try > > to do it in a way that those tools can leverage it and avoid having to > > implement it themselves. > > I mean, again, if it were part of pg_basebackup and available via the > replication protocol, they could do exactly that, through either > method. I don't get it. No, they can't. Today there exists *exactly* this situation: pg_basebackup uses the base backup protocol for doing backups, and the external tools don't use it. Why? Because it can't be used in a parallel manner, making it largely uninteresting as a mechanism for doing backups of systems at any scale. Yes, sure, they *could* technically use it, but from a *practical* standpoint they don't because it *sucks*. Let's not do that for incremental backups. > You seem to be arguing that we shouldn't add > the necessary capabilities to the replication protocol or > pg_basebackup, but at the same time arguing that pg_basebackup is > inadequate because it's missing important capabilities. This confuses > me. I'm sorry for not being clear. I'm not argueing that we *shouldn't* add such capabilities. I *want* these capabilities to be added, but I want them added in a way that's actually useful to the external tools and not something that only works for pg_basebackup (which is currently single-threaded). I hope that's the kind of feedback you've been looking for on this thread. > > It's an interesting idea to add in everything to pg_basebackup that > > users doing backups would like to see, but that's quite a list: > > > > - full backups > > - differential backups > > - incremental backups / block-level backups > > - (server-side) compression > > - (server-side) encryption > > - page-level checksum validation > > - calculating checksums (on the whole file) > > - External object storage (S3, et al) > > - more things... > > > > I'm really not convinced that I agree with the division of labor as > > you've outlined it, where all of the above is done by pg_basebackup, > > where just archiving and backup retention are handled by some external > > tool (except that we already have pg_receivewal, so archiving isn't > > really an externally handled thing either, unless you want features like > > parallel archive-push or parallel archive-get...). > > Yeah, if it were up to me, I'd choose put most of that in the server > and make it available via the replication protocol, and then give > pg_basebackup able to use that functionality. I'm all about that. I don't know that the client-side tool would still be called 'pg_basebackup' at that point, but I definitely want to get to a point where we have all of these capabilities available in core. > And external tools > could use that functionality via pg_basebackup or by using the > replication protocol directly. I actually don't really understand > what the alternative is. If you want server-side compression, for > example, that really has to be done on the server. And how would the > server expose that, except through the replication protocol? Sure, we > could design a new protocol for it. Call it... say... the > shmeplication protocol. And then you could use the replication > protocol for what it does today and the shmeplication protocol for all > the cool bits. But why would that be better? The replication protocol (or base backup protocol, really..) is what we make it, in the end. Of course server-side compression needs to be done on the server and we need a way to tell the server "please compress this for us before sending it". I'm not suggesting there's some alternative to that. What I'm suggesting is that when we go to implement the incremental backup protocol that we have a way for that to be parallelized (at least... maybe other things too) because that's what the external tools would really like. Even pg_dump works in the way that it connects and builds a list of things to run against and then farms that out to the parallel processes, so we have an example of how this is done in core today. > > What would really help me, at least, understand the idea here would be > > to understand exactly what the existing tools do that the subset of > > users you're thinking about doesn't like/want, but which pg_basebackup, > > today, does. Is the issue that there's a repository instead of just a > > plain PG directory or set of tar files, like what pg_basebackup produces > > today? But how would we do things like have compression, or encryption, > > or block-based incremental backups without some kind of repository or > > directory that doesn't actually look exactly like a PG data directory? > > I guess we're still wallowing in the same confusion here. > pg_basebackup, for me, is just a convenient place to stick this > functionality. If the server has the ability to construct and send an > incremental backup by some means, then it needs a client on the other > end to receive and store that backup, and since pg_basebackup already > knows how to do that for full backups, extending it to incremental > backups (and/or parallel, encrypted, compressed, and validated > backups) seems very natural to me. Otherwise I add server-side > functionality to allow $X and then have to write an entirely new > client to interact with that instead of just using the client I've > already got. That's more work, and I'm lazy. I'm not suggesting that we don't add this functionality to pg_basebackup, I'm just saying that we should be thinking about how the external tools will want to leverage this new capability because it's materially different from the basic minimum that pg_basebackup requires. Yes, it'd be a bit more work and a somewhat more complicated protocol than the simple approach needed by pg_basebackup, but that's what those other tools will want. If we don't care about them, ok, I get that, but I thought the idea here was to build something that's useful to both the external tools and pg_basebackup. We won't get that if we focus on just implementing a protocol for pg_basebackup to use. > Now it's true that if we wanted to build something like the rsync > protocol into PostgreSQL, jamming that into pg_basebackup might well > be a bridge too far. That would involve taking backups via a method > so different from what we're currently doing that it would probably > make sense to at least consider creating a whole new tool for that > purpose. But that wasn't my proposal... The idea around the rsync binary-diff protocol was *specifically* for things that we can't do through block-level updates with WAL scanning, just to be clear. I wasn't thinking that would be good for the relation files since we have more information for those in the LSN, et al. Thanks! Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > > > Wow. I have to admit that I feel completely opposite of that- I'd > > > *love* to have an independent tool (which ideally uses the same code > > > through the common library, or similar) that can be run to apply WAL. > > > > > > In other words, I don't agree that it's the server's problem at all to > > > solve that, or, at least, I don't believe that it needs to be. > > > > I mean, I guess I'd love to have that if I could get it by waving a > > magic wand, but I wouldn't love it if I had to write the code or > > maintain it. The routines for applying WAL currently all assume that > > you have a whole bunch of server infrastructure present; that code > > wouldn't run in a frontend environment, I think. I wouldn't want to > > have a second copy of every WAL apply routine that might have its own > > set of bugs. > > I'll fight tooth and nail not to have a second implementation of replay, > even if it's just portions. The code we have is complicated and fragile > enough, having a [partial] second version would be way worse. There's > already plenty improvements we need to make to speed up replay, and a > lot of them require multiple execution threads (be it processes or OS > threads), something not easily feasible in a standalone tool. And > without the already existing concurrent work during replay (primarily > checkpointer doing a lot of the necessary IO), it'd also be pretty > unattractive to use any separate tool. I agree that we don't want another implementation and that there's a lot that we want to do to improve replay performance. We've already got frontend tools which work with multiple execution threads, so I'm not sure I get the "not easily feasible" bit, and the argument about the checkpointer seems largely related to that (as in- if we didn't have multiple threads/processes then things would perform quite badly... but we can and do have multiple threads/processes in frontend tools today, even in pg_basebackup). You certainly bring up some good concerns though and they make me think of other bits that would seem like they'd possibly be larger issues for a frontend tool- like having a large pool of memory for cacheing (aka shared buffers) the changes. If what we're talking about here is *just* replay though, without having the system available for reads, I wonder if we might want a different solution there. > Unless you just define the server binary as that "independent tool". That's certainly an interesting idea. > Which I think is entirely reasonable. With the 'consistent' and LSN > recovery targets one already can get most of what's needed from such a > tool, anyway. I'd argue the biggest issue there is that there's no > equivalent to starting postgres with a private socket directory on > windows, and perhaps an option or two making it easier to start postgres > in a "private" mode for things like this. This would mean building in a way to do parallel WAL replay into the server binary though, as discussed above, and it seems like making that work in a way that allows us to still be available as a read-only standby would be quite a bit more difficult. We could possibly support parallel WAL replay only when we aren't a replica but from the same binary. The concerns mentioned about making it easier to start PG in a private mode don't seem too bad but I am not entirely sure that the tools which want to leverage that kind of capability would want to have to exec out to the PG binary to use it. A lot of this part of the discussion feels like a tangent though, unless I'm missing something. The "WAL compression" tool contemplated previously would be much simpler and not the full-blown WAL replay capability, which would be left to the server, unless you're suggesting that even that should be exclusively the purview of the backend? Though that ship's already sailed, given that external projects have implemented it. Having a library to provide that which external projects could leverage would be nicer than having everyone write their own version. Thanks! Stephen
Attachment
On Thu, Apr 18, 2019 at 6:39 PM Stephen Frost <sfrost@snowman.net> wrote: > Where is the client going to get the threshold LSN from? > > If it doesn't have access to the old backup, then I'm a bit confused as > to how a incremental backup would be possible? Isn't that a requirement > here? I explained this in the very first email that I wrote on this thread, and then wrote a very extensive further reply on this exact topic to Peter Eisentraut. It's a bit disheartening to see you arguing against my ideas when it's not clear that you've actually read and understood them. > > The obvious way of extending this system to parallel backup is to have > > N connections each streaming a separate tarfile such that when you > > combine them all you recreate the original data directory. That would > > be perfectly compatible with what I'm proposing for incremental > > backup. Maybe you have another idea in mind, but I don't know what it > > is exactly. > > So, while that's an obvious approach, it isn't the most sensible- and > we know that from experience in actually implementing parallel backup of > PG files. I'm happy to discuss the approach we use in pgBackRest if > you'd like to discuss this further, but it seems a bit far afield from > the topic of discussion here and it seems like you're not interested or > offering to work on supporting parallel backup in core. If there's some way of modifying my proposal so that it makes life better for external backup tools, I'm certainly willing to consider that, but you're going to have to tell me what you have in mind. If that means describing what pgbackrest does, then do it. My concern here is that you seem to want a lot of complicated stuff that will require *significant* setup in order for people to be able to use it. From what I am able to gather from your remarks so far, you think people should archive their WAL to a separate machine, and then the WAL-summarizer should run there, and then data from that should be fed back to the backup client, which should then give the server a list of modified files (and presumably, someday, blocks) and the server then returns that data, which the client then cross-verifies with checksums and awesome sauce. Which is all fine, but actually requires quite a bit of set-up and quite a bit of buy-in to the tool. And I have no problem with people having that level of buy-in to the tool. EnterpriseDB offers a number of tools which require similar levels of setup and configuration, and it's not inappropriate for an enterprise-grade backup tool to have all that stuff. However, for those who may not want to do all that, my original proposal lets you take an incremental backup by doing the following list of steps: 1. Take an incremental backup. If you'd like, you can also: 0. Enable the WAL-scanning background worker to make incremental backups much faster. You do not need a WAL archive, and you do not need EITHER the backup tool or the server to have access to previous backups, and you do not need the client to have any access to archived WAL or the summary files produced from it. The only thing you need to know the start-of-backup LSN for the previous backup. I expect you to reply with a long complaint about how my proposal is totally inadequate, but actually I think for most people, most of the time, it would not only be adequate, but extremely convenient. And despite your protestations to the contrary, it does not block parallelism, checksum verification, or any other cool features that somebody may want to add later. It'll work just fine with those things. And for the record, I am willing to put some effort into parallelism. I just think that it makes more sense to do the incremental part first. I think that incremental backup is likely to have less effect on parallel backup than the other way around. What I'm NOT willing to do is build a whole bunch of infrastructure that will help pgbackrest do amazing things but will not provide a simple and convenient way of taking incremental backups using only core tools. I do care about having something that's good for pgbackrest and other out-of-core tools. I just care about it MUCH LESS than I care about making PostgreSQL core awesome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > What I'm NOT willing to > do is build a whole bunch of infrastructure that will help pgbackrest > do amazing things but will not provide a simple and convenient way of > taking incremental backups using only core tools. I do care about > having something that's good for pgbackrest and other out-of-core > tools. I just care about it MUCH LESS than I care about making > PostgreSQL core awesome. Then I misunderstood your original proposal where you talked about providing something that the various external tools could use. If you'd like to *just* provide a mechanism for pg_basebackup to be able to do a trivial incremental backup, great, but it's not going to be useful or used by the external tools, just like the existing base backup protocol isn't used by the external tools because it can't be used in a parallel fashion. As such, and with all the other missing bits from pg_basebackup, it looks likely to me that such a feature is going to be lackluster, at best, and end up being only marginally interesting, when it could have been much more and leveraged by all of the existing tools. I agree that making a parallel-supporting protocol work is harder but I actually don't think it would be *that* much more difficult to do. That's frankly discouraging, but I'm not going to tell you where to spend your time. Making PG core awesome when it comes to backup is going to involve so much more than just marginal improvements to pg_basebackup, but it's also something that I'm very much supportive of and have invested a great deal in, by spending time and resources working to build a tool that gets closer to what an in-core solution would look like than anything that exists today. Thanks, Stephen
Attachment
Hi! Sorry for the delay. > 18 апр. 2019 г., в 21:56, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote: >> As I understand it, the problem is not with backing up an individual >> database or cluster, but rather dealing with backing up thousands of >> individual clusters with thousands of tables in each, leading to an >> awful lot of tables with lots of FSMs/VMs, all of which end up having to >> get copied and stored wholesale. I'll point this thread out to him and >> hopefully he'll have a chance to share more specific information. > > Sounds good. During introduction of WAL-delta backups, we faced two things: 1. Heavy spike in network load. We shift beginning of backup randomly, but variation is not very big: night is short andwe want to make big backups during low rps time. This low variation of time of starts of small backups creates big networkspike. 2. Incremental backups became very cheap if measured in used resources of a single cluster. 1st is not a big problem, actually, bit we realized that we can do incremental backups not just at night, but, for example,4 times a day. Or every hour. Or every minute. Why not, if they are cheap enough? Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is made ofFSM (no LSN) and VM (hard to use LSN). Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much. I think that problem of incrementing FSM and VM is too distant now. But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore. Lookslike it is possible, but too much AM-specific. It is hard when you write backup tool in Go and cannot simply link with PG. > 15 апр. 2019 г., в 18:01, Stephen Frost <sfrost@snowman.net> написал(а): > ...the goal here > isn't actually to make pg_basebackup into an enterprise backup tool, > ... BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be ubiquitousand sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why externalbackup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc) Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel network transmission.Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads, it'sabout saturated IO controllers. Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we even throttledisk IO. But users may want parallel backup to catch-up standby. Thanks. Best regards, Andrey Borodin.
On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > > What I'm NOT willing to > > do is build a whole bunch of infrastructure that will help pgbackrest > > do amazing things but will not provide a simple and convenient way of > > taking incremental backups using only core tools. I do care about > > having something that's good for pgbackrest and other out-of-core > > tools. I just care about it MUCH LESS than I care about making > > PostgreSQL core awesome. > > Then I misunderstood your original proposal where you talked about > providing something that the various external tools could use. If you'd > like to *just* provide a mechanism for pg_basebackup to be able to do a > trivial incremental backup, great, but it's not going to be useful or > used by the external tools, just like the existing base backup protocol > isn't used by the external tools because it can't be used in a parallel > fashion. Well, what I meant - and perhaps I wasn't clear enough about this - is that it could be used by an external solution for *managing* backups, not so much an external engine for *taking* backups. But actually, I really don't see any reason why the latter wouldn't also be possible. It was already suggested upthread by Anastasia that there should be a way to ask the server to give only the identity of the modified blocks without the contents of those blocks; if we provide that, then a tool can get those and do whatever it likes with them, including fetching them in parallel by some other means. Another obvious extension would be to add a command that says 'give me this file' or 'give me this file but only this list of blocks' which would give clients lots of options: they could provide their own lists of blocks to fetch computed by whatever internal magic they have, or they could request the server's modified-block map information first and then schedule fetching those blocks in parallel using this new command. So it seems like with some pretty straightforward extensions this can be made usable by and valuable to people wanting to build external backup engines, too. I do not necessarily feel obliged to implement every feature that might help with that kind of thing just because I've expressed an interest in this general area, but I might do some of them, and maybe people like you or Anastasia who want to make these facilities available to external tools can help with some of the work, too. That being said, as long as there is significant demand for value-added backup features over and above what is in core, there are probably going to be non-core backup tools that do things their own way instead of just leaning on whatever the server provides natively. In a certain sense that's regrettable, because it means that somebody - or perhaps multiple somebodys - goes to the trouble of doing something outside core and then somebody else puts something in core that obsoletes it and therein lies duplication of effort. On the other hand, it also allows people to innovate way faster than can be done in core, it allows competition among different possible designs, and it's just kinda the way we roll around here. I can't get very worked up about it. One thing I'm definitely not going to do here is abandon my goal of producing a *simple* incremental backup solution that can be deployed *easily* by users. I understand from your remarks that such a solution will not suit everybody. However, unlike you, I do not believe that pg_basebackup was a failure. I certainly agree that it has some limitations that mean that it is hard to use in large deployments, but it's also *extremely* convenient for people with a fairly small database when they just need a quick and easy backup. Adding some more features to it - such as incremental backup - will make it useful to more people in more cases. There will doubtless still be people who need more, and that's OK: those people can use a third-party tool. I will not get anywhere trying to solve every problem at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is madeof FSM (no LSN) and VM (hard to use LSN). > Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much. It seems like if the backups are only a few minutes apart, PITR might be a better choice than super-frequent incremental backups. What do you think about that? > I think that problem of incrementing FSM and VM is too distant now. > But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore.Looks like it is possible, but too much AM-specific. Interesting idea - that's worth some more thought. > BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be ubiquitousand sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why externalbackup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc) +1 > Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel networktransmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads,it's about saturated IO controllers. > Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we eventhrottle disk IO. But users may want parallel backup to catch-up standby. I'm not sure I entirely understand your point here -- are you saying that parallel backup is important, or that it's not important, or something in between? Do you think it's more or less important than incremental backup? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > What I'm NOT willing to > > > do is build a whole bunch of infrastructure that will help pgbackrest > > > do amazing things but will not provide a simple and convenient way of > > > taking incremental backups using only core tools. I do care about > > > having something that's good for pgbackrest and other out-of-core > > > tools. I just care about it MUCH LESS than I care about making > > > PostgreSQL core awesome. > > > > Then I misunderstood your original proposal where you talked about > > providing something that the various external tools could use. If you'd > > like to *just* provide a mechanism for pg_basebackup to be able to do a > > trivial incremental backup, great, but it's not going to be useful or > > used by the external tools, just like the existing base backup protocol > > isn't used by the external tools because it can't be used in a parallel > > fashion. > > Well, what I meant - and perhaps I wasn't clear enough about this - is > that it could be used by an external solution for *managing* backups, > not so much an external engine for *taking* backups. But actually, I > really don't see any reason why the latter wouldn't also be possible. > It was already suggested upthread by Anastasia that there should be a > way to ask the server to give only the identity of the modified blocks > without the contents of those blocks; if we provide that, then a tool > can get those and do whatever it likes with them, including fetching > them in parallel by some other means. Another obvious extension would > be to add a command that says 'give me this file' or 'give me this > file but only this list of blocks' which would give clients lots of > options: they could provide their own lists of blocks to fetch > computed by whatever internal magic they have, or they could request > the server's modified-block map information first and then schedule > fetching those blocks in parallel using this new command. So it seems > like with some pretty straightforward extensions this can be made > usable by and valuable to people wanting to build external backup > engines, too. I do not necessarily feel obliged to implement every > feature that might help with that kind of thing just because I've > expressed an interest in this general area, but I might do some of > them, and maybe people like you or Anastasia who want to make these > facilities available to external tools can help with some of the work, > too. Yes, if we spend a bit of time thinking about how this could be implemented in a way that could be used by multiple connections concurrently then we could provide something that both pg_basebackup and the external tools could use. Getting a list first and then supporting a 'give me this file' API, or 'give me these blocks from this file' would be very similar to what many of the external tools today. I agree that I don't think it'd be hard to do. I'm suggesting that we do that instead of, at a protocol level, something similar to what was done with pg_basebackup which prevents that. I don't really agree that implementing "give me a list of files" and "give me this file" is really somehow an 'extension' to the tar-based approach that pg_basebackup uses today, it's really a rather different thing, and I mention that as a parallel (hah!) to what we're discussing here regarding the incremental backup approach. Having been around for a while working on backup-related things, if I was to implement the protocol for pg_basebackup today, I'd definitely implement "give me a list" and "give me this file" rather than the tar-based approach, because I've learned that people want to be able to do parallel backups and that's a decent way to do that. I wouldn't set out and implement something new that's there's just no hope of making parallel. Maybe the first write of pg_basebackup would still be simple and serial since it's certainly more work to make a frontend tool like that work in parallel, but at least the protocol would be ready to support a parallel option being added alter without being rewritten. And that's really what I was trying to get at here- if we've got the choice now to decide what this is going to look like from a protocol level, it'd be great if we could make it able to support being used in a parallel fashion, even if pg_basebackup is still single-threaded. > That being said, as long as there is significant demand for > value-added backup features over and above what is in core, there are > probably going to be non-core backup tools that do things their own > way instead of just leaning on whatever the server provides natively. > In a certain sense that's regrettable, because it means that somebody > - or perhaps multiple somebodys - goes to the trouble of doing > something outside core and then somebody else puts something in core > that obsoletes it and therein lies duplication of effort. On the > other hand, it also allows people to innovate way faster than can be > done in core, it allows competition among different possible designs, > and it's just kinda the way we roll around here. I can't get very > worked up about it. Yes, that's largely the tact we've taken with it- build something outside of core, where we can move a lot faster with the implementation and innovate quickly, until we get to a stable system that's as portable and in a compatible language to what's in core today. I don't have any problem with new things going into core, in fact, I'm all for it, but if someone asks me "I'd like to do this thing in core and I'd like it to be useful for external tools" then I'll do my best to share my experiences with what's been done in core vs. what's been done in this space outside of core and what some lessons learned from that have been and ways that we could at least try to make it so that external tools will be able to use whatever is implemented in core. > One thing I'm definitely not going to do here is abandon my goal of > producing a *simple* incremental backup solution that can be deployed > *easily* by users. I understand from your remarks that such a solution > will not suit everybody. However, unlike you, I do not believe that > pg_basebackup was a failure. I certainly agree that it has some > limitations that mean that it is hard to use in large deployments, but > it's also *extremely* convenient for people with a fairly small > database when they just need a quick and easy backup. Adding some > more features to it - such as incremental backup - will make it useful > to more people in more cases. There will doubtless still be people > who need more, and that's OK: those people can use a third-party tool. > I will not get anywhere trying to solve every problem at once. I don't get this at all. What I've really been focused on has been the protocol-level questions of what this is going to look like, because that's what I see the external tools potentially using. pg_basebackup itself could remain single-threaded and could provide exactly the same interface, no matter if the protocol is "give me all the blocks across the entire cluster as a single compressed stream" or the protocol is "give me a list of files that changed" and "give me a list of these blocks in this file" or even "give me all the blocks that changed in this file". I also don't think pg_basebackup is a failure, and I didn't mean to imply that, and I'm sorry for some of the hyperbole which lead to that impression coming across. pg_basebackup is great, for what it is, and I regularly recommend it in certain use-cases as being a simple tool that does one thing and does it pretty well, for smaller clusters. The protocol it uses is unfortunately only useful in a single-threaded manner though and it'd be great if we could avoid implementing similar things in the protocol in the future. Thanks, Stephen
Attachment
> 21 апр. 2019 г., в 1:13, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is madeof FSM (no LSN) and VM (hard to use LSN). >> Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much. > > It seems like if the backups are only a few minutes apart, PITR might > be a better choice than super-frequent incremental backups. What do > you think about that? PITR is painfully slow on heavily loaded clusters. I observed restorations when 5 seconds of WAL were restored in 4 seconds.Backup was only few hours past primary node, but could catch up only at night. And during this process only one of 56 cpu cores was used. And SSD RAID throughput was not 100% utilized. Block level delta backups can be restored very efficiently: if we restore from newest to past steps, we write no more thancluster size at last backup. >> I think that problem of incrementing FSM and VM is too distant now. >> But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore.Looks like it is possible, but too much AM-specific. > > Interesting idea - that's worth some more thought. Core routines to recreate VM and FSM would be cool :) But this need to be done without extra IO, not an easy trick. >> Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel networktransmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads,it's about saturated IO controllers. >> Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we eventhrottle disk IO. But users may want parallel backup to catch-up standby. > > I'm not sure I entirely understand your point here -- are you saying > that parallel backup is important, or that it's not important, or > something in between? Do you think it's more or less important than > incremental backup? I think that there is no such thing as parallel backup. Backup creation is composite process of many subprocesses. In my experience, parallel network transmission is cool and very important, it makes upload 3 times faster. But my experienceis limited to cloud storages. Would this hold if storage backend is local FS? I have no idea. Parallel reading from disk has the same effect. Compression and encryption can be single threaded, I think it will not bebottleneck (unless one uses lzma's neighborhood on Pareto frontier). For me, I think the most important thing is incremental backups (with parallel steps merge) and then parallel backup. But there is huge fraction of users, who can benefit from parallel backup and do not need incremental backup at all. Best regards, Andrey Borodin.
On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote: > Having been around for a while working on backup-related things, if I > was to implement the protocol for pg_basebackup today, I'd definitely > implement "give me a list" and "give me this file" rather than the > tar-based approach, because I've learned that people want to be > able to do parallel backups and that's a decent way to do that. I > wouldn't set out and implement something new that's there's just no hope > of making parallel. Maybe the first write of pg_basebackup would still > be simple and serial since it's certainly more work to make a frontend > tool like that work in parallel, but at least the protocol would be > ready to support a parallel option being added alter without being > rewritten. > > And that's really what I was trying to get at here- if we've got the > choice now to decide what this is going to look like from a protocol > level, it'd be great if we could make it able to support being used in a > parallel fashion, even if pg_basebackup is still single-threaded. I think we're getting closer to a meeting of the minds here, but I don't think it's intrinsically necessary to rewrite the whole method of operation of pg_basebackup to implement incremental backup in a sensible way. One could instead just do a straightforward extension to the existing BASE_BACKUP command to enable incremental backup. Then, to enable parallel full backup and all sorts of out-of-core hacking, one could expand the command language to allow tools to access individual steps: START_BACKUP, SEND_FILE_LIST, SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes for an appealing project, but I do not think there is a technical reason why it has to be done first. Or for that matter why it has to be done second. As I keep saying, incremental backup and full backup are separate projects and I believe it's completely reasonable for whoever is doing the work to decide on the order in which they would like to do the work. Having said that, I'm curious what people other than Stephen (and other pgbackrest hackers) think about the relative value of parallel backup vs. incremental backup. Stephen appears quite convinced that parallel backup is full of win and incremental backup is a bit of a yawn by comparison, and while I certainly would not want to discount the value of his experience in this area, it sometimes happens on this mailing list that [ drum roll please ] not everybody agrees about everything. So, what do other people think? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22.04.2019 2:02, Robert Haas wrote: > On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote: >> Having been around for a while working on backup-related things, if I >> was to implement the protocol for pg_basebackup today, I'd definitely >> implement "give me a list" and "give me this file" rather than the >> tar-based approach, because I've learned that people want to be >> able to do parallel backups and that's a decent way to do that. I >> wouldn't set out and implement something new that's there's just no hope >> of making parallel. Maybe the first write of pg_basebackup would still >> be simple and serial since it's certainly more work to make a frontend >> tool like that work in parallel, but at least the protocol would be >> ready to support a parallel option being added alter without being >> rewritten. >> >> And that's really what I was trying to get at here- if we've got the >> choice now to decide what this is going to look like from a protocol >> level, it'd be great if we could make it able to support being used in a >> parallel fashion, even if pg_basebackup is still single-threaded. > I think we're getting closer to a meeting of the minds here, but I > don't think it's intrinsically necessary to rewrite the whole method > of operation of pg_basebackup to implement incremental backup in a > sensible way. One could instead just do a straightforward extension > to the existing BASE_BACKUP command to enable incremental backup. > Then, to enable parallel full backup and all sorts of out-of-core > hacking, one could expand the command language to allow tools to > access individual steps: START_BACKUP, SEND_FILE_LIST, > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes > for an appealing project, but I do not think there is a technical > reason why it has to be done first. Or for that matter why it has to > be done second. As I keep saying, incremental backup and full backup > are separate projects and I believe it's completely reasonable for > whoever is doing the work to decide on the order in which they would > like to do the work. > > Having said that, I'm curious what people other than Stephen (and > other pgbackrest hackers) think about the relative value of parallel > backup vs. incremental backup. Stephen appears quite convinced that > parallel backup is full of win and incremental backup is a bit of a > yawn by comparison, and while I certainly would not want to discount > the value of his experience in this area, it sometimes happens on this > mailing list that [ drum roll please ] not everybody agrees about > everything. So, what do other people think? > Based on the experience of pg_probackup users I can say that there is no 100% winer and depending on use case either parallel either incremental backups are preferable. - If size of database is not so larger and intensity of updates is high enough, then parallel backup within one data center is definitely more efficient solution. - If size of database is very large and data is rarely updated or database is mostly append-only, then incremental backup is preferable. - Some customers need to collect at central server backups of databases installed at many nodes with slow and unreliable connection (assume DBMS installed at locomotives). Definitely parallelism can not help here, unlike support of incremental backup. - Parallel backup more aggressively consumes resources of the system, interfering with normal work of application. So performing parallel backup may cause significant degradation of application speed. pg_probackup supports both features: parallel and incremental backups and it is up to user how to use it in more efficient way for particular configuration. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote: > > Having been around for a while working on backup-related things, if I > > was to implement the protocol for pg_basebackup today, I'd definitely > > implement "give me a list" and "give me this file" rather than the > > tar-based approach, because I've learned that people want to be > > able to do parallel backups and that's a decent way to do that. I > > wouldn't set out and implement something new that's there's just no hope > > of making parallel. Maybe the first write of pg_basebackup would still > > be simple and serial since it's certainly more work to make a frontend > > tool like that work in parallel, but at least the protocol would be > > ready to support a parallel option being added alter without being > > rewritten. > > > > And that's really what I was trying to get at here- if we've got the > > choice now to decide what this is going to look like from a protocol > > level, it'd be great if we could make it able to support being used in a > > parallel fashion, even if pg_basebackup is still single-threaded. > > I think we're getting closer to a meeting of the minds here, but I > don't think it's intrinsically necessary to rewrite the whole method > of operation of pg_basebackup to implement incremental backup in a > sensible way. It wasn't my intent to imply that the whole method of operation of pg_basebackup would have to change for this. > One could instead just do a straightforward extension > to the existing BASE_BACKUP command to enable incremental backup. Ok, how do you envision that? As I mentioned up-thread, I am concerned that we're talking too high-level here and it's making the discussion more difficult than it would be if we were to put together specific ideas and then discuss them. One way I can imagine to extend BASE_BACKUP is by adding LSN as an optional parameter and then having the database server scan the entire cluster and send a tarball which contains essentially a 'diff' file of some kind for each file where we can construct a diff based on the LSN, and then the complete contents of the file for everything else that needs to be in the backup. So, sure, that would work, but it wouldn't be able to be parallelized and I don't think it'd end up being very exciting for the external tools because of that, but it would be fine for pg_basebackup. On the other hand, if you added new commands for 'list of files changed since this LSN' and 'give me this file' and 'give me this file with the changes in it since this LSN', then pg_basebackup could work with that pretty easily in a single-threaded model (maybe with two connections to the backend, but still in a single process, or maybe just by slurping up the file list and then asking for each one) and the external tools could leverage those new capabilities too for their backups, both full backups and incremental ones. This also wouldn't have to change how pg_basebackup does full backups today one bit, so what we're really talking about here is the direction to take the new code that's being written, not about rewriting existing code. I agree that it'd be a bit more work... but hopefully not *that* much more, and it would mean we could later add parallel backup to pg_basebackup more easily too, if we wanted to. > Then, to enable parallel full backup and all sorts of out-of-core > hacking, one could expand the command language to allow tools to > access individual steps: START_BACKUP, SEND_FILE_LIST, > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes > for an appealing project, but I do not think there is a technical > reason why it has to be done first. Or for that matter why it has to > be done second. As I keep saying, incremental backup and full backup > are separate projects and I believe it's completely reasonable for > whoever is doing the work to decide on the order in which they would > like to do the work. I didn't mean to imply that one had to be done before the other from a technical standpoint. I agree that they don't depend on each other. You're certainly welcome to do what you would like, I simply wanted to share my experiences and try to help move this in a direction that would involve less code rewrite in the future and to have a feature that would be more appealing to the external tools. > Having said that, I'm curious what people other than Stephen (and > other pgbackrest hackers) While David and I do talk, we haven't really discussed this proposal all that much, so please don't assume that he shares my thoughts here. I'd also like to hear what others think, particularly those who have been working in this area. > think about the relative value of parallel > backup vs. incremental backup. Stephen appears quite convinced that > parallel backup is full of win and incremental backup is a bit of a > yawn by comparison, and while I certainly would not want to discount > the value of his experience in this area, it sometimes happens on this > mailing list that [ drum roll please ] not everybody agrees about > everything. So, what do other people think? I'm afraid this is painting my position here with an extremely broad brush and so I'd like to clarify a bit: I'm *all* for incremental backups. Incremental and differential backups were supported by pgBackRest very early on and are used extensively. Today's pgBackRest does that at a file level, but I would very much like to get to a block level shortly after we finish rewriting it into C and porting it to Windows (and probably the other platforms PG runs on today), which isn't very far off now. I'd like to make sure that whatever core ends up with as an incremental backup solution also matches very closely what we do with pgBackRest too, but everything that's been discussed here seems pretty reasonable when it comes to the bits around how the blocks are detected and the files get stitched back together, so I don't expect there to be too much of an issue there. What I'm afraid will be lackluster is adding block-level incremental backup support to pg_basebackup without any support for managing backups or anything else. I'm also concerned that it's going to mean that people who want to use incremental backup with pg_basebackup are going to have to write a lot of their own management code (probably in shell scripts and such...) around that and if they get anything wrong there then people are going to end up with bad backups that they can't restore from, or they'll have corrupted clusters if they do manage to get them restored. It'd also be nice to have as much exposed through the common library as possible when it comes to, well, everything being discussed, so that the external tools could leverage that code and avoid having to write their own. This would probably apply more to the WAL-scanning discussion, but figured I'd mention it here too. If the protocol was implemented in a way that we could leverage it from external tools in a parallel fashion then I'd be more excited about the overall body of work, although, thinking about it a bit more, I have to admit that I'm not sure that pgBackRest would end up using it in any case, no matter how it's implemented, since it wouldn't support compression or encryption, both of which we support doing in-stream before the data leaves the server, though the external tools which don't support those options likely would find the parallel option more appealing. Thanks, Stephen
Attachment
Hi, On 2019-04-19 20:04:41 -0400, Stephen Frost wrote: > I agree that we don't want another implementation and that there's a lot > that we want to do to improve replay performance. We've already got > frontend tools which work with multiple execution threads, so I'm not > sure I get the "not easily feasible" bit, and the argument about the > checkpointer seems largely related to that (as in- if we didn't have > multiple threads/processes then things would perform quite badly... but > we can and do have multiple threads/processes in frontend tools today, > even in pg_basebackup). You need not just multiple execution threads, but basically a new implementation of shared buffers, locking, process monitoring, with most of the related infrastructure. You're literally talking about reimplementing a very substantial portion of the backend. I'm not sure I can transport in written words - via a public medium - how bad an idea it would be to go there. > You certainly bring up some good concerns though and they make me think > of other bits that would seem like they'd possibly be larger issues for > a frontend tool- like having a large pool of memory for cacheing (aka > shared buffers) the changes. If what we're talking about here is *just* > replay though, without having the system available for reads, I wonder > if we might want a different solution there. No. > > Which I think is entirely reasonable. With the 'consistent' and LSN > > recovery targets one already can get most of what's needed from such a > > tool, anyway. I'd argue the biggest issue there is that there's no > > equivalent to starting postgres with a private socket directory on > > windows, and perhaps an option or two making it easier to start postgres > > in a "private" mode for things like this. > > This would mean building in a way to do parallel WAL replay into the > server binary though, as discussed above, and it seems like making that > work in a way that allows us to still be available as a read-only > standby would be quite a bit more difficult. We could possibly support > parallel WAL replay only when we aren't a replica but from the same > binary. I'm doubtful that we should try to implement parallel WAL apply that can't support HS - a substantial portion of the the logic to avoid issues around relfilenode reuse, consistency etc is going to be to be necessary for non-HS aware apply anyway. But if somebody had a concrete proposal for something that's fundamentally only doable without HS, I could be convinced. > The concerns mentioned about making it easier to start PG in a > private mode don't seem too bad but I am not entirely sure that the > tools which want to leverage that kind of capability would want to have > to exec out to the PG binary to use it. Tough luck. But even leaving infeasability aside, it seems like a quite bad idea to do this in-process inside a tool that manages backup & recovery. Creating threads / sub-processes with complicated needs (like any pared down version of pg to do just recovery would have) from within a library has substantial complications. So you'd not want to do this in-process anyway. > A lot of this part of the discussion feels like a tangent though, unless > I'm missing something. I'm replying to: On 2019-04-17 18:43:10 -0400, Stephen Frost wrote: > Wow. I have to admit that I feel completely opposite of that- I'd > *love* to have an independent tool (which ideally uses the same code > through the common library, or similar) that can be run to apply WAL. And I'm basically saying that anything that starts from this premise is fatally flawed (in the ex falso quodlibet kind of sense ;)). > The "WAL compression" tool contemplated > previously would be much simpler and not the full-blown WAL replay > capability, which would be left to the server, unless you're suggesting > that even that should be exclusively the purview of the backend? Though > that ship's already sailed, given that external projects have > implemented it. I'm extremely doubtful of such tools (but it's not what I was responding too, see above). I'd be extremely surprised if even one of them came close to being correct. The old FPI removal tool had data corrupting bugs left and right. > Having a library to provide that which external > projects could leverage would be nicer than having everyone write their > own version. No, I don't think that's necessarily true. Something complicated that's hard to get right doesn't have to be provided by core. Even if other projects decide that their risk/reward assesment is different than core postgres'. We don't have to take on all kind of work and complexity for external tools. Greetings, Andres Freund
On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote: > > I think we're getting closer to a meeting of the minds here, but I > > don't think it's intrinsically necessary to rewrite the whole method > > of operation of pg_basebackup to implement incremental backup in a > > sensible way. > > It wasn't my intent to imply that the whole method of operation of > pg_basebackup would have to change for this. Cool. > > One could instead just do a straightforward extension > > to the existing BASE_BACKUP command to enable incremental backup. > > Ok, how do you envision that? As I mentioned up-thread, I am concerned > that we're talking too high-level here and it's making the discussion > more difficult than it would be if we were to put together specific > ideas and then discuss them. > > One way I can imagine to extend BASE_BACKUP is by adding LSN as an > optional parameter and then having the database server scan the entire > cluster and send a tarball which contains essentially a 'diff' file of > some kind for each file where we can construct a diff based on the LSN, > and then the complete contents of the file for everything else that > needs to be in the backup. /me scratches head. Isn't that pretty much what I described in my original post? I even described what that "'diff' file of some kind" would look like in some detail in the paragraph of that emailed numbered "2.", and I described the reasons for that choice at length in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com I can't figure out how I'm managing to be so unclear about things about which I thought I'd been rather explicit. > So, sure, that would work, but it wouldn't be able to be parallelized > and I don't think it'd end up being very exciting for the external tools > because of that, but it would be fine for pg_basebackup. Stop being such a pessimist. Yes, if we only add the option to the BASE_BACKUP command, it won't directly be very exciting for external tools, but a lot of the work that is needed to do things that ARE exciting for external tools will have been done. For instance, if the work to figure out which blocks have been modified via WAL-scanning gets done, and initially that's only exposed via BASE_BACKUP, it won't be much work for somebody to write code for a new code that exposes that information directly through some new replication command. There's a difference between something that's going in the wrong direction and something that's going in the right direction but not as far or as fast as you'd like. And I'm 99% sure that everything I'm proposing here falls in the latter category rather than the former. > On the other hand, if you added new commands for 'list of files changed > since this LSN' and 'give me this file' and 'give me this file with the > changes in it since this LSN', then pg_basebackup could work with that > pretty easily in a single-threaded model (maybe with two connections to > the backend, but still in a single process, or maybe just by slurping up > the file list and then asking for each one) and the external tools could > leverage those new capabilities too for their backups, both full backups > and incremental ones. This also wouldn't have to change how > pg_basebackup does full backups today one bit, so what we're really > talking about here is the direction to take the new code that's being > written, not about rewriting existing code. I agree that it'd be a bit > more work... but hopefully not *that* much more, and it would mean we > could later add parallel backup to pg_basebackup more easily too, if we > wanted to. For purposes of implementing parallel pg_basebackup, it would probably be better if the server rather than the client decided which files to send via which connection. If the client decides, then every time the server finishes sending a file, the client has to request another file, and that introduces some latency: after the server finishes sending each file, it has to wait for the client to finish receiving the data, and it has to wait for the client to tell it what file to send next. If the server decides, then it can just send data at top speed without a break. So the ideal interface for pg_basebackup would really be something like: START_PARALLEL_BACKUP blah blah PARTICIPANTS 4; ...returning a cookie that can be then be used by each participant for an argument to a new commands: JOIN_PARALLLEL_BACKUP 'cookie'; However, that is obviously extremely inconvenient for third-party tools. It's possible we need both an interface like this -- for use by parallel pg_basebackup -- and a START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type interface for use by external tools. On the other hand, maybe the additional overhead caused by managing the list of files to be fetched on the client side is negligible. It'd be interesting to see, though, how busy the server is when running an incremental backup managed by an external tool like BART or pgbackrest on a cluster with a gazillion little-tiny relations. I wonder if we'd find that it spends most of its time waiting for the client. > What I'm afraid will be lackluster is adding block-level incremental > backup support to pg_basebackup without any support for managing > backups or anything else. I'm also concerned that it's going to mean > that people who want to use incremental backup with pg_basebackup are > going to have to write a lot of their own management code (probably in > shell scripts and such...) around that and if they get anything wrong > there then people are going to end up with bad backups that they can't > restore from, or they'll have corrupted clusters if they do manage to > get them restored. I think that this is another complaint that basically falls into the category of saying that this proposal might not fix everything for everybody, but that complaint could be levied against any reasonable development proposal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-04-19 20:04:41 -0400, Stephen Frost wrote: > > I agree that we don't want another implementation and that there's a lot > > that we want to do to improve replay performance. We've already got > > frontend tools which work with multiple execution threads, so I'm not > > sure I get the "not easily feasible" bit, and the argument about the > > checkpointer seems largely related to that (as in- if we didn't have > > multiple threads/processes then things would perform quite badly... but > > we can and do have multiple threads/processes in frontend tools today, > > even in pg_basebackup). > > You need not just multiple execution threads, but basically a new > implementation of shared buffers, locking, process monitoring, with most > of the related infrastructure. You're literally talking about > reimplementing a very substantial portion of the backend. I'm not sure > I can transport in written words - via a public medium - how bad an idea > it would be to go there. Yes, there'd be some need for locking and process monitoring, though if we aren't supporting ongoing read queries at the same time, there's a whole bunch of things that we don't need from the existing backend. > > > Which I think is entirely reasonable. With the 'consistent' and LSN > > > recovery targets one already can get most of what's needed from such a > > > tool, anyway. I'd argue the biggest issue there is that there's no > > > equivalent to starting postgres with a private socket directory on > > > windows, and perhaps an option or two making it easier to start postgres > > > in a "private" mode for things like this. > > > > This would mean building in a way to do parallel WAL replay into the > > server binary though, as discussed above, and it seems like making that > > work in a way that allows us to still be available as a read-only > > standby would be quite a bit more difficult. We could possibly support > > parallel WAL replay only when we aren't a replica but from the same > > binary. > > I'm doubtful that we should try to implement parallel WAL apply that > can't support HS - a substantial portion of the the logic to avoid > issues around relfilenode reuse, consistency etc is going to be to be > necessary for non-HS aware apply anyway. But if somebody had a concrete > proposal for something that's fundamentally only doable without HS, I > could be convinced. I'd certainly prefer that we support parallel WAL replay *with* HS, that just seems like a much larger problem, but I'd be quite happy to be told that it wouldn't be that much harder. > > A lot of this part of the discussion feels like a tangent though, unless > > I'm missing something. > > I'm replying to: > > On 2019-04-17 18:43:10 -0400, Stephen Frost wrote: > > Wow. I have to admit that I feel completely opposite of that- I'd > > *love* to have an independent tool (which ideally uses the same code > > through the common library, or similar) that can be run to apply WAL. > > And I'm basically saying that anything that starts from this premise is > fatally flawed (in the ex falso quodlibet kind of sense ;)). I'd just say that it'd be... difficult. :) > > The "WAL compression" tool contemplated > > previously would be much simpler and not the full-blown WAL replay > > capability, which would be left to the server, unless you're suggesting > > that even that should be exclusively the purview of the backend? Though > > that ship's already sailed, given that external projects have > > implemented it. > > I'm extremely doubtful of such tools (but it's not what I was responding > too, see above). I'd be extremely surprised if even one of them came > close to being correct. The old FPI removal tool had data corrupting > bugs left and right. I have concerns about it myself, which is why I'd actually really like to see something in core that does it, and does it the right way, that other projects could then leverage (ideally by just linking into the library without having to rewrite what's in core, though that might not be an option for things like WAL-G that are in Go and possibly don't want to link in some C library). > > Having a library to provide that which external > > projects could leverage would be nicer than having everyone write their > > own version. > > No, I don't think that's necessarily true. Something complicated that's > hard to get right doesn't have to be provided by core. Even if other > projects decide that their risk/reward assesment is different than core > postgres'. We don't have to take on all kind of work and complexity for > external tools. No, it doesn't have to be provided by core, but I sure would like it to be and I'd be much more comfortable if it was because then we'd also take care to not break whatever assumptions are made (or to do so in a way that can be detected and/or handled) as new code is written. As discussed above, as long as it isn't provided by core, it's not going to be trusted, likely will have bugs, and probably will be broken by things happening in core moving forward. The only option left is "well, we just won't have that capability at all". Maybe that's what you're getting at here, but not sure I agree with that as the result. Thanks, Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote: > > > One could instead just do a straightforward extension > > > to the existing BASE_BACKUP command to enable incremental backup. > > > > Ok, how do you envision that? As I mentioned up-thread, I am concerned > > that we're talking too high-level here and it's making the discussion > > more difficult than it would be if we were to put together specific > > ideas and then discuss them. > > > > One way I can imagine to extend BASE_BACKUP is by adding LSN as an > > optional parameter and then having the database server scan the entire > > cluster and send a tarball which contains essentially a 'diff' file of > > some kind for each file where we can construct a diff based on the LSN, > > and then the complete contents of the file for everything else that > > needs to be in the backup. > > /me scratches head. Isn't that pretty much what I described in my > original post? I even described what that "'diff' file of some kind" > would look like in some detail in the paragraph of that emailed > numbered "2.", and I described the reasons for that choice at length > in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com > > I can't figure out how I'm managing to be so unclear about things > about which I thought I'd been rather explicit. There was basically zero discussion about what things would look like at a protocol level (I went back and skimmed over the thread before sending my last email to specifically see if I was going to get this response back..). I get the idea behind the diff file, the contents of which I wasn't getting into above. > > So, sure, that would work, but it wouldn't be able to be parallelized > > and I don't think it'd end up being very exciting for the external tools > > because of that, but it would be fine for pg_basebackup. > > Stop being such a pessimist. Yes, if we only add the option to the > BASE_BACKUP command, it won't directly be very exciting for external > tools, but a lot of the work that is needed to do things that ARE > exciting for external tools will have been done. For instance, if the > work to figure out which blocks have been modified via WAL-scanning > gets done, and initially that's only exposed via BASE_BACKUP, it won't > be much work for somebody to write code for a new code that exposes > that information directly through some new replication command. > There's a difference between something that's going in the wrong > direction and something that's going in the right direction but not as > far or as fast as you'd like. And I'm 99% sure that everything I'm > proposing here falls in the latter category rather than the former. I didn't mean to imply that you're doing in the wrong direction here and I thought I said somewhere in my last email more-or-less exactly the same, that a great deal of the work needed for block-level incremental backup would be done, but specifically that this proposal wouldn't allow external tools to leverage that. It sounds like what you're suggesting now is that you're happy to implement the backend code, expose it in a way that works just for pg_basebackup, and that if someone else wants to add things to the protocol to make it easier for external tools to leverage, great. All I can say is that that's basically how we ended up in the situation we're in today where pg_basebackup doesn't support parallel backup but a bunch of external tools do and they don't go through the backend to get there, even though they'd probably prefer to. > > On the other hand, if you added new commands for 'list of files changed > > since this LSN' and 'give me this file' and 'give me this file with the > > changes in it since this LSN', then pg_basebackup could work with that > > pretty easily in a single-threaded model (maybe with two connections to > > the backend, but still in a single process, or maybe just by slurping up > > the file list and then asking for each one) and the external tools could > > leverage those new capabilities too for their backups, both full backups > > and incremental ones. This also wouldn't have to change how > > pg_basebackup does full backups today one bit, so what we're really > > talking about here is the direction to take the new code that's being > > written, not about rewriting existing code. I agree that it'd be a bit > > more work... but hopefully not *that* much more, and it would mean we > > could later add parallel backup to pg_basebackup more easily too, if we > > wanted to. > > For purposes of implementing parallel pg_basebackup, it would probably > be better if the server rather than the client decided which files to > send via which connection. If the client decides, then every time the > server finishes sending a file, the client has to request another > file, and that introduces some latency: after the server finishes > sending each file, it has to wait for the client to finish receiving > the data, and it has to wait for the client to tell it what file to > send next. If the server decides, then it can just send data at top > speed without a break. So the ideal interface for pg_basebackup would > really be something like: > > START_PARALLEL_BACKUP blah blah PARTICIPANTS 4; > > ...returning a cookie that can be then be used by each participant for > an argument to a new commands: > > JOIN_PARALLLEL_BACKUP 'cookie'; > > However, that is obviously extremely inconvenient for third-party > tools. It's possible we need both an interface like this -- for use > by parallel pg_basebackup -- and a > START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type > interface for use by external tools. On the other hand, maybe the > additional overhead caused by managing the list of files to be fetched > on the client side is negligible. It'd be interesting to see, though, > how busy the server is when running an incremental backup managed by > an external tool like BART or pgbackrest on a cluster with a gazillion > little-tiny relations. I wonder if we'd find that it spends most of > its time waiting for the client. Thanks for sharing your thoughts on that, certainly having the backend able to be more intelligent about streaming files to avoid latency is good and possibly the best approach. Another alternative to reducing the latency would be to have a way for the client to request a set of files, but I don't know that it'd be better. I'm not really sure why the above is extremely inconvenient for third-party tools, beyond just that they've already been written to work with an assumption that the server-side of things isn't as intelligent as PG is. > > What I'm afraid will be lackluster is adding block-level incremental > > backup support to pg_basebackup without any support for managing > > backups or anything else. I'm also concerned that it's going to mean > > that people who want to use incremental backup with pg_basebackup are > > going to have to write a lot of their own management code (probably in > > shell scripts and such...) around that and if they get anything wrong > > there then people are going to end up with bad backups that they can't > > restore from, or they'll have corrupted clusters if they do manage to > > get them restored. > > I think that this is another complaint that basically falls into the > category of saying that this proposal might not fix everything for > everybody, but that complaint could be levied against any reasonable > development proposal. I'm disappointed that the concerns about the trouble that end users are likely to have with this didn't garner more discussion. Thanks, Stephen
Attachment
Hi, On 2019-04-22 14:26:40 -0400, Stephen Frost wrote: > I'm disappointed that the concerns about the trouble that end users are > likely to have with this didn't garner more discussion. My impression is that endusers are having a lot more trouble due to important backup/restore features not being in core/pg_basebackup, than due to external tools having a harder time to implement certain features. Focusing on external tools being able to provide all those features, because core hasn't yet, is imo entirely the wrong thing to concentrate upon. And it's not like things largely haven't been implemented in pg_basebackup for fundamental architectural reasons. It's because we've built like 5 different external tools with randomly differing featureset and licenses. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-04-22 14:26:40 -0400, Stephen Frost wrote: > > I'm disappointed that the concerns about the trouble that end users are > > likely to have with this didn't garner more discussion. > > My impression is that endusers are having a lot more trouble due to > important backup/restore features not being in core/pg_basebackup, than > due to external tools having a harder time to implement certain > features. I had been referring specifically to the concern I raised about incremental block-level backups being added to pg_basebackup and how that'll make using pg_basebackup more complicated and therefore more difficult for end-users to get right, particularly if the end user is having to handle management of the association between the full backup and the incremental backups. I wasn't referring to anything regarding external tools. > Focusing on external tools being able to provide all those > features, because core hasn't yet, is imo entirely the wrong thing to > concentrate upon. And it's not like things largely haven't been > implemented in pg_basebackup for fundamental architectural reasons. > It's because we've built like 5 different external tools with randomly > differing featureset and licenses. There's a few challenges when it comes to adding backup features to core. One of the reasons is that core naturally moves slower when it comes to development than external projects do, as was discusssed earlier on this thread. Another is that, when it comes to backup, specifically, people want to back up their *existing* systems, which means that they need a backup tool that's going to work with whatever version of PG they've currently got deployed and that's often a few years old already. Certainly when I've thought about features that we'd like to see and considered if there's something that could be implemented in core vs. implemented outside of core, the answer often ends up being "well, if we do it ourselves then we can make it work for PG 9.2 and above, and have it working for existing users, but if we work it in as part of core, it won't be available until next year and only for version 12 and above, and users can only use it once they've upgraded.." Thanks, Stephen
Attachment
On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote: > There was basically zero discussion about what things would look like at > a protocol level (I went back and skimmed over the thread before sending > my last email to specifically see if I was going to get this response > back..). I get the idea behind the diff file, the contents of which I > wasn't getting into above. Well, I wrote: "There should be a way to tell pg_basebackup to request from the server only those blocks where LSN >= threshold_value." I guess I assumed that people would interested in the details take that to mean "and therefore the protocol would grow an option for this type of request in whatever way is the most straightforward possible extension of the current functionality is," which is indeed how you eventually interpreted it when you said we could "extend BASE_BACKUP is by adding LSN as an optional parameter." I could have been more explicit, but sometimes people tell me that my emails are too long. > external tools to leverage that. It sounds like what you're suggesting > now is that you're happy to implement the backend code, expose it in a > way that works just for pg_basebackup, and that if someone else wants to > add things to the protocol to make it easier for external tools to > leverage, great. Yep, that's more or less it, although I am potentially willing to do some modest amount of that other work along the way. I just don't want to prioritize it higher than getting the actual thing I want to build built, which I think is a pretty fair position for me to take. > All I can say is that that's basically how we ended up > in the situation we're in today where pg_basebackup doesn't support > parallel backup but a bunch of external tools do and they don't go > through the backend to get there, even though they'd probably prefer to. I certainly agree that core should try to do things in a way that is useful to external tools when that can be done without undue effort, but only if it can actually be done without undo effort. Let's see whether that's the case here: - Anastasia wants a command added that dumps out whatever the server knows about what files have changed, which I already agreed was a reasonable extension of my initial proposal. - You said that for this to be useful to pgbackrest, it'd have to use a whole different mechanism that includes commands to request individual files and blocks within those files, which would be a significant rewrite of pg_basebackup that you agreed is more closely related to parallel backup than to the project under discussion on this thread. And that even then pgbackrest probably wouldn't use it because it also does server-side compression and encryption which are not included in this proposal. It seems to me that the first one falls into the category a reasonable additional effort and the second one falls into the category of lots of extra and unrelated work that wouldn't even get used. > Thanks for sharing your thoughts on that, certainly having the backend > able to be more intelligent about streaming files to avoid latency is > good and possibly the best approach. Another alternative to reducing > the latency would be to have a way for the client to request a set of > files, but I don't know that it'd be better. I don't know either. This is an area that needs more thought, I think, although as discussed, it's more related to parallel backup than $SUBJECT. > I'm not really sure why the above is extremely inconvenient for > third-party tools, beyond just that they've already been written to work > with an assumption that the server-side of things isn't as intelligent > as PG is. Well, one thing you might want to do is have a tool that connects to the server, enters backup mode, requests information on what blocks have changed, copies those blocks via direct filesystem access, and then exits backup mode. Such a tool would really benefit from a START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP command language, because it would just skip ever issuing the SEND_FILE_CONTENTS command in favor of doing that part of the work via other means. On the other hand, a START_PARALLEL_BACKUP LSN '1/234' command is useless to such a tool. Contrariwise, a tool that has its own magic - perhaps based on WAL-scanning or something like ptrack - to know which files currently exist and which blocks are modified could use SEND_FILE_CONTENTS but not SEND_FILE_LIST. And a filesystem-snapshot based technique might use START_BACKUP and STOP_BACKUP but nothing else. In short, providing granular commands like this lets the client be really intelligent even if the server isn't, and lets the client have fine-grained control of the process. This is very good if you're an out-of-core tool maintainer and your tool is trying to be smarter than - or even just differently-designed than - core. But if what you really want is just a maximally-efficient parallel backup, you don't need the commands to be fine-grained like this. You don't even really *want* the commands to be fine-grained like this, because it's better if the server works it all out so as to avoid unnecessary network round-trips. You just want to tell the server "hey, I want to do a parallel backup with 5 participants - hit me!" and have it do that in the most efficient way that it knows how, without forcing the client to make any decisions that can be made just as well, and perhaps more efficiently, on the server. On the third hand, one advantage of having the fine-grained commands is that it would not only make it easier for out-of-core tools to do cool things, but also in-core tools. For instance, you can imagine being able to do something like: pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA If the client is using what I'm calling fine-grained commands, this is easy to implement. If it's just calling a piece of server side functionality that sends back a tarball as a blob, it's not. So each approach has some pros and cons. > I'm disappointed that the concerns about the trouble that end users are > likely to have with this didn't garner more discussion. Well, we can keep discussing things. I've tried to reply to as many of your concerns as I can, but I believe you've written more email on this thread than everyone else combined, so perhaps I haven't entirely been able to keep up. That being said, as far as I can tell, those concerns were not seconded by anyone else. Also, if I understand correctly, when I asked how we could avoid that problem, you that you didn't know. And I said it seemed like we would need to a very expensive operation at server startup, or magic. So I feel that perhaps it is a problem that (1) is not of great general concern and (2) to which no really superior engineering solution is possible. I may, however, be mistaken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
22.04.2019 2:02, Robert Haas wrote: > I think we're getting closer to a meeting of the minds here, but I > don't think it's intrinsically necessary to rewrite the whole method > of operation of pg_basebackup to implement incremental backup in a > sensible way. One could instead just do a straightforward extension > to the existing BASE_BACKUP command to enable incremental backup. > Then, to enable parallel full backup and all sorts of out-of-core > hacking, one could expand the command language to allow tools to > access individual steps: START_BACKUP, SEND_FILE_LIST, > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever. The second thing makes > for an appealing project, but I do not think there is a technical > reason why it has to be done first. Or for that matter why it has to > be done second. As I keep saying, incremental backup and full backup > are separate projects and I believe it's completely reasonable for > whoever is doing the work to decide on the order in which they would > like to do the work. > > Having said that, I'm curious what people other than Stephen (and > other pgbackrest hackers) think about the relative value of parallel > backup vs. incremental backup. Stephen appears quite convinced that > parallel backup is full of win and incremental backup is a bit of a > yawn by comparison, and while I certainly would not want to discount > the value of his experience in this area, it sometimes happens on this > mailing list that [ drum roll please ] not everybody agrees about > everything. So, what do other people think? > Personally, I believe that incremental backups are more useful to implement first since they benefit both backup speed and the space taken by a backup. Frankly speaking, I'm a bit surprised that the discussion of parallel backups took so much of this thread. Of course, we must keep it in mind, while designing the API to avoid introducing any architectural obstacles, but any further discussion of parallelism is a subject of another topic. I understand Stephen's concerns about the difficulties of incremental backup management. Even with an assumption that user is ready to manage backup chains, retention, and other stuff, we must consider the format of backup metadata that will allow us to perform some primitive commands: 1) Tell whether this backup full or incremental. 2) Tell what backup is a parent of this incremental backup. Probably, we can limit it to just returning "start_lsn", which later can be compared to "stop_lsn" of parent backup. 3) Take an incremental backup based on this backup. Here we must help a backup manager to retrieve the LSN to pass it to pg_basebackup. 4) Restore an incremental backup into a directory (on top of already restored full backup). One may use it to perform "merge" or "restore" of the incremental backup, depending on the destination directory. I wonder if it is possible to integrate it into any existing tool, or we end up with something like pg_basebackup/pg_baserestore as in case of pg_dump/pg_restore. Have you designed these? I may only recall "pg_combinebackup" from the very first message in this thread, which looks more like a sketch to explain the idea, rather than the thought-out feature design. I also found a page https://wiki.postgresql.org/wiki/Incremental_backup that raises the same questions. I'm volunteering to write a draft patch or, more likely, set of patches, which will allow us to discuss the subject in more detail. And to do that I wish we agree on the API and data format (at least broadly). Looking forward to hearing your thoughts. As I see it, ideally the backup management tools should concentrate more on managing multiple backups, while all the logic of taking a single backup (of any kind) should be integrated into the core. It means that any out-of-core client won't have to walk the PGDATA directory and care about all the postgres specific knowledge of data files consisting of blocks with headers and LSNs and so on. It simply requests data and gets it. Understandably, it won't be implemented in one take and what is more probably, it is not reachable fully. Still, it will be great to do our best to provide such tools (both existing and future) with conveniently formatted data and API to get it. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
I hope it's alright to throw in my $0.02 as a user. I've been following this (and the other thread on reading WAL to find modified blocks, prefaulting, whatever else) since the start with great excitement and would love to see the built-in backup capabilities in Postgres greatly improved. I know this is not completely on-topic for just incremental backups, so I apologize in advance. It just seemed like the most apt place to chime in.
Just to preface where I am coming from, I have been using pgBackRest for the past couple years and used wal-e prior to that. I am not a big *nix user other than all my servers, do all my development on Windows / use primarily Java. The command line is not where I feel most comfortable despite my best efforts over the last 5-6 years. Prior to Postgres, I used SQL Server for quite a few years at previous companies but was more a junior / intermediate skill set back then. I just wanted to put that out there so you can see where my bias's are.
With all that said, I would not be comfortable using pg_basebackup as my main backup tool simply because I’d have to cobble together numerous tools to get backups stored in a safe (not on the same server) location, I’d have to manage expiring backups and the WAL which is no longer needed, along with the rest of the stuff that makes these backup management tools useful.
The command line scares me, and even if I was able to get all that working, I would not feel warm and fuzzy I didn’t mess something up horribly and I may hit an edge case which destroys backups, silently corrupts data, etc.
I love that there are tools that manage all of it; backups, wal archiving, remote storage, integrate with cloud storage (S3 and the like), manages the retention of these backups with all their dependencies for me, and has all the restore options necessary built in as well.
Block level incremental backup would be amazing for my use case. I have small updates / deletes that happen to data all over some of my largest tables. With pgBackRest, since the diff/incremental backups are at the file level, I can have a single update / delete which touched a random spot in a table and now requires that whole 1gb file to be backed up again. That said, even if pg_basebackup was the only tool that did incremental block level backup tomorrow, I still wouldn’t start using it directly. I went into the issues I’d have to deal with if I used pg_basebackup above, and incremental backups without a management tool make me think using it correctly would be much harder.
I know this thread is just about incremental backup, and that pretty much everything in core is built up from small features into larger more complex ones. I understand that and am not trying to dump on any efforts, I am super excited to see work being done in this area! I just wanted to share my perspective on how crucial good backup management is to me (and I’m sure a few others may share my sentiment considering how popular all the external tools are).
I would never put a system in production unless I have some backup management in place. If core builds a backup management tool which uses pg_basebackup as building blocks for its solution…awesome! That may be something I’d use. If pg_basebackup can be improved so it can be used as the basis most external backup management tools can build on top of, that’s also great. All the external tools which practically every Postgres company have built show that it’s obviously a need for a lot of users. Core will never solve every single problem for all users, I know that. It would just be great to see some of the fundamental features of backup management baked into core in an extensible way.
With that, there could be a recommended way to set up backups (full/incremental, parallel, compressed), point in time recovery, backup retention, and perform restores (to a point in time, on a replica server, etc) with just the tooling within core with a nice and simple user interface, and great performance.
If those features core supports in the internal tooling are built in an extensible way (as has been discussed), there could be much less duplication of work implementing the same base features over and over for each external tool. Those companies can focus on more value-added features to their own products that core would never support, or on improving the tooling/performance/features core provides.
Well, this is way longer and a lot less coherent than I was hoping, so I apologize for that. Hopefully my stream of thoughts made a little bit of sense to someone.
-Adam
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote: > > There was basically zero discussion about what things would look like at > > a protocol level (I went back and skimmed over the thread before sending > > my last email to specifically see if I was going to get this response > > back..). I get the idea behind the diff file, the contents of which I > > wasn't getting into above. > > Well, I wrote: > > "There should be a way to tell pg_basebackup to request from the > server only those blocks where LSN >= threshold_value." > > I guess I assumed that people would interested in the details take > that to mean "and therefore the protocol would grow an option for this > type of request in whatever way is the most straightforward possible > extension of the current functionality is," which is indeed how you > eventually interpreted it when you said we could "extend BASE_BACKUP > is by adding LSN as an optional parameter." Looking at it from what I'm sitting, I brought up two ways that we could extend the protocol to "request from the server only those blocks where LSN >= threshold_value" with one being the modification to BASE_BACKUP and the other being a new set of commands that could be parallelized. If I had assumed that you'd be thinking the same way I am about extending the backup protocol, I wouldn't have said anything now and then would have complained after you wrote a patch that just extended the BASE_BACKUP command, at which point I likely would have been told that it's now been done and that I should have mentioned it earlier. > > external tools to leverage that. It sounds like what you're suggesting > > now is that you're happy to implement the backend code, expose it in a > > way that works just for pg_basebackup, and that if someone else wants to > > add things to the protocol to make it easier for external tools to > > leverage, great. > > Yep, that's more or less it, although I am potentially willing to do > some modest amount of that other work along the way. I just don't > want to prioritize it higher than getting the actual thing I want to > build built, which I think is a pretty fair position for me to take. At least in part then it seems like we're viewing the level of effort around what I'm talking about quite differently, and I feel like that's largely because every time I mention parallel anything there's this assumption that I'm asking you to parallelize pg_basebackup or write a whole bunch more code to provide a fully optimized server-side parallel implementation for backups. That really wasn't what I was going for. I was thinking it would be a modest amount of additional work add incremental backup via a few new commands, instead of through the BASE_BACKUP protocol command, that would make parallelization possible. Now, through this discussion, you've brought up some really good points about how the initial thoughts I had around how we could add some relatively simple commands, as part of this work, to make it easier for someone to later add parallel support to pg_basebackup (either full or incremental), or for external tools to leverage, might not be the best solution when it comes to having parallel backup in core, and therefore wouldn't actually end up being useful towards that end. That's certainly a fair point and possibly enough to justify not spending even the modest time I was thinking it'd need, but I'm not convinced. Now, that said, if you are convinced that's the case, and you're doing the work, then it's certainly your prerogative to go in the direction you're convinced of. I don't mean any of this discussion to imply that I'd object to a commit that extended BASE_BACKUP in the way outlined above, but I understood the question to be "what do people think of this idea?" and to that I'm still of the opinion that spending a modest amount of time to provide a way to parallelize an incremental backup is worth it, even if it isn't optimal and isn't the direct goal of this effort. There's a tangent on all of this that's pretty key though, which is the question around just how the blocks are identified. If the WAL scanning is done to figure out the blocks, then that's quite a bit different from the other idea of "open this relation and scan it, but only give me the blocks after this LSN". It's the latter case that I've been mostly thinking about in this thread, which is part of why I was thinking it'd be a modest amount of work to have protocol commands that accepted a file (or perhaps a relation..) to scan and return blocks from instead of baking this into BASE_BACKUP which by definition just serially scans the data directory and returns things as it finds them. For the case where we have WAL scanning happening and modfiles which are being read and used to figure out the blocks to send, it seems like it might be more complicated and therefore potentially quite a bit more work to have a parallel version of that. > > All I can say is that that's basically how we ended up > > in the situation we're in today where pg_basebackup doesn't support > > parallel backup but a bunch of external tools do and they don't go > > through the backend to get there, even though they'd probably prefer to. > > I certainly agree that core should try to do things in a way that is > useful to external tools when that can be done without undue effort, > but only if it can actually be done without undo effort. Let's see > whether that's the case here: > > - Anastasia wants a command added that dumps out whatever the server > knows about what files have changed, which I already agreed was a > reasonable extension of my initial proposal. That seems like a useful thing to have, I agree. > - You said that for this to be useful to pgbackrest, it'd have to use > a whole different mechanism that includes commands to request > individual files and blocks within those files, which would be a > significant rewrite of pg_basebackup that you agreed is more closely > related to parallel backup than to the project under discussion on > this thread. And that even then pgbackrest probably wouldn't use it > because it also does server-side compression and encryption which are > not included in this proposal. Yes, having thought about it a bit more, without adding in the other features that we already support in pgBackRest, it's unlikely we'd use it in the form that I was contemplating. That said, it'd at least be closer to something we could use and adding those other features, such as compression and encryption, would almost certainly be simpler and easier if there were already protocol commands like those we discussed for parallel work. > > Thanks for sharing your thoughts on that, certainly having the backend > > able to be more intelligent about streaming files to avoid latency is > > good and possibly the best approach. Another alternative to reducing > > the latency would be to have a way for the client to request a set of > > files, but I don't know that it'd be better. > > I don't know either. This is an area that needs more thought, I > think, although as discussed, it's more related to parallel backup > than $SUBJECT. Yes, I agree with that. > > I'm not really sure why the above is extremely inconvenient for > > third-party tools, beyond just that they've already been written to work > > with an assumption that the server-side of things isn't as intelligent > > as PG is. > > Well, one thing you might want to do is have a tool that connects to > the server, enters backup mode, requests information on what blocks > have changed, copies those blocks via direct filesystem access, and > then exits backup mode. Such a tool would really benefit from a > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP > command language, because it would just skip ever issuing the > SEND_FILE_CONTENTS command in favor of doing that part of the work via > other means. On the other hand, a START_PARALLEL_BACKUP LSN '1/234' > command is useless to such a tool. That's true, but I hardly ever hear people talking about how wonderful it is that pgBackRest uses SSH to grab the data. What I hear, often, is that people would really like backups to be done over the PG protocol on the same port that replication is done on. A possible compromise is having a dedicated port for the backup agent to use, but it's definitely not the preference. > Contrariwise, a tool that has its own magic - perhaps based on > WAL-scanning or something like ptrack - to know which files currently > exist and which blocks are modified could use SEND_FILE_CONTENTS but > not SEND_FILE_LIST. And a filesystem-snapshot based technique might > use START_BACKUP and STOP_BACKUP but nothing else. > > In short, providing granular commands like this lets the client be > really intelligent even if the server isn't, and lets the client have > fine-grained control of the process. This is very good if you're an > out-of-core tool maintainer and your tool is trying to be smarter than > - or even just differently-designed than - core. > > But if what you really want is just a maximally-efficient parallel > backup, you don't need the commands to be fine-grained like this. You > don't even really *want* the commands to be fine-grained like this, > because it's better if the server works it all out so as to avoid > unnecessary network round-trips. You just want to tell the server > "hey, I want to do a parallel backup with 5 participants - hit me!" > and have it do that in the most efficient way that it knows how, > without forcing the client to make any decisions that can be made just > as well, and perhaps more efficiently, on the server. > > On the third hand, one advantage of having the fine-grained commands > is that it would not only make it easier for out-of-core tools to do > cool things, but also in-core tools. For instance, you can imagine > being able to do something like: > > pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA > > If the client is using what I'm calling fine-grained commands, this is > easy to implement. If it's just calling a piece of server side > functionality that sends back a tarball as a blob, it's not. > > So each approach has some pros and cons. I agree that each has some pros and cons. Certainly one of the big 'cons' here is that it'd be a lot more backend work to implement the 'maximally-efficient parallel backup', while the fine-grained commands wouldn't require nearly as much but would still allow a great deal of the benefit for both in-core and out-of-core tools, potentially. > > I'm disappointed that the concerns about the trouble that end users are > > likely to have with this didn't garner more discussion. > > Well, we can keep discussing things. I've tried to reply to as many > of your concerns as I can, but I believe you've written more email on > this thread than everyone else combined, so perhaps I haven't entirely > been able to keep up. > > That being said, as far as I can tell, those concerns were not > seconded by anyone else. Also, if I understand correctly, when I > asked how we could avoid that problem, you that you didn't know. And > I said it seemed like we would need to a very expensive operation at > server startup, or magic. So I feel that perhaps it is a problem that > (1) is not of great general concern and (2) to which no really > superior engineering solution is possible. The comments that Anastasia had around the issues with being able to identify the full backup that goes with a given incremental backup, et al, certainly echoed some my concerns regarding this part of the discussion. As for the concerns about trying to avoid corruption from starting up an invalid cluster, I didn't see much discussion about the idea of some kind of cross-check between pg_control and backup_label. That was all very hand-wavy, so I'm not too surprised, but I don't think it's completely impossible to have something better than "well, if you just remove this one file, then you get a non-obviously corrupt cluster that you can happily start up". I'll certainly accept that it requires more thought though and if we're willing to continue a discussion around that, great. Thanks, Stephen
Attachment
On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote: > Looking at it from what I'm sitting, I brought up two ways that we > could extend the protocol to "request from the server only those blocks > where LSN >= threshold_value" with one being the modification to > BASE_BACKUP and the other being a new set of commands that could be > parallelized. If I had assumed that you'd be thinking the same way I am > about extending the backup protocol, I wouldn't have said anything now > and then would have complained after you wrote a patch that just > extended the BASE_BACKUP command, at which point I likely would have > been told that it's now been done and that I should have mentioned it > earlier. Fair enough. > At least in part then it seems like we're viewing the level of effort > around what I'm talking about quite differently, and I feel like that's > largely because every time I mention parallel anything there's this > assumption that I'm asking you to parallelize pg_basebackup or write a > whole bunch more code to provide a fully optimized server-side parallel > implementation for backups. That really wasn't what I was going for. I > was thinking it would be a modest amount of additional work add > incremental backup via a few new commands, instead of through the > BASE_BACKUP protocol command, that would make parallelization possible. I'm not sure about that. It doesn't seem crazy difficult, but there are a few wrinkles. One is that if the client is requesting files one at a time, it's got to have a list of all the files that it needs to request, and that means that it has to ask the server to make a preparatory pass over the whole PGDATA directory to get a list of all the files that exist. That overhead is not otherwise needed. Another is that the list of files might be really large, and that means that the client would either use a lot of memory to hold that great big list, or need to deal with spilling the list to a spool file someplace, or else have a server protocol that lets the list be fetched in incrementally in chunks. A third is that, as you mention further on, it means that the client has to care a lot more about exactly how the server is figuring out which blocks have been modified. If it just says BASE_BACKUP ..., the server an be internally reading each block and checking the LSN, or using WAL-scanning or ptrack or whatever and the client doesn't need to know or care. But if the client is asking for a list of modified files or blocks, then that presumes the information is available, and not too expensively, without actually reading the files. Fourth, MAX_RATE probably won't actually limit to the correct rate overall if the limit is applied separately to each file. I'd be afraid that a patch that tried to handle all that as part of this project would get rejected on the grounds that it was trying to solve too many unrelated problems. Also, though not everybody has to agree on what constitutes a "modest amount of additional work," I would not describe solving all of those problems as a modest effort, but rather a pretty substantial one. > There's a tangent on all of this that's pretty key though, which is the > question around just how the blocks are identified. If the WAL scanning > is done to figure out the blocks, then that's quite a bit different from > the other idea of "open this relation and scan it, but only give me the > blocks after this LSN". It's the latter case that I've been mostly > thinking about in this thread, which is part of why I was thinking it'd > be a modest amount of work to have protocol commands that accepted a > file (or perhaps a relation..) to scan and return blocks from instead of > baking this into BASE_BACKUP which by definition just serially scans the > data directory and returns things as it finds them. For the case where > we have WAL scanning happening and modfiles which are being read and > used to figure out the blocks to send, it seems like it might be more > complicated and therefore potentially quite a bit more work to have a > parallel version of that. Yeah. I don't entirely agree that the first one is simple, as per the above, but I definitely agree that the second one is more complicated than the first one. > > Well, one thing you might want to do is have a tool that connects to > > the server, enters backup mode, requests information on what blocks > > have changed, copies those blocks via direct filesystem access, and > > then exits backup mode. Such a tool would really benefit from a > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP > > command language, because it would just skip ever issuing the > > SEND_FILE_CONTENTS command in favor of doing that part of the work via > > other means. On the other hand, a START_PARALLEL_BACKUP LSN '1/234' > > command is useless to such a tool. > > That's true, but I hardly ever hear people talking about how wonderful > it is that pgBackRest uses SSH to grab the data. What I hear, often, is > that people would really like backups to be done over the PG protocol on > the same port that replication is done on. A possible compromise is > having a dedicated port for the backup agent to use, but it's definitely > not the preference. If you happen to be on the same system where the backup is running, reading straight from the data directory might be a lot faster. Otherwise, I tend to agree with you that using libpq is probably best. > I agree that each has some pros and cons. Certainly one of the big > 'cons' here is that it'd be a lot more backend work to implement the > 'maximally-efficient parallel backup', while the fine-grained commands > wouldn't require nearly as much but would still allow a great deal of > the benefit for both in-core and out-of-core tools, potentially. I agree. > The comments that Anastasia had around the issues with being able to > identify the full backup that goes with a given incremental backup, et > al, certainly echoed some my concerns regarding this part of the > discussion. > > As for the concerns about trying to avoid corruption from starting up an > invalid cluster, I didn't see much discussion about the idea of some > kind of cross-check between pg_control and backup_label. That was all > very hand-wavy, so I'm not too surprised, but I don't think it's > completely impossible to have something better than "well, if you just > remove this one file, then you get a non-obviously corrupt cluster that > you can happily start up". I'll certainly accept that it requires more > thought though and if we're willing to continue a discussion around > that, great. I think there are three different issues here that need to be considered separately. Issue #1: If you manually add files to your backup, remove files from your backup, or change files in your backup, bad things will happen. There is fundamentally nothing we can do to prevent this completely, but it may be possible to make the system more resilient against ham-handed modifications, at least to the extent of detecting them. That's maybe a topic for another thread, but it's an interesting one: Andres and I were brainstorming about it at some point. Issue #2: You can only restore an LSN-based incremental backup correctly if you have a base backup whose start-of-backup LSN is greater than or equal to the threshold LSN used to take the incremental backup. If #1 is not in play, this is just a simple cross-check at restoration time: retrieve the 'START WAL LOCATION' from the prior backup's backup_label file and the threshold LSN for the incremental backup from wherever you decide to store it and compare them; if they do not have the right relationship, ERROR. As to whether #1 might end up in play here, anything's possible, but wouldn't manually editing LSNs in backup metadata files be pretty obviously a bad idea? (Then again, I didn't really think the whole backup_label thing was that confusing either, and obviously I was wrong about that. Still, editing a file requires a little more work than removing it... you have to not only lie to the system, you have to decide which lie to tell!) Issue #3: Even if you clearly understand the rule articulated in #2, you might find it hard to follow in practice. If you take a full backup on Sunday and an incremental against Sunday's backup or against the previous day's backup on each subsequent day, it's not really that hard to understand. But in more complex scenarios it could be hard to get right. For example if you've been removing your backups when they are a month old and and then you start doing the same thing once you add incrementals to the picture you might easily remove a full backup upon which a newer incremental depends. I see the need for good tools to manage this kind of complexity, but have no plan as part of this project to provide them. I think that just requires too many assumptions about where those backups are being stored and how they are being catalogued and managed; I don't believe I currently am knowledgeable enough to design something that would be good enough to meet core standards for inclusion, and I don't want to waste energy trying. If someone else wants to try, that's OK with me, but I think it's probably better to let this be a thing that people experiment with outside of core for a while until we see what ends up being a winner. I realize that this is a debatable position, but as I'm sure you realize by now, I have a strong desire to limit the scope of this project in such a way that I can get it done, 'cuz a bird in the hand is worth two in the bush. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote: > > At least in part then it seems like we're viewing the level of effort > > around what I'm talking about quite differently, and I feel like that's > > largely because every time I mention parallel anything there's this > > assumption that I'm asking you to parallelize pg_basebackup or write a > > whole bunch more code to provide a fully optimized server-side parallel > > implementation for backups. That really wasn't what I was going for. I > > was thinking it would be a modest amount of additional work add > > incremental backup via a few new commands, instead of through the > > BASE_BACKUP protocol command, that would make parallelization possible. > > I'm not sure about that. It doesn't seem crazy difficult, but there > are a few wrinkles. One is that if the client is requesting files one > at a time, it's got to have a list of all the files that it needs to > request, and that means that it has to ask the server to make a > preparatory pass over the whole PGDATA directory to get a list of all > the files that exist. That overhead is not otherwise needed. Another > is that the list of files might be really large, and that means that > the client would either use a lot of memory to hold that great big > list, or need to deal with spilling the list to a spool file > someplace, or else have a server protocol that lets the list be > fetched in incrementally in chunks. So, I had a thought about that when I was composing the last email and while I'm still unsure about it, maybe it'd be useful to mention it here- do we really need a list of every *file*, or could we reduce that down to a list of relations + forks for the main data directory, and then always include whatever other directories/files are appropriate? When it comes to operating in chunks, well, if we're getting a list of relations instead of files, we do have this thing called cursors.. > A third is that, as you mention > further on, it means that the client has to care a lot more about > exactly how the server is figuring out which blocks have been > modified. If it just says BASE_BACKUP ..., the server an be > internally reading each block and checking the LSN, or using > WAL-scanning or ptrack or whatever and the client doesn't need to know > or care. But if the client is asking for a list of modified files or > blocks, then that presumes the information is available, and not too > expensively, without actually reading the files. I would think the client would be able to just ask for the list of modified files, when it comes to building up the list of files to ask for, which could potentially be done based on mtime instead of by WAL scanning or by scanning the files themselves. Don't get me wrong, I'd prefer that we work based on the WAL, since I have more confidence in that, but certainly quite a few of the tools do work off mtime these days and while it's not perfect, the risk/reward there is pretty palatable to a lot of people. > Fourth, MAX_RATE > probably won't actually limit to the correct rate overall if the limit > is applied separately to each file. Sure, I hadn't been thinking about MAX_RATE and that would certainly complicate things if we're offering to provide MAX_RATE-type capabilities as part of this new set of commands. > I'd be afraid that a patch that tried to handle all that as part of > this project would get rejected on the grounds that it was trying to > solve too many unrelated problems. Also, though not everybody has to > agree on what constitutes a "modest amount of additional work," I > would not describe solving all of those problems as a modest effort, > but rather a pretty substantial one. I suspect some of that's driven by how they get solved and if we decide we have to solve all of them. With things like MAX_RATE + incremental backups, I wonder how that's going to end up working, when you have the option to apply the limit to the network, or to the disk I/O. You might have addressed that elsewhere, I've not looked, and I'm not too particular about it personally either, but a definition could be "max rate at which we'll read the file you asked for on this connection" and that would be pretty straight-forward, I'd think. > > > Well, one thing you might want to do is have a tool that connects to > > > the server, enters backup mode, requests information on what blocks > > > have changed, copies those blocks via direct filesystem access, and > > > then exits backup mode. Such a tool would really benefit from a > > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP > > > command language, because it would just skip ever issuing the > > > SEND_FILE_CONTENTS command in favor of doing that part of the work via > > > other means. On the other hand, a START_PARALLEL_BACKUP LSN '1/234' > > > command is useless to such a tool. > > > > That's true, but I hardly ever hear people talking about how wonderful > > it is that pgBackRest uses SSH to grab the data. What I hear, often, is > > that people would really like backups to be done over the PG protocol on > > the same port that replication is done on. A possible compromise is > > having a dedicated port for the backup agent to use, but it's definitely > > not the preference. > > If you happen to be on the same system where the backup is running, > reading straight from the data directory might be a lot faster. Yes, that's certainly true. > > The comments that Anastasia had around the issues with being able to > > identify the full backup that goes with a given incremental backup, et > > al, certainly echoed some my concerns regarding this part of the > > discussion. > > > > As for the concerns about trying to avoid corruption from starting up an > > invalid cluster, I didn't see much discussion about the idea of some > > kind of cross-check between pg_control and backup_label. That was all > > very hand-wavy, so I'm not too surprised, but I don't think it's > > completely impossible to have something better than "well, if you just > > remove this one file, then you get a non-obviously corrupt cluster that > > you can happily start up". I'll certainly accept that it requires more > > thought though and if we're willing to continue a discussion around > > that, great. > > I think there are three different issues here that need to be > considered separately. > > Issue #1: If you manually add files to your backup, remove files from > your backup, or change files in your backup, bad things will happen. > There is fundamentally nothing we can do to prevent this completely, > but it may be possible to make the system more resilient against > ham-handed modifications, at least to the extent of detecting them. > That's maybe a topic for another thread, but it's an interesting one: > Andres and I were brainstorming about it at some point. I'd certainly be interested in hearing about ways we can improve on that. I'm alright with it being on another thread as it's a broader concern than just what we're talking about here. > Issue #2: You can only restore an LSN-based incremental backup > correctly if you have a base backup whose start-of-backup LSN is > greater than or equal to the threshold LSN used to take the > incremental backup. If #1 is not in play, this is just a simple > cross-check at restoration time: retrieve the 'START WAL LOCATION' > from the prior backup's backup_label file and the threshold LSN for > the incremental backup from wherever you decide to store it and > compare them; if they do not have the right relationship, ERROR. As > to whether #1 might end up in play here, anything's possible, but > wouldn't manually editing LSNs in backup metadata files be pretty > obviously a bad idea? (Then again, I didn't really think the whole > backup_label thing was that confusing either, and obviously I was > wrong about that. Still, editing a file requires a little more work > than removing it... you have to not only lie to the system, you have > to decide which lie to tell!) Yes, that'd certainly be at least one cross-check, but what if you've got an incremental backup based on a prior incremental backup that's based on a prior full, and you skip the incremental backup inbetween somehow? Or are we just going to state outright that we don't support incremental-on-incremental (in which case, all backups would actually be either 'full' or 'differential' in the pgBackRest parlance, anyway, and that parlance comes from my recollection of how other tools describe the different backup types, but that was from many moons ago and might be entirely wrong)? > Issue #3: Even if you clearly understand the rule articulated in #2, > you might find it hard to follow in practice. If you take a full > backup on Sunday and an incremental against Sunday's backup or against > the previous day's backup on each subsequent day, it's not really that > hard to understand. But in more complex scenarios it could be hard to > get right. For example if you've been removing your backups when they > are a month old and and then you start doing the same thing once you > add incrementals to the picture you might easily remove a full backup > upon which a newer incremental depends. I see the need for good tools > to manage this kind of complexity, but have no plan as part of this > project to provide them. I think that just requires too many > assumptions about where those backups are being stored and how they > are being catalogued and managed; I don't believe I currently am > knowledgeable enough to design something that would be good enough to > meet core standards for inclusion, and I don't want to waste energy > trying. If someone else wants to try, that's OK with me, but I think > it's probably better to let this be a thing that people experiment > with outside of core for a while until we see what ends up being a > winner. I realize that this is a debatable position, but as I'm sure > you realize by now, I have a strong desire to limit the scope of this > project in such a way that I can get it done, 'cuz a bird in the hand > is worth two in the bush. Even if what we're talking about here is really only "differentials", or backups where the incremental contains all the changes from a prior full backup, if the only check is "full LSN is greater than or equal to the incremental backup LSN", then you have a potential problem that's larger than just the incrementals no longer being valid because you removed the full backup on which they were taken- you might think that an *earlier* full backup is the one for a given incremental and perform a restore with the wrong full/incremental matchup and end up with a corrupted database. These are exactly the kind of issues that make me really wonder if this is the right natural progression for pg_basebackup or any backup tool to go in. Maybe there's some additional things we can do to make it harder for someone to end up with a corrupted database when they restore, but it's really hard to get things like expiration correct. We see users already ending up with problems because they don't manage expiration of their WAL correctly, and now we're adding another level of serious complication to the expiration requirements that, as we've seen even on this thread, some users are just not going to ever feel comfortable with doing on their own. Perhaps it's not relevant and I get that you want to build this cool incremental backup capability into pg_basebackup and I'm not going to stop you from doing it, but if I was going to build a backup tool, adding support for block-level incremental backup wouldn't be where I'd start, and, in fact, I might not even get to it even after investing over 5 years in the project and even after building in proper backup management. The idea of implementing block-level incrementals while pushing the backup management, expiration, and dependency between incrementals and fulls on to the user to figure out just strikes me as entirely backwards and, frankly, to be gratuitously 'itch scratching' at the expense of what users really want and need here. One of the great things about pg_basebackup is its simplicity and ability to be a one-time "give me a snapshot of the database" and this is building in a complicated feature to it that *requires* users to build their own basic capabilities externally in order to be able to use it. I've tried to avoid getting into that here and I won't go on about it, since it's your time to do with as you feel appropriate, but I do worry that it makes us, as a project, look a bit more cavalier about what users are asking for vs. what cool new thing we want to play with than I, at least, would like us to be (so, I'll caveat that with "in this area anyway", since I suspect saying this will probably come back to bite me in some other discussion later ;). Thanks, Stephen
Attachment
On Wed, Apr 24, 2019 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote: > So, I had a thought about that when I was composing the last email and > while I'm still unsure about it, maybe it'd be useful to mention it > here- do we really need a list of every *file*, or could we reduce that > down to a list of relations + forks for the main data directory, and > then always include whatever other directories/files are appropriate? I'm not quite sure what the difference is here. I agree that we could try to compact the list of file names by saying 16384 (24 segments) instead of 16384, 16384.1, ..., 16384.23, but I doubt that saves anything meaningful. I don't see how we can leave anything out altogether. If there's a filename called boaty.mcboatface in the server directory, I think we've got to back it up, and that won't happen unless the client knows that it is there, and it won't know unless we include it in a list. > When it comes to operating in chunks, well, if we're getting a list of > relations instead of files, we do have this thing called cursors.. Sure... but they don't work for replication commands and I am definitely not volunteering to change that. > I would think the client would be able to just ask for the list of > modified files, when it comes to building up the list of files to ask > for, which could potentially be done based on mtime instead of by WAL > scanning or by scanning the files themselves. Don't get me wrong, I'd > prefer that we work based on the WAL, since I have more confidence in > that, but certainly quite a few of the tools do work off mtime these > days and while it's not perfect, the risk/reward there is pretty > palatable to a lot of people. That approach, as with a few others that have been suggested, requires that the client have access to the previous backup, which makes me uninterested in implementing it. I want a version of incremental backup where the client needs to know the LSN of the previous backup and nothing else. That way, if you store your actual backups on a tape drive in an airless vault at the bottom of the Pacific Ocean, you can still take incremental backup against them, as long as you remember to note the LSNs before you ship the backups to the vault. Woohoo! It also allows for the wire protocol to be very simple and the client to be very simple; neither of those things is essential, but both are nice. Also, I think using mtimes is just asking to get burned. Yeah, almost nobody will, but an LSN-based approach is more granular (block level) and more reliable (can't be fooled by resetting a clock backward, or by a filesystem being careless with file metadata), so I think it makes sense to focus on getting that to work. It's worth keeping in mind that there may be somewhat different expectations for an external tool vs. a core feature. Stupid as it may sound, I think people using an external tool are more likely to do things read the directions, and those directions can say things like "use a reasonable filesystem and don't set your clock backward." When stuff goes into core, people assume that they should be able to run it on any filesystem on any hardware where they can get it to work and it should just work. And you also get a lot more users, so even if the percentage of people not reading the directions were to stay constant, the actual number of such people will go up a lot. So picking what we seem to both agree to be the most robust way of detecting changes seems like the way to go from here. > I suspect some of that's driven by how they get solved and if we decide > we have to solve all of them. With things like MAX_RATE + incremental > backups, I wonder how that's going to end up working, when you have the > option to apply the limit to the network, or to the disk I/O. You might > have addressed that elsewhere, I've not looked, and I'm not too > particular about it personally either, but a definition could be "max > rate at which we'll read the file you asked for on this connection" and > that would be pretty straight-forward, I'd think. I mean, it's just so people can tell pg_basebackup what rate they want via a command-line option and have it happen like that. They don't care about the rates for individual files. > > Issue #1: If you manually add files to your backup, remove files from > > your backup, or change files in your backup, bad things will happen. > > There is fundamentally nothing we can do to prevent this completely, > > but it may be possible to make the system more resilient against > > ham-handed modifications, at least to the extent of detecting them. > > That's maybe a topic for another thread, but it's an interesting one: > > Andres and I were brainstorming about it at some point. > > I'd certainly be interested in hearing about ways we can improve on > that. I'm alright with it being on another thread as it's a broader > concern than just what we're talking about here. Might be a good topic to chat about at PGCon. > > Issue #2: You can only restore an LSN-based incremental backup > > correctly if you have a base backup whose start-of-backup LSN is > > greater than or equal to the threshold LSN used to take the > > incremental backup. If #1 is not in play, this is just a simple > > cross-check at restoration time: retrieve the 'START WAL LOCATION' > > from the prior backup's backup_label file and the threshold LSN for > > the incremental backup from wherever you decide to store it and > > compare them; if they do not have the right relationship, ERROR. As > > to whether #1 might end up in play here, anything's possible, but > > wouldn't manually editing LSNs in backup metadata files be pretty > > obviously a bad idea? (Then again, I didn't really think the whole > > backup_label thing was that confusing either, and obviously I was > > wrong about that. Still, editing a file requires a little more work > > than removing it... you have to not only lie to the system, you have > > to decide which lie to tell!) > > Yes, that'd certainly be at least one cross-check, but what if you've > got an incremental backup based on a prior incremental backup that's > based on a prior full, and you skip the incremental backup inbetween > somehow? Or are we just going to state outright that we don't support > incremental-on-incremental (in which case, all backups would actually be > either 'full' or 'differential' in the pgBackRest parlance, anyway, and > that parlance comes from my recollection of how other tools describe the > different backup types, but that was from many moons ago and might be > entirely wrong)? I have every intention of supporting that case, just as I described in my original email, and the algorithm that I just described handles it. You just have to repeat the checks for every backup in the chain. If you have a backup A, and a backup B intended as an incremental vs. A, and a backup C intended as an incremental vs. B, then the threshold LSN for C is presumably the starting LSN for B, and the threshold LSN for B is presumably the starting LSN for A. If you try to restore A-B-C you'll check C vs. B and find that all is well and similarly for B vs. A. If you try to restore A-C, you'll find out that A's start LSN precedes C's threshold LSN and error out. > Even if what we're talking about here is really only "differentials", or > backups where the incremental contains all the changes from a prior full > backup, if the only check is "full LSN is greater than or equal to the > incremental backup LSN", then you have a potential problem that's larger > than just the incrementals no longer being valid because you removed the > full backup on which they were taken- you might think that an *earlier* > full backup is the one for a given incremental and perform a restore > with the wrong full/incremental matchup and end up with a corrupted > database. No, the proposed check is explicitly designed to prevent that. You'd get a restore failure (which is not great either, of course). > management. The idea of implementing block-level incrementals while > pushing the backup management, expiration, and dependency between > incrementals and fulls on to the user to figure out just strikes me as > entirely backwards and, frankly, to be gratuitously 'itch scratching' at > the expense of what users really want and need here. Well, not everybody needs or wants the same thing. I wouldn't be proposing it if my employer didn't think it was gonna solve a real problem... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
23.04.2019 14:08, Anastasia Lubennikova wrote: > I'm volunteering to write a draft patch or, more likely, set of > patches, which > will allow us to discuss the subject in more detail. > And to do that I wish we agree on the API and data format (at least > broadly). > Looking forward to hearing your thoughts. Though the previous discussion stalled, I still hope that we could agree on basic points such as a map file format and protocol extension, which is necessary to start implementing the feature. --------- Proof Of Concept patch --------- In attachments, you can find a prototype of incremental pg_basebackup, which consists of 2 features: 1) To perform incremental backup one should call pg_basebackup with a new argument: pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn' where lsn is a start_lsn of parent backup (can be found in "backup_label" file) It calls BASE_BACKUP replication command with a new argument PREV_BACKUP_START_LSN 'lsn'. For datafiles, only pages with LSN > prev_backup_start_lsn will be included in the backup. They are saved into 'filename.partial' file, 'filename.blockmap' file contains an array of BlockNumbers. For example, if we backuped blocks 1,3,5, filename.partial will contain 3 blocks, and 'filename.blockmap' will contain array {1,3,5}. Non-datafiles use the same format as before. 2) To merge incremental backup into a full backup call pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' --merge-backups It will move all files from 'incremental_basedir' to 'basedir' handling '.partial' files correctly. --------- Questions to discuss --------- Please note that it is just a proof-of-concept patch and it can be optimized in many ways. Let's concentrate on issues that affect the protocol or data format. 1) Whether we collect block maps using simple "read everything page by page" approach or WAL scanning or any other page tracking algorithm, we must choose a map format. I implemented the simplest one, while there are more ideas: - We can have a map not per file, but per relation or maybe per tablespace, which will make implementation more complex, but probably more optimal. The only problem I see with existing implementation is that even if only a few blocks changed, we still must pad it to 512 bytes per tar format requirements. - We can save LSNs into the block map. typedef struct BlockMapItem { BlockNumber blkno; XLogRecPtr lsn; } BlockMapItem; In my implementation, invalid prev_backup_start_lsn means fallback to regular basebackup without any block maps. Alternatively, we can define another meaning of this value and send a block map for all files. Backup utilities can use these maps to speed up backup merge or restore. 2) We can implement BASE_BACKUP SEND_FILELIST replication command, which will return a list of filenames with file sizes and block maps if lsn was provided. To avoid changing format, we can simply send tar headers for each file: - tarHeader("filename.blockmap") followed by blockmap for relation files if prev_backup_start_lsn is provided; - tarHeader("filename") without actual file content for non relation files or for all files in "FULL" backup The caller can parse messages and use them for any purpose, for example, to perform a parallel backup. Thoughts? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.
Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.
It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.
Attached series of patches I had so far... (which needed further optimization and adjustments though)
Here is the overall design (as proposed by Robert) we are trying to implement:
1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.
2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.
Here are the implementation details when we have a valid LSN
sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:
A. Read the whole file into memory.
B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.
C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.
D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
- start with a 4-byte magic number
- then store a 4-byte CRC covering the header
- then a 4-byte count of the number of blocks included in the file
- then the block numbers, each as a 4-byte quantity
- then the data blocks
We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.
Thanks
Technical Architect, Product Development
EnterpriseDB Corporation
Attachment
Hi Anastasia,On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.
Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.
It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.
Attached series of patches I had so far... (which needed further optimization and adjustments though)
Here is the overall design (as proposed by Robert) we are trying to implement:
1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.
2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.
Here are the implementation details when we have a valid LSN
sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:
A. Read the whole file into memory.
B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.
C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.
D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
- start with a 4-byte magic number
- then store a 4-byte CRC covering the header
- then a 4-byte count of the number of blocks included in the file
- then the block numbers, each as a 4-byte quantity
- then the data blocks
We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.
For combining a full backup with one or more incremental backup, we are adding
a new utility called pg_combinebackup in src/bin.
Here is the overall design as proposed by Robert.
pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result is
wrong. We should aim for a system where every block in the output directory is
written exactly once and nothing ever has to be created and then removed.
To make that work, we should start by examining the final incremental backup.
We should proceed with one file at a time. For each file:
1. If the complete file is present in the incremental backup, then just copy it
to the output directory - and move on to the next file.
2. Otherwise, we have a .partial file. Work backward through the backup chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each file
where that block is stored at this stage, again using the header information.
3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's done,
go on to the next file and start over at step 1.
We are already started working on this design.
--
Technical Architect, Product Development
EnterpriseDB Corporation
On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:Hi Anastasia,On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.
Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.
It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.
Attached series of patches I had so far... (which needed further optimization and adjustments though)
Here is the overall design (as proposed by Robert) we are trying to implement:
1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.
2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.
Here are the implementation details when we have a valid LSN
sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:
A. Read the whole file into memory.
B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.
C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.
D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
- start with a 4-byte magic number
- then store a 4-byte CRC covering the header
- then a 4-byte count of the number of blocks included in the file
- then the block numbers, each as a 4-byte quantity
- then the data blocks
We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.
For combining a full backup with one or more incremental backup, we are adding
a new utility called pg_combinebackup in src/bin.
Here is the overall design as proposed by Robert.
pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result is
wrong. We should aim for a system where every block in the output directory is
written exactly once and nothing ever has to be created and then removed.
To make that work, we should start by examining the final incremental backup.
We should proceed with one file at a time. For each file:
1. If the complete file is present in the incremental backup, then just copy it
to the output directory - and move on to the next file.
2. Otherwise, we have a .partial file. Work backward through the backup chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each file
where that block is stored at this stage, again using the header information.
3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's done,
go on to the next file and start over at step 1.
We are already started working on this design.
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
At what stage you will apply the WAL generated in between the START/STOP backup.
In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.
--Ibrar Ahmed
--
Technical Architect, Product Development
EnterpriseDB Corporation
On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:At what stage you will apply the WAL generated in between the START/STOP backup.
In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.
--Ibrar Ahmed
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:At what stage you will apply the WAL generated in between the START/STOP backup.
In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.Ok, so you keep all the WAL files since the first backup, right?
The WAL files will anyway be copied while taking a backup (full or incremental),
but only last incremental backup's WAL files are copied to the combined
synthetic full backup.
--Ibrar Ahmed
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation--Ibrar Ahmed
--
Technical Architect, Product Development
EnterpriseDB Corporation
Hi Jeevan, The idea is very nice. When Insert/update/delete and truncate/drop happens at various combinations, How the incremental backup handles the copying of the blocks? On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > > > > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: >> >> >> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: >>> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: >>>> >>>> >>>> At what stage you will apply the WAL generated in between the START/STOP backup. >>> >>> >>> In this design, we are not touching any WAL related code. The WAL files will >>> get copied with each backup either full or incremental. And thus, the last >>> incremental backup will have the final WAL files which will be copied as-is >>> in the combined full-backup and they will get apply automatically if that >>> the data directory is used to start the server. >> >> >> Ok, so you keep all the WAL files since the first backup, right? > > > The WAL files will anyway be copied while taking a backup (full or incremental), > but only last incremental backup's WAL files are copied to the combined > synthetic full backup. > >>> >>>> >>>> -- >>>> Ibrar Ahmed >>> >>> >>> -- >>> Jeevan Chalke >>> Technical Architect, Product Development >>> EnterpriseDB Corporation >>> >> >> >> -- >> Ibrar Ahmed > > > > -- > Jeevan Chalke > Technical Architect, Product Development > EnterpriseDB Corporation > -- Regards, vignesh EnterpriseDB: http://www.enterprisedb.com
Hi Jeevan,
The idea is very nice.
When Insert/update/delete and truncate/drop happens at various
combinations, How the incremental backup handles the copying of the
blocks?
On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
>
> On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>
>>
>>
>> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>>
>>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>>>>
>>>>
>>>> At what stage you will apply the WAL generated in between the START/STOP backup.
>>>
>>>
>>> In this design, we are not touching any WAL related code. The WAL files will
>>> get copied with each backup either full or incremental. And thus, the last
>>> incremental backup will have the final WAL files which will be copied as-is
>>> in the combined full-backup and they will get apply automatically if that
>>> the data directory is used to start the server.
>>
>>
>> Ok, so you keep all the WAL files since the first backup, right?
>
>
> The WAL files will anyway be copied while taking a backup (full or incremental),
> but only last incremental backup's WAL files are copied to the combined
> synthetic full backup.
>
>>>
>>>>
>>>> --
>>>> Ibrar Ahmed
>>>
>>>
>>> --
>>> Jeevan Chalke
>>> Technical Architect, Product Development
>>> EnterpriseDB Corporation
>>>
>>
>>
>> --
>> Ibrar Ahmed
>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>
--
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com
Thanks Jeevan. 1) If relation file has changed due to truncate or vacuum. During incremental backup the new files will be copied. There are chances that both the old file and new file will be present. I'm not sure if cleaning up of the old file is handled. 2) Just a small thought on building the bitmap, can the bitmap be built and maintained as and when the changes are happening in the system. If we are building the bitmap while doing the incremental backup, Scanning through each file might take more time. This can be a configurable parameter, the system can run without capturing this information by default, but if there are some of them who will be taking incremental backup frequently this configuration can be enabled which should track the modified blocks. What is your thought on this? -- Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Tue, Jul 23, 2019 at 11:19 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > > Hi Vignesh, > > This backup technology is extending the pg_basebackup itself, which means we can > still take online backups. This is internally done using pg_start_backup and > pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is > used in the recovery process while starting the cluster from a backup image. What > incremental backup will just modify (as compared to traditional pg_basebackup) > is - After doing the checkpoint, instead of copying the entire relation files, > it takes an input LSN and scan all the blocks in all relation files, and store > the blocks having LSN >= InputLSN. This means it considers all the changes > that are already written into relation files including insert/update/delete etc > up to the checkpoint performed by pg_start_backup internally, and as Jeevan Chalke > mentioned upthread the incremental backup will also contain copy of WAL files. > Once this incremental backup is combined with the parent backup by means of new > combine process (that will be introduced as part of this feature itself) should > ideally look like a full pg_basebackup. Note that any changes done by these > insert/delete/update operations while the incremental backup was being taken > will be still available via WAL files and as normal restore process, will be > replayed from the checkpoint onwards up to a consistent point. > > My two cents! > > Regards, > Jeevan Ladhe > > On Sat, Jul 20, 2019 at 11:22 PM vignesh C <vignesh21@gmail.com> wrote: >> >> Hi Jeevan, >> >> The idea is very nice. >> When Insert/update/delete and truncate/drop happens at various >> combinations, How the incremental backup handles the copying of the >> blocks? >> >> >> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke >> <jeevan.chalke@enterprisedb.com> wrote: >> > >> > >> > >> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: >> >> >> >> >> >> >> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: >> >>> >> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: >> >>>> >> >>>> >> >>>> At what stage you will apply the WAL generated in between the START/STOP backup. >> >>> >> >>> >> >>> In this design, we are not touching any WAL related code. The WAL files will >> >>> get copied with each backup either full or incremental. And thus, the last >> >>> incremental backup will have the final WAL files which will be copied as-is >> >>> in the combined full-backup and they will get apply automatically if that >> >>> the data directory is used to start the server. >> >> >> >> >> >> Ok, so you keep all the WAL files since the first backup, right? >> > >> > >> > The WAL files will anyway be copied while taking a backup (full or incremental), >> > but only last incremental backup's WAL files are copied to the combined >> > synthetic full backup. >> > >> >>> >> >>>> >> >>>> -- >> >>>> Ibrar Ahmed >> >>> >> >>> >> >>> -- >> >>> Jeevan Chalke >> >>> Technical Architect, Product Development >> >>> EnterpriseDB Corporation >> >>> >> >> >> >> >> >> -- >> >> Ibrar Ahmed >> > >> > >> > >> > -- >> > Jeevan Chalke >> > Technical Architect, Product Development >> > EnterpriseDB Corporation >> > >> >> >> -- >> Regards, >> vignesh >> >> >>
1) If relation file has changed due to truncate or vacuum.
During incremental backup the new files will be copied.
There are chances that both the old file and new file
will be present. I'm not sure if cleaning up of the
old file is handled.
2) Just a small thought on building the bitmap,
can the bitmap be built and maintained as
and when the changes are happening in the system.
If we are building the bitmap while doing the incremental backup,
Scanning through each file might take more time.
This can be a configurable parameter, the system can run
without capturing this information by default, but if there are some
of them who will be taking incremental backup frequently this
configuration can be enabled which should track the modified blocks.
Hi Vignesh,Please find my comments inline below:1) If relation file has changed due to truncate or vacuum.
During incremental backup the new files will be copied.
There are chances that both the old file and new file
will be present. I'm not sure if cleaning up of the
old file is handled.When an incremental backup is taken it either copies the file in its entirety ifa file is changed more than 90%, or writes .partial with changed blocks bitmapand actual data. For the files that are unchanged, it writes 0 bytes and stillcreates a .partial file for unchanged files too. This means there is a .partitialfile for all the files that are to be looked up in full backup.While composing a synthetic backup from incremental backup the pg_combinebackuptool will only look for those relation files in full(parent) backup which arehaving .partial files in the incremental backup. So, if vacuum/truncate happenedbetween full and incremental backup, then the incremental backup image will nothave a 0-length .partial file for that relation, and so the synthetic backupthat is restored using pg_combinebackup will not have that file as well.
2) Just a small thought on building the bitmap,
can the bitmap be built and maintained as
and when the changes are happening in the system.
If we are building the bitmap while doing the incremental backup,
Scanning through each file might take more time.
This can be a configurable parameter, the system can run
without capturing this information by default, but if there are some
of them who will be taking incremental backup frequently this
configuration can be enabled which should track the modified blocks.IIUC, this will need changes in the backend. Honestly, I think backup is amaintenance task and hampering the backend for this does not look like a goodidea. But, having said that even if we have to provide this as a switch for someof the users, it will need a different infrastructure than what we are buildinghere for constructing bitmap, where we scan all the files one by one. Maybe forthe initial version, we can go with the current proposal that Robert has suggested,and add this switch at a later point as an enhancement.
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > In attachments, you can find a prototype of incremental pg_basebackup, > which consists of 2 features: > > 1) To perform incremental backup one should call pg_basebackup with a > new argument: > > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn' > > where lsn is a start_lsn of parent backup (can be found in > "backup_label" file) > > It calls BASE_BACKUP replication command with a new argument > PREV_BACKUP_START_LSN 'lsn'. > > For datafiles, only pages with LSN > prev_backup_start_lsn will be > included in the backup. > They are saved into 'filename.partial' file, 'filename.blockmap' file > contains an array of BlockNumbers. > For example, if we backuped blocks 1,3,5, filename.partial will contain > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}. I think it's better to keep both the information about changed blocks and the contents of the changed blocks in a single file. The list of changed blocks is probably quite short, and I don't really want to double the number of files in the backup if there's no real need. I suspect it's just overall a bit simpler to keep everything together. I don't think this is a make-or-break thing, and welcome contrary arguments, but that's my preference. > 2) To merge incremental backup into a full backup call > > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' > --merge-backups > > It will move all files from 'incremental_basedir' to 'basedir' handling > '.partial' files correctly. This, to me, looks like it's much worse than the design that I proposed originally. It means that: 1. You can't take an incremental backup without having the full backup available at the time you want to take the incremental backup. 2. You're always storing a full backup, which means that you need more disk space, and potentially much more I/O while taking the backup. You save on transfer bandwidth, but you add a lot of disk reads and writes, costs which have to be paid even if the backup is never restored. > 1) Whether we collect block maps using simple "read everything page by > page" approach > or WAL scanning or any other page tracking algorithm, we must choose a > map format. > I implemented the simplest one, while there are more ideas: I think we should start simple. I haven't had a chance to look at Jeevan's patch at all, or yours in any detail, as yet, so these are just some very preliminary comments. It will be good, however, if we can agree on who is going to do what part of this as we try to drive this forward together. I'm sorry that I didn't communicate EDB's plans to work on this more clearly; duplicated effort serves nobody well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Jeevan
I reviewed first two patches -
0001-Add-support-for-command-line-option-to-pass-LSN.patch and
0002-Add-TAP-test-to-test-LSN-option.patch
from the set of incremental backup patches, and the changes look good to me.
I had some concerns around the way we are working around with the fact that
pg_lsn_in() accepts the lsn with 0 as a valid lsn and I think that itself is
contradictory to the definition of InvalidXLogRecPtr. I have started a separate
new thread[1] for the same.
Also, I observe that now commit 21f428eb, has already moved the lsn decoding
logic to a separate function pg_lsn_in_internal(), so the function
decode_lsn_internal() from patch 0001 will go away and the dependent code needs
to be modified.
I shall review the rest of the patches, and post the comments.
Regards,
Jeevan Ladhe
Hi Anastasia,--On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:23.04.2019 14:08, Anastasia Lubennikova wrote:
> I'm volunteering to write a draft patch or, more likely, set of
> patches, which
> will allow us to discuss the subject in more detail.
> And to do that I wish we agree on the API and data format (at least
> broadly).
> Looking forward to hearing your thoughts.
Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file
format and protocol extension,
which is necessary to start implementing the feature.
It's great that you too come up with the PoC patch. I didn't look at your changes in much details but we at EnterpriseDB too working on this feature and started implementing it.
Attached series of patches I had so far... (which needed further optimization and adjustments though)
Here is the overall design (as proposed by Robert) we are trying to implement:
1. Extend the BASE_BACKUP command that can be used with replication connections. Add a new [ LSN 'lsn' ] option.
2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send the option added to the server in #1.
Here are the implementation details when we have a valid LSN
sendFile() in basebackup.c is the function which mostly does the thing for us. If the filename looks like a relation file, then we'll need to consider sending only a partial file. The way to do that is probably:
A. Read the whole file into memory.
B. Check the LSN of each block. Build a bitmap indicating which blocks have an LSN greater than or equal to the threshold LSN.
C. If more than 90% of the bits in the bitmap are set, send the whole file just as if this were a full backup. This 90% is a constant now; we might make it a GUC later.
D. Otherwise, send a file with .partial added to the name. The .partial file contains an indication of which blocks were changed at the beginning, followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
- start with a 4-byte magic number
- then store a 4-byte CRC covering the header
- then a 4-byte count of the number of blocks included in the file
- then the block numbers, each as a 4-byte quantity
- then the data blocks
We are also working on combining these incremental back-ups with the full backup and for that, we are planning to add a new utility called pg_combinebackup. Will post the details on that later once we have on the same page for taking backup.
ThanksJeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together. I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.
I had a look over Anastasia's PoC patch to understand the approach she has
taken and here are my observations.
1.
The patch first creates a .blockmap file for each relation file containing
an array of all modified block numbers. This is done by reading all blocks
(in a chunk of 4 (32kb in total) in a loop) from a file and checking the page
LSN with given LSN. Later, to create .partial file, a relation file is opened
again and all blocks are read in a chunk of 4 in a loop. If found modified,
it is copied into another memory and after scanning all 4 blocks, all copied
blocks are sent to the .partial file.
In this approach, each file is opened and read twice which looks more expensive
to me. Whereas in my patch, I do that just once. However, I read the entire
file in memory to check which blocks are modified but in Anastasia's design
max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
that as we wanted to know how heavily the file got modified so that we can
send the entire file if it was modified beyond the threshold (currently 90%).
2.
Also, while sending modified blocks, they are copied in another buffer, instead
they can be just sent from the read files contents (in BLCKSZ block size).
Here, the .blockmap created earlier was not used. In my implementation, we are
sending just a .partial file with a header containing all required details like
the number of blocks changes along with the block numbers including CRC
followed by the blocks itself.
3.
I tried compiling Anastasia's patch, but getting an error. So could not see or
test how it goes. Also, like a normal backup option, the incremental backup
option needs to verify the checksum if requested.
4.
While combining full and incremental backup, files from the incremental backup
are just copied into the full backup directory. While the design I posted
earlier, we are trying another way round to avoid over-writing and other issues
as I explained earlier.
I am almost done writing the patch for pg_combinebackup and will post soon.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> In attachments, you can find a prototype of incremental pg_basebackup,
> which consists of 2 features:
>
> 1) To perform incremental backup one should call pg_basebackup with a
> new argument:
>
> pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
>
> where lsn is a start_lsn of parent backup (can be found in
> "backup_label" file)
>
> It calls BASE_BACKUP replication command with a new argument
> PREV_BACKUP_START_LSN 'lsn'.
>
> For datafiles, only pages with LSN > prev_backup_start_lsn will be
> included in the backup.
> They are saved into 'filename.partial' file, 'filename.blockmap' file
> contains an array of BlockNumbers.
> For example, if we backuped blocks 1,3,5, filename.partial will contain
> 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
I think it's better to keep both the information about changed blocks
and the contents of the changed blocks in a single file. The list of
changed blocks is probably quite short, and I don't really want to
double the number of files in the backup if there's no real need. I
suspect it's just overall a bit simpler to keep everything together.
I don't think this is a make-or-break thing, and welcome contrary
arguments, but that's my preference.
> 2) To merge incremental backup into a full backup call
>
> pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> --merge-backups
>
> It will move all files from 'incremental_basedir' to 'basedir' handling
> '.partial' files correctly.
This, to me, looks like it's much worse than the design that I
proposed originally. It means that:
1. You can't take an incremental backup without having the full backup
available at the time you want to take the incremental backup.
2. You're always storing a full backup, which means that you need more
disk space, and potentially much more I/O while taking the backup.
You save on transfer bandwidth, but you add a lot of disk reads and
writes, costs which have to be paid even if the backup is never
restored.
> 1) Whether we collect block maps using simple "read everything page by
> page" approach
> or WAL scanning or any other page tracking algorithm, we must choose a
> map format.
> I implemented the simplest one, while there are more ideas:
I think we should start simple.
I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together. I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > In attachments, you can find a prototype of incremental pg_basebackup, > > which consists of 2 features: > > > > 1) To perform incremental backup one should call pg_basebackup with a > > new argument: > > > > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn' > > > > where lsn is a start_lsn of parent backup (can be found in > > "backup_label" file) > > > > It calls BASE_BACKUP replication command with a new argument > > PREV_BACKUP_START_LSN 'lsn'. > > > > For datafiles, only pages with LSN > prev_backup_start_lsn will be > > included in the backup. >> One thought, if the file is not modified no need to check the lsn. >> > > They are saved into 'filename.partial' file, 'filename.blockmap' file > > contains an array of BlockNumbers. > > For example, if we backuped blocks 1,3,5, filename.partial will contain > > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}. > > I think it's better to keep both the information about changed blocks > and the contents of the changed blocks in a single file. The list of > changed blocks is probably quite short, and I don't really want to > double the number of files in the backup if there's no real need. I > suspect it's just overall a bit simpler to keep everything together. > I don't think this is a make-or-break thing, and welcome contrary > arguments, but that's my preference. > I feel Robert's suggestion is good. We can probably keep one meta file for each backup with some basic information of all the files being backed up, this metadata file will be useful in the below case: Table dropped before incremental backup Table truncated and Insert/Update/Delete operations before incremental backup I feel if we have the metadata, we can add some optimization to decide the above scenario with the metadata information to identify the file deletion and avoiding write and delete for pg_combinebackup which Jeevan has told in his previous mail. Probably it can also help us to decide which work the worker needs to do if we are planning to backup in parallel. Regards, vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 31, 2019 at 1:59 PM vignesh C <vignesh21@gmail.com> wrote: > I feel Robert's suggestion is good. > We can probably keep one meta file for each backup with some basic information > of all the files being backed up, this metadata file will be useful in the > below case: > Table dropped before incremental backup > Table truncated and Insert/Update/Delete operations before incremental backup There's really no need for this with the design I proposed. The files that should exist when you restore in incremental backup are exactly the set of files that exist in the final incremental backup, except that any .partial files need to be replaced with a correct reconstruction of the underlying file. You don't need to know what got dropped or truncated; you only need to know what's supposed to be there at the end. You may be thinking, as I once did, that restoring an incremental backup would consist of restoring the full backup first and then layering the incrementals over it, but if you read what I proposed, it actually works the other way around: you restore the files that are present in the incremental, and as needed, pull pieces of them from earlier incremental and/or full backups. I think this is a *much* better design than doing it the other way; it avoids any risk of getting the wrong answer due to truncations or drops, and it also is faster, because you only read older backups to the extent that you actually need their contents. I think it's a good idea to try to keep all the information about a single file being backup in one place. It's just less confusing. If, for example, you have a metadata file that tells you which files are dropped - that is, which files you DON'T have - then what happen if one of those files is present in the data directory after all? Well, then you have inconsistent information and are confused, and maybe your code won't even notice the inconsistency. Similarly, if the metadata file is separate from the block data, then what happens if one file is missing, or isn't from the same backup as the other file? That shouldn't happen, of course, but if it does, you'll get confused. There's no perfect solution to these kinds of problems: if we suppose that the backup can be corrupted by having missing or extra files, why not also corruption within a single file? Still, on balance I tend to think that keeping related stuff together minimizes the surface area for bugs. I realize that's arguable, though. One consideration that goes the other way: if you have a manifest file that says what files are supposed to be present in the backup, then you can detect a disappearing file, which is impossible with the design I've proposed (and with the current full backup machinery). That might be worth fixing, but it's a separate feature that has little to do with incremental backup. > Probably it can also help us to decide which work the worker needs to do > if we are planning to backup in parallel. I don't think we need a manifest file for parallel backup. One process or thread can scan the directory tree, make a list of which files are present, and then hand individual files off to other processes or threads. In short, the directory listing serves as the manifest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I am almost done writing the patch for pg_combinebackup and will post soon.
full basebackup with one or more incremental backups.
I have tested it manually and it works for all best cases.
Let me know if you have any inputs/suggestions/review comments?
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Attachment
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: >> >> I am almost done writing the patch for pg_combinebackup and will post soon. > > > Attached patch which implements the pg_combinebackup utility used to combine > full basebackup with one or more incremental backups. > > I have tested it manually and it works for all best cases. > > Let me know if you have any inputs/suggestions/review comments? > Some comments: 1) There will be some link files created for tablespace, we might require some special handling for it 2) + while (numretries <= maxretries) + { + rc = system(copycmd); + if (rc == 0) + return; + + pg_log_info("could not copy, retrying after %d seconds", + sleeptime); + pg_usleep(numretries++ * sleeptime * 1000000L); + } Retry functionality is hanlded only for copying of full files, should we handle retry for copying of partial files 3) + maxretries = atoi(optarg); + if (maxretries < 0) + { + pg_log_error("invalid value for maxretries"); + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname); + exit(1); + } + break; + case 's': + sleeptime = atoi(optarg); + if (sleeptime <= 0 || sleeptime > 60) + { + pg_log_error("invalid value for sleeptime"); + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname); + exit(1); + } + break; we can have some range for maxretries similar to sleeptime 4) + fp = fopen(filename, "r"); + if (fp == NULL) + { + pg_log_error("could not read file \"%s\": %m", filename); + exit(1); + } + + labelfile = malloc(statbuf.st_size + 1); + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size) + { + pg_log_error("corrupted file \"%s\": %m", filename); + free(labelfile); + exit(1); + } Should we check for malloc failure 5) Should we add display of progress as backup may take some time, this can be added as enhancement. We can get other's opinion on this. 6) + if (nIncrDir == MAX_INCR_BK_COUNT) + { + pg_log_error("too many incremental backups to combine"); + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); + exit(1); + } + + IncrDirs[nIncrDir] = optarg; + nIncrDir++; + break; If the backup count increases providing the input may be difficult, Shall user provide all the incremental backups from a parent folder and can we handle the ordering of incremental backup internally 7) + if (isPartialFile) + { + if (verbose) + pg_log_info("combining partial file \"%s.partial\"", fn); + + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn); + } + else + copy_whole_file(infn, outfn); Add verbose for copying whole file 8) We can also check if approximate space is available in disk before starting combine backup, this can be added as enhancement. We can get other's opinion on this. 9) + printf(_(" -i, --incr-backup=DIRECTORY incremental backup directory (maximum %d)\n"), MAX_INCR_BK_COUNT); + printf(_(" -o, --output-dir=DIRECTORY combine backup into directory\n")); + printf(_("\nGeneral options:\n")); + printf(_(" -n, --no-clean do not clean up after errors\n")); Combine backup into directory can be combine backup directory 10) +/* Max number of incremental backups to be combined. */ +#define MAX_INCR_BK_COUNT 10 + +/* magic number in incremental backup's .partial file */ MAX_INCR_BK_COUNT can be increased little, some applications use 1 full backup at the beginning of the month and use 30 incremental backups rest of the days in the month Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote: > + rc = system(copycmd); I don't think this patch should be calling system() in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote: > > + rc = system(copycmd); > > I don't think this patch should be calling system() in the first place. +1. Thanks, Stephen
Attachment
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it
2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 1000000L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)Use pg_malloc instead of malloc
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime
4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.
6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally
7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);
Add verbose for copying whole file
8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.
9)
+ printf(_(" -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_(" -o, --output-dir=DIRECTORY combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_(" -n, --no-clean do not clean up after errors\n"));
Combine backup into directory can be combine backup directory
10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
I have not looked at the patch in detail, but just some nits from my side.On Fri, Aug 2, 2019 at 6:13 PM vignesh C <vignesh21@gmail.com> wrote:On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it
2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;Use API to copy the file instead of "system", better to use the secure copy.
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 1000000L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial filesThe log and the sleep time does not match, you are multiplying sleeptime with numretries++ and logging only "sleeptime"Why we are retiring here, capture proper copy error and act accordingly. Blindly retiring does not make sense.3)Use pg_malloc instead of malloc
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime
4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.Yes, we should, but this is not the right time to do that.6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internallyWhy we have that limit at first place?7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);
Add verbose for copying whole file
8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.
9)
+ printf(_(" -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_(" -o, --output-dir=DIRECTORY combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_(" -n, --no-clean do not clean up after errors\n"));
Combine backup into directory can be combine backup directory
10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com--Ibrar Ahmed
On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> + rc = system(copycmd);
I don't think this patch should be calling system() in the first place.
into the memory and writing the same to the file.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Mon, Aug 5, 2019 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
> + rc = system(copycmd);
I don't think this patch should be calling system() in the first place.So, do you mean we should just do fread() and fwrite() for the whole file?I thought it is better if it was done by the OS itself instead of reading 1GB
into the memory and writing the same to the file.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > So, do you mean we should just do fread() and fwrite() for the whole file? > > I thought it is better if it was done by the OS itself instead of reading 1GB > into the memory and writing the same to the file. Well, 'cp' is just a C program. If they can write code to copy a file, so can we, and then we're not dependent on 'cp' being installed, working properly, being in the user's path or at the hard-coded pathname we expect, etc. There's an existing copy_file() function in src/backed/storage/file/copydir.c which I'd probably look into adapting for frontend use. I'm not sure whether it would be important to adapt the data-flushing code that's present in that routine or whether we could get by with just the loop to read() and write() data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > + if (!XLogRecPtrIsInvalid(previous_lsn)) > + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n", > + (uint32) (previous_lsn >> 32), (uint32) previous_lsn); > > May be we should rename to something like: > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION" > to make it more intuitive? So, I think that you are right that PREVIOUS WAL LOCATION might not be entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL LOCATION is definitely not clear. This backup is an incremental backup, and it has a start WAL location, so you'd end up with START WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound like they ought to both be the same thing, but they're not. Perhaps something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR INCREMENTAL BACKUP would be clearer. > File header structure is defined in both the files basebackup.c and > pg_combinebackup.c. I think it is better to move this to replication/basebackup.h. Or some other header, but yeah, definitely don't duplicate the struct definition (or any other kind of definition). > IMHO, while labels are not advisable in general, it may be better to use a label > here rather than a while(1) loop, so that we can move to the label in case we > want to retry once. I think here it opens doors for future bugs if someone > happens to add code here, ending up adding some condition and then the > break becomes conditional. That will leave us in an infinite loop. I'm not sure which style is better here, but I don't really buy this argument. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> + if (!XLogRecPtrIsInvalid(previous_lsn))
> + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> + (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?
So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear. This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not. Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.
> File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.
Or some other header, but yeah, definitely don't duplicate the struct
definition (or any other kind of definition).
> IMHO, while labels are not advisable in general, it may be better to use a label
> here rather than a while(1) loop, so that we can move to the label in case we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.
I'm not sure which style is better here, but I don't really buy this argument.
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.
Well, 'cp' is just a C program. If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc. There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use. I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.
Agree that we can certainly use open(), read(), write(), and close() here, but
given that pg_basebackup.c and basbackup.c are using file operations, I think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.
Let me know if we still want to go with native OS calls.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Hi Robert,On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> + if (!XLogRecPtrIsInvalid(previous_lsn))
> + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> + (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?
So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear. This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not. Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:Hi Robert,On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> + if (!XLogRecPtrIsInvalid(previous_lsn))
> + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> + (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
> to make it more intuitive?
So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear. This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not. Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?+1 for INCREMENTAL BACKUP REFERENCE WA.
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Mon, Aug 12, 2019 at 7:57 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > Agree that we can certainly use open(), read(), write(), and close() here, but > given that pg_basebackup.c and basbackup.c are using file operations, I think > using fopen(), fread(), fwrite(), and fclose() will be better here, at-least > for consistetncy. Oh, that's fine. Whatever's more consistent with the pre-existing code. Just, let's not use system(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 9, 2019 at 6:36 PM Robert Haas <robertmhaas@gmail.com> wrote:On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.
Well, 'cp' is just a C program. If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc. There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use. I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.
Agree that we can certainly use open(), read(), write(), and close() here, but
given that pg_basebackup.c and basbackup.c are using file operations, I think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.
Let me know if we still want to go with native OS calls.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptime
4)
Should we check for malloc failure
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.
6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally
7)
Add verbose for copying whole file
8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.
9)
Combine backup into directory can be combine backup directory
10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month
Let's see others opinion too.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
readable than the first version.
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Attachment
Hi Jeevan,I have reviewed the backup part at code level and still looking into therestore(combine) and functional part of it. But, here are my comments so far:
The patches need rebase.
May be we should rename to something like:"INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"to make it more intuitive?
File header structure is defined in both the files basebackup.c andpg_combinebackup.c. I think it is better to move this to replication/basebackup.h.
I think we can avoid having flag isrelfile in sendFile().Something like this:
Also, having isrelfile as part of following condition:is confusing, because even the relation files in full backup are going to bebacked up by this loop only, but still, the condition reads '(!isrelfile &&...)'.
IMHO, while labels are not advisable in general, it may be better to use a labelhere rather than a while(1) loop, so that we can move to the label in case wewant to retry once. I think here it opens doors for future bugs if someonehappens to add code here, ending up adding some condition and then thebreak becomes conditional. That will leave us in an infinite loop.
Similar to structure partial_file_header, I think above macro can also be movedto basebackup.h instead of defining it twice.
I think this is a huge memory request (1GB) and may fail on busy/loaded server attimes. We should check for failures of malloc, maybe throw some error ongetting ENOMEM as errno.
Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and itshould be safe to read just statbuf_st_size always I guess? But, I am ok withhaving this extra guard here.
In sendFile(), I am sorry if I am missing something, but I am not able tounderstand why 'cnt' and 'i' should have different values when they are beingpassed to verify_page_checksum(). I think passing only one of them should besufficient.
buffer whereas blkno is the offset from the start of the page. For incremental
backup, they are same as we read the whole file but they are different in case
of regular full backup where we read 4 blocks at a time. i value there will be
between 0 and 3.
Maybe we should just have a variable no_of_blocks to store a number of blocks,rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worstcase.
Sorry if I am missing something, but, should not it be just:len = cnt;
As I said earlier in my previous email, we now do not need +decode_lsn_internal()as it is already taken care by the introduction of function pg_lsn_in_internal().
Regards,Jeevan Ladhe
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for itYep. I have that in my ToDo.Will start working on that soon.
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptimeI took help from pg_standby code related to maxentries and sleeptime.However, as we don't want to use system() call now, I haveremoved all this kludge and just used fread/fwrite as discussed.
4)
Should we check for malloc failureUsed pg_malloc() instead. Same is also suggested by Ibrar.
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.Can be done afterward once we have the functionality in place.
6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internallyI am not sure of this yet. We need to provide the tablespace mapping too.But thanks for putting a point here. Will keep that in mind when I revisit this.
7)
Add verbose for copying whole fileDone
8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.Hmm... will leave it for now. User will get an error anyway.
9)
Combine backup into directory can be combine backup directoryDone
10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the monthYeah, agree. But using any number here is debatable.
Let's see others opinion too.
Attached new sets of patches with refactoring done separately.Incremental backup patch became small now and hopefully more
readable than the first version.
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for itYep. I have that in my ToDo.Will start working on that soon.
2)
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files
3)
we can have some range for maxretries similar to sleeptimeI took help from pg_standby code related to maxentries and sleeptime.However, as we don't want to use system() call now, I haveremoved all this kludge and just used fread/fwrite as discussed.
4)
Should we check for malloc failureUsed pg_malloc() instead. Same is also suggested by Ibrar.
5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.Can be done afterward once we have the functionality in place.
6)
If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internallyI am not sure of this yet. We need to provide the tablespace mapping too.But thanks for putting a point here. Will keep that in mind when I revisit this.
7)
Add verbose for copying whole fileDone
8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.Hmm... will leave it for now. User will get an error anyway.
9)
Combine backup into directory can be combine backup directoryDone
10)
MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the monthYeah, agree. But using any number here is debatable.
Let's see others opinion too.Why not use a list?Attached new sets of patches with refactoring done separately.Incremental backup patch became small now and hopefully more
readable than the first version.
--Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
Why are you using malloc, you can use palloc here.
> char *extptr = strstr(fn, ".partial");
I think there should be a better and strict way to check the file extension.
-
> + extptr = strstr(outfn, ".partial");
> + Assert (extptr != NULL);
Why are you checking that again, you just appended that in the above statement?
-
> + if (verbose && statbuf.st_size > (RELSEG_SIZE * BLCKSZ))
> + pg_log_info("found big file \"%s\" (size: %.2lfGB): %m", fromfn,
> + (double) statbuf.st_size / (RELSEG_SIZE * BLCKSZ));
This is not just a log, you find a file which is bigger which surely has some problem.
-
> + * We do read entire 1GB file in memory while taking incremental backup; so
> + * I don't see any reason why can't we do that here. Also, copying data in
> + * chunks is expensive. However, for bigger files, we still slice at 1GB
> + * border.
What do you mean by bigger file, a file greater than 1GB? In which case you get file > 1GB?
On Fri, Aug 16, 2019 at 8:07 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > > What do you mean by bigger file, a file greater than 1GB? In which case you get file > 1GB? > > > Few comments: Comment: + buf = (char *) malloc(statbuf->st_size); + if (buf == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0) + { + Bitmapset *mod_blocks = NULL; + int nmodblocks = 0; + + if (cnt % BLCKSZ != 0) + { We can use same size as full page size. After pg start backup full page write will be enabled. We can use the same file size to maintain data consistency. Comment: /* Validate given LSN and convert it into XLogRecPtr. */ + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error); + if (XLogRecPtrIsInvalid(opt->lsn)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("invalid value for LSN"))); Validate input lsn is less than current system lsn. Comment: /* Validate given LSN and convert it into XLogRecPtr. */ + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error); + if (XLogRecPtrIsInvalid(opt->lsn)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("invalid value for LSN"))); Should we check if it is same timeline as the system's timeline. Comment: + if (fread(blkdata, 1, BLCKSZ, infp) != BLCKSZ) + { + pg_log_error("could not read from file \"%s\": %m", outfn); + cleanup_filemaps(filemaps, fmindex + 1); + exit(1); + } + + /* Finally write one block to the output file */ + if (fwrite(blkdata, 1, BLCKSZ, outfp) != BLCKSZ) + { + pg_log_error("could not write to file \"%s\": %m", outfn); + cleanup_filemaps(filemaps, fmindex + 1); + exit(1); + } Should we support compression formats supported by pg_basebackup. This can be an enhancement after the functionality is completed. Comment: We should provide some mechanism to validate the backup. To identify if some backup is corrupt or some file is missing(deleted) in a backup. Comment: + ofp = fopen(tofn, "wb"); + if (ofp == NULL) + { + pg_log_error("could not create file \"%s\": %m", tofn); + exit(1); + } ifp should be closed in the error flow. Comment: + fp = fopen(filename, "r"); + if (fp == NULL) + { + pg_log_error("could not read file \"%s\": %m", filename); + exit(1); + } + + labelfile = pg_malloc(statbuf.st_size + 1); + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size) + { + pg_log_error("corrupted file \"%s\": %m", filename); + pg_free(labelfile); + exit(1); + } fclose can be moved above. Comment: + if (!modifiedblockfound) + { + copy_whole_file(fm->filename, outfn); + cleanup_filemaps(filemaps, fmindex + 1); + return; + } + + /* Write all blocks to the output file */ + + if (fstat(fileno(fm->fp), &statbuf) != 0) + { + pg_log_error("could not stat file \"%s\": %m", fm->filename); + pg_free(filemaps); + exit(1); + } Some error flow, cleanup_filemaps need to be called to close the file descriptors that are opened. Comment: +/* + * When to send the whole file, % blocks modified (90%) + */ +#define WHOLE_FILE_THRESHOLD 0.9 + This can be user configured value. This can be an enhancement after the functionality is completed. Comment: We can add a readme file with all the details regarding incremental backup and combine backup. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > [ patches ] Reviewing 0002 and 0003: - Commit message for 0003 claims magic number and checksum are 0, but that (fortunately) doesn't seem to be the case. - looks_like_rel_name actually checks whether it looks like a *non-temporary* relation name; suggest adjusting the function name. - The names do_full_backup and do_incremental_backup are quite confusing because you're really talking about what to do with one file. I suggest sendCompleteFile() and sendPartialFile(). - Is there any good reason to have 'refptr' as a global variable, or could we just pass the LSN around via function arguments? I know it's just mimicking startptr, but storing startptr in a global variable doesn't seem like a great idea either, so if it's not too annoying, let's pass it down via function arguments instead. Also, refptr is a crappy name (even worse than startptr); whether we end up with a global variable or a bunch of local variables, let's make the name(s) clear and unambiguous, like incremental_reference_lsn. Yeah, I know that's long, but I still think it's better than being unclear. - do_incremental_backup looks like it can never report an error from fread(), which is bad. But I see that this is just copied from the existing code which has the same problem, so I started a separate thread about that. - I think that passing cnt and blkindex to verify_page_checksum() doesn't look very good from an abstraction point of view. Granted, the existing code isn't great either, but I think this makes the problem worse. I suggest passing "int backup_distance" to this function, computed as cnt - BLCKSZ * blkindex. Then, you can fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance - BLCKSZ). - While I generally support the use of while and for loops rather than goto for flow control, a while (1) loop that ends with a break is functionally a goto anyway. I think there are several ways this could be revised. The most obvious one is probably to use goto, but I vote for inverting the sense of the test: if (PageIsNew(page) || PageGetLSN(page) >= startptr) break; This approach also saves a level of indentation for more than half of the function. - I am not sure that it's a good idea for sendwholefile = true to result in dumping the entire file onto the wire in a single CopyData message. I don't know of a concrete problem in typical configurations, but someone who increases RELSEG_SIZE might be able to overflow CopyData's length word. At 2GB the length word would be negative, which might break, and at 4GB it would wrap around, which would certainly break. See CopyData in https://www.postgresql.org/docs/12/protocol-message-formats.html To avoid this issue, and maybe some others, I suggest defining a reasonably large chunk size, say 1MB as a constant in this file someplace, and sending the data as a series of chunks of that size. - I don't think that the way concurrent truncation is handled is correct for partial files. Right now it just falls through to code which appends blocks of zeroes in either the complete-file or partial-file case. I think that logic should be moved into the function that handles the complete-file case. In the partial-file case, the blocks that we actually send need to match the list of block numbers we promised to send. We can't just send the promised blocks and then tack a bunch of zero-filled blocks onto the end that the file header doesn't know about. - For reviewer convenience, please use the -v option to git format-patch when posting and reposting a patch series. Using -v2, -v3, etc. on successive versions really helps. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Due to the inherent nature of pg_basebackup, the incremental backup alsoallows taking backup in tar and compressed format. But, pg_combinebackupdoes not understand how to restore this. I think we should either makepg_combinebackup support restoration of tar incremental backup or restricttaking the incremental backup in tar format until pg_combinebackupsupports the restoration by making option '--lsn' and '-Ft' exclusive.It is arguable that one can take the incremental backup in tar format, extractthat manually and then give the resultant directory as input to thepg_combinebackup, but I think that kills the purpose of havingpg_combinebackup utility.Thoughts?Regards,Jeevan Ladhe
Attachment
Due to the inherent nature of pg_basebackup, the incremental backup alsoallows taking backup in tar and compressed format. But, pg_combinebackupdoes not understand how to restore this. I think we should either makepg_combinebackup support restoration of tar incremental backup or restricttaking the incremental backup in tar format until pg_combinebackupsupports the restoration by making option '--lsn' and '-Ft' exclusive.It is arguable that one can take the incremental backup in tar format, extractthat manually and then give the resultant directory as input to thepg_combinebackup, but I think that kills the purpose of havingpg_combinebackup utility.Thoughts?Regards,Jeevan Ladhe
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > Due to the inherent nature of pg_basebackup, the incremental backup also > allows taking backup in tar and compressed format. But, pg_combinebackup > does not understand how to restore this. I think we should either make > pg_combinebackup support restoration of tar incremental backup or restrict > taking the incremental backup in tar format until pg_combinebackup > supports the restoration by making option '--lsn' and '-Ft' exclusive. > > It is arguable that one can take the incremental backup in tar format, extract > that manually and then give the resultant directory as input to the > pg_combinebackup, but I think that kills the purpose of having > pg_combinebackup utility. I don't agree. You're right that you would have to untar (and uncompress) the backup to run pg_combinebackup, but you would also have to do that to restore a non-incremental backup, so it doesn't seem much different. It's true that for an incremental backup you might need to untar and uncompress multiple prior backups rather than just one, but that's just the nature of an incremental backup. And, on a practical level, if you want compression, which is pretty likely if you're thinking about incremental backups, the way to get that is to use tar format with -z or -Z. It might be interesting to teach pg_combinebackup to be able to read tar-format backups, but I think that there are several variants of the tar format, and I suspect it would need to read them all. If someone un-tars and re-tars a backup with a different tar tool, we don't want it to become unreadable. So we'd either have to write our own de-tarring library or add an external dependency on one. I don't think it's worth doing that at this point; I definitely don't think it needs to be part of the first patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Due to the inherent nature of pg_basebackup, the incremental backup also
> allows taking backup in tar and compressed format. But, pg_combinebackup
> does not understand how to restore this. I think we should either make
> pg_combinebackup support restoration of tar incremental backup or restrict
> taking the incremental backup in tar format until pg_combinebackup
> supports the restoration by making option '--lsn' and '-Ft' exclusive.
>
> It is arguable that one can take the incremental backup in tar format, extract
> that manually and then give the resultant directory as input to the
> pg_combinebackup, but I think that kills the purpose of having
> pg_combinebackup utility.
I don't agree. You're right that you would have to untar (and
uncompress) the backup to run pg_combinebackup, but you would also
have to do that to restore a non-incremental backup, so it doesn't
seem much different. It's true that for an incremental backup you
might need to untar and uncompress multiple prior backups rather than
just one, but that's just the nature of an incremental backup. And,
on a practical level, if you want compression, which is pretty likely
if you're thinking about incremental backups, the way to get that is
to use tar format with -z or -Z.
It might be interesting to teach pg_combinebackup to be able to read
tar-format backups, but I think that there are several variants of the
tar format, and I suspect it would need to read them all. If someone
un-tars and re-tars a backup with a different tar tool, we don't want
it to become unreadable. So we'd either have to write our own
de-tarring library or add an external dependency on one.
I don't
think it's worth doing that at this point; I definitely don't think it
needs to be part of the first patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Due to the inherent nature of pg_basebackup, the incremental backup also
> allows taking backup in tar and compressed format. But, pg_combinebackup
> does not understand how to restore this. I think we should either make
> pg_combinebackup support restoration of tar incremental backup or restrict
> taking the incremental backup in tar format until pg_combinebackup
> supports the restoration by making option '--lsn' and '-Ft' exclusive.
>
> It is arguable that one can take the incremental backup in tar format, extract
> that manually and then give the resultant directory as input to the
> pg_combinebackup, but I think that kills the purpose of having
> pg_combinebackup utility.
I don't agree. You're right that you would have to untar (and
uncompress) the backup to run pg_combinebackup, but you would also
have to do that to restore a non-incremental backup, so it doesn't
seem much different.
I don't think it's worth doing that at this point; I definitely don't think it
needs to be part of the first patch.
On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > 0003: +/* + * When to send the whole file, % blocks modified (90%) + */ +#define WHOLE_FILE_THRESHOLD 0.9 How this threshold is selected. Is it by some test? - magic number, currently 0 (4 bytes) I think in the patch we are using (#define INCREMENTAL_BACKUP_MAGIC 0x494E4352) as a magic number, not 0 + Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ)); + + buf = (char *) malloc(statbuf->st_size); + if (buf == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0) + { + Bitmapset *mod_blocks = NULL; + int nmodblocks = 0; + + if (cnt % BLCKSZ != 0) + { It will be good to add some comments for the if block and also for the assert. Actully, it's not very clear from the code. 0004: +#include <time.h> +#include <sys/stat.h> +#include <unistd.h> Header file include order (sys/state.h should be before time.h) + printf(_("%s combines full backup with incremental backup.\n\n"), progname); /backup/backups + * scan_file + * + * Checks whether given file is partial file or not. If partial, then combines + * it into a full backup file, else copies as is to the output directory. + */ /If partial, then combines/ If partial, then combine +static void +combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir, + const char *subdirpath, const char *outfn) + /* + * Open all files from all incremental backup directories and create a file + * map. + */ + basefilefound = false; + for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++) + { + fm = &filemaps[fmindex]; + ..... + } + + + /* Process all opened files. */ + lastblkno = 0; + modifiedblockfound = false; + for (i = 0; i < fmindex; i++) + { + char *buf; + int hsize; + int k; + int blkstartoffset; ...... + } + + for (i = 0; i <= lastblkno; i++) + { + char blkdata[BLCKSZ]; + FILE *infp; + int offset; ... + } } Can we breakdown this function in 2-3 functions. At least creating a file map can directly go to a separate function. I have read 0003 and 0004 patch and there are few cosmetic comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > Are we using any tar library in pg_basebackup.c? We already have the capability > in pg_basebackup to do that. I think pg_basebackup is using homebrew code to generate tar files, but I'm reluctant to do that for reading tar files. For generating a file, you can always emit the newest and "best" tar format, but for reading a file, you probably want to be prepared for older or cruftier variants. Maybe not -- I'm not super-familiar with the tar on-disk format. But I think there must be a reason why tar libraries exist, and I don't want to write a new one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> Are we using any tar library in pg_basebackup.c? We already have the capability
> in pg_basebackup to do that.
I think pg_basebackup is using homebrew code to generate tar files,
but I'm reluctant to do that for reading tar files. For generating a
file, you can always emit the newest and "best" tar format, but for
reading a file, you probably want to be prepared for older or cruftier
variants. Maybe not -- I'm not super-familiar with the tar on-disk
format. But I think there must be a reason why tar libraries exist,
and I don't want to write a new one.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > +1 using the library to tar. But I think reason not using tar library is TAR is > one of the most simple file format. What is the best/newest format of TAR? So, I don't really want to go down this path at all, as I already said. You can certainly do your own research on this topic if you wish. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes: > +1 using the library to tar. Uh, *what* library? pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction of which is boilerplate for interfacing to pg_backup_archiver's APIs. The stuff that actually knows specifically about tar looks to be maybe a couple hundred lines, plus there's another couple hundred lines of (rather duplicative?) code in src/port/tar.c. None of it is rocket science. I can't believe that it'd be a good tradeoff to create a new external dependency to replace that amount of code. In case you haven't noticed, our luck with depending on external libraries has been abysmal. Possibly there's an argument for refactoring things so that there's more stuff in tar.c and less elsewhere, but let's not go looking for external code to depend on. regards, tom lane
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
> +1 using the library to tar.
Uh, *what* library?
pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction
of which is boilerplate for interfacing to pg_backup_archiver's APIs.
The stuff that actually knows specifically about tar looks to be maybe
a couple hundred lines, plus there's another couple hundred lines of
(rather duplicative?) code in src/port/tar.c. None of it is rocket
science.
I can't believe that it'd be a good tradeoff to create a new external
dependency to replace that amount of code. In case you haven't noticed,
our luck with depending on external libraries has been abysmal.
Possibly there's an argument for refactoring things so that there's
more stuff in tar.c and less elsewhere, but let's not go looking
for external code to depend on.
regards, tom lane
On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
> +1 using the library to tar. But I think reason not using tar library is TAR is
> one of the most simple file format. What is the best/newest format of TAR?
So, I don't really want to go down this path at all, as I already
said. You can certainly do your own research on this topic if you
wish.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke > <jeevan.chalke@enterprisedb.com> wrote: > > > 0003: > +/* > + * When to send the whole file, % blocks modified (90%) > + */ > +#define WHOLE_FILE_THRESHOLD 0.9 > > How this threshold is selected. Is it by some test? > > > - magic number, currently 0 (4 bytes) > I think in the patch we are using (#define INCREMENTAL_BACKUP_MAGIC > 0x494E4352) as a magic number, not 0 > > > + Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ)); > + > + buf = (char *) malloc(statbuf->st_size); > + if (buf == NULL) > + ereport(ERROR, > + (errcode(ERRCODE_OUT_OF_MEMORY), > + errmsg("out of memory"))); > + > + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0) > + { > + Bitmapset *mod_blocks = NULL; > + int nmodblocks = 0; > + > + if (cnt % BLCKSZ != 0) > + { > > It will be good to add some comments for the if block and also for the > assert. Actully, it's not very clear from the code. > > 0004: > +#include <time.h> > +#include <sys/stat.h> > +#include <unistd.h> > Header file include order (sys/state.h should be before time.h) > > > > + printf(_("%s combines full backup with incremental backup.\n\n"), progname); > /backup/backups > > > + * scan_file > + * > + * Checks whether given file is partial file or not. If partial, then combines > + * it into a full backup file, else copies as is to the output directory. > + */ > > /If partial, then combines/ If partial, then combine > > > > +static void > +combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir, > + const char *subdirpath, const char *outfn) > + /* > + * Open all files from all incremental backup directories and create a file > + * map. > + */ > + basefilefound = false; > + for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++) > + { > + fm = &filemaps[fmindex]; > + > ..... > + } > + > + > + /* Process all opened files. */ > + lastblkno = 0; > + modifiedblockfound = false; > + for (i = 0; i < fmindex; i++) > + { > + char *buf; > + int hsize; > + int k; > + int blkstartoffset; > ...... > + } > + > + for (i = 0; i <= lastblkno; i++) > + { > + char blkdata[BLCKSZ]; > + FILE *infp; > + int offset; > ... > + } > } > > Can we breakdown this function in 2-3 functions. At least creating a > file map can directly go to a separate function. > > I have read 0003 and 0004 patch and there are few cosmetic comments. > I have not yet completed the review for 0004, but I have few more comments. Tomorrow I will try to complete the review and some testing as well. 1. It seems that the output full backup generated with pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL LOCATION". It seems confusing because now this is a full backup, not the incremental backup. 2. + FILE *outfp; + FileOffset outblocks[RELSEG_SIZE]; + int i; + FileMap *filemaps; + int fmindex; + bool basefilefound; + bool modifiedblockfound; + uint32 lastblkno; + FileMap *fm; + struct stat statbuf; + uint32 nblocks; + + memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE); I don't think you need to memset this explicitly as you can initialize the array itself no? FileOffset outblocks[RELSEG_SIZE] = {{0}} -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 3, 2019 at 12:46 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote: > I did that and have experience working on the TAR format. I was curious about what > "best/newest" you are talking. Well, why not go look it up? On my MacBook, tar is documented to understand three different tar formats: gnutar, ustar, and v7, and two sets of extensions to the tar format: numeric extensions required by POSIX, and Solaris extensions. It also understands the pax and restricted-pax formats which are derived from the ustar format. I don't know what your system supports, but it's probably not hugely different; the fact that there are multiple tar formats has been documented in the tar man page on every machine I've checked for the past 20 years. Here, 'man tar' refers the reader to 'man libarchive-formats', which contains the details mentioned above. A quick Google search for 'multiple tar formats' also finds https://en.wikipedia.org/wiki/Tar_(computing)#File_format and https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html each of which explains a good deal of the complexity in this area. I don't really understand why I have to explain to you what I mean when I say there are multiple tar formats when you can look it up on Google and find that there are multiple tar formats. Again, the point is that the current code only generates tar archives and therefore only needs to generate one format, but if we add code that reads a tar archive, it probably needs to read several formats, because there are several formats that are popular enough to be widely-supported. It's possible that somebody else here knows more about this topic and could make better judgements than I can, but my view at present is that if we want to read tar archives, we probably would want to do it by depending on libarchive. And I don't think we should do that for this project because I don't think it would provide much value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 03, 2019 at 08:59:53AM -0400, Robert Haas wrote: > I think pg_basebackup is using homebrew code to generate tar files, > but I'm reluctant to do that for reading tar files. Yes. This code has not actually changed since its introduction. Please note that we also have code which reads directly data from a tarball in pg_basebackup.c when appending the recovery parameters to postgresql.auto.conf for -R. There could be some consolidation here with what you are doing. > For generating a > file, you can always emit the newest and "best" tar format, but for > reading a file, you probably want to be prepared for older or cruftier > variants. Maybe not -- I'm not super-familiar with the tar on-disk > format. But I think there must be a reason why tar libraries exist, > and I don't want to write a new one. We need to be sure as well that the library chosen does not block access to a feature in all the various platforms we have. -- Michael
Attachment
On Wed, Sep 4, 2019 at 10:08 PM Michael Paquier <michael@paquier.xyz> wrote: > > For generating a > > file, you can always emit the newest and "best" tar format, but for > > reading a file, you probably want to be prepared for older or cruftier > > variants. Maybe not -- I'm not super-familiar with the tar on-disk > > format. But I think there must be a reason why tar libraries exist, > > and I don't want to write a new one. > > We need to be sure as well that the library chosen does not block > access to a feature in all the various platforms we have. Well, again, my preference is to just not make this particular feature work natively with tar files. Then I don't need to choose a library, so the question is moot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
This patchset also fixes the issues reported by Vignesh, Robert, Jeevan Ladhe,
and Dilip Kumar.
Please have a look and let me know if I missed any comments to account.
Thanks
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Attachment
Few comments:
Comment:
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
+ {
+ Bitmapset *mod_blocks = NULL;
+ int nmodblocks = 0;
+
+ if (cnt % BLCKSZ != 0)
+ {
We can use same size as full page size.
After pg start backup full page write will be enabled.
We can use the same file size to maintain data consistency.
The aim here is to read entire file in-memory and thus used statbuf->st_size.
Comment:
Should we check if it is same timeline as the system's timeline.
Comment:
Should we support compression formats supported by pg_basebackup.
This can be an enhancement after the functionality is completed.
uncompress first, combine them, compress if required.
Comment:
We should provide some mechanism to validate the backup. To identify
if some backup is corrupt or some file is missing(deleted) in a
backup.
Comment:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9
+
This can be user configured value.
This can be an enhancement after the functionality is completed.
Comment:
We can add a readme file with all the details regarding incremental
backup and combine backup.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> [ patches ]
Reviewing 0002 and 0003:
- Commit message for 0003 claims magic number and checksum are 0, but
that (fortunately) doesn't seem to be the case.
- looks_like_rel_name actually checks whether it looks like a
*non-temporary* relation name; suggest adjusting the function name.
- The names do_full_backup and do_incremental_backup are quite
confusing because you're really talking about what to do with one
file. I suggest sendCompleteFile() and sendPartialFile().
- Is there any good reason to have 'refptr' as a global variable, or
could we just pass the LSN around via function arguments? I know it's
just mimicking startptr, but storing startptr in a global variable
doesn't seem like a great idea either, so if it's not too annoying,
let's pass it down via function arguments instead. Also, refptr is a
crappy name (even worse than startptr); whether we end up with a
global variable or a bunch of local variables, let's make the name(s)
clear and unambiguous, like incremental_reference_lsn. Yeah, I know
that's long, but I still think it's better than being unclear.
change their signature, like, sendFile(), sendDir(), sendTablspeace() etc.
- do_incremental_backup looks like it can never report an error from
fread(), which is bad. But I see that this is just copied from the
existing code which has the same problem, so I started a separate
thread about that.
- I think that passing cnt and blkindex to verify_page_checksum()
doesn't look very good from an abstraction point of view. Granted,
the existing code isn't great either, but I think this makes the
problem worse. I suggest passing "int backup_distance" to this
function, computed as cnt - BLCKSZ * blkindex. Then, you can
fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
- BLCKSZ).
- While I generally support the use of while and for loops rather than
goto for flow control, a while (1) loop that ends with a break is
functionally a goto anyway. I think there are several ways this could
be revised. The most obvious one is probably to use goto, but I vote
for inverting the sense of the test: if (PageIsNew(page) ||
PageGetLSN(page) >= startptr) break; This approach also saves a level
of indentation for more than half of the function.
- I am not sure that it's a good idea for sendwholefile = true to
result in dumping the entire file onto the wire in a single CopyData
message. I don't know of a concrete problem in typical
configurations, but someone who increases RELSEG_SIZE might be able to
overflow CopyData's length word. At 2GB the length word would be
negative, which might break, and at 4GB it would wrap around, which
would certainly break. See CopyData in
https://www.postgresql.org/docs/12/protocol-message-formats.html To
avoid this issue, and maybe some others, I suggest defining a
reasonably large chunk size, say 1MB as a constant in this file
someplace, and sending the data as a series of chunks of that size.
- I don't think that the way concurrent truncation is handled is
correct for partial files. Right now it just falls through to code
which appends blocks of zeroes in either the complete-file or
partial-file case. I think that logic should be moved into the
function that handles the complete-file case. In the partial-file
case, the blocks that we actually send need to match the list of block
numbers we promised to send. We can't just send the promised blocks
and then tack a bunch of zero-filled blocks onto the end that the file
header doesn't know about.
never sending zeroes at the end in case of partial file.
- For reviewer convenience, please use the -v option to git
format-patch when posting and reposting a patch series. Using -v2,
-v3, etc. on successive versions really helps.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Here are some comments:Or maybe we can just say:"cannot verify checksum in file \"%s\"" if checksum requested, disable thechecksum and leave it to the following message:+ ereport(WARNING,+ (errmsg("file size (%d) not in multiple of page size (%d), sending whole file",+ (int) cnt, BLCKSZ)));
I think we should give the user hint from where he should be reading the inputlsn for incremental backup in the --help option as well as documentation?Something like - "To take an incremental backup, please provide value of "--lsn"as the "START WAL LOCATION" of previously taken full backup or incrementalbackup from backup_lable file.
pg_combinebackup:+static bool made_new_outputdata = false;+static bool found_existing_outputdata = false;Both of these are global, I understand that we need them global so that they areaccessible in cleanup_directories_atexit(). But they are passed toverify_dir_is_empty_or_create() as parameters, which I think is not needed.Instead verify_dir_is_empty_or_create() can directly change the globals.
The current logic assumes the incremental backup directories are to be providedas input in the serial order the backups were taken. This is bit confusingunless clarified in pg_combinebackup help menu or documentation. I think weshould clarify it at both the places.
I think scan_directory() should be rather renamed as do_combinebackup().
I am not sure about this renaming. scan_directory() is called recursively
to scan each sub-directories too. If we rename it then it is not actually
recursively doing a combinebackup. Combine backup is a single whole process.
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
0003:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9
How this threshold is selected. Is it by some test?
- magic number, currently 0 (4 bytes)
I think in the patch we are using (#define INCREMENTAL_BACKUP_MAGIC
0x494E4352) as a magic number, not 0
Can we breakdown this function in 2-3 functions. At least creating a
file map can directly go to a separate function.
I have read 0003 and 0004 patch and there are few cosmetic comments.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
I have not yet completed the review for 0004, but I have few more
comments. Tomorrow I will try to complete the review and some testing
as well.
1. It seems that the output full backup generated with
pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
LOCATION". It seems confusing
because now this is a full backup, not the incremental backup.
2.
+ memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);
I don't think you need to memset this explicitly as you can initialize
the array itself no?
FileOffset outblocks[RELSEG_SIZE] = {{0}}
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
One of my colleague at EDB, Rajkumar Raghuwanshi, while testing this
feature reported an issue. He reported that if a full base-backup is
taken, and then created a database, and then took an incremental backup,
combining full backup with incremental backup is then failing.
I had a look over this issue and observed that when the new database is
created, the catalog files are copied as-is into the new directory
corresponding to a newly created database. And as they are just copied,
the LSN on those pages are not changed. Due to this incremental backup
thinks that its an existing file and thus do not copy the blocks from
these new files, leading to the failure.
I have surprised to know that even though we are creating new files from
old files, we kept the LSN unmodified. I didn't see any other parameter
in basebackup which tells that this is a new file from last LSN or
something.
I tried looking for any other DDL doing similar stuff like creating a new
page with existing LSN. But I could not find any other commands than
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE.
Suggestions/thoughts?
--
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
>
>
>
> On Tue, Aug 27, 2019 at 4:46 PM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Few comments:
>> Comment:
>> + buf = (char *) malloc(statbuf->st_size);
>> + if (buf == NULL)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_OUT_OF_MEMORY),
>> + errmsg("out of memory")));
>> +
>> + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
>> + {
>> + Bitmapset *mod_blocks = NULL;
>> + int nmodblocks = 0;
>> +
>> + if (cnt % BLCKSZ != 0)
>> + {
>>
>> We can use same size as full page size.
>> After pg start backup full page write will be enabled.
>> We can use the same file size to maintain data consistency.
>
>
> Can you please explain which size?
> The aim here is to read entire file in-memory and thus used statbuf->st_size.
>
Instead of reading the whole file here, we can read the file page by page. There is a possibility of data inconsistency if data is not read page by page, data will be consistent if read page by page as full page write will be enabled at this time.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 13, 2019 at 1:08 PM vignesh C <vignesh21@gmail.com> wrote: > Instead of reading the whole file here, we can read the file page by page. There is a possibility of data inconsistencyif data is not read page by page, data will be consistent if read page by page as full page write will be enabledat this time. I think you are confused about what "full page writes" means. It has to do what gets written to the write-ahead log, not the way that the pages themselves are written. There is no portable way to ensure that an 8kB read or write is atomic, and generally it isn't. It shouldn't matter whether the file is read all at once, page by page, or byte by byte, except for performance. Recovery is going to run when that backup is restored, and any inconsistencies should get fixed up at that time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > I had a look over this issue and observed that when the new database is > created, the catalog files are copied as-is into the new directory > corresponding to a newly created database. And as they are just copied, > the LSN on those pages are not changed. Due to this incremental backup > thinks that its an existing file and thus do not copy the blocks from > these new files, leading to the failure. *facepalm* Well, this shoots a pretty big hole in my design for this feature. I don't know why I didn't think of this when I wrote out that design originally. Ugh. Unless we change the way that CREATE DATABASE and any similar operations work so that they always stamp pages with new LSNs, I think we have to give up on the idea of being able to take an incremental backup by just specifying an LSN. We'll instead need to get a list of files from the server first, and then request the entirety of any that we don't have, plus the changed blocks from the ones that we do have. I guess that will make Stephen happy, since it's more like the design he wanted originally (and should generalize more simply to parallel backup). One question I have is: is there any scenario in which an existing page gets modified after the full backup and before the incremental backup but does not end up with an LSN that follows the full backup's start LSN? If there is, then the whole concept of using LSNs to tell which blocks have been modified doesn't really work. I can't think of a way that can happen off-hand, but then, I thought my last design was good, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 16, 2019 at 7:22 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke > <jeevan.chalke@enterprisedb.com> wrote: > > I had a look over this issue and observed that when the new database is > > created, the catalog files are copied as-is into the new directory > > corresponding to a newly created database. And as they are just copied, > > the LSN on those pages are not changed. Due to this incremental backup > > thinks that its an existing file and thus do not copy the blocks from > > these new files, leading to the failure. > > *facepalm* > > Well, this shoots a pretty big hole in my design for this feature. I > don't know why I didn't think of this when I wrote out that design > originally. Ugh. > > Unless we change the way that CREATE DATABASE and any similar > operations work so that they always stamp pages with new LSNs, I think > we have to give up on the idea of being able to take an incremental > backup by just specifying an LSN. > This seems to be a blocking problem for the LSN based design. Can we think of using creation time for file? Basically, if the file creation time is later than backup-labels "START TIME:", then include that file entirely. I think one big point against this is clock skew like what if somebody tinkers with the clock. And also, this can cover cases like what Jeevan has pointed but might not cover other cases which we found problematic. > We'll instead need to get a list of > files from the server first, and then request the entirety of any that > we don't have, plus the changed blocks from the ones that we do have. > I guess that will make Stephen happy, since it's more like the design > he wanted originally (and should generalize more simply to parallel > backup). > > One question I have is: is there any scenario in which an existing > page gets modified after the full backup and before the incremental > backup but does not end up with an LSN that follows the full backup's > start LSN? > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will have similar problems. One related point is how do incremental backups handle the case where vacuum truncates the relation partially? Basically, with current patch/design, it doesn't appear that such information can be passed via incremental backup. I am not sure if this is a problem, but it would be good if we can somehow handle this. Isn't some operations where at the end we directly call heap_sync without writing WAL will have a similar problem as well? Similarly, it is not very clear if unlogged relations are handled in some way if not, the same could be documented. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > This seems to be a blocking problem for the LSN based design. Well, only the simplest version of it, I think. > Can we think of using creation time for file? Basically, if the file > creation time is later than backup-labels "START TIME:", then include > that file entirely. I think one big point against this is clock skew > like what if somebody tinkers with the clock. And also, this can > cover cases like > what Jeevan has pointed but might not cover other cases which we found > problematic. Well that would mean, for example, that if you copied the data directory from one machine to another, the next "incremental" backup would turn into a full backup. That sucks. And in other situations, like resetting the clock, it could mean that you end up with a corrupt backup without any real ability for PostgreSQL to detect it. I'm not saying that it is impossible to create a practically useful system based on file time stamps, but I really don't like it. > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will > have similar problems. I'm not sure quite what you mean by that. Can you elaborate? It appears to me that the XLR_SPECIAL_REL_UPDATE operations are all things that create files, remove files, or truncate files, and the sketch in my previous email would handle the first two of those cases correctly. See below for the third. > One related point is how do incremental backups handle the case where > vacuum truncates the relation partially? Basically, with current > patch/design, it doesn't appear that such information can be passed > via incremental backup. I am not sure if this is a problem, but it > would be good if we can somehow handle this. As to this, if you're taking a full backup of a particular file, there's no problem. If you're taking a partial backup of a particular file, you need to include the current length of the file and the identity and contents of each modified block. Then you're fine. > Isn't some operations where at the end we directly call heap_sync > without writing WAL will have a similar problem as well? Maybe. Can you give an example? > Similarly, > it is not very clear if unlogged relations are handled in some way if > not, the same could be documented. I think that we don't need to back up the contents of unlogged relations at all, right? Restoration from an online backup always involves running recovery, and so unlogged relations will anyway get zapped. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Can we think of using creation time for file? Basically, if the file > > creation time is later than backup-labels "START TIME:", then include > > that file entirely. I think one big point against this is clock skew > > like what if somebody tinkers with the clock. And also, this can > > cover cases like > > what Jeevan has pointed but might not cover other cases which we found > > problematic. > > Well that would mean, for example, that if you copied the data > directory from one machine to another, the next "incremental" backup > would turn into a full backup. That sucks. And in other situations, > like resetting the clock, it could mean that you end up with a corrupt > backup without any real ability for PostgreSQL to detect it. I'm not > saying that it is impossible to create a practically useful system > based on file time stamps, but I really don't like it. In a number of cases, trying to make sure that on a failover or copy of the backup the next 'incremental' is really an 'incremental' is dangerous. A better strategy to address this, and the other issues realized on this thread recently, is to: - Have a manifest of every file in each backup - Always back up new files that weren't in the prior backup - Keep a checksum of each file - Track the timestamp of each file as of when it was backed up - Track the file size of each file - Track the starting timestamp of each backup - Always include files with a modification time after the starting timestamp of the prior backup, or if the file size has changed - In the event of any anomolies (which includes things like a timeline switch), use checksum matching (aka 'delta checksum backup') to perform the backup instead of using timestamps (or just always do that if you want to be particularly careful- having an option for it is great) - Probably other things I'm not thinking of off-hand, but this is at least a good start. Make sure to checksum this information too. I agree entirely that it is dangerous to simply rely on creation time as compared to some other time, or to rely on modification time of a given file across multiple backups (which has been shown to reliably cause corruption, at least with rsync and its 1-second granularity on modification time). By having a manifest for each backed up file for each backup, you also gain the ability to validate that a backup in the repository hasn't been corrupted post-backup, a feature that at least some other database backup and restore systems have (referring specifically to the big O in this particular case, but I bet others do too). Having a system of keeping track of which backups are full and which are differential in an overall system also gives you the ability to do things like expiration in a sensible way, including handling WAL expiration. As also mentioned up-thread, this likely also allows you to have a simpler approach to parallelizing the overall backup. I'd like to clarify that while I would like to have an easier way to parallelize backups, that's a relatively minor complaint- the much bigger issue that I have with this feature is that trying to address everything correctly while having only the amount of information that could be passed on the command-line about the prior full/incremental is going to be extremely difficult, complicated, and likely to lead to subtle bugs in the actual code, and probably less than subtle bugs in how users end up using it, since they'll have to implement the expiration and tracking of information between backups themselves (unless something's changed in that part during this discussion- I admit that I've not read every email in this thread). > > One related point is how do incremental backups handle the case where > > vacuum truncates the relation partially? Basically, with current > > patch/design, it doesn't appear that such information can be passed > > via incremental backup. I am not sure if this is a problem, but it > > would be good if we can somehow handle this. > > As to this, if you're taking a full backup of a particular file, > there's no problem. If you're taking a partial backup of a particular > file, you need to include the current length of the file and the > identity and contents of each modified block. Then you're fine. I would also expect this to be fine but if there's an example of where this is an issue, please share. The only issue that I can think of off-hand is orphaned-file risk, whereby you have something like CREATE DATABASE or perhaps ALTER TABLE .. SET TABLESPACE or such, take a backup while that's happening, but that doesn't complete during the backup (or recovery, or perhaps even in some other scenarios, it's unfortunately quite complicated). This orphaned file risk isn't newly discovered but fixing it is pretty complicated- would love to discuss ideas around how to handle it. > > Isn't some operations where at the end we directly call heap_sync > > without writing WAL will have a similar problem as well? > > Maybe. Can you give an example? I'd be curious to hear what the concern is here also. > > Similarly, > > it is not very clear if unlogged relations are handled in some way if > > not, the same could be documented. > > I think that we don't need to back up the contents of unlogged > relations at all, right? Restoration from an online backup always > involves running recovery, and so unlogged relations will anyway get > zapped. Unlogged relations shouldn't be in the backup at all, since, yes, they get zapped at the start of recovery. We recently taught pg_basebackup how to avoid backing them up so this shouldn't be an issue, as they should be skipped for incrementals as well as fulls. I expect the orphaned file problem also exists for UNLOGGED->LOGGED transitions. Thanks, Stephen
Attachment
On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Isn't some operations where at the end we directly call heap_sync > > without writing WAL will have a similar problem as well? > > Maybe. Can you give an example? Looking through the code, I found two cases where we do this. One is a bulk insert operation with wal_level = minimal, and the other is CLUSTER or VACUUM FULL with wal_level = minimal. In both of these cases we are generating new blocks whose LSNs will be 0/0. So, I think we need a rule that if the server is asked to back up all blocks in a file with LSNs > some threshold LSN, it must also include any blocks whose LSN is 0/0. Those blocks are either uninitialized or are populated without WAL logging, so they always need to be copied. Outside of unlogged and temporary tables, I don't know of any case where make a critical modification to an already-existing block without bumping the LSN. I hope there is no such case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote: > In a number of cases, trying to make sure that on a failover or copy of > the backup the next 'incremental' is really an 'incremental' is > dangerous. A better strategy to address this, and the other issues > realized on this thread recently, is to: > > - Have a manifest of every file in each backup > - Always back up new files that weren't in the prior backup > - Keep a checksum of each file > - Track the timestamp of each file as of when it was backed up > - Track the file size of each file > - Track the starting timestamp of each backup > - Always include files with a modification time after the starting > timestamp of the prior backup, or if the file size has changed > - In the event of any anomolies (which includes things like a timeline > switch), use checksum matching (aka 'delta checksum backup') to > perform the backup instead of using timestamps (or just always do that > if you want to be particularly careful- having an option for it is > great) > - Probably other things I'm not thinking of off-hand, but this is at > least a good start. Make sure to checksum this information too. I agree with some of these ideas but not all of them. I think having a backup manifest is a good idea; that would allow taking a new incremental backup to work from the manifest rather than the data directory, which could be extremely useful, because it might be a lot faster and the manifest could also be copied to a machine other than the one where the entire backup is stored. If the backup itself has been pushed off to S3 or whatever, you can't access it quickly, but you could keep the manifest around. I also agree that backing up all files that weren't in the previous backup is a good strategy. I proposed that fairly explicitly a few emails back; but also, the contrary is obviously nonsense. And I also agree with, and proposed, that we record the size along with the file. I don't really agree with your comments about checksums and timestamps. I think that, if possible, there should be ONE method of determining whether a block has changed in some important way, and I think if we can make LSN work, that would be for the best. If you use multiple methods of detecting changes without any clearly-defined reason for so doing, maybe what you're saying is that you don't really believe that any of the methods are reliable but if we throw the kitchen sink at the problem it should come out OK. Any bugs in one mechanism are likely to be masked by one of the others, but that's not as as good as one method that is known to be altogether reliable. > By having a manifest for each backed up file for each backup, you also > gain the ability to validate that a backup in the repository hasn't been > corrupted post-backup, a feature that at least some other database > backup and restore systems have (referring specifically to the big O in > this particular case, but I bet others do too). Agreed. The manifest only lets you validate to a limited extent, but that's still useful. > Having a system of keeping track of which backups are full and which are > differential in an overall system also gives you the ability to do > things like expiration in a sensible way, including handling WAL > expiration. True, but I'm not sure that functionality belongs in core. It certainly needs to be possible for out-of-core code to do this part of the work if desired, because people want to integrate with enterprise backup systems, and we can't come in and say, well, you back up everything else using Netbackup or Tivoli, but for PostgreSQL you have to use pg_backrest. I mean, maybe you can win that argument, but I know I can't. > I'd like to clarify that while I would like to have an easier way to > parallelize backups, that's a relatively minor complaint- the much > bigger issue that I have with this feature is that trying to address > everything correctly while having only the amount of information that > could be passed on the command-line about the prior full/incremental is > going to be extremely difficult, complicated, and likely to lead to > subtle bugs in the actual code, and probably less than subtle bugs in > how users end up using it, since they'll have to implement the > expiration and tracking of information between backups themselves > (unless something's changed in that part during this discussion- I admit > that I've not read every email in this thread). Well, the evidence seems to show that you are right, at least to some extent. I consider it a positive good if the client needs to give the server only a limited amount of information. After all, you could always take an incremental backup by shipping every byte of the previous backup to the server, having it compare everything to the current contents, and having it then send you back the stuff that is new or different. But that would be dumb, because most of the point of an incremental backup is to save on sending lots of data over the network unnecessarily. Now, it seems that I took that goal to an unhealthy extreme, because as we've now realized, sending only an LSN and nothing else isn't enough to get a correct backup. So we need to send more, and it doesn't have to be the absolutely most stripped-down, bear-bones version of what could be sent. But it should be fairly minimal, I think; that's kinda the point of the feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote: > > In a number of cases, trying to make sure that on a failover or copy of > > the backup the next 'incremental' is really an 'incremental' is > > dangerous. A better strategy to address this, and the other issues > > realized on this thread recently, is to: > > > > - Have a manifest of every file in each backup > > - Always back up new files that weren't in the prior backup > > - Keep a checksum of each file > > - Track the timestamp of each file as of when it was backed up > > - Track the file size of each file > > - Track the starting timestamp of each backup > > - Always include files with a modification time after the starting > > timestamp of the prior backup, or if the file size has changed > > - In the event of any anomolies (which includes things like a timeline > > switch), use checksum matching (aka 'delta checksum backup') to > > perform the backup instead of using timestamps (or just always do that > > if you want to be particularly careful- having an option for it is > > great) > > - Probably other things I'm not thinking of off-hand, but this is at > > least a good start. Make sure to checksum this information too. > > I agree with some of these ideas but not all of them. I think having > a backup manifest is a good idea; that would allow taking a new > incremental backup to work from the manifest rather than the data > directory, which could be extremely useful, because it might be a lot > faster and the manifest could also be copied to a machine other than > the one where the entire backup is stored. If the backup itself has > been pushed off to S3 or whatever, you can't access it quickly, but > you could keep the manifest around. Yes, those are also good reasons for having a manifest. > I also agree that backing up all files that weren't in the previous > backup is a good strategy. I proposed that fairly explicitly a few > emails back; but also, the contrary is obviously nonsense. And I also > agree with, and proposed, that we record the size along with the file. Sure, I didn't mean to imply that there was something wrong with that. Including the checksum and other metadata is also valuable, both for helping to identify corruption in the backup archive and for forensics, if not for other reasons. > I don't really agree with your comments about checksums and > timestamps. I think that, if possible, there should be ONE method of > determining whether a block has changed in some important way, and I > think if we can make LSN work, that would be for the best. If you use > multiple methods of detecting changes without any clearly-defined > reason for so doing, maybe what you're saying is that you don't really > believe that any of the methods are reliable but if we throw the > kitchen sink at the problem it should come out OK. Any bugs in one > mechanism are likely to be masked by one of the others, but that's not > as as good as one method that is known to be altogether reliable. I disagree with this on a couple of levels. The first is pretty simple- we don't have all of the information. The user may have some reason to believe that timestamp-based is a bad idea, for example, and therefore having an option to perform a checksum-based backup makes sense. rsync is a pretty good tool in my view and it has a very similar option- because there are trade-offs to be made. LSN is great, if you don't mind reading every file of your database start-to-finish every time, but in a running system which hasn't suffered from clock skew or other odd issues (some of which we can also detect), it's pretty painful to scan absolutely everything like that for an incremental. Perhaps the discussion has already moved on to having some way of our own to track if a given file has changed without having to scan all of it- if so, that's a discussion I'd be interested in. I'm not against other approaches here besides timestamps if there's a solid reason why they're better and they're also able to avoid scanning the entire database. > > By having a manifest for each backed up file for each backup, you also > > gain the ability to validate that a backup in the repository hasn't been > > corrupted post-backup, a feature that at least some other database > > backup and restore systems have (referring specifically to the big O in > > this particular case, but I bet others do too). > > Agreed. The manifest only lets you validate to a limited extent, but > that's still useful. If you track the checksum of the file in the manifest then it's a pretty strong validation that the backup repo hasn't been corrupted between the backup and the restore. Of course, the database could have been corrupted at the source, and perhaps that's what you were getting at with your 'limited extent' but that isn't what I was referring to. Claiming that the backup has been 'validated' by only looking at file sizes certainly wouldn't be acceptable. I can't imagine you were suggesting that as you're certainly capable of realizing that, but I got the feeling you weren't agreeing that having the checksum of the file made sense to include in the manifest, so I feel like I'm missing something here. > > Having a system of keeping track of which backups are full and which are > > differential in an overall system also gives you the ability to do > > things like expiration in a sensible way, including handling WAL > > expiration. > > True, but I'm not sure that functionality belongs in core. It > certainly needs to be possible for out-of-core code to do this part of > the work if desired, because people want to integrate with enterprise > backup systems, and we can't come in and say, well, you back up > everything else using Netbackup or Tivoli, but for PostgreSQL you have > to use pg_backrest. I mean, maybe you can win that argument, but I > know I can't. I'm pretty baffled by this argument, particularly in this context. We already have tooling around trying to manage WAL archives in core- see pg_archivecleanup. Further, we're talking about pg_basebackup here, not about Netbackup or Tivoli, and the results of a pg_basebackup (that is, a set of tar files, or a data directory) could happily be backed up using whatever Enterprise tool folks want to use- in much the same way that a pgbackrest repo is also able to be backed up using whatever Enterprise tools someone wishes to use. We designed it quite carefully to work with exactly that use-case, so the distinction here is quite lost on me. Perhaps you could clarify what use-case these changes to pg_basebackup solve, when working with a Netbackup or Tivoli system, that pgbackrest doesn't, since you bring it up here? > > I'd like to clarify that while I would like to have an easier way to > > parallelize backups, that's a relatively minor complaint- the much > > bigger issue that I have with this feature is that trying to address > > everything correctly while having only the amount of information that > > could be passed on the command-line about the prior full/incremental is > > going to be extremely difficult, complicated, and likely to lead to > > subtle bugs in the actual code, and probably less than subtle bugs in > > how users end up using it, since they'll have to implement the > > expiration and tracking of information between backups themselves > > (unless something's changed in that part during this discussion- I admit > > that I've not read every email in this thread). > > Well, the evidence seems to show that you are right, at least to some > extent. I consider it a positive good if the client needs to give the > server only a limited amount of information. After all, you could > always take an incremental backup by shipping every byte of the > previous backup to the server, having it compare everything to the > current contents, and having it then send you back the stuff that is > new or different. But that would be dumb, because most of the point of > an incremental backup is to save on sending lots of data over the > network unnecessarily. Now, it seems that I took that goal to an > unhealthy extreme, because as we've now realized, sending only an LSN > and nothing else isn't enough to get a correct backup. So we need to > send more, and it doesn't have to be the absolutely most > stripped-down, bear-bones version of what could be sent. But it should > be fairly minimal, I think; that's kinda the point of the feature. Right- much of the point of an incremental backup feature is to try and minimize the amount of work that's done while still getting a good backup. I don't agree that we should focus solely on network bandwidth as there are also trade-offs to be made around disk bandwidth to consider, see above discussion regarding timestamps vs. checksum'ing every file. As for if we should be sending more to the server, or asking the server to send more to us, I don't really have a good feel for what's "best". At least one implementation I'm familiar with builds a manifest on the PG server side and then compares the results of that to the manifest stored with the backup (where that comparison is actually done is on whatever system the "backup" was started from, typically a backup server). Perhaps there's an argument for sending the manifest from the backup repository to PostgreSQL for it to then compare against the data directory but I'm not really sure how it could possibly do that more efficiently and that's moving work to the PG server that it doesn't really need to do. Thanks, Stephen
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > Isn't some operations where at the end we directly call heap_sync > > > without writing WAL will have a similar problem as well? > > > > Maybe. Can you give an example? > > Looking through the code, I found two cases where we do this. One is > a bulk insert operation with wal_level = minimal, and the other is > CLUSTER or VACUUM FULL with wal_level = minimal. In both of these > cases we are generating new blocks whose LSNs will be 0/0. So, I think > we need a rule that if the server is asked to back up all blocks in a > file with LSNs > some threshold LSN, it must also include any blocks > whose LSN is 0/0. Those blocks are either uninitialized or are > populated without WAL logging, so they always need to be copied. I'm not sure I see a way around it but this seems pretty unfortunate- every single incremental backup will have all of those included even though the full backup likely also does (I say likely since someone could do a full backup, set the WAL to minimal, load a bunch of data, and then restart back to a WAL level where we can do a new backup, and then do an incremental, so we don't *know* that the full includes those blocks unless we also track a block-level checksum or similar). Then again, doing these kinds of server bounces to change the WAL level around is, hopefully, relatively rare.. > Outside of unlogged and temporary tables, I don't know of any case > where make a critical modification to an already-existing block > without bumping the LSN. I hope there is no such case. I believe we all do. :) Thanks, Stephen
Attachment
On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote: > I disagree with this on a couple of levels. The first is pretty simple- > we don't have all of the information. The user may have some reason to > believe that timestamp-based is a bad idea, for example, and therefore > having an option to perform a checksum-based backup makes sense. rsync > is a pretty good tool in my view and it has a very similar option- > because there are trade-offs to be made. LSN is great, if you don't > mind reading every file of your database start-to-finish every time, but > in a running system which hasn't suffered from clock skew or other odd > issues (some of which we can also detect), it's pretty painful to scan > absolutely everything like that for an incremental. There's a separate thread on using WAL-scanning to avoid having to scan all the data every time. I pointed it out to you early in this thread, too. > If you track the checksum of the file in the manifest then it's a pretty > strong validation that the backup repo hasn't been corrupted between the > backup and the restore. Of course, the database could have been > corrupted at the source, and perhaps that's what you were getting at > with your 'limited extent' but that isn't what I was referring to. Yeah, that all seems fair. Without the checksum, you can only validate that you have the right files and that they are the right sizes, which is not bad, but the checksums certainly make it stronger. But, wouldn't having to checksum all of the files add significantly to the cost of taking the backup? If so, I can imagine that some people might want to pay that cost but others might not. If it's basically free to checksum the data while we have it in memory anyway, then I guess there's little to be lost. > I'm pretty baffled by this argument, particularly in this context. We > already have tooling around trying to manage WAL archives in core- see > pg_archivecleanup. Further, we're talking about pg_basebackup here, not > about Netbackup or Tivoli, and the results of a pg_basebackup (that is, > a set of tar files, or a data directory) could happily be backed up > using whatever Enterprise tool folks want to use- in much the same way > that a pgbackrest repo is also able to be backed up using whatever > Enterprise tools someone wishes to use. We designed it quite carefully > to work with exactly that use-case, so the distinction here is quite > lost on me. Perhaps you could clarify what use-case these changes to > pg_basebackup solve, when working with a Netbackup or Tivoli system, > that pgbackrest doesn't, since you bring it up here? I'm not an expert on any of those systems, but I doubt that everybody's OK with backing everything up to a pgbackrest repository and then separately backing up that repository to some other system. That sounds like a pretty large storage cost. > As for if we should be sending more to the server, or asking the server > to send more to us, I don't really have a good feel for what's "best". > At least one implementation I'm familiar with builds a manifest on the > PG server side and then compares the results of that to the manifest > stored with the backup (where that comparison is actually done is on > whatever system the "backup" was started from, typically a backup > server). Perhaps there's an argument for sending the manifest from the > backup repository to PostgreSQL for it to then compare against the data > directory but I'm not really sure how it could possibly do that more > efficiently and that's moving work to the PG server that it doesn't > really need to do. I agree with all that, but... if the server builds a manifest on the PG server that is to be compared with the backup's manifest, the one the PG server builds can't really include checksums, I think. To get the checksums, it would have to read the entire cluster while building the manifest, which sounds insane. Presumably it would have to build a checksum-free version of the manifest, and then the client could checksum the files as they're streamed down and write out a revised manifest that adds the checksums. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote: > > I disagree with this on a couple of levels. The first is pretty simple- > > we don't have all of the information. The user may have some reason to > > believe that timestamp-based is a bad idea, for example, and therefore > > having an option to perform a checksum-based backup makes sense. rsync > > is a pretty good tool in my view and it has a very similar option- > > because there are trade-offs to be made. LSN is great, if you don't > > mind reading every file of your database start-to-finish every time, but > > in a running system which hasn't suffered from clock skew or other odd > > issues (some of which we can also detect), it's pretty painful to scan > > absolutely everything like that for an incremental. > > There's a separate thread on using WAL-scanning to avoid having to > scan all the data every time. I pointed it out to you early in this > thread, too. As discussed nearby, not everything that needs to be included in the backup is actually going to be in the WAL though, right? How would that ever be able to handle the case where someone starts the server under wal_level = logical, takes a full backup, then restarts with wal_level = minimal, writes out a bunch of new data, and then restarts back to wal_level = logical and takes an incremental? How would we even detect that such a thing happened? > > If you track the checksum of the file in the manifest then it's a pretty > > strong validation that the backup repo hasn't been corrupted between the > > backup and the restore. Of course, the database could have been > > corrupted at the source, and perhaps that's what you were getting at > > with your 'limited extent' but that isn't what I was referring to. > > Yeah, that all seems fair. Without the checksum, you can only validate > that you have the right files and that they are the right sizes, which > is not bad, but the checksums certainly make it stronger. But, > wouldn't having to checksum all of the files add significantly to the > cost of taking the backup? If so, I can imagine that some people might > want to pay that cost but others might not. If it's basically free to > checksum the data while we have it in memory anyway, then I guess > there's little to be lost. On larger systems, so many of the files are 1GB in size that checking the file size is quite close to meaningless. Yes, having to checksum all of the files definitely adds to the cost of taking the backup, but to avoid it we need strong assurances that a given file hasn't been changed since our last full backup. WAL, today at least, isn't quite that, and timestamps can possibly be fooled with, so if you'd like to be particularly careful, there doesn't seem to be a lot of alternatives. > > I'm pretty baffled by this argument, particularly in this context. We > > already have tooling around trying to manage WAL archives in core- see > > pg_archivecleanup. Further, we're talking about pg_basebackup here, not > > about Netbackup or Tivoli, and the results of a pg_basebackup (that is, > > a set of tar files, or a data directory) could happily be backed up > > using whatever Enterprise tool folks want to use- in much the same way > > that a pgbackrest repo is also able to be backed up using whatever > > Enterprise tools someone wishes to use. We designed it quite carefully > > to work with exactly that use-case, so the distinction here is quite > > lost on me. Perhaps you could clarify what use-case these changes to > > pg_basebackup solve, when working with a Netbackup or Tivoli system, > > that pgbackrest doesn't, since you bring it up here? > > I'm not an expert on any of those systems, but I doubt that > everybody's OK with backing everything up to a pgbackrest repository > and then separately backing up that repository to some other system. > That sounds like a pretty large storage cost. I'm not asking you to be an expert on those systems, just to help me understand the statements you're making. How is backing up to a pgbackrest repo different than running a pg_basebackup in the context of using some other Enterprise backup system? In both cases, you'll have a full copy of the backup (presumably compressed) somewhere out on a disk or filesystem which is then backed up by the Enterprise tool. > > As for if we should be sending more to the server, or asking the server > > to send more to us, I don't really have a good feel for what's "best". > > At least one implementation I'm familiar with builds a manifest on the > > PG server side and then compares the results of that to the manifest > > stored with the backup (where that comparison is actually done is on > > whatever system the "backup" was started from, typically a backup > > server). Perhaps there's an argument for sending the manifest from the > > backup repository to PostgreSQL for it to then compare against the data > > directory but I'm not really sure how it could possibly do that more > > efficiently and that's moving work to the PG server that it doesn't > > really need to do. > > I agree with all that, but... if the server builds a manifest on the > PG server that is to be compared with the backup's manifest, the one > the PG server builds can't really include checksums, I think. To get > the checksums, it would have to read the entire cluster while building > the manifest, which sounds insane. Presumably it would have to build a > checksum-free version of the manifest, and then the client could > checksum the files as they're streamed down and write out a revised > manifest that adds the checksums. Unless files can be excluded based on some relatively strong criteria, then yes, the approach would be to use checksums of the files and would necessairly include all files, meaning that you'd have to read them all. That's not great, of course, which is why there are trade-offs to be made, one of which typically involves using timestamps, but doing so quite carefully, to perform the file exclusion. Other ideas are great but it seems like WAL isn't really a great idea unless we make some changes there and we, as in PG, haven't got a robust "we know this file changed as of this point" to work from. I worry that we're putting too much faith into a system to do something independent of what it was actually built and designed to do, and thinking that because we could trust it for X, we can trust it for Y. Thanks, Stephen
Attachment
On Mon, Sep 16, 2019 at 11:09 PM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > Isn't some operations where at the end we directly call heap_sync > > > > without writing WAL will have a similar problem as well? > > > > > > Maybe. Can you give an example? > > > > Looking through the code, I found two cases where we do this. One is > > a bulk insert operation with wal_level = minimal, and the other is > > CLUSTER or VACUUM FULL with wal_level = minimal. In both of these > > cases we are generating new blocks whose LSNs will be 0/0. So, I think > > we need a rule that if the server is asked to back up all blocks in a > > file with LSNs > some threshold LSN, it must also include any blocks > > whose LSN is 0/0. Those blocks are either uninitialized or are > > populated without WAL logging, so they always need to be copied. > > I'm not sure I see a way around it but this seems pretty unfortunate- > every single incremental backup will have all of those included even > though the full backup likely also does > Yeah, this is quite unfortunate. One more thing to note is that the same is true for other operation like 'create index' (ex. nbtree bypasses buffer manager while creating the index, doesn't write wal for wal_level=minimal and then syncs at the end). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 16, 2019 at 7:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This seems to be a blocking problem for the LSN based design. > > Well, only the simplest version of it, I think. > > > Can we think of using creation time for file? Basically, if the file > > creation time is later than backup-labels "START TIME:", then include > > that file entirely. I think one big point against this is clock skew > > like what if somebody tinkers with the clock. And also, this can > > cover cases like > > what Jeevan has pointed but might not cover other cases which we found > > problematic. > > Well that would mean, for example, that if you copied the data > directory from one machine to another, the next "incremental" backup > would turn into a full backup. That sucks. And in other situations, > like resetting the clock, it could mean that you end up with a corrupt > backup without any real ability for PostgreSQL to detect it. I'm not > saying that it is impossible to create a practically useful system > based on file time stamps, but I really don't like it. > > > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will > > have similar problems. > > I'm not sure quite what you mean by that. Can you elaborate? It > appears to me that the XLR_SPECIAL_REL_UPDATE operations are all > things that create files, remove files, or truncate files, and the > sketch in my previous email would handle the first two of those cases > correctly. See below for the third. > > > One related point is how do incremental backups handle the case where > > vacuum truncates the relation partially? Basically, with current > > patch/design, it doesn't appear that such information can be passed > > via incremental backup. I am not sure if this is a problem, but it > > would be good if we can somehow handle this. > > As to this, if you're taking a full backup of a particular file, > there's no problem. If you're taking a partial backup of a particular > file, you need to include the current length of the file and the > identity and contents of each modified block. Then you're fine. > Right, this should address that point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > As discussed nearby, not everything that needs to be included in the > backup is actually going to be in the WAL though, right? How would that > ever be able to handle the case where someone starts the server under > wal_level = logical, takes a full backup, then restarts with wal_level = > minimal, writes out a bunch of new data, and then restarts back to > wal_level = logical and takes an incremental? Fair point. I think the WAL-scanning approach can only work if wal_level > minimal. But, I also think that few people run with wal_level = minimal in this era where the default has been changed to replica; and I think we can detect the WAL level in use while scanning WAL. It can only change at a checkpoint. > On larger systems, so many of the files are 1GB in size that checking > the file size is quite close to meaningless. Yes, having to checksum > all of the files definitely adds to the cost of taking the backup, but > to avoid it we need strong assurances that a given file hasn't been > changed since our last full backup. WAL, today at least, isn't quite > that, and timestamps can possibly be fooled with, so if you'd like to be > particularly careful, there doesn't seem to be a lot of alternatives. I see your points, but it feels like you're trying to talk down the WAL-based approach over what seem to me to be fairly manageable corner cases. > I'm not asking you to be an expert on those systems, just to help me > understand the statements you're making. How is backing up to a > pgbackrest repo different than running a pg_basebackup in the context of > using some other Enterprise backup system? In both cases, you'll have a > full copy of the backup (presumably compressed) somewhere out on a disk > or filesystem which is then backed up by the Enterprise tool. Well, I think that what people really want is to be able to backup straight into the enterprise tool, without an intermediate step. My basic point here is: As with practically all PostgreSQL development, I think we should try to expose capabilities and avoid making policy on behalf of users. I'm not objecting to the idea of having tools that can help users figure out how much WAL they need to retain -- but insofar as we can do it, such tools should work regardless of where that WAL is actually stored. I dislike the idea that PostgreSQL would provide something akin to a "pgbackrest repository" in core, or I at least I think it would be important that we're careful about how much functionality gets tied to the presence and use of such a thing, because, at least based on my experience working at EnterpriseDB, larger customers often don't want to do it that way. > That's not great, of course, which is why there are trade-offs to be > made, one of which typically involves using timestamps, but doing so > quite carefully, to perform the file exclusion. Other ideas are great > but it seems like WAL isn't really a great idea unless we make some > changes there and we, as in PG, haven't got a robust "we know this file > changed as of this point" to work from. I worry that we're putting too > much faith into a system to do something independent of what it was > actually built and designed to do, and thinking that because we could > trust it for X, we can trust it for Y. That seems like a considerable overreaction to me based on the problems reported thus far. The fact is, WAL was originally intended for crash recovery and has subsequently been generalized to be usable for point-in-time recovery, standby servers, and logical decoding. It's clearly established at this point as the canonical way that you know what in the database has changed, which is the same need that we have for incremental backup. At any rate, the same criticism can be leveled - IMHO with a lot more validity - at timestamps. Last-modification timestamps are completely outside of our control; they are owned by the OS and various operating systems can and do have varying behavior. They can go backwards when things have changed; they can go forwards when things have not changed. They were clearly not intended to meet this kind of requirement. Even, they were intended for that purpose much less so than WAL, which was actually designed for a requirement in this general ballpark, if not this thing precisely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote: > > As discussed nearby, not everything that needs to be included in the > > backup is actually going to be in the WAL though, right? How would that > > ever be able to handle the case where someone starts the server under > > wal_level = logical, takes a full backup, then restarts with wal_level = > > minimal, writes out a bunch of new data, and then restarts back to > > wal_level = logical and takes an incremental? > > Fair point. I think the WAL-scanning approach can only work if > wal_level > minimal. But, I also think that few people run with > wal_level = minimal in this era where the default has been changed to > replica; and I think we can detect the WAL level in use while scanning > WAL. It can only change at a checkpoint. We need to be sure that we can detect if the WAL level has ever been set to minimal between a full and an incremental and, if so, either refuse to run the incremental, or promote it to a full, or make it a checksum-based incremental instead of trusting the WAL stream. I'm also glad that we ended up changing the default though and I do hope that there's relatively few people running with minimal and that there's even fewer who play around with flipping it back and forth. > > On larger systems, so many of the files are 1GB in size that checking > > the file size is quite close to meaningless. Yes, having to checksum > > all of the files definitely adds to the cost of taking the backup, but > > to avoid it we need strong assurances that a given file hasn't been > > changed since our last full backup. WAL, today at least, isn't quite > > that, and timestamps can possibly be fooled with, so if you'd like to be > > particularly careful, there doesn't seem to be a lot of alternatives. > > I see your points, but it feels like you're trying to talk down the > WAL-based approach over what seem to me to be fairly manageable corner > cases. Just to be clear, I see your points and I like the general idea of finding solutions, but it seems like the issues are likely to be pretty complex and I'm not sure that's being appreciated very well. > > I'm not asking you to be an expert on those systems, just to help me > > understand the statements you're making. How is backing up to a > > pgbackrest repo different than running a pg_basebackup in the context of > > using some other Enterprise backup system? In both cases, you'll have a > > full copy of the backup (presumably compressed) somewhere out on a disk > > or filesystem which is then backed up by the Enterprise tool. > > Well, I think that what people really want is to be able to backup > straight into the enterprise tool, without an intermediate step. Ok.. I can understand that but I don't get how these changes to pg_basebackup will help facilitate that. If they don't and what you're talking about here is independent, then great, that clarifies things, but if you're saying that these changes to pg_basebackup are to help with backing up directly into those Enterprise systems then I'm just asking for some help in understanding how- what's the use-case here that we're adding to pg_basebackup that makes it work with these Enterprise systems? I'm not trying to be difficult here, I'm just trying to understand. > My basic point here is: As with practically all PostgreSQL > development, I think we should try to expose capabilities and avoid > making policy on behalf of users. > > I'm not objecting to the idea of having tools that can help users > figure out how much WAL they need to retain -- but insofar as we can > do it, such tools should work regardless of where that WAL is actually > stored. How would that tool work, if it's to be able to work regardless of where the WAL is actually stored..? Today, pg_archivecleanup just works against a POSIX filesystem- are you thinking that the tool would have a pluggable storage system, so that it could work with, say, a POSIX filesystem, or a CIFS mount, or a s3-like system? > I dislike the idea that PostgreSQL would provide something > akin to a "pgbackrest repository" in core, or I at least I think it > would be important that we're careful about how much functionality > gets tied to the presence and use of such a thing, because, at least > based on my experience working at EnterpriseDB, larger customers often > don't want to do it that way. This seems largely independent of the above discussion, but since we're discussing it, I've certainly had various experiences in this area too- some larger customers would like to use an s3-like store (which pgbackrest already supports and will be supporting others going forward as it has a pluggable storage mechanism for the repo...), and then there's customers who would like to point their Enterprise backup solution at a directory on disk to back it up (which pgbackrest also supports, as mentioned previously), and lastly there's customers who really want to just backup the PG data directory and they'd like it to "just work", thank you, and no they don't have any thought or concern about how to handle WAL, but surely it can't be that important, can it? The last is tongue-in-cheek and I'm half-kidding there, but this is why I was trying to understand the comments above about what the use-case is here that we're trying to solve for that answers the call for the Enterprise software crowd, and ideally what distinguishes that from pgbackrest, but just the clear cut "this is what this change will do to make pg_basebackup work for Enterprise customers" would be great, or even a "well, pg_basebackup today works for them because it does X and it'll continue to be able to do X even after this change." I'll take a wild shot in the dark to try to help move us through this- is it that pg_basebackup can stream out to stdout in some cases..? Though that's quite limited since it means you can't have additional tablespaces and you can't stream the WAL, and how would that work with the manifest idea that's being discussed..? If there's a directory that's got manifest files in it for each backup, so we have the file sizes for them, those would need to be accessible when we go to do the incremental backup and couldn't be stored off somewhere else, I wouldn't think.. > > That's not great, of course, which is why there are trade-offs to be > > made, one of which typically involves using timestamps, but doing so > > quite carefully, to perform the file exclusion. Other ideas are great > > but it seems like WAL isn't really a great idea unless we make some > > changes there and we, as in PG, haven't got a robust "we know this file > > changed as of this point" to work from. I worry that we're putting too > > much faith into a system to do something independent of what it was > > actually built and designed to do, and thinking that because we could > > trust it for X, we can trust it for Y. > > That seems like a considerable overreaction to me based on the > problems reported thus far. The fact is, WAL was originally intended > for crash recovery and has subsequently been generalized to be usable > for point-in-time recovery, standby servers, and logical decoding. > It's clearly established at this point as the canonical way that you > know what in the database has changed, which is the same need that we > have for incremental backup. Provided the WAL level is at the level that you need it to be that will be true for things which are actually supported with PITR, replication to standby servers, et al. I can see how it might come across as an overreaction but this strikes me as a pretty glaring issue and I worry that if it was overlooked until now that there'll be other more subtle issues, and backups are just plain complicated to get right, just to begin with already, something that I don't think people appreciate until they've been dealing with them for quite a while. Not that this would be the first time we've had issues in this area, and we'd likely work through them over time, but I'm sure we'd all prefer to get it as close to right as possible the first time around, and that's going to require some pretty in depth review. > At any rate, the same criticism can be leveled - IMHO with a lot more > validity - at timestamps. Last-modification timestamps are completely > outside of our control; they are owned by the OS and various operating > systems can and do have varying behavior. They can go backwards when > things have changed; they can go forwards when things have not > changed. They were clearly not intended to meet this kind of > requirement. Even, they were intended for that purpose much less so > than WAL, which was actually designed for a requirement in this > general ballpark, if not this thing precisely. While I understand that timestamps may be used for a lot of things and that the time on a system could go forward or backward, the actual requirement is: - If the file was modified after the backup was done, the timestamp (or the size) needs to be different. Doesn't actually matter if it's forwards, or backwards, different is all that's needed. The timestamp also needs to be before the backup started for it to be considered an option to skip it. Is it possible for that to be fool'd? Yes, of course, but it isn't as simply fooled as your typical "just copy files newer than X" issue that other tools have, at least, if you're keeping a manifest of all of the files, et al, as discussed earlier. Thanks, Stephen
Attachment
On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote: > We need to be sure that we can detect if the WAL level has ever been set > to minimal between a full and an incremental and, if so, either refuse > to run the incremental, or promote it to a full, or make it a > checksum-based incremental instead of trusting the WAL stream. Sure. What about checksum collisions? > Just to be clear, I see your points and I like the general idea of > finding solutions, but it seems like the issues are likely to be pretty > complex and I'm not sure that's being appreciated very well. Definitely possible, but it's more helpful if you can point out the actual issues. > Ok.. I can understand that but I don't get how these changes to > pg_basebackup will help facilitate that. If they don't and what you're > talking about here is independent, then great, that clarifies things, > but if you're saying that these changes to pg_basebackup are to help > with backing up directly into those Enterprise systems then I'm just > asking for some help in understanding how- what's the use-case here that > we're adding to pg_basebackup that makes it work with these Enterprise > systems? > > I'm not trying to be difficult here, I'm just trying to understand. Man, I feel like we're totally drifting off into the weeds here. I'm not arguing that these changes to pg_basebackup will help enterprise users except insofar as those users want incremental backup. All of this discussion started with this comment from you: "Having a system of keeping track of which backups are full and which are differential in an overall system also gives you the ability to do things like expiration in a sensible way, including handling WAL expiration." All I was doing was saying that for an enterprise user, the overall system might be something entirely outside of our control, like NetBackup or Tivoli. Therefore, whatever functionality we provide to do that kind of thing should be able to be used in such contexts. That hardly seems like a controversial proposition. > How would that tool work, if it's to be able to work regardless of where > the WAL is actually stored..? Today, pg_archivecleanup just works > against a POSIX filesystem- are you thinking that the tool would have a > pluggable storage system, so that it could work with, say, a POSIX > filesystem, or a CIFS mount, or a s3-like system? Again, I was making a general statement about design goals -- "we should try to work nicely with enterprise backup products" -- not proposing a specific design for a specific thing. I don't think the idea of some pluggability in that area is a bad one, but it's not even slightly what this thread is about. > Provided the WAL level is at the level that you need it to be that will > be true for things which are actually supported with PITR, replication > to standby servers, et al. I can see how it might come across as an > overreaction but this strikes me as a pretty glaring issue and I worry > that if it was overlooked until now that there'll be other more subtle > issues, and backups are just plain complicated to get right, just to > begin with already, something that I don't think people appreciate until > they've been dealing with them for quite a while. Permit me to be unpersuaded. If it was such a glaring issue, and if experience is the key to spotting such issues, then why didn't YOU spot it? I'm not arguing that this stuff isn't hard. It is. Nor am I arguing that I didn't screw up. I did. But designs need to be accepted or rejected based on facts, not FUD. You've raised some good technical points and if you've got more concerns, I'd like to hear them, but I don't think arguing vaguely that a certain approach will probably run into trouble gets us anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote: > > We need to be sure that we can detect if the WAL level has ever been set > > to minimal between a full and an incremental and, if so, either refuse > > to run the incremental, or promote it to a full, or make it a > > checksum-based incremental instead of trusting the WAL stream. > > Sure. What about checksum collisions? Certainly possible, of course, but a sha256 of each file is at least somewhat better than, say, our page-level checksums. I do agree that having the option to just say "promote it to a full", or "do a byte-by-byte comparison against the prior backed up file" would be useful for those who are concerned about sha256 collision probabilities. Having a cross-check of "does this X% of files that we decided not to back up due to whatever really still match what we think is in the backup?" is definitely a valuable feature and one which I'd hope we get to at some point. > > Ok.. I can understand that but I don't get how these changes to > > pg_basebackup will help facilitate that. If they don't and what you're > > talking about here is independent, then great, that clarifies things, > > but if you're saying that these changes to pg_basebackup are to help > > with backing up directly into those Enterprise systems then I'm just > > asking for some help in understanding how- what's the use-case here that > > we're adding to pg_basebackup that makes it work with these Enterprise > > systems? > > > > I'm not trying to be difficult here, I'm just trying to understand. > > Man, I feel like we're totally drifting off into the weeds here. I'm > not arguing that these changes to pg_basebackup will help enterprise > users except insofar as those users want incremental backup. All of > this discussion started with this comment from you: > > "Having a system of keeping track of which backups are full and which > are differential in an overall system also gives you the ability to do > things like expiration in a sensible way, including handling WAL > expiration." > > All I was doing was saying that for an enterprise user, the overall > system might be something entirely outside of our control, like > NetBackup or Tivoli. Therefore, whatever functionality we provide to > do that kind of thing should be able to be used in such contexts. That > hardly seems like a controversial proposition. And all I was trying to understand was how what pg_basebackup does in this context is really different from what can be done with pgbackrest, since you brought it up: "True, but I'm not sure that functionality belongs in core. It certainly needs to be possible for out-of-core code to do this part of the work if desired, because people want to integrate with enterprise backup systems, and we can't come in and say, well, you back up everything else using Netbackup or Tivoli, but for PostgreSQL you have to use pg_backrest. I mean, maybe you can win that argument, but I know I can't." What it sounds like you're argueing here is that what pg_basebackup "has" in it is that it specifically doesn't include any kind of expiration management of any kind, and that's somehow helpful to people who want to use Enterprise backup solutions. Maybe that's what you were getting at, in which case, I'm sorry for misunderstanding and dragging it out, and thanks for helping me understand. > > How would that tool work, if it's to be able to work regardless of where > > the WAL is actually stored..? Today, pg_archivecleanup just works > > against a POSIX filesystem- are you thinking that the tool would have a > > pluggable storage system, so that it could work with, say, a POSIX > > filesystem, or a CIFS mount, or a s3-like system? > > Again, I was making a general statement about design goals -- "we > should try to work nicely with enterprise backup products" -- not > proposing a specific design for a specific thing. I don't think the > idea of some pluggability in that area is a bad one, but it's not even > slightly what this thread is about. Well, I agree with you, as I said up-thread, that this seemed to be going in a different and perhaps not entirely relevant direction. > > Provided the WAL level is at the level that you need it to be that will > > be true for things which are actually supported with PITR, replication > > to standby servers, et al. I can see how it might come across as an > > overreaction but this strikes me as a pretty glaring issue and I worry > > that if it was overlooked until now that there'll be other more subtle > > issues, and backups are just plain complicated to get right, just to > > begin with already, something that I don't think people appreciate until > > they've been dealing with them for quite a while. > > Permit me to be unpersuaded. If it was such a glaring issue, and if > experience is the key to spotting such issues, then why didn't YOU > spot it? I'm not designing the feature..? Sure, I agreed earlier with the general idea that we might be able to use WAL scanning and/or the LSN to figure out if a page had changed, but the next step would have been, I would have thought anyway, for someone to go do the analysis that has only recently been started to look at the places when we write and the cases where we write the WAL and actually build up confidence that this approach isn't missing anything. Instead, we seem to have come a long way in the development of this without having done that, and that does shake my confidence in this effort. > I'm not arguing that this stuff isn't hard. It is. Nor am I arguing > that I didn't screw up. I did. But designs need to be accepted or > rejected based on facts, not FUD. You've raised some good technical > points and if you've got more concerns, I'd like to hear them, but I > don't think arguing vaguely that a certain approach will probably run > into trouble gets us anywhere. This just gets back to what I was saying earlier. It seems like we're presuming this is going to 'just work' because, say, replication works great, or crash recovery works great, and those are based on WAL. I'm still hopeful that we can do something based on WAL or LSN here, but it needs a careful review of when we are, and when we aren't, writing out WAL for basically everything we do, an effort that I'm glad to see might be starting to happen, but a quick "oh, this is why in this one case with this one thing, and we're all good now" doesn't instill confidence in me, at least. Thanks, Stephen