Re: refactoring basebackup.c - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: refactoring basebackup.c |
Date | |
Msg-id | CA+TgmoY5zpS=gJj2O0PXi6YGao5YR89v0qN_xZzWNvff2Tug+Q@mail.gmail.com Whole thread Raw |
In response to | Re: refactoring basebackup.c (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: refactoring basebackup.c
Re: refactoring basebackup.c |
List | pgsql-hackers |
On Fri, May 8, 2020 at 4:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > So it might be good if I'd remembered to attach the patches. Let's try > that again. Here's an updated patch set. This is now rebased over master and includes as 0001 the patch I posted separately at http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@mail.gmail.com but drops some other patches that were committed meanwhile. 0002-0009 of this series are basically the same as 0004-0011 from the previous series, except for rebasing and fixing a bug I discovered in what's now 0006. 0012 does a refactoring of pg_basebackup along similar lines to the server-side refactoring from patches earlier in the series. 0012 is a really terrible, hacky, awful demonstration of how this infrastructure can support server-side compression. If you apply it and take a tar-format backup without -R, you will get .tar files that are actually .tar.gz files. You can rename them, decompress them, and use pg_verifybackup to check that everything is OK. If you try to do anything else with 0012 applied, everything will break. In the process of working on this, I learned a lot about how pg_basebackup actually works, and found out about a number of things that, with the benefit of hindsight, seem like they might not have been the best way to go. 1. pg_basebackup -R injects recovery.conf (on older versions) or injects standby.signal and appends to postgresql.auto.conf (on newer versions) by parsing the tar file sent by the server and editing it on the fly. From the point of view of server-side compression, this is not ideal, because if you want to make these kinds of changes when server-side compression is in use, you'd have to decompress the stream on the client side in order to figure out where in the steam you ought to inject your changes. But having to do that is a major expense. If the client instead told the server what to change when generating the archive, and the server did it, this expense could be avoided. It would have the additional advantage that the backup manifest could reflect the effects of those changes; right now it doesn't, and pg_verifybackup just knows to expect differences in those files. 2. According to the comments, some tar programs require two tar blocks (i.e. 512-byte blocks) of zero bytes at the end of an archive. The server does not generate these blocks of zero bytes, so it basically creates a tar file that works fine with my copy of tar but might break with somebody else's. Instead, the client appends 1024 zero bytes to the end of every file it receives from the server. That is an odd way of fixing this problem, and it makes things rather inflexible. If the server sends you any kind of a file OTHER THAN a tar file with the last 1024 zero bytes stripped off, then adding 1024 zero bytes will be the wrong thing to do. It would be better if the server just generated fully correct tar files (whatever we think that means) and the client wrote out exactly what it got from the server. Then, we could have the server generate cpio archives or zip files or gzip-compressed tar files or lz4-compressed tar files or anything we like, and the client wouldn't really need to care as long as it didn't need to extract those archives. That seems a lot cleaner. 3. The way that progress reporting works relies on the server knowing exactly how large the archive sent to the client is going to be. Progress as reckoned by the client is equal to the number of archive payload bytes the client has received. This works OK with a tar because we know how big the tar file is going to be based on the size of the input files we intend to send, but it's unsuitable for any sort of compressed archive (tar.gz, zip, whatever) because the compression ratio cannot be predicted in advance. It would be better if the server sent the payload bytes (possibly compressed) interleaved with progress indicators, so that the client could correctly indicate that, say, the backup is 30% complete because 30GB of 100GB has been processed on the server side, even though the amount of data actually received by the client might be 25GB or 20GB or 10GB or whatever because it got compressed before transmission. 4. A related consideration is that we might want to have an option to do something with the backup other than send it to the client. For example, it might be useful to have an option for pg_basebackup to tell the server to write the backup files to some specified server directory, or to, say, S3. There are security concerns there, and I'm not proposing to do anything about this immediately, but it seems like something we might eventually want to have. In such a case, we are not going to send any payload to the client, but the client probably still wants progress indicators, so the current system of coupling progress to the number of bytes received by the client breaks down for that reason also. 5. As things stand today, the client must know exactly how many archives it should expect to receive from the server and what each one is. It can do that, because it knows to expect one archive per tablespace, and the archive must be an uncompressed tarfile, so there is no ambiguity. But, if the server could send archives to other places, or send other kinds of archives to the client, then this would become more complex. There is no intrinsic reason why the logic on the client side can't simply be made more complicated in order to cope, but it doesn't seem like great design, because then every time you enhance the server, you've also got to enhance the client, and that limits cross-version compatibility, and also seems more fragile. I would rather that the server advertise the number of archives and the names of each archive to the client explicitly, allowing the client to be dumb unless it needs to post-process (e.g. extract) those archives. Putting all of the above together, what I propose - but have not yet tried to implement - is a new COPY sub-protocol for taking base backups. Instead of sending a COPY stream per archive, the server would send a single COPY stream where the first byte of each message is a type indicator, like we do with the replication sub-protocol today. For example, if the first byte is 'a' that could indicate that we're beginning a new archive and the rest of the message would indicate the archive name and perhaps some flags or options. If the first byte is 'p' that could indicate that we're sending archive payload, perhaps with the first four bytes of the message being progress, i.e. the number of newly-processed bytes on the server side prior to any compression, and the remaining bytes being payload. On receipt of such a message, the client would increment the progress indicator by the value indicated in those first four bytes, and then process the remaining bytes by writing them to a file or whatever behavior the user selected via -Fp, -Ft, -Z, etc. To be clear, I'm not saying that this specific thing is the right thing, just something of this sort. The server would need to continue supporting the current multi-copy protocol for compatibility with older pg_basebackup versions, and pg_basebackup would need to continue to support it for compatibility with older server versions, but we could use the new approach going forward. (Or, we could break compatibility, but that would probably be unpopular and seems unnecessary and even risky to me at this point.) The ideas in the previous paragraph would address #3-#5 directly, but they also indirectly address #2 because while we're switching protocols we could easily move the padding with zero bytes to the server side, and I think we should. #1 is a bit of a separate consideration. To tackle #1 along the lines proposed above, the client needs a way to send the recovery.conf contents to the server so that the server can inject them into the tar file. It's not exactly clear to me what the best way of permitting this is, and maybe there's a totally different approach that would be altogether better. One thing to consider is that we might well want the client to be able to send *multiple* chunks of data to the server at the start of a backup. For instance, suppose we want to support incremental backups. I think the right approach is for the client to send the backup_manifest file from the previous full backup to the server. What exactly the server does with it afterward depends on your preferred approach, but the necessary information is there. Maybe incremental backup is based on comparing cryptographic checksums, so the server looks at all the files and sends to the client those where the checksum (hopefully SHA-something!) does not match. I wouldn't favor this approach myself, but I know some people like it. Or maybe it's based on finding blocks modified since the LSN of the previous backup; the manifest has enough information for that to work, too. In such an approach, there can be altogether new files with old LSNs, because files can be flat-copied without changing block LSNs, so it's important to have the complete list of files from the previous backup, and that too is in the manifest. There are even timestamps for the bold among you. Anyway, my point is to advocate for a design where the client says (1) I want a backup with these options and then (2) here are 0, 1, or >1 files (recovery parameters and/or backup manifest and/or other things) in support of that and then the server hands back a stream of archives which the client may or may not choose to post-process. It's tempting to think about solving this problem by appealing to CopyBoth, but I think that might be the wrong idea. The reason we use CopyBoth for the replication subprotocol is because there's periodic messages flowing in both directions that are only loosely coupled to each other. Apart from reading frequently enough to avoid a deadlock because both sides have full write buffers, each end of the connection can kind of do whatever it wants. But for the kinds of use cases I'm talking about here, that's not so. First the client talks and the server acknowledges, then the reverse. That doesn't mean we couldn't use CopyBoth, but maybe a CopyIn followed by a CopyOut would be more straightforward. I am not sure of the details here and am happy to hear the ideas of others. One final thought is that the options framework for pg_basebackup is a little unfortunate. As of today, what the client receives, always, is a series of tar files. If you say -Fp, it doesn't change the backup format; it just extracts the tar files. If you say -Ft, it doesn't. If you say -Ft but also -Z, it compresses the tar files. Thinking just about server-side compression and ignoring for the moment more remote features like alternate archive formats (e.g. zip) or things like storing the backup to an alternate location rather than returning it to the client, you probably want the client to be able to specify at least (1) server-side compression (perhaps with one of several algorithms) and the client just writes the results, (2) server-side compression (still with a choice of algorithm) and the client decompresses but does not extract, (3) server-side compression (still with a choice of algorithms) and the client decompresses and extracts, (4) client-side compression (with a choice of algorithms), and (5) client-side extraction. You might also want (6) server-side compression (with a choice of algorithms) and client-side decompresses and then re-compresses with a different algorithm (e.g. lz4 on the server to save bandwidth at moderate CPU cost into parallel bzip2 on the client for minimum archival storage). Or, as also discussed upthread, you might want (7) server-side compression of each file individually, so that you get a seekable archive of individually compressed files (e.g. to support fast delta restore). I think that with these refactoring patches - and the wire protocol redesign mentioned above - it is very reasonable to offer as many of these options as we believe users will find useful, but it is not very clear how to extend the current command-line option framework to support them. It probably would have been better if pg_basebackup, instead of having -Fp and -Ft, just had an --extract option that you could either specify or omit, because that would not have presumed anything about the archive format, but the existing structure is well-baked at this point. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
- v2-0003-Introduce-bbsink-abstraction.patch
- v2-0002-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch
- v2-0005-Convert-throttling-related-code-to-a-bbsink.patch
- v2-0004-Convert-libpq-related-code-to-a-bbsink.patch
- v2-0006-Convert-progress-reporting-code-to-a-bbsink.patch
- v2-0009-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch
- v2-0008-Create-and-use-bbarchiver-implementations-for-tar.patch
- v2-0007-Introduce-bbarchiver-abstraction.patch
- v2-0010-WIP-Introduce-bbstreamer-abstration-and-adapt-pg_.patch
- v2-0011-POC-Embarrassingly-bad-server-side-compression-pa.patch
pgsql-hackers by date: