Re: refactoring basebackup.c - Mailing list pgsql-hackers

From Robert Haas
Subject Re: refactoring basebackup.c
Date
Msg-id CA+TgmoY5zpS=gJj2O0PXi6YGao5YR89v0qN_xZzWNvff2Tug+Q@mail.gmail.com
Whole thread Raw
In response to Re: refactoring basebackup.c  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: refactoring basebackup.c
Re: refactoring basebackup.c
List pgsql-hackers
On Fri, May 8, 2020 at 4:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
> So it might be good if I'd remembered to attach the patches. Let's try
> that again.

Here's an updated patch set. This is now rebased over master and
includes as 0001 the patch I posted separately at
http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@mail.gmail.com
but drops some other patches that were committed meanwhile. 0002-0009
of this series are basically the same as 0004-0011 from the previous
series, except for rebasing and fixing a bug I discovered in what's
now 0006. 0012 does a refactoring of pg_basebackup along similar lines
to the server-side refactoring from patches earlier in the series.
0012 is a really terrible, hacky, awful demonstration of how this
infrastructure can support server-side compression. If you apply it
and take a tar-format backup without -R, you will get .tar files that
are actually .tar.gz files. You can rename them, decompress them, and
use pg_verifybackup to check that everything is OK. If you try to do
anything else with 0012 applied, everything will break.

In the process of working on this, I learned a lot about how
pg_basebackup actually works, and found out about a number of things
that, with the benefit of hindsight, seem like they might not have
been the best way to go.

1. pg_basebackup -R injects recovery.conf (on older versions) or
injects standby.signal and appends to postgresql.auto.conf (on newer
versions) by parsing the tar file sent by the server and editing it on
the fly. From the point of view of server-side compression, this is
not ideal, because if you want to make these kinds of changes when
server-side compression is in use, you'd have to decompress the stream
on the client side in order to figure out where in the steam you ought
to inject your changes. But having to do that is a major expense. If
the client instead told the server what to change when generating the
archive, and the server did it, this expense could be avoided. It
would have the additional advantage that the backup manifest could
reflect the effects of those changes; right now it doesn't, and
pg_verifybackup just knows to expect differences in those files.

2. According to the comments, some tar programs require two tar blocks
(i.e. 512-byte blocks) of zero bytes at the end of an archive. The
server does not generate these blocks of zero bytes, so it basically
creates a tar file that works fine with my copy of tar but might break
with somebody else's. Instead, the client appends 1024 zero bytes to
the end of every file it receives from the server. That is an odd way
of fixing this problem, and it makes things rather inflexible. If the
server sends you any kind of a file OTHER THAN a tar file with the
last 1024 zero bytes stripped off, then adding 1024 zero bytes will be
the wrong thing to do. It would be better if the server just generated
fully correct tar files (whatever we think that means) and the client
wrote out exactly what it got from the server. Then, we could have the
server generate cpio archives or zip files or gzip-compressed tar
files or lz4-compressed tar files or anything we like, and the client
wouldn't really need to care as long as it didn't need to extract
those archives. That seems a lot cleaner.

3. The way that progress reporting works relies on the server knowing
exactly how large the archive sent to the client is going to be.
Progress as reckoned by the client is equal to the number of archive
payload bytes the client has received. This works OK with a tar
because we know how big the tar file is going to be based on the size
of the input files we intend to send, but it's unsuitable for any sort
of compressed archive (tar.gz, zip, whatever) because the compression
ratio cannot be predicted in advance. It would be better if the server
sent the payload bytes (possibly compressed) interleaved with progress
indicators, so that the client could correctly indicate that, say, the
backup is 30% complete because 30GB of 100GB has been processed on the
server side, even though the amount of data actually received by the
client might be 25GB or 20GB or 10GB or whatever because it got
compressed before transmission.

4. A related consideration is that we might want to have an option to
do something with the backup other than send it to the client. For
example, it might be useful to have an option for pg_basebackup to
tell the server to write the backup files to some specified server
directory, or to, say, S3. There are security concerns there, and I'm
not proposing to do anything about this immediately, but it seems like
something we might eventually want to have. In such a case, we are not
going to send any payload to the client, but the client probably still
wants progress indicators, so the current system of coupling progress
to the number of bytes received by the client breaks down for that
reason also.

5. As things stand today, the client must know exactly how many
archives it should expect to receive from the server and what each one
is. It can do that, because it knows to expect one archive per
tablespace, and the archive must be an uncompressed tarfile, so there
is no ambiguity. But, if the server could send archives to other
places, or send other kinds of archives to the client, then this would
become more complex. There is no intrinsic reason why the logic on the
client side can't simply be made more complicated in order to cope,
but it doesn't seem like great design, because then every time you
enhance the server, you've also got to enhance the client, and that
limits cross-version compatibility, and also seems more fragile. I
would rather that the server advertise the number of archives and the
names of each archive to the client explicitly, allowing the client to
be dumb unless it needs to post-process (e.g. extract) those archives.

Putting all of the above together, what I propose - but have not yet
tried to implement - is a new COPY sub-protocol for taking base
backups. Instead of sending a COPY stream per archive, the server
would send a single COPY stream where the first byte of each message
is a type indicator, like we do with the replication sub-protocol
today. For example, if the first byte is 'a' that could indicate that
we're beginning a new archive and the rest of the message would
indicate the archive name and perhaps some flags or options. If the
first byte is 'p' that could indicate that we're sending archive
payload, perhaps with the first four bytes of the message being
progress, i.e. the number of newly-processed bytes on the server side
prior to any compression, and the remaining bytes being payload. On
receipt of such a message, the client would increment the progress
indicator by the value indicated in those first four bytes, and then
process the remaining bytes by writing them to a file or whatever
behavior the user selected via -Fp, -Ft, -Z, etc. To be clear, I'm not
saying that this specific thing is the right thing, just something of
this sort. The server would need to continue supporting the current
multi-copy protocol for compatibility with older pg_basebackup
versions, and pg_basebackup would need to continue to support it for
compatibility with older server versions, but we could use the new
approach going forward. (Or, we could break compatibility, but that
would probably be unpopular and seems unnecessary and even risky to me
at this point.)

The ideas in the previous paragraph would address #3-#5 directly, but
they also indirectly address #2 because while we're switching
protocols we could easily move the padding with zero bytes to the
server side, and I think we should. #1 is a bit of a separate
consideration. To tackle #1 along the lines proposed above, the client
needs a way to send the recovery.conf contents to the server so that
the server can inject them into the tar file. It's not exactly clear
to me what the best way of permitting this is, and maybe there's a
totally different approach that would be altogether better. One thing
to consider is that we might well want the client to be able to send
*multiple* chunks of data to the server at the start of a backup. For
instance, suppose we want to support incremental backups. I think the
right approach is for the client to send the backup_manifest file from
the previous full backup to the server. What exactly the server does
with it afterward depends on your preferred approach, but the
necessary information is there. Maybe incremental backup is based on
comparing cryptographic checksums, so the server looks at all the
files and sends to the client those where the checksum (hopefully
SHA-something!) does not match. I wouldn't favor this approach myself,
but I know some people like it. Or maybe it's based on finding blocks
modified since the LSN of the previous backup; the manifest has enough
information for that to work, too. In such an approach, there can be
altogether new files with old LSNs, because files can be flat-copied
without changing block LSNs, so it's important to have the complete
list of files from the previous backup, and that too is in the
manifest. There are even timestamps for the bold among you. Anyway, my
point is to advocate for a design where the client says (1) I want a
backup with these options and then (2) here are 0, 1, or >1 files
(recovery parameters and/or backup manifest and/or other things) in
support of that and then the server hands back a stream of archives
which the client may or may not choose to post-process.

It's tempting to think about solving this problem by appealing to
CopyBoth, but I think that might be the wrong idea. The reason we use
CopyBoth for the replication subprotocol is because there's periodic
messages flowing in both directions that are only loosely coupled to
each other. Apart from reading frequently enough to avoid a deadlock
because both sides have full write buffers, each end of the connection
can kind of do whatever it wants. But for the kinds of use cases I'm
talking about here, that's not so. First the client talks and the
server acknowledges, then the reverse. That doesn't mean we couldn't
use CopyBoth, but maybe a CopyIn followed by a CopyOut would be more
straightforward. I am not sure of the details here and am happy to
hear the ideas of others.

One final thought is that the options framework for pg_basebackup is a
little unfortunate. As of today, what the client receives, always, is
a series of tar files. If you say -Fp, it doesn't change the backup
format; it just extracts the tar files. If you say -Ft, it doesn't. If
you say -Ft but also -Z, it compresses the tar files. Thinking just
about server-side compression and ignoring for the moment more remote
features like alternate archive formats (e.g. zip) or things like
storing the backup to an alternate location rather than returning it
to the client, you probably want the client to be able to specify at
least (1) server-side compression (perhaps with one of several
algorithms) and the client just writes the results, (2) server-side
compression (still with a choice of algorithm) and the client
decompresses but does not extract, (3) server-side compression (still
with a choice of algorithms) and the client decompresses and extracts,
(4) client-side compression (with a choice of algorithms), and (5)
client-side extraction. You might also want (6) server-side
compression (with a choice of algorithms) and client-side decompresses
and then re-compresses with a different algorithm (e.g. lz4 on the
server to save bandwidth at moderate CPU cost into parallel bzip2 on
the client for minimum archival storage). Or, as also discussed
upthread, you might want (7) server-side compression of each file
individually, so that you get a seekable archive of individually
compressed files (e.g. to support fast delta restore). I think that
with these refactoring patches - and the wire protocol redesign
mentioned above - it is very reasonable to offer as many of these
options as we believe users will find useful, but it is not very clear
how to extend the current command-line option framework to support
them. It probably would have been better if pg_basebackup, instead of
having -Fp and -Ft, just had an --extract option that you could either
specify or omit, because that would not have presumed anything about
the archive format, but the existing structure is well-baked at this
point.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: IDEA: pg_stat_statements tracking utility statements by tag?
Next
From: Robert Haas
Date:
Subject: Re: [Patch] ALTER SYSTEM READ ONLY