walmethods.c is kind of a mess (was Re: refactoring basebackup.c) - Mailing list pgsql-hackers

From Robert Haas
Subject walmethods.c is kind of a mess (was Re: refactoring basebackup.c)
Date
Msg-id CA+TgmoYYT7j38T6f+f-YFZKJvBGZ9wOtfRZuKQZAm-2gSbx6sg@mail.gmail.com
Whole thread Raw
In response to Re: refactoring basebackup.c  (Dipesh Pandit <dipesh.pandit@gmail.com>)
List pgsql-hackers
On Fri, Mar 4, 2022 at 3:32 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> GZIP manages to overcome this problem as it provides an option to turn on/off
> compression on the fly while writing a compressed archive with the help of zlib
> library function deflateParams(). The current gzip implementation for
> CreateWalTarMethod uses this library function to turn off compression just before
> step #1 and it writes the uncompressed header of size equal to TAR_BLOCK_SIZE.
> It uses the same library function to turn on the compression for writing the contents
> of the WAL file as part of step #2. It again turns off the compression just before step
> #3 to overwrite the header. The header is overwritten at the same offset with size
> equal to TAR_BLOCK_SIZE.

This is a real mess. To me, it seems like a pretty big hack to use
deflateParams() to shut off compression in the middle of the
compressed data stream so that we can go back and overwrite that part
of the data later. It appears that the only reason we need that hack
is because we don't know the file size starting out. Except we kind of
do know the size, because pad_to_size specifies a minimum size for the
file. It's true that the maximum file size is unbounded, but I'm not
sure why that's important. I wonder if anyone else has an idea why we
didn't just set the file size to pad_to_size exactly when we write the
tar header the first time, instead of this IMHO kind of nutty approach
where we back up. I'd try to figure it out from the comments, but
there basically aren't any. I also had a look at the relevant commit
messages and didn't see anything relevant there either. If I'm missing
something, please point it out.

While I'm complaining, I noticed while looking at this code that it is
documented that "The caller must ensure that only one method is
instantiated in any given program, and that it's only instantiated
once!" As far as I can see, this is because somebody thought about
putting all of the relevant data into a struct and then decided on an
alternative strategy of storing some of it there, and the rest in a
global variable. I can't quite imagine why anyone would think that was
a good idea. There may be some reason that I can't see right now, but
here again there appear to be no relevant code comments.

I'm somewhat inclined to wonder whether we could just get rid of
walmethods.c entirely and use the new bbstreamer stuff instead. That
code also knows how to write plain files into a directory, and write
tar archives, and compress stuff, but in my totally biased opinion as
the author of most of that code, it's better code. It has no
restriction on using at most one method per program, or of
instantiating that method only once, and it already has LZ4 support,
and there's a pending patch for ZSTD support that I intend to get
committed soon as well. It also has, and I know I might be beating a
dead horse here, comments. Now, admittedly, it does need to know the
size of each archive member up front in order to work, so if we can't
solve the problem then we can't go this route. But if we can't solve
that problem, then we also can't add LZ4 and ZSTD support to
walmethods.c, because random access to compressed data is not really a
thing, even if we hacked it to work for gzip.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: RKN Sai Krishna
Date:
Subject: pg_rewind enhancements
Next
From: Matthias van de Meent
Date:
Subject: Re: New Table Access Methods for Multi and Single Inserts