Home > mailing lists
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: pg_waldump: support decoding of WAL inside tarfile
Date	October 20 17:34:50
Msg-id	CA+Tgmoardk4VuthHc23vov+AVkhq7eT0mFUs-2ctAnP1uiTaog@mail.gmail.com Whole thread Raw
In response to	Re: pg_waldump: support decoding of WAL inside tarfile (Amul Sul <sulamul@gmail.com>)
List	pgsql-hackers
Tree view
On Thu, Oct 16, 2025 at 7:49 AM Amul Sul <sulamul@gmail.com> wrote:
> astreamer reads the archive in fixed-size chunks (here it is 128KB).
> Sometimes, a single read can contain data from two WAL files --
> specifically, the tail end of one file and the start of the next --
> because of how they’re physically stored in the archive. astreamer
> knows where one file ends and another begins through tags like
> ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and
> ASTREAMER_MEMBER_TRAILER. However, it can’t pause mid-chunk to hold
> data from the next file once the previous one ends and for the caller;
> it pushes the entire chunk it has read to the target buffer.

Right, this makes sense.

> So, if we put the reordering logic outside the streamer, we’d
> sometimes be receiving buffers containing mixed data from two WAL
> files. The caller would then need to correctly identify WAL file
> boundaries within those buffers. This would require passing extra
> metadata -- like segment numbers for the WAL files in the buffer, plus
> start and end offsets of those segments within the buffer. While not
> impossible, it feels a bit hacky and I'm unsure if that’s the best
> approach.

I agree that we need that kind of metadata, but I don't see why our
need for it depends on where we do the reordering. That is, if we do
the reordering above the astreamer layer, we need to keep track of the
origin of each chunk of WAL bytes, and if we do the reordering within
the astreamer layer, we still need to keep track of the origin of the
WAL bytes. Doing the ordering properly requires that tracking, but it
doesn't say anything about where that tracking has to be performed.

I think it might be better if we didn't write to the astreamer's
buffer at all. For example, suppose we create a struct that looks
approximately like this:

struct ChunkOfDecodedWAL
{
     XLogSegNo segno; // could also be XLogRecPtr start_lsn or char
*walfilename or whatever
     StringInfoData buffer;
     char *spillfilename; // or whatever we use to identify the temporary files
     bool already_removed;
     // potentially other metadata
};

Then, create a hash table and key it on the segno whatever. Have the
astreamer write to the hash table: when it gets a chunk of WAL, it
looks up or creates the relevant hash table entry and appends the data
to the buffer. At any convenient point in the code, you can decide to
write the data from the buffer to a spill file, after which you
resetStringInfo() on the buffer and populate the spill file name. When
you've used up the data, you remove the spill file and set the
already_removed flag.

I think this could also help with the error reporting stuff. When you
get to the end of the file, you'll know all the files you saw and how
much data you read from each of them. So you could possibly do
something like

ERROR: LSN %08X/%08X not found in archive "\%s\"
DETAIL: WAL segment %s is not present in the archive
-or
DETAIL: WAL segment %s was expected to be %u bytes, but was only %u bytes
-or-
DETAIL: whatever else can go wrong

The point is that every file you've ever seen has a hash table entry,
and in that hash table entry you can store everything about that file
that you need to know, whether that's the file data, the disk file
that contains the file data, the fact that we already threw the data
away, or any other fact that you can imagine wanting to know.

Said differently, the astreamer buffer is not really a great place to
write data. It exists because when we're just forwarding data from one
astreamer to the next, we will often need to buffer a small amount of
data to avoid terrible performance. However, it's only there to be
used when we don't have something better. I don't think any astreamer
that is intended to be the last one in the chain currently writes to
the buffer -- they write to the output file, or whatever, because
using an in-memory buffer as your final output destination is not a
real good plan.

> > While I'm on the topic of astreamer_wal_read(), here are a few other
> > problems I noticed:
> >
> > * The return value is not documented, and it seems to always be count,
> > in which case it might as well return void. The caller already has the
> > value they passed for count.
>
> The caller will be xlogreader, and I believe we shouldn't change that.
> For the same reason, WALDumpReadPage() also returns the same.

OK, but then you can make that clear via a brief comment.

> The loop isn't needed because the caller always requests 8KB of data,
> while READ_CHUNK_SIZE is 128KB. It’s assumed that the astreamer has
> already created the file with some initial data. For example, if only
> a few bytes have been written so far, when we reach
> TarWALDumpReadPage(), it detects that we’re reading the same file
> that the astreamer is still writing to and hasn’t finished. It then request to
> appends 128KB of data by calling astreamer_archive_read, even though we
> only need 8KB at a time. This process repeats each time the next 8KBchunk is
> requested: astreamer_archive_read() appends another 128KB,and continues until
> the file has been fully read and written.

Sure, but you don't know how much data is going to come out the other
end of the astreamer pipeline. Since the data is (possibly)
compressed, you expect at least as many bytes to emerge from the
output end as you add to the input end, but it's not a good idea to
rely on assumptions like that. Sometimes compressors end up making the
data slightly larger instead of smaller. It's unlikely that the effect
would be so dramatic that adding 128kB to one end of the pipeline
would make less than 8kB emerge from the other end, but it's not a
good idea to rely on assumptions like that. Not that this is a real
thing, but imagine that the compressed file had something in the
middle of it that behaved like a comment in C code, i.e. it didn't
generate any output.

> In the case where the astreamer is exporting a file to disk but hasn’t
> finished writing it, and we call TarWALDumpReadPage() to request
> block(s) from that WAL file, we can read only up to the existing
> blocks in the file. Since the file is incomplete, reading may fail
> later. To handle this, astreamer_archive_read() is invoked to append
> more data -- usually more than the requested amount, as explained
> earlier. That is the race condition I am trying to handle.

That's not what a race condition is:
https://en.wikipedia.org/wiki/Race_condition

> Now, regarding the concern of astreamer_archive_read() returning zero
> without reading or appending any data: this can happen only if the WAL
> is shorter than expected -- an incomplete. In that case,
> WALDumpReadPage() will raise the appropriate error, we don't have to
> check at that point, I think.

I'm not going to accept that kind of justification -- it is too
fragile to assume that you don't need to check for an error because it
"can't happen". Sometimes that is reasonable, but there is quite a lot
of action-at-a-distance here, so it does not feel safe.

> > Another thing I noticed is that astreamer_archive_read() makes
> > reference to decrypting, but there's no cryptography involved in any
> > of this.
>
> I think that was a typo -- I meant decompression.

I figured as much, but it still needs fixing.

--
Robert Haas
EDB: http://www.enterprisedb.com
pgsql-hackers by date:
From: Álvaro Herrera
Date: 20 October, 17:24:36
Subject: Re: Add \pset options for boolean value display
From: Tom Lane
Date: 20 October, 17:39:19
Subject: Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

Previous

Next