Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: pg_waldump: support decoding of WAL inside tarfile |
Date | |
Msg-id | CA+Tgmoardk4VuthHc23vov+AVkhq7eT0mFUs-2ctAnP1uiTaog@mail.gmail.com Whole thread Raw |
In response to | Re: pg_waldump: support decoding of WAL inside tarfile (Amul Sul <sulamul@gmail.com>) |
List | pgsql-hackers |
On Thu, Oct 16, 2025 at 7:49 AM Amul Sul <sulamul@gmail.com> wrote: > astreamer reads the archive in fixed-size chunks (here it is 128KB). > Sometimes, a single read can contain data from two WAL files -- > specifically, the tail end of one file and the start of the next -- > because of how they’re physically stored in the archive. astreamer > knows where one file ends and another begins through tags like > ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and > ASTREAMER_MEMBER_TRAILER. However, it can’t pause mid-chunk to hold > data from the next file once the previous one ends and for the caller; > it pushes the entire chunk it has read to the target buffer. Right, this makes sense. > So, if we put the reordering logic outside the streamer, we’d > sometimes be receiving buffers containing mixed data from two WAL > files. The caller would then need to correctly identify WAL file > boundaries within those buffers. This would require passing extra > metadata -- like segment numbers for the WAL files in the buffer, plus > start and end offsets of those segments within the buffer. While not > impossible, it feels a bit hacky and I'm unsure if that’s the best > approach. I agree that we need that kind of metadata, but I don't see why our need for it depends on where we do the reordering. That is, if we do the reordering above the astreamer layer, we need to keep track of the origin of each chunk of WAL bytes, and if we do the reordering within the astreamer layer, we still need to keep track of the origin of the WAL bytes. Doing the ordering properly requires that tracking, but it doesn't say anything about where that tracking has to be performed. I think it might be better if we didn't write to the astreamer's buffer at all. For example, suppose we create a struct that looks approximately like this: struct ChunkOfDecodedWAL { XLogSegNo segno; // could also be XLogRecPtr start_lsn or char *walfilename or whatever StringInfoData buffer; char *spillfilename; // or whatever we use to identify the temporary files bool already_removed; // potentially other metadata }; Then, create a hash table and key it on the segno whatever. Have the astreamer write to the hash table: when it gets a chunk of WAL, it looks up or creates the relevant hash table entry and appends the data to the buffer. At any convenient point in the code, you can decide to write the data from the buffer to a spill file, after which you resetStringInfo() on the buffer and populate the spill file name. When you've used up the data, you remove the spill file and set the already_removed flag. I think this could also help with the error reporting stuff. When you get to the end of the file, you'll know all the files you saw and how much data you read from each of them. So you could possibly do something like ERROR: LSN %08X/%08X not found in archive "\%s\" DETAIL: WAL segment %s is not present in the archive -or DETAIL: WAL segment %s was expected to be %u bytes, but was only %u bytes -or- DETAIL: whatever else can go wrong The point is that every file you've ever seen has a hash table entry, and in that hash table entry you can store everything about that file that you need to know, whether that's the file data, the disk file that contains the file data, the fact that we already threw the data away, or any other fact that you can imagine wanting to know. Said differently, the astreamer buffer is not really a great place to write data. It exists because when we're just forwarding data from one astreamer to the next, we will often need to buffer a small amount of data to avoid terrible performance. However, it's only there to be used when we don't have something better. I don't think any astreamer that is intended to be the last one in the chain currently writes to the buffer -- they write to the output file, or whatever, because using an in-memory buffer as your final output destination is not a real good plan. > > While I'm on the topic of astreamer_wal_read(), here are a few other > > problems I noticed: > > > > * The return value is not documented, and it seems to always be count, > > in which case it might as well return void. The caller already has the > > value they passed for count. > > The caller will be xlogreader, and I believe we shouldn't change that. > For the same reason, WALDumpReadPage() also returns the same. OK, but then you can make that clear via a brief comment. > The loop isn't needed because the caller always requests 8KB of data, > while READ_CHUNK_SIZE is 128KB. It’s assumed that the astreamer has > already created the file with some initial data. For example, if only > a few bytes have been written so far, when we reach > TarWALDumpReadPage(), it detects that we’re reading the same file > that the astreamer is still writing to and hasn’t finished. It then request to > appends 128KB of data by calling astreamer_archive_read, even though we > only need 8KB at a time. This process repeats each time the next 8KBchunk is > requested: astreamer_archive_read() appends another 128KB,and continues until > the file has been fully read and written. Sure, but you don't know how much data is going to come out the other end of the astreamer pipeline. Since the data is (possibly) compressed, you expect at least as many bytes to emerge from the output end as you add to the input end, but it's not a good idea to rely on assumptions like that. Sometimes compressors end up making the data slightly larger instead of smaller. It's unlikely that the effect would be so dramatic that adding 128kB to one end of the pipeline would make less than 8kB emerge from the other end, but it's not a good idea to rely on assumptions like that. Not that this is a real thing, but imagine that the compressed file had something in the middle of it that behaved like a comment in C code, i.e. it didn't generate any output. > In the case where the astreamer is exporting a file to disk but hasn’t > finished writing it, and we call TarWALDumpReadPage() to request > block(s) from that WAL file, we can read only up to the existing > blocks in the file. Since the file is incomplete, reading may fail > later. To handle this, astreamer_archive_read() is invoked to append > more data -- usually more than the requested amount, as explained > earlier. That is the race condition I am trying to handle. That's not what a race condition is: https://en.wikipedia.org/wiki/Race_condition > Now, regarding the concern of astreamer_archive_read() returning zero > without reading or appending any data: this can happen only if the WAL > is shorter than expected -- an incomplete. In that case, > WALDumpReadPage() will raise the appropriate error, we don't have to > check at that point, I think. I'm not going to accept that kind of justification -- it is too fragile to assume that you don't need to check for an error because it "can't happen". Sometimes that is reasonable, but there is quite a lot of action-at-a-distance here, so it does not feel safe. > > Another thing I noticed is that astreamer_archive_read() makes > > reference to decrypting, but there's no cryptography involved in any > > of this. > > I think that was a typo -- I meant decompression. I figured as much, but it still needs fixing. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: