Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From Tom Lane
Subject Re: pg_waldump: support decoding of WAL inside tarfile
Date
Msg-id 2790913.1774200584@sss.pgh.pa.us
Whole thread Raw
In response to Re: pg_waldump: support decoding of WAL inside tarfile  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: pg_waldump: support decoding of WAL inside tarfile
List pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
> OK, patch 5 of this set does that. I reworked your previous patches 2 and 3
> slightly - mostly additional comments, and fixing a bug in use
> of sizeof(XLogLongPageHeader). Patch 4 here tries to fix the wrong use of
> cur_file in get_archive_wal_entry()

A few nits:

0001: I think we need to rewrite the commit message now that we've
run the unknowns to ground.  Perhaps like

Send the correct amount of data to the next astreamer, not the
whole allocated buffer size.  This bug escaped detection because
in present uses the next astreamer is always a tar-file parser
which is insensitive to trailing garbage.  But that may not
always be true.

Also, I think we ought to back-patch 0001, just in case we have
misunderstood the scope of usage or somebody tries to make use of
back-branch code for a new purpose.  This is a bit tedious because
before f80b09bac/3c9056981 the code was somewhere else and used
different names.  But the bugs are clearly there, back to v15.

0002, 0003, 0005: LGTM.

0004: Don't like this at all.  It fails to respond to the point
I was trying to make: the code will only consider spilling hash
entries that are the cur_file at some read_archive_file boundary.
Also, it seems to be doubling down on the already over-complicated
and probably under-correct goal of spilling entries that are still
in progress of being read.  We can make this function far simpler
and more obviously correct if we just accept that we'll read a
WAL file completely before spilling it.  See my proposed
alternative to 0004, attached.

            regards, tom lane

From 7c46a2f2d70c0f103ac574d29a44a6d98fa2aba6 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 22 Mar 2026 13:18:05 -0400
Subject: [PATCH v6 4/4] Fix get_archive_wal_entry() to handle spilling
 correctly.

get_archive_wal_entry() relied on tracking cur_file to identify
WAL hash table entries that need to be spilled to disk.  However,
this can't work for entries that are read completely within a
single read_archive_file call: the caller will never see cur_file
pointing at such an entry.  Instead, scan the WAL hash table to
find entries we should spill.

Also, simplify the logic tremendously by not attempting to spill
entries that haven't been read fully.  I am not convinced that the old
logic handled that correctly in every path, and it's really not worth
the complication and risk of bugs to try to spill entries on the fly.
We can just write them in a single go once they are no longer the
cur_file.

Fix a buglet: if the initial loop in init_archive_reader loaded
multiple WAL files before identifying the segment size, only
one of those hash table entries was subject to being freed as
irrelevant, and none of them were subject to getting spilled.

Fix a rather critical performance problem: the code thought that
resetStringInfo() will reclaim storage, but it doesn't.  So by the
end of the run we'd have consumed storage space equal to the total
amount of WAL read, negating all the effort of the spill logic.
---
 src/bin/pg_waldump/archive_waldump.c | 120 +++++++++++++--------------
 1 file changed, 58 insertions(+), 62 deletions(-)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index 20ca4547ba7..650aa8c73ea 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -229,7 +229,10 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
      * relevance check against the user-provided range now; if the WAL falls
      * outside this range, remove it from the hash table. Subsequent WAL will
      * be filtered automatically by the archive streamer using the updated
-     * start_segno and end_segno values.
+     * start_segno and end_segno values.  Note that there could be other
+     * irrelevant records already loaded into the hash table too, but if so we
+     * will clean them out during the first iteration of
+     * get_archive_wal_entry().
      */
     XLogFromFileName(entry->fname, &timeline, &segno, privateInfo->segsize);
     if (privateInfo->timeline != timeline ||
@@ -458,30 +461,25 @@ free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 /*
  * Returns the archived WAL entry from the hash table if it already exists.
  * Otherwise, reads more data from the archive until the requested entry is
- * found.  If the archive streamer is reading a WAL file from the archive that
+ * found.  If the archive streamer reads a WAL file from the archive that
  * is not currently needed, that data is spilled to a temporary file for later
  * retrieval.
+ *
+ * Note that the returned entry might not have been completely read from
+ * the archive yet.
  */
 static ArchivedWALFile *
 get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 {
-    ArchivedWALFile *entry = NULL;
-    FILE       *write_fp = NULL;
-
-    /*
-     * Search the hash table first. If the entry is found, return it.
-     * Otherwise, the requested WAL entry hasn't been read from the archive
-     * yet; invoke the archive streamer to fetch it.
-     */
     while (1)
     {
+        ArchivedWALFile *entry;
+        ArchivedWAL_iterator iter;
+
         /*
-         * Search hash table.
-         *
-         * We perform the search inside the loop because a single iteration of
-         * the archive reader may decompress and extract multiple files into
-         * the hash table. One of these newly added files could be the one we
-         * are seeking.
+         * Search the hash table first.  If the entry is found, return it.
+         * Otherwise, the requested WAL entry hasn't been read from the
+         * archive yet; we must invoke the archive streamer to fetch it.
          */
         entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname);

@@ -489,64 +487,62 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
             return entry;

         /*
-         * Capture the current entry before calling read_archive_file(),
-         * because cur_file may advance to a new segment during streaming. We
-         * hold this reference so we can flush any remaining buffer data and
-         * close the write handle once we detect that cur_file has moved on.
+         * Before loading more data, scan the hash table to see if we have
+         * loaded any files we don't need, or don't need yet.  Files we don't
+         * need can be summarily deleted.  If we find one we will need later,
+         * spill its data to disk to conserve memory space.
          */
-        entry = privateInfo->cur_file;
-
-        /*
-         * Fetch more data either when no current file is being tracked or
-         * when its buffer has been fully flushed to the temporary file.
-         */
-        if (entry == NULL || entry->buf->len == 0)
+        ArchivedWAL_start_iterate(privateInfo->archive_wal_htab, &iter);
+        while ((entry = ArchivedWAL_iterate(privateInfo->archive_wal_htab,
+                                            &iter)) != NULL)
         {
-            if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
-                break;            /* archive file ended */
-        }
+            TimeLineID    timeline;
+            XLogSegNo    segno;
+            FILE       *write_fp;

-        /*
-         * Archive streamer is reading a non-WAL file or an irrelevant WAL
-         * file.
-         */
-        if (entry == NULL)
-            continue;
+            /*
+             * Check to see if this entry is irrelevant.  (This can only
+             * succeed if multiple WAL files were loaded before
+             * init_archive_reader identified the WAL segment size, but that's
+             * possible with well-compressed WAL data.)
+             */
+            XLogFromFileName(entry->fname, &timeline, &segno,
+                             privateInfo->segsize);
+            if (privateInfo->timeline != timeline ||
+                privateInfo->start_segno > segno ||
+                privateInfo->end_segno < segno)
+            {
+                free_archive_wal_entry(entry->fname, privateInfo);
+                continue;
+            }

-        /*
-         * The streamer is producing a WAL segment that isn't the one asked
-         * for; it must be arriving out of order.  Spill its data to disk so
-         * it can be read back when needed.
-         */
-        Assert(strcmp(fname, entry->fname) != 0);
+            /*
+             * Otherwise, it's relevant but not the one we want right now.
+             * Spill to disk if it's complete; otherwise, wait till it is.
+             */
+            if (entry->spilled)
+                continue;        /* already spilled */
+            if (entry == privateInfo->cur_file)
+                continue;        /* still being read */

-        /* Create a temporary file if one does not already exist */
-        if (!entry->spilled)
-        {
+            /* Write out the completed WAL file contents to a temp file. */
             write_fp = prepare_tmp_write(entry->fname, privateInfo);
+            perform_tmp_write(entry->fname, entry->buf, write_fp);
+            fclose(write_fp);
+
+            /* resetStringInfo won't release storage, so delete/recreate. */
+            destroyStringInfo(entry->buf);
+            entry->buf = makeStringInfo();
             entry->spilled = true;
         }

-        /* Flush data from the buffer to the file */
-        perform_tmp_write(entry->fname, entry->buf, write_fp);
-        resetStringInfo(entry->buf);
-
         /*
-         * If cur_file changed since we captured entry above, the archive
-         * streamer has finished this segment and moved on.  Close its spill
-         * file handle so data is flushed to disk before the next segment
-         * starts writing to a different handle.
+         * Read more data.  If we reach EOF, the desired file is not present.
          */
-        if (entry != privateInfo->cur_file && write_fp != NULL)
-        {
-            fclose(write_fp);
-            write_fp = NULL;
-        }
+        if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
+            pg_fatal("could not find WAL \"%s\" in archive \"%s\"",
+                     fname, privateInfo->archive_name);
     }
-
-    /* Requested WAL segment not found */
-    pg_fatal("could not find WAL \"%s\" in archive \"%s\"",
-             fname, privateInfo->archive_name);
 }

 /*
--
2.43.7


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Compress prune/freeze records with Delta Frame of Reference algorithm
Next
From: "Greg Burd"
Date:
Subject: Re: Add RISC-V Zbb popcount optimization