Home > mailing lists
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: pg_waldump: support decoding of WAL inside tarfile
Date	March 22 21:17:53
Msg-id	2880042.1774203473@sss.pgh.pa.us Whole thread Raw
In response to	Re: pg_waldump: support decoding of WAL inside tarfile (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: pg_waldump: support decoding of WAL inside tarfile
List	pgsql-hackers
Tree view
I wrote:
> ... We can make this function far simpler
> and more obviously correct if we just accept that we'll read a
> WAL file completely before spilling it.  See my proposed
> alternative to 0004, attached.

Actually, we can make that better yet by not expecting
get_archive_wal_entry to clean up after init_archive's
failure to free all irrelevant hashtable entries.
Better version attached.

            regards, tom lane

From 1f4c82839847658351c372e85f7fb8c58957fc75 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 22 Mar 2026 14:14:34 -0400
Subject: [PATCH v7 4/4] Fix get_archive_wal_entry() to handle spilling
 correctly.

get_archive_wal_entry() relied on tracking cur_file to identify
WAL hash table entries that need to be spilled to disk.  However,
this can't work for entries that are read completely within a
single read_archive_file call: the caller will never see cur_file
pointing at such an entry.  Instead, scan the WAL hash table to
find entries we should spill.  This also fixes a buglet that any
hash table entries completely loaded during init_archive_reader
were never considered for spilling.

Also, simplify the logic tremendously by not attempting to spill
entries that haven't been read fully.  I am not convinced that the old
logic handled that correctly in every path, and it's really not worth
the complication and risk of bugs to try to spill entries on the fly.
We can just write them in a single go once they are no longer the
cur_file.

Fix a rather critical performance problem: the code thought that
resetStringInfo() will reclaim storage, but it doesn't.  So by the
end of the run we'd have consumed storage space equal to the total
amount of WAL read, negating all the effort of the spill logic.

Fix a second buglet: if the initial loop in init_archive_reader loaded
multiple WAL files before identifying the segment size, only one of
those hash table entries was subject to being freed as irrelevant.
Scan the whole hash table for that operation too.
---
 src/bin/pg_waldump/archive_waldump.c | 130 +++++++++++----------------
 1 file changed, 53 insertions(+), 77 deletions(-)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index 3fce2183099..067cb85c36b 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -128,8 +128,7 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
     astreamer  *streamer;
     ArchivedWALFile *entry = NULL;
     XLogLongPageHeader longhdr;
-    XLogSegNo    segno;
-    TimeLineID    timeline;
+    ArchivedWAL_iterator iter;

     /* Open tar archive and store its file descriptor */
     fd = open_file_in_directory(privateInfo->archive_dir,
@@ -183,8 +182,6 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
      */
     while (entry == NULL)
     {
-        ArchivedWAL_iterator iter;
-
         if (!read_archive_file(privateInfo, XLOG_BLCKSZ))
             pg_fatal("could not find WAL in archive \"%s\"",
                      privateInfo->archive_name);
@@ -226,18 +223,25 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
                     privateInfo->segsize);

     /*
-     * This WAL record was fetched before the filtering parameters
-     * (start_segno and end_segno) were fully initialized. Perform the
-     * relevance check against the user-provided range now; if the WAL falls
-     * outside this range, remove it from the hash table. Subsequent WAL will
-     * be filtered automatically by the archive streamer using the updated
-     * start_segno and end_segno values.
+     * Now that we have initialized the filtering parameters (start_segno and
+     * end_segno), we can discard any already-loaded WAL hash table entries
+     * for segments we don't actually need.  Subsequent WAL will be filtered
+     * automatically by the archive streamer using the updated start_segno and
+     * end_segno values.
      */
-    XLogFromFileName(entry->fname, &timeline, &segno, privateInfo->segsize);
-    if (privateInfo->timeline != timeline ||
-        privateInfo->start_segno > segno ||
-        privateInfo->end_segno < segno)
-        free_archive_wal_entry(entry->fname, privateInfo);
+    ArchivedWAL_start_iterate(privateInfo->archive_wal_htab, &iter);
+    while ((entry = ArchivedWAL_iterate(privateInfo->archive_wal_htab,
+                                        &iter)) != NULL)
+    {
+        XLogSegNo    segno;
+        TimeLineID    timeline;
+
+        XLogFromFileName(entry->fname, &timeline, &segno, privateInfo->segsize);
+        if (privateInfo->timeline != timeline ||
+            privateInfo->start_segno > segno ||
+            privateInfo->end_segno < segno)
+            free_archive_wal_entry(entry->fname, privateInfo);
+    }
 }

 /*
@@ -460,30 +464,25 @@ free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 /*
  * Returns the archived WAL entry from the hash table if it already exists.
  * Otherwise, reads more data from the archive until the requested entry is
- * found.  If the archive streamer is reading a WAL file from the archive that
+ * found.  If the archive streamer reads a WAL file from the archive that
  * is not currently needed, that data is spilled to a temporary file for later
  * retrieval.
+ *
+ * Note that the returned entry might not have been completely read from
+ * the archive yet.
  */
 static ArchivedWALFile *
 get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 {
-    ArchivedWALFile *entry = NULL;
-    FILE       *write_fp = NULL;
-
-    /*
-     * Search the hash table first. If the entry is found, return it.
-     * Otherwise, the requested WAL entry hasn't been read from the archive
-     * yet; invoke the archive streamer to fetch it.
-     */
     while (1)
     {
+        ArchivedWALFile *entry;
+        ArchivedWAL_iterator iter;
+
         /*
-         * Search hash table.
-         *
-         * We perform the search inside the loop because a single iteration of
-         * the archive reader may decompress and extract multiple files into
-         * the hash table. One of these newly added files could be the one we
-         * are seeking.
+         * Search the hash table first.  If the entry is found, return it.
+         * Otherwise, the requested WAL entry hasn't been read from the
+         * archive yet; we must invoke the archive streamer to fetch it.
          */
         entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname);

@@ -491,64 +490,41 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
             return entry;

         /*
-         * Capture the current entry before calling read_archive_file(),
-         * because cur_file may advance to a new segment during streaming. We
-         * hold this reference so we can flush any remaining buffer data and
-         * close the write handle once we detect that cur_file has moved on.
+         * Before loading more data, scan the hash table to see if we have
+         * loaded any files we don't need yet.  If so, spill their data to
+         * disk to conserve memory space.  But don't try to spill a
+         * partially-read file; it's not worth the complication.
          */
-        entry = privateInfo->cur_file;
-
-        /*
-         * Fetch more data either when no current file is being tracked or
-         * when its buffer has been fully flushed to the temporary file.
-         */
-        if (entry == NULL || entry->buf->len == 0)
+        ArchivedWAL_start_iterate(privateInfo->archive_wal_htab, &iter);
+        while ((entry = ArchivedWAL_iterate(privateInfo->archive_wal_htab,
+                                            &iter)) != NULL)
         {
-            if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
-                break;            /* archive file ended */
-        }
+            FILE       *write_fp;

-        /*
-         * Archive streamer is reading a non-WAL file or an irrelevant WAL
-         * file.
-         */
-        if (entry == NULL)
-            continue;
-
-        /*
-         * The streamer is producing a WAL segment that isn't the one asked
-         * for; it must be arriving out of order.  Spill its data to disk so
-         * it can be read back when needed.
-         */
-        Assert(strcmp(fname, entry->fname) != 0);
+            /* OK to spill? */
+            if (entry->spilled)
+                continue;        /* already spilled */
+            if (entry == privateInfo->cur_file)
+                continue;        /* still being read */

-        /* Create a temporary file if one does not already exist */
-        if (!entry->spilled)
-        {
+            /* Write out the completed WAL file contents to a temp file. */
             write_fp = prepare_tmp_write(entry->fname, privateInfo);
+            perform_tmp_write(entry->fname, entry->buf, write_fp);
+            fclose(write_fp);
+
+            /* resetStringInfo won't release storage, so delete/recreate. */
+            destroyStringInfo(entry->buf);
+            entry->buf = makeStringInfo();
             entry->spilled = true;
         }

-        /* Flush data from the buffer to the file */
-        perform_tmp_write(entry->fname, entry->buf, write_fp);
-        resetStringInfo(entry->buf);
-
         /*
-         * If cur_file changed since we captured entry above, the archive
-         * streamer has finished this segment and moved on.  Close its spill
-         * file handle so data is flushed to disk before the next segment
-         * starts writing to a different handle.
+         * Read more data.  If we reach EOF, the desired file is not present.
          */
-        if (entry != privateInfo->cur_file && write_fp != NULL)
-        {
-            fclose(write_fp);
-            write_fp = NULL;
-        }
+        if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
+            pg_fatal("could not find WAL \"%s\" in archive \"%s\"",
+                     fname, privateInfo->archive_name);
     }
-
-    /* Requested WAL segment not found */
-    pg_fatal("could not find WAL \"%s\" in archive \"%s\"",
-             fname, privateInfo->archive_name);
 }

 /*
--
2.43.7
pgsql-hackers by date:
From: Lukas Fittl
Date: 22 March, 21:13:11
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
From: "Greg Burd"
Date: 22 March, 22:13:22
Subject: Re: Add RISC-V Zbb popcount optimization
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

Previous

Next