Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From Tom Lane
Subject Re: pg_waldump: support decoding of WAL inside tarfile
Date
Msg-id 2431968.1774121470@sss.pgh.pa.us
Whole thread Raw
In response to Re: pg_waldump: support decoding of WAL inside tarfile  (Amul Sul <sulamul@gmail.com>)
List pgsql-hackers
I don't like the v3 patches too much: in particular, they do nothing
for the failure-to-finalize bug I identified yesterday.  v3-0003
is on the right track but seems overcomplicated.  Here is my own
set of proposed patches, which I think will fix what we are seeing
in the buildfarm:

v4-0001 is Andrew's fix for incorrect decompressor finalization.

v4-0002 fixes read_archive_file() to do the missing finalize step.

v4-0003 fixes init_archive_reader() to not depend on cur_file.
This is closely allied to v3-0003 but simpler.  I also added
some commentary to pg_waldump.h about what it's safe to do with
cur_file.

get_archive_wal_entry() violates that advice and is pretty much
utterly broken IMO, because it still believes that it can use cur_file
in an incorrect way.  However, the impact of that is that it may fail
to flush some hashtable entries out to temp files (in case a single
read_archive_file() step reads more than one WAL file, which is
entirely possible with compression).  That is a performance issue but
it's not causing our buildfarm problems, so I left it untouched here.
But I don't think any of the patches proposed so far fix it properly.
What it should do should look more like the revised version of
init_archive_reader's loop: call read_archive_file(), then scan the
hash table for WAL entries we need to flush to files, then finally
return the desired WAL entry if it's present, else loop around.

            regards, tom lane

From a89c0fc8c0fc2c368bc49089344394471c470a79 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 21 Mar 2026 15:14:40 -0400
Subject: [PATCH v4 1/3] Fix finalization of decompressor astreamers.

Send the correct amount of data to the next astreamer, not the
whole allocated buffer size.  It's unclear how we missed this bug;
perhaps the use-cases so far are insensitive to trailing garbage.

Author: Andrew Dunstan <andrew@dunslane.net>
---
 src/fe_utils/astreamer_gzip.c | 9 +++++----
 src/fe_utils/astreamer_lz4.c  | 9 +++++----
 src/fe_utils/astreamer_zstd.c | 2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/src/fe_utils/astreamer_gzip.c b/src/fe_utils/astreamer_gzip.c
index 2e080c37a58..df392f67cab 100644
--- a/src/fe_utils/astreamer_gzip.c
+++ b/src/fe_utils/astreamer_gzip.c
@@ -347,10 +347,11 @@ astreamer_gzip_decompressor_finalize(astreamer *streamer)
      * End of the stream, if there is some pending data in output buffers then
      * we must forward it to next streamer.
      */
-    astreamer_content(mystreamer->base.bbs_next, NULL,
-                      mystreamer->base.bbs_buffer.data,
-                      mystreamer->base.bbs_buffer.maxlen,
-                      ASTREAMER_UNKNOWN);
+    if (mystreamer->bytes_written > 0)
+        astreamer_content(mystreamer->base.bbs_next, NULL,
+                          mystreamer->base.bbs_buffer.data,
+                          mystreamer->bytes_written,
+                          ASTREAMER_UNKNOWN);

     astreamer_finalize(mystreamer->base.bbs_next);
 }
diff --git a/src/fe_utils/astreamer_lz4.c b/src/fe_utils/astreamer_lz4.c
index 2bc32b42879..605c188007b 100644
--- a/src/fe_utils/astreamer_lz4.c
+++ b/src/fe_utils/astreamer_lz4.c
@@ -397,10 +397,11 @@ astreamer_lz4_decompressor_finalize(astreamer *streamer)
      * End of the stream, if there is some pending data in output buffers then
      * we must forward it to next streamer.
      */
-    astreamer_content(mystreamer->base.bbs_next, NULL,
-                      mystreamer->base.bbs_buffer.data,
-                      mystreamer->base.bbs_buffer.maxlen,
-                      ASTREAMER_UNKNOWN);
+    if (mystreamer->bytes_written > 0)
+        astreamer_content(mystreamer->base.bbs_next, NULL,
+                          mystreamer->base.bbs_buffer.data,
+                          mystreamer->bytes_written,
+                          ASTREAMER_UNKNOWN);

     astreamer_finalize(mystreamer->base.bbs_next);
 }
diff --git a/src/fe_utils/astreamer_zstd.c b/src/fe_utils/astreamer_zstd.c
index f26abcfd0fa..4b43ab795e3 100644
--- a/src/fe_utils/astreamer_zstd.c
+++ b/src/fe_utils/astreamer_zstd.c
@@ -347,7 +347,7 @@ astreamer_zstd_decompressor_finalize(astreamer *streamer)
     if (mystreamer->zstd_outBuf.pos > 0)
         astreamer_content(mystreamer->base.bbs_next, NULL,
                           mystreamer->base.bbs_buffer.data,
-                          mystreamer->base.bbs_buffer.maxlen,
+                          mystreamer->zstd_outBuf.pos,
                           ASTREAMER_UNKNOWN);

     astreamer_finalize(mystreamer->base.bbs_next);
--
2.43.7

From 8b9e9c7395a56e1a2beb95f36182c8f0e1c3f3ac Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 21 Mar 2026 15:15:00 -0400
Subject: [PATCH v4 2/3] Fix failure to finalize the decompression pipeline at
 archive EOF.

archive_waldump.c called astreamer_finalize() nowhere.  This meant
that any data retained in decompression buffers at the moment we
detect archive EOF would never reach astreamer_waldump_content(),
resulting in surprising failures if we actually need the last few
bytes of the archive file.

To fix, make read_archive_file() do the finalize once it detects
EOF.  Change its API to return a boolean "yes there's more data"
rather than the entirely-misleading raw count of bytes read.

(I'm not exactly convinced that it should have a count parameter
rather than just always reading archive_read_buf_size worth,
either.  But I didn't change that here.)
---
 src/bin/pg_waldump/archive_waldump.c | 42 ++++++++++++++++++++++------
 src/bin/pg_waldump/pg_waldump.h      |  1 +
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index b078c2d6960..16e5cd58e41 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -89,7 +89,7 @@ typedef struct astreamer_waldump

 static ArchivedWALFile *get_archive_wal_entry(const char *fname,
                                               XLogDumpPrivate *privateInfo);
-static int    read_archive_file(XLogDumpPrivate *privateInfo, Size count);
+static bool read_archive_file(XLogDumpPrivate *privateInfo, Size count);
 static void setup_tmpwal_dir(const char *waldir);
 static void cleanup_tmpwal_dir_atexit(void);

@@ -139,6 +139,7 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
         pg_fatal("could not open file \"%s\"", privateInfo->archive_name);

     privateInfo->archive_fd = fd;
+    privateInfo->archive_fd_eof = false;

     streamer = astreamer_waldump_new(privateInfo);

@@ -178,7 +179,7 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
      */
     while (entry == NULL || entry->buf->len < XLOG_BLCKSZ)
     {
-        if (read_archive_file(privateInfo, XLOG_BLCKSZ) == 0)
+        if (!read_archive_file(privateInfo, XLOG_BLCKSZ))
             pg_fatal("could not find WAL in archive \"%s\"",
                      privateInfo->archive_name);

@@ -236,9 +237,10 @@ free_archive_reader(XLogDumpPrivate *privateInfo)
     /*
      * NB: Normally, astreamer_finalize() is called before astreamer_free() to
      * flush any remaining buffered data or to ensure the end of the tar
-     * archive is reached.  However, when decoding WAL, once we hit the end
-     * LSN, any remaining buffered data or unread portion of the archive can
-     * be safely ignored.
+     * archive is reached.  read_archive_file() may have done so.  However,
+     * when decoding WAL we can stop once we hit the end LSN, so we may never
+     * have read all of the input file.  In that case any remaining buffered
+     * data or unread portion of the archive can be safely ignored.
      */
     astreamer_free(privateInfo->archive_streamer);

@@ -384,7 +386,7 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr,
                          fname, privateInfo->archive_name,
                          (long long int) (count - nbytes),
                          (long long int) count);
-            if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0)
+            if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
                 pg_fatal("unexpected end of archive \"%s\" while reading \"%s\": read %lld of %lld bytes",
                          privateInfo->archive_name, fname,
                          (long long int) (count - nbytes),
@@ -490,7 +492,7 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
          */
         if (entry == NULL || entry->buf->len == 0)
         {
-            if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0)
+            if (!read_archive_file(privateInfo, READ_CHUNK_SIZE))
                 break;            /* archive file ended */
         }

@@ -540,8 +542,14 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo)
 /*
  * Reads a chunk from the archive file and passes it through the streamer
  * pipeline for decompression (if needed) and tar member extraction.
+ *
+ * count is the maximum amount to try to read this time.  Note that it's
+ * measured in raw file bytes, and may have little to do with how much
+ * comes out of decompression/extraction.
+ *
+ * Returns true if successful, false if there is no more data.
  */
-static int
+static bool
 read_archive_file(XLogDumpPrivate *privateInfo, Size count)
 {
     int            rc;
@@ -549,6 +557,11 @@ read_archive_file(XLogDumpPrivate *privateInfo, Size count)
     /* The read request must not exceed the allocated buffer size. */
     Assert(privateInfo->archive_read_buf_size >= count);

+    /* Fail if we already reached EOF in a prior call. */
+    if (privateInfo->archive_fd_eof)
+        return false;
+
+    /* Try to read some more data. */
     rc = read(privateInfo->archive_fd, privateInfo->archive_read_buf, count);
     if (rc < 0)
         pg_fatal("could not read file \"%s\": %m",
@@ -562,8 +575,19 @@ read_archive_file(XLogDumpPrivate *privateInfo, Size count)
         astreamer_content(privateInfo->archive_streamer, NULL,
                           privateInfo->archive_read_buf, rc,
                           ASTREAMER_UNKNOWN);
+    else
+    {
+        /*
+         * We reached EOF, but there is probably still data queued in the
+         * astreamer pipeline's buffers.  Flush it out to ensure that we
+         * process everything.
+         */
+        astreamer_finalize(privateInfo->archive_streamer);
+        /* Set flag to ensure we don't finalize more than once. */
+        privateInfo->archive_fd_eof = true;
+    }

-    return rc;
+    return true;
 }

 /*
diff --git a/src/bin/pg_waldump/pg_waldump.h b/src/bin/pg_waldump/pg_waldump.h
index 36893624f53..cde7c6ca3f2 100644
--- a/src/bin/pg_waldump/pg_waldump.h
+++ b/src/bin/pg_waldump/pg_waldump.h
@@ -35,6 +35,7 @@ typedef struct XLogDumpPrivate
     char       *archive_dir;
     char       *archive_name;    /* Tar archive filename */
     int            archive_fd;        /* File descriptor for the open tar file */
+    bool        archive_fd_eof; /* Have we reached EOF on archive_fd? */

     astreamer  *archive_streamer;
     char       *archive_read_buf;    /* Reusable read buffer for archive I/O */
--
2.43.7

From b28034bae725549cceacb92b7ed631072d4a5665 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 21 Mar 2026 15:15:16 -0400
Subject: [PATCH v4 3/3] Fix init_archive_reader() to not depend on cur_file.

Relying on privateInfo->cur_file is a mistake: it can only work
if read_archive_file stops at a point where some WAL segment has
been partially read.  That might not happen, notably if we reach
the end of the archive before satisfying the loop.  This appears
to explain not-very-reproducible "could not find WAL in archive"
failures we're seeing in the buildfarm.

Instead, after calling read_archive_file, scan the archive_wal_htab
to see if there is any cached WAL segment that has enough data.

While at it, fix a minor thinko: we don't have to insist on having
collected XLOG_BLCKSZ worth of WAL from that first WAL segment.
It's enough if we have the first page header.
---
 src/bin/pg_waldump/archive_waldump.c | 20 +++++++++++++++-----
 src/bin/pg_waldump/pg_waldump.h      |  9 ++++++++-
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c
index 16e5cd58e41..20ca4547ba7 100644
--- a/src/bin/pg_waldump/archive_waldump.c
+++ b/src/bin/pg_waldump/archive_waldump.c
@@ -173,17 +173,27 @@ init_archive_reader(XLogDumpPrivate *privateInfo,
     privateInfo->archive_wal_htab = ArchivedWAL_create(8, NULL);

     /*
-     * Read until we have at least one full WAL page (XLOG_BLCKSZ bytes) from
-     * the first WAL segment in the archive so we can extract the WAL segment
-     * size from the long page header.
+     * Read the archive until we've found at least one WAL segment and
+     * obtained enough bytes from it to let us extract the WAL segment size
+     * from the long page header.
      */
-    while (entry == NULL || entry->buf->len < XLOG_BLCKSZ)
+    while (entry == NULL)
     {
+        ArchivedWAL_iterator iter;
+
+        /* Read more data, fail if there is no more. */
         if (!read_archive_file(privateInfo, XLOG_BLCKSZ))
             pg_fatal("could not find WAL in archive \"%s\"",
                      privateInfo->archive_name);

-        entry = privateInfo->cur_file;
+        /* Search the hash table for a WAL segment with enough data. */
+        ArchivedWAL_start_iterate(privateInfo->archive_wal_htab, &iter);
+        while ((entry = ArchivedWAL_iterate(privateInfo->archive_wal_htab,
+                                            &iter)) != NULL)
+        {
+            if (entry->read_len >= sizeof(XLogLongPageHeader))
+                break;
+        }
     }

     /* Extract the WAL segment size from the long page header */
diff --git a/src/bin/pg_waldump/pg_waldump.h b/src/bin/pg_waldump/pg_waldump.h
index cde7c6ca3f2..ca0dfd97168 100644
--- a/src/bin/pg_waldump/pg_waldump.h
+++ b/src/bin/pg_waldump/pg_waldump.h
@@ -44,7 +44,14 @@ typedef struct XLogDumpPrivate
     Size        archive_read_buf_size;
 #endif

-    /* What the archive streamer is currently reading */
+    /*
+     * The buffer for the WAL file the archive streamer is currently reading,
+     * or NULL if none.  It is quite risky to examine this anywhere except in
+     * astreamer_waldump_content(), since it can change multiple times during
+     * a single read_archive_file() call.  However, it is safe to assume that
+     * if cur_file is different from a particular ArchivedWALFile of interest,
+     * then the archive streamer has finished reading that file.
+     */
     struct ArchivedWALFile *cur_file;

     /*
--
2.43.7


pgsql-hackers by date:

Previous
From: "Greg Burd"
Date:
Subject: Re: Add RISC-V Zbb popcount optimization
Next
From: Andres Freund
Date:
Subject: Re: index prefetching