Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers
| From | Tom Lane |
|---|---|
| Subject | Re: pg_waldump: support decoding of WAL inside tarfile |
| Date | |
| Msg-id | 374225.1774459521@sss.pgh.pa.us Whole thread Raw |
| In response to | Re: pg_waldump: support decoding of WAL inside tarfile (Michael Paquier <michael@paquier.xyz>) |
| List | pgsql-hackers |
Michael Paquier <michael@paquier.xyz> writes: > The buildfarm has switched mostly to green, except on this one: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoatzin&dt=2026-03-23%2006%3A00%3A42 Yeah, there are some other weird failures on other machines too. There's also the problem we already knew about of FD leakage breaking cleanup of the temp file directory on Windows. I wrote a patch to fix the FD leakage problem (v8-0001 attached). I don't have Windows at hand, but I tested it by dint of having the atexit callback invoke "lsof" to see if there were any open files in the temp directory. I also occasionally saw some of the weird errors mentioned above. After much debugging, I believe that the issue is that archive_waldump.c is unaware that inserting or deleting entries in a simplehash.h hash table can cause other entries to move. That can break privateInfo->cur_file, and it can also break read_archive_wal_page which thought it could just re-use its entry pointer after calling read_archive_file. v8-0002 attached fixes that, and I'm not seeing weird failures anymore. Two additional thoughts: 1. The amount of data that pg_waldump's TAP tests use is not sufficient to trigger these problems with any degree of reliability. I'm hesitant to make the tests run longer, but clearly we do not have adequate coverage now. 2. I didn't do it here, but I urgently think we should rip out read_archive_wal_page's stanza that truncates the entry's "buf" string (the "if (privateInfo->decoding_started)" part). My faith in this code in general is at rock bottom, and my faith in the extent to which we've tested it is somewhere below ground level. I don't think we need rickety optimizations that serve only to keep the active hashtable entry to less than 16MB, when we're going to reclaim that space altogether as soon as we've finished dumping that segment. This truncation scares me because it adds a whole 'nother level of poorly-documented complexity to the invariants around what is in entry->buf. Also, while we theoretically should not need to spill the entry after this point, if we did we would write a corrupted spill file. regards, tom lane From d21895df67eea1d624e1ffc6eaa7a2c86ef386ff Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 25 Mar 2026 10:46:44 -0400 Subject: [PATCH v8 1/2] Fix file descriptor leakages in pg_waldump. TarWALDumpCloseSegment was of the opinion that it didn't need to do anything. It was mistaken: it has to close the open file if any, because nothing else will. In addition, we failed to ensure that any file being read by the XLogReader machinery gets closed before the atexit callback tries to cleanup the temporary directory holding spilled WAL files. While the file would have been closed already in case of a success exit, this doesn't happen in case of pg_fatal() exits. The least messy way to fix that is to move the atexit function into pg_waldump.c, where it has easier access to the XLogReaderState pointer and to WALDumpCloseSegment. These FD leakages are pretty insignificant on Unix-ish platforms, but they're a bug on Windows, because they prevent successful cleanup of the temporary directory for extracted WAL files. (Windows can't delete a directory that holds a deleted-but-still-open file.) This is visible in occasional buildfarm failures. --- src/bin/pg_waldump/archive_waldump.c | 26 ++-------------- src/bin/pg_waldump/pg_waldump.c | 44 ++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 26 deletions(-) diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c index 96f4d94449f..b8e9754b729 100644 --- a/src/bin/pg_waldump/archive_waldump.c +++ b/src/bin/pg_waldump/archive_waldump.c @@ -91,7 +91,6 @@ static ArchivedWALFile *get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo); static bool read_archive_file(XLogDumpPrivate *privateInfo); static void setup_tmpwal_dir(const char *waldir); -static void cleanup_tmpwal_dir_atexit(void); static FILE *prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo); static void perform_tmp_write(const char *fname, StringInfo buf, FILE *file); @@ -607,23 +606,10 @@ setup_tmpwal_dir(const char *waldir) pg_log_debug("created directory \"%s\"", TmpWalSegDir); } -/* - * Remove temporary directory at exit, if any. - */ -static void -cleanup_tmpwal_dir_atexit(void) -{ - Assert(TmpWalSegDir != NULL); - - rmtree(TmpWalSegDir, true); - - TmpWalSegDir = NULL; -} - /* * Open a file in the temporary spill directory for writing an out-of-order - * WAL segment, creating the directory and registering the cleanup callback - * if not already done. Returns the open file handle. + * WAL segment, creating the directory if not already done. + * Returns the open file handle. */ static FILE * prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo) @@ -631,15 +617,9 @@ prepare_tmp_write(const char *fname, XLogDumpPrivate *privateInfo) char fpath[MAXPGPATH]; FILE *file; - /* - * Setup temporary directory to store WAL segments and set up an exit - * callback to remove it upon completion if not already. - */ + /* Setup temporary directory to store WAL segments, if we didn't already */ if (unlikely(TmpWalSegDir == NULL)) - { setup_tmpwal_dir(privateInfo->archive_dir); - atexit(cleanup_tmpwal_dir_atexit); - } snprintf(fpath, MAXPGPATH, "%s/%s", TmpWalSegDir, fname); diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 630d9859882..6bda3c12aa3 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -42,6 +42,8 @@ static const char *progname; static volatile sig_atomic_t time_to_stop = false; +static XLogReaderState *xlogreader_state_cleanup = NULL; + static const RelFileLocator emptyRelFileLocator = {0, 0, 0}; typedef struct XLogDumpConfig @@ -454,13 +456,14 @@ TarWALDumpOpenSegment(XLogReaderState *state, XLogSegNo nextSegNo, /* * pg_waldump's XLogReaderRoutine->segment_close callback to support dumping - * WAL files from tar archives. Segment tracking is handled by - * TarWALDumpReadPage, so no action is needed here. + * WAL files from tar archives. Same as wal_segment_close. */ static void TarWALDumpCloseSegment(XLogReaderState *state) { - /* No action needed */ + close(state->seg.ws_file); + /* need to check errno? */ + state->seg.ws_file = -1; } /* @@ -865,6 +868,27 @@ XLogDumpDisplayStats(XLogDumpConfig *config, XLogStats *stats) total_len, "[100%]"); } +/* + * Remove temporary directory at exit, if any. + */ +static void +cleanup_tmpwal_dir_atexit(void) +{ + /* + * Before calling rmtree, we must close any open file we have in the temp + * directory; else rmdir fails on Windows. + */ + if (xlogreader_state_cleanup != NULL && + xlogreader_state_cleanup->seg.ws_file >= 0) + WALDumpCloseSegment(xlogreader_state_cleanup); + + if (TmpWalSegDir != NULL) + { + rmtree(TmpWalSegDir, true); + TmpWalSegDir = NULL; + } +} + static void usage(void) { @@ -1383,6 +1407,14 @@ main(int argc, char **argv) if (!xlogreader_state) pg_fatal("out of memory while allocating a WAL reading processor"); + /* + * Set up atexit cleanup of temporary directory. This must happen before + * archive_waldump.c could possibly create the temporary directory. Also + * arm the callback to cleanup the xlogreader state. + */ + atexit(cleanup_tmpwal_dir_atexit); + xlogreader_state_cleanup = xlogreader_state; + /* first find a valid recptr to start from */ first_record = XLogFindNextRecord(xlogreader_state, private.startptr, &errormsg); @@ -1495,6 +1527,12 @@ main(int argc, char **argv) LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), errormsg); + /* + * Disarm atexit cleanup of open WAL file; XLogReaderFree will close it, + * and we don't want the atexit callback trying to touch freed memory. + */ + xlogreader_state_cleanup = NULL; + XLogReaderFree(xlogreader_state); if (private.archive_name) -- 2.43.7 From f28cd200bd5db855164aa06be5c3e960a5ab2f0f Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 25 Mar 2026 13:00:54 -0400 Subject: [PATCH v8 2/2] Fix misuse of simplehash.h hash operations in pg_waldump. Both ArchivedWAL_insert and ArchivedWAL_delete_item can cause existing hashtable entries to move. The code didn't account for this and could leave privateInfo->cur_file pointing at a dead or incorrect entry, with hilarity ensuing. Likewise, read_archive_wal_page calls read_archive_file which could result in movement of the hashtable entry it is working with (due to hashtable expansion for insertion of new entries). I believe these bugs explain some odd buildfarm failures, although the amount of data we use in pg_waldump's TAP tests isn't enough to trigger them reliably. --- src/bin/pg_waldump/archive_waldump.c | 36 ++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c index b8e9754b729..4348e192f19 100644 --- a/src/bin/pg_waldump/archive_waldump.c +++ b/src/bin/pg_waldump/archive_waldump.c @@ -307,6 +307,7 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr, XLByteToSeg(targetPagePtr, segno, segsize); XLogFileName(fname, privateInfo->timeline, segno, segsize); entry = get_archive_wal_entry(fname, privateInfo); + Assert(!entry->spilled); while (nbytes > 0) { @@ -403,6 +404,14 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr, privateInfo->archive_name, fname, (long long int) (count - nbytes), (long long int) count); + + /* + * Loading more data may have moved hash table entries, so we must + * re-look-up the one we are reading from. + */ + entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname); + /* ... it had better still be there */ + Assert(entry != NULL); } } @@ -429,6 +438,7 @@ void free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) { ArchivedWALFile *entry; + const char *oldfname; entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname); @@ -454,7 +464,23 @@ free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) if (privateInfo->cur_file == entry) privateInfo->cur_file = NULL; + /* + * ArchivedWAL_delete_item may cause other hash table entries to move. + * Therefore, if cur_file isn't NULL now, we have to be prepared to look + * that entry up again after the deletion. Fortunately, the entry's fname + * string won't move. + */ + oldfname = privateInfo->cur_file ? privateInfo->cur_file->fname : NULL; + ArchivedWAL_delete_item(privateInfo->archive_wal_htab, entry); + + if (oldfname) + { + privateInfo->cur_file = ArchivedWAL_lookup(privateInfo->archive_wal_htab, + oldfname); + /* ... it had better still be there */ + Assert(privateInfo->cur_file != NULL); + } } /* @@ -700,6 +726,9 @@ astreamer_waldump_content(astreamer *streamer, astreamer_member *member, ArchivedWALFile *entry; bool found; + /* Shouldn't see MEMBER_HEADER in the middle of a file */ + Assert(privateInfo->cur_file == NULL); + pg_log_debug("reading \"%s\"", member->pathname); if (!member_is_wal_file(mystreamer, member, &fname)) @@ -728,6 +757,13 @@ astreamer_waldump_content(astreamer *streamer, astreamer_member *member, } } + /* + * Note: ArchivedWAL_insert may cause existing hash table + * entries to move. While cur_file is known to be NULL right + * now, read_archive_wal_page may have a live hash entry + * pointer, which it needs to take care to update after + * read_archive_file completes. + */ entry = ArchivedWAL_insert(privateInfo->archive_wal_htab, fname, &found); -- 2.43.7
pgsql-hackers by date: