Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date
Msg-id 20220307.173927.1648940908020465758.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Robert Haas <robertmhaas@gmail.com>)
Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
So the new framework has been dropped in this version.
The second test is removed as it is irrelevant to this bug.

In this version the patch is a single file that contains the test.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 43bb3ba8900edd53a1feb0acb1a72bdc22bb1627 Mon Sep 17 00:00:00 2001
From: P <apraveen@pivotal.io>
Date: Mon, 7 Mar 2022 17:10:07 +0900
Subject: [PATCH v20] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

    CREATE DATABASE
    DROP DATABASE
    DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c   |   6 +
 src/backend/access/transam/xlogutils.c      | 145 ++++++++++++++++++++
 src/backend/commands/dbcommands.c           |  56 ++++++++
 src/backend/commands/tablespace.c           |  16 +++
 src/include/access/xlogutils.h              |   4 +
 src/test/recovery/t/029_replay_tsp_drops.pl |  62 +++++++++
 6 files changed, 289 insertions(+)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
          */
         XLogCheckInvalidPages();
 
+        /*
+         * Check if the XLOG sequence contained any unresolved references to
+         * missing directories.
+         */
+        XLogCheckMissingDirs();
+
         reachedConsistency = true;
         ereport(LOG,
                 (errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+    Oid spcNode;
+    Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+    xl_missing_dir_key key;
+    char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+    xl_missing_dir_key key;
+    bool found;
+    xl_missing_dir *entry;
+
+    /*
+     * Database OID may be invalid but tablespace OID must be valid.  If
+     * dbNode is InvalidOid, we are logging a missing tablespace directory,
+     * otherwise we are logging a missing database directory.
+     */
+    Assert(OidIsValid(spcNode));
+
+    if (missing_dir_tab == NULL)
+    {
+        /* create hash table when first needed */
+        HASHCTL        ctl;
+
+        memset(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(xl_missing_dir_key);
+        ctl.entrysize = sizeof(xl_missing_dir);
+
+        missing_dir_tab = hash_create("XLOG missing directory table",
+                                       100,
+                                       &ctl,
+                                       HASH_ELEM | HASH_BLOBS);
+    }
+
+    key.spcNode = spcNode;
+    key.dbNode = dbNode;
+
+    entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+    if (found)
+    {
+        if (dbNode == InvalidOid)
+            elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+                 path, spcNode, entry->path);
+        else
+            elog(DEBUG2, "missing directory %s (tablespace %d database %d) already exists: %s",
+                 path, spcNode, dbNode, entry->path);
+    }
+    else
+    {
+        strlcpy(entry->path, path, sizeof(entry->path));
+        if (dbNode == InvalidOid)
+            elog(DEBUG2, "logged missing dir %s (tablespace %d)",
+                 path, spcNode);
+        else
+            elog(DEBUG2, "logged missing dir %s (tablespace %d database %d)",
+                 path, spcNode, dbNode);
+    }
+}
+
+void
+XLogForgetMissingDir(Oid spcNode, Oid dbNode)
+{
+    xl_missing_dir_key key;
+
+    key.spcNode = spcNode;
+    key.dbNode = dbNode;
+
+    /* Database OID may be invalid but tablespace OID must be valid. */
+    Assert(OidIsValid(spcNode));
+
+    if (missing_dir_tab == NULL)
+        return;
+
+    if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) != NULL)
+    {
+        if (dbNode == InvalidOid)
+        {
+            elog(DEBUG2, "forgot missing dir (tablespace %d)", spcNode);
+        }
+        else
+        {
+            char *path = GetDatabasePath(dbNode, spcNode);
+
+            elog(DEBUG2, "forgot missing dir %s (tablespace %d database %d)",
+                 path, spcNode, dbNode);
+            pfree(path);
+        }
+    }
+}
+
+/*
+ * This is called at the end of crash recovery, before entering archive
+ * recovery on a standby.  PANIC if the hash table is not empty.
+ */
+void
+XLogCheckMissingDirs(void)
+{
+    HASH_SEQ_STATUS status;
+    xl_missing_dir *hentry;
+    bool        foundone = false;
+
+    if (missing_dir_tab == NULL)
+        return;                    /* nothing to do */
+
+    hash_seq_init(&status, missing_dir_tab);
+
+    while ((hentry = (xl_missing_dir *) hash_seq_search(&status)) != NULL)
+    {
+        elog(WARNING, "missing directory \"%s\" tablespace %d database %d",
+             hentry->path, hentry->key.spcNode, hentry->key.dbNode);
+        foundone = true;
+    }
+
+    if (foundone)
+        elog(PANIC, "WAL contains references to missing directories");
+
+    hash_destroy(missing_dir_tab);
+    missing_dir_tab = NULL;
+}
 
 /* Report a reference to an invalid page */
 static void
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index c37e3c9a9a..8994e9da99 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -30,6 +30,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -2382,7 +2383,9 @@ dbase_redo(XLogReaderState *record)
         xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
         char       *src_path;
         char       *dst_path;
+        char       *parent_path;
         struct stat st;
+        bool        skip = false;
 
         src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
         dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
@@ -2400,6 +2403,55 @@ dbase_redo(XLogReaderState *record)
                         (errmsg("some useless files may be left behind in old database directory \"%s\"",
                                 dst_path)));
         }
+        else if (!reachedConsistency)
+        {
+            /*
+             * It is possible that drop tablespace record appearing later in
+             * the WAL as already been replayed.  That means we are replaying
+             * the create database record second time, as part of crash
+             * recovery.  In that case, the tablespace directory has already
+             * been removed and the create database operation cannot be
+             * replayed.  We should skip the replay but remember the missing
+             * tablespace directory, to be matched with a drop tablespace
+             * record later.
+             */
+            parent_path = pstrdup(dst_path);
+            get_parent_directory(parent_path);
+            if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+            {
+                XLogReportMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
+                skip = true;
+                ereport(WARNING,
+                        (errmsg("skipping create database WAL record"),
+                         errdetail("Target tablespace \"%s\" not found. We "
+                                   "expect to encounter a WAL record that "
+                                   "removes this directory before reaching "
+                                   "consistent state.", parent_path)));
+            }
+            pfree(parent_path);
+        }
+
+        /*
+         * Source directory may be missing.  E.g. the template database used
+         * for creating this database may have been dropped, due to reasons
+         * noted above.  Moving a database from one tablespace may also be a
+         * partner in the crime.
+         */
+        if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)) &&
+            !reachedConsistency)
+        {
+            XLogReportMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
+            skip = true;
+            ereport(WARNING,
+                    (errmsg("skipping create database WAL record"),
+                     errdetail("Source database \"%s\" not found. We expect "
+                               "to encounter a WAL record that removes this "
+                               "directory before reaching consistent state.",
+                               src_path)));
+        }
+
+        if (skip)
+            return;
 
         /*
          * Force dirty buffers out to disk, to ensure source database is
@@ -2462,6 +2514,10 @@ dbase_redo(XLogReaderState *record)
                 ereport(WARNING,
                         (errmsg("some useless files may be left behind in old database directory \"%s\"",
                                 dst_path)));
+
+            if (!reachedConsistency)
+                XLogForgetMissingDir(xlrec->tablespace_ids[i], xlrec->db_id);
+
             pfree(dst_path);
         }
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..66bd28fc74 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -57,6 +57,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xloginsert.h"
+#include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1574,6 +1575,21 @@ tblspc_redo(XLogReaderState *record)
     {
         xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+        if (!reachedConsistency)
+            XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
+
+        /*
+         * Before we remove the tablespace directory, update minimum recovery
+         * point to cover this WAL record. Once the tablespace is removed,
+         * there's no going back.  This manually enforces the WAL-first rule.
+         * Doing this before the removal means that if the removal fails for
+         * some reason, the directory is left alone and needs to be manually
+         * removed.  Alternatively you could update the minimum recovery point
+         * after removal, but that would leave a small window where the
+         * WAL-first rule could be violated.
+         */
+        XLogFlush(record->EndRecPtr);
+
         /*
          * If we issued a WAL record for a drop tablespace it implies that
          * there were no files in it at all when the DROP was done. That means
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..5d9c20cae7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -65,6 +65,10 @@ extern void XLogDropDatabase(Oid dbid);
 extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
                                  BlockNumber nblocks);
 
+extern void XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path);
+extern void XLogForgetMissingDir(Oid spcNode, Oid dbNode);
+extern void XLogCheckMissingDirs(void);
+
 /* Result codes for XLogReadBufferForRedo[Extended] */
 typedef enum
 {
diff --git a/src/test/recovery/t/029_replay_tsp_drops.pl b/src/test/recovery/t/029_replay_tsp_drops.pl
new file mode 100644
index 0000000000..de2a92661c
--- /dev/null
+++ b/src/test/recovery/t/029_replay_tsp_drops.pl
@@ -0,0 +1,62 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+
+# Test recovery involving tablespace droppings.  If recovery stops
+# after once tablespace is removed, the next recovery should properly
+# ignore the operations within the removed tablespaces.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+#use File::Compare;
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary1');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+$node_primary->psql('postgres',
+qq[
+    SET allow_in_place_tablespaces=on;
+    CREATE TABLESPACE dropme_ts1 LOCATION '';
+    CREATE TABLESPACE dropme_ts2 LOCATION '';
+    CREATE TABLESPACE source_ts  LOCATION '';
+    CREATE TABLESPACE target_ts  LOCATION '';
+    CREATE DATABASE template_db IS_TEMPLATE = true;
+]);
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start;
+
+# Make sure connection is made
+$node_primary->poll_query_until(
+    'postgres', 'SELECT count(*) = 1 FROM pg_stat_replication');
+
+$node_standby->safe_psql('postgres', 'CHECKPOINT');
+
+# Do immediate shutdown just after a sequence of CREAT DATABASE / DROP
+# DATABASE / DROP TABLESPACE. This causes CREATE DATABASE WAL records
+# to be applied to already-removed directories.
+$node_primary->safe_psql('postgres',
+                        q[CREATE DATABASE dropme_db1 WITH TABLESPACE dropme_ts1;
+                          CREATE DATABASE dropme_db2 WITH TABLESPACE dropme_ts2;
+                          CREATE DATABASE moveme_db TABLESPACE source_ts;
+                          ALTER DATABASE moveme_db SET TABLESPACE target_ts;
+                          CREATE DATABASE newdb TEMPLATE template_db;
+                          ALTER DATABASE template_db IS_TEMPLATE = false;
+                          DROP DATABASE dropme_db1;
+                          DROP DATABASE dropme_db2; DROP TABLESPACE dropme_ts2;
+                          DROP TABLESPACE source_ts;
+                          DROP DATABASE template_db;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+                               $node_primary->lsn('replay'));
+$node_standby->stop('immediate');
+
+# Should restart ignoring directory creation error.
+is($node_standby->start(fail_ok => 1), 1);
+
+# Ensure that a missing tablespace directory during create database
+done_testing();
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Amit Kapila
Date:
Subject: Re: Optionally automatically disable logical replication subscriptions on error