Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records
Date
Msg-id 2138765.1664924048@sss.pgh.pa.us
Whole thread Raw
Responses Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
[ redirecting to -hackers because patch attached ]

David Rowley <dgrowleyml@gmail.com> writes:
> So that confirms there were 950k relations in the xl_standby_locks.
> The contents of that message seem to be produced by standby_desc().
> That should be the same WAL record that's processed by standby_redo()
> which adds the 950k locks to the RecoveryLockListsEntry.

> I'm not seeing why 950k becomes 134m.

I figured out what the problem is.  The standby's startup process
retains knowledge of all these locks in standby.c's RecoveryLockLists
data structure, which *has no de-duplication capability*.  It'll add
another entry to the per-XID list any time it's told about a given
exclusive lock.  And checkpoints cause us to regurgitate the entire
set of currently-held exclusive locks into the WAL.  So if you have
a process holding a lot of exclusive locks, and sitting on them
across multiple checkpoints, standby startup processes will bloat.
It's not a true leak, in that we know where the memory is and
we'll release it whenever we see that XID commit/abort.  And I doubt
that this is a common usage pattern, which probably explains the
lack of previous complaints.  Still, bloat bad.

PFA a quick-hack fix that solves this issue by making per-transaction
subsidiary hash tables.  That's overkill perhaps; I'm a little worried
about whether this slows down normal cases more than it's worth.
But we ought to do something about this, because aside from the
duplication aspect the current storage of these lists seems mighty
space-inefficient.

            regards, tom lane

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9dab931990..14b86cd76a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,7 +42,26 @@ int            max_standby_archive_delay = 30 * 1000;
 int            max_standby_streaming_delay = 30 * 1000;
 bool        log_recovery_conflict_waits = false;

-static HTAB *RecoveryLockLists;
+/*
+ * Keep track of all the exclusive locks owned by original transactions.
+ * For each original transaction that is known to have any such locks,
+ * there is a RecoveryLockHashEntry in the RecoveryLockHash hash table,
+ * pointing to a subsidiary hash table containing RecoveryLockEntry items.
+ */
+typedef struct RecoveryLockHashEntry
+{
+    TransactionId xid;            /* hash key -- must be first */
+    HTAB       *hashtab;        /* table of locks belonging to xid */
+} RecoveryLockHashEntry;
+
+typedef struct RecoveryLockEntry
+{
+    /* Both Oids are included in the hash key; there is no payload per se */
+    Oid            dbOid;            /* DB containing table */
+    Oid            relOid;            /* OID of table */
+} RecoveryLockEntry;
+
+static HTAB *RecoveryLockHash = NULL;

 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
@@ -58,15 +77,6 @@ static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
 static const char *get_recovery_conflict_desc(ProcSignalReason reason);

-/*
- * Keep track of all the locks owned by a given transaction.
- */
-typedef struct RecoveryLockListsEntry
-{
-    TransactionId xid;
-    List       *locks;
-} RecoveryLockListsEntry;
-
 /*
  * InitRecoveryTransactionEnvironment
  *        Initialize tracking of our primary's in-progress transactions.
@@ -85,16 +95,18 @@ InitRecoveryTransactionEnvironment(void)
     VirtualTransactionId vxid;
     HASHCTL        hash_ctl;

+    Assert(RecoveryLockHash == NULL);    /* don't run this twice */
+
     /*
      * Initialize the hash table for tracking the list of locks held by each
      * transaction.
      */
     hash_ctl.keysize = sizeof(TransactionId);
-    hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
-    RecoveryLockLists = hash_create("RecoveryLockLists",
-                                    64,
-                                    &hash_ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+    hash_ctl.entrysize = sizeof(RecoveryLockHashEntry);
+    RecoveryLockHash = hash_create("RecoveryLockHash",
+                                   64,
+                                   &hash_ctl,
+                                   HASH_ELEM | HASH_BLOBS);

     /*
      * Initialize shared invalidation management for Startup process, being
@@ -140,12 +152,12 @@ void
 ShutdownRecoveryTransactionEnvironment(void)
 {
     /*
-     * Do nothing if RecoveryLockLists is NULL because which means that
-     * transaction tracking has not been yet initialized or has been already
-     * shutdowned. This prevents transaction tracking from being shutdowned
-     * unexpectedly more than once.
+     * Do nothing if RecoveryLockHash is NULL because that means that
+     * transaction tracking has not yet been initialized or has already been
+     * shut down.  This makes it safe to have possibly-redundant calls of this
+     * function during process exit.
      */
-    if (RecoveryLockLists == NULL)
+    if (RecoveryLockHash == NULL)
         return;

     /* Mark all tracked in-progress transactions as finished. */
@@ -155,8 +167,8 @@ ShutdownRecoveryTransactionEnvironment(void)
     StandbyReleaseAllLocks();

     /* Destroy the hash table of locks. */
-    hash_destroy(RecoveryLockLists);
-    RecoveryLockLists = NULL;
+    hash_destroy(RecoveryLockHash);
+    RecoveryLockHash = NULL;

     /* Cleanup our VirtualTransaction */
     VirtualXactLockTableCleanup();
@@ -932,12 +944,13 @@ StandbyLockTimeoutHandler(void)
  * We only keep track of AccessExclusiveLocks, which are only ever held by
  * one transaction on one relation.
  *
- * We keep a hash table of lists of locks in local memory keyed by xid,
- * RecoveryLockLists, so we can keep track of the various entries made by
- * the Startup process's virtual xid in the shared lock table.
- *
- * List elements use type xl_standby_lock, since the WAL record type exactly
- * matches the information that we need to keep track of.
+ * We keep a hash table of hashes of locks in local memory keyed by xid,
+ * RecoveryLockHash, so we can keep track of the various entries made by
+ * the Startup process's virtual xid in the shared lock table.  This data
+ * structure allows us to efficiently find all the locks held by a given
+ * original transaction.  It also lets us efficiently de-duplicate locks,
+ * which is important because checkpoints will re-report the same locks
+ * already held.
  *
  * We use session locks rather than normal locks so we don't need
  * ResourceOwners.
@@ -947,9 +960,8 @@ StandbyLockTimeoutHandler(void)
 void
 StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
 {
-    RecoveryLockListsEntry *entry;
-    xl_standby_lock *newlock;
-    LOCKTAG        locktag;
+    RecoveryLockHashEntry *entry;
+    RecoveryLockEntry lockent;
     bool        found;

     /* Already processed? */
@@ -964,62 +976,84 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
     /* dbOid is InvalidOid when we are locking a shared relation. */
     Assert(OidIsValid(relOid));

-    /* Create a new list for this xid, if we don't have one already. */
-    entry = hash_search(RecoveryLockLists, &xid, HASH_ENTER, &found);
+    /* Create a new hash table for this xid, if we don't have one already. */
+    entry = hash_search(RecoveryLockHash, &xid, HASH_ENTER, &found);
     if (!found)
     {
-        entry->xid = xid;
-        entry->locks = NIL;
+        HASHCTL        hash_ctl;
+        char        hash_label[64];
+
+        Assert(entry->xid == xid);    /* dynahash should have set this */
+
+        hash_ctl.keysize = sizeof(RecoveryLockEntry);
+        hash_ctl.entrysize = sizeof(RecoveryLockEntry);
+        snprintf(hash_label, sizeof(hash_label), "RecoveryLockHash_%u", xid);
+        entry->hashtab = hash_create(hash_label,
+                                     64,
+                                     &hash_ctl,
+                                     HASH_ELEM | HASH_BLOBS);
     }

-    newlock = palloc(sizeof(xl_standby_lock));
-    newlock->xid = xid;
-    newlock->dbOid = dbOid;
-    newlock->relOid = relOid;
-    entry->locks = lappend(entry->locks, newlock);
+    /* Create a new hash entry for this lock, unless we have one already. */
+    lockent.dbOid = dbOid;
+    lockent.relOid = relOid;
+    (void) hash_search(entry->hashtab, &lockent, HASH_ENTER, &found);
+    if (!found)
+    {
+        /* It's new, so acquire the lock during replay. */
+        LOCKTAG        locktag;

-    SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
+        SET_LOCKTAG_RELATION(locktag, dbOid, relOid);

-    (void) LockAcquire(&locktag, AccessExclusiveLock, true, false);
+        (void) LockAcquire(&locktag, AccessExclusiveLock, true, false);
+    }
 }

+/*
+ * Release all the locks recorded in this RecoveryLockHashEntry,
+ * and destroy the subsidiary hash table.
+ */
 static void
-StandbyReleaseLockList(List *locks)
+StandbyReleaseLockHash(RecoveryLockHashEntry *entry)
 {
-    ListCell   *lc;
+    HASH_SEQ_STATUS status;
+    RecoveryLockEntry *lockent;

-    foreach(lc, locks)
+    hash_seq_init(&status, entry->hashtab);
+    while ((lockent = hash_seq_search(&status)))
     {
-        xl_standby_lock *lock = (xl_standby_lock *) lfirst(lc);
         LOCKTAG        locktag;

         elog(trace_recovery(DEBUG4),
              "releasing recovery lock: xid %u db %u rel %u",
-             lock->xid, lock->dbOid, lock->relOid);
-        SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
+             entry->xid, lockent->dbOid, lockent->relOid);
+        SET_LOCKTAG_RELATION(locktag, lockent->dbOid, lockent->relOid);
         if (!LockRelease(&locktag, AccessExclusiveLock, true))
         {
             elog(LOG,
-                 "RecoveryLockLists contains entry for lock no longer recorded by lock manager: xid %u database %u
relation%u", 
-                 lock->xid, lock->dbOid, lock->relOid);
+                 "RecoveryLockHash contains entry for lock no longer recorded by lock manager: xid %u database %u
relation%u", 
+                 entry->xid, lockent->dbOid, lockent->relOid);
             Assert(false);
         }
     }

-    list_free_deep(locks);
+    hash_destroy(entry->hashtab);
 }

+/*
+ * Release locks for specific XID, or all locks if it's InvalidXid.
+ */
 static void
 StandbyReleaseLocks(TransactionId xid)
 {
-    RecoveryLockListsEntry *entry;
+    RecoveryLockHashEntry *entry;

     if (TransactionIdIsValid(xid))
     {
-        if ((entry = hash_search(RecoveryLockLists, &xid, HASH_FIND, NULL)))
+        if ((entry = hash_search(RecoveryLockHash, &xid, HASH_FIND, NULL)))
         {
-            StandbyReleaseLockList(entry->locks);
-            hash_search(RecoveryLockLists, entry, HASH_REMOVE, NULL);
+            StandbyReleaseLockHash(entry);
+            hash_search(RecoveryLockHash, entry, HASH_REMOVE, NULL);
         }
     }
     else
@@ -1028,7 +1062,7 @@ StandbyReleaseLocks(TransactionId xid)

 /*
  * Release locks for a transaction tree, starting at xid down, from
- * RecoveryLockLists.
+ * RecoveryLockHash.
  *
  * Called during WAL replay of COMMIT/ROLLBACK when in hot standby mode,
  * to remove any AccessExclusiveLocks requested by a transaction.
@@ -1051,15 +1085,15 @@ void
 StandbyReleaseAllLocks(void)
 {
     HASH_SEQ_STATUS status;
-    RecoveryLockListsEntry *entry;
+    RecoveryLockHashEntry *entry;

     elog(trace_recovery(DEBUG2), "release all standby locks");

-    hash_seq_init(&status, RecoveryLockLists);
+    hash_seq_init(&status, RecoveryLockHash);
     while ((entry = hash_seq_search(&status)))
     {
-        StandbyReleaseLockList(entry->locks);
-        hash_search(RecoveryLockLists, entry, HASH_REMOVE, NULL);
+        StandbyReleaseLockHash(entry);
+        hash_search(RecoveryLockHash, entry, HASH_REMOVE, NULL);
     }
 }

@@ -1072,9 +1106,9 @@ void
 StandbyReleaseOldLocks(TransactionId oldxid)
 {
     HASH_SEQ_STATUS status;
-    RecoveryLockListsEntry *entry;
+    RecoveryLockHashEntry *entry;

-    hash_seq_init(&status, RecoveryLockLists);
+    hash_seq_init(&status, RecoveryLockHash);
     while ((entry = hash_seq_search(&status)))
     {
         Assert(TransactionIdIsValid(entry->xid));
@@ -1088,8 +1122,8 @@ StandbyReleaseOldLocks(TransactionId oldxid)
             continue;

         /* Remove all locks and hash table entry. */
-        StandbyReleaseLockList(entry->locks);
-        hash_search(RecoveryLockLists, entry, HASH_REMOVE, NULL);
+        StandbyReleaseLockHash(entry);
+        hash_search(RecoveryLockHash, entry, HASH_REMOVE, NULL);
     }
 }


pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: [PATCH] Expand character set for ltree labels
Next
From: Nathan Bossart
Date:
Subject: Re: Move backup-related code to xlogbackup.c/.h