Memory context can be its own parent and child in replication command - Mailing list pgsql-hackers

From Anthonin Bonnefoy
Subject Memory context can be its own parent and child in replication command
Date
Msg-id CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

While running some tests with logical replication, I've run in a
situation where a walsender process was stuck in an infinite loop with
the following backtrace:

#0  in MemoryContextDelete (...) at ../src/backend/utils/mmgr/mcxt.c:474
#1  in exec_replication_command (cmd_string=... "BEGIN") at
../src/backend/replication/walsender.c:2005

Which matches the following while loop:
while (curr->firstchild != NULL)
    curr = curr->firstchild;

Inspecting the memory context, I have the following:

$9 = (MemoryContext) 0xafb741c35360
(gdb) p *curr
$10 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false, mem_allocated = 8192, methods = 0xafb7278a82d8
<mcxt_methods+216>, parent = 0xafb741c35360, firstchild =
0xafb741c35360, prevchild = 0x0, nextchild = 0x0, name =
0xafb7275f68a8 "Replication command context", ident = 0x0, reset_cbs =
0x0}

So the memory context is 0xafb741c35360, which is also the same value
for parent and firstchild. This explains the infinite loop as
MemoryContextDelete tries to find the leaf with no children.

I was able to get a rr recording of the issue and trace how this
happened. This can be reproduced by triggering 2 replication commands,
with the first one doing a snapshot export:

CREATE_REPLICATION_SLOT "test_slot" LOGICAL "test_decoding" ( SNAPSHOT
"export");
DROP_REPLICATION_SLOT "test_slot";

- CreateReplicationSlot will start a new transaction to handle the
snapshot export.
- This transaction will save the replication command context (let's
call it ctx1) in its state.
- ctx1 is deleted at the end of exec_replication_command
- During the next replication command, the transaction is aborted in
SnapBuildClearExportedSnapshot
- The transaction restores ctx1 as the CurrentMemoryContext
- AllocSetContextCreate is called with ctx1 as a parent, and it will
pull ctx1 from the freelist
- ctx1's parent and child will be set to ctx1 and returned
- During ctx1 deletion, it will be stuck in an infinite loop

I've added a tap test to reproduce the issue, along with an assertion
during context creation to check the parent and returned context are
not the same so the test would immediately abort and not stay stuck.

To fix this, it seems like saving and restoring the memory context
after the call AbortCurrentTransaction was the best approach. It is
similar to what's done with CurrentResourceOwner. I've thought of
switching to the TopMemoryContext before exporting the snapshot so
aborting the transaction will switch back to TopMemoryContext, but
this would still require restoring the memory context after the
transaction is aborted.

Regards,
Anthonin Bonnefoy

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: making EXPLAIN extensible
Next
From: Alexander Korotkov
Date:
Subject: Re: Get rid of WALBufMappingLock