Thread: Memory context can be its own parent and child in replication command

Memory context can be its own parent and child in replication command

From

Anthonin Bonnefoy

Date:

07 March, 18:07:06

Hi,

While running some tests with logical replication, I've run in a
situation where a walsender process was stuck in an infinite loop with
the following backtrace:

#0  in MemoryContextDelete (...) at ../src/backend/utils/mmgr/mcxt.c:474
#1  in exec_replication_command (cmd_string=... "BEGIN") at
../src/backend/replication/walsender.c:2005

Which matches the following while loop:
while (curr->firstchild != NULL)
    curr = curr->firstchild;

Inspecting the memory context, I have the following:

$9 = (MemoryContext) 0xafb741c35360
(gdb) p *curr
$10 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false, mem_allocated = 8192, methods = 0xafb7278a82d8
<mcxt_methods+216>, parent = 0xafb741c35360, firstchild =
0xafb741c35360, prevchild = 0x0, nextchild = 0x0, name =
0xafb7275f68a8 "Replication command context", ident = 0x0, reset_cbs =
0x0}

So the memory context is 0xafb741c35360, which is also the same value
for parent and firstchild. This explains the infinite loop as
MemoryContextDelete tries to find the leaf with no children.

I was able to get a rr recording of the issue and trace how this
happened. This can be reproduced by triggering 2 replication commands,
with the first one doing a snapshot export:

CREATE_REPLICATION_SLOT "test_slot" LOGICAL "test_decoding" ( SNAPSHOT
"export");
DROP_REPLICATION_SLOT "test_slot";

- CreateReplicationSlot will start a new transaction to handle the
snapshot export.
- This transaction will save the replication command context (let's
call it ctx1) in its state.
- ctx1 is deleted at the end of exec_replication_command
- During the next replication command, the transaction is aborted in
SnapBuildClearExportedSnapshot
- The transaction restores ctx1 as the CurrentMemoryContext
- AllocSetContextCreate is called with ctx1 as a parent, and it will
pull ctx1 from the freelist
- ctx1's parent and child will be set to ctx1 and returned
- During ctx1 deletion, it will be stuck in an infinite loop

I've added a tap test to reproduce the issue, along with an assertion
during context creation to check the parent and returned context are
not the same so the test would immediately abort and not stay stuck.

To fix this, it seems like saving and restoring the memory context
after the call AbortCurrentTransaction was the best approach. It is
similar to what's done with CurrentResourceOwner. I've thought of
switching to the TopMemoryContext before exporting the snapshot so
aborting the transaction will switch back to TopMemoryContext, but
this would still require restoring the memory context after the
transaction is aborted.

Regards,
Anthonin Bonnefoy

Attachment

v01-0001-Avoid-using-deleted-context-with-replication-com.patch

Re: Memory context can be its own parent and child in replication command

From

Anthonin Bonnefoy

Date:

10 March, 11:07:59

The issue seems to have been introduced by 2129d184a206c when
transactionState->priorContext was added and is only present in HEAD.

Re: Memory context can be its own parent and child in replication command

From

Michael Paquier

Date:

11 March, 03:09:16

On Mon, Mar 10, 2025 at 09:07:59AM +0100, Anthonin Bonnefoy wrote:
> The issue seems to have been introduced by 2129d184a206c when
> transactionState->priorContext was added and is only present in HEAD.

It seems to me that you mean 1afe31f03cd2, no?
--
Michael

Attachment

signature.asc

Re: Memory context can be its own parent and child in replication command

From

Tom Lane

Date:

11 March, 04:13:17

Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Mar 10, 2025 at 09:07:59AM +0100, Anthonin Bonnefoy wrote:
>> The issue seems to have been introduced by 2129d184a206c when
>> transactionState->priorContext was added and is only present in HEAD.

> It seems to me that you mean 1afe31f03cd2, no?

I dunno about this patch: it seems to me it's doing things exactly
backwards, namely trying to restore the CurrentMemoryContext after
a transaction abort to what it was just beforehand.  With only
a slightly different set of suppositions, this is introducing
use of a deleted context rather than removing it.

I'm inclined to suspect that the real problem is somebody installed
the wrong context as current just before the transaction was started.
I don't have the energy to look closer right now, though.

Independently of where the actual bug is there, it seems not
nice that it's so easy to bollix the memory context data
structures.  I wonder if we can make that more detectable
at reasonable cost.

            regards, tom lane

Re: Memory context can be its own parent and child in replication command

From

Anthonin Bonnefoy

Date:

11 March, 12:26:32

On Tue, Mar 11, 2025 at 1:09 AM Michael Paquier <michael@paquier.xyz> wrote:
> It seems to me that you mean 1afe31f03cd2, no?

Yes, that was a bad copy/paste from me.

On Tue, Mar 11, 2025 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I dunno about this patch: it seems to me it's doing things exactly
> backwards, namely trying to restore the CurrentMemoryContext after
> a transaction abort to what it was just beforehand.  With only
> a slightly different set of suppositions, this is introducing
> use of a deleted context rather than removing it.

Yeah, I'm not super happy about the fix either. This leaves the issue
that CurrentMemoryContext is set to a deleted context, even if
temporarily.

> I'm inclined to suspect that the real problem is somebody installed
> the wrong context as current just before the transaction was started.
> I don't have the energy to look closer right now, though.

The context installed is the replication command context created at
the beginning of exec_replication_command. This context is deleted at
the end of the function while the transaction is still running. Maybe
a better fix would be to not delete the Replication command context
when a transaction was started to handle the export, and switch back
to this context at the start of the next exec_replication_command?
This way, the memory context referenced by priorContext would still be
valid when the transaction is aborted.

This would also require the Replication command context to not be
attached under the MessageContext to avoid being deleted at the end of
the query cycle.

> Independently of where the actual bug is there, it seems not
> nice that it's so easy to bollix the memory context data
> structures.  I wonder if we can make that more detectable
> at reasonable cost.

One thing that made it hard to trace the issue was that there's no way
to know if a MemoryContext was deleted or not. Setting the
MemoryContext's node type to T_Invalid during reset could be a way to
tag the context as deleted. And this will be able to leverage the
existing MemoryContextIsValid check and additional
MemoryContextIsValid checks could be added before restoring the
priorContext.

--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -527,6 +527,7 @@ MemoryContextDeleteOnly(MemoryContext context)

        context->methods->delete_context(context);

+       context->type = T_Invalid;
        VALGRIND_DESTROY_MEMPOOL(context);
 }

However, when testing this on my mac, it seems to trigger a heap
corruption during initdb.

postgres(9057,0x200690840) malloc: Heap corruption detected, free list
is damaged at 0x60000322bb10
*** Incorrect guard value: 276707563012096
postgres(9057,0x200690840) malloc: *** set a breakpoint in
malloc_error_break to debug

While it ran fine on linux. I didn't have time to check why yet.

Re: Memory context can be its own parent and child in replication command

From

Anthonin Bonnefoy

Date:

11 March, 16:04:27

On Tue, Mar 11, 2025 at 10:26 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
> --- a/src/backend/utils/mmgr/mcxt.c
> +++ b/src/backend/utils/mmgr/mcxt.c
> @@ -527,6 +527,7 @@ MemoryContextDeleteOnly(MemoryContext context)
>
>         context->methods->delete_context(context);
>
> +       context->type = T_Invalid;
>         VALGRIND_DESTROY_MEMPOOL(context);
>  }
>
> However, when testing this on my mac, it seems to trigger a heap
> corruption during initdb.

Which is normal since this would only work with AllocSet as other
delelte_context methods would just free the context... A better spot
would be just before putting the context in the freelist in
AllocSetDelete.

Re: Memory context can be its own parent and child in replication command

From

Anthonin Bonnefoy

Date:

12 March, 12:13:06

I've updated the patch with another approach:

0001: This sets the MemoryContext type to T_Invalid just before adding
it to the aset freelist, allowing to distinguish between a
MemoryContext that was placed in the freelist and shouldn't be used,
and a valid one. The existing MemoryContextIsValid checks will be
triggered if a MemoryContext in the freelist is used.
I've also added additional MemoryContextIsValid to make sure the
MemoryContext restored by a transaction is valid.

0002: Keep the replication command alive when there's an ongoing
snapshot export. This way, the context restored is valid and can be
reused. This requires the replication command context to be created
under the TopMemoryContext.

Regards,
Anthonin Bonnefoy

Attachment

Re: Memory context can be its own parent and child in replication command

From

Tom Lane

Date:

20 April, 21:18:27

Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> writes:
> I've updated the patch with another approach:

I got back to looking at this today.  I still don't like very much
about it.  After thinking for awhile, I concur that we want to create
the cmd_context under TopMemoryContext, but I think we can make things
a great deal simpler and less fragile by just keeping it in existence
indefinitely, as attached.  The fundamental problem here is that
exec_replication_command is creating and deleting the context without
regard to where transaction boundaries are.  I don't think the best
way to deal with that is to require it to know where those boundaries
are.

I also don't like the proposed test script.  The test case alone would
be fine, but spinning up an entire new instance seems a rather huge
overhead to carry forevermore for a single test that likely will never
find anything again.  I tried to cram the test case into one of the
existing scripts, but it was hard to find one where it would fit
naturally.  An obvious candidate is subscription/t/100_bugs.pl, but
the test failed there for lack of access to the test_decoding plugin.
Perhaps we could use another plugin, but I didn't try hard.

As for the memory-context-bug-catching changes, I'm not very convinced
by the idea of setting context->type = T_Invalid.  That would help
if someone tries to reference the context while it's still in the
freelist, but not if the reference is from re-use of a stale pointer
after the context has been reassigned (which IIUC is the case here).
I also think that decorating a few MemoryContextSwitchTos with
assertions looks pretty random.  I wonder if we should content
ourselves with adding some assertions that catch attempts to make
a context be its own parent.  It looks like MemoryContextSetParent
already does that, so the only new assert would be in
MemoryContextCreate.

            regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 987a7336ec8..9fa8beb6103 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1973,8 +1973,10 @@ exec_replication_command(const char *cmd_string)
     int            parse_rc;
     Node       *cmd_node;
     const char *cmdtag;
-    MemoryContext cmd_context;
-    MemoryContext old_context;
+    MemoryContext old_context = CurrentMemoryContext;
+
+    /* We save and re-use the cmd_context across calls */
+    static MemoryContext cmd_context = NULL;

     /*
      * If WAL sender has been told that shutdown is getting close, switch its
@@ -2003,11 +2005,30 @@ exec_replication_command(const char *cmd_string)

     /*
      * Prepare to parse and execute the command.
+     *
+     * Because replication command execution can involve beginning or ending
+     * transactions, we need a working context that will survive that, so we
+     * make it a child of TopMemoryContext.  That in turn creates a hazard of
+     * long-lived memory leaks if we lose track of the working context.  We
+     * deal with that by creating it only once per walsender, and resetting it
+     * for each new command.  (Normally this reset is a no-op, but if the
+     * prior exec_replication_command call failed with an error, it won't be.)
+     *
+     * This is subtler than it looks.  The transactions we manage can extend
+     * across replication commands, indeed SnapBuildClearExportedSnapshot
+     * might have just ended one.  Because transaction exit will revert to the
+     * memory context that was current at transaction start, we need to be
+     * sure that that context is still valid.  That motivates re-using the
+     * same cmd_context rather than making a new one each time.
      */
-    cmd_context = AllocSetContextCreate(CurrentMemoryContext,
-                                        "Replication command context",
-                                        ALLOCSET_DEFAULT_SIZES);
-    old_context = MemoryContextSwitchTo(cmd_context);
+    if (cmd_context == NULL)
+        cmd_context = AllocSetContextCreate(TopMemoryContext,
+                                            "Replication command context",
+                                            ALLOCSET_DEFAULT_SIZES);
+    else
+        MemoryContextReset(cmd_context);
+
+    MemoryContextSwitchTo(cmd_context);

     replication_scanner_init(cmd_string, &scanner);

@@ -2020,7 +2041,7 @@ exec_replication_command(const char *cmd_string)
         replication_scanner_finish(scanner);

         MemoryContextSwitchTo(old_context);
-        MemoryContextDelete(cmd_context);
+        MemoryContextReset(cmd_context);

         /* XXX this is a pretty random place to make this check */
         if (MyDatabaseId == InvalidOid)
@@ -2180,9 +2201,12 @@ exec_replication_command(const char *cmd_string)
                  cmd_node->type);
     }

-    /* done */
+    /*
+     * Done.  Revert to caller's memory context, and clean out the cmd_context
+     * to recover memory right away.
+     */
     MemoryContextSwitchTo(old_context);
-    MemoryContextDelete(cmd_context);
+    MemoryContextReset(cmd_context);

     /*
      * We need not update ps display or pg_stat_activity, because PostgresMain

Re: Memory context can be its own parent and child in replication command

From

Tom Lane

Date:

20 April, 22:28:51

I wrote:
> ... An obvious candidate is subscription/t/100_bugs.pl, but
> the test failed there for lack of access to the test_decoding plugin.
> Perhaps we could use another plugin, but I didn't try hard.

Ah, I realized we could just use pgoutput, since the test doesn't
actually do anything that would exercise the plugin.  So here's
a v4 with the test restored.

            regards, tom lane

From cd8ff75cf45cb9515e9bf3ad1696f80901b2035f Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 20 Apr 2025 15:24:14 -0400
Subject: [PATCH v4] Use the same cmd_context throughout a walsender's
 lifetime.

exec_replication_command created a cmd_context to work in and
then deleted it on exit.  This is pretty dangerous because
some replication commands start/finish transactions.  In the
wake of commit 1afe31f03, that could lead to restoring a
CurrentMemoryContext that's already been deleted (and maybe
even reassigned, leading to hilarity such as a memory context
that is its own parent).

To fix, let's make the cmd_context persist across
exec_replication_command calls; instead of deleting it, we'll just
reset it each time.  In this way it retains the same identity and
there's no problem if transaction abort restores it as the working
context.  It probably even saves a few microseconds to do this.

This fix also ensures that exec_replication_command returns to the
caller (PostgresMain) with the same context active that had been
when it was called (probably MessageContext).  The previous
coding could get that wrong too.

Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
---
 src/backend/replication/walsender.c | 42 ++++++++++++++++++++++-------
 src/test/subscription/t/100_bugs.pl | 18 +++++++++++++
 2 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 987a7336ec8..9fa8beb6103 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1973,8 +1973,10 @@ exec_replication_command(const char *cmd_string)
     int            parse_rc;
     Node       *cmd_node;
     const char *cmdtag;
-    MemoryContext cmd_context;
-    MemoryContext old_context;
+    MemoryContext old_context = CurrentMemoryContext;
+
+    /* We save and re-use the cmd_context across calls */
+    static MemoryContext cmd_context = NULL;

     /*
      * If WAL sender has been told that shutdown is getting close, switch its
@@ -2003,11 +2005,30 @@ exec_replication_command(const char *cmd_string)

     /*
      * Prepare to parse and execute the command.
+     *
+     * Because replication command execution can involve beginning or ending
+     * transactions, we need a working context that will survive that, so we
+     * make it a child of TopMemoryContext.  That in turn creates a hazard of
+     * long-lived memory leaks if we lose track of the working context.  We
+     * deal with that by creating it only once per walsender, and resetting it
+     * for each new command.  (Normally this reset is a no-op, but if the
+     * prior exec_replication_command call failed with an error, it won't be.)
+     *
+     * This is subtler than it looks.  The transactions we manage can extend
+     * across replication commands, indeed SnapBuildClearExportedSnapshot
+     * might have just ended one.  Because transaction exit will revert to the
+     * memory context that was current at transaction start, we need to be
+     * sure that that context is still valid.  That motivates re-using the
+     * same cmd_context rather than making a new one each time.
      */
-    cmd_context = AllocSetContextCreate(CurrentMemoryContext,
-                                        "Replication command context",
-                                        ALLOCSET_DEFAULT_SIZES);
-    old_context = MemoryContextSwitchTo(cmd_context);
+    if (cmd_context == NULL)
+        cmd_context = AllocSetContextCreate(TopMemoryContext,
+                                            "Replication command context",
+                                            ALLOCSET_DEFAULT_SIZES);
+    else
+        MemoryContextReset(cmd_context);
+
+    MemoryContextSwitchTo(cmd_context);

     replication_scanner_init(cmd_string, &scanner);

@@ -2020,7 +2041,7 @@ exec_replication_command(const char *cmd_string)
         replication_scanner_finish(scanner);

         MemoryContextSwitchTo(old_context);
-        MemoryContextDelete(cmd_context);
+        MemoryContextReset(cmd_context);

         /* XXX this is a pretty random place to make this check */
         if (MyDatabaseId == InvalidOid)
@@ -2180,9 +2201,12 @@ exec_replication_command(const char *cmd_string)
                  cmd_node->type);
     }

-    /* done */
+    /*
+     * Done.  Revert to caller's memory context, and clean out the cmd_context
+     * to recover memory right away.
+     */
     MemoryContextSwitchTo(old_context);
-    MemoryContextDelete(cmd_context);
+    MemoryContextReset(cmd_context);

     /*
      * We need not update ps display or pg_stat_activity, because PostgresMain
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 83120f1cb6f..9c9701d58a8 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -477,6 +477,24 @@ $result =
 is( $result, qq(2|f
 3|t), 'check replicated update on subscriber');

+# Test create and immediate drop of replication slot via replication commands
+# (this exposed a memory-management bug in v18)
+my $publisher_host = $node_publisher->host;
+my $publisher_port = $node_publisher->port;
+my $connstr_db =
+  "host=$publisher_host port=$publisher_port replication=database dbname=postgres";
+
+is( $node_publisher->psql(
+        'postgres',
+        qq[
+        CREATE_REPLICATION_SLOT "test_slot" LOGICAL pgoutput ( SNAPSHOT "export");
+        DROP_REPLICATION_SLOT "test_slot";
+    ],
+        timeout => $PostgreSQL::Test::Utils::timeout_default,
+        extra_params => [ '-d', $connstr_db ]),
+    0,
+    'create and immediate drop of replication slot');
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');

--
2.43.5

Re: Memory context can be its own parent and child in replication command

From

Rahila Syed

Date:

21 April, 12:47:41

Hi,

So here's a v4 with the test restored.

I tested this patch, it fixes the issue reported. It passes Github CI tests.

already does that, so the only new assert would be in
MemoryContextCreate.

+1 for adding the assertion to increase the chances of this bug being

caught by memory context infrastructure.

I had the following comment.

Why do we do this:

- MemoryContext old_context;
+ MemoryContext old_context = CurrentMemoryContext;

Instead of implementing it as done in the previous version of this code, i.e.
old_context = MemoryContextSwitchTo(cmd_context);

Thank you,

Rahila Syed

Re: Memory context can be its own parent and child in replication command

From

Tom Lane

Date:

21 April, 19:23:23

Rahila Syed <rahilasyed90@gmail.com> writes:
> +1 for adding the assertion to increase the chances of this bug being
> caught by memory context infrastructure.

I verified that an assertion inside MemoryContextCreate would catch
this bug, so I added one (in 5ec8b01c3), and then pushed the main
fix at 80b727eb9.

> I had the following comment.

> Why do we do this:
> -       MemoryContext old_context;
> +       MemoryContext old_context = CurrentMemoryContext;
> Instead of implementing it as done in the previous version of this code,
> i.e.
>     old_context = MemoryContextSwitchTo(cmd_context);

Because capturing old_context at that point is too late.  The crux
of the bug is exactly that SnapBuildClearExportedSnapshot may have
changed CurrentMemoryContext.  We could have avoided that by
reordering the code, or done something like what Anthonin did in v02.
But this way is perfectly idiomatic IMO.  There are some dozens of
similar usages elsewhere in our code, and it makes very sure that
old_context captures the caller's context even if we rearrange things
some more inside exec_replication_command.

            regards, tom lane