Re: logical replication worker accesses catalogs in error context callback - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical replication worker accesses catalogs in error context callback
Date
Msg-id 20210204195459.ve7wdhmc75vy6df7@alap3.anarazel.de
Whole thread Raw
In response to Re: logical replication worker accesses catalogs in error context callback  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
Hi,

On 2021-02-04 15:08:42 +0530, Bharath Rupireddy wrote:
> On Thu, Feb 4, 2021 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > About 0001, have we tried to reproduce the actual bug here which means
> > > > > when the error_callback is called we should face some problem? I feel
> > > > > with the correct testcase we should hit the Assert
> > > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > > > there we expect the transaction to be in a valid state. I understand
> > > > > that the transaction is in a broken state at that time but having a
> > > > > testcase to hit the actual bug makes it easy to test the fix.
> > > >
> > > > I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
> > > > and run anything involving logical rep using a low enough log
> > > > level. There might even be enough messages with a low enough level
> > > > already somewhere in the relevant paths.
> > >
> > > I'm not sure how adding a DEBUG message to XLogWrite() would ensure
> > > the logical replication worker on the subscriber system enters
> > > slot_store_error_callback and hits the Assert(IsTransactionState());?
> > > Could you please elaborate? Or I may be missing something here to
> > > understand.
> >
> > XLogWrite() is in a critical section, the DEBUG messages triggers error
> > context callbacks to be called, the callbacks allocate memory, which
> > triggers an assertion.
>
> I see that XLogWrite() can be called from logical replication
>
worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite
> --> by now the txn state is not TRANS_INPROGRESS, so
> IsTransactionState() can return false. Adding a DEBUG message there
> can call the error context callback.
>
> But the error context callback slot_store_error_callback, gets used in
> slot_store_data and slot_modify_data. In both places we just register
> the callback before an attribute processing for loop and deregister
> after it. So, when we are in those for loops, we expect to be in
> TRANS_INPROGRESS, see ensure_transaction before slot_store_data and
> the same is true for slot_modify_data. So, to really hit the assert
> failure in the catalogue access from slot_store_error_callback, first,
> we shouldn't be in TRANS_INPROGRESS in those for loops and have a
> DEBUG message. I'm not exactly sure how we achieve this.

I think you're focussing too much on IsTransactionState(). It's easier
to hit the critical section issues.

What makes the issue a bit hard to hit is that to hit it you need a
syscache miss while slot_store_error_callback is set up, and that cache
miss then needs to trigger a read from disk (to potentially trigger a
WAL flush), and then the slot_store_error_callback() needs to actually
do something.

With the below patch

$ cat /tmp/tmp_config.conf
log_min_messages=debug5
shared_buffers=1MB

make TEMP_CONFIG=/tmp/extra_config.conf -C src/test/subscription/ check

triggers an assertion most of the time. The catcache.c changes aren't
necessary for this test, but we should have them...

Regards,

Andres

diff --git c/src/backend/access/transam/xlog.c i/src/backend/access/transam/xlog.c
index f03bd473e2b..7f422b4f5f2 100644
--- c/src/backend/access/transam/xlog.c
+++ i/src/backend/access/transam/xlog.c
@@ -2432,6 +2432,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int         startidx;
     uint32      startoffset;
 
+    elog(DEBUG1, "writing!");
+
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
 
@@ -2885,6 +2887,8 @@ XLogFlush(XLogRecPtr record)
 
     START_CRIT_SECTION();
 
+    elog(DEBUG1, "checking for flushing");
+
     /*
      * Since fsync is usually a horribly expensive operation, we try to
      * piggyback as much data as we can on each fsync: if we see any more data
diff --git c/src/backend/replication/logical/worker.c i/src/backend/replication/logical/worker.c
index eb7db89cef7..19aa76af889 100644
--- c/src/backend/replication/logical/worker.c
+++ i/src/backend/replication/logical/worker.c
@@ -435,6 +435,8 @@ slot_store_error_callback(void *arg)
     Oid         remotetypoid,
                 localtypoid;
 
+    Assert(IsTransactionState() && CritSectionCount == 0);
+
     /* Nothing to do if remote attribute number is not set */
     if (errarg->remote_attnum < 0)
         return;
diff --git c/src/backend/utils/cache/catcache.c i/src/backend/utils/cache/catcache.c
index fa2b49c676e..c7335c08570 100644
--- c/src/backend/utils/cache/catcache.c
+++ i/src/backend/utils/cache/catcache.c
@@ -1210,7 +1210,7 @@ SearchCatCacheInternal(CatCache *cache,
     CatCTup    *ct;
 
     /* Make sure we're in an xact, even if this ends up being a cache hit */
-    Assert(IsTransactionState());
+    Assert(IsTransactionState() && CritSectionCount == 0);
 
     Assert(cache->cc_nkeys == nkeys);
 
@@ -1523,6 +1523,9 @@ SearchCatCacheList(CatCache *cache,
     MemoryContext oldcxt;
     int         i;
 
+    /* Make sure we're in an xact, even if this ends up being a cache hit */
+    Assert(IsTransactionState() && CritSectionCount == 0);
+
     /*
      * one-time startup overhead for each cache
      */



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Bug in COPY FROM backslash escaping multi-byte chars
Next
From: Robert Haas
Date:
Subject: Re: new heapcheck contrib module