Re: logical replication worker accesses catalogs in error context callback - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: logical replication worker accesses catalogs in error context callback |
Date | |
Msg-id | 20210204195459.ve7wdhmc75vy6df7@alap3.anarazel.de Whole thread Raw |
In response to | Re: logical replication worker accesses catalogs in error context callback (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
List | pgsql-hackers |
Hi, On 2021-02-04 15:08:42 +0530, Bharath Rupireddy wrote: > On Thu, Feb 4, 2021 at 5:16 AM Andres Freund <andres@anarazel.de> wrote: > > > > > About 0001, have we tried to reproduce the actual bug here which means > > > > > when the error_callback is called we should face some problem? I feel > > > > > with the correct testcase we should hit the Assert > > > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because > > > > > there we expect the transaction to be in a valid state. I understand > > > > > that the transaction is in a broken state at that time but having a > > > > > testcase to hit the actual bug makes it easy to test the fix. > > > > > > > > I think it's easy to hit - add a trivial DEBUG message to XLogWrite(), > > > > and run anything involving logical rep using a low enough log > > > > level. There might even be enough messages with a low enough level > > > > already somewhere in the relevant paths. > > > > > > I'm not sure how adding a DEBUG message to XLogWrite() would ensure > > > the logical replication worker on the subscriber system enters > > > slot_store_error_callback and hits the Assert(IsTransactionState());? > > > Could you please elaborate? Or I may be missing something here to > > > understand. > > > > XLogWrite() is in a critical section, the DEBUG messages triggers error > > context callbacks to be called, the callbacks allocate memory, which > > triggers an assertion. > > I see that XLogWrite() can be called from logical replication > worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite > --> by now the txn state is not TRANS_INPROGRESS, so > IsTransactionState() can return false. Adding a DEBUG message there > can call the error context callback. > > But the error context callback slot_store_error_callback, gets used in > slot_store_data and slot_modify_data. In both places we just register > the callback before an attribute processing for loop and deregister > after it. So, when we are in those for loops, we expect to be in > TRANS_INPROGRESS, see ensure_transaction before slot_store_data and > the same is true for slot_modify_data. So, to really hit the assert > failure in the catalogue access from slot_store_error_callback, first, > we shouldn't be in TRANS_INPROGRESS in those for loops and have a > DEBUG message. I'm not exactly sure how we achieve this. I think you're focussing too much on IsTransactionState(). It's easier to hit the critical section issues. What makes the issue a bit hard to hit is that to hit it you need a syscache miss while slot_store_error_callback is set up, and that cache miss then needs to trigger a read from disk (to potentially trigger a WAL flush), and then the slot_store_error_callback() needs to actually do something. With the below patch $ cat /tmp/tmp_config.conf log_min_messages=debug5 shared_buffers=1MB make TEMP_CONFIG=/tmp/extra_config.conf -C src/test/subscription/ check triggers an assertion most of the time. The catcache.c changes aren't necessary for this test, but we should have them... Regards, Andres diff --git c/src/backend/access/transam/xlog.c i/src/backend/access/transam/xlog.c index f03bd473e2b..7f422b4f5f2 100644 --- c/src/backend/access/transam/xlog.c +++ i/src/backend/access/transam/xlog.c @@ -2432,6 +2432,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) int startidx; uint32 startoffset; + elog(DEBUG1, "writing!"); + /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); @@ -2885,6 +2887,8 @@ XLogFlush(XLogRecPtr record) START_CRIT_SECTION(); + elog(DEBUG1, "checking for flushing"); + /* * Since fsync is usually a horribly expensive operation, we try to * piggyback as much data as we can on each fsync: if we see any more data diff --git c/src/backend/replication/logical/worker.c i/src/backend/replication/logical/worker.c index eb7db89cef7..19aa76af889 100644 --- c/src/backend/replication/logical/worker.c +++ i/src/backend/replication/logical/worker.c @@ -435,6 +435,8 @@ slot_store_error_callback(void *arg) Oid remotetypoid, localtypoid; + Assert(IsTransactionState() && CritSectionCount == 0); + /* Nothing to do if remote attribute number is not set */ if (errarg->remote_attnum < 0) return; diff --git c/src/backend/utils/cache/catcache.c i/src/backend/utils/cache/catcache.c index fa2b49c676e..c7335c08570 100644 --- c/src/backend/utils/cache/catcache.c +++ i/src/backend/utils/cache/catcache.c @@ -1210,7 +1210,7 @@ SearchCatCacheInternal(CatCache *cache, CatCTup *ct; /* Make sure we're in an xact, even if this ends up being a cache hit */ - Assert(IsTransactionState()); + Assert(IsTransactionState() && CritSectionCount == 0); Assert(cache->cc_nkeys == nkeys); @@ -1523,6 +1523,9 @@ SearchCatCacheList(CatCache *cache, MemoryContext oldcxt; int i; + /* Make sure we're in an xact, even if this ends up being a cache hit */ + Assert(IsTransactionState() && CritSectionCount == 0); + /* * one-time startup overhead for each cache */
pgsql-hackers by date: