Re: logical replication worker accesses catalogs in error context callback - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: logical replication worker accesses catalogs in error context callback
Date
Msg-id CALj2ACVF=JD5KSDf1uyKREcDnPH01fuecNGE+vZ0qVa-6Ktz3g@mail.gmail.com
Whole thread Raw
In response to Re: logical replication worker accesses catalogs in error context callback  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: logical replication worker accesses catalogs in error context callback
List pgsql-hackers
On Thu, Feb 4, 2021 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > About 0001, have we tried to reproduce the actual bug here which means
> > > when the error_callback is called we should face some problem? I feel
> > > with the correct testcase we should hit the Assert
> > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > there we expect the transaction to be in a valid state. I understand
> > > that the transaction is in a broken state at that time but having a
> > > testcase to hit the actual bug makes it easy to test the fix.
> >
> > I have not tried hitting the Assert(IsTransactionState() in
> > SearchCatCacheInternal. To do that, I need to figure out hitting
> > "incorrect binary data format in logical replication column" error in
> > either slot_modify_data or slot_store_data so that we will enter the
> > error callback slot_store_error_callback and then IsTransactionState()
> > should return false i.e. txn shouldn't be in TRANS_INPROGRESS.
> >
>
> Even, if you hit that via debugger it will be sufficient or you can
> write another elog/ereport there to achieve the same. The exact test
> case to hit that error is not mandatory.

Thanks Amit. I verified it with gdb. I attached gdb to the logical
replication worker. In slot_store_data's for loop, I intentionally set
CurrentTransactionState->state = TRANS_DEFAULT, and jumped to the
existing error "incorrect binary data format in logical replication
column", so that the slot_store_error_callback is called. While we are
in the error context callback:

On master: since the system catalogues are accessed in
slot_store_error_callback, the Assert(IsTransactionState() in
SearchCatCacheInternal failed and the error we intend to see is not
logged and we see below in the subscriber server log and the session
in the subscriber gets restarted.
2021-02-04 17:26:27.517 IST [2269230] ERROR:  could not send data to
WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2021-02-04 17:26:27.518 IST [2269190] LOG:  background worker "logical
replication worker" (PID 2269230) exited with exit code 1

With patch: since we avoided system catalogue access in
slot_store_error_callback, we see the error that we intentionally
jumped to, in the subscriber server log.
2021-02-04 17:27:37.542 IST [2269424] ERROR:  incorrect binary data
format in logical replication column 1
2021-02-04 17:27:37.542 IST [2269424] CONTEXT:  processing remote data
for replication target relation "public.t1" column "a1", remote type
integer, local type integer

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Is Recovery actually paused?
Next
From: Amit Kapila
Date:
Subject: Re: logical replication worker accesses catalogs in error context callback