RE: [HACKERS] logical decoding of two-phase transactions - Mailing list pgsql-hackers

From tanghy.fnst@fujitsu.com
Subject RE: [HACKERS] logical decoding of two-phase transactions
Date
Msg-id OS0PR01MB611332752AFF6ADF580AB849FB259@OS0PR01MB6113.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [HACKERS] logical decoding of two-phase transactions  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: [HACKERS] logical decoding of two-phase transactions
List pgsql-hackers
> > 13.
> > @@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
> > bool isTopLevel)
> >   {
> >   Assert(slotname);
> >
> > - walrcv_create_slot(wrconn, slotname, false,
> > + /*
> > + * Even if two_phase is set, don't create the slot with
> > + * two-phase enabled. Will enable it once all the tables are
> > + * synced and ready. This avoids race-conditions like prepared
> > + * transactions being skipped due to changes not being applied
> > + * due to checks in should_apply_changes_for_rel() when
> > + * tablesync for the corresponding tables are in progress. See
> > + * comments atop worker.c.
> > + */
> > + walrcv_create_slot(wrconn, slotname, false, false,
> >
> > Can't we enable two_phase if copy_data is false? Because in that case,
> > all relations will be in a READY state. If we do that then we should
> > also set two_phase state as 'enabled' during createsubscription. I
> > think we need to be careful to check that connect option is given and
> > copy_data is false before setting such a state. Now, I guess we may
> > not be able to optimize this to not set 'enabled' state when the
> > subscription has no rels.
> >
> 
> Fixed in v77-0001

I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data =
false)",but it crashed.
 
-------------
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data =
false);
WARNING:  relcache reference leak: relation "pg_subscription" not closed
WARNING:  snapshot 0x34278d0 still active
NOTICE:  created replication slot "sub" on publisher
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-------------

There are two warnings and a segmentation fault in subscriber log:
-------------
2021-05-24 15:08:32.435 CST [2848572] WARNING:  relcache reference leak: relation "pg_subscription" not closed
2021-05-24 15:08:32.435 CST [2848572] WARNING:  snapshot 0x32ce8b0 still active
2021-05-24 15:08:33.012 CST [2848555] LOG:  server process (PID 2848572) was terminated by signal 11: Segmentation
fault
2021-05-24 15:08:33.012 CST [2848555] DETAIL:  Failed process was running: CREATE SUBSCRIPTION sub CONNECTION
'dbname=postgres'PUBLICATION pub WITH(two_phase = on, copy_data = false);
 
-------------

The backtrace about segmentation fault is attached. It happened in table_close function, we got it because
"CurrentResourceOwner"was NULL.
 

I think it was related with the first warning, which reported "relcache reference leak". The backtrace information is
attached,too. When updating two-phase state in CreateSubscription function, it released resource owner and set
"CurrentResourceOwner"as NULL in CommitTransaction function.
 

The second warning about "snapshot still active" was also happened in CommitTransaction function. It called
AtEOXact_Snapshotfunction, checked leftover snapshots and reported the warning.
 
I debugged and found the snapshot was added in function PortalRunUtility by calling PushActiveSnapshot function, the
addressof "ActiveSnapshot" at this time was same as the address in warning.
 

In summary, when creating subscription with two_phase = on and copy_data = false, it calls UpdateTwoPhaseState function
inCreateSubscription function to set two_phase state as 'enabled', and it checked and released relcache and snapshot
tooearly so the NG happened. I think some change should be made to avoid it. Thought?
 

FYI
 I also tested the new released V78* at [1], the above NG still exists.
[1] https://www.postgresql.org/message-id/CAFPTHDab56twVmC%2B0a%3DRNcRw4KuyFdqzW0JAcvJdS63n_fRnOQ%40mail.gmail.com

Regards
Tang


Attachment

pgsql-hackers by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Re: Issue on catalogs.sgml
Next
From: Dilip Kumar
Date:
Subject: Re: Assertion failure while streaming toasted data