RE: Truncate in synchronous logical replication failed - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Truncate in synchronous logical replication failed
Date
Msg-id OSBPR01MB48886AFC51035D44B2792E34ED709@OSBPR01MB4888.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Truncate in synchronous logical replication failed  (Japin Li <japinli@hotmail.com>)
Responses Re: Truncate in synchronous logical replication failed  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi


On Saturday, April 10, 2021 11:52 PM Japin Li <japinli@hotmail.com> wrote:
> On Thu, 08 Apr 2021 at 19:20, Japin Li <japinli@hotmail.com> wrote:
> > On Wed, 07 Apr 2021 at 16:34, tanghy.fnst@fujitsu.com
> <tanghy.fnst@fujitsu.com> wrote:
> >> On Wednesday, April 7, 2021 5:28 PM Amit Kapila
> >> <amit.kapila16@gmail.com> wrote
> >>
> >>>Can you please check if the behavior is the same for PG-13? This is
> >>>just to ensure that we have not introduced any bug in PG-14.
> >>
> >> Yes, same failure happens at PG-13, too.
> >>
> >
> > I found that when we truncate a table in synchronous logical
> > replication,
> > LockAcquireExtended() [1] will try to take a lock via fast path and it
> > failed (FastPathStrongRelationLocks->count[fasthashcode] = 1).
> > However, it can acquire the lock when in asynchronous logical replication.
> > I'm not familiar with the locks, any suggestions? What the difference
> > between sync and async logical replication for locks?
> >
>
> After some analyze, I find that when the TRUNCATE finish, it will call
> SyncRepWaitForLSN(), for asynchronous logical replication, it will exit early,
> and then it calls ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) to
> release the locks, so the walsender can acquire the lock.
>
> But for synchronous logical replication, SyncRepWaitForLSN() will wait for
> specified LSN to be confirmed, so it cannot release the lock, and the
> walsender try to acquire the lock.  Obviously, it cannot acquire the lock,
> because the lock hold by the process which performs TRUNCATE command.
> This is why the TRUNCATE in synchronous logical replication is blocked.
Yeah, the TRUNCATE waits in SyncRepWaitForLSN() while
the walsender is blocked by the AccessExclusiveLock taken by it,
which makes the subscriber cannot take the change and leads to a sort of deadlock.


On Wednesday, April 7, 2021 3:56 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote:
> I checked the PG-DOC, found it says that “Replication of TRUNCATE
> commands is supported”[1], so maybe TRUNCATE is not supported in
> synchronous logical replication?
>
> If my understanding is right, maybe PG-DOC can be modified like this. Any
> thought?
> Replication of TRUNCATE commands is supported
> ->
> Replication of TRUNCATE commands is supported in asynchronous mode
I'm not sure if this becomes the final solution,
but if we take a measure to fix the doc, we have to be careful for the description,
because when we remove the primary keys of 'test' tables on the scenario in [1], we don't have this issue.
It means TRUNCATE in synchronous logical replication is not always blocked.

Having the primary key on the pub only causes the hang.
Also, I can observe the same hang using REPLICA IDENTITY USING INDEX and without primary key on the pub,
while I cannot reproduce the problem with the REPLICA IDENTITY FULL and without primary key.
This difference comes from logicalrep_write_attrs() which has a branch to call RelationGetIndexAttrBitmap().
Therefore, the description above is not correct, strictly speaking, I thought.

I'll share my analysis when I get a better idea to address this.

[1] -
https://www.postgresql.org/message-id/OS0PR01MB6113C2499C7DC70EE55ADB82FB759%40OS0PR01MB6113.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: psql - add SHOW_ALL_RESULTS option
Next
From: Masahiko Sawada
Date:
Subject: Re: Replication slot stats misgivings