RE: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id OSBPR01MB25528F4B0B8178D3AA8DE2BFF5082@OSBPR01MB2552.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Slow catchup of 2PC (twophase) transactions on replica in LR  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Slow catchup of 2PC (twophase) transactions on replica in LR
List pgsql-hackers
Dear Amit,

> > FYI - We also considered the idea which walsender waits until all prepared
> transactions
> > are resolved before decoding and sending changes, but it did not work well
> > - the restarted walsender sent only COMMIT PREPARED record for
> transactions which
> > have been prepared before disabling the subscription. This happened because
> > 1) if the two_phase option of slots is false, the confirmed_flush can be ahead of
> >    PREPARE record, and
> > 2) after the altering and restarting, start_decoding_at becomes same as
> >    confirmed_flush and records behind this won't be decoded.
> >
> 
> I don't understand the exact problem you are facing. IIUC, if the
> commit is after start_decoding_at point and prepare was before it, we
> expect to send the entire transaction followed by a commit record. The
> restart_lsn should be before the start of such a transaction and we
> should have recorded the changes in the reorder buffer.

This behavior is right for two_phase = false case. But if the parameter is
altered between PREPARE and COMMIT PREPARED, there is a possibility that only
COMMIT PREPARED is sent. As the first place, the executed workload is below.

1. created a subscription with (two_phase = false)
2. prepared a transaction on publisher
3. disabled the subscription once
4. altered the subscription to two_phase = true
5. enabled the subscription again
6. did COMMIT PREPARED on the publisher

-> Apply worker would raise an ERROR while applying COMMIT PREPARED record:
ERROR:  prepared transaction with identifier "pg_gid_XXX_YYY" does not exist

Below part describes why the ERROR occurred.

======

### Regarding 1) the confirmed_flush can be ahead of PREPARE record,

If two_phase is off, as you might know, confirmed_flush can be ahead of PREPARE
record by keepalive mechanism.

Walsender sometimes sends a keepalive message in WalSndKeepalive(). Here the LSN
is written, which is lastly decoded record. Since the PREPARE record is skipped
(just handled by ReorderBufferProcessXid()), sometimes the written LSN in the
message can be ahead of PREPARE record. If the WAL records are aligned like below,
the LSN can point CHECKPOINT_ONLINE.

...
INSERT
PREPARE txn1
CHECKPOINT_ONLINE
...

On worker side, when it receives the keepalive, it compares the LSN in the
message and lastly received LSN, and advance last_received. Then, the worker replies
to the walsender, and at that time it replies that last_recevied record has been
flushed on the subscriber. See send_feedback().
 
On publisher, when the walsender receives the message from subscriber, it reads
the message and advance the confirmed_flush to the written value. If the walsender
sends LSN which locates ahead PREPARE, the confirmed flush is updated as well.

### Regarding 2) after the altering, records behind the confirmed_flush are not decoded

Then, at decoding phase. The snapshot builder determines the point where decoding
is resumed, as start_decoding_at. After the restart, the value is same as
confirmed_flush of the slot. Since the confiremed_fluish is ahead of PREPARE,
the start_decoding_at becomes ahead as well, so whole of prepared transactions
are not decoded.

======

Attached zip file contains the PoC and used script. You can refer what I really did.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: pg_combinebackup fails on file named INCREMENTAL.*
Next
From: Andres Freund
Date:
Subject: Re: Differential code coverage between 16 and HEAD