RE: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs

From Hayato Kuroda (Fujitsu)
Subject RE: Potential data loss due to race condition during logical replication slot creation
Date
Msg-id TYCPR01MB12077ADB13450B422B6161A01F5342@TYCPR01MB12077.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Potential data loss due to race condition during logical replication slot creation  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Potential data loss due to race condition during logical replication slot creation
List pgsql-bugs
Dear Sawada-san,

Thanks for giving comments and sorry for late reply.

> > If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
> 
> I think that builder->initial_xmin_horizon could be older than
> highest_running_xid, for example, when there is a logical replication
> slot whose catalog_xmin is old. However, even in this case, we might
> need to ignore restoring the snapshot. For example, a slightly
> modified test case still can cause the same problem.
> 
> The test case in the Kuroda-san's v2 patch:
> permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0"\   "s1_get_changes_slot1"
> 
> Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0\  " "s1_get_changes_slot1"

Good catch, I confirmed that the variant still partially output transactions:

```
step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0');
 
data                                     
-----------------------------------------
BEGIN                                    
table public.tbl: INSERT: val1[integer]:2
COMMIT                                   
(3 rows)
```

/////

I analyzed a bit. In below descriptions, I uses these notations:

txn0 - has a single insert
txn1 - has begin-insert-commit operations.
       slot1 is created in parallel

While creating slot0, initial_xmin_horizon was also set, and it come from
GetOldestSafeDecodingTransactionId().
Since there were no slots in the database before, the value was just a next xid.

Then, the pinned xid was assigned to the txn0.
It is correct behavior because slot0 must decode txn0.

Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
restored once. txn1 was the recorded as the highest_running, but initial_xmin was
older than it. So the snapshot was restored and txn1 was partially decoded.

initial_xmin_horizon (txn0) < highest_running_xid (txn1)

/////

So, how should we fix? One idea is based on (c), and adds a flag which indicates
an existence of concurrent transactions. The restoration is skipped if both two
flags are true. PSA the PoC patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment

pgsql-bugs by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: BUG #18379: LDAP bind password exposed
Next
From: PG Bug reporting form
Date:
Subject: BUG #18409: After my windows update, I can not run postgre 16 server