Re: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: Potential data loss due to race condition during logical replication slot creation
Date
Msg-id CAD21AoCDv_1TCKMZUds=f-ywEM2w+751ztTMgtOR1w8jEOy1dQ@mail.gmail.com
Whole thread Raw
In response to Re: Potential data loss due to race condition during logical replication slot creation  (vignesh C <vignesh21@gmail.com>)
List pgsql-bugs
On Sat, Feb 3, 2024 at 10:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you for the report!
> >
> > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
> > >
> > > Hello,
> > >
> > > We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
> > >
> > > Session 1:
> > >
> > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
> > >
> > > CREATE TABLE test (a int);
> > > BEGIN;
> > > INSERT INTO test VALUES (1);
> > >
> > > Session 2:
> > >
> > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
> > >
> > > <query hangs>
> > >
> > > Session 3:
> > >
> > > CHECKPOINT;
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > <should return nothing of interest>
> > >
> > > Session 1:
> > >
> > > INSERT INTO test VALUES (2);
> > > COMMIT;
> > >
> > > <Session 2 query no longer hangs and successfully creates the slot2>
> > >
> > > Session 2:
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > select pg_logical_slot_get_changes('slot2', NULL, NULL);
> > >
> > > <expected: no rows of the txn are returned for slot2>
> > > <actual: The 2nd row of the txn is returned for slot2>
> > >
> >
> > I've confirmed this happens HEAD and all supported backbranches. And I
> > think it's a (long-standing) bug.
> >
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
> >
> > > Newly created logical replication slots initialize their restart LSN to the current insert position within the
WALand also force a checkpoint to get the current state of the running transactions on the system. This create process
willthen wait for all of the transactions within that running xact record to complete before being able to transition
tothe next snapbuild state. During this time period, if another running xact record is written and then a different
logicalreplication process decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
> > >
> > > Once all of the transactions from the initial running xact have finished, the process performing the slot
creationwill become unblocked and will then consume the new running xact record. The process will see a valid snapshot,
restorethat snapshot from disk, and then transition immediately to the consistent state. The slot will then set the
confirmedflush LSN of the slot to the start of the next record after that running xact. 
> >
> > My analysis is the same. After the slot creation, the slot's LSNs become like:
> >
> > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
> > some changes) < slot's confirmed_flush_lsn < (tx-commit)
> >
> > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
> > >
> > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether
itis building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
> > >
> > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 
> >
> > I basically agree with the basic idea of skipping snapshot restore
> > while finding the initial start point. But I'd like to hear other
> > opinions too.
> >
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
> >
> > As for the proposed patch, it uses a global variable, InCreate, and
> > memory context callback to reset it. While I believe this is for not
> > breaking any ABI as this bug exists on back branches too, I think it's
> > better to consider how to fix it in HEAD and then how to backpatch it.
> > We don't necessarily need to have the same fixes for all branches all
> > the time.
>
> I agree to having a separate patch for HEAD in this case. How about a
> fix like a patch attached.

Thank you for the patch!

I've just sent some comments on how to fix this issue and updated the
isolation test patch[1]. While I basically agree to add the flag in
SnapBuild, I think we can set the flag in AllocateSnapshotBuilder().
Also, we need to bump SNAPBUILD_VERSION.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCsSkwcAYgcUMi2zKp8F%3DmoJL4K1beGD6AV_%3DpHhfAM9Q%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Potential data loss due to race condition during logical replication slot creation
Next
From: PG Bug reporting form
Date:
Subject: BUG #18331: Why is the session lock (DEFAULT_LOCKMETHOD) not automatically released during process exit?