Re: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: Potential data loss due to race condition during logical replication slot creation |
Date | |
Msg-id | CAD21AoC8cKeRyQWtF0GCXtf=B_ry7EvDvnH1ZRuEm9Ld79er9Q@mail.gmail.com Whole thread Raw |
In response to | Potential data loss due to race condition during logical replication slot creation ("Callahan, Drew" <callaan@amazon.com>) |
Responses |
Re: Potential data loss due to race condition during logical replication slot creation
|
List | pgsql-bugs |
Hi, Thank you for the report! On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > Hello, > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > Session 1: > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > CREATE TABLE test (a int); > BEGIN; > INSERT INTO test VALUES (1); > > Session 2: > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > <query hangs> > > Session 3: > > CHECKPOINT; > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > <should return nothing of interest> > > Session 1: > > INSERT INTO test VALUES (2); > COMMIT; > > <Session 2 query no longer hangs and successfully creates the slot2> > > Session 2: > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > <expected: no rows of the txn are returned for slot2> > <actual: The 2nd row of the txn is returned for slot2> > I've confirmed this happens HEAD and all supported backbranches. And I think it's a (long-standing) bug. I think we can prepare an isolation test of the above steps and include it in contrib/test_decoding/specs. > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL andalso force a checkpoint to get the current state of the running transactions on the system. This create process will thenwait for all of the transactions within that running xact record to complete before being able to transition to the nextsnapbuild state. During this time period, if another running xact record is written and then a different logical replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > Once all of the transactions from the initial running xact have finished, the process performing the slot creation willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flushLSN of the slot to the start of the next record after that running xact. My analysis is the same. After the slot creation, the slot's LSNs become like: (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further some changes) < slot's confirmed_flush_lsn < (tx-commit) > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. I basically agree with the basic idea of skipping snapshot restore while finding the initial start point. But I'd like to hear other opinions too. Another simple way to fix this is that we always set need_full_snapshot to true when slot creation. But it would then make the snapshot builder include non-catalog-change transactions too while finding the initial startpoint. Which is not necessary. As for the proposed patch, it uses a global variable, InCreate, and memory context callback to reset it. While I believe this is for not breaking any ABI as this bug exists on back branches too, I think it's better to consider how to fix it in HEAD and then how to backpatch it. We don't necessarily need to have the same fixes for all branches all the time. IIUC we need to skip snapshot restore only when we're finding the initial start point. So probably we don't need to set InCreate when calling create_logical_replication_slot() with find_startpoint = false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: