Re: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: Potential data loss due to race condition during logical replication slot creation
Date
Msg-id CAD21AoC8cKeRyQWtF0GCXtf=B_ry7EvDvnH1ZRuEm9Ld79er9Q@mail.gmail.com
Whole thread Raw
In response to Potential data loss due to race condition during logical replication slot creation  ("Callahan, Drew" <callaan@amazon.com>)
Responses Re: Potential data loss due to race condition during logical replication slot creation  (vignesh C <vignesh21@gmail.com>)
List pgsql-bugs
Hi,

Thank you for the report!

On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
>
> Hello,
>
> We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
>
> Session 1:
>
> SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
>
> CREATE TABLE test (a int);
> BEGIN;
> INSERT INTO test VALUES (1);
>
> Session 2:
>
> SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
>
> <query hangs>
>
> Session 3:
>
> CHECKPOINT;
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> <should return nothing of interest>
>
> Session 1:
>
> INSERT INTO test VALUES (2);
> COMMIT;
>
> <Session 2 query no longer hangs and successfully creates the slot2>
>
> Session 2:
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> select pg_logical_slot_get_changes('slot2', NULL, NULL);
>
> <expected: no rows of the txn are returned for slot2>
> <actual: The 2nd row of the txn is returned for slot2>
>

I've confirmed this happens HEAD and all supported backbranches. And I
think it's a (long-standing) bug.

I think we can prepare an isolation test of the above steps and
include it in contrib/test_decoding/specs.

> Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL
andalso force a checkpoint to get the current state of the running transactions on the system. This create process will
thenwait for all of the transactions within that running xact record to complete before being able to transition to the
nextsnapbuild state. During this time period, if another running xact record is written and then a different logical
replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
>
> Once all of the transactions from the initial running xact have finished, the process performing the slot creation
willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore
thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed
flushLSN of the slot to the start of the next record after that running xact. 

My analysis is the same. After the slot creation, the slot's LSNs become like:

(tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
some changes) < slot's confirmed_flush_lsn < (tx-commit)

> We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
>
> The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it
isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
>
> With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 

I basically agree with the basic idea of skipping snapshot restore
while finding the initial start point. But I'd like to hear other
opinions too.

Another simple way to fix this is that we always set
need_full_snapshot to true when slot creation. But it would then make
the snapshot builder include non-catalog-change transactions too while
finding the initial startpoint. Which is not necessary.

As for the proposed patch, it uses a global variable, InCreate, and
memory context callback to reset it. While I believe this is for not
breaking any ABI as this bug exists on back branches too, I think it's
better to consider how to fix it in HEAD and then how to backpatch it.
We don't necessarily need to have the same fixes for all branches all
the time.

IIUC we need to skip snapshot restore only when we're finding the
initial start point. So probably we don't need to set InCreate when
calling create_logical_replication_slot() with find_startpoint =
false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: Peter Smith
Date:
Subject: Re: BUG #18319: Logical Replication updates causing duplication of row if evaluation filter is set to the same field
Next
From:
Date:
Subject: Undetected deadlock between primary and standby processes