Thread: Potential data loss due to race condition during logical replication slot creation

Potential data loss due to race condition during logical replication slot creation

From

"Callahan, Drew"

Date:

01 February 2024, 22:28:44

Hello,

We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:

Session 1:

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

CREATE TABLE test (a int);
BEGIN;
INSERT INTO test VALUES (1);

Session 2:

SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');

<query hangs>

Session 3:

CHECKPOINT;

select pg_logical_slot_get_changes('slot1', NULL, NULL);

<should return nothing of interest>

Session 1:

INSERT INTO test VALUES (2);
COMMIT;

<Session 2 query no longer hangs and successfully creates the slot2>

Session 2:

select pg_logical_slot_get_changes('slot1', NULL, NULL);

select pg_logical_slot_get_changes('slot2', NULL, NULL);

<expected: no rows of the txn are returned for slot2>
<actual: The 2nd row of the txn is returned for slot2>

Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.

Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.

We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.

The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.

With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.

Thanks,

Drew

Attachment

skip_snap_restore.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

02 February 2024, 00:53:33

Hi,

Thank you for the report!

On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
>
> Hello,
>
> We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
>
> Session 1:
>
> SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
>
> CREATE TABLE test (a int);
> BEGIN;
> INSERT INTO test VALUES (1);
>
> Session 2:
>
> SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
>
> <query hangs>
>
> Session 3:
>
> CHECKPOINT;
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> <should return nothing of interest>
>
> Session 1:
>
> INSERT INTO test VALUES (2);
> COMMIT;
>
> <Session 2 query no longer hangs and successfully creates the slot2>
>
> Session 2:
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> select pg_logical_slot_get_changes('slot2', NULL, NULL);
>
> <expected: no rows of the txn are returned for slot2>
> <actual: The 2nd row of the txn is returned for slot2>
>

I've confirmed this happens HEAD and all supported backbranches. And I
think it's a (long-standing) bug.

I think we can prepare an isolation test of the above steps and
include it in contrib/test_decoding/specs.

> Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL
andalso force a checkpoint to get the current state of the running transactions on the system. This create process will
thenwait for all of the transactions within that running xact record to complete before being able to transition to the
nextsnapbuild state. During this time period, if another running xact record is written and then a different logical
replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
>
> Once all of the transactions from the initial running xact have finished, the process performing the slot creation
willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore
thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed
flushLSN of the slot to the start of the next record after that running xact. 

My analysis is the same. After the slot creation, the slot's LSNs become like:

(tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
some changes) < slot's confirmed_flush_lsn < (tx-commit)

> We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
>
> The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it
isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
>
> With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 

I basically agree with the basic idea of skipping snapshot restore
while finding the initial start point. But I'd like to hear other
opinions too.

Another simple way to fix this is that we always set
need_full_snapshot to true when slot creation. But it would then make
the snapshot builder include non-catalog-change transactions too while
finding the initial startpoint. Which is not necessary.

As for the proposed patch, it uses a global variable, InCreate, and
memory context callback to reset it. While I believe this is for not
breaking any ABI as this bug exists on back branches too, I think it's
better to consider how to fix it in HEAD and then how to backpatch it.
We don't necessarily need to have the same fixes for all branches all
the time.

IIUC we need to skip snapshot restore only when we're finding the
initial start point. So probably we don't need to set InCreate when
calling create_logical_replication_slot() with find_startpoint =
false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

"Callahan, Drew"

Date:

03 February 2024, 03:12:15

> I think we can prepare an isolation test of the above steps and
> include it in contrib/test_decoding/specs.

Thanks for the suggestion. Included an isolation test for the
exact repro and a couple other tests for fully before and fully after.
These two tests don't add a whole lot of value since this
is already heavily tested elsewhere, so I'm good with removing.
However, thought it made sense to include from a perspective of
totality given the test name.

> Another simple way to fix this is that we always set
> need_full_snapshot to true when slot creation. But it would then make
> the snapshot builder include non-catalog-change transactions too while
> finding the initial startpoint. Which is not necessary.

Since we're not accumulating changes anyway and we never distribute snapshots
while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
This seems pretty contained and wouldn't require the use of a global variable
and memory context callbacks which is a bit nicer from a readability standpoint.
It's a little inefficient with memory and CPU, especially in the presence of
long running transactions, but likely would not be noticeable to most users.

> As for the proposed patch, it uses a global variable, InCreate, and
> memory context callback to reset it. While I believe this is for not
> breaking any ABI as this bug exists on back branches too, I think it's
> better to consider how to fix it in HEAD and then how to backpatch it.
> We don't necessarily need to have the same fixes for all branches all
> the time.

Attached a patch that is ABI breaking, but removes the need for a global variable.
This relies on pushing the context of whether we are in a create mode to the
logical decoding context, which is cleaner, but not as clean as just pushing
all the way down to the snapbuilder. I would push down to the snap builder,
but the entire object is persisted to disk. This is problematic to me since the
create flag would be persisted causing exisiting slots to restore a snapbuild
that could have the create flag set to true. This is not problematic as is,
since the variable would only be leveraged in SnapBuildRestore() which is only
called while inconsistent, but seems like a future bug waiting to happen.

We could always clear the flag when persisting, but that seems a bit hacky.
We could also fork the snap builder into on-disk and in-memory structs, which
would be convenient since we would not need to invalidate exisiting snapshots.
However, not sure how people feel about the fork.

> IIUC we need to skip snapshot restore only when we're finding the
> initial start point. So probably we don't need to set InCreate when
> calling create_logical_replication_slot() with find_startpoint =
> false.

Yep, that's correct. We can safely skip setting InCreate. We could move
the setter function slotfuncs.c as one way of doing this. I lean towards
setting it no matter what since it is harmless if we never try to find
a decoding point and removes the need for future callers of
CreateInitDecodingContext() from having to determine if it's safe to be false.
Though I agree that InCreate may a bigger hammer than needed which may need
to be modified in the future.

Thanks for the quick response,
Drew Callahan

Attachment

Re: Potential data loss due to race condition during logical replication slot creation

From

vignesh C

Date:

03 February 2024, 13:48:19

On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> Thank you for the report!
>
> On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
> >
> > Hello,
> >
> > We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
> >
> > Session 1:
> >
> > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
> >
> > CREATE TABLE test (a int);
> > BEGIN;
> > INSERT INTO test VALUES (1);
> >
> > Session 2:
> >
> > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
> >
> > <query hangs>
> >
> > Session 3:
> >
> > CHECKPOINT;
> >
> > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> >
> > <should return nothing of interest>
> >
> > Session 1:
> >
> > INSERT INTO test VALUES (2);
> > COMMIT;
> >
> > <Session 2 query no longer hangs and successfully creates the slot2>
> >
> > Session 2:
> >
> > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> >
> > select pg_logical_slot_get_changes('slot2', NULL, NULL);
> >
> > <expected: no rows of the txn are returned for slot2>
> > <actual: The 2nd row of the txn is returned for slot2>
> >
>
> I've confirmed this happens HEAD and all supported backbranches. And I
> think it's a (long-standing) bug.
>
> I think we can prepare an isolation test of the above steps and
> include it in contrib/test_decoding/specs.
>
> > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL
andalso force a checkpoint to get the current state of the running transactions on the system. This create process will
thenwait for all of the transactions within that running xact record to complete before being able to transition to the
nextsnapbuild state. During this time period, if another running xact record is written and then a different logical
replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
> >
> > Once all of the transactions from the initial running xact have finished, the process performing the slot creation
willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore
thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed
flushLSN of the slot to the start of the next record after that running xact. 
>
> My analysis is the same. After the slot creation, the slot's LSNs become like:
>
> (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
> some changes) < slot's confirmed_flush_lsn < (tx-commit)
>
> > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
> >
> > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it
isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
> >
> > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 
>
> I basically agree with the basic idea of skipping snapshot restore
> while finding the initial start point. But I'd like to hear other
> opinions too.
>
> Another simple way to fix this is that we always set
> need_full_snapshot to true when slot creation. But it would then make
> the snapshot builder include non-catalog-change transactions too while
> finding the initial startpoint. Which is not necessary.
>
> As for the proposed patch, it uses a global variable, InCreate, and
> memory context callback to reset it. While I believe this is for not
> breaking any ABI as this bug exists on back branches too, I think it's
> better to consider how to fix it in HEAD and then how to backpatch it.
> We don't necessarily need to have the same fixes for all branches all
> the time.

I agree to having a separate patch for HEAD in this case. How about a
fix like a patch attached.

Regards,
Vignesh

Attachment

v2-0001-While-a-new-logical-replication-slot-is-created-a.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

05 February 2024, 02:06:34

On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
>
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
>
> Thanks for the suggestion. Included an isolation test for the
> exact repro and a couple other tests for fully before and fully after.
> These two tests don't add a whole lot of value since this
> is already heavily tested elsewhere, so I'm good with removing.
> However, thought it made sense to include from a perspective of
> totality given the test name.

Thank you for the patch for the isolation test! Since Vignesh also
shared the isolation patch I've merged them and modified some
comments. I've attached the patch. I'm going to merge the isolation
test patch to the patch to fix the issue after we get a consensus on
how to fix it.

>
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
>
> Since we're not accumulating changes anyway and we never distribute snapshots
> while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> This seems pretty contained and wouldn't require the use of a global variable
> and memory context callbacks which is a bit nicer from a readability standpoint.
> It's a little inefficient with memory and CPU, especially in the presence of
> long running transactions, but likely would not be noticeable to most users.
>
> > As for the proposed patch, it uses a global variable, InCreate, and
> > memory context callback to reset it. While I believe this is for not
> > breaking any ABI as this bug exists on back branches too, I think it's
> > better to consider how to fix it in HEAD and then how to backpatch it.
> > We don't necessarily need to have the same fixes for all branches all
> > the time.
>
> Attached a patch that is ABI breaking, but removes the need for a global variable.
> This relies on pushing the context of whether we are in a create mode to the
> logical decoding context, which is cleaner, but not as clean as just pushing
> all the way down to the snapbuilder. I would push down to the snap builder,
> but the entire object is persisted to disk. This is problematic to me since the
> create flag would be persisted causing exisiting slots to restore a snapbuild
> that could have the create flag set to true. This is not problematic as is,
> since the variable would only be leveraged in SnapBuildRestore() which is only
> called while inconsistent, but seems like a future bug waiting to happen.

But IIUC in SnapBuildRestore(), we selectively restore variables from
serialized SnapBuild. For instance, building_full_snapshot and
initial_xmin_horizon are serialized to the disk but ignored when
restoring a snapshot. I think that even if we add the in_create flag
to SnapBuild, it would be ignored on snapshot restore.

I think that it's cleaner if we could pass in_creat flag to
AllocateSnapshotBuilder() than passing the flag down to
SnapBuildFindSnapshot() from standby_decode().

>
> We could always clear the flag when persisting, but that seems a bit hacky.
> We could also fork the snap builder into on-disk and in-memory structs, which
> would be convenient since we would not need to invalidate exisiting snapshots.
> However, not sure how people feel about the fork.
>
> > IIUC we need to skip snapshot restore only when we're finding the
> > initial start point. So probably we don't need to set InCreate when
> > calling create_logical_replication_slot() with find_startpoint =
> > false.
>
> Yep, that's correct. We can safely skip setting InCreate. We could move
> the setter function slotfuncs.c as one way of doing this. I lean towards
> setting it no matter what since it is harmless if we never try to find
> a decoding point and removes the need for future callers of
> CreateInitDecodingContext() from having to determine if it's safe to be false.
> Though I agree that InCreate may a bigger hammer than needed which may need
> to be modified in the future.

I agree to set the flag no matter what. The flag is needed for finding
the initial startpoint but it would be harmless to set it even if the
caller doesn't call DecodingContextFindStartpoint(). If the caller
skips to call DecodingContextFindStartpoint() it's the caller's
responsibility to set a sane LSNs to the slots before staring the
logical decoding. So it makes sense to me to remove the need for
callers to determine to set the flag, as you mentioned.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

isolation_test_v2.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

05 February 2024, 02:12:31

On Sat, Feb 3, 2024 at 10:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you for the report!
> >
> > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
> > >
> > > Hello,
> > >
> > > We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
> > >
> > > Session 1:
> > >
> > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
> > >
> > > CREATE TABLE test (a int);
> > > BEGIN;
> > > INSERT INTO test VALUES (1);
> > >
> > > Session 2:
> > >
> > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
> > >
> > > <query hangs>
> > >
> > > Session 3:
> > >
> > > CHECKPOINT;
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > <should return nothing of interest>
> > >
> > > Session 1:
> > >
> > > INSERT INTO test VALUES (2);
> > > COMMIT;
> > >
> > > <Session 2 query no longer hangs and successfully creates the slot2>
> > >
> > > Session 2:
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > select pg_logical_slot_get_changes('slot2', NULL, NULL);
> > >
> > > <expected: no rows of the txn are returned for slot2>
> > > <actual: The 2nd row of the txn is returned for slot2>
> > >
> >
> > I've confirmed this happens HEAD and all supported backbranches. And I
> > think it's a (long-standing) bug.
> >
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
> >
> > > Newly created logical replication slots initialize their restart LSN to the current insert position within the
WALand also force a checkpoint to get the current state of the running transactions on the system. This create process
willthen wait for all of the transactions within that running xact record to complete before being able to transition
tothe next snapbuild state. During this time period, if another running xact record is written and then a different
logicalreplication process decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
> > >
> > > Once all of the transactions from the initial running xact have finished, the process performing the slot
creationwill become unblocked and will then consume the new running xact record. The process will see a valid snapshot,
restorethat snapshot from disk, and then transition immediately to the consistent state. The slot will then set the
confirmedflush LSN of the slot to the start of the next record after that running xact. 
> >
> > My analysis is the same. After the slot creation, the slot's LSNs become like:
> >
> > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
> > some changes) < slot's confirmed_flush_lsn < (tx-commit)
> >
> > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
> > >
> > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether
itis building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
> > >
> > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 
> >
> > I basically agree with the basic idea of skipping snapshot restore
> > while finding the initial start point. But I'd like to hear other
> > opinions too.
> >
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
> >
> > As for the proposed patch, it uses a global variable, InCreate, and
> > memory context callback to reset it. While I believe this is for not
> > breaking any ABI as this bug exists on back branches too, I think it's
> > better to consider how to fix it in HEAD and then how to backpatch it.
> > We don't necessarily need to have the same fixes for all branches all
> > the time.
>
> I agree to having a separate patch for HEAD in this case. How about a
> fix like a patch attached.

Thank you for the patch!

I've just sent some comments on how to fix this issue and updated the
isolation test patch[1]. While I basically agree to add the flag in
SnapBuild, I think we can set the flag in AllocateSnapshotBuilder().
Also, we need to bump SNAPBUILD_VERSION.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCsSkwcAYgcUMi2zKp8F%3DmoJL4K1beGD6AV_%3DpHhfAM9Q%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

08 February 2024, 05:54:52

On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
>
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
>
> Thanks for the suggestion. Included an isolation test for the
> exact repro and a couple other tests for fully before and fully after.
> These two tests don't add a whole lot of value since this
> is already heavily tested elsewhere, so I'm good with removing.
> However, thought it made sense to include from a perspective of
> totality given the test name.
>
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
>
> Since we're not accumulating changes anyway and we never distribute snapshots
> while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> This seems pretty contained and wouldn't require the use of a global variable
> and memory context callbacks which is a bit nicer from a readability standpoint.

Sorry I missed this comment.

I think it's a good point that this fix doesn't lead to any on-disk
compatibility while not affecting existing users much, especially for
back-patching. If we want to choose this approach for bank branches,
we need to carefully consider the side impacts.

> It's a little inefficient with memory and CPU, especially in the presence of
> long running transactions, but likely would not be noticeable to most users.

This matches my analysis.

Here is the summary of several proposals we've discussed:

a) Have CreateInitDecodingContext() always pass need_full_snapshot =
true to AllocateSnapshotBuilder().

This is the simplest approach and doesn't break the on-disk
SnapBuildOnDisck compatibility. A downside is that it's a little
inefficient with memory and CPU since it includes non-catalog change
transactions too. Also, passing need_full_snapshot to
CreateInitDecodingContext() will no longer have any meaning.

b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.

This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
versions. But since different branches are using different versions of
SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
version that is consistent across all versions.

c) Add a global variable, say in_create, to snapbuild.c

it would also require adding a setter function for in_create. There
are several ideas where to set/reset the flag. One idea is that we
reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
before starting to find the start point, for example at the beginning
of DecodingContextFindStartpoint(). It probably won't require a memory
context callback to make sure to clear the flag. This idea doesn't
have a downside from users and extensions perspects. But I'm slightly
hesitant to add a global variable.

What do you think? and any other ideas?

As a separate topic, I think that this backpatching complexity comes
from the fact that we're serializing a whole SnapBuild to the disk
although we don't need to serialize all of its fields. In order to
make future back-patching easy, it might be worth considering
decoupling the fields that need to be serialized from SnapBuild
struct.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

19 February 2024, 09:12:23

On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> >
> > > I think we can prepare an isolation test of the above steps and
> > > include it in contrib/test_decoding/specs.
> >
> > Thanks for the suggestion. Included an isolation test for the
> > exact repro and a couple other tests for fully before and fully after.
> > These two tests don't add a whole lot of value since this
> > is already heavily tested elsewhere, so I'm good with removing.
> > However, thought it made sense to include from a perspective of
> > totality given the test name.
> >
> > > Another simple way to fix this is that we always set
> > > need_full_snapshot to true when slot creation. But it would then make
> > > the snapshot builder include non-catalog-change transactions too while
> > > finding the initial startpoint. Which is not necessary.
> >
> > Since we're not accumulating changes anyway and we never distribute snapshots
> > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > This seems pretty contained and wouldn't require the use of a global variable
> > and memory context callbacks which is a bit nicer from a readability standpoint.
>
> Sorry I missed this comment.
>
> I think it's a good point that this fix doesn't lead to any on-disk
> compatibility while not affecting existing users much, especially for
> back-patching. If we want to choose this approach for bank branches,
> we need to carefully consider the side impacts.
>
> > It's a little inefficient with memory and CPU, especially in the presence of
> > long running transactions, but likely would not be noticeable to most users.
>
> This matches my analysis.
>
> Here is the summary of several proposals we've discussed:
>
> a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> true to AllocateSnapshotBuilder().
>
> This is the simplest approach and doesn't break the on-disk
> SnapBuildOnDisck compatibility. A downside is that it's a little
> inefficient with memory and CPU since it includes non-catalog change
> transactions too. Also, passing need_full_snapshot to
> CreateInitDecodingContext() will no longer have any meaning.
>
> b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
>
> This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> versions. But since different branches are using different versions of
> SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> version that is consistent across all versions.
>
> c) Add a global variable, say in_create, to snapbuild.c
>
> it would also require adding a setter function for in_create. There
> are several ideas where to set/reset the flag. One idea is that we
> reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> before starting to find the start point, for example at the beginning
> of DecodingContextFindStartpoint(). It probably won't require a memory
> context callback to make sure to clear the flag. This idea doesn't
> have a downside from users and extensions perspects. But I'm slightly
> hesitant to add a global variable.
>
> What do you think? and any other ideas?

I've drafted the idea (c) for discussion (for master and v16 for now).
I also liked the idea (a) but I'm concerned a bit about future impact.

>
> As a separate topic, I think that this backpatching complexity comes
> from the fact that we're serializing a whole SnapBuild to the disk
> although we don't need to serialize all of its fields. In order to
> make future back-patching easy, it might be worth considering
> decoupling the fields that need to be serialized from SnapBuild
> struct.

If we implement this idea, we won't need to bump SNAPBUILD_VERSION for HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

13 March 2024, 09:34:22

Dear hackers,

While analyzing another failure [1], I found here. I think they occurred by the
same reason.

The reported failure occurred when the replication slot is created in the middle
of the transaction and it reuses the snapshot from other slot. The reproducer is:

```
Session0

SELECT pg_create_logical_replication_slot('slot0', 'test_decoding');
BEGIN;
INSERT INTO foo ...

Session1

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

Session2

CHECKPOINT;
SELECT pg_logical_slot_get_changes('slot0', NULL, NULL);

Session0

INSERT INTO var ... // var is defined with (user_catalog_table = true)
COMMIT;

Session1
SELECT pg_logical_slot_get_changes('slot1', NULL, NULL);
-> Assertion failure.
```

> Here is the summary of several proposals we've discussed:
> a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> true to AllocateSnapshotBuilder().

> b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.

> c) Add a global variable, say in_create, to snapbuild.c

Regarding three options raised by Sawada-san, I preferred the approach a).
Since the issue could happen for all supported branches, we should choose the
conservative approach. Also, it is quite painful if there are some codes for
handling the same issue.

Attached patch implemented the approach a) since no one made. I also added
the test which can do assertion failure, but not sure it should be included.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment

master_0001-fix-snapbuild-bug-by-approach-a.patch

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

13 March 2024, 09:47:01

> Attached patch implemented the approach a) since no one made. I also added
> the test which can do assertion failure, but not sure it should be included.

Attached one had unnecessary changes. PSA the corrected version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment

master_0001-fix-snapbuild-bug-by-approach-a.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

18 March 2024, 09:08:06

On Wed, Mar 13, 2024 at 3:17 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > Attached patch implemented the approach a) since no one made. I also added
> > the test which can do assertion failure, but not sure it should be included.
>

I feel setting "needs_full_snapshot" to true for decoding means the
snapshot will start tracking non-catalog committed xacts as well which
is costly. See SnapBuildCommitTxn(). Can we avoid this problem if we
would have list of all running xacts when we serialize the snapshot by
not decoding any xact whose xid lies in that list? If so, one idea to
achieve could be that we maintain the highest_running_xid while
serailizing the snapshot and then during restore if that
highest_running_xid is <= builder->initial_xmin_horizon, then we
ignore restoring the snapshot. We already have few such cases handled
in SnapBuildRestore().

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

18 March 2024, 09:18:56

On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> > >
> > > > I think we can prepare an isolation test of the above steps and
> > > > include it in contrib/test_decoding/specs.
> > >
> > > Thanks for the suggestion. Included an isolation test for the
> > > exact repro and a couple other tests for fully before and fully after.
> > > These two tests don't add a whole lot of value since this
> > > is already heavily tested elsewhere, so I'm good with removing.
> > > However, thought it made sense to include from a perspective of
> > > totality given the test name.
> > >
> > > > Another simple way to fix this is that we always set
> > > > need_full_snapshot to true when slot creation. But it would then make
> > > > the snapshot builder include non-catalog-change transactions too while
> > > > finding the initial startpoint. Which is not necessary.
> > >
> > > Since we're not accumulating changes anyway and we never distribute snapshots
> > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > > This seems pretty contained and wouldn't require the use of a global variable
> > > and memory context callbacks which is a bit nicer from a readability standpoint.
> >
> > Sorry I missed this comment.
> >
> > I think it's a good point that this fix doesn't lead to any on-disk
> > compatibility while not affecting existing users much, especially for
> > back-patching. If we want to choose this approach for bank branches,
> > we need to carefully consider the side impacts.
> >
> > > It's a little inefficient with memory and CPU, especially in the presence of
> > > long running transactions, but likely would not be noticeable to most users.
> >
> > This matches my analysis.
> >
> > Here is the summary of several proposals we've discussed:
> >
> > a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> > true to AllocateSnapshotBuilder().
> >
> > This is the simplest approach and doesn't break the on-disk
> > SnapBuildOnDisck compatibility. A downside is that it's a little
> > inefficient with memory and CPU since it includes non-catalog change
> > transactions too. Also, passing need_full_snapshot to
> > CreateInitDecodingContext() will no longer have any meaning.
> >
> > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
> >
> > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> > versions. But since different branches are using different versions of
> > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> > version that is consistent across all versions.
> >
> > c) Add a global variable, say in_create, to snapbuild.c
> >
> > it would also require adding a setter function for in_create. There
> > are several ideas where to set/reset the flag. One idea is that we
> > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> > before starting to find the start point, for example at the beginning
> > of DecodingContextFindStartpoint(). It probably won't require a memory
> > context callback to make sure to clear the flag. This idea doesn't
> > have a downside from users and extensions perspects. But I'm slightly
> > hesitant to add a global variable.
> >
> > What do you think? and any other ideas?
>
> I've drafted the idea (c) for discussion (for master and v16 for now).
> I also liked the idea (a) but I'm concerned a bit about future impact.
>

I agree that (c) would be better than (a) as it avoids collecting
non-catalog xacts in snapshot. However, I think we shouldn't avoid
restoring the snapshot unless really required. See my previous
response.

--
With Regards,
Amit Kapila.

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

19 March 2024, 02:16:35

Dear Amit,

> I feel setting "needs_full_snapshot" to true for decoding means the
> snapshot will start tracking non-catalog committed xacts as well which
> is costly.

I think the approach was most conservative one which does not have to change
the version of the snapshot. However, I understood that you wanted to consider
the optimized solution for HEAD first.

> See SnapBuildCommitTxn(). Can we avoid this problem if we
> would have list of all running xacts when we serialize the snapshot by
> not decoding any xact whose xid lies in that list? If so, one idea to
> achieve could be that we maintain the highest_running_xid while
> serailizing the snapshot and then during restore if that
> highest_running_xid is <= builder->initial_xmin_horizon, then we
> ignore restoring the snapshot. We already have few such cases handled
> in SnapBuildRestore().

Based on the idea, I made a prototype. It can pass tests added by others and me.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment

0001-Serialize-running-xacts.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

19 March 2024, 02:33:06

On Tue, Mar 19, 2024 at 7:46 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> I think the approach was most conservative one which does not have to change
> the version of the snapshot. However, I understood that you wanted to consider
> the optimized solution for HEAD first.
>

Right, let's see if we can have a solution other than always avoiding
restoring snapshots during slot creation even if that is for just
HEAD.

> > See SnapBuildCommitTxn(). Can we avoid this problem if we
> > would have list of all running xacts when we serialize the snapshot by
> > not decoding any xact whose xid lies in that list? If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
>
> Based on the idea, I made a prototype. It can pass tests added by others and me.
> How do other think?
>

Won't it be possible to achieve the same thing if we just save
(serialize) the highest xid among all running xacts?

--
With Regards,
Amit Kapila.

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

19 March 2024, 04:39:30

Dear Amit,

> 
> Won't it be possible to achieve the same thing if we just save
> (serialize) the highest xid among all running xacts?
>

Indeed, here is an updated version.
Since the array in xl_running_xact is not ordered, entries of it must be seeked
and found the highest one.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment

v2-0001-Serialize-running-xacts.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

19 March 2024, 16:42:15

On Mon, Mar 18, 2024 at 6:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> If so, one idea to
> achieve could be that we maintain the highest_running_xid while
> serailizing the snapshot and then during restore if that
> highest_running_xid is <= builder->initial_xmin_horizon, then we
> ignore restoring the snapshot. We already have few such cases handled
> in SnapBuildRestore().

I think that builder->initial_xmin_horizon could be older than
highest_running_xid, for example, when there is a logical replication
slot whose catalog_xmin is old. However, even in this case, we might
need to ignore restoring the snapshot. For example, a slightly
modified test case still can cause the same problem.

The test case in the Kuroda-san's v2 patch:
permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
"s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
"s1_get_changes_slot0"\   "s1_get_changes_slot1"

Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
"s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
"s1_get_changes_slot0\  " "s1_get_changes_slot1"

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

27 March 2024, 07:54:12

Dear Sawada-san,

Thanks for giving comments and sorry for late reply.

> > If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
> 
> I think that builder->initial_xmin_horizon could be older than
> highest_running_xid, for example, when there is a logical replication
> slot whose catalog_xmin is old. However, even in this case, we might
> need to ignore restoring the snapshot. For example, a slightly
> modified test case still can cause the same problem.
> 
> The test case in the Kuroda-san's v2 patch:
> permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0"\   "s1_get_changes_slot1"
> 
> Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0\  " "s1_get_changes_slot1"

Good catch, I confirmed that the variant still partially output transactions:

```
step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0');
 
data                                     
-----------------------------------------
BEGIN                                    
table public.tbl: INSERT: val1[integer]:2
COMMIT                                   
(3 rows)
```

/////

I analyzed a bit. In below descriptions, I uses these notations:

txn0 - has a single insert
txn1 - has begin-insert-commit operations.
       slot1 is created in parallel

While creating slot0, initial_xmin_horizon was also set, and it come from
GetOldestSafeDecodingTransactionId().
Since there were no slots in the database before, the value was just a next xid.

Then, the pinned xid was assigned to the txn0.
It is correct behavior because slot0 must decode txn0.

Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
restored once. txn1 was the recorded as the highest_running, but initial_xmin was
older than it. So the snapshot was restored and txn1 was partially decoded.

initial_xmin_horizon (txn0) < highest_running_xid (txn1)

/////

So, how should we fix? One idea is based on (c), and adds a flag which indicates
an existence of concurrent transactions. The restoration is skipped if both two
flags are true. PSA the PoC patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment

0001-Serialize-a-flag-and-use-finding_start_point.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

27 March 2024, 08:26:05

On Wed, Mar 27, 2024 at 4:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> Thanks for giving comments and sorry for late reply.
>
> > > If so, one idea to
> > > achieve could be that we maintain the highest_running_xid while
> > > serailizing the snapshot and then during restore if that
> > > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > > ignore restoring the snapshot. We already have few such cases handled
> > > in SnapBuildRestore().
> >
> > I think that builder->initial_xmin_horizon could be older than
> > highest_running_xid, for example, when there is a logical replication
> > slot whose catalog_xmin is old. However, even in this case, we might
> > need to ignore restoring the snapshot. For example, a slightly
> > modified test case still can cause the same problem.
> >
> > The test case in the Kuroda-san's v2 patch:
> > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> > "s1_get_changes_slot0"\   "s1_get_changes_slot1"
> >
> > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> > permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> > "s1_get_changes_slot0\  " "s1_get_changes_slot1"
>
> Good catch, I confirmed that the variant still partially output transactions:
>
> ```
> step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0'); 
> data
> -----------------------------------------
> BEGIN
> table public.tbl: INSERT: val1[integer]:2
> COMMIT
> (3 rows)
> ```
>
> /////
>
> I analyzed a bit. In below descriptions, I uses these notations:
>
> txn0 - has a single insert
> txn1 - has begin-insert-commit operations.
>        slot1 is created in parallel
>
> While creating slot0, initial_xmin_horizon was also set, and it come from
> GetOldestSafeDecodingTransactionId().
> Since there were no slots in the database before, the value was just a next xid.
>
> Then, the pinned xid was assigned to the txn0.
> It is correct behavior because slot0 must decode txn0.
>
> Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
> restored once. txn1 was the recorded as the highest_running, but initial_xmin was
> older than it. So the snapshot was restored and txn1 was partially decoded.
>
> initial_xmin_horizon (txn0) < highest_running_xid (txn1)
>
> /////
>
> So, how should we fix? One idea is based on (c), and adds a flag which indicates
> an existence of concurrent transactions. The restoration is skipped if both two
> flags are true. PSA the PoC patch.
>

With the PoC patch, we check ondisk.builder.is_there_running_xact in
SnapBuildRestore(), but can we just check running->xcnt in
SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is,
if builder->initial_xmin_horizon is valid (or
builder->finding_start_point is true) and running->xcnt > 0, we skip
the snapshot restore. However, I think there are still cases where we
unnecessarily skip snapshot restores.

Probably, what we would like to avoid is, we compute
initial_xmin_horizon and start to find the initial start point while
there is a concurrently running transaction, and then jump to the
consistent state  by restoring the consistent snapshot before the
concurrent transaction commits. So we can ignore snapshot restores if
(oldest XID among transactions running at the time of
CreateInitDecodingContext()) >= (OldestRunningXID in
xl_running_xacts).

I've drafted this idea in the attached patch just for discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

0001-draft-fix.patch

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

27 March 2024, 10:37:02

Dear Sawada-san,

> 
> With the PoC patch, we check ondisk.builder.is_there_running_xact in
> SnapBuildRestore(),

Yes, the PoC requires that the state of snapshot in the file must be read.

> but can we just check running->xcnt in
> SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is,
> if builder->initial_xmin_horizon is valid (or
> builder->finding_start_point is true) and running->xcnt > 0, we skip
> the snapshot restore.

IIUC, it does not require modifications of API. It may be an advantage.

> However, I think there are still cases where we
> unnecessarily skip snapshot restores
>
> Probably, what we would like to avoid is, we compute
> initial_xmin_horizon and start to find the initial start point while
> there is a concurrently running transaction, and then jump to the
> consistent state  by restoring the consistent snapshot before the
> concurrent transaction commits.

Yeah, information before concurrent txns are committed should not be used. I think
that's why SnapBuildWaitSnapshot() waits until listed transactions are finished.

> So we can ignore snapshot restores if
> (oldest XID among transactions running at the time of
> CreateInitDecodingContext()) >= (OldestRunningXID in
> xl_running_xacts).
> 
> I've drafted this idea in the attached patch just for discussion.

Thanks for sharing the patch. At least I confirmed all tests and workload you
pointed out in [1] were passed. I will post here if I found other issues.

[1]: https://www.postgresql.org/message-id/CAD21AoDzLY9vRpo%2Bxb2qPtfn46ikiULPXDpT94sPyFH4GE8bYg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

16 April 2024, 03:56:30

On Mon, Mar 18, 2024 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> > > >
> > > > > I think we can prepare an isolation test of the above steps and
> > > > > include it in contrib/test_decoding/specs.
> > > >
> > > > Thanks for the suggestion. Included an isolation test for the
> > > > exact repro and a couple other tests for fully before and fully after.
> > > > These two tests don't add a whole lot of value since this
> > > > is already heavily tested elsewhere, so I'm good with removing.
> > > > However, thought it made sense to include from a perspective of
> > > > totality given the test name.
> > > >
> > > > > Another simple way to fix this is that we always set
> > > > > need_full_snapshot to true when slot creation. But it would then make
> > > > > the snapshot builder include non-catalog-change transactions too while
> > > > > finding the initial startpoint. Which is not necessary.
> > > >
> > > > Since we're not accumulating changes anyway and we never distribute snapshots
> > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > > > This seems pretty contained and wouldn't require the use of a global variable
> > > > and memory context callbacks which is a bit nicer from a readability standpoint.
> > >
> > > Sorry I missed this comment.
> > >
> > > I think it's a good point that this fix doesn't lead to any on-disk
> > > compatibility while not affecting existing users much, especially for
> > > back-patching. If we want to choose this approach for bank branches,
> > > we need to carefully consider the side impacts.
> > >
> > > > It's a little inefficient with memory and CPU, especially in the presence of
> > > > long running transactions, but likely would not be noticeable to most users.
> > >
> > > This matches my analysis.
> > >
> > > Here is the summary of several proposals we've discussed:
> > >
> > > a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> > > true to AllocateSnapshotBuilder().
> > >
> > > This is the simplest approach and doesn't break the on-disk
> > > SnapBuildOnDisck compatibility. A downside is that it's a little
> > > inefficient with memory and CPU since it includes non-catalog change
> > > transactions too. Also, passing need_full_snapshot to
> > > CreateInitDecodingContext() will no longer have any meaning.
> > >
> > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
> > >
> > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> > > versions. But since different branches are using different versions of
> > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> > > version that is consistent across all versions.
> > >
> > > c) Add a global variable, say in_create, to snapbuild.c
> > >
> > > it would also require adding a setter function for in_create. There
> > > are several ideas where to set/reset the flag. One idea is that we
> > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> > > before starting to find the start point, for example at the beginning
> > > of DecodingContextFindStartpoint(). It probably won't require a memory
> > > context callback to make sure to clear the flag. This idea doesn't
> > > have a downside from users and extensions perspects. But I'm slightly
> > > hesitant to add a global variable.
> > >
> > > What do you think? and any other ideas?
> >
> > I've drafted the idea (c) for discussion (for master and v16 for now).
> > I also liked the idea (a) but I'm concerned a bit about future impact.
> >
>
> I agree that (c) would be better than (a) as it avoids collecting
> non-catalog xacts in snapshot. However, I think we shouldn't avoid
> restoring the snapshot unless really required.

Pondering further, I came across the question; in what case we would
need to restore the snapshot and jump to the consistent state when
finding the initial start point?

When creating a slot, in ReplicationSlotReserveWal(), we determine the
restart_lsn and write a RUNNING_XACT record. This record would be the
first RUNNING_XACT record that the logical decoding decodes in most
cases. In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e.
running->oldestRunningXid == running->nextXid), it can jump to the
consistent state. If not, it means there were running transactions at
that time of the RUNNING_XACT record being produced. Therefore, we
must not restore the snapshot and jump to the consistent state.
Because otherwise, we end up deciding the start point in the middle of
a transaction that started before the RUNNING_XACT record. Probably
the same is true for all subsequent snapbuild states.

I might be missing something but I could not find the case where we
can or want to restore the serialized snapshot when finding the
initial start point. If my analysis is correct, we can go with either
(a) or (c) I proposed before[1]. Between these two options, it also
could be an option that (a) is for backbranches for safety and (c) is
for master.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoBH4NWdRcEjXpBFHSKuVO6wia-vHHHaKuEi-h4i4wbi8A%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

20 June 2024, 06:19:25

On Tue, Apr 16, 2024 at 9:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Pondering further, I came across the question; in what case we would
> need to restore the snapshot and jump to the consistent state when
> finding the initial start point?
>
> When creating a slot, in ReplicationSlotReserveWal(), we determine the
> restart_lsn and write a RUNNING_XACT record. This record would be the
> first RUNNING_XACT record that the logical decoding decodes in most
> cases.
>

This probably won't be true on standby as we don't LOG RUNNING_XACT
record in that case.

> In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e.
> running->oldestRunningXid == running->nextXid), it can jump to the
> consistent state. If not, it means there were running transactions at
> that time of the RUNNING_XACT record being produced. Therefore, we
> must not restore the snapshot and jump to the consistent state.
> Because otherwise, we end up deciding the start point in the middle of
> a transaction that started before the RUNNING_XACT record. Probably
> the same is true for all subsequent snapbuild states.
>

One thing to consider on this is that we can only restore the snapshot
if by that time some other get changes (in your case, step
"s2_get_changes_slot0" as in your draft patch) would have serialized
the consistent snapshot at that LSN. One could question if we can't
reach a consistent state at a particular LSN then why in the first
place the snapshot has been serialized at that LSN? The answer could
be that it is okay to use such a serialized snapshot after initial
snapshot creation because we know that the restart_lsn of a slot in
such cases would be a location where we won't see the data for partial
transactions. This is because, after the very first time (after the
initdecodingcontext), the restart_lsn would be set to a location after
we reach the consistent point.

> I might be missing something but I could not find the case where we
> can or want to restore the serialized snapshot when finding the
> initial start point. If my analysis is correct, we can go with either
> (a) or (c) I proposed before[1]. Between these two options, it also
> could be an option that (a) is for backbranches for safety and (c) is
> for master.
>

The approach (a) has a downside, it will lead to tracking more
transactions (non-catalog) than required without any benefit for the
user. Considering that is true, I wouldn't prefer that approach.

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

21 June 2024, 06:46:18

On Thu, Jun 20, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 16, 2024 at 9:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Pondering further, I came across the question; in what case we would
> > need to restore the snapshot and jump to the consistent state when
> > finding the initial start point?
> >
> > When creating a slot, in ReplicationSlotReserveWal(), we determine the
> > restart_lsn and write a RUNNING_XACT record. This record would be the
> > first RUNNING_XACT record that the logical decoding decodes in most
> > cases.
> >
>
> This probably won't be true on standby as we don't LOG RUNNING_XACT
> record in that case.

Yes, on the standby we wait for the RUNNING_XACT record before
starting actual decoding.

>
> > In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e.
> > running->oldestRunningXid == running->nextXid), it can jump to the
> > consistent state. If not, it means there were running transactions at
> > that time of the RUNNING_XACT record being produced. Therefore, we
> > must not restore the snapshot and jump to the consistent state.
> > Because otherwise, we end up deciding the start point in the middle of
> > a transaction that started before the RUNNING_XACT record. Probably
> > the same is true for all subsequent snapbuild states.
> >
>
> One thing to consider on this is that we can only restore the snapshot
> if by that time some other get changes (in your case, step
> "s2_get_changes_slot0" as in your draft patch) would have serialized
> the consistent snapshot at that LSN. One could question if we can't
> reach a consistent state at a particular LSN then why in the first
> place the snapshot has been serialized at that LSN? The answer could
> be that it is okay to use such a serialized snapshot after initial
> snapshot creation because we know that the restart_lsn of a slot in
> such cases would be a location where we won't see the data for partial
> transactions.

True. After the CreateInitDecodingContext() the restart_lsn is just an
arbitrary LSN in a sense; either the latest replay LSN or the current
insertion LSN. We need to find an LSN for the start point that
corresponds to that restart_lsn. So I think we need to find the start
point while disabling snapshot restores.

>
> > I might be missing something but I could not find the case where we
> > can or want to restore the serialized snapshot when finding the
> > initial start point. If my analysis is correct, we can go with either
> > (a) or (c) I proposed before[1]. Between these two options, it also
> > could be an option that (a) is for backbranches for safety and (c) is
> > for master.
> >
>
> The approach (a) has a downside, it will lead to tracking more
> transactions (non-catalog) than required without any benefit for the
> user. Considering that is true, I wouldn't prefer that approach.

Yes, it will lead to tracking non-catalog-change transactions as well.
If there are many subtransactions, the overhead could be noticeable.
But it happens only once when creating a slot.

Another variant of (a) is that we skip snapshot restores if the
initial_xmin_hirizon is a valid transaction id. The
initia_xmin_horizon is always set to a valida transaction id when
initializing the decoding context, e.g. during
CreateInitDecodingContext(). That way, we don't need to track
non-catalog-change transctions. A downside is that this approach
assumes that DecodingContextFindStartpoint() is called with the
decoding context created by CreateInitDecodingContxt(), which is true
in the core codes, but might not be true in third party extensions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

24 June 2024, 03:54:46

On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > The approach (a) has a downside, it will lead to tracking more
> > transactions (non-catalog) than required without any benefit for the
> > user. Considering that is true, I wouldn't prefer that approach.
>
> Yes, it will lead to tracking non-catalog-change transactions as well.
> If there are many subtransactions, the overhead could be noticeable.
> But it happens only once when creating a slot.
>

True, but it doesn't seem advisable to add such an overhead even
during create time without any concrete reason.

> Another variant of (a) is that we skip snapshot restores if the
> initial_xmin_hirizon is a valid transaction id. The
> initia_xmin_horizon is always set to a valida transaction id when
> initializing the decoding context, e.g. during
> CreateInitDecodingContext(). That way, we don't need to track
> non-catalog-change transctions. A downside is that this approach
> assumes that DecodingContextFindStartpoint() is called with the
> decoding context created by CreateInitDecodingContxt(), which is true
> in the core codes, but might not be true in third party extensions.
>

I think it is better to be explicit in this case rather than relying
on initia_xmin_horizon. So, we can store in_create/create_in_progress
flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
in back branches. I think changing SnapBuild means we have to update
SNAPBUILD_VERSION, right? Is that a good idea to do at this point of
time or shall we wait new branch to open and change it there? Anyway,
it would be a few days away and in the meantime, we can review and
keep the patches ready.

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

24 June 2024, 05:01:49

On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > The approach (a) has a downside, it will lead to tracking more
> > > transactions (non-catalog) than required without any benefit for the
> > > user. Considering that is true, I wouldn't prefer that approach.
> >
> > Yes, it will lead to tracking non-catalog-change transactions as well.
> > If there are many subtransactions, the overhead could be noticeable.
> > But it happens only once when creating a slot.
> >
>
> True, but it doesn't seem advisable to add such an overhead even
> during create time without any concrete reason.
>
> > Another variant of (a) is that we skip snapshot restores if the
> > initial_xmin_hirizon is a valid transaction id. The
> > initia_xmin_horizon is always set to a valida transaction id when
> > initializing the decoding context, e.g. during
> > CreateInitDecodingContext(). That way, we don't need to track
> > non-catalog-change transctions. A downside is that this approach
> > assumes that DecodingContextFindStartpoint() is called with the
> > decoding context created by CreateInitDecodingContxt(), which is true
> > in the core codes, but might not be true in third party extensions.
> >
>
> I think it is better to be explicit in this case rather than relying
> on initia_xmin_horizon. So, we can store in_create/create_in_progress
> flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
> in back branches.

I think we cannot access the flag in LogicalDecodingContext from
snapbuild.c at least in backbranches. I've discussed adding such a
flag in snapbuild.c as a global variable, but I'm slightly hesitant to
add a global variable besides InitialRunningXacts.

>  I think changing SnapBuild means we have to update
> SNAPBUILD_VERSION, right? Is that a good idea to do at this point of
> time or shall we wait new branch to open and change it there? Anyway,
> it would be a few days away and in the meantime, we can review and
> keep the patches ready.

I think we should wait to add such changes that break on-disk
compatibility until a new branch opens. On HEAD, I think we can add a
new flag in SnapBuild and set it during say
DecodingContextFindStartpoint().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

25 June 2024, 04:24:10

On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > The approach (a) has a downside, it will lead to tracking more
> > > > transactions (non-catalog) than required without any benefit for the
> > > > user. Considering that is true, I wouldn't prefer that approach.
> > >
> > > Yes, it will lead to tracking non-catalog-change transactions as well.
> > > If there are many subtransactions, the overhead could be noticeable.
> > > But it happens only once when creating a slot.
> > >
> >
> > True, but it doesn't seem advisable to add such an overhead even
> > during create time without any concrete reason.
> >
> > > Another variant of (a) is that we skip snapshot restores if the
> > > initial_xmin_hirizon is a valid transaction id. The
> > > initia_xmin_horizon is always set to a valida transaction id when
> > > initializing the decoding context, e.g. during
> > > CreateInitDecodingContext(). That way, we don't need to track
> > > non-catalog-change transctions. A downside is that this approach
> > > assumes that DecodingContextFindStartpoint() is called with the
> > > decoding context created by CreateInitDecodingContxt(), which is true
> > > in the core codes, but might not be true in third party extensions.
> > >
> >
> > I think it is better to be explicit in this case rather than relying
> > on initia_xmin_horizon. So, we can store in_create/create_in_progress
> > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
> > in back branches.
>
> I think we cannot access the flag in LogicalDecodingContext from
> snapbuild.c at least in backbranches. I've discussed adding such a
> flag in snapbuild.c as a global variable, but I'm slightly hesitant to
> add a global variable besides InitialRunningXacts.
>

I agree that adding a global variable is not advisable. Can we pass
the flag stored in LogicalDecodingContext to snapbuild.c? That might
not be elegant but I don't have any better ideas.

> >  I think changing SnapBuild means we have to update
> > SNAPBUILD_VERSION, right? Is that a good idea to do at this point of
> > time or shall we wait new branch to open and change it there? Anyway,
> > it would be a few days away and in the meantime, we can review and
> > keep the patches ready.
>
> I think we should wait to add such changes that break on-disk
> compatibility until a new branch opens. On HEAD, I think we can add a
> new flag in SnapBuild and set it during say
> DecodingContextFindStartpoint().
>

Fair enough.

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

25 June 2024, 08:26:11

On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > The approach (a) has a downside, it will lead to tracking more
> > > > > transactions (non-catalog) than required without any benefit for the
> > > > > user. Considering that is true, I wouldn't prefer that approach.
> > > >
> > > > Yes, it will lead to tracking non-catalog-change transactions as well.
> > > > If there are many subtransactions, the overhead could be noticeable.
> > > > But it happens only once when creating a slot.
> > > >
> > >
> > > True, but it doesn't seem advisable to add such an overhead even
> > > during create time without any concrete reason.
> > >
> > > > Another variant of (a) is that we skip snapshot restores if the
> > > > initial_xmin_hirizon is a valid transaction id. The
> > > > initia_xmin_horizon is always set to a valida transaction id when
> > > > initializing the decoding context, e.g. during
> > > > CreateInitDecodingContext(). That way, we don't need to track
> > > > non-catalog-change transctions. A downside is that this approach
> > > > assumes that DecodingContextFindStartpoint() is called with the
> > > > decoding context created by CreateInitDecodingContxt(), which is true
> > > > in the core codes, but might not be true in third party extensions.
> > > >
> > >
> > > I think it is better to be explicit in this case rather than relying
> > > on initia_xmin_horizon. So, we can store in_create/create_in_progress
> > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
> > > in back branches.
> >
> > I think we cannot access the flag in LogicalDecodingContext from
> > snapbuild.c at least in backbranches. I've discussed adding such a
> > flag in snapbuild.c as a global variable, but I'm slightly hesitant to
> > add a global variable besides InitialRunningXacts.
> >
>
> I agree that adding a global variable is not advisable. Can we pass
> the flag stored in LogicalDecodingContext to snapbuild.c?

Ah, I found a good path: snapbuild->reorder->private_data (storing a
pointer to a LogicalDecodingContext). This assumes private_data always
stores a pointer to a LogicalDecodingContext but I think that's find
at least for backbranches.

I've attached the patch for this idea for PG16.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

0001-Fix-possibility-of-logical-decoding-partial-of-trans.patch

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

26 June 2024, 09:17:03

On Tue, Jun 25, 2024 at 1:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > The approach (a) has a downside, it will lead to tracking more
> > > > > > transactions (non-catalog) than required without any benefit for the
> > > > > > user. Considering that is true, I wouldn't prefer that approach.
> > > > >
> > > > > Yes, it will lead to tracking non-catalog-change transactions as well.
> > > > > If there are many subtransactions, the overhead could be noticeable.
> > > > > But it happens only once when creating a slot.
> > > > >
> > > >
> > > > True, but it doesn't seem advisable to add such an overhead even
> > > > during create time without any concrete reason.
> > > >
> > > > > Another variant of (a) is that we skip snapshot restores if the
> > > > > initial_xmin_hirizon is a valid transaction id. The
> > > > > initia_xmin_horizon is always set to a valida transaction id when
> > > > > initializing the decoding context, e.g. during
> > > > > CreateInitDecodingContext(). That way, we don't need to track
> > > > > non-catalog-change transctions. A downside is that this approach
> > > > > assumes that DecodingContextFindStartpoint() is called with the
> > > > > decoding context created by CreateInitDecodingContxt(), which is true
> > > > > in the core codes, but might not be true in third party extensions.
> > > > >
> > > >
> > > > I think it is better to be explicit in this case rather than relying
> > > > on initia_xmin_horizon. So, we can store in_create/create_in_progress
> > > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
> > > > in back branches.
> > >
> > > I think we cannot access the flag in LogicalDecodingContext from
> > > snapbuild.c at least in backbranches. I've discussed adding such a
> > > flag in snapbuild.c as a global variable, but I'm slightly hesitant to
> > > add a global variable besides InitialRunningXacts.
> > >
> >
> > I agree that adding a global variable is not advisable. Can we pass
> > the flag stored in LogicalDecodingContext to snapbuild.c?
>
> Ah, I found a good path: snapbuild->reorder->private_data (storing a
> pointer to a LogicalDecodingContext). This assumes private_data always
> stores a pointer to a LogicalDecodingContext but I think that's find
> at least for backbranches.
>

+1. This approach looks good to me.

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

05 July 2024, 14:52:52

On Wed, Jun 26, 2024 at 6:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 1:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > > >
> > > > > > > The approach (a) has a downside, it will lead to tracking more
> > > > > > > transactions (non-catalog) than required without any benefit for the
> > > > > > > user. Considering that is true, I wouldn't prefer that approach.
> > > > > >
> > > > > > Yes, it will lead to tracking non-catalog-change transactions as well.
> > > > > > If there are many subtransactions, the overhead could be noticeable.
> > > > > > But it happens only once when creating a slot.
> > > > > >
> > > > >
> > > > > True, but it doesn't seem advisable to add such an overhead even
> > > > > during create time without any concrete reason.
> > > > >
> > > > > > Another variant of (a) is that we skip snapshot restores if the
> > > > > > initial_xmin_hirizon is a valid transaction id. The
> > > > > > initia_xmin_horizon is always set to a valida transaction id when
> > > > > > initializing the decoding context, e.g. during
> > > > > > CreateInitDecodingContext(). That way, we don't need to track
> > > > > > non-catalog-change transctions. A downside is that this approach
> > > > > > assumes that DecodingContextFindStartpoint() is called with the
> > > > > > decoding context created by CreateInitDecodingContxt(), which is true
> > > > > > in the core codes, but might not be true in third party extensions.
> > > > > >
> > > > >
> > > > > I think it is better to be explicit in this case rather than relying
> > > > > on initia_xmin_horizon. So, we can store in_create/create_in_progress
> > > > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext
> > > > > in back branches.
> > > >
> > > > I think we cannot access the flag in LogicalDecodingContext from
> > > > snapbuild.c at least in backbranches. I've discussed adding such a
> > > > flag in snapbuild.c as a global variable, but I'm slightly hesitant to
> > > > add a global variable besides InitialRunningXacts.
> > > >
> > >
> > > I agree that adding a global variable is not advisable. Can we pass
> > > the flag stored in LogicalDecodingContext to snapbuild.c?
> >
> > Ah, I found a good path: snapbuild->reorder->private_data (storing a
> > pointer to a LogicalDecodingContext). This assumes private_data always
> > stores a pointer to a LogicalDecodingContext but I think that's find
> > at least for backbranches.
> >
>
> +1. This approach looks good to me.
>

I've attached updated patches for HEAD and pg17 for now (I will create
the patch for other backbranches).

In the patches, I used a different approach in between HEAD and
backbranches. On HEAD, we store a flag indicating whether or not we
should skip snapshot restores into the SnapBuild struct and set it
only while finding the start point. Therefore we have to bump
SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
above; store the flag in LogicalDecodingContext and set it when
creating the LogicalDecodingContext for a new logical slot. A possible
downside of the approach taken for backbranches is that we implicitly
require for users to use the same LogicalDecodingContext for  both
initializing the context for a new slot and finding its start point.
IIUC it was not strictly required. This restriction would not be a
problem at least in the core, but I'm not sure if there are no
external extensions that create a logical slot in that way. This is
the main reason why I used a different approach on HEAD and
backbranches. Therefore, if it does not matter, I think we can use the
same approach on all branches, which is better in terms of
maintainability.

Also, I excluded the test case for the problem that Kuroda-san
reported[1] since the reported problem occurred due to the same cause
of the problem originally reported on this thread. The bug that
decodes only partial transactions could lead to various symptoms. IIUC
these test cases test the same behavior.

Regards,

[1]
https://www.postgresql.org/message-id/TYCPR01MB1207719C811F580A8774C79B7F52A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Potential data loss due to race condition during logical replication slot creation

From

Michael Paquier

Date:

07 July 2024, 23:56:42

On Fri, Jul 05, 2024 at 11:52:52PM +0900, Masahiko Sawada wrote:
> I've attached updated patches for HEAD and pg17 for now (I will create
> the patch for other backbranches).
>
> In the patches, I used a different approach in between HEAD and
> backbranches. On HEAD, we store a flag indicating whether or not we
> should skip snapshot restores into the SnapBuild struct and set it
> only while finding the start point. Therefore we have to bump
> SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> above; store the flag in LogicalDecodingContext and set it when
> creating the LogicalDecodingContext for a new logical slot. A possible
> downside of the approach taken for backbranches is that we implicitly
> require for users to use the same LogicalDecodingContext for  both
> initializing the context for a new slot and finding its start point.
> IIUC it was not strictly required. This restriction would not be a
> problem at least in the core, but I'm not sure if there are no
> external extensions that create a logical slot in that way. This is
> the main reason why I used a different approach on HEAD and
> backbranches. Therefore, if it does not matter, I think we can use the
> same approach on all branches, which is better in terms of
> maintainability.

--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -189,6 +189,9 @@ struct SnapBuild
     /* Indicates if we are building full snapshot or just catalog one. */
     bool        building_full_snapshot;

+    /* Indicates if we are finding the start point to extract changes */
+    bool        finding_start_point;
+

FYI, I think that it is still OK to bump SNAPBUILD_VERSION on
REL_17_STABLE.  That will reduce by 1 year the time window required to
maintain the tweaks implemented for the versions in the back-branches.
So I'd suggest to do what the v17 version of the patch does for ~16,
and use the snapshot format changes in 17~.
--
Michael

Attachment

signature.asc

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

08 July 2024, 03:01:23

On Mon, Jul 8, 2024 at 8:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jul 05, 2024 at 11:52:52PM +0900, Masahiko Sawada wrote:
> > I've attached updated patches for HEAD and pg17 for now (I will create
> > the patch for other backbranches).
> >
> > In the patches, I used a different approach in between HEAD and
> > backbranches. On HEAD, we store a flag indicating whether or not we
> > should skip snapshot restores into the SnapBuild struct and set it
> > only while finding the start point. Therefore we have to bump
> > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > above; store the flag in LogicalDecodingContext and set it when
> > creating the LogicalDecodingContext for a new logical slot. A possible
> > downside of the approach taken for backbranches is that we implicitly
> > require for users to use the same LogicalDecodingContext for  both
> > initializing the context for a new slot and finding its start point.
> > IIUC it was not strictly required. This restriction would not be a
> > problem at least in the core, but I'm not sure if there are no
> > external extensions that create a logical slot in that way. This is
> > the main reason why I used a different approach on HEAD and
> > backbranches. Therefore, if it does not matter, I think we can use the
> > same approach on all branches, which is better in terms of
> > maintainability.
>
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -189,6 +189,9 @@ struct SnapBuild
>         /* Indicates if we are building full snapshot or just catalog one. */
>         bool            building_full_snapshot;
>
> +       /* Indicates if we are finding the start point to extract changes */
> +       bool            finding_start_point;
> +
>
> FYI, I think that it is still OK to bump SNAPBUILD_VERSION on
> REL_17_STABLE.  That will reduce by 1 year the time window required to
> maintain the tweaks implemented for the versions in the back-branches.
> So I'd suggest to do what the v17 version of the patch does for ~16,
> and use the snapshot format changes in 17~.

Thank you for the comment.

Agreed. If we get consensus on this approach to go, I'll use the patch
changing the SnapBuild struct for 17 and HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Potential data loss due to race condition during logical replication slot creation

From

"Hayato Kuroda (Fujitsu)"

Date:

08 July 2024, 09:01:47

Dear Sawada-san,

Thanks for creating the patch!

> I've attached updated patches for HEAD and pg17 for now (I will create
> the patch for other backbranches).
> 
> In the patches, I used a different approach in between HEAD and
> backbranches. On HEAD, we store a flag indicating whether or not we
> should skip snapshot restores into the SnapBuild struct and set it
> only while finding the start point. Therefore we have to bump
> SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> above; store the flag in LogicalDecodingContext and set it when
> creating the LogicalDecodingContext for a new logical slot. A possible
> downside of the approach taken for backbranches is that we implicitly
> require for users to use the same LogicalDecodingContext for  both
> initializing the context for a new slot and finding its start point.
> IIUC it was not strictly required. This restriction would not be a
> problem at least in the core, but I'm not sure if there are no
> external extensions that create a logical slot in that way. This is
> the main reason why I used a different approach on HEAD and
> backbranches. Therefore, if it does not matter, I think we can use the
> same approach on all branches, which is better in terms of
> maintainability.

I want to confirm your point. You meant that you wanted to unify appraoches,
especially you wanted to store the flag in LogicalDecodingContext, rigth? I
briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
find such pattern. E.g., [1]. But this point must be pointed by extension developers.

> 
> Also, I excluded the test case for the problem that Kuroda-san
> reported[1] since the reported problem occurred due to the same cause
> of the problem originally reported on this thread. The bug that
> decodes only partial transactions could lead to various symptoms. IIUC
> these test cases test the same behavior.

Yes, I also think they are caused by the same root cause, so we can skip.

Comments for HEAD patch:
1.
You must modify test_decoding/meson.build. It was also missing in patches for
backbranches.

2.
The test code needs cleanup. E.g.,
- Steps "s0_insert_cat" and "s0_savepoint" is not used
- Table user_cat is not used
It was also missing in patches for backbranches.

3.
Just to confirm - In SnapshotRestore(), the added attribute is not restored
from the disk. This is intentional because 1) it has been set to false when it
was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
by palloc0() so it was set to false. Is it right?

Comments for backblanches patch:

1.
According to wikipage [2], new attribute must be at the end of struct.

[1]: https://github.com/cybertec-postgresql/pg_squeeze/blob/master/worker.c#L1614
[2]: https://wiki.postgresql.org/wiki/Committing_checklist

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

08 July 2024, 14:02:44

On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> Thanks for creating the patch!
>
> > I've attached updated patches for HEAD and pg17 for now (I will create
> > the patch for other backbranches).
> >
> > In the patches, I used a different approach in between HEAD and
> > backbranches. On HEAD, we store a flag indicating whether or not we
> > should skip snapshot restores into the SnapBuild struct and set it
> > only while finding the start point. Therefore we have to bump
> > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > above; store the flag in LogicalDecodingContext and set it when
> > creating the LogicalDecodingContext for a new logical slot. A possible
> > downside of the approach taken for backbranches is that we implicitly
> > require for users to use the same LogicalDecodingContext for  both
> > initializing the context for a new slot and finding its start point.
> > IIUC it was not strictly required. This restriction would not be a
> > problem at least in the core, but I'm not sure if there are no
> > external extensions that create a logical slot in that way. This is
> > the main reason why I used a different approach on HEAD and
> > backbranches. Therefore, if it does not matter, I think we can use the
> > same approach on all branches, which is better in terms of
> > maintainability.
>
> I want to confirm your point. You meant that you wanted to unify appraoches,
> especially you wanted to store the flag in LogicalDecodingContext, rigth?

Yes. Ideally I'd like to use the same approach in all branches
regardless of how to fix it for better maintainability.

> I
> briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
> find such pattern. E.g., [1]. But this point must be pointed by extension developers.

Thank you for searching on github. It will also affect the future
extension developments so I'm slightly hesitant to do that.

> >
> > Also, I excluded the test case for the problem that Kuroda-san
> > reported[1] since the reported problem occurred due to the same cause
> > of the problem originally reported on this thread. The bug that
> > decodes only partial transactions could lead to various symptoms. IIUC
> > these test cases test the same behavior.
>
> Yes, I also think they are caused by the same root cause, so we can skip.
>
> Comments for HEAD patch:
> 1.
> You must modify test_decoding/meson.build. It was also missing in patches for
> backbranches.
>
> 2.
> The test code needs cleanup. E.g.,
> - Steps "s0_insert_cat" and "s0_savepoint" is not used
> - Table user_cat is not used
> It was also missing in patches for backbranches.

Will fix the above points in the next version patch.

>
> 3.
> Just to confirm - In SnapshotRestore(), the added attribute is not restored
> from the disk. This is intentional because 1) it has been set to false when it
> was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
> by palloc0() so it was set to false. Is it right?

Right.

> Comments for backblanches patch:
>
> 1.
> According to wikipage [2], new attribute must be at the end of struct.

Good point. Will fix.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

09 July 2024, 05:29:52

On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Sawada-san,
> >
> > Thanks for creating the patch!
> >
> > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > the patch for other backbranches).
> > >
> > > In the patches, I used a different approach in between HEAD and
> > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > should skip snapshot restores into the SnapBuild struct and set it
> > > only while finding the start point. Therefore we have to bump
> > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > above; store the flag in LogicalDecodingContext and set it when
> > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > downside of the approach taken for backbranches is that we implicitly
> > > require for users to use the same LogicalDecodingContext for  both
> > > initializing the context for a new slot and finding its start point.
> > > IIUC it was not strictly required. This restriction would not be a
> > > problem at least in the core, but I'm not sure if there are no
> > > external extensions that create a logical slot in that way. This is
> > > the main reason why I used a different approach on HEAD and
> > > backbranches. Therefore, if it does not matter, I think we can use the
> > > same approach on all branches, which is better in terms of
> > > maintainability.
> >
> > I want to confirm your point. You meant that you wanted to unify appraoches,
> > especially you wanted to store the flag in LogicalDecodingContext, rigth?
>
> Yes. Ideally I'd like to use the same approach in all branches
> regardless of how to fix it for better maintainability.
>
> > I
> > briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
> > find such pattern. E.g., [1]. But this point must be pointed by extension developers.
>
> Thank you for searching on github. It will also affect the future
> extension developments so I'm slightly hesitant to do that.
>
> > >
> > > Also, I excluded the test case for the problem that Kuroda-san
> > > reported[1] since the reported problem occurred due to the same cause
> > > of the problem originally reported on this thread. The bug that
> > > decodes only partial transactions could lead to various symptoms. IIUC
> > > these test cases test the same behavior.
> >
> > Yes, I also think they are caused by the same root cause, so we can skip.
> >
> > Comments for HEAD patch:
> > 1.
> > You must modify test_decoding/meson.build. It was also missing in patches for
> > backbranches.
> >
> > 2.
> > The test code needs cleanup. E.g.,
> > - Steps "s0_insert_cat" and "s0_savepoint" is not used
> > - Table user_cat is not used
> > It was also missing in patches for backbranches.
>
> Will fix the above points in the next version patch.
>
> >
> > 3.
> > Just to confirm - In SnapshotRestore(), the added attribute is not restored
> > from the disk. This is intentional because 1) it has been set to false when it
> > was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
> > by palloc0() so it was set to false. Is it right?
>
> Right.
>
> > Comments for backblanches patch:
> >
> > 1.
> > According to wikipage [2], new attribute must be at the end of struct.
>
> Good point. Will fix.
>

I've attached the new version patches for all branches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

09 July 2024, 09:27:30

On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Sawada-san,
> > >
> > > Thanks for creating the patch!
> > >
> > > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > > the patch for other backbranches).
> > > >
> > > > In the patches, I used a different approach in between HEAD and
> > > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > > should skip snapshot restores into the SnapBuild struct and set it
> > > > only while finding the start point. Therefore we have to bump
> > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > > above; store the flag in LogicalDecodingContext and set it when
> > > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > > downside of the approach taken for backbranches is that we implicitly
> > > > require for users to use the same LogicalDecodingContext for  both
> > > > initializing the context for a new slot and finding its start point.
> > > > IIUC it was not strictly required. This restriction would not be a
> > > > problem at least in the core, but I'm not sure if there are no
> > > > external extensions that create a logical slot in that way. This is
> > > > the main reason why I used a different approach on HEAD and
> > > > backbranches. Therefore, if it does not matter, I think we can use the
> > > > same approach on all branches, which is better in terms of
> > > > maintainability.
> > >
> > > I want to confirm your point. You meant that you wanted to unify appraoches,
> > > especially you wanted to store the flag in LogicalDecodingContext, rigth?
> >
> > Yes. Ideally I'd like to use the same approach in all branches
> > regardless of how to fix it for better maintainability.
> >

The difference is minor so using slightly different approaches should
be okay. In the ideal case, I agree that using the same approach makes
sense but for future extendability, it is better to keep this new
variable in SnapBuild at least in HEAD and PG17.

>
> I've attached the new version patches for all branches.
>

Few comments:
1.
@@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 {
  ReplicationSlot *slot = ctx->slot;

+ /* Let snapshot builder start to find the start point */
+ SnapBuildSetFindStartPoint(ctx->snapshot_builder, true);
+
  /* Initialize from where to start reading WAL. */
  XLogBeginRead(ctx->reader, slot->data.restart_lsn);

@@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
  if (slot->data.two_phase)
  slot->data.two_phase_at = ctx->reader->EndRecPtr;
  SpinLockRelease(&slot->mutex);
+
+ /* Complete to find the start point */
+ SnapBuildSetFindStartPoint(ctx->snapshot_builder, false);

I wonder why you didn't choose to set this variable in
AllocateSnapshotBuilder()? If we do so, then we may not need set/reset
in DecodingContextFindStartpoint(). The one advantage of using
set/reset for the minimal time as done here is that we can avoid any
impact of this new variable but I still feel setting it in
AllocateSnapshotBuilder() seems more natural.

2.
Since in this case
+ *   the restart LSN could be in the middle of transactions we need to
+ *   find the start point where we won't see the data for partial
+ *   transactions.

There is a connecting word missing between *transactions* and *we*.
Can we use a slightly different wording like: "Can't use this method
while finding the start point for decoding changes as the restart LSN
would be an arbitrary LSN but we need to find the start point to
extract changes where we won't see the data for partial
transactions."?

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

10 July 2024, 01:15:37

On Tue, Jul 9, 2024 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> > > <kuroda.hayato@fujitsu.com> wrote:
> > > >
> > > > Dear Sawada-san,
> > > >
> > > > Thanks for creating the patch!
> > > >
> > > > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > > > the patch for other backbranches).
> > > > >
> > > > > In the patches, I used a different approach in between HEAD and
> > > > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > > > should skip snapshot restores into the SnapBuild struct and set it
> > > > > only while finding the start point. Therefore we have to bump
> > > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > > > above; store the flag in LogicalDecodingContext and set it when
> > > > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > > > downside of the approach taken for backbranches is that we implicitly
> > > > > require for users to use the same LogicalDecodingContext for  both
> > > > > initializing the context for a new slot and finding its start point.
> > > > > IIUC it was not strictly required. This restriction would not be a
> > > > > problem at least in the core, but I'm not sure if there are no
> > > > > external extensions that create a logical slot in that way. This is
> > > > > the main reason why I used a different approach on HEAD and
> > > > > backbranches. Therefore, if it does not matter, I think we can use the
> > > > > same approach on all branches, which is better in terms of
> > > > > maintainability.
> > > >
> > > > I want to confirm your point. You meant that you wanted to unify appraoches,
> > > > especially you wanted to store the flag in LogicalDecodingContext, rigth?
> > >
> > > Yes. Ideally I'd like to use the same approach in all branches
> > > regardless of how to fix it for better maintainability.
> > >
>
> The difference is minor so using slightly different approaches should
> be okay. In the ideal case, I agree that using the same approach makes
> sense but for future extendability, it is better to keep this new
> variable in SnapBuild at least in HEAD and PG17.
>
> >
> > I've attached the new version patches for all branches.
> >
>
> Few comments:
> 1.
> @@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
>  {
>   ReplicationSlot *slot = ctx->slot;
>
> + /* Let snapshot builder start to find the start point */
> + SnapBuildSetFindStartPoint(ctx->snapshot_builder, true);
> +
>   /* Initialize from where to start reading WAL. */
>   XLogBeginRead(ctx->reader, slot->data.restart_lsn);
>
> @@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
>   if (slot->data.two_phase)
>   slot->data.two_phase_at = ctx->reader->EndRecPtr;
>   SpinLockRelease(&slot->mutex);
> +
> + /* Complete to find the start point */
> + SnapBuildSetFindStartPoint(ctx->snapshot_builder, false);
>
> I wonder why you didn't choose to set this variable in
> AllocateSnapshotBuilder()? If we do so, then we may not need set/reset
> in DecodingContextFindStartpoint(). The one advantage of using
> set/reset for the minimal time as done here is that we can avoid any
> impact of this new variable but I still feel setting it in
> AllocateSnapshotBuilder() seems more natural.

Initially what I thought was; if we set the flag in
AllocateSnapshotBuilder(), we need to pass true through like
CreateInitDecodingContext() -> StartupDecodingContext() ->
AllocateSnapshotBuilder(), meaning that the only
LogicalDecodingContext created via CreateInitDecodingContext() can be
used in DecodingContextFindStartpoint(). IOW if we use the
LogicalDecodingContext created via CreateDecodingContext() (i.e.,
setting the flag as false) in DecodingContextFindStartpoint() we would
end up having the same problem. IIUC we haven't had such a usage
restriction before. But thinking on this approach further, probably
the same is true for initial_xmin_horizon. The field is set only when
the LogicalDecodingContext is created via CreateInitDecodingContext(),
and is used only while finding the start point. So I'm fine with
setting the flag in AllocateSnapshotBuilder(). If we go this approach,
I think we should check if the flag is set before finding the start
point.

>
> 2.
> Since in this case
> + *   the restart LSN could be in the middle of transactions we need to
> + *   find the start point where we won't see the data for partial
> + *   transactions.
>
> There is a connecting word missing between *transactions* and *we*.
> Can we use a slightly different wording like: "Can't use this method
> while finding the start point for decoding changes as the restart LSN
> would be an arbitrary LSN but we need to find the start point to
> extract changes where we won't see the data for partial
> transactions."?

Looks good. Will fix.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

10 July 2024, 02:21:48

On Wed, Jul 10, 2024 at 10:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>  If we go this approach,
> I think we should check if the flag is set before finding the start
> point.

Checking if the flag is set in DecodingContextFindStartpoint()
requires introducing a new function in snapbuild.c and the function
would be used only in an Assertion. So the attached updated patch
doesn't introduce such a function. If it's really worth it, I can add
it, but I think we can live without that.

Overall I think the patches are in good shape, so I'm going to push
them tomorrow, barring any objections and further comments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Potential data loss due to race condition during logical replication slot creation

From

Amit Kapila

Date:

10 July 2024, 03:43:27

On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Overall I think the patches are in good shape, so I'm going to push
> them tomorrow, barring any objections and further comments.
>

Agreed. Below are a few minor comments that you might want to consider:

1.
@@ -76,6 +77,7 @@ extern SnapBuildState
SnapBuildCurrentState(SnapBuild *builder);
 extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);

 extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
+extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool
find_start_point);

This is not required in the latest version.

2.
+ /*
+ * Indicates if we are using the snapshot builder for the initial creation
+ * of a logical replication slot.

The word 'initial' in the above comment is not required. If you
consider this then a similar wording change is required in lower
branches as well.

3.
HEAD and v17
---------------------
- /* b) valid on disk state and not building full snapshot */
+
+ /*
+ * b) valid on disk state and while neither building full snapshot nor
+ * finding the start point.
+ */
  else if (!builder->building_full_snapshot &&
+ !builder->in_slot_creation &&

V16 and below
---------------------
- /* b) valid on disk state and not building full snapshot */
+
+ /*
+ * b) valid on disk state and neither building full snapshot nor while
+ * creating a slot.
+ */
  else if (!builder->building_full_snapshot &&
+ !ctx->in_create &&

Isn't it better to use the same comment in both places?

--
With Regards,
Amit Kapila.

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

10 July 2024, 05:30:55

On Wed, Jul 10, 2024 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Overall I think the patches are in good shape, so I'm going to push
> > them tomorrow, barring any objections and further comments.
> >
>
> Agreed. Below are a few minor comments that you might want to consider:
>
> 1.
> @@ -76,6 +77,7 @@ extern SnapBuildState
> SnapBuildCurrentState(SnapBuild *builder);
>  extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
>
>  extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
> +extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool
> find_start_point);
>
> This is not required in the latest version.
>
> 2.
> + /*
> + * Indicates if we are using the snapshot builder for the initial creation
> + * of a logical replication slot.
>
> The word 'initial' in the above comment is not required. If you
> consider this then a similar wording change is required in lower
> branches as well.
>
> 3.
> HEAD and v17
> ---------------------
> - /* b) valid on disk state and not building full snapshot */
> +
> + /*
> + * b) valid on disk state and while neither building full snapshot nor
> + * finding the start point.
> + */
>   else if (!builder->building_full_snapshot &&
> + !builder->in_slot_creation &&
>
> V16 and below
> ---------------------
> - /* b) valid on disk state and not building full snapshot */
> +
> + /*
> + * b) valid on disk state and neither building full snapshot nor while
> + * creating a slot.
> + */
>   else if (!builder->building_full_snapshot &&
> + !ctx->in_create &&
>
> Isn't it better to use the same comment in both places?

Thank you for reviewing the patches! I agreed with all the points.
I've attached the updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Potential data loss due to race condition during logical replication slot creation

From

Masahiko Sawada

Date:

12 July 2024, 02:52:20

On Wed, Jul 10, 2024 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jul 10, 2024 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Overall I think the patches are in good shape, so I'm going to push
> > > them tomorrow, barring any objections and further comments.
> > >
> >
> > Agreed. Below are a few minor comments that you might want to consider:
> >
> > 1.
> > @@ -76,6 +77,7 @@ extern SnapBuildState
> > SnapBuildCurrentState(SnapBuild *builder);
> >  extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
> >
> >  extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
> > +extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool
> > find_start_point);
> >
> > This is not required in the latest version.
> >
> > 2.
> > + /*
> > + * Indicates if we are using the snapshot builder for the initial creation
> > + * of a logical replication slot.
> >
> > The word 'initial' in the above comment is not required. If you
> > consider this then a similar wording change is required in lower
> > branches as well.
> >
> > 3.
> > HEAD and v17
> > ---------------------
> > - /* b) valid on disk state and not building full snapshot */
> > +
> > + /*
> > + * b) valid on disk state and while neither building full snapshot nor
> > + * finding the start point.
> > + */
> >   else if (!builder->building_full_snapshot &&
> > + !builder->in_slot_creation &&
> >
> > V16 and below
> > ---------------------
> > - /* b) valid on disk state and not building full snapshot */
> > +
> > + /*
> > + * b) valid on disk state and neither building full snapshot nor while
> > + * creating a slot.
> > + */
> >   else if (!builder->building_full_snapshot &&
> > + !ctx->in_create &&
> >
> > Isn't it better to use the same comment in both places?
>
> Thank you for reviewing the patches! I agreed with all the points.
> I've attached the updated patches.
>

Pushed, bb19b70081 (and all supported branches).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Potential data loss due to race condition during logical replication slot creation

From

Michael Paquier

Date:

12 July 2024, 03:42:07

On Fri, Jul 12, 2024 at 11:52:20AM +0900, Masahiko Sawada wrote:
> Pushed, bb19b70081 (and all supported branches).

Thanks, Sawada-san.
--
Michael

Attachment

signature.asc