Thread: Potential data loss due to race condition during logical replication slot creation

Hello,

We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:

Session 1:

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

CREATE TABLE test (a int);
BEGIN;
INSERT INTO test VALUES (1);

Session 2:

SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');

<query hangs>

Session 3:

CHECKPOINT;

select pg_logical_slot_get_changes('slot1', NULL, NULL);

<should return nothing of interest>

Session 1:

INSERT INTO test VALUES (2);
COMMIT;

<Session 2 query no longer hangs and successfully creates the slot2>

Session 2:

select pg_logical_slot_get_changes('slot1', NULL, NULL);

select pg_logical_slot_get_changes('slot2', NULL, NULL);

<expected: no rows of the txn are returned for slot2>
<actual: The 2nd row of the txn is returned for slot2>

Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.

Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.

We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.

The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.

With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.

 

Thanks,

Drew

 

Attachment
Hi,

Thank you for the report!

On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
>
> Hello,
>
> We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
>
> Session 1:
>
> SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
>
> CREATE TABLE test (a int);
> BEGIN;
> INSERT INTO test VALUES (1);
>
> Session 2:
>
> SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
>
> <query hangs>
>
> Session 3:
>
> CHECKPOINT;
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> <should return nothing of interest>
>
> Session 1:
>
> INSERT INTO test VALUES (2);
> COMMIT;
>
> <Session 2 query no longer hangs and successfully creates the slot2>
>
> Session 2:
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> select pg_logical_slot_get_changes('slot2', NULL, NULL);
>
> <expected: no rows of the txn are returned for slot2>
> <actual: The 2nd row of the txn is returned for slot2>
>

I've confirmed this happens HEAD and all supported backbranches. And I
think it's a (long-standing) bug.

I think we can prepare an isolation test of the above steps and
include it in contrib/test_decoding/specs.

> Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL
andalso force a checkpoint to get the current state of the running transactions on the system. This create process will
thenwait for all of the transactions within that running xact record to complete before being able to transition to the
nextsnapbuild state. During this time period, if another running xact record is written and then a different logical
replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
>
> Once all of the transactions from the initial running xact have finished, the process performing the slot creation
willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore
thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed
flushLSN of the slot to the start of the next record after that running xact. 

My analysis is the same. After the slot creation, the slot's LSNs become like:

(tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
some changes) < slot's confirmed_flush_lsn < (tx-commit)

> We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
>
> The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it
isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
>
> With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 

I basically agree with the basic idea of skipping snapshot restore
while finding the initial start point. But I'd like to hear other
opinions too.

Another simple way to fix this is that we always set
need_full_snapshot to true when slot creation. But it would then make
the snapshot builder include non-catalog-change transactions too while
finding the initial startpoint. Which is not necessary.

As for the proposed patch, it uses a global variable, InCreate, and
memory context callback to reset it. While I believe this is for not
breaking any ABI as this bug exists on back branches too, I think it's
better to consider how to fix it in HEAD and then how to backpatch it.
We don't necessarily need to have the same fixes for all branches all
the time.

IIUC we need to skip snapshot restore only when we're finding the
initial start point. So probably we don't need to set InCreate when
calling create_logical_replication_slot() with find_startpoint =
false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



> I think we can prepare an isolation test of the above steps and
> include it in contrib/test_decoding/specs.

Thanks for the suggestion. Included an isolation test for the
exact repro and a couple other tests for fully before and fully after.
These two tests don't add a whole lot of value since this
is already heavily tested elsewhere, so I'm good with removing.
However, thought it made sense to include from a perspective of
totality given the test name.

> Another simple way to fix this is that we always set
> need_full_snapshot to true when slot creation. But it would then make
> the snapshot builder include non-catalog-change transactions too while
> finding the initial startpoint. Which is not necessary.

Since we're not accumulating changes anyway and we never distribute snapshots
while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
This seems pretty contained and wouldn't require the use of a global variable
and memory context callbacks which is a bit nicer from a readability standpoint.
It's a little inefficient with memory and CPU, especially in the presence of
long running transactions, but likely would not be noticeable to most users.

> As for the proposed patch, it uses a global variable, InCreate, and
> memory context callback to reset it. While I believe this is for not
> breaking any ABI as this bug exists on back branches too, I think it's
> better to consider how to fix it in HEAD and then how to backpatch it.
> We don't necessarily need to have the same fixes for all branches all
> the time.

Attached a patch that is ABI breaking, but removes the need for a global variable.
This relies on pushing the context of whether we are in a create mode to the
logical decoding context, which is cleaner, but not as clean as just pushing
all the way down to the snapbuilder. I would push down to the snap builder,
but the entire object is persisted to disk. This is problematic to me since the
create flag would be persisted causing exisiting slots to restore a snapbuild
that could have the create flag set to true. This is not problematic as is,
since the variable would only be leveraged in SnapBuildRestore() which is only
called while inconsistent, but seems like a future bug waiting to happen.

We could always clear the flag when persisting, but that seems a bit hacky.
We could also fork the snap builder into on-disk and in-memory structs, which
would be convenient since we would not need to invalidate exisiting snapshots.
However, not sure how people feel about the fork.

> IIUC we need to skip snapshot restore only when we're finding the
> initial start point. So probably we don't need to set InCreate when
> calling create_logical_replication_slot() with find_startpoint =
> false.

Yep, that's correct. We can safely skip setting InCreate. We could move
the setter function slotfuncs.c as one way of doing this. I lean towards
setting it no matter what since it is harmless if we never try to find
a decoding point and removes the need for future callers of
CreateInitDecodingContext() from having to determine if it's safe to be false.
Though I agree that InCreate may a bigger hammer than needed which may need
to be modified in the future.

Thanks for the quick response,
Drew Callahan


Attachment
On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> Thank you for the report!
>
> On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
> >
> > Hello,
> >
> > We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
> >
> > Session 1:
> >
> > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
> >
> > CREATE TABLE test (a int);
> > BEGIN;
> > INSERT INTO test VALUES (1);
> >
> > Session 2:
> >
> > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
> >
> > <query hangs>
> >
> > Session 3:
> >
> > CHECKPOINT;
> >
> > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> >
> > <should return nothing of interest>
> >
> > Session 1:
> >
> > INSERT INTO test VALUES (2);
> > COMMIT;
> >
> > <Session 2 query no longer hangs and successfully creates the slot2>
> >
> > Session 2:
> >
> > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> >
> > select pg_logical_slot_get_changes('slot2', NULL, NULL);
> >
> > <expected: no rows of the txn are returned for slot2>
> > <actual: The 2nd row of the txn is returned for slot2>
> >
>
> I've confirmed this happens HEAD and all supported backbranches. And I
> think it's a (long-standing) bug.
>
> I think we can prepare an isolation test of the above steps and
> include it in contrib/test_decoding/specs.
>
> > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL
andalso force a checkpoint to get the current state of the running transactions on the system. This create process will
thenwait for all of the transactions within that running xact record to complete before being able to transition to the
nextsnapbuild state. During this time period, if another running xact record is written and then a different logical
replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
> >
> > Once all of the transactions from the initial running xact have finished, the process performing the slot creation
willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore
thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed
flushLSN of the slot to the start of the next record after that running xact. 
>
> My analysis is the same. After the slot creation, the slot's LSNs become like:
>
> (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
> some changes) < slot's confirmed_flush_lsn < (tx-commit)
>
> > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
> >
> > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it
isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
> >
> > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 
>
> I basically agree with the basic idea of skipping snapshot restore
> while finding the initial start point. But I'd like to hear other
> opinions too.
>
> Another simple way to fix this is that we always set
> need_full_snapshot to true when slot creation. But it would then make
> the snapshot builder include non-catalog-change transactions too while
> finding the initial startpoint. Which is not necessary.
>
> As for the proposed patch, it uses a global variable, InCreate, and
> memory context callback to reset it. While I believe this is for not
> breaking any ABI as this bug exists on back branches too, I think it's
> better to consider how to fix it in HEAD and then how to backpatch it.
> We don't necessarily need to have the same fixes for all branches all
> the time.

I agree to having a separate patch for HEAD in this case. How about a
fix like a patch attached.

Regards,
Vignesh

Attachment
On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
>
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
>
> Thanks for the suggestion. Included an isolation test for the
> exact repro and a couple other tests for fully before and fully after.
> These two tests don't add a whole lot of value since this
> is already heavily tested elsewhere, so I'm good with removing.
> However, thought it made sense to include from a perspective of
> totality given the test name.

Thank you for the patch for the isolation test! Since Vignesh also
shared the isolation patch I've merged them and modified some
comments. I've attached the patch. I'm going to merge the isolation
test patch to the patch to fix the issue after we get a consensus on
how to fix it.

>
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
>
> Since we're not accumulating changes anyway and we never distribute snapshots
> while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> This seems pretty contained and wouldn't require the use of a global variable
> and memory context callbacks which is a bit nicer from a readability standpoint.
> It's a little inefficient with memory and CPU, especially in the presence of
> long running transactions, but likely would not be noticeable to most users.
>
> > As for the proposed patch, it uses a global variable, InCreate, and
> > memory context callback to reset it. While I believe this is for not
> > breaking any ABI as this bug exists on back branches too, I think it's
> > better to consider how to fix it in HEAD and then how to backpatch it.
> > We don't necessarily need to have the same fixes for all branches all
> > the time.
>
> Attached a patch that is ABI breaking, but removes the need for a global variable.
> This relies on pushing the context of whether we are in a create mode to the
> logical decoding context, which is cleaner, but not as clean as just pushing
> all the way down to the snapbuilder. I would push down to the snap builder,
> but the entire object is persisted to disk. This is problematic to me since the
> create flag would be persisted causing exisiting slots to restore a snapbuild
> that could have the create flag set to true. This is not problematic as is,
> since the variable would only be leveraged in SnapBuildRestore() which is only
> called while inconsistent, but seems like a future bug waiting to happen.

But IIUC in SnapBuildRestore(), we selectively restore variables from
serialized SnapBuild. For instance, building_full_snapshot and
initial_xmin_horizon are serialized to the disk but ignored when
restoring a snapshot. I think that even if we add the in_create flag
to SnapBuild, it would be ignored on snapshot restore.

I think that it's cleaner if we could pass in_creat flag to
AllocateSnapshotBuilder() than passing the flag down to
SnapBuildFindSnapshot() from standby_decode().

>
> We could always clear the flag when persisting, but that seems a bit hacky.
> We could also fork the snap builder into on-disk and in-memory structs, which
> would be convenient since we would not need to invalidate exisiting snapshots.
> However, not sure how people feel about the fork.
>
> > IIUC we need to skip snapshot restore only when we're finding the
> > initial start point. So probably we don't need to set InCreate when
> > calling create_logical_replication_slot() with find_startpoint =
> > false.
>
> Yep, that's correct. We can safely skip setting InCreate. We could move
> the setter function slotfuncs.c as one way of doing this. I lean towards
> setting it no matter what since it is harmless if we never try to find
> a decoding point and removes the need for future callers of
> CreateInitDecodingContext() from having to determine if it's safe to be false.
> Though I agree that InCreate may a bigger hammer than needed which may need
> to be modified in the future.

I agree to set the flag no matter what. The flag is needed for finding
the initial startpoint but it would be harmless to set it even if the
caller doesn't call DecodingContextFindStartpoint(). If the caller
skips to call DecodingContextFindStartpoint() it's the caller's
responsibility to set a sane LSNs to the slots before staring the
logical decoding. So it makes sense to me to remove the need for
callers to determine to set the flag, as you mentioned.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Sat, Feb 3, 2024 at 10:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you for the report!
> >
> > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote:
> > >
> > > Hello,
> > >
> > > We discovered a race condition during logical replication slot creation that can result in the changes for
transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the
slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a
snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue
below:
> > >
> > > Session 1:
> > >
> > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
> > >
> > > CREATE TABLE test (a int);
> > > BEGIN;
> > > INSERT INTO test VALUES (1);
> > >
> > > Session 2:
> > >
> > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
> > >
> > > <query hangs>
> > >
> > > Session 3:
> > >
> > > CHECKPOINT;
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > <should return nothing of interest>
> > >
> > > Session 1:
> > >
> > > INSERT INTO test VALUES (2);
> > > COMMIT;
> > >
> > > <Session 2 query no longer hangs and successfully creates the slot2>
> > >
> > > Session 2:
> > >
> > > select pg_logical_slot_get_changes('slot1', NULL, NULL);
> > >
> > > select pg_logical_slot_get_changes('slot2', NULL, NULL);
> > >
> > > <expected: no rows of the txn are returned for slot2>
> > > <actual: The 2nd row of the txn is returned for slot2>
> > >
> >
> > I've confirmed this happens HEAD and all supported backbranches. And I
> > think it's a (long-standing) bug.
> >
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
> >
> > > Newly created logical replication slots initialize their restart LSN to the current insert position within the
WALand also force a checkpoint to get the current state of the running transactions on the system. This create process
willthen wait for all of the transactions within that running xact record to complete before being able to transition
tothe next snapbuild state. During this time period, if another running xact record is written and then a different
logicalreplication process decodes this running xact record, a globally accessible snapshot will be persisted to disk. 
> > >
> > > Once all of the transactions from the initial running xact have finished, the process performing the slot
creationwill become unblocked and will then consume the new running xact record. The process will see a valid snapshot,
restorethat snapshot from disk, and then transition immediately to the consistent state. The slot will then set the
confirmedflush LSN of the slot to the start of the next record after that running xact. 
> >
> > My analysis is the same. After the slot creation, the slot's LSNs become like:
> >
> > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
> > some changes) < slot's confirmed_flush_lsn < (tx-commit)
> >
> > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit
afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those
transactionscommit, we will then partially stream their changes. 
> > >
> > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether
itis building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the
“need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to
becomeconsistent. 
> > >
> > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully
consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become
fullyconsistent we can prevent the issue by disallowing this behavior. 
> >
> > I basically agree with the basic idea of skipping snapshot restore
> > while finding the initial start point. But I'd like to hear other
> > opinions too.
> >
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
> >
> > As for the proposed patch, it uses a global variable, InCreate, and
> > memory context callback to reset it. While I believe this is for not
> > breaking any ABI as this bug exists on back branches too, I think it's
> > better to consider how to fix it in HEAD and then how to backpatch it.
> > We don't necessarily need to have the same fixes for all branches all
> > the time.
>
> I agree to having a separate patch for HEAD in this case. How about a
> fix like a patch attached.

Thank you for the patch!

I've just sent some comments on how to fix this issue and updated the
isolation test patch[1]. While I basically agree to add the flag in
SnapBuild, I think we can set the flag in AllocateSnapshotBuilder().
Also, we need to bump SNAPBUILD_VERSION.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCsSkwcAYgcUMi2zKp8F%3DmoJL4K1beGD6AV_%3DpHhfAM9Q%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
>
> > I think we can prepare an isolation test of the above steps and
> > include it in contrib/test_decoding/specs.
>
> Thanks for the suggestion. Included an isolation test for the
> exact repro and a couple other tests for fully before and fully after.
> These two tests don't add a whole lot of value since this
> is already heavily tested elsewhere, so I'm good with removing.
> However, thought it made sense to include from a perspective of
> totality given the test name.
>
> > Another simple way to fix this is that we always set
> > need_full_snapshot to true when slot creation. But it would then make
> > the snapshot builder include non-catalog-change transactions too while
> > finding the initial startpoint. Which is not necessary.
>
> Since we're not accumulating changes anyway and we never distribute snapshots
> while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> This seems pretty contained and wouldn't require the use of a global variable
> and memory context callbacks which is a bit nicer from a readability standpoint.

Sorry I missed this comment.

I think it's a good point that this fix doesn't lead to any on-disk
compatibility while not affecting existing users much, especially for
back-patching. If we want to choose this approach for bank branches,
we need to carefully consider the side impacts.

> It's a little inefficient with memory and CPU, especially in the presence of
> long running transactions, but likely would not be noticeable to most users.

This matches my analysis.

Here is the summary of several proposals we've discussed:

a) Have CreateInitDecodingContext() always pass need_full_snapshot =
true to AllocateSnapshotBuilder().

This is the simplest approach and doesn't break the on-disk
SnapBuildOnDisck compatibility. A downside is that it's a little
inefficient with memory and CPU since it includes non-catalog change
transactions too. Also, passing need_full_snapshot to
CreateInitDecodingContext() will no longer have any meaning.

b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.

This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
versions. But since different branches are using different versions of
SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
version that is consistent across all versions.

c) Add a global variable, say in_create, to snapbuild.c

it would also require adding a setter function for in_create. There
are several ideas where to set/reset the flag. One idea is that we
reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
before starting to find the start point, for example at the beginning
of DecodingContextFindStartpoint(). It probably won't require a memory
context callback to make sure to clear the flag. This idea doesn't
have a downside from users and extensions perspects. But I'm slightly
hesitant to add a global variable.

What do you think? and any other ideas?

As a separate topic, I think that this backpatching complexity comes
from the fact that we're serializing a whole SnapBuild to the disk
although we don't need to serialize all of its fields. In order to
make future back-patching easy, it might be worth considering
decoupling the fields that need to be serialized from SnapBuild
struct.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> >
> > > I think we can prepare an isolation test of the above steps and
> > > include it in contrib/test_decoding/specs.
> >
> > Thanks for the suggestion. Included an isolation test for the
> > exact repro and a couple other tests for fully before and fully after.
> > These two tests don't add a whole lot of value since this
> > is already heavily tested elsewhere, so I'm good with removing.
> > However, thought it made sense to include from a perspective of
> > totality given the test name.
> >
> > > Another simple way to fix this is that we always set
> > > need_full_snapshot to true when slot creation. But it would then make
> > > the snapshot builder include non-catalog-change transactions too while
> > > finding the initial startpoint. Which is not necessary.
> >
> > Since we're not accumulating changes anyway and we never distribute snapshots
> > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > This seems pretty contained and wouldn't require the use of a global variable
> > and memory context callbacks which is a bit nicer from a readability standpoint.
>
> Sorry I missed this comment.
>
> I think it's a good point that this fix doesn't lead to any on-disk
> compatibility while not affecting existing users much, especially for
> back-patching. If we want to choose this approach for bank branches,
> we need to carefully consider the side impacts.
>
> > It's a little inefficient with memory and CPU, especially in the presence of
> > long running transactions, but likely would not be noticeable to most users.
>
> This matches my analysis.
>
> Here is the summary of several proposals we've discussed:
>
> a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> true to AllocateSnapshotBuilder().
>
> This is the simplest approach and doesn't break the on-disk
> SnapBuildOnDisck compatibility. A downside is that it's a little
> inefficient with memory and CPU since it includes non-catalog change
> transactions too. Also, passing need_full_snapshot to
> CreateInitDecodingContext() will no longer have any meaning.
>
> b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
>
> This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> versions. But since different branches are using different versions of
> SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> version that is consistent across all versions.
>
> c) Add a global variable, say in_create, to snapbuild.c
>
> it would also require adding a setter function for in_create. There
> are several ideas where to set/reset the flag. One idea is that we
> reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> before starting to find the start point, for example at the beginning
> of DecodingContextFindStartpoint(). It probably won't require a memory
> context callback to make sure to clear the flag. This idea doesn't
> have a downside from users and extensions perspects. But I'm slightly
> hesitant to add a global variable.
>
> What do you think? and any other ideas?

I've drafted the idea (c) for discussion (for master and v16 for now).
I also liked the idea (a) but I'm concerned a bit about future impact.

>
> As a separate topic, I think that this backpatching complexity comes
> from the fact that we're serializing a whole SnapBuild to the disk
> although we don't need to serialize all of its fields. In order to
> make future back-patching easy, it might be worth considering
> decoupling the fields that need to be serialized from SnapBuild
> struct.

If we implement this idea, we won't need to bump SNAPBUILD_VERSION for HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

While analyzing another failure [1], I found here. I think they occurred by the
same reason.

The reported failure occurred when the replication slot is created in the middle
of the transaction and it reuses the snapshot from other slot. The reproducer is:

```
Session0

SELECT pg_create_logical_replication_slot('slot0', 'test_decoding');
BEGIN;
INSERT INTO foo ...

Session1

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

Session2

CHECKPOINT;
SELECT pg_logical_slot_get_changes('slot0', NULL, NULL);

Session0

INSERT INTO var ... // var is defined with (user_catalog_table = true)
COMMIT;

Session1
SELECT pg_logical_slot_get_changes('slot1', NULL, NULL);
-> Assertion failure.
```

> Here is the summary of several proposals we've discussed:
> a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> true to AllocateSnapshotBuilder().

> b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.

> c) Add a global variable, say in_create, to snapbuild.c

Regarding three options raised by Sawada-san, I preferred the approach a).
Since the issue could happen for all supported branches, we should choose the
conservative approach. Also, it is quite painful if there are some codes for
handling the same issue.

Attached patch implemented the approach a) since no one made. I also added
the test which can do assertion failure, but not sure it should be included.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment

RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
> Attached patch implemented the approach a) since no one made. I also added
> the test which can do assertion failure, but not sure it should be included.

Attached one had unnecessary changes. PSA the corrected version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment
On Wed, Mar 13, 2024 at 3:17 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > Attached patch implemented the approach a) since no one made. I also added
> > the test which can do assertion failure, but not sure it should be included.
>

I feel setting "needs_full_snapshot" to true for decoding means the
snapshot will start tracking non-catalog committed xacts as well which
is costly. See SnapBuildCommitTxn(). Can we avoid this problem if we
would have list of all running xacts when we serialize the snapshot by
not decoding any xact whose xid lies in that list? If so, one idea to
achieve could be that we maintain the highest_running_xid while
serailizing the snapshot and then during restore if that
highest_running_xid is <= builder->initial_xmin_horizon, then we
ignore restoring the snapshot. We already have few such cases handled
in SnapBuildRestore().

--
With Regards,
Amit Kapila.



On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> > >
> > > > I think we can prepare an isolation test of the above steps and
> > > > include it in contrib/test_decoding/specs.
> > >
> > > Thanks for the suggestion. Included an isolation test for the
> > > exact repro and a couple other tests for fully before and fully after.
> > > These two tests don't add a whole lot of value since this
> > > is already heavily tested elsewhere, so I'm good with removing.
> > > However, thought it made sense to include from a perspective of
> > > totality given the test name.
> > >
> > > > Another simple way to fix this is that we always set
> > > > need_full_snapshot to true when slot creation. But it would then make
> > > > the snapshot builder include non-catalog-change transactions too while
> > > > finding the initial startpoint. Which is not necessary.
> > >
> > > Since we're not accumulating changes anyway and we never distribute snapshots
> > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > > This seems pretty contained and wouldn't require the use of a global variable
> > > and memory context callbacks which is a bit nicer from a readability standpoint.
> >
> > Sorry I missed this comment.
> >
> > I think it's a good point that this fix doesn't lead to any on-disk
> > compatibility while not affecting existing users much, especially for
> > back-patching. If we want to choose this approach for bank branches,
> > we need to carefully consider the side impacts.
> >
> > > It's a little inefficient with memory and CPU, especially in the presence of
> > > long running transactions, but likely would not be noticeable to most users.
> >
> > This matches my analysis.
> >
> > Here is the summary of several proposals we've discussed:
> >
> > a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> > true to AllocateSnapshotBuilder().
> >
> > This is the simplest approach and doesn't break the on-disk
> > SnapBuildOnDisck compatibility. A downside is that it's a little
> > inefficient with memory and CPU since it includes non-catalog change
> > transactions too. Also, passing need_full_snapshot to
> > CreateInitDecodingContext() will no longer have any meaning.
> >
> > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
> >
> > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> > versions. But since different branches are using different versions of
> > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> > version that is consistent across all versions.
> >
> > c) Add a global variable, say in_create, to snapbuild.c
> >
> > it would also require adding a setter function for in_create. There
> > are several ideas where to set/reset the flag. One idea is that we
> > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> > before starting to find the start point, for example at the beginning
> > of DecodingContextFindStartpoint(). It probably won't require a memory
> > context callback to make sure to clear the flag. This idea doesn't
> > have a downside from users and extensions perspects. But I'm slightly
> > hesitant to add a global variable.
> >
> > What do you think? and any other ideas?
>
> I've drafted the idea (c) for discussion (for master and v16 for now).
> I also liked the idea (a) but I'm concerned a bit about future impact.
>

I agree that (c) would be better than (a) as it avoids collecting
non-catalog xacts in snapshot. However, I think we shouldn't avoid
restoring the snapshot unless really required. See my previous
response.

--
With Regards,
Amit Kapila.



RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit,

> I feel setting "needs_full_snapshot" to true for decoding means the
> snapshot will start tracking non-catalog committed xacts as well which
> is costly.

I think the approach was most conservative one which does not have to change
the version of the snapshot. However, I understood that you wanted to consider
the optimized solution for HEAD first.

> See SnapBuildCommitTxn(). Can we avoid this problem if we
> would have list of all running xacts when we serialize the snapshot by
> not decoding any xact whose xid lies in that list? If so, one idea to
> achieve could be that we maintain the highest_running_xid while
> serailizing the snapshot and then during restore if that
> highest_running_xid is <= builder->initial_xmin_horizon, then we
> ignore restoring the snapshot. We already have few such cases handled
> in SnapBuildRestore().

Based on the idea, I made a prototype. It can pass tests added by others and me.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Attachment
On Tue, Mar 19, 2024 at 7:46 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> I think the approach was most conservative one which does not have to change
> the version of the snapshot. However, I understood that you wanted to consider
> the optimized solution for HEAD first.
>

Right, let's see if we can have a solution other than always avoiding
restoring snapshots during slot creation even if that is for just
HEAD.

> > See SnapBuildCommitTxn(). Can we avoid this problem if we
> > would have list of all running xacts when we serialize the snapshot by
> > not decoding any xact whose xid lies in that list? If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
>
> Based on the idea, I made a prototype. It can pass tests added by others and me.
> How do other think?
>

Won't it be possible to achieve the same thing if we just save
(serialize) the highest xid among all running xacts?

--
With Regards,
Amit Kapila.



RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit,

> 
> Won't it be possible to achieve the same thing if we just save
> (serialize) the highest xid among all running xacts?
>

Indeed, here is an updated version.
Since the array in xl_running_xact is not ordered, entries of it must be seeked
and found the highest one.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment
On Mon, Mar 18, 2024 at 6:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> If so, one idea to
> achieve could be that we maintain the highest_running_xid while
> serailizing the snapshot and then during restore if that
> highest_running_xid is <= builder->initial_xmin_horizon, then we
> ignore restoring the snapshot. We already have few such cases handled
> in SnapBuildRestore().

I think that builder->initial_xmin_horizon could be older than
highest_running_xid, for example, when there is a logical replication
slot whose catalog_xmin is old. However, even in this case, we might
need to ignore restoring the snapshot. For example, a slightly
modified test case still can cause the same problem.

The test case in the Kuroda-san's v2 patch:
permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
"s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
"s1_get_changes_slot0"\   "s1_get_changes_slot1"

Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
"s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
"s1_get_changes_slot0\  " "s1_get_changes_slot1"

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

Thanks for giving comments and sorry for late reply.

> > If so, one idea to
> > achieve could be that we maintain the highest_running_xid while
> > serailizing the snapshot and then during restore if that
> > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > ignore restoring the snapshot. We already have few such cases handled
> > in SnapBuildRestore().
> 
> I think that builder->initial_xmin_horizon could be older than
> highest_running_xid, for example, when there is a logical replication
> slot whose catalog_xmin is old. However, even in this case, we might
> need to ignore restoring the snapshot. For example, a slightly
> modified test case still can cause the same problem.
> 
> The test case in the Kuroda-san's v2 patch:
> permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0"\   "s1_get_changes_slot1"
> 
> Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> "s1_get_changes_slot0\  " "s1_get_changes_slot1"

Good catch, I confirmed that the variant still partially output transactions:

```
step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0');
 
data                                     
-----------------------------------------
BEGIN                                    
table public.tbl: INSERT: val1[integer]:2
COMMIT                                   
(3 rows)
```

/////

I analyzed a bit. In below descriptions, I uses these notations:

txn0 - has a single insert
txn1 - has begin-insert-commit operations.
       slot1 is created in parallel

While creating slot0, initial_xmin_horizon was also set, and it come from
GetOldestSafeDecodingTransactionId().
Since there were no slots in the database before, the value was just a next xid.

Then, the pinned xid was assigned to the txn0.
It is correct behavior because slot0 must decode txn0.

Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
restored once. txn1 was the recorded as the highest_running, but initial_xmin was
older than it. So the snapshot was restored and txn1 was partially decoded.

initial_xmin_horizon (txn0) < highest_running_xid (txn1)

/////

So, how should we fix? One idea is based on (c), and adds a flag which indicates
an existence of concurrent transactions. The restoration is skipped if both two
flags are true. PSA the PoC patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment
On Wed, Mar 27, 2024 at 4:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> Thanks for giving comments and sorry for late reply.
>
> > > If so, one idea to
> > > achieve could be that we maintain the highest_running_xid while
> > > serailizing the snapshot and then during restore if that
> > > highest_running_xid is <= builder->initial_xmin_horizon, then we
> > > ignore restoring the snapshot. We already have few such cases handled
> > > in SnapBuildRestore().
> >
> > I think that builder->initial_xmin_horizon could be older than
> > highest_running_xid, for example, when there is a logical replication
> > slot whose catalog_xmin is old. However, even in this case, we might
> > need to ignore restoring the snapshot. For example, a slightly
> > modified test case still can cause the same problem.
> >
> > The test case in the Kuroda-san's v2 patch:
> > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init"
> > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> > "s1_get_changes_slot0"\   "s1_get_changes_slot1"
> >
> > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"):
> > permutation "s0_init"  "s0_insert1" "s0_begin" "s0_insert1" "s1_init"
> > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit"
> > "s1_get_changes_slot0\  " "s1_get_changes_slot1"
>
> Good catch, I confirmed that the variant still partially output transactions:
>
> ```
> step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0'); 
> data
> -----------------------------------------
> BEGIN
> table public.tbl: INSERT: val1[integer]:2
> COMMIT
> (3 rows)
> ```
>
> /////
>
> I analyzed a bit. In below descriptions, I uses these notations:
>
> txn0 - has a single insert
> txn1 - has begin-insert-commit operations.
>        slot1 is created in parallel
>
> While creating slot0, initial_xmin_horizon was also set, and it come from
> GetOldestSafeDecodingTransactionId().
> Since there were no slots in the database before, the value was just a next xid.
>
> Then, the pinned xid was assigned to the txn0.
> It is correct behavior because slot0 must decode txn0.
>
> Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was
> restored once. txn1 was the recorded as the highest_running, but initial_xmin was
> older than it. So the snapshot was restored and txn1 was partially decoded.
>
> initial_xmin_horizon (txn0) < highest_running_xid (txn1)
>
> /////
>
> So, how should we fix? One idea is based on (c), and adds a flag which indicates
> an existence of concurrent transactions. The restoration is skipped if both two
> flags are true. PSA the PoC patch.
>

With the PoC patch, we check ondisk.builder.is_there_running_xact in
SnapBuildRestore(), but can we just check running->xcnt in
SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is,
if builder->initial_xmin_horizon is valid (or
builder->finding_start_point is true) and running->xcnt > 0, we skip
the snapshot restore. However, I think there are still cases where we
unnecessarily skip snapshot restores.

Probably, what we would like to avoid is, we compute
initial_xmin_horizon and start to find the initial start point while
there is a concurrently running transaction, and then jump to the
consistent state  by restoring the consistent snapshot before the
concurrent transaction commits. So we can ignore snapshot restores if
(oldest XID among transactions running at the time of
CreateInitDecodingContext()) >= (OldestRunningXID in
xl_running_xacts).

I've drafted this idea in the attached patch just for discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Potential data loss due to race condition during logical replication slot creation

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> 
> With the PoC patch, we check ondisk.builder.is_there_running_xact in
> SnapBuildRestore(),

Yes, the PoC requires that the state of snapshot in the file must be read.

> but can we just check running->xcnt in
> SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is,
> if builder->initial_xmin_horizon is valid (or
> builder->finding_start_point is true) and running->xcnt > 0, we skip
> the snapshot restore.

IIUC, it does not require modifications of API. It may be an advantage.

> However, I think there are still cases where we
> unnecessarily skip snapshot restores
>
> Probably, what we would like to avoid is, we compute
> initial_xmin_horizon and start to find the initial start point while
> there is a concurrently running transaction, and then jump to the
> consistent state  by restoring the consistent snapshot before the
> concurrent transaction commits.

Yeah, information before concurrent txns are committed should not be used. I think
that's why SnapBuildWaitSnapshot() waits until listed transactions are finished.

> So we can ignore snapshot restores if
> (oldest XID among transactions running at the time of
> CreateInitDecodingContext()) >= (OldestRunningXID in
> xl_running_xacts).
> 
> I've drafted this idea in the attached patch just for discussion.

Thanks for sharing the patch. At least I confirmed all tests and workload you
pointed out in [1] were passed. I will post here if I found other issues.

[1]: https://www.postgresql.org/message-id/CAD21AoDzLY9vRpo%2Bxb2qPtfn46ikiULPXDpT94sPyFH4GE8bYg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


On Mon, Mar 18, 2024 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote:
> > > >
> > > > > I think we can prepare an isolation test of the above steps and
> > > > > include it in contrib/test_decoding/specs.
> > > >
> > > > Thanks for the suggestion. Included an isolation test for the
> > > > exact repro and a couple other tests for fully before and fully after.
> > > > These two tests don't add a whole lot of value since this
> > > > is already heavily tested elsewhere, so I'm good with removing.
> > > > However, thought it made sense to include from a perspective of
> > > > totality given the test name.
> > > >
> > > > > Another simple way to fix this is that we always set
> > > > > need_full_snapshot to true when slot creation. But it would then make
> > > > > the snapshot builder include non-catalog-change transactions too while
> > > > > finding the initial startpoint. Which is not necessary.
> > > >
> > > > Since we're not accumulating changes anyway and we never distribute snapshots
> > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer.
> > > > This seems pretty contained and wouldn't require the use of a global variable
> > > > and memory context callbacks which is a bit nicer from a readability standpoint.
> > >
> > > Sorry I missed this comment.
> > >
> > > I think it's a good point that this fix doesn't lead to any on-disk
> > > compatibility while not affecting existing users much, especially for
> > > back-patching. If we want to choose this approach for bank branches,
> > > we need to carefully consider the side impacts.
> > >
> > > > It's a little inefficient with memory and CPU, especially in the presence of
> > > > long running transactions, but likely would not be noticeable to most users.
> > >
> > > This matches my analysis.
> > >
> > > Here is the summary of several proposals we've discussed:
> > >
> > > a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> > > true to AllocateSnapshotBuilder().
> > >
> > > This is the simplest approach and doesn't break the on-disk
> > > SnapBuildOnDisck compatibility. A downside is that it's a little
> > > inefficient with memory and CPU since it includes non-catalog change
> > > transactions too. Also, passing need_full_snapshot to
> > > CreateInitDecodingContext() will no longer have any meaning.
> > >
> > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.
> > >
> > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk
> > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple
> > > versions. But since different branches are using different versions of
> > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk
> > > version that is consistent across all versions.
> > >
> > > c) Add a global variable, say in_create, to snapbuild.c
> > >
> > > it would also require adding a setter function for in_create. There
> > > are several ideas where to set/reset the flag. One idea is that we
> > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere
> > > before starting to find the start point, for example at the beginning
> > > of DecodingContextFindStartpoint(). It probably won't require a memory
> > > context callback to make sure to clear the flag. This idea doesn't
> > > have a downside from users and extensions perspects. But I'm slightly
> > > hesitant to add a global variable.
> > >
> > > What do you think? and any other ideas?
> >
> > I've drafted the idea (c) for discussion (for master and v16 for now).
> > I also liked the idea (a) but I'm concerned a bit about future impact.
> >
>
> I agree that (c) would be better than (a) as it avoids collecting
> non-catalog xacts in snapshot. However, I think we shouldn't avoid
> restoring the snapshot unless really required.

Pondering further, I came across the question; in what case we would
need to restore the snapshot and jump to the consistent state when
finding the initial start point?

When creating a slot, in ReplicationSlotReserveWal(), we determine the
restart_lsn and write a RUNNING_XACT record. This record would be the
first RUNNING_XACT record that the logical decoding decodes in most
cases. In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e.
running->oldestRunningXid == running->nextXid), it can jump to the
consistent state. If not, it means there were running transactions at
that time of the RUNNING_XACT record being produced. Therefore, we
must not restore the snapshot and jump to the consistent state.
Because otherwise, we end up deciding the start point in the middle of
a transaction that started before the RUNNING_XACT record. Probably
the same is true for all subsequent snapbuild states.

I might be missing something but I could not find the case where we
can or want to restore the serialized snapshot when finding the
initial start point. If my analysis is correct, we can go with either
(a) or (c) I proposed before[1]. Between these two options, it also
could be an option that (a) is for backbranches for safety and (c) is
for master.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoBH4NWdRcEjXpBFHSKuVO6wia-vHHHaKuEi-h4i4wbi8A%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com