Thread: Potential data loss due to race condition during logical replication slot creation
Potential data loss due to race condition during logical replication slot creation
Hello,
We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:
Session 1:
SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
CREATE TABLE test (a int);
BEGIN;
INSERT INTO test VALUES (1);
Session 2:
SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
<query hangs>
Session 3:
CHECKPOINT;
select pg_logical_slot_get_changes('slot1', NULL, NULL);
<should return nothing of interest>
Session 1:
INSERT INTO test VALUES (2);
COMMIT;
<Session 2 query no longer hangs and successfully creates the slot2>
Session 2:
select pg_logical_slot_get_changes('slot1', NULL, NULL);
select pg_logical_slot_get_changes('slot2', NULL, NULL);
<expected: no rows of the txn are returned for slot2>
<actual: The 2nd row of the txn is returned for slot2>
Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.
Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.
We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.
The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.
With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.
Thanks,
Drew
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
Hi, Thank you for the report! On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > Hello, > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > Session 1: > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > CREATE TABLE test (a int); > BEGIN; > INSERT INTO test VALUES (1); > > Session 2: > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > <query hangs> > > Session 3: > > CHECKPOINT; > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > <should return nothing of interest> > > Session 1: > > INSERT INTO test VALUES (2); > COMMIT; > > <Session 2 query no longer hangs and successfully creates the slot2> > > Session 2: > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > <expected: no rows of the txn are returned for slot2> > <actual: The 2nd row of the txn is returned for slot2> > I've confirmed this happens HEAD and all supported backbranches. And I think it's a (long-standing) bug. I think we can prepare an isolation test of the above steps and include it in contrib/test_decoding/specs. > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL andalso force a checkpoint to get the current state of the running transactions on the system. This create process will thenwait for all of the transactions within that running xact record to complete before being able to transition to the nextsnapbuild state. During this time period, if another running xact record is written and then a different logical replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > Once all of the transactions from the initial running xact have finished, the process performing the slot creation willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flushLSN of the slot to the start of the next record after that running xact. My analysis is the same. After the slot creation, the slot's LSNs become like: (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further some changes) < slot's confirmed_flush_lsn < (tx-commit) > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. I basically agree with the basic idea of skipping snapshot restore while finding the initial start point. But I'd like to hear other opinions too. Another simple way to fix this is that we always set need_full_snapshot to true when slot creation. But it would then make the snapshot builder include non-catalog-change transactions too while finding the initial startpoint. Which is not necessary. As for the proposed patch, it uses a global variable, InCreate, and memory context callback to reset it. While I believe this is for not breaking any ABI as this bug exists on back branches too, I think it's better to consider how to fix it in HEAD and then how to backpatch it. We don't necessarily need to have the same fixes for all branches all the time. IIUC we need to skip snapshot restore only when we're finding the initial start point. So probably we don't need to set InCreate when calling create_logical_replication_slot() with find_startpoint = false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
> I think we can prepare an isolation test of the above steps and > include it in contrib/test_decoding/specs. Thanks for the suggestion. Included an isolation test for the exact repro and a couple other tests for fully before and fully after. These two tests don't add a whole lot of value since this is already heavily tested elsewhere, so I'm good with removing. However, thought it made sense to include from a perspective of totality given the test name. > Another simple way to fix this is that we always set > need_full_snapshot to true when slot creation. But it would then make > the snapshot builder include non-catalog-change transactions too while > finding the initial startpoint. Which is not necessary. Since we're not accumulating changes anyway and we never distribute snapshots while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. This seems pretty contained and wouldn't require the use of a global variable and memory context callbacks which is a bit nicer from a readability standpoint. It's a little inefficient with memory and CPU, especially in the presence of long running transactions, but likely would not be noticeable to most users. > As for the proposed patch, it uses a global variable, InCreate, and > memory context callback to reset it. While I believe this is for not > breaking any ABI as this bug exists on back branches too, I think it's > better to consider how to fix it in HEAD and then how to backpatch it. > We don't necessarily need to have the same fixes for all branches all > the time. Attached a patch that is ABI breaking, but removes the need for a global variable. This relies on pushing the context of whether we are in a create mode to the logical decoding context, which is cleaner, but not as clean as just pushing all the way down to the snapbuilder. I would push down to the snap builder, but the entire object is persisted to disk. This is problematic to me since the create flag would be persisted causing exisiting slots to restore a snapbuild that could have the create flag set to true. This is not problematic as is, since the variable would only be leveraged in SnapBuildRestore() which is only called while inconsistent, but seems like a future bug waiting to happen. We could always clear the flag when persisting, but that seems a bit hacky. We could also fork the snap builder into on-disk and in-memory structs, which would be convenient since we would not need to invalidate exisiting snapshots. However, not sure how people feel about the fork. > IIUC we need to skip snapshot restore only when we're finding the > initial start point. So probably we don't need to set InCreate when > calling create_logical_replication_slot() with find_startpoint = > false. Yep, that's correct. We can safely skip setting InCreate. We could move the setter function slotfuncs.c as one way of doing this. I lean towards setting it no matter what since it is harmless if we never try to find a decoding point and removes the need for future callers of CreateInitDecodingContext() from having to determine if it's safe to be false. Though I agree that InCreate may a bigger hammer than needed which may need to be modified in the future. Thanks for the quick response, Drew Callahan
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > Thank you for the report! > > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > > > Hello, > > > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > > > Session 1: > > > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > > > CREATE TABLE test (a int); > > BEGIN; > > INSERT INTO test VALUES (1); > > > > Session 2: > > > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > > > <query hangs> > > > > Session 3: > > > > CHECKPOINT; > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > <should return nothing of interest> > > > > Session 1: > > > > INSERT INTO test VALUES (2); > > COMMIT; > > > > <Session 2 query no longer hangs and successfully creates the slot2> > > > > Session 2: > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > > > <expected: no rows of the txn are returned for slot2> > > <actual: The 2nd row of the txn is returned for slot2> > > > > I've confirmed this happens HEAD and all supported backbranches. And I > think it's a (long-standing) bug. > > I think we can prepare an isolation test of the above steps and > include it in contrib/test_decoding/specs. > > > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL andalso force a checkpoint to get the current state of the running transactions on the system. This create process will thenwait for all of the transactions within that running xact record to complete before being able to transition to the nextsnapbuild state. During this time period, if another running xact record is written and then a different logical replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > > > Once all of the transactions from the initial running xact have finished, the process performing the slot creation willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flushLSN of the slot to the start of the next record after that running xact. > > My analysis is the same. After the slot creation, the slot's LSNs become like: > > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further > some changes) < slot's confirmed_flush_lsn < (tx-commit) > > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. > > I basically agree with the basic idea of skipping snapshot restore > while finding the initial start point. But I'd like to hear other > opinions too. > > Another simple way to fix this is that we always set > need_full_snapshot to true when slot creation. But it would then make > the snapshot builder include non-catalog-change transactions too while > finding the initial startpoint. Which is not necessary. > > As for the proposed patch, it uses a global variable, InCreate, and > memory context callback to reset it. While I believe this is for not > breaking any ABI as this bug exists on back branches too, I think it's > better to consider how to fix it in HEAD and then how to backpatch it. > We don't necessarily need to have the same fixes for all branches all > the time. I agree to having a separate patch for HEAD in this case. How about a fix like a patch attached. Regards, Vignesh
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > Thanks for the suggestion. Included an isolation test for the > exact repro and a couple other tests for fully before and fully after. > These two tests don't add a whole lot of value since this > is already heavily tested elsewhere, so I'm good with removing. > However, thought it made sense to include from a perspective of > totality given the test name. Thank you for the patch for the isolation test! Since Vignesh also shared the isolation patch I've merged them and modified some comments. I've attached the patch. I'm going to merge the isolation test patch to the patch to fix the issue after we get a consensus on how to fix it. > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > Since we're not accumulating changes anyway and we never distribute snapshots > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > This seems pretty contained and wouldn't require the use of a global variable > and memory context callbacks which is a bit nicer from a readability standpoint. > It's a little inefficient with memory and CPU, especially in the presence of > long running transactions, but likely would not be noticeable to most users. > > > As for the proposed patch, it uses a global variable, InCreate, and > > memory context callback to reset it. While I believe this is for not > > breaking any ABI as this bug exists on back branches too, I think it's > > better to consider how to fix it in HEAD and then how to backpatch it. > > We don't necessarily need to have the same fixes for all branches all > > the time. > > Attached a patch that is ABI breaking, but removes the need for a global variable. > This relies on pushing the context of whether we are in a create mode to the > logical decoding context, which is cleaner, but not as clean as just pushing > all the way down to the snapbuilder. I would push down to the snap builder, > but the entire object is persisted to disk. This is problematic to me since the > create flag would be persisted causing exisiting slots to restore a snapbuild > that could have the create flag set to true. This is not problematic as is, > since the variable would only be leveraged in SnapBuildRestore() which is only > called while inconsistent, but seems like a future bug waiting to happen. But IIUC in SnapBuildRestore(), we selectively restore variables from serialized SnapBuild. For instance, building_full_snapshot and initial_xmin_horizon are serialized to the disk but ignored when restoring a snapshot. I think that even if we add the in_create flag to SnapBuild, it would be ignored on snapshot restore. I think that it's cleaner if we could pass in_creat flag to AllocateSnapshotBuilder() than passing the flag down to SnapBuildFindSnapshot() from standby_decode(). > > We could always clear the flag when persisting, but that seems a bit hacky. > We could also fork the snap builder into on-disk and in-memory structs, which > would be convenient since we would not need to invalidate exisiting snapshots. > However, not sure how people feel about the fork. > > > IIUC we need to skip snapshot restore only when we're finding the > > initial start point. So probably we don't need to set InCreate when > > calling create_logical_replication_slot() with find_startpoint = > > false. > > Yep, that's correct. We can safely skip setting InCreate. We could move > the setter function slotfuncs.c as one way of doing this. I lean towards > setting it no matter what since it is harmless if we never try to find > a decoding point and removes the need for future callers of > CreateInitDecodingContext() from having to determine if it's safe to be false. > Though I agree that InCreate may a bigger hammer than needed which may need > to be modified in the future. I agree to set the flag no matter what. The flag is needed for finding the initial startpoint but it would be harmless to set it even if the caller doesn't call DecodingContextFindStartpoint(). If the caller skips to call DecodingContextFindStartpoint() it's the caller's responsibility to set a sane LSNs to the slots before staring the logical decoding. So it makes sense to me to remove the need for callers to determine to set the flag, as you mentioned. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 10:48 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > Thank you for the report! > > > > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > Hello, > > > > > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > > > > > Session 1: > > > > > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > > > > > CREATE TABLE test (a int); > > > BEGIN; > > > INSERT INTO test VALUES (1); > > > > > > Session 2: > > > > > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > > > > > <query hangs> > > > > > > Session 3: > > > > > > CHECKPOINT; > > > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > > > <should return nothing of interest> > > > > > > Session 1: > > > > > > INSERT INTO test VALUES (2); > > > COMMIT; > > > > > > <Session 2 query no longer hangs and successfully creates the slot2> > > > > > > Session 2: > > > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > > > > > <expected: no rows of the txn are returned for slot2> > > > <actual: The 2nd row of the txn is returned for slot2> > > > > > > > I've confirmed this happens HEAD and all supported backbranches. And I > > think it's a (long-standing) bug. > > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > > > > Newly created logical replication slots initialize their restart LSN to the current insert position within the WALand also force a checkpoint to get the current state of the running transactions on the system. This create process willthen wait for all of the transactions within that running xact record to complete before being able to transition tothe next snapbuild state. During this time period, if another running xact record is written and then a different logicalreplication process decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > > > > > Once all of the transactions from the initial running xact have finished, the process performing the slot creationwill become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restorethat snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmedflush LSN of the slot to the start of the next record after that running xact. > > > > My analysis is the same. After the slot creation, the slot's LSNs become like: > > > > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further > > some changes) < slot's confirmed_flush_lsn < (tx-commit) > > > > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > > > > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether itis building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > > > > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. > > > > I basically agree with the basic idea of skipping snapshot restore > > while finding the initial start point. But I'd like to hear other > > opinions too. > > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > > > As for the proposed patch, it uses a global variable, InCreate, and > > memory context callback to reset it. While I believe this is for not > > breaking any ABI as this bug exists on back branches too, I think it's > > better to consider how to fix it in HEAD and then how to backpatch it. > > We don't necessarily need to have the same fixes for all branches all > > the time. > > I agree to having a separate patch for HEAD in this case. How about a > fix like a patch attached. Thank you for the patch! I've just sent some comments on how to fix this issue and updated the isolation test patch[1]. While I basically agree to add the flag in SnapBuild, I think we can set the flag in AllocateSnapshotBuilder(). Also, we need to bump SNAPBUILD_VERSION. Regards, [1] https://www.postgresql.org/message-id/CAD21AoCsSkwcAYgcUMi2zKp8F%3DmoJL4K1beGD6AV_%3DpHhfAM9Q%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > Thanks for the suggestion. Included an isolation test for the > exact repro and a couple other tests for fully before and fully after. > These two tests don't add a whole lot of value since this > is already heavily tested elsewhere, so I'm good with removing. > However, thought it made sense to include from a perspective of > totality given the test name. > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > Since we're not accumulating changes anyway and we never distribute snapshots > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > This seems pretty contained and wouldn't require the use of a global variable > and memory context callbacks which is a bit nicer from a readability standpoint. Sorry I missed this comment. I think it's a good point that this fix doesn't lead to any on-disk compatibility while not affecting existing users much, especially for back-patching. If we want to choose this approach for bank branches, we need to carefully consider the side impacts. > It's a little inefficient with memory and CPU, especially in the presence of > long running transactions, but likely would not be noticeable to most users. This matches my analysis. Here is the summary of several proposals we've discussed: a) Have CreateInitDecodingContext() always pass need_full_snapshot = true to AllocateSnapshotBuilder(). This is the simplest approach and doesn't break the on-disk SnapBuildOnDisck compatibility. A downside is that it's a little inefficient with memory and CPU since it includes non-catalog change transactions too. Also, passing need_full_snapshot to CreateInitDecodingContext() will no longer have any meaning. b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. This bumps SNAPBUILD_VERSION and therefore breaks the on-disk SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple versions. But since different branches are using different versions of SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk version that is consistent across all versions. c) Add a global variable, say in_create, to snapbuild.c it would also require adding a setter function for in_create. There are several ideas where to set/reset the flag. One idea is that we reset the flag in AllocateSnapshotBuilder() and set the flag anywhere before starting to find the start point, for example at the beginning of DecodingContextFindStartpoint(). It probably won't require a memory context callback to make sure to clear the flag. This idea doesn't have a downside from users and extensions perspects. But I'm slightly hesitant to add a global variable. What do you think? and any other ideas? As a separate topic, I think that this backpatching complexity comes from the fact that we're serializing a whole SnapBuild to the disk although we don't need to serialize all of its fields. In order to make future back-patching easy, it might be worth considering decoupling the fields that need to be serialized from SnapBuild struct. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > I think we can prepare an isolation test of the above steps and > > > include it in contrib/test_decoding/specs. > > > > Thanks for the suggestion. Included an isolation test for the > > exact repro and a couple other tests for fully before and fully after. > > These two tests don't add a whole lot of value since this > > is already heavily tested elsewhere, so I'm good with removing. > > However, thought it made sense to include from a perspective of > > totality given the test name. > > > > > Another simple way to fix this is that we always set > > > need_full_snapshot to true when slot creation. But it would then make > > > the snapshot builder include non-catalog-change transactions too while > > > finding the initial startpoint. Which is not necessary. > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > This seems pretty contained and wouldn't require the use of a global variable > > and memory context callbacks which is a bit nicer from a readability standpoint. > > Sorry I missed this comment. > > I think it's a good point that this fix doesn't lead to any on-disk > compatibility while not affecting existing users much, especially for > back-patching. If we want to choose this approach for bank branches, > we need to carefully consider the side impacts. > > > It's a little inefficient with memory and CPU, especially in the presence of > > long running transactions, but likely would not be noticeable to most users. > > This matches my analysis. > > Here is the summary of several proposals we've discussed: > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > true to AllocateSnapshotBuilder(). > > This is the simplest approach and doesn't break the on-disk > SnapBuildOnDisck compatibility. A downside is that it's a little > inefficient with memory and CPU since it includes non-catalog change > transactions too. Also, passing need_full_snapshot to > CreateInitDecodingContext() will no longer have any meaning. > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > versions. But since different branches are using different versions of > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > version that is consistent across all versions. > > c) Add a global variable, say in_create, to snapbuild.c > > it would also require adding a setter function for in_create. There > are several ideas where to set/reset the flag. One idea is that we > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > before starting to find the start point, for example at the beginning > of DecodingContextFindStartpoint(). It probably won't require a memory > context callback to make sure to clear the flag. This idea doesn't > have a downside from users and extensions perspects. But I'm slightly > hesitant to add a global variable. > > What do you think? and any other ideas? I've drafted the idea (c) for discussion (for master and v16 for now). I also liked the idea (a) but I'm concerned a bit about future impact. > > As a separate topic, I think that this backpatching complexity comes > from the fact that we're serializing a whole SnapBuild to the disk > although we don't need to serialize all of its fields. In order to > make future back-patching easy, it might be worth considering > decoupling the fields that need to be serialized from SnapBuild > struct. If we implement this idea, we won't need to bump SNAPBUILD_VERSION for HEAD. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
RE: Potential data loss due to race condition during logical replication slot creation
Dear hackers, While analyzing another failure [1], I found here. I think they occurred by the same reason. The reported failure occurred when the replication slot is created in the middle of the transaction and it reuses the snapshot from other slot. The reproducer is: ``` Session0 SELECT pg_create_logical_replication_slot('slot0', 'test_decoding'); BEGIN; INSERT INTO foo ... Session1 SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); Session2 CHECKPOINT; SELECT pg_logical_slot_get_changes('slot0', NULL, NULL); Session0 INSERT INTO var ... // var is defined with (user_catalog_table = true) COMMIT; Session1 SELECT pg_logical_slot_get_changes('slot1', NULL, NULL); -> Assertion failure. ``` > Here is the summary of several proposals we've discussed: > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > true to AllocateSnapshotBuilder(). > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > c) Add a global variable, say in_create, to snapbuild.c Regarding three options raised by Sawada-san, I preferred the approach a). Since the issue could happen for all supported branches, we should choose the conservative approach. Also, it is quite painful if there are some codes for handling the same issue. Attached patch implemented the approach a) since no one made. I also added the test which can do assertion failure, but not sure it should be included. [1]: https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
RE: Potential data loss due to race condition during logical replication slot creation
> Attached patch implemented the approach a) since no one made. I also added > the test which can do assertion failure, but not sure it should be included. Attached one had unnecessary changes. PSA the corrected version. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Mar 13, 2024 at 3:17 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Attached patch implemented the approach a) since no one made. I also added > > the test which can do assertion failure, but not sure it should be included. > I feel setting "needs_full_snapshot" to true for decoding means the snapshot will start tracking non-catalog committed xacts as well which is costly. See SnapBuildCommitTxn(). Can we avoid this problem if we would have list of all running xacts when we serialize the snapshot by not decoding any xact whose xid lies in that list? If so, one idea to achieve could be that we maintain the highest_running_xid while serailizing the snapshot and then during restore if that highest_running_xid is <= builder->initial_xmin_horizon, then we ignore restoring the snapshot. We already have few such cases handled in SnapBuildRestore(). -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > > I think we can prepare an isolation test of the above steps and > > > > include it in contrib/test_decoding/specs. > > > > > > Thanks for the suggestion. Included an isolation test for the > > > exact repro and a couple other tests for fully before and fully after. > > > These two tests don't add a whole lot of value since this > > > is already heavily tested elsewhere, so I'm good with removing. > > > However, thought it made sense to include from a perspective of > > > totality given the test name. > > > > > > > Another simple way to fix this is that we always set > > > > need_full_snapshot to true when slot creation. But it would then make > > > > the snapshot builder include non-catalog-change transactions too while > > > > finding the initial startpoint. Which is not necessary. > > > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > > This seems pretty contained and wouldn't require the use of a global variable > > > and memory context callbacks which is a bit nicer from a readability standpoint. > > > > Sorry I missed this comment. > > > > I think it's a good point that this fix doesn't lead to any on-disk > > compatibility while not affecting existing users much, especially for > > back-patching. If we want to choose this approach for bank branches, > > we need to carefully consider the side impacts. > > > > > It's a little inefficient with memory and CPU, especially in the presence of > > > long running transactions, but likely would not be noticeable to most users. > > > > This matches my analysis. > > > > Here is the summary of several proposals we've discussed: > > > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > > true to AllocateSnapshotBuilder(). > > > > This is the simplest approach and doesn't break the on-disk > > SnapBuildOnDisck compatibility. A downside is that it's a little > > inefficient with memory and CPU since it includes non-catalog change > > transactions too. Also, passing need_full_snapshot to > > CreateInitDecodingContext() will no longer have any meaning. > > > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > > versions. But since different branches are using different versions of > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > > version that is consistent across all versions. > > > > c) Add a global variable, say in_create, to snapbuild.c > > > > it would also require adding a setter function for in_create. There > > are several ideas where to set/reset the flag. One idea is that we > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > > before starting to find the start point, for example at the beginning > > of DecodingContextFindStartpoint(). It probably won't require a memory > > context callback to make sure to clear the flag. This idea doesn't > > have a downside from users and extensions perspects. But I'm slightly > > hesitant to add a global variable. > > > > What do you think? and any other ideas? > > I've drafted the idea (c) for discussion (for master and v16 for now). > I also liked the idea (a) but I'm concerned a bit about future impact. > I agree that (c) would be better than (a) as it avoids collecting non-catalog xacts in snapshot. However, I think we shouldn't avoid restoring the snapshot unless really required. See my previous response. -- With Regards, Amit Kapila.
RE: Potential data loss due to race condition during logical replication slot creation
Dear Amit, > I feel setting "needs_full_snapshot" to true for decoding means the > snapshot will start tracking non-catalog committed xacts as well which > is costly. I think the approach was most conservative one which does not have to change the version of the snapshot. However, I understood that you wanted to consider the optimized solution for HEAD first. > See SnapBuildCommitTxn(). Can we avoid this problem if we > would have list of all running xacts when we serialize the snapshot by > not decoding any xact whose xid lies in that list? If so, one idea to > achieve could be that we maintain the highest_running_xid while > serailizing the snapshot and then during restore if that > highest_running_xid is <= builder->initial_xmin_horizon, then we > ignore restoring the snapshot. We already have few such cases handled > in SnapBuildRestore(). Based on the idea, I made a prototype. It can pass tests added by others and me. How do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Mar 19, 2024 at 7:46 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > I think the approach was most conservative one which does not have to change > the version of the snapshot. However, I understood that you wanted to consider > the optimized solution for HEAD first. > Right, let's see if we can have a solution other than always avoiding restoring snapshots during slot creation even if that is for just HEAD. > > See SnapBuildCommitTxn(). Can we avoid this problem if we > > would have list of all running xacts when we serialize the snapshot by > > not decoding any xact whose xid lies in that list? If so, one idea to > > achieve could be that we maintain the highest_running_xid while > > serailizing the snapshot and then during restore if that > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > ignore restoring the snapshot. We already have few such cases handled > > in SnapBuildRestore(). > > Based on the idea, I made a prototype. It can pass tests added by others and me. > How do other think? > Won't it be possible to achieve the same thing if we just save (serialize) the highest xid among all running xacts? -- With Regards, Amit Kapila.
RE: Potential data loss due to race condition during logical replication slot creation
Dear Amit, > > Won't it be possible to achieve the same thing if we just save > (serialize) the highest xid among all running xacts? > Indeed, here is an updated version. Since the array in xl_running_xact is not ordered, entries of it must be seeked and found the highest one. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Mar 18, 2024 at 6:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > If so, one idea to > achieve could be that we maintain the highest_running_xid while > serailizing the snapshot and then during restore if that > highest_running_xid is <= builder->initial_xmin_horizon, then we > ignore restoring the snapshot. We already have few such cases handled > in SnapBuildRestore(). I think that builder->initial_xmin_horizon could be older than highest_running_xid, for example, when there is a logical replication slot whose catalog_xmin is old. However, even in this case, we might need to ignore restoring the snapshot. For example, a slightly modified test case still can cause the same problem. The test case in the Kuroda-san's v2 patch: permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" "s1_get_changes_slot0"\ "s1_get_changes_slot1" Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" "s1_get_changes_slot0\ " "s1_get_changes_slot1" Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Potential data loss due to race condition during logical replication slot creation
Dear Sawada-san, Thanks for giving comments and sorry for late reply. > > If so, one idea to > > achieve could be that we maintain the highest_running_xid while > > serailizing the snapshot and then during restore if that > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > ignore restoring the snapshot. We already have few such cases handled > > in SnapBuildRestore(). > > I think that builder->initial_xmin_horizon could be older than > highest_running_xid, for example, when there is a logical replication > slot whose catalog_xmin is old. However, even in this case, we might > need to ignore restoring the snapshot. For example, a slightly > modified test case still can cause the same problem. > > The test case in the Kuroda-san's v2 patch: > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > "s1_get_changes_slot0"\ "s1_get_changes_slot1" > > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): > permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > "s1_get_changes_slot0\ " "s1_get_changes_slot1" Good catch, I confirmed that the variant still partially output transactions: ``` step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids','0'); data ----------------------------------------- BEGIN table public.tbl: INSERT: val1[integer]:2 COMMIT (3 rows) ``` ///// I analyzed a bit. In below descriptions, I uses these notations: txn0 - has a single insert txn1 - has begin-insert-commit operations. slot1 is created in parallel While creating slot0, initial_xmin_horizon was also set, and it come from GetOldestSafeDecodingTransactionId(). Since there were no slots in the database before, the value was just a next xid. Then, the pinned xid was assigned to the txn0. It is correct behavior because slot0 must decode txn0. Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was restored once. txn1 was the recorded as the highest_running, but initial_xmin was older than it. So the snapshot was restored and txn1 was partially decoded. initial_xmin_horizon (txn0) < highest_running_xid (txn1) ///// So, how should we fix? One idea is based on (c), and adds a flag which indicates an existence of concurrent transactions. The restoration is skipped if both two flags are true. PSA the PoC patch. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Mar 27, 2024 at 4:54 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thanks for giving comments and sorry for late reply. > > > > If so, one idea to > > > achieve could be that we maintain the highest_running_xid while > > > serailizing the snapshot and then during restore if that > > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > > ignore restoring the snapshot. We already have few such cases handled > > > in SnapBuildRestore(). > > > > I think that builder->initial_xmin_horizon could be older than > > highest_running_xid, for example, when there is a logical replication > > slot whose catalog_xmin is old. However, even in this case, we might > > need to ignore restoring the snapshot. For example, a slightly > > modified test case still can cause the same problem. > > > > The test case in the Kuroda-san's v2 patch: > > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" > > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > > "s1_get_changes_slot0"\ "s1_get_changes_slot1" > > > > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): > > permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" > > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > > "s1_get_changes_slot0\ " "s1_get_changes_slot1" > > Good catch, I confirmed that the variant still partially output transactions: > > ``` > step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids','0'); > data > ----------------------------------------- > BEGIN > table public.tbl: INSERT: val1[integer]:2 > COMMIT > (3 rows) > ``` > > ///// > > I analyzed a bit. In below descriptions, I uses these notations: > > txn0 - has a single insert > txn1 - has begin-insert-commit operations. > slot1 is created in parallel > > While creating slot0, initial_xmin_horizon was also set, and it come from > GetOldestSafeDecodingTransactionId(). > Since there were no slots in the database before, the value was just a next xid. > > Then, the pinned xid was assigned to the txn0. > It is correct behavior because slot0 must decode txn0. > > Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was > restored once. txn1 was the recorded as the highest_running, but initial_xmin was > older than it. So the snapshot was restored and txn1 was partially decoded. > > initial_xmin_horizon (txn0) < highest_running_xid (txn1) > > ///// > > So, how should we fix? One idea is based on (c), and adds a flag which indicates > an existence of concurrent transactions. The restoration is skipped if both two > flags are true. PSA the PoC patch. > With the PoC patch, we check ondisk.builder.is_there_running_xact in SnapBuildRestore(), but can we just check running->xcnt in SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is, if builder->initial_xmin_horizon is valid (or builder->finding_start_point is true) and running->xcnt > 0, we skip the snapshot restore. However, I think there are still cases where we unnecessarily skip snapshot restores. Probably, what we would like to avoid is, we compute initial_xmin_horizon and start to find the initial start point while there is a concurrently running transaction, and then jump to the consistent state by restoring the consistent snapshot before the concurrent transaction commits. So we can ignore snapshot restores if (oldest XID among transactions running at the time of CreateInitDecodingContext()) >= (OldestRunningXID in xl_running_xacts). I've drafted this idea in the attached patch just for discussion. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
RE: Potential data loss due to race condition during logical replication slot creation
Dear Sawada-san, > > With the PoC patch, we check ondisk.builder.is_there_running_xact in > SnapBuildRestore(), Yes, the PoC requires that the state of snapshot in the file must be read. > but can we just check running->xcnt in > SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is, > if builder->initial_xmin_horizon is valid (or > builder->finding_start_point is true) and running->xcnt > 0, we skip > the snapshot restore. IIUC, it does not require modifications of API. It may be an advantage. > However, I think there are still cases where we > unnecessarily skip snapshot restores > > Probably, what we would like to avoid is, we compute > initial_xmin_horizon and start to find the initial start point while > there is a concurrently running transaction, and then jump to the > consistent state by restoring the consistent snapshot before the > concurrent transaction commits. Yeah, information before concurrent txns are committed should not be used. I think that's why SnapBuildWaitSnapshot() waits until listed transactions are finished. > So we can ignore snapshot restores if > (oldest XID among transactions running at the time of > CreateInitDecodingContext()) >= (OldestRunningXID in > xl_running_xacts). > > I've drafted this idea in the attached patch just for discussion. Thanks for sharing the patch. At least I confirmed all tests and workload you pointed out in [1] were passed. I will post here if I found other issues. [1]: https://www.postgresql.org/message-id/CAD21AoDzLY9vRpo%2Bxb2qPtfn46ikiULPXDpT94sPyFH4GE8bYg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Mar 18, 2024 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > > > > I think we can prepare an isolation test of the above steps and > > > > > include it in contrib/test_decoding/specs. > > > > > > > > Thanks for the suggestion. Included an isolation test for the > > > > exact repro and a couple other tests for fully before and fully after. > > > > These two tests don't add a whole lot of value since this > > > > is already heavily tested elsewhere, so I'm good with removing. > > > > However, thought it made sense to include from a perspective of > > > > totality given the test name. > > > > > > > > > Another simple way to fix this is that we always set > > > > > need_full_snapshot to true when slot creation. But it would then make > > > > > the snapshot builder include non-catalog-change transactions too while > > > > > finding the initial startpoint. Which is not necessary. > > > > > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > > > This seems pretty contained and wouldn't require the use of a global variable > > > > and memory context callbacks which is a bit nicer from a readability standpoint. > > > > > > Sorry I missed this comment. > > > > > > I think it's a good point that this fix doesn't lead to any on-disk > > > compatibility while not affecting existing users much, especially for > > > back-patching. If we want to choose this approach for bank branches, > > > we need to carefully consider the side impacts. > > > > > > > It's a little inefficient with memory and CPU, especially in the presence of > > > > long running transactions, but likely would not be noticeable to most users. > > > > > > This matches my analysis. > > > > > > Here is the summary of several proposals we've discussed: > > > > > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > > > true to AllocateSnapshotBuilder(). > > > > > > This is the simplest approach and doesn't break the on-disk > > > SnapBuildOnDisck compatibility. A downside is that it's a little > > > inefficient with memory and CPU since it includes non-catalog change > > > transactions too. Also, passing need_full_snapshot to > > > CreateInitDecodingContext() will no longer have any meaning. > > > > > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > > > > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > > > versions. But since different branches are using different versions of > > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > > > version that is consistent across all versions. > > > > > > c) Add a global variable, say in_create, to snapbuild.c > > > > > > it would also require adding a setter function for in_create. There > > > are several ideas where to set/reset the flag. One idea is that we > > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > > > before starting to find the start point, for example at the beginning > > > of DecodingContextFindStartpoint(). It probably won't require a memory > > > context callback to make sure to clear the flag. This idea doesn't > > > have a downside from users and extensions perspects. But I'm slightly > > > hesitant to add a global variable. > > > > > > What do you think? and any other ideas? > > > > I've drafted the idea (c) for discussion (for master and v16 for now). > > I also liked the idea (a) but I'm concerned a bit about future impact. > > > > I agree that (c) would be better than (a) as it avoids collecting > non-catalog xacts in snapshot. However, I think we shouldn't avoid > restoring the snapshot unless really required. Pondering further, I came across the question; in what case we would need to restore the snapshot and jump to the consistent state when finding the initial start point? When creating a slot, in ReplicationSlotReserveWal(), we determine the restart_lsn and write a RUNNING_XACT record. This record would be the first RUNNING_XACT record that the logical decoding decodes in most cases. In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e. running->oldestRunningXid == running->nextXid), it can jump to the consistent state. If not, it means there were running transactions at that time of the RUNNING_XACT record being produced. Therefore, we must not restore the snapshot and jump to the consistent state. Because otherwise, we end up deciding the start point in the middle of a transaction that started before the RUNNING_XACT record. Probably the same is true for all subsequent snapbuild states. I might be missing something but I could not find the case where we can or want to restore the serialized snapshot when finding the initial start point. If my analysis is correct, we can go with either (a) or (c) I proposed before[1]. Between these two options, it also could be an option that (a) is for backbranches for safety and (c) is for master. Regards, [1] https://www.postgresql.org/message-id/CAD21AoBH4NWdRcEjXpBFHSKuVO6wia-vHHHaKuEi-h4i4wbi8A%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Apr 16, 2024 at 9:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Pondering further, I came across the question; in what case we would > need to restore the snapshot and jump to the consistent state when > finding the initial start point? > > When creating a slot, in ReplicationSlotReserveWal(), we determine the > restart_lsn and write a RUNNING_XACT record. This record would be the > first RUNNING_XACT record that the logical decoding decodes in most > cases. > This probably won't be true on standby as we don't LOG RUNNING_XACT record in that case. > In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e. > running->oldestRunningXid == running->nextXid), it can jump to the > consistent state. If not, it means there were running transactions at > that time of the RUNNING_XACT record being produced. Therefore, we > must not restore the snapshot and jump to the consistent state. > Because otherwise, we end up deciding the start point in the middle of > a transaction that started before the RUNNING_XACT record. Probably > the same is true for all subsequent snapbuild states. > One thing to consider on this is that we can only restore the snapshot if by that time some other get changes (in your case, step "s2_get_changes_slot0" as in your draft patch) would have serialized the consistent snapshot at that LSN. One could question if we can't reach a consistent state at a particular LSN then why in the first place the snapshot has been serialized at that LSN? The answer could be that it is okay to use such a serialized snapshot after initial snapshot creation because we know that the restart_lsn of a slot in such cases would be a location where we won't see the data for partial transactions. This is because, after the very first time (after the initdecodingcontext), the restart_lsn would be set to a location after we reach the consistent point. > I might be missing something but I could not find the case where we > can or want to restore the serialized snapshot when finding the > initial start point. If my analysis is correct, we can go with either > (a) or (c) I proposed before[1]. Between these two options, it also > could be an option that (a) is for backbranches for safety and (c) is > for master. > The approach (a) has a downside, it will lead to tracking more transactions (non-catalog) than required without any benefit for the user. Considering that is true, I wouldn't prefer that approach. -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Thu, Jun 20, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 16, 2024 at 9:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Pondering further, I came across the question; in what case we would > > need to restore the snapshot and jump to the consistent state when > > finding the initial start point? > > > > When creating a slot, in ReplicationSlotReserveWal(), we determine the > > restart_lsn and write a RUNNING_XACT record. This record would be the > > first RUNNING_XACT record that the logical decoding decodes in most > > cases. > > > > This probably won't be true on standby as we don't LOG RUNNING_XACT > record in that case. Yes, on the standby we wait for the RUNNING_XACT record before starting actual decoding. > > > In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e. > > running->oldestRunningXid == running->nextXid), it can jump to the > > consistent state. If not, it means there were running transactions at > > that time of the RUNNING_XACT record being produced. Therefore, we > > must not restore the snapshot and jump to the consistent state. > > Because otherwise, we end up deciding the start point in the middle of > > a transaction that started before the RUNNING_XACT record. Probably > > the same is true for all subsequent snapbuild states. > > > > One thing to consider on this is that we can only restore the snapshot > if by that time some other get changes (in your case, step > "s2_get_changes_slot0" as in your draft patch) would have serialized > the consistent snapshot at that LSN. One could question if we can't > reach a consistent state at a particular LSN then why in the first > place the snapshot has been serialized at that LSN? The answer could > be that it is okay to use such a serialized snapshot after initial > snapshot creation because we know that the restart_lsn of a slot in > such cases would be a location where we won't see the data for partial > transactions. True. After the CreateInitDecodingContext() the restart_lsn is just an arbitrary LSN in a sense; either the latest replay LSN or the current insertion LSN. We need to find an LSN for the start point that corresponds to that restart_lsn. So I think we need to find the start point while disabling snapshot restores. > > > I might be missing something but I could not find the case where we > > can or want to restore the serialized snapshot when finding the > > initial start point. If my analysis is correct, we can go with either > > (a) or (c) I proposed before[1]. Between these two options, it also > > could be an option that (a) is for backbranches for safety and (c) is > > for master. > > > > The approach (a) has a downside, it will lead to tracking more > transactions (non-catalog) than required without any benefit for the > user. Considering that is true, I wouldn't prefer that approach. Yes, it will lead to tracking non-catalog-change transactions as well. If there are many subtransactions, the overhead could be noticeable. But it happens only once when creating a slot. Another variant of (a) is that we skip snapshot restores if the initial_xmin_hirizon is a valid transaction id. The initia_xmin_horizon is always set to a valida transaction id when initializing the decoding context, e.g. during CreateInitDecodingContext(). That way, we don't need to track non-catalog-change transctions. A downside is that this approach assumes that DecodingContextFindStartpoint() is called with the decoding context created by CreateInitDecodingContxt(), which is true in the core codes, but might not be true in third party extensions. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > The approach (a) has a downside, it will lead to tracking more > > transactions (non-catalog) than required without any benefit for the > > user. Considering that is true, I wouldn't prefer that approach. > > Yes, it will lead to tracking non-catalog-change transactions as well. > If there are many subtransactions, the overhead could be noticeable. > But it happens only once when creating a slot. > True, but it doesn't seem advisable to add such an overhead even during create time without any concrete reason. > Another variant of (a) is that we skip snapshot restores if the > initial_xmin_hirizon is a valid transaction id. The > initia_xmin_horizon is always set to a valida transaction id when > initializing the decoding context, e.g. during > CreateInitDecodingContext(). That way, we don't need to track > non-catalog-change transctions. A downside is that this approach > assumes that DecodingContextFindStartpoint() is called with the > decoding context created by CreateInitDecodingContxt(), which is true > in the core codes, but might not be true in third party extensions. > I think it is better to be explicit in this case rather than relying on initia_xmin_horizon. So, we can store in_create/create_in_progress flag in the Snapbuild in HEAD and store it in LogicalDecodingContext in back branches. I think changing SnapBuild means we have to update SNAPBUILD_VERSION, right? Is that a good idea to do at this point of time or shall we wait new branch to open and change it there? Anyway, it would be a few days away and in the meantime, we can review and keep the patches ready. -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > The approach (a) has a downside, it will lead to tracking more > > > transactions (non-catalog) than required without any benefit for the > > > user. Considering that is true, I wouldn't prefer that approach. > > > > Yes, it will lead to tracking non-catalog-change transactions as well. > > If there are many subtransactions, the overhead could be noticeable. > > But it happens only once when creating a slot. > > > > True, but it doesn't seem advisable to add such an overhead even > during create time without any concrete reason. > > > Another variant of (a) is that we skip snapshot restores if the > > initial_xmin_hirizon is a valid transaction id. The > > initia_xmin_horizon is always set to a valida transaction id when > > initializing the decoding context, e.g. during > > CreateInitDecodingContext(). That way, we don't need to track > > non-catalog-change transctions. A downside is that this approach > > assumes that DecodingContextFindStartpoint() is called with the > > decoding context created by CreateInitDecodingContxt(), which is true > > in the core codes, but might not be true in third party extensions. > > > > I think it is better to be explicit in this case rather than relying > on initia_xmin_horizon. So, we can store in_create/create_in_progress > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext > in back branches. I think we cannot access the flag in LogicalDecodingContext from snapbuild.c at least in backbranches. I've discussed adding such a flag in snapbuild.c as a global variable, but I'm slightly hesitant to add a global variable besides InitialRunningXacts. > I think changing SnapBuild means we have to update > SNAPBUILD_VERSION, right? Is that a good idea to do at this point of > time or shall we wait new branch to open and change it there? Anyway, > it would be a few days away and in the meantime, we can review and > keep the patches ready. I think we should wait to add such changes that break on-disk compatibility until a new branch opens. On HEAD, I think we can add a new flag in SnapBuild and set it during say DecodingContextFindStartpoint(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > The approach (a) has a downside, it will lead to tracking more > > > > transactions (non-catalog) than required without any benefit for the > > > > user. Considering that is true, I wouldn't prefer that approach. > > > > > > Yes, it will lead to tracking non-catalog-change transactions as well. > > > If there are many subtransactions, the overhead could be noticeable. > > > But it happens only once when creating a slot. > > > > > > > True, but it doesn't seem advisable to add such an overhead even > > during create time without any concrete reason. > > > > > Another variant of (a) is that we skip snapshot restores if the > > > initial_xmin_hirizon is a valid transaction id. The > > > initia_xmin_horizon is always set to a valida transaction id when > > > initializing the decoding context, e.g. during > > > CreateInitDecodingContext(). That way, we don't need to track > > > non-catalog-change transctions. A downside is that this approach > > > assumes that DecodingContextFindStartpoint() is called with the > > > decoding context created by CreateInitDecodingContxt(), which is true > > > in the core codes, but might not be true in third party extensions. > > > > > > > I think it is better to be explicit in this case rather than relying > > on initia_xmin_horizon. So, we can store in_create/create_in_progress > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext > > in back branches. > > I think we cannot access the flag in LogicalDecodingContext from > snapbuild.c at least in backbranches. I've discussed adding such a > flag in snapbuild.c as a global variable, but I'm slightly hesitant to > add a global variable besides InitialRunningXacts. > I agree that adding a global variable is not advisable. Can we pass the flag stored in LogicalDecodingContext to snapbuild.c? That might not be elegant but I don't have any better ideas. > > I think changing SnapBuild means we have to update > > SNAPBUILD_VERSION, right? Is that a good idea to do at this point of > > time or shall we wait new branch to open and change it there? Anyway, > > it would be a few days away and in the meantime, we can review and > > keep the patches ready. > > I think we should wait to add such changes that break on-disk > compatibility until a new branch opens. On HEAD, I think we can add a > new flag in SnapBuild and set it during say > DecodingContextFindStartpoint(). > Fair enough. -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > The approach (a) has a downside, it will lead to tracking more > > > > > transactions (non-catalog) than required without any benefit for the > > > > > user. Considering that is true, I wouldn't prefer that approach. > > > > > > > > Yes, it will lead to tracking non-catalog-change transactions as well. > > > > If there are many subtransactions, the overhead could be noticeable. > > > > But it happens only once when creating a slot. > > > > > > > > > > True, but it doesn't seem advisable to add such an overhead even > > > during create time without any concrete reason. > > > > > > > Another variant of (a) is that we skip snapshot restores if the > > > > initial_xmin_hirizon is a valid transaction id. The > > > > initia_xmin_horizon is always set to a valida transaction id when > > > > initializing the decoding context, e.g. during > > > > CreateInitDecodingContext(). That way, we don't need to track > > > > non-catalog-change transctions. A downside is that this approach > > > > assumes that DecodingContextFindStartpoint() is called with the > > > > decoding context created by CreateInitDecodingContxt(), which is true > > > > in the core codes, but might not be true in third party extensions. > > > > > > > > > > I think it is better to be explicit in this case rather than relying > > > on initia_xmin_horizon. So, we can store in_create/create_in_progress > > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext > > > in back branches. > > > > I think we cannot access the flag in LogicalDecodingContext from > > snapbuild.c at least in backbranches. I've discussed adding such a > > flag in snapbuild.c as a global variable, but I'm slightly hesitant to > > add a global variable besides InitialRunningXacts. > > > > I agree that adding a global variable is not advisable. Can we pass > the flag stored in LogicalDecodingContext to snapbuild.c? Ah, I found a good path: snapbuild->reorder->private_data (storing a pointer to a LogicalDecodingContext). This assumes private_data always stores a pointer to a LogicalDecodingContext but I think that's find at least for backbranches. I've attached the patch for this idea for PG16. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Jun 25, 2024 at 1:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > The approach (a) has a downside, it will lead to tracking more > > > > > > transactions (non-catalog) than required without any benefit for the > > > > > > user. Considering that is true, I wouldn't prefer that approach. > > > > > > > > > > Yes, it will lead to tracking non-catalog-change transactions as well. > > > > > If there are many subtransactions, the overhead could be noticeable. > > > > > But it happens only once when creating a slot. > > > > > > > > > > > > > True, but it doesn't seem advisable to add such an overhead even > > > > during create time without any concrete reason. > > > > > > > > > Another variant of (a) is that we skip snapshot restores if the > > > > > initial_xmin_hirizon is a valid transaction id. The > > > > > initia_xmin_horizon is always set to a valida transaction id when > > > > > initializing the decoding context, e.g. during > > > > > CreateInitDecodingContext(). That way, we don't need to track > > > > > non-catalog-change transctions. A downside is that this approach > > > > > assumes that DecodingContextFindStartpoint() is called with the > > > > > decoding context created by CreateInitDecodingContxt(), which is true > > > > > in the core codes, but might not be true in third party extensions. > > > > > > > > > > > > > I think it is better to be explicit in this case rather than relying > > > > on initia_xmin_horizon. So, we can store in_create/create_in_progress > > > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext > > > > in back branches. > > > > > > I think we cannot access the flag in LogicalDecodingContext from > > > snapbuild.c at least in backbranches. I've discussed adding such a > > > flag in snapbuild.c as a global variable, but I'm slightly hesitant to > > > add a global variable besides InitialRunningXacts. > > > > > > > I agree that adding a global variable is not advisable. Can we pass > > the flag stored in LogicalDecodingContext to snapbuild.c? > > Ah, I found a good path: snapbuild->reorder->private_data (storing a > pointer to a LogicalDecodingContext). This assumes private_data always > stores a pointer to a LogicalDecodingContext but I think that's find > at least for backbranches. > +1. This approach looks good to me. -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Jun 26, 2024 at 6:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 25, 2024 at 1:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Jun 25, 2024 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jun 24, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Mon, Jun 24, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Jun 21, 2024 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > The approach (a) has a downside, it will lead to tracking more > > > > > > > transactions (non-catalog) than required without any benefit for the > > > > > > > user. Considering that is true, I wouldn't prefer that approach. > > > > > > > > > > > > Yes, it will lead to tracking non-catalog-change transactions as well. > > > > > > If there are many subtransactions, the overhead could be noticeable. > > > > > > But it happens only once when creating a slot. > > > > > > > > > > > > > > > > True, but it doesn't seem advisable to add such an overhead even > > > > > during create time without any concrete reason. > > > > > > > > > > > Another variant of (a) is that we skip snapshot restores if the > > > > > > initial_xmin_hirizon is a valid transaction id. The > > > > > > initia_xmin_horizon is always set to a valida transaction id when > > > > > > initializing the decoding context, e.g. during > > > > > > CreateInitDecodingContext(). That way, we don't need to track > > > > > > non-catalog-change transctions. A downside is that this approach > > > > > > assumes that DecodingContextFindStartpoint() is called with the > > > > > > decoding context created by CreateInitDecodingContxt(), which is true > > > > > > in the core codes, but might not be true in third party extensions. > > > > > > > > > > > > > > > > I think it is better to be explicit in this case rather than relying > > > > > on initia_xmin_horizon. So, we can store in_create/create_in_progress > > > > > flag in the Snapbuild in HEAD and store it in LogicalDecodingContext > > > > > in back branches. > > > > > > > > I think we cannot access the flag in LogicalDecodingContext from > > > > snapbuild.c at least in backbranches. I've discussed adding such a > > > > flag in snapbuild.c as a global variable, but I'm slightly hesitant to > > > > add a global variable besides InitialRunningXacts. > > > > > > > > > > I agree that adding a global variable is not advisable. Can we pass > > > the flag stored in LogicalDecodingContext to snapbuild.c? > > > > Ah, I found a good path: snapbuild->reorder->private_data (storing a > > pointer to a LogicalDecodingContext). This assumes private_data always > > stores a pointer to a LogicalDecodingContext but I think that's find > > at least for backbranches. > > > > +1. This approach looks good to me. > I've attached updated patches for HEAD and pg17 for now (I will create the patch for other backbranches). In the patches, I used a different approach in between HEAD and backbranches. On HEAD, we store a flag indicating whether or not we should skip snapshot restores into the SnapBuild struct and set it only while finding the start point. Therefore we have to bump SNAPBUILD_VERSION. On backbranches, I used the approach we discussed above; store the flag in LogicalDecodingContext and set it when creating the LogicalDecodingContext for a new logical slot. A possible downside of the approach taken for backbranches is that we implicitly require for users to use the same LogicalDecodingContext for both initializing the context for a new slot and finding its start point. IIUC it was not strictly required. This restriction would not be a problem at least in the core, but I'm not sure if there are no external extensions that create a logical slot in that way. This is the main reason why I used a different approach on HEAD and backbranches. Therefore, if it does not matter, I think we can use the same approach on all branches, which is better in terms of maintainability. Also, I excluded the test case for the problem that Kuroda-san reported[1] since the reported problem occurred due to the same cause of the problem originally reported on this thread. The bug that decodes only partial transactions could lead to various symptoms. IIUC these test cases test the same behavior. Regards, [1] https://www.postgresql.org/message-id/TYCPR01MB1207719C811F580A8774C79B7F52A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Fri, Jul 05, 2024 at 11:52:52PM +0900, Masahiko Sawada wrote: > I've attached updated patches for HEAD and pg17 for now (I will create > the patch for other backbranches). > > In the patches, I used a different approach in between HEAD and > backbranches. On HEAD, we store a flag indicating whether or not we > should skip snapshot restores into the SnapBuild struct and set it > only while finding the start point. Therefore we have to bump > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > above; store the flag in LogicalDecodingContext and set it when > creating the LogicalDecodingContext for a new logical slot. A possible > downside of the approach taken for backbranches is that we implicitly > require for users to use the same LogicalDecodingContext for both > initializing the context for a new slot and finding its start point. > IIUC it was not strictly required. This restriction would not be a > problem at least in the core, but I'm not sure if there are no > external extensions that create a logical slot in that way. This is > the main reason why I used a different approach on HEAD and > backbranches. Therefore, if it does not matter, I think we can use the > same approach on all branches, which is better in terms of > maintainability. --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -189,6 +189,9 @@ struct SnapBuild /* Indicates if we are building full snapshot or just catalog one. */ bool building_full_snapshot; + /* Indicates if we are finding the start point to extract changes */ + bool finding_start_point; + FYI, I think that it is still OK to bump SNAPBUILD_VERSION on REL_17_STABLE. That will reduce by 1 year the time window required to maintain the tweaks implemented for the versions in the back-branches. So I'd suggest to do what the v17 version of the patch does for ~16, and use the snapshot format changes in 17~. -- Michael
Attachment
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Jul 8, 2024 at 8:56 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Jul 05, 2024 at 11:52:52PM +0900, Masahiko Sawada wrote: > > I've attached updated patches for HEAD and pg17 for now (I will create > > the patch for other backbranches). > > > > In the patches, I used a different approach in between HEAD and > > backbranches. On HEAD, we store a flag indicating whether or not we > > should skip snapshot restores into the SnapBuild struct and set it > > only while finding the start point. Therefore we have to bump > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > above; store the flag in LogicalDecodingContext and set it when > > creating the LogicalDecodingContext for a new logical slot. A possible > > downside of the approach taken for backbranches is that we implicitly > > require for users to use the same LogicalDecodingContext for both > > initializing the context for a new slot and finding its start point. > > IIUC it was not strictly required. This restriction would not be a > > problem at least in the core, but I'm not sure if there are no > > external extensions that create a logical slot in that way. This is > > the main reason why I used a different approach on HEAD and > > backbranches. Therefore, if it does not matter, I think we can use the > > same approach on all branches, which is better in terms of > > maintainability. > > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -189,6 +189,9 @@ struct SnapBuild > /* Indicates if we are building full snapshot or just catalog one. */ > bool building_full_snapshot; > > + /* Indicates if we are finding the start point to extract changes */ > + bool finding_start_point; > + > > FYI, I think that it is still OK to bump SNAPBUILD_VERSION on > REL_17_STABLE. That will reduce by 1 year the time window required to > maintain the tweaks implemented for the versions in the back-branches. > So I'd suggest to do what the v17 version of the patch does for ~16, > and use the snapshot format changes in 17~. Thank you for the comment. Agreed. If we get consensus on this approach to go, I'll use the patch changing the SnapBuild struct for 17 and HEAD. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Potential data loss due to race condition during logical replication slot creation
Dear Sawada-san, Thanks for creating the patch! > I've attached updated patches for HEAD and pg17 for now (I will create > the patch for other backbranches). > > In the patches, I used a different approach in between HEAD and > backbranches. On HEAD, we store a flag indicating whether or not we > should skip snapshot restores into the SnapBuild struct and set it > only while finding the start point. Therefore we have to bump > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > above; store the flag in LogicalDecodingContext and set it when > creating the LogicalDecodingContext for a new logical slot. A possible > downside of the approach taken for backbranches is that we implicitly > require for users to use the same LogicalDecodingContext for both > initializing the context for a new slot and finding its start point. > IIUC it was not strictly required. This restriction would not be a > problem at least in the core, but I'm not sure if there are no > external extensions that create a logical slot in that way. This is > the main reason why I used a different approach on HEAD and > backbranches. Therefore, if it does not matter, I think we can use the > same approach on all branches, which is better in terms of > maintainability. I want to confirm your point. You meant that you wanted to unify appraoches, especially you wanted to store the flag in LogicalDecodingContext, rigth? I briefly grepped github repos with "DecodingContextFindStartpoint", and cannot find such pattern. E.g., [1]. But this point must be pointed by extension developers. > > Also, I excluded the test case for the problem that Kuroda-san > reported[1] since the reported problem occurred due to the same cause > of the problem originally reported on this thread. The bug that > decodes only partial transactions could lead to various symptoms. IIUC > these test cases test the same behavior. Yes, I also think they are caused by the same root cause, so we can skip. Comments for HEAD patch: 1. You must modify test_decoding/meson.build. It was also missing in patches for backbranches. 2. The test code needs cleanup. E.g., - Steps "s0_insert_cat" and "s0_savepoint" is not used - Table user_cat is not used It was also missing in patches for backbranches. 3. Just to confirm - In SnapshotRestore(), the added attribute is not restored from the disk. This is intentional because 1) it has been set to false when it was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized by palloc0() so it was set to false. Is it right? Comments for backblanches patch: 1. According to wikipage [2], new attribute must be at the end of struct. [1]: https://github.com/cybertec-postgresql/pg_squeeze/blob/master/worker.c#L1614 [2]: https://wiki.postgresql.org/wiki/Committing_checklist Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thanks for creating the patch! > > > I've attached updated patches for HEAD and pg17 for now (I will create > > the patch for other backbranches). > > > > In the patches, I used a different approach in between HEAD and > > backbranches. On HEAD, we store a flag indicating whether or not we > > should skip snapshot restores into the SnapBuild struct and set it > > only while finding the start point. Therefore we have to bump > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > above; store the flag in LogicalDecodingContext and set it when > > creating the LogicalDecodingContext for a new logical slot. A possible > > downside of the approach taken for backbranches is that we implicitly > > require for users to use the same LogicalDecodingContext for both > > initializing the context for a new slot and finding its start point. > > IIUC it was not strictly required. This restriction would not be a > > problem at least in the core, but I'm not sure if there are no > > external extensions that create a logical slot in that way. This is > > the main reason why I used a different approach on HEAD and > > backbranches. Therefore, if it does not matter, I think we can use the > > same approach on all branches, which is better in terms of > > maintainability. > > I want to confirm your point. You meant that you wanted to unify appraoches, > especially you wanted to store the flag in LogicalDecodingContext, rigth? Yes. Ideally I'd like to use the same approach in all branches regardless of how to fix it for better maintainability. > I > briefly grepped github repos with "DecodingContextFindStartpoint", and cannot > find such pattern. E.g., [1]. But this point must be pointed by extension developers. Thank you for searching on github. It will also affect the future extension developments so I'm slightly hesitant to do that. > > > > Also, I excluded the test case for the problem that Kuroda-san > > reported[1] since the reported problem occurred due to the same cause > > of the problem originally reported on this thread. The bug that > > decodes only partial transactions could lead to various symptoms. IIUC > > these test cases test the same behavior. > > Yes, I also think they are caused by the same root cause, so we can skip. > > Comments for HEAD patch: > 1. > You must modify test_decoding/meson.build. It was also missing in patches for > backbranches. > > 2. > The test code needs cleanup. E.g., > - Steps "s0_insert_cat" and "s0_savepoint" is not used > - Table user_cat is not used > It was also missing in patches for backbranches. Will fix the above points in the next version patch. > > 3. > Just to confirm - In SnapshotRestore(), the added attribute is not restored > from the disk. This is intentional because 1) it has been set to false when it > was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized > by palloc0() so it was set to false. Is it right? Right. > Comments for backblanches patch: > > 1. > According to wikipage [2], new attribute must be at the end of struct. Good point. Will fix. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Sawada-san, > > > > Thanks for creating the patch! > > > > > I've attached updated patches for HEAD and pg17 for now (I will create > > > the patch for other backbranches). > > > > > > In the patches, I used a different approach in between HEAD and > > > backbranches. On HEAD, we store a flag indicating whether or not we > > > should skip snapshot restores into the SnapBuild struct and set it > > > only while finding the start point. Therefore we have to bump > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > > above; store the flag in LogicalDecodingContext and set it when > > > creating the LogicalDecodingContext for a new logical slot. A possible > > > downside of the approach taken for backbranches is that we implicitly > > > require for users to use the same LogicalDecodingContext for both > > > initializing the context for a new slot and finding its start point. > > > IIUC it was not strictly required. This restriction would not be a > > > problem at least in the core, but I'm not sure if there are no > > > external extensions that create a logical slot in that way. This is > > > the main reason why I used a different approach on HEAD and > > > backbranches. Therefore, if it does not matter, I think we can use the > > > same approach on all branches, which is better in terms of > > > maintainability. > > > > I want to confirm your point. You meant that you wanted to unify appraoches, > > especially you wanted to store the flag in LogicalDecodingContext, rigth? > > Yes. Ideally I'd like to use the same approach in all branches > regardless of how to fix it for better maintainability. > > > I > > briefly grepped github repos with "DecodingContextFindStartpoint", and cannot > > find such pattern. E.g., [1]. But this point must be pointed by extension developers. > > Thank you for searching on github. It will also affect the future > extension developments so I'm slightly hesitant to do that. > > > > > > > Also, I excluded the test case for the problem that Kuroda-san > > > reported[1] since the reported problem occurred due to the same cause > > > of the problem originally reported on this thread. The bug that > > > decodes only partial transactions could lead to various symptoms. IIUC > > > these test cases test the same behavior. > > > > Yes, I also think they are caused by the same root cause, so we can skip. > > > > Comments for HEAD patch: > > 1. > > You must modify test_decoding/meson.build. It was also missing in patches for > > backbranches. > > > > 2. > > The test code needs cleanup. E.g., > > - Steps "s0_insert_cat" and "s0_savepoint" is not used > > - Table user_cat is not used > > It was also missing in patches for backbranches. > > Will fix the above points in the next version patch. > > > > > 3. > > Just to confirm - In SnapshotRestore(), the added attribute is not restored > > from the disk. This is intentional because 1) it has been set to false when it > > was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized > > by palloc0() so it was set to false. Is it right? > > Right. > > > Comments for backblanches patch: > > > > 1. > > According to wikipage [2], new attribute must be at the end of struct. > > Good point. Will fix. > I've attached the new version patches for all branches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- REL14_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- master_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL16_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL15_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL17_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL12_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL13_v2-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Sawada-san, > > > > > > Thanks for creating the patch! > > > > > > > I've attached updated patches for HEAD and pg17 for now (I will create > > > > the patch for other backbranches). > > > > > > > > In the patches, I used a different approach in between HEAD and > > > > backbranches. On HEAD, we store a flag indicating whether or not we > > > > should skip snapshot restores into the SnapBuild struct and set it > > > > only while finding the start point. Therefore we have to bump > > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > > > above; store the flag in LogicalDecodingContext and set it when > > > > creating the LogicalDecodingContext for a new logical slot. A possible > > > > downside of the approach taken for backbranches is that we implicitly > > > > require for users to use the same LogicalDecodingContext for both > > > > initializing the context for a new slot and finding its start point. > > > > IIUC it was not strictly required. This restriction would not be a > > > > problem at least in the core, but I'm not sure if there are no > > > > external extensions that create a logical slot in that way. This is > > > > the main reason why I used a different approach on HEAD and > > > > backbranches. Therefore, if it does not matter, I think we can use the > > > > same approach on all branches, which is better in terms of > > > > maintainability. > > > > > > I want to confirm your point. You meant that you wanted to unify appraoches, > > > especially you wanted to store the flag in LogicalDecodingContext, rigth? > > > > Yes. Ideally I'd like to use the same approach in all branches > > regardless of how to fix it for better maintainability. > > The difference is minor so using slightly different approaches should be okay. In the ideal case, I agree that using the same approach makes sense but for future extendability, it is better to keep this new variable in SnapBuild at least in HEAD and PG17. > > I've attached the new version patches for all branches. > Few comments: 1. @@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) { ReplicationSlot *slot = ctx->slot; + /* Let snapshot builder start to find the start point */ + SnapBuildSetFindStartPoint(ctx->snapshot_builder, true); + /* Initialize from where to start reading WAL. */ XLogBeginRead(ctx->reader, slot->data.restart_lsn); @@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) if (slot->data.two_phase) slot->data.two_phase_at = ctx->reader->EndRecPtr; SpinLockRelease(&slot->mutex); + + /* Complete to find the start point */ + SnapBuildSetFindStartPoint(ctx->snapshot_builder, false); I wonder why you didn't choose to set this variable in AllocateSnapshotBuilder()? If we do so, then we may not need set/reset in DecodingContextFindStartpoint(). The one advantage of using set/reset for the minimal time as done here is that we can avoid any impact of this new variable but I still feel setting it in AllocateSnapshotBuilder() seems more natural. 2. Since in this case + * the restart LSN could be in the middle of transactions we need to + * find the start point where we won't see the data for partial + * transactions. There is a connecting word missing between *transactions* and *we*. Can we use a slightly different wording like: "Can't use this method while finding the start point for decoding changes as the restart LSN would be an arbitrary LSN but we need to find the start point to extract changes where we won't see the data for partial transactions."? -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Tue, Jul 9, 2024 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > Dear Sawada-san, > > > > > > > > Thanks for creating the patch! > > > > > > > > > I've attached updated patches for HEAD and pg17 for now (I will create > > > > > the patch for other backbranches). > > > > > > > > > > In the patches, I used a different approach in between HEAD and > > > > > backbranches. On HEAD, we store a flag indicating whether or not we > > > > > should skip snapshot restores into the SnapBuild struct and set it > > > > > only while finding the start point. Therefore we have to bump > > > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > > > > above; store the flag in LogicalDecodingContext and set it when > > > > > creating the LogicalDecodingContext for a new logical slot. A possible > > > > > downside of the approach taken for backbranches is that we implicitly > > > > > require for users to use the same LogicalDecodingContext for both > > > > > initializing the context for a new slot and finding its start point. > > > > > IIUC it was not strictly required. This restriction would not be a > > > > > problem at least in the core, but I'm not sure if there are no > > > > > external extensions that create a logical slot in that way. This is > > > > > the main reason why I used a different approach on HEAD and > > > > > backbranches. Therefore, if it does not matter, I think we can use the > > > > > same approach on all branches, which is better in terms of > > > > > maintainability. > > > > > > > > I want to confirm your point. You meant that you wanted to unify appraoches, > > > > especially you wanted to store the flag in LogicalDecodingContext, rigth? > > > > > > Yes. Ideally I'd like to use the same approach in all branches > > > regardless of how to fix it for better maintainability. > > > > > The difference is minor so using slightly different approaches should > be okay. In the ideal case, I agree that using the same approach makes > sense but for future extendability, it is better to keep this new > variable in SnapBuild at least in HEAD and PG17. > > > > > I've attached the new version patches for all branches. > > > > Few comments: > 1. > @@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) > { > ReplicationSlot *slot = ctx->slot; > > + /* Let snapshot builder start to find the start point */ > + SnapBuildSetFindStartPoint(ctx->snapshot_builder, true); > + > /* Initialize from where to start reading WAL. */ > XLogBeginRead(ctx->reader, slot->data.restart_lsn); > > @@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) > if (slot->data.two_phase) > slot->data.two_phase_at = ctx->reader->EndRecPtr; > SpinLockRelease(&slot->mutex); > + > + /* Complete to find the start point */ > + SnapBuildSetFindStartPoint(ctx->snapshot_builder, false); > > I wonder why you didn't choose to set this variable in > AllocateSnapshotBuilder()? If we do so, then we may not need set/reset > in DecodingContextFindStartpoint(). The one advantage of using > set/reset for the minimal time as done here is that we can avoid any > impact of this new variable but I still feel setting it in > AllocateSnapshotBuilder() seems more natural. Initially what I thought was; if we set the flag in AllocateSnapshotBuilder(), we need to pass true through like CreateInitDecodingContext() -> StartupDecodingContext() -> AllocateSnapshotBuilder(), meaning that the only LogicalDecodingContext created via CreateInitDecodingContext() can be used in DecodingContextFindStartpoint(). IOW if we use the LogicalDecodingContext created via CreateDecodingContext() (i.e., setting the flag as false) in DecodingContextFindStartpoint() we would end up having the same problem. IIUC we haven't had such a usage restriction before. But thinking on this approach further, probably the same is true for initial_xmin_horizon. The field is set only when the LogicalDecodingContext is created via CreateInitDecodingContext(), and is used only while finding the start point. So I'm fine with setting the flag in AllocateSnapshotBuilder(). If we go this approach, I think we should check if the flag is set before finding the start point. > > 2. > Since in this case > + * the restart LSN could be in the middle of transactions we need to > + * find the start point where we won't see the data for partial > + * transactions. > > There is a connecting word missing between *transactions* and *we*. > Can we use a slightly different wording like: "Can't use this method > while finding the start point for decoding changes as the restart LSN > would be an arbitrary LSN but we need to find the start point to > extract changes where we won't see the data for partial > transactions."? Looks good. Will fix. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Jul 10, 2024 at 10:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > If we go this approach, > I think we should check if the flag is set before finding the start > point. Checking if the flag is set in DecodingContextFindStartpoint() requires introducing a new function in snapbuild.c and the function would be used only in an Assertion. So the attached updated patch doesn't introduce such a function. If it's really worth it, I can add it, but I think we can live without that. Overall I think the patches are in good shape, so I'm going to push them tomorrow, barring any objections and further comments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- master_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL17_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL15_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL14_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL16_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL13_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL12_v3-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Overall I think the patches are in good shape, so I'm going to push > them tomorrow, barring any objections and further comments. > Agreed. Below are a few minor comments that you might want to consider: 1. @@ -76,6 +77,7 @@ extern SnapBuildState SnapBuildCurrentState(SnapBuild *builder); extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder); extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr); +extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool find_start_point); This is not required in the latest version. 2. + /* + * Indicates if we are using the snapshot builder for the initial creation + * of a logical replication slot. The word 'initial' in the above comment is not required. If you consider this then a similar wording change is required in lower branches as well. 3. HEAD and v17 --------------------- - /* b) valid on disk state and not building full snapshot */ + + /* + * b) valid on disk state and while neither building full snapshot nor + * finding the start point. + */ else if (!builder->building_full_snapshot && + !builder->in_slot_creation && V16 and below --------------------- - /* b) valid on disk state and not building full snapshot */ + + /* + * b) valid on disk state and neither building full snapshot nor while + * creating a slot. + */ else if (!builder->building_full_snapshot && + !ctx->in_create && Isn't it better to use the same comment in both places? -- With Regards, Amit Kapila.
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Jul 10, 2024 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Overall I think the patches are in good shape, so I'm going to push > > them tomorrow, barring any objections and further comments. > > > > Agreed. Below are a few minor comments that you might want to consider: > > 1. > @@ -76,6 +77,7 @@ extern SnapBuildState > SnapBuildCurrentState(SnapBuild *builder); > extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder); > > extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr); > +extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool > find_start_point); > > This is not required in the latest version. > > 2. > + /* > + * Indicates if we are using the snapshot builder for the initial creation > + * of a logical replication slot. > > The word 'initial' in the above comment is not required. If you > consider this then a similar wording change is required in lower > branches as well. > > 3. > HEAD and v17 > --------------------- > - /* b) valid on disk state and not building full snapshot */ > + > + /* > + * b) valid on disk state and while neither building full snapshot nor > + * finding the start point. > + */ > else if (!builder->building_full_snapshot && > + !builder->in_slot_creation && > > V16 and below > --------------------- > - /* b) valid on disk state and not building full snapshot */ > + > + /* > + * b) valid on disk state and neither building full snapshot nor while > + * creating a slot. > + */ > else if (!builder->building_full_snapshot && > + !ctx->in_create && > > Isn't it better to use the same comment in both places? Thank you for reviewing the patches! I agreed with all the points. I've attached the updated patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- REL17_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL15_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL14_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL16_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- master_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL13_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
- REL12_v4-0001-Fix-possibility-of-logical-decoding-partial-trans.patch
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Jul 10, 2024 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 10, 2024 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 10, 2024 at 7:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Overall I think the patches are in good shape, so I'm going to push > > > them tomorrow, barring any objections and further comments. > > > > > > > Agreed. Below are a few minor comments that you might want to consider: > > > > 1. > > @@ -76,6 +77,7 @@ extern SnapBuildState > > SnapBuildCurrentState(SnapBuild *builder); > > extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder); > > > > extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr); > > +extern void SnapBuildSetFindStartPoint(SnapBuild *builder, bool > > find_start_point); > > > > This is not required in the latest version. > > > > 2. > > + /* > > + * Indicates if we are using the snapshot builder for the initial creation > > + * of a logical replication slot. > > > > The word 'initial' in the above comment is not required. If you > > consider this then a similar wording change is required in lower > > branches as well. > > > > 3. > > HEAD and v17 > > --------------------- > > - /* b) valid on disk state and not building full snapshot */ > > + > > + /* > > + * b) valid on disk state and while neither building full snapshot nor > > + * finding the start point. > > + */ > > else if (!builder->building_full_snapshot && > > + !builder->in_slot_creation && > > > > V16 and below > > --------------------- > > - /* b) valid on disk state and not building full snapshot */ > > + > > + /* > > + * b) valid on disk state and neither building full snapshot nor while > > + * creating a slot. > > + */ > > else if (!builder->building_full_snapshot && > > + !ctx->in_create && > > > > Isn't it better to use the same comment in both places? > > Thank you for reviewing the patches! I agreed with all the points. > I've attached the updated patches. > Pushed, bb19b70081 (and all supported branches). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Fri, Jul 12, 2024 at 11:52:20AM +0900, Masahiko Sawada wrote: > Pushed, bb19b70081 (and all supported branches). Thanks, Sawada-san. -- Michael