Re: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: Potential data loss due to race condition during logical replication slot creation |
Date | |
Msg-id | CAA4eK1+Q+gHADNeEuC=spzCh=JXBMNSWhK1SScu6dQ1owjrBVw@mail.gmail.com Whole thread Raw |
In response to | Re: Potential data loss due to race condition during logical replication slot creation (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Potential data loss due to race condition during logical replication slot creation
(Masahiko Sawada <sawada.mshk@gmail.com>)
|
List | pgsql-bugs |
On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > > I think we can prepare an isolation test of the above steps and > > > > include it in contrib/test_decoding/specs. > > > > > > Thanks for the suggestion. Included an isolation test for the > > > exact repro and a couple other tests for fully before and fully after. > > > These two tests don't add a whole lot of value since this > > > is already heavily tested elsewhere, so I'm good with removing. > > > However, thought it made sense to include from a perspective of > > > totality given the test name. > > > > > > > Another simple way to fix this is that we always set > > > > need_full_snapshot to true when slot creation. But it would then make > > > > the snapshot builder include non-catalog-change transactions too while > > > > finding the initial startpoint. Which is not necessary. > > > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > > This seems pretty contained and wouldn't require the use of a global variable > > > and memory context callbacks which is a bit nicer from a readability standpoint. > > > > Sorry I missed this comment. > > > > I think it's a good point that this fix doesn't lead to any on-disk > > compatibility while not affecting existing users much, especially for > > back-patching. If we want to choose this approach for bank branches, > > we need to carefully consider the side impacts. > > > > > It's a little inefficient with memory and CPU, especially in the presence of > > > long running transactions, but likely would not be noticeable to most users. > > > > This matches my analysis. > > > > Here is the summary of several proposals we've discussed: > > > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > > true to AllocateSnapshotBuilder(). > > > > This is the simplest approach and doesn't break the on-disk > > SnapBuildOnDisck compatibility. A downside is that it's a little > > inefficient with memory and CPU since it includes non-catalog change > > transactions too. Also, passing need_full_snapshot to > > CreateInitDecodingContext() will no longer have any meaning. > > > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > > versions. But since different branches are using different versions of > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > > version that is consistent across all versions. > > > > c) Add a global variable, say in_create, to snapbuild.c > > > > it would also require adding a setter function for in_create. There > > are several ideas where to set/reset the flag. One idea is that we > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > > before starting to find the start point, for example at the beginning > > of DecodingContextFindStartpoint(). It probably won't require a memory > > context callback to make sure to clear the flag. This idea doesn't > > have a downside from users and extensions perspects. But I'm slightly > > hesitant to add a global variable. > > > > What do you think? and any other ideas? > > I've drafted the idea (c) for discussion (for master and v16 for now). > I also liked the idea (a) but I'm concerned a bit about future impact. > I agree that (c) would be better than (a) as it avoids collecting non-catalog xacts in snapshot. However, I think we shouldn't avoid restoring the snapshot unless really required. See my previous response. -- With Regards, Amit Kapila.
pgsql-bugs by date: