Amit Kapila <amit.kapila16@gmail.com> writes:
> On Thu, Aug 20, 2020 at 10:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. On the other hand, the code is *releasing* the
>> ReplicationSlotControlLock before it calls
>> ProcArraySetReplicationSlotXmin, and that seems like a flat out
>> concurrency bug.
> It is not clear to me how those values can go backward.
After releasing ReplicationSlotControlLock, that code is holding no
lock at all (in the already_locked=false case I'm concerned about).
Thus the scenario to consider is:
1. Walsender A runs ReplicationSlotsComputeRequiredXmin, computes
some perfectly-valid xmins, releases ReplicationSlotControlLock,
amd then gets swapped out to Timbuktu.
2. Time passes and the "true values" of those xmins advance thanks
to other walsender activity.
3. Walsender B runs ReplicationSlotsComputeRequiredXmin, computes
some perfectly-valid xmins, and successfully stores them in the
procarray.
4. Walsender A returns from never-never land, and stores its now
quite stale results in the procarray, causing the globally visible
xmins to go backwards from where they were after step 3.
I see no mechanism in the code that prevents this scenario.
On reflection I'm not even very sure that the code change
I'm suggesting would prevent it. It'd prevent walsenders
from entering or exiting until we've updated the procarray,
but there's nothing to stop the furthest-back walsender from
advancing its values.
There may be some argument why this can't lead to a problem,
but I don't see any comments making such an argument, either.
regards, tom lane