> On Oct 23, 2025, at 18:02, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>
> Hi,
>
> On Thu, Oct 23, 2025 at 11:17 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>>
>>
>>> On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>>>
>>>
>>> I managed to reproduce the race with v20-alt3. I tried to write a TAP
>>> test reproducing the issue, so it was easier to validate changes.
>>> Please find the attached TAP test. I added it to some random package
>>> for simplicity.
>>>
>>
>> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can
doa little bit more optimization:
>>
>> The notifier: in SignalBackend()
>> * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
>> * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the
advancement.
>>
>> The listener: in asyncQueueReadAllNotifications():
>> * With alt3, we only lock and update pos
>> * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an
advancement,so it can stop reading.
>>
>
> I think this would be a reasonable optimization if it weren't for the
> race condition mentioned above. The problem is that if the local pos
> lags behind the shared memory pos, it could point to a truncated queue
> segment, so we shouldn't allow that.
>
I figured out a way to resolve the race condition for alt3:
* add an awakening flag for every listener, this flag is only set by listeners
* add an advisory pos for every listener, similar to alt1
* if a listener is awaken, notify only updates the listener’s advisory pos; otherwise directly advance its position.
* If a running listener see current pos is behind advisory pos, then stop reading
See more details in attach patch file, I added code comments for my changes. Now the TAP test won’t hit the race
condition.
```
# +++ tap check in src/test/authentication +++
t/008_listen-pos-race.pl .. skipped: Injection points not supported by this build
Files=1, Tests=0, 0 wallclock secs ( 0.00 usr 0.00 sys + 0.03 cusr 0.01 csys = 0.04 CPU)
Result: NOTESTS
```
And with my solution, listeners longer will still use local pos, so that no longer need to acquire notification lock in
everyloop.
The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch. Please see if my solution works.
I also made a tiny change in the TAP script to allow it to terminate gracefully.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/