On 2020/04/17 3:00, Tom Lane wrote:
> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>> On 2020/04/14 22:52, Tom Lane wrote:
>>> *Yes it does*. The existing code can deliver entirely broken results
>>> if some walsender exits between where we examine the priorities and
>>> where we fetch the WAL pointers.
>
>> IMO that the broken results can be delivered when walsender marked
>> as sync exits *and* new walsender comes at that moment. If this new
>> walsender uses the WalSnd slot that the exited walsender used,
>> SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
>> on this new walsender (i.e., different walsender from one marked as sync).
>
> Right, exactly, sorry that I was not more specific.
>
>> BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
>> it should not be back-patched to avoid ABI break. Right?
>
> Anything that is using that is just as broken as the core code is, for the
> same reasons, so I don't have a problem with changing its API. Maybe we
> should rename it while we're at it, just to make it clear that we are
> breaking any external callers. (If there are any, which seems somewhat
> unlikely.)
I agree to change the API if that's the only way to fix the bug. But ISTM that
we can fix the bug without changing the API, like the attached patch does.
Your patch changes the logic to pick up sync standbys, e.g., use qsort(),
in addition to the bug fix. This might be an improvement and I agree that
it's worth considering that idea for the master branch or v14. But I'm not
fan of adding such changes into the back branches if they are not
necessary for the bug fix. I like to basically keep the current logic as it is,
at least for the back branch, like the attached patch does.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION