On 11/5/2025 1:55 AM, Michael Paquier wrote:
> On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
>> On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>>> Timing issue then, the buildfarm has not been complaining on this one
>>> AFAIK, there have been no recoveryCheck failures reported:
>>> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> drongo has just reported one failure, so I stand corrected:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50
>
Perfect timing on that drongo failure - validates exactly what I was
seeing with the O_CLOEXEC changes.
> And one log rotation should be enough before the restart.
>
>>> Hmm. The reason why I didn't use a PID matching check (mentioned at
>>> [1]) is that this is not entirely bullet-proof. On a very slow
>>> machine, one could assume that standby_1 generates some records and
>>> that these are replayed by standby_2 *before* the PID of the WAL
>>> receiver is retrieved. This could lead to false positives in some
>>> cases, and a bunch of buildfarm members are very slow. You have a
>>> point that these would unlikely happen in normal runs, so a PID
>>> matching check would be relevant most of the time anyway, even if the
>>> original PID has been fetched after the TLI jump has been processed in
>>> standby_2. I'd rather keep the log check, TBH, bypassing it with an
>>> extra rotate_logfile() before the restart of standby_2.
>>
>> I’ve also prepared a patch for this method.
>
> That's exactly what I have done a couple of minutes ago, and noticed
> your message before applying the fix so I've listed you are a
> co-author on this one.
>
Thanks for the quick turnaround and for including both Xuneng and me.
> I have also kept the PID check after pondering a bit about it. A TLI
> jump could be replayed before we grab the initial PID, but in most
> cases it should be able to do its work correctly.
> --
> Michael
The rotate_logfile() fixes the fundamental issue (log file reuse), while
the PID check adds extra verification for the common case. Belt and
braces approach makes sense here.
Your point about slow machines is well taken - I hadn't considered the
window between restart and PID capture on heavily loaded systems.
This clears the path for my O_CLOEXEC work without worrying about
spurious test failures. Appreciate it.