On 06/06/17 23:42, Andres Freund wrote:
> On 2017-06-06 23:24:50 +0200, Petr Jelinek wrote:
>> On 06/06/17 23:17, Andres Freund wrote:
>>> Right. I found a couple more instance of similarly iffy, although not
>>> quite as broken, patterns in launcher.c. It's easy to get this wrong,
>>> but it's a lot easy if you do it differently everywhere you use a
>>> latch. It's not good if code in the same file, by the same author(s),
>>> has different ways of using latches.
>>
>> Huh? I see same pattern everywhere in launcher.c, what am I missing?
>
> WaitForReplicationWorkerAttach:
> while (...)
> CHECK_FOR_INTERRUPTS();
> /* other stuff including returns */
> WaitLatch()
> WL_POSTMASTER_DEATH
> ResetLatch()
>
> logicalrep_worker_stop loop 1:
> while (...)
> /* other stuff */
> CHECK_FOR_INTERRUPTS()
> WaitLatch()
> POSTMASTER_DEATH
> ResetLatch()
> /* other stuff including returns */
> logicalrep_worker_stop loop 1:
> while (...)
> /* other stuff including returns */
> CHECK_FOR_INTERRUPTS();
> WaitLatch()
> WL_POSTMASTER_DEATH
> ResetLatch()
>
> ApplyLauncherMain:
> while (!got_SIGTERM)
> /* lots other stuff */
> WaitLatch()
> WL_POSTMASTER_DEATH
> /* some other stuff */
> ResetLatch()
> (note no CFI)
>
> they're not hugely different, but subtely there are differences.
> Sometimes you're guaranteed to check for interrupts after resetting the
> latch, in other cases not. Sometimes expensive-ish things happen before
> a CFI...
>
Ah that's because signals in launcher are broken, see
https://www.postgresql.org/message-id/fe072153-babd-3b5d-8052-73527a6eb657@2ndquadrant.com
which also includes patch to fix it.
We originally had custom signal handling everywhere, then I realized it
was mistake for workers because of locking interaction but missed same
issue with launcher (the CFI in current launcher doesn't work).
-- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services