Thread: pgsql: Get rid of the dedicated latch for signaling the startup process

pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:
Get rid of the dedicated latch for signaling the startup process.

This commit gets rid of the dedicated latch for signaling the startup
process in favor of using its procLatch,  since that comports better
with possible generic signal handlers using that latch.

Commit 1e53fe0e70 changed background processes so that they use standard
SIGHUP handler. Like that, this commit also makes the startup process use
standard SIGHUP handler to simplify the code.

Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ac22929a2613e122708bd0172508ac863c51c1cc

Modified Files
--------------
src/backend/access/transam/xlog.c | 29 ++++++++++++++---------------
src/backend/postmaster/startup.c  | 24 +++++-------------------
2 files changed, 19 insertions(+), 34 deletions(-)


Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Heikki Linnakangas
Date:
On 04/11/2020 09:44, Fujii Masao wrote:
> Get rid of the dedicated latch for signaling the startup process.
> 
> This commit gets rid of the dedicated latch for signaling the startup
> process in favor of using its procLatch,  since that comports better
> with possible generic signal handlers using that latch.
> 
> Commit 1e53fe0e70 changed background processes so that they use standard
> SIGHUP handler. Like that, this commit also makes the startup process use
> standard SIGHUP handler to simplify the code.

This seems to have made buildfarm member 'elver' to segfault. I've got a 
hunch that setting recoveryWakeupLatch to NULL, when WakeupRecovery() 
doesn't check for NULL, is not OK. It's surprising that we're only 
seeing this on 'elver', though.

- Heikki



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:

On 2020/11/04 19:27, Heikki Linnakangas wrote:
> On 04/11/2020 09:44, Fujii Masao wrote:
>> Get rid of the dedicated latch for signaling the startup process.
>>
>> This commit gets rid of the dedicated latch for signaling the startup
>> process in favor of using its procLatch,  since that comports better
>> with possible generic signal handlers using that latch.
>>
>> Commit 1e53fe0e70 changed background processes so that they use standard
>> SIGHUP handler. Like that, this commit also makes the startup process use
>> standard SIGHUP handler to simplify the code.
> 
> This seems to have made buildfarm member 'elver' to segfault. I've got a hunch that setting recoveryWakeupLatch to
NULL,when WakeupRecovery() doesn't check for NULL, is not OK. It's surprising that we're only seeing this on 'elver',
though.

Thanks for the report! The latch is reset to NULL after ShutdownWalRcv(). So I thought that there are no processes
settingthe latch (i.e., walreceiver has already exited) after it's reset to NULL. But this is not true. There seem a
windowbetween ShutdownWalRcv() and the actual exit of walreceiver. If a signal is sent during that window, the
segmentationfault would happen. This would be the reason why segv happened in some platforms, but not in others.
 

I'm thinking to remove the following code to fix this issue. Thought?

    /*
     * We don't need the latch anymore. It's not strictly necessary to reset
     * it to NULL, but let's do it for the sake of tidiness.
     */
    if (ArchiveRecoveryRequested)
        XLogCtl->recoveryWakeupLatch = NULL;

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:

On 2020/11/04 20:20, Fujii Masao wrote:
> 
> 
> On 2020/11/04 19:27, Heikki Linnakangas wrote:
>> On 04/11/2020 09:44, Fujii Masao wrote:
>>> Get rid of the dedicated latch for signaling the startup process.
>>>
>>> This commit gets rid of the dedicated latch for signaling the startup
>>> process in favor of using its procLatch,  since that comports better
>>> with possible generic signal handlers using that latch.
>>>
>>> Commit 1e53fe0e70 changed background processes so that they use standard
>>> SIGHUP handler. Like that, this commit also makes the startup process use
>>> standard SIGHUP handler to simplify the code.
>>
>> This seems to have made buildfarm member 'elver' to segfault. I've got a hunch that setting recoveryWakeupLatch to
NULL,when WakeupRecovery() doesn't check for NULL, is not OK. It's surprising that we're only seeing this on 'elver',
though.
> 
> Thanks for the report! The latch is reset to NULL after ShutdownWalRcv(). So I thought that there are no processes
settingthe latch (i.e., walreceiver has already exited) after it's reset to NULL. But this is not true. There seem a
windowbetween ShutdownWalRcv() and the actual exit of walreceiver. If a signal is sent during that window, the
segmentationfault would happen. This would be the reason why segv happened in some platforms, but not in others.
 
> 
> I'm thinking to remove the following code to fix this issue. Thought?
> 
>      /*
>       * We don't need the latch anymore. It's not strictly necessary to reset
>       * it to NULL, but let's do it for the sake of tidiness.
>       */
>      if (ArchiveRecoveryRequested)
>          XLogCtl->recoveryWakeupLatch = NULL;

Or ISTM that WakeupRecovery() should set the latch only when the latch
has not been reset to NULL yet.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Heikki Linnakangas
Date:
On 04/11/2020 14:03, Fujii Masao wrote:
>> I'm thinking to remove the following code to fix this issue. Thought?
>>
>>       /*
>>        * We don't need the latch anymore. It's not strictly necessary to reset
>>        * it to NULL, but let's do it for the sake of tidiness.
>>        */
>>       if (ArchiveRecoveryRequested)
>>           XLogCtl->recoveryWakeupLatch = NULL;

That should work, but it seems a bit sloppy to leave it pointing to a 
latch that belongs to a random process though.

> Or ISTM that WakeupRecovery() should set the latch only when the latch
> has not been reset to NULL yet.

Got to be careful with race conditions, if the latch is set to NULL at 
the same time that WakeupRecovery() is called.

- Heikki



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Heikki Linnakangas
Date:
On 04/11/2020 15:17, Heikki Linnakangas wrote:
> On 04/11/2020 14:03, Fujii Masao wrote:
>> Or ISTM that WakeupRecovery() should set the latch only when the latch
>> has not been reset to NULL yet.
> 
> Got to be careful with race conditions, if the latch is set to NULL at
> the same time that WakeupRecovery() is called.

I don't think commit 113d3591b8 got this quite right:

> void
> WakeupRecovery(void)
> {
>     if (XLogCtl->recoveryWakeupLatch)
>         SetLatch(XLogCtl->recoveryWakeupLatch);
> }

If XLogCtl->recoveryWakeupLatch is set to NULL between the if and the 
SetLatch, you'll still get a segfault. That's highly unlikely to happen 
in practice because the compiler will optimize that into a single load 
instruction, but could happen with -O0. I think you'd need to do the 
access only once, using a volatile pointer, to make that safe. Maybe 
it's simpler to just not reset it to NULL, after all.

- Heikki



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:

On 2020/11/05 5:36, Heikki Linnakangas wrote:
> On 04/11/2020 15:17, Heikki Linnakangas wrote:
>> On 04/11/2020 14:03, Fujii Masao wrote:
>>> Or ISTM that WakeupRecovery() should set the latch only when the latch
>>> has not been reset to NULL yet.
>>
>> Got to be careful with race conditions, if the latch is set to NULL at
>> the same time that WakeupRecovery() is called.
> 
> I don't think commit 113d3591b8 got this quite right:
> 
>> void
>> WakeupRecovery(void)
>> {
>>     if (XLogCtl->recoveryWakeupLatch)
>>         SetLatch(XLogCtl->recoveryWakeupLatch);
>> }
> 
> If XLogCtl->recoveryWakeupLatch is set to NULL between the if and the SetLatch, you'll still get a segfault. That's
highlyunlikely to happen in practice because the compiler will optimize that into a single load instruction, but could
happenwith -O0. I think you'd need to do the access only once, using a volatile pointer, to make that safe. Maybe it's
simplerto just not reset it to NULL, after all.
 

Yes, you're right. I agree it's simpler to remove the code resetting
the latch to NULL. Also as the comment for that code explains,
basically it's not necessary to reset it to NULL.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:

On 2020/11/05 6:02, Fujii Masao wrote:
> 
> 
> On 2020/11/05 5:36, Heikki Linnakangas wrote:
>> On 04/11/2020 15:17, Heikki Linnakangas wrote:
>>> On 04/11/2020 14:03, Fujii Masao wrote:
>>>> Or ISTM that WakeupRecovery() should set the latch only when the latch
>>>> has not been reset to NULL yet.
>>>
>>> Got to be careful with race conditions, if the latch is set to NULL at
>>> the same time that WakeupRecovery() is called.
>>
>> I don't think commit 113d3591b8 got this quite right:
>>
>>> void
>>> WakeupRecovery(void)
>>> {
>>>     if (XLogCtl->recoveryWakeupLatch)
>>>         SetLatch(XLogCtl->recoveryWakeupLatch);
>>> }
>>
>> If XLogCtl->recoveryWakeupLatch is set to NULL between the if and the SetLatch, you'll still get a segfault. That's
highlyunlikely to happen in practice because the compiler will optimize that into a single load instruction, but could
happenwith -O0. I think you'd need to do the access only once, using a volatile pointer, to make that safe.
 

On second thought, since fetching the latch pointer might not be atomic,
maybe we need to use spinlock like WalRcvForceReply() does. But using
spinlock in every calls of WakeupRecovery() might cause performance
overhead. So I'm thinking to use spinlock only when it's necessary, i.e.,
when the latch may be reset to NULL concurrently. Attached is the POC
patch implementing that. Thought?

> Maybe it's simpler to just not reset it to NULL, after all.
> 
> Yes, you're right. I agree it's simpler to remove the code resetting
> the latch to NULL. Also as the comment for that code explains,
> basically it's not necessary to reset it to NULL.

On second thought, your comment in upthread "it seems a bit sloppy to
leave it pointing to a latch that belongs to a random process though."
seems right. So I'm inclined to avoid this approach, for now.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: pgsql: Get rid of the dedicated latch for signaling the startup process

From
Fujii Masao
Date:

On 2020/11/05 23:32, Fujii Masao wrote:
> 
> 
> On 2020/11/05 6:02, Fujii Masao wrote:
>>
>>
>> On 2020/11/05 5:36, Heikki Linnakangas wrote:
>>> On 04/11/2020 15:17, Heikki Linnakangas wrote:
>>>> On 04/11/2020 14:03, Fujii Masao wrote:
>>>>> Or ISTM that WakeupRecovery() should set the latch only when the latch
>>>>> has not been reset to NULL yet.
>>>>
>>>> Got to be careful with race conditions, if the latch is set to NULL at
>>>> the same time that WakeupRecovery() is called.
>>>
>>> I don't think commit 113d3591b8 got this quite right:
>>>
>>>> void
>>>> WakeupRecovery(void)
>>>> {
>>>>     if (XLogCtl->recoveryWakeupLatch)
>>>>         SetLatch(XLogCtl->recoveryWakeupLatch);
>>>> }
>>>
>>> If XLogCtl->recoveryWakeupLatch is set to NULL between the if and the SetLatch, you'll still get a segfault. That's
highlyunlikely to happen in practice because the compiler will optimize that into a single load instruction, but could
happenwith -O0. I think you'd need to do the access only once, using a volatile pointer, to make that safe.
 
> 
> On second thought, since fetching the latch pointer might not be atomic,
> maybe we need to use spinlock like WalRcvForceReply() does. But using
> spinlock in every calls of WakeupRecovery() might cause performance
> overhead. So I'm thinking to use spinlock only when it's necessary, i.e.,
> when the latch may be reset to NULL concurrently. Attached is the POC
> patch implementing that. Thought?

Previously I added this patch to next CommitFest. But I reverted the commit
ac22929a26 and 113d3591b8 because those changes have other issue. So this
patch is no longer necessary, and I dropped it from next CommitFest.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION