Re: Add Information during standby recovery conflicts - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Add Information during standby recovery conflicts
Date
Msg-id a802f1c0-58d9-bd3f-bc3e-bdad54726855@oss.nttdata.com
Whole thread Raw
In response to Re: Add Information during standby recovery conflicts  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-hackers

On 2020/12/15 15:40, Fujii Masao wrote:
> 
> 
> On 2020/12/15 12:04, Kyotaro Horiguchi wrote:
>> At Tue, 15 Dec 2020 02:00:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>>> Thanks for the review! I'm thinking to wait half a day before
>>> commiting
>>> the patch just in the case someone may object the patch.
>>
>> Sorry for coming late.  I have looked only the latest thread so I
>> should be missing many things so please ignore if it was settled in
>> the past discussion.
>>
>>
>> It emits messages like the follows;
>>
>> [40509:startup] LOG:  recovery still waiting after 1021.431 ms: recovery conflict on lock
>> [40509:startup] DETAIL:  Conflicting processes: 41171, 41194.
>> [40509:startup] CONTEXT:  WAL redo at 0/3013158 for Standby/LOCK: xid 510 db 13609 rel 16384
>>
>> IFAIK DETAIL usually shows ordinary sentences so the first word is
>> capitalized and ends with a period. But it is not a sentence so
>> following period looks odd.  a searcheing the tree for errdetails
>> showed some anomalies.
>>
>> src/backend/parser/parse_param.c                      errdetail("%s versus %s",
>> src/backend/jit/llvm/llvmjit_error.cpp              errdetail("while in LLVM")));
>> src/backend/replication/logical/tablesync.c                  errdetail("The error was: %s", res->err)));
>> src/backend/tcop/postgres.c                 errdetail("prepare: %s", pstmt->plansource->query_string);
>> src/backend/tcop/postgres.c         errdetail("abort reason: recovery conflict");
>>
>> and one similar occurance:
>>
>> src/backend/utils/adt/dbsize.c                      errdetail("Invalid size unit: \"%s\".", strptr),
>>
>> I'm not sure, but it seems to me at least the period is unnecessary
>> here.
> 
> Since Error Message Style Guide in the docs says "Detail and hint messages:
> Use complete sentences, and end each with a period.", I think that a period
> is necessary here. No?
> 
> 
>>
>>
>> +            bool        maybe_log_conflict =
>> +            (standbyWaitStart != 0 && !logged_recovery_conflict);
>>
>> odd indentation.
> 
> This is the result of pgindent run. I'm not sure why pgindent indents
> that way, but for now I'd like to follow pgindent.
> 
> 
>>
>>
>> +        /* Also, set the timer if necessary */
>> +        if (logging_timer)
>> +        {
>> +            timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
>> +            timeouts[cnt].type = TMPARAM_AFTER;
>> +            timeouts[cnt].delay_ms = DeadlockTimeout;
>> +            cnt++;
>> +        }
>>
>> This doesn't consider spurious wakeup. I'm not sure it actually
>> happenes but we usually consider that.  That is since before this
>> patch, but ProcWaitForSignal()'s comment says that:
>>
>>>   * As this uses the generic process latch the caller has to be robust against
>>>   * unrelated wakeups: Always check that the desired state has occurred, and
>>>   * wait again if not.
>>
>> If we don't care of spurious wakeups, we should add such a comment.
> 
> If ProcWaitForSignal() wakes up because of the reason (e.g., SIGHUP)
> other than deadlock_timeout, ProcSleep() will call
> ResolveRecoveryConflictWithLock() and we sleep on ProcWaitForSignal()
> again since the recovery conflict has not been resolved yet. So we can
> say that we consider "spurious wakeups"?
> 
> However when I read the related code again, I found another issue in
> the patch. In ResolveRecoveryConflictWithLock(), if SIGHUP causes us to
> wake up out of ProcWaitForSignal() before deadlock_timeout is reached,
> we will use deadlock_timeout again when sleeping on ProcWaitForSignal().
> Instead, probably we should use the "deadlock_timeout - elapsed time"
> so that we can emit a log message as soon as deadlock_timeout passes
> since starting waiting on recovery conflict. Otherwise it may take at most
> "deadlock_timeout * 2" to log "still waiting ..." message.
> 
> To fix this issue, "deadlock_timeout - elapsed time" needs to be used as
> the timeout when enabling the timer at least in
> ResolveRecoveryConflictWithLock() and ResolveRecoveryConflictWithBufferPin().
> Also similar change needs to be applied to
> ResolveRecoveryConflictWithVirtualXIDs().
> 
> BTW, without applying the patch, *originally*
> ResolveRecoveryConflictWithBufferPin() seems to have this issue.
> It enables deadlock_timeout timer so as to request for hot-standbfy
> backends to check themselves for deadlocks. But if we wake up out of
> ProcWaitForSignal() before deadlock_timeout is reached, the subsequent
> call to ResolveRecoveryConflictWithBufferPin() also uses deadlock_timeout
> again instead of "deadlock_timeout - elapsed time". So the request for
> deadlock check can be delayed. Furthermore,
> if ResolveRecoveryConflictWithBufferPin() always wakes up out of
> ProcWaitForSignal() before deadlock_timeout is reached, the request
> for deadlock check may also never happen infinitely.
> 
> Maybe we should fix the original issue at first separately from the patch.

Hmm... commit ac22929a26 seems to make the thing worse. Before that commit,
other wakeup request like SIGHUP didn't cause ProcWaitForSignal() to
actually wake up in ResolveRecoveryConflictWithBufferPin(). Because such
other wakeup requests use the different latch from that that
ProcWaitForSignal() waits on.

But commit ac22929a26 changed the startup process code so that they
both use the same latch. Which could cause ProcWaitForSignal() to be
more likely to wake up because of the requests other than deadlock_timeout.

Maybe we need to revert commit ac22929a26.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Add Information during standby recovery conflicts
Next
From: Noah Misch
Date:
Subject: Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)