Re: Add Information during standby recovery conflicts - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Add Information during standby recovery conflicts
Date
Msg-id 29da248a-e21c-a3eb-a051-f1ec79b13d31@amazon.com
Whole thread Raw
In response to Re: Add Information during standby recovery conflicts  ("Drouvot, Bertrand" <bdrouvot@amazon.com>)
Responses Re: Add Information during standby recovery conflicts
Re: Add Information during standby recovery conflicts
List pgsql-hackers
On 8/28/20 4:14 PM, Drouvot, Bertrand wrote:
>
> On 8/28/20 7:03 AM, Masahiko Sawada wrote:
>> CAUTION: This email originated from outside of the organization. Do 
>> not click links or open attachments unless you can confirm the sender 
>> and know the content is safe.
>>
>>
>>
>> On Thu, 27 Aug 2020 at 20:58, Drouvot, Bertrand <bdrouvot@amazon.com> 
>> wrote:
>>>
>>> On 8/27/20 10:16 AM, Masahiko Sawada wrote
>>>> On Mon, 10 Aug 2020 at 23:43, Drouvot, Bertrand 
>>>> <bdrouvot@amazon.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 7/31/20 7:12 AM, Masahiko Sawada wrote:
>>>>>> +   tmpWaitlist = waitlist;
>>>>>> +   while (VirtualTransactionIdIsValid(*tmpWaitlist))
>>>>>> +   {
>>>>>> +       tmpWaitlist++;
>>>>>> +   }
>>>>>> +
>>>>>> +   num_waitlist_entries = (tmpWaitlist - waitlist);
>>>>>> +
>>>>>> +   /* display wal record information */
>>>>>> +   if (log_recovery_conflicts &&
>>>>>> (TimestampDifferenceExceeds(recovery_conflicts_log_time,
>>>>>> GetCurrentTimestamp(),
>>>>>> +                                   DeadlockTimeout))) {
>>>>>> +       LogBlockedWalRecordInfo(num_waitlist_entries, reason);
>>>>>> +       recovery_conflicts_log_time = GetCurrentTimestamp();
>>>>>> +   }
>>>>>>
>>>>>> recovery_conflicts_log_time is not initialized. And shouldn't we
>>>>>> compare the current timestamp to the timestamp when the startup
>>>>>> process started waiting?
>>>>>>
>>>>>> I think we should call LogBlockedWalRecordInfo() inside of the inner
>>>>>> while loop rather than at the beginning of
>>>>>> ResolveRecoveryConflictWithVirtualXIDs(). In lock conflict cases, 
>>>>>> the
>>>>>> startup process waits until 'ltime', then enters
>>>>>> ResolveRecoveryConflictWithVirtualXIDs() after reaching 'ltime'.
>>>>>> Therefore, it makes sense to call LogBlockedWalRecordInfo() at the
>>>>>> beginning of ResolveRecoveryConflictWithVirtualXIDs(). However, in
>>>>>> snapshot and tablespace conflict cases (i.g.
>>>>>> ResolveRecoveryConflictWithSnapshot() and
>>>>>> ResolveRecoveryConflictWithTablespace()), it enters
>>>>>> ResolveRecoveryConflictWithVirtualXIDs() without waits and waits for
>>>>>> reaching ‘ltime’ inside of the inner while look. So the above
>>>>>> condition could always be false.
>>>>> That would make the information being displayed after
>>>>> max_standby_streaming_delay is reached for the multiple cases you 
>>>>> just
>>>>> described.
>>>> Sorry, it should be deadlock_timeout, not max_standby_streaming_delay.
>>>> Otherwise, the recovery conflict log message is printed when
>>>> resolution, which seems not to achieve the original purpose. Am I
>>>> missing something?
>>> Ok, I understand where the confusion is coming from.
>>>
>>> Indeed the new version is now printing the recovery conflict log 
>>> message
>>> during the conflict resolution (while the initial intention was to be
>>> warned as soon as the replay had to wait).
>>>
>>> The advantage of the new version is that it would be consistent across
>>> all the conflicts scenarios (if not, we would get messages during the
>>> resolution or when the replay started waiting, depending of the 
>>> conflict
>>> scenario).
>>>
>>> On the other hand, the cons of the new version is that we would miss
>>> messages when no resolution is needed (replay wait duration <
>>> max_standby_streaming_delay), but is that really annoying? (As no
>>> cancellation would occur)
>>>
>>> Thinking about it, i like the new version (being warned during the
>>> resolution) as we would get messages only when cancelation will occur
>>> (which is what the user might want to avoid, so the extra info would be
>>> useful).
>>>
>>> What do you think?
>> Hmm, I think we print the reason why backends are canceled even of as
>> now by ProcessInterrupts(). With this patch and related patches you
>> proposed on another thread, the startup process reports virtual xids
>> being interrupted, the reason, and LSN of blocked WAL, then processes
>> will also report its virtual xid and reason. Therefore, the new
>> information added by these patches is only the LSN of blocked WAL.
>
> That's completely right, let's come back to the original intention of 
> this patch (means, don't wait for the conflict resolution to log 
> messages).
>
> I'll submit a new version once updated.

Please find attached the new patch.

It provides the following outcomes depending on the conflict:

2020-10-04 09:08:51.923 UTC [30788] LOG:  recovery is waiting recovery 
conflict on buffer pin

OR

2020-10-04 09:52:25.832 UTC [1249] LOG:  recovery is waiting recovery 
conflict on snapshot
2020-10-04 09:52:25.832 UTC [1249] DETAIL:  Conflicting virtual 
transaction ids: 3/2, 2/4.

OR

2020-10-04 09:11:51.717 UTC [30788] LOG:  recovery is waiting recovery 
conflict on lock
2020-10-04 09:11:51.717 UTC [30788] DETAIL:  Conflicting virtual 
transaction id: 2/5.

OR

2020-10-04 09:13:04.104 UTC [30788] LOG:  recovery is resolving recovery 
conflict on database

Thanks

Bertrand


Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Re: [HACKERS] Custom compression methods
Next
From: Alvaro Herrera
Date:
Subject: Re: Add Information during standby recovery conflicts