Re: Some problems of recovery conflict wait events - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Some problems of recovery conflict wait events
Date
Msg-id CA+fd4k5NF=aEr50Mo_RhGCTjVum0GC_e6czbOvTckhq6yrTmAw@mail.gmail.com
Whole thread Raw
In response to Re: Some problems of recovery conflict wait events  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Some problems of recovery conflict wait events
List pgsql-hackers
On Tue, 10 Mar 2020 at 00:57, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/03/09 14:10, Masahiko Sawada wrote:
> > On Mon, 9 Mar 2020 at 13:24, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/03/08 13:52, Masahiko Sawada wrote:
> >>> On Thu, 5 Mar 2020 at 20:16, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/03/05 16:58, Masahiko Sawada wrote:
> >>>>> On Wed, 4 Mar 2020 at 15:21, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/03/04 14:31, Masahiko Sawada wrote:
> >>>>>>> On Wed, 4 Mar 2020 at 13:48, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2020/03/04 13:27, Michael Paquier wrote:
> >>>>>>>>> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
> >>>>>>>>>> Yeah, so 0001 patch sets existing wait events to recovery conflict
> >>>>>>>>>> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
> >>>>>>>>>> to the recovery conflict on a snapshot. 0003 patch improves these wait
> >>>>>>>>>> events by adding the new type of wait event such as
> >>>>>>>>>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
> >>>>>>>>>> is the fix for existing versions and 0003 patch is an improvement for
> >>>>>>>>>> only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
> >>>>>>>>
> >>>>>>>> Yes, it looks like a improvement rather than bug fix.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Okay, understand.
> >>>>>>>
> >>>>>>>>> I got my eyes on this patch set.  The full patch set is in my opinion
> >>>>>>>>> just a set of improvements, and not bug fixes, so I would refrain from
> >>>>>>>>> back-backpatching.
> >>>>>>>>
> >>>>>>>> I think that the issue (i.e., "waiting" is reported twice needlessly
> >>>>>>>> in PS display) that 0002 patch tries to fix is a bug. So it should be
> >>>>>>>> fixed even in the back branches.
> >>>>>>>
> >>>>>>> So we need only two patches: one fixes process title issue and another
> >>>>>>> improve wait event. I've attached updated patches.
> >>>>>>
> >>>>>> Thanks for updating the patches! I started reading 0001 patch.
> >>>>>
> >>>>> Thank you for reviewing this patch.
> >>>>>
> >>>>>>
> >>>>>> -                       /*
> >>>>>> -                        * Report via ps if we have been waiting for more than 500 msec
> >>>>>> -                        * (should that be configurable?)
> >>>>>> -                        */
> >>>>>> -                       if (update_process_title && new_status == NULL &&
> >>>>>> -                               TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
> >>>>>> -                                                                                  500))
> >>>>>>
> >>>>>> The patch changes ResolveRecoveryConflictWithSnapshot() and
> >>>>>> ResolveRecoveryConflictWithTablespace() so that they always add
> >>>>>> "waiting" into the PS display, whether wait is really necessary or not.
> >>>>>> But isn't it better to display "waiting" in PS basically when wait is
> >>>>>> necessary, like originally ResolveRecoveryConflictWithVirtualXIDs()
> >>>>>> does as the above?
> >>>>>
> >>>>> You're right. Will fix it.
> >>>>>
> >>>>>>
> >>>>>>      ResolveRecoveryConflictWithDatabase(Oid dbid)
> >>>>>>      {
> >>>>>> +       char            *new_status = NULL;
> >>>>>> +
> >>>>>> +       /* Report via ps we are waiting */
> >>>>>> +       new_status = set_process_title_waiting();
> >>>>>>
> >>>>>> In  ResolveRecoveryConflictWithDatabase(), there seems no need to
> >>>>>> display "waiting" in PS because no wait occurs when recovery conflict
> >>>>>> with database happens.
> >>>>>
> >>>>> Isn't the startup process waiting for other backend to terminate?
> >>>>
> >>>> Yeah, you're right. I agree that "waiting" should be reported in this case.
> >>>>
> >>>> Currently ResolveRecoveryConflictWithLock() and
> >>>> ResolveRecoveryConflictWithBufferPin() don't call
> >>>> ResolveRecoveryConflictWithVirtualXIDs and don't report "waiting"
> >>>> in PS display. You changed them so that they report "waiting". I agree
> >>>> to have this change. But this change is an improvement rather than
> >>>> a bug fix, i.e., we should apply this change only in v13?
> >>>>
> >>>
> >>> Did you mean ResolveRecoveryConflictWithDatabase and
> >>> ResolveRecoveryConflictWithBufferPin?
> >>
> >> Yes! Sorry for my typo.
> >>
> >>> In the current code as far as I
> >>> researched there are two cases where we don't add "waiting" and one
> >>> case where we doubly add "waiting".
> >>>
> >>> ResolveRecoveryConflictWithDatabase and
> >>> ResolveRecoveryConflictWithBufferPin don't update the ps title.
> >>> Although the path where GetCurrentTimestamp() >= ltime is false in
> >>> ResolveRecoveryConflictWithLock also doesn't update the ps title, it's
> >>> already updated in WaitOnLock. On the other hand, the path where
> >>> GetCurrentTimestamp() >= ltime is true in
> >>> ResolveRecoveryConflictWithLock updates the ps title but it's wrong as
> >>> I reported.
> >>>
> >>> I've split the patch into two patches: 0001 patch fixes the issue
> >>> about doubly updating ps title, 0002 patch makes the recovery conflict
> >>> resolution on database and buffer pin update the ps title.
> >>
> >> Thanks for splitting the patches. I think that 0001 patch can be back-patched.
> >>
> >> -                       /*
> >> -                        * Report via ps if we have been waiting for more than 500 msec
> >> -                        * (should that be configurable?)
> >> -                        */
> >> -                       if (update_process_title && new_status == NULL &&
> >> -                               TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(),
> >> -                                                                                  500))
> >>
> >> Originally, "waiting" is reported in PS if we've been waiting for more than
> >> 500 msec, as the above does. But you got rid of those codes in the patch.
> >> Did you confirm that it's safe to do that? If not, isn't it better to apply
> >> the attached patch?
> >
> > In WaitOnLock() we update the ps title regardless of waiting time. So
> > I thought we can change it to make these behavior consistent. But
> > considering back-patch, your patch looks better than mine.
>
> Yeah, so I pushed the 0001 patch at first!
> I will review the other patches later.

Thank you!

For 0002 patch which makes ResolveRecoveryConflictWithDatabase and
ResolveRecoveryConflictWithBufferPin update the ps title, I think
these are better to wait for 5ms before updating the ps title like
ResolveRecoveryConflictWithVirtualXIDs, for consistency among recovery
conflict resolution functions, but what do you think?

Regards,


--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [Patch] pg_rewind: options to use restore_command fromrecovery.conf or command line
Next
From: Justin Pryzby
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)