Re: when the startup process doesn't - Mailing list pgsql-hackers

From Nitin Jadhav
Subject Re: when the startup process doesn't
Date
Msg-id CAMm1aWYAjwui2OrY6TSvVF8N9YLdRdnO5BUS-t5GDR60X1FanQ@mail.gmail.com
Whole thread Raw
In response to Re: when the startup process doesn't  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: when the startup process doesn't
List pgsql-hackers
Hi,

As nobody opposed the log based approach, I thought of creating a
patch using this approach. Please find the patch attached.

Introduced the new GUC variable 'log_min_duration_startup_process'
which indicates the time period after every time period it logs the
progress of the process startup when it is set to a value (In
millisecond) greater than zero. if it is set to zero, then it logs all
available messages. If it is set to -1, then disables the feature.

> There are probably a number of reasons why this can happen, but the
> ones that seem to come up most often in my experience are
> (1) SyncDataDirectory takes a long time, (b) ResetUnloggedRelations
> takes a long time, and (c) there's a lot of WAL to apply so that takes a
> long time.

I have added the proper logs in all of the above scenarios.

Following is the sample log displayed during server startup when the
time period is set to 10ms.

2021-06-04 19:40:06.390 IST [51116] LOG:  Syncing data directory,
elapsed time: 14.165 ms,  current path: ./base/13892/16384_fsm
2021-06-04 19:40:06.399 IST [51116] LOG:  Syncing data directory
completed after 22.661 ms
2021-06-04 19:40:06.399 IST [51116] LOG:  database system was not
properly shut down; automatic recovery in progress
2021-06-04 19:40:06.401 IST [51116] LOG:  Resetting unlogged relations
completed after 0.219 ms
2021-06-04 19:40:06.401 IST [51116] LOG:  redo starts at 0/4728B88
2021-06-04 19:40:06.411 IST [51116] LOG:  Performing crash recovery,
elapsed time: 10.002 ms,  current LSN: 0/47AA998
2021-06-04 19:40:06.421 IST [51116] LOG:  Performing crash recovery,
elapsed time: 20.002 ms,  current LSN: 0/4838D80
2021-06-04 19:40:06.431 IST [51116] LOG:  Performing crash recovery,
elapsed time: 30.002 ms,  current LSN: 0/48DA718
2021-06-04 19:40:06.441 IST [51116] LOG:  Performing crash recovery,
elapsed time: 40.002 ms,  current LSN: 0/49791C0
.
.
.
2021-06-04 19:40:07.222 IST [51116] LOG:  Performing crash recovery,
elapsed time: 820.805 ms,  current LSN: 0/76F6F10
2021-06-04 19:40:07.227 IST [51116] LOG:  invalid record length at
0/774E758: wanted 24, got 0
2021-06-04 19:40:07.227 IST [51116] LOG:  redo done at 0/774E730
system usage: CPU: user: 0.77 s, system: 0.03 s, elapsed: 0.82 s

Kindly let me know if any changes are required.

Thanks & Regards,
Nitin Jadhav

On Thu, Apr 22, 2021 at 4:39 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> On Wed, 21 Apr 2021 12:36:05 -0700
> Andres Freund <andres@anarazel.de> wrote:
>
> >  [...]
> >
> > I don't think that concern equally applies for what I am proposing
> > here. For one, we already have minRecoveryPoint in ControlData, and we
> > already use it for the purpose of determining where we need to recover
> > to, albeit only during crash recovery. Imo that's substantially
> > different from adding actual recovery progress status information to the
> > control file.
>
> Just for the record, when I was talking about updating status of the startup
> in the controldata, I was thinking about setting the last known LSN replayed.
> Not some kind of variable string.
>
> >
> > I also think that it'd actually be a significant reliability improvement
> > if we maintained an approximate minRecoveryPoint during normal running:
> > I've seen way too many cases where WAL files were lost / removed and
> > crash recovery just started up happily. Only hitting problems months
> > down the line. Yes, it'd obviously not bullet proof, since we'd not want
> > to add a significant stream of new fsyncs, but IME such WAL files
> > lost/removed issues tend not to be about a few hundred bytes of WAL but
> > many segments missing.
>
> Maybe setting this minRecoveryPoint once per segment would be enough, near
> from the beginning of the WAL. At least, the recovery process would be
> forced to actually replay until the very last known segment.
>
> Regards,
>
>

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Fixup some appendStringInfo and appendPQExpBuffer calls
Next
From: Robert Haas
Date:
Subject: Re: Race condition in recovery?