Re: when the startup process doesn't - Mailing list pgsql-hackers
From | Nitin Jadhav |
---|---|
Subject | Re: when the startup process doesn't |
Date | |
Msg-id | CAMm1aWYAjwui2OrY6TSvVF8N9YLdRdnO5BUS-t5GDR60X1FanQ@mail.gmail.com Whole thread Raw |
In response to | Re: when the startup process doesn't (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Responses |
Re: when the startup process doesn't
|
List | pgsql-hackers |
Hi, As nobody opposed the log based approach, I thought of creating a patch using this approach. Please find the patch attached. Introduced the new GUC variable 'log_min_duration_startup_process' which indicates the time period after every time period it logs the progress of the process startup when it is set to a value (In millisecond) greater than zero. if it is set to zero, then it logs all available messages. If it is set to -1, then disables the feature. > There are probably a number of reasons why this can happen, but the > ones that seem to come up most often in my experience are > (1) SyncDataDirectory takes a long time, (b) ResetUnloggedRelations > takes a long time, and (c) there's a lot of WAL to apply so that takes a > long time. I have added the proper logs in all of the above scenarios. Following is the sample log displayed during server startup when the time period is set to 10ms. 2021-06-04 19:40:06.390 IST [51116] LOG: Syncing data directory, elapsed time: 14.165 ms, current path: ./base/13892/16384_fsm 2021-06-04 19:40:06.399 IST [51116] LOG: Syncing data directory completed after 22.661 ms 2021-06-04 19:40:06.399 IST [51116] LOG: database system was not properly shut down; automatic recovery in progress 2021-06-04 19:40:06.401 IST [51116] LOG: Resetting unlogged relations completed after 0.219 ms 2021-06-04 19:40:06.401 IST [51116] LOG: redo starts at 0/4728B88 2021-06-04 19:40:06.411 IST [51116] LOG: Performing crash recovery, elapsed time: 10.002 ms, current LSN: 0/47AA998 2021-06-04 19:40:06.421 IST [51116] LOG: Performing crash recovery, elapsed time: 20.002 ms, current LSN: 0/4838D80 2021-06-04 19:40:06.431 IST [51116] LOG: Performing crash recovery, elapsed time: 30.002 ms, current LSN: 0/48DA718 2021-06-04 19:40:06.441 IST [51116] LOG: Performing crash recovery, elapsed time: 40.002 ms, current LSN: 0/49791C0 . . . 2021-06-04 19:40:07.222 IST [51116] LOG: Performing crash recovery, elapsed time: 820.805 ms, current LSN: 0/76F6F10 2021-06-04 19:40:07.227 IST [51116] LOG: invalid record length at 0/774E758: wanted 24, got 0 2021-06-04 19:40:07.227 IST [51116] LOG: redo done at 0/774E730 system usage: CPU: user: 0.77 s, system: 0.03 s, elapsed: 0.82 s Kindly let me know if any changes are required. Thanks & Regards, Nitin Jadhav On Thu, Apr 22, 2021 at 4:39 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > On Wed, 21 Apr 2021 12:36:05 -0700 > Andres Freund <andres@anarazel.de> wrote: > > > [...] > > > > I don't think that concern equally applies for what I am proposing > > here. For one, we already have minRecoveryPoint in ControlData, and we > > already use it for the purpose of determining where we need to recover > > to, albeit only during crash recovery. Imo that's substantially > > different from adding actual recovery progress status information to the > > control file. > > Just for the record, when I was talking about updating status of the startup > in the controldata, I was thinking about setting the last known LSN replayed. > Not some kind of variable string. > > > > > I also think that it'd actually be a significant reliability improvement > > if we maintained an approximate minRecoveryPoint during normal running: > > I've seen way too many cases where WAL files were lost / removed and > > crash recovery just started up happily. Only hitting problems months > > down the line. Yes, it'd obviously not bullet proof, since we'd not want > > to add a significant stream of new fsyncs, but IME such WAL files > > lost/removed issues tend not to be about a few hundred bytes of WAL but > > many segments missing. > > Maybe setting this minRecoveryPoint once per segment would be enough, near > from the beginning of the WAL. At least, the recovery process would be > forced to actually replay until the very last known segment. > > Regards, > >
Attachment
pgsql-hackers by date: