Re: when the startup process doesn't (logging startup delays) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: when the startup process doesn't (logging startup delays) |
Date | |
Msg-id | CA+TgmoZwJTw4xUwJ-rHhUZNgcr5Eb0DJJJxJF8A=ycm1oJ9pxg@mail.gmail.com Whole thread Raw |
In response to | Re: when the startup process doesn't (logging startup delays) (Nitin Jadhav <nitinjadhavpostgres@gmail.com>) |
Responses |
Re: when the startup process doesn't (logging startup delays)
|
List | pgsql-hackers |
On Tue, Aug 3, 2021 at 2:48 AM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > Implemented the above approach and verified the patch. Kindly have a > look and share your thoughts. +LogStartupProgressTimerExpired(bool force, long *secs, int *usecs) The name of this function begins with "log" but it does not log anything, so that's probably a sign that you should rethink the name of the function. I suggested startup_progress_timer_expired() upthread, but now I think maybe we should call it startup_progress_timer_has_expired() and then renaming the Boolean you've called logStartupProgressTimeout to startup_progress_timer_expired. Also, the argument "bool force" doesn't really make sense for this function, which is why I suggested above calling it "bool done" instead. + elapsed_ms = (seconds * 1000) + (useconds / 1000); + interval_in_ms = log_startup_progress_interval * 1000; + enable_timeout_after(LOG_STARTUP_PROGRESS_TIMEOUT, + (interval_in_ms - (elapsed_ms % interval_in_ms))); This will work correctly only if elapsed_ms is equal to interval_in_ms or slightly greater than interval_ms. But if elapsed_ms is greater than two times interval_ms, then it will produce pretty much random results. If elapsed_ms is negative because the system clock gets set backward, a possibility I've already mentioned to you in a previous review, then it will also misbehave. I actually don't think enable_timeout_after() is very easy to use for this kind of thing. At least for me, it's way easier to think about calculating the timestamp at which I want the timer to expire. Maybe something along these lines: next_timeout = last_startup_progress_timeout + interval; if (next_timeout < now) { // Either the timeout was processed so late that we missed an entire cycle, // or the system clock was set backwards. next_timeout = now + interval; } enable_timeout_at(next_timeout); Also, I said before that I thought it was OK that you were logging a line at the end of every operation as well as after every N milliseconds. But, the more I think about it, the less I like it. We already have a 'redo done' line that shows up at the end of redo, which the patch wisely does not duplicate. But it's not quite clear that any of these other things are important enough to bother mentioning in the log unless they take a long time. After an immediate shutdown of an empty cluster, with this patch applied, I get 3 extra log messages: 2021-08-03 10:17:49.630 EDT [17567] LOG: data directory sync (fsync) complete after 0.13 s 2021-08-03 10:17:49.633 EDT [17567] LOG: resetting unlogged relations (cleanup) complete after 0.00 s 2021-08-03 10:17:49.635 EDT [17567] LOG: resetting unlogged relations (init) complete after 0.00 s That doesn't seem like information anyone really needs. If it had taken a long time, it would have been worth logging, but in the normal case where it doesn't, it's just clutter. Another funny thing is that, as you've coded it, those additional log lines only appear when log_startup_progress_interval != 0. That's strange. It seems particularly strange because of the existing precedent where 'redo done' appears regardless of any setting, but also because when I set, say, a 10s interval, I guess I expect something to happen every 10s. Making something happen once at the end is different. So I think we should take this out, which would permit simplifying a bunch of code.The four places where you call ereport_startup_progress(true, ...) would go away. ereport_startup_progress() would no longer need a Boolean argument, and neither would LogStartupProgressTimerExpired() / startup_progress_timer_has_expired(). Note that there's no real need to disable the timeout when we're done with it. It's fine if we do, but if we don't, it's also not a big deal; all that happens if we leave the timer scheduled and let it expire is that it will set a Boolean flag that nobody will care about. So what I'm thinking about is that we could just have, say, reset_startup_progress_timeout() and startup_progress_timeout_has_expired(). reset_startup_progress_timeout() would just do exactly what I showed above to reset the timeout, and you'd call it at the beginning of each phase. And startup_progress_timeout_has_expired() would look roughly like this: if (!startup_progress_timer_expired) return; now = GetCurrentTimestamp(); // compute timestamp difference last_startup_progress_timeout = now; reset_startup_progress_timeout(); With these changes you'd have only 1 place in the code that needs to care about log_startup_progress_interval, as opposed to 3 as you have it currently, and only one place that enables the timeout, as opposed to 2 as you have it currently. I think that would be tidier. I think we also should consider where to put the new functions introduced by this patch, and the GUC. You put them in xlog.c which is reasonable since that is where StartupXLOG() lives. However, xlog.c is also a gigantic file, and xlog.h is included in a lot of places, most of which aren't going to care about the new things you're adding to that file at all. So I'm thinking it might make more sense to put the new code in src/backend/postmaster/startup.c. That is actually a better thematic fit given that this is really about the startup process specifically, not WAL-logging in general. Then reinit.c would include startup.h instead of xlog.h, which seems better, because I don't think we want any actual xlog operations to happen from within that file, so better not to be including xlog.h. The patch currently lacks documentation. It needs to update config.sgml. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: