Re: when the startup process doesn't - Mailing list pgsql-hackers

From SATYANARAYANA NARLAPURAM
Subject Re: when the startup process doesn't
Date
Msg-id CAHg+QDfjBSvX28yoa8XUL80v=+ukHsUc36fh1bxgR=f2MRgjTw@mail.gmail.com
Whole thread Raw
In response to when the startup process doesn't  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
+1 for both log messages and allowing connections. I believe these two complement each other. 

In the cloud world, we oftentimes want to monitor the progress of the recovery without connecting to the server as the operators don't necessarily have the required permissions to connect and query. Secondly, having this information in the log helps going back in time and understand where Postgres spent time during recovery.

The ability to query the server provides real time information  and come handy.

Thanks,
Satya



On Mon, Apr 19, 2021 at 10:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
Hi,

I've noticed that customers not infrequently complain that they start
postgres and then the system doesn't come up for a while and they have
no idea what's going on and are (understandably) worried. There are
probably a number of reasons why this can happen, but the ones that
seem to come up most often in my experience are (1) SyncDataDirectory
takes a long time, (b) ResetUnloggedRelations takes a long time, and
(c) there's a lot of WAL to apply so that takes a long time. It's
possible to distinguish this last case from the other two by looking
at the output of 'ps', but that's not super-convenient if your normal
method of access to the server is via libpq, and it only works if you
are monitoring it as it's happening rather than looking at the logs
after-the-fact. I am not sure there's any real way to distinguish the
other two cases without using strace or gdb or similar.

It seems to me that we could do better. One approach would be to try
to issue a log message periodically - maybe once per minute, or some
configurable interval, e.g. perhaps add messages something like this:

LOG:  still syncing data directory, elapsed time %ld.%03d ms, current path %s
LOG:  data directory sync complete after %ld.%03d ms
LOG:  still resetting unlogged relations, elapsed time %ld.%03d ms,
current path %s
LOG:  unlogged relations reset after %ld.%03d ms
LOG:  still performing crash recovery, elapsed time %ld.%03d ms,
current LSN %08X/%08X

We already have a message when redo is complete, so there's no need
for another one. The implementation here doesn't seem too hard either:
the startup process would set a timer, when the timer expires the
signal handler sets a flag, at a convenient point we notice the flag
is set and responding by printing a message and clearing the flag.

Another possible approach would be to accept connections for
monitoring purposes even during crash recovery. We can't allow access
to any database at that point, since the system might not be
consistent, but we could allow something like a replication connection
(the non-database-associated variant). Maybe it would be precisely a
replication connection and we'd just refuse all but a subset of
commands, or maybe it would be some other kinds of thing. But either
way you'd be able to issue a command in some mini-language saying "so,
tell me how startup is going" and it would reply with a result set of
some kind.

If I had to pick one of these two ideas, I'd pick the one the
log-based solution, since it seems easier to access and simplifies
retrospective analysis, but I suspect SQL access would be quite useful
for some users too, especially in cloud environments where "just log
into the machine and have a look" is not an option.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: fix old confusing JSON example
Next
From: SATYANARAYANA NARLAPURAM
Date:
Subject: Re: Synchronous commit behavior during network outage