when the startup process doesn't - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | when the startup process doesn't |
Date | |
Msg-id | CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com Whole thread Raw |
Responses |
Re: when the startup process doesn't
Re: when the startup process doesn't Re: when the startup process doesn't Re: when the startup process doesn't Re: when the startup process doesn't Re: when the startup process doesn't |
List | pgsql-hackers |
Hi, I've noticed that customers not infrequently complain that they start postgres and then the system doesn't come up for a while and they have no idea what's going on and are (understandably) worried. There are probably a number of reasons why this can happen, but the ones that seem to come up most often in my experience are (1) SyncDataDirectory takes a long time, (b) ResetUnloggedRelations takes a long time, and (c) there's a lot of WAL to apply so that takes a long time. It's possible to distinguish this last case from the other two by looking at the output of 'ps', but that's not super-convenient if your normal method of access to the server is via libpq, and it only works if you are monitoring it as it's happening rather than looking at the logs after-the-fact. I am not sure there's any real way to distinguish the other two cases without using strace or gdb or similar. It seems to me that we could do better. One approach would be to try to issue a log message periodically - maybe once per minute, or some configurable interval, e.g. perhaps add messages something like this: LOG: still syncing data directory, elapsed time %ld.%03d ms, current path %s LOG: data directory sync complete after %ld.%03d ms LOG: still resetting unlogged relations, elapsed time %ld.%03d ms, current path %s LOG: unlogged relations reset after %ld.%03d ms LOG: still performing crash recovery, elapsed time %ld.%03d ms, current LSN %08X/%08X We already have a message when redo is complete, so there's no need for another one. The implementation here doesn't seem too hard either: the startup process would set a timer, when the timer expires the signal handler sets a flag, at a convenient point we notice the flag is set and responding by printing a message and clearing the flag. Another possible approach would be to accept connections for monitoring purposes even during crash recovery. We can't allow access to any database at that point, since the system might not be consistent, but we could allow something like a replication connection (the non-database-associated variant). Maybe it would be precisely a replication connection and we'd just refuse all but a subset of commands, or maybe it would be some other kinds of thing. But either way you'd be able to issue a command in some mini-language saying "so, tell me how startup is going" and it would reply with a result set of some kind. If I had to pick one of these two ideas, I'd pick the one the log-based solution, since it seems easier to access and simplifies retrospective analysis, but I suspect SQL access would be quite useful for some users too, especially in cloud environments where "just log into the machine and have a look" is not an option. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: