Re: Add session statistics to pg_stat_database - Mailing list pgsql-hackers
From | Magnus Hagander |
---|---|
Subject | Re: Add session statistics to pg_stat_database |
Date | |
Msg-id | CABUevEzx9bFWkarv7rc+CCZemG-77y0wHOn8fg3e-OePVS=D-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Add session statistics to pg_stat_database (Laurenz Albe <laurenz.albe@cybertec.at>) |
Responses |
Re: Add session statistics to pg_stat_database
|
List | pgsql-hackers |
On Fri, Nov 20, 2020 at 3:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Tue, 2020-11-17 at 17:33 +0100, Magnus Hagander wrote:
> I've taken a look as well, and here are a few short notes:
Much appreciated!
Sorry about the delay in getting back to you on this one. FYI, while the patch has been bumped to the next CF by now, I do intend to continue working on it before that starts.
> * It talks about "number of connections" but "number of aborted sessions". We should probably
> be consistent about talking either about connections or sessions? In particular, connections
> seems wrong in this case, because it only starts counting after authentication is complete
> (since otherwise we send no stats)? (This goes for both docs and actual function names)
Yes, that is true. I have changed "connections" to "sessions" and renamed the new
column "connections" to "session_count".
I think that most people will understand a session as started after a successful
connection.
Yeah, I agree, and as long as it's consistent we don't need more explanations than that.
Further int he views, it's a bit strange to have session_count and aborted_session, but I'm not sure what to suggest. "aborted_session_count" seems too long. Maybe just "sessions" instead of "session_count" -- no other counters actually have the "_count" suffix.
> * Is this actually a fix that's independent of the new stats? It seems in general to be
> changing the behaviour of "force", which is more generic?
> - !have_function_stats)
> + !have_function_stats && !force)
The comment right above that reads:
/* Don't expend a clock check if nothing to do */
So it is just a quick exit if there is nothing to do.
But with that patch we have something to do if "force" (see below) is true:
Report the remaining session duration and if the session was closed normally.
Thus the additional check.
Ah yeah, makes sense. It becomes more clear with the rename.
> * in pgstat_send_connstat() you pass the parameter "force" in as "disconnect".
> That behaviour at least requires a comment saying why, I think. My understanding is
> it relies on that "force" means this is a "backend is shutting down", but that is not
> actually documented anywhere. Maybe the "force" parameter should actually be renamed
> to indicate this is really what it means, to avoid a future mistake in the area?
> But even with that, how does that turn into disconnect?
"pgstat_report_stat(true)" is only called from "pgstat_beshutdown_hook()", so
it is currently only called when the backend is about to exit.
According the the comments the flag means that "caller wants to force stats out".
I guess that the author thought that there may arise other reasons to force sending
statistics in the future (commit 641912b4d from 2007).
However, since that has not happened, I have renamed the flag to "disconnect" and
adapted the documentation. This doesn't change the current behavior, but establishes
a new rule.
That makes it a lot more clear. And I agree, if nobody came up with a reason since 2007, then we are free to repurpose it :)
> * Maybe rename pgStatSessionDisconnected to pgStatSessionNormalDisconnected?
> To avoid having to go back to the setting point and look it up in a comment.
Long, descriptive names are a good thing.
I have decided to use "pgStatSessionDisconnectedNormally", since that is even longer
and seems to fit the "yes or no" category better.
WFM.
> I wonder if there would also be a way to count "sessions that crashed" as well.
> That is,the ones that failed in a way that caused the postmaster to restart the system.
> But that's information we'd have to send from the postmaster, but I'm actually unsure
> if we're "allowed" to send things to the stats collector from the postmaster.
> But I think it could be quite useful information to have. Maybe we can find some way
> to piggyback on the fact that we're restarting the stats collector as a result?
Sure, a crash count would be useful. I don't know if it is easy for the stats collector
to tell the difference between a start after a backend crash and - say - starting from
a base backup.
Patch v6 attached.
I think that that would be material for another patch, and I don't think it should go
to "pg_stat_database", because a) it might be hard to tell to which database the crashed
backend was attached, b) it might be a background process that doesn't belong to a database
and c) if the crash were caused by - say - corruption in a shared catalog, it would be
misleading
I'm not sure it is outside the scope of this patch, because I think it might be easier to do than I (and I think you) first thought. We don't need to track which database crashed -- if we track all *other* ways a database exits, then crashes are all that remains.
So in fact, we *almost* have all the data we need already. We have the number of sessions started. We have the number of sessions "aborted". if we also had the number of sessions that were closed normally, then whatever is "left" would be the number of sessions crashed. And we do already, in your patch, send the message in the case of both aborted and non-aborted sessions. So we just need to keep track of both in the statsfile (which we don't now), and we'd more or less have it, wouldn't we?
However, some thinking around that also leads me to another question which is very much in scope for this patch regardless, which is what about shutdown and admin termination. Right now, when you do a "pg_ctl stop" on the database, all sessions count as aborted. Same thing for a pg_terminate_backend(). I wonder if this is also a case that would be useful to track as a separate thing? One could argue that the docs in your patch say aborted means "terminated by something else than a regular client disconnection". But that's true for a "shutdown", but not for a crash, so whichever way we go with crashes it's slightly incorrect.
But thinking from a usability perspective, wouldn't what we want more be something like <closed by correct disconnect>, <closed by abnormal disconnect>, <closed by admin>, <crash>?
What do you think of adapting it to that?
Basically, that would change pgStatSessionDisconnectedNormally into instead being an enum of reasons, which could be normal disconnect, abnormal disconnect and admin. And we'd track all those three as separate numbers in the stats file, meaning we could then calculate the crash by subtracting all three from the total number of sessions?
(Let me know if you think the idea could work and would prefer it if I worked up a complete suggestion based on it rather than just spitting ideas)
pgsql-hackers by date: