Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible |
Date | |
Msg-id | CAOYmi+=azjJ064LfZG0pJ6DBjx9Ro0Kzdd+Z2Jyyy=66cqL+8w@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Thu, Feb 6, 2025 at 12:35 AM Michael Paquier <michael@paquier.xyz> wrote: > /* Update app name to current GUC setting */ > + /* TODO: ask the list: maybe do this before setting STATE_UNDEFINED? */ > if (application_name) > pgstat_report_appname(application_name); > > Historically, we've always reported the application name after > STATE_UNDEFINED is set, never the reverse. There could be potential > security implications if we were to begin reporting the application > name before the connection has fully authenticated because it would be > possible for attempted connections to show malicious data in the > catalogs, and now we're sure that the catalog data is OK to query for > other users. Let's be careful about that. I think that we should > still set that as late as possible in pgstat_bestart_final(), never at > an earlier step as this data can come from a connection > string, potentially malicious if not authenticated yet. I don't want to change the order of authentication and application name, just move the application name report up above the state change in pgstat_bestart_final(). I don't feel strongly about it, though. > + if (!bootstrap) > + { > + pgstat_bestart_initial(); > + if (MyProcPort) > + pgstat_bestart_security(); /* fill in any SSL/GSS info too */ > > This part of patch 0001 is giving me a very long pause. What's the > merit of filling in twice the backend entry data if we're going to > update it at the end anyway for normal backend connections? More info > for debugging, is it? Correct; if you already know authentication information from the transport then it seemed nice to fill it in before doing the stuff that hangs. > It seems that removing this call does not > influence the tests you are adding, as well (or should the test > "pg_stat_ssl view is updated" be responsible for checking that?). The test is supposed to enforce that, but I see that it's not for some reason. That's concerning. I'll investigate, thanks for pointing it out. > With this patch, the information that we get to be able to debug a > backend entry in pg_stat_activity is st_clientaddr and > remote_hostname. If we have a backend stuck in a wait event for the > "Auth" class, would these be sufficient to provide a good user > experience? Still better than nothing as we don't know the database > and user ID used for the connection until authentication is done, I > guess, to be able to grab patterns behind authentication getting > stuck. It's better than nothing, but is there a particular reason not to trust the crypto the first time around? The SSL/GSS details are what they are; if you don't trust them at either point one or point two then I think we need to have a more urgent conversation about that? > In patch 0002, WAIT_EVENT_SSPI_TRANSLATE_NAME is used twice for two > code paths. Perhaps these should be two separate wait events? That was my question from v6: > I violated the "one event name per call site" rule with > TranslateName(). The call pattern there is "call once to figure out > the buffer length, then call again to fill it in", and IMO that didn't > deserve differentiation. But if anyone objects, I'm happy to change it > (and I'd appreciate some name suggestions in that case). So -- any name suggestions? :D (Personally, I viewed this sort of like an unrolled loop of two. I don't know if it helps to know which call is hanging as long as you know that TranslateName is hanging.) > The events attached to ldap_unbind() make that also kind of hard to > debug. What's the code path actually impacted if we see them in > pg_stat_activity? We have nine of them with the same wait event > attached to them. Do we _want_ nine separate flavors of WAIT_EVENT_LDAP_UNBIND? I figured it was enough to know that you were stuck unbinding. > The patch needs some 2024 -> 2025 updates in its copyright notices. Will do. > At the end of [1], I've been reminded of this quote from Andres about > 0002: > "This doesn't really seem like it's actually using wait events to > describe waits. The new wait events cover stuff like memory > allocations etc, see e.g. pg_SSPI_make_upn()." > > Should we do improvements in this area with interruptions before > attempting to tackle the problem of making starting backend entries > visible in pg_stat_activity, before these complete authentication? We > use wait events for syscalls on I/O already, perhaps that does not > stand as an argument in favor of this patch on its own, but I don't > see why more events associated to external library calls as done in > 0002 is a problem? I had hoped that my v6 addressed Andres' concerns by pushing the events down as far as possible (that way they can't be nested, which I took to be the primary problem). Does something else need to be done? Thanks, --Jacob
pgsql-hackers by date: