Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date
Msg-id CAOYmi+=azjJ064LfZG0pJ6DBjx9Ro0Kzdd+Z2Jyyy=66cqL+8w@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu, Feb 6, 2025 at 12:35 AM Michael Paquier <michael@paquier.xyz> wrote:
>         /* Update app name to current GUC setting */
> +       /* TODO: ask the list: maybe do this before setting STATE_UNDEFINED? */
>         if (application_name)
>                 pgstat_report_appname(application_name);
>
> Historically, we've always reported the application name after
> STATE_UNDEFINED is set, never the reverse.  There could be potential
> security implications if we were to begin reporting the application
> name before the connection has fully authenticated because it would be
> possible for attempted connections to show malicious data in the
> catalogs, and now we're sure that the catalog data is OK to query for
> other users.  Let's be careful about that.  I think that we should
> still set that as late as possible in pgstat_bestart_final(), never at
> an earlier step as this data can come from a connection
> string, potentially malicious if not authenticated yet.

I don't want to change the order of authentication and application
name, just move the application name report up above the state change
in pgstat_bestart_final(). I don't feel strongly about it, though.

> +       if (!bootstrap)
> +       {
> +               pgstat_bestart_initial();
> +               if (MyProcPort)
> +                       pgstat_bestart_security();      /* fill in any SSL/GSS info too */
>
> This part of patch 0001 is giving me a very long pause.  What's the
> merit of filling in twice the backend entry data if we're going to
> update it at the end anyway for normal backend connections?  More info
> for debugging, is it?

Correct; if you already know authentication information from the
transport then it seemed nice to fill it in before doing the stuff
that hangs.

> It seems that removing this call does not
> influence the tests you are adding, as well (or should the test
> "pg_stat_ssl view is updated" be responsible for checking that?).

The test is supposed to enforce that, but I see that it's not for some
reason. That's concerning. I'll investigate, thanks for pointing it
out.

> With this patch, the information that we get to be able to debug a
> backend entry in pg_stat_activity is st_clientaddr and
> remote_hostname.  If we have a backend stuck in a wait event for the
> "Auth" class, would these be sufficient to provide a good user
> experience?  Still better than nothing as we don't know the database
> and user ID used for the connection until authentication is done, I
> guess, to be able to grab patterns behind authentication getting
> stuck.

It's better than nothing, but is there a particular reason not to
trust the crypto the first time around? The SSL/GSS details are what
they are; if you don't trust them at either point one or point two
then I think we need to have a more urgent conversation about that?

> In patch 0002, WAIT_EVENT_SSPI_TRANSLATE_NAME is used twice for two
> code paths.  Perhaps these should be two separate wait events?

That was my question from v6:

> I violated the "one event name per call site" rule with
> TranslateName(). The call pattern there is "call once to figure out
> the buffer length, then call again to fill it in", and IMO that didn't
> deserve differentiation. But if anyone objects, I'm happy to change it
> (and I'd appreciate some name suggestions in that case).

So -- any name suggestions? :D (Personally, I viewed this sort of like
an unrolled loop of two. I don't know if it helps to know which call
is hanging as long as you know that TranslateName is hanging.)

> The events attached to ldap_unbind() make that also kind of hard to
> debug.  What's the code path actually impacted if we see them in
> pg_stat_activity?  We have nine of them with the same wait event
> attached to them.

Do we _want_ nine separate flavors of WAIT_EVENT_LDAP_UNBIND? I
figured it was enough to know that you were stuck unbinding.

> The patch needs some 2024 -> 2025 updates in its copyright notices.

Will do.

> At the end of [1], I've been reminded of this quote from Andres about
> 0002:
> "This doesn't really seem like it's actually using wait events to
> describe waits. The new wait events cover stuff like memory
> allocations etc, see e.g. pg_SSPI_make_upn()."
>
> Should we do improvements in this area with interruptions before
> attempting to tackle the problem of making starting backend entries
> visible in pg_stat_activity, before these complete authentication?  We
> use wait events for syscalls on I/O already, perhaps that does not
> stand as an argument in favor of this patch on its own, but I don't
> see why more events associated to external library calls as done in
> 0002 is a problem?

I had hoped that my v6 addressed Andres' concerns by pushing the
events down as far as possible (that way they can't be nested, which I
took to be the primary problem). Does something else need to be done?

Thanks,
--Jacob



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: describe special values in GUC descriptions more consistently
Next
From: Ilia Evdokimov
Date:
Subject: Re: explain analyze rows=%.0f