Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date
Msg-id faesruozcdcg2aksscb3dcojy4gqjqsyaxfajkqzm4kozjowgm@nmjgrrvdax25
Whole thread Raw
In response to Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible  (Noah Misch <noah@leadboat.com>)
Responses Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
List pgsql-hackers
Hi,

On 2024-11-07 09:20:24 -0800, Jacob Champion wrote:
> From e755fdccf16cb4fcd3036e1209291750ffecd261 Mon Sep 17 00:00:00 2001
> From: Jacob Champion <jacob.champion@enterprisedb.com>
> Date: Fri, 3 May 2024 15:54:58 -0700
> Subject: [PATCH v5 1/2] pgstat: report in earlier with STATE_STARTING
> 
> Add pgstat_bestart_pre_auth(), which reports a 'starting' state while
> waiting for backend initialization and client authentication to
> complete. Since we hold a transaction open for a good amount of that,
> and some authentication methods call out to external systems, having a
> pg_stat_activity entry helps DBAs debug when things go badly wrong.

I don't understand why the pgstat_bestart()/pgstat_bestart_pre_auth() split
makes sense. The latter is going to redo most of the work that the former
did. What's the point of that?

Why not have a new function that initializes just the missing additional
information? Or for that matter, why not move most of what pgstat_bestart()
does into pgstat_beinit()?

As-is I'm -1 on this patch, it makes something complicated and fragile even
more so.


> From 858e95f996589461e2840047fa35675b6f96e46d Mon Sep 17 00:00:00 2001
> From: Jacob Champion <jacob.champion@enterprisedb.com>
> Date: Fri, 3 May 2024 15:58:23 -0700
> Subject: [PATCH v5 2/2] Report external auth calls as wait events
> 
> Introduce a new "Auth" wait class for various external authentication
> systems, to make it obvious what's going wrong if one of those systems
> hangs. Each new wait event is unique in order to more easily pinpoint
> problematic locations in the code.

This doesn't really seem like it's actually using wait events to describe
waits. The new wait events cover stuff like memory allocations etc, see
e.g. pg_SSPI_make_upn().

I have some sympathy for that, it'd be nice if we had some generic way to
describe what code is doing - but it doesn't really seem good to use wait
events for that. Right now a backend reporting a wait allows to conclude that
a backend isn't running postgres code and busy or blocked outside of postgres
- but that's not true anymore if you have wait event cover generic things like
memory allocations (or even various library functions).

This isn't just pedantry - all the relevant code really needs to be rewritten
to allow the blocking to happen in an interruptible way, otherwise
authentication timeout etc can't realiably work. Once that's done you can
actually define useful wait events too.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Next
From: Andres Freund
Date:
Subject: Re: Commit Timestamp and LSN Inversion issue