Thread: Auto explain after query timeout

Auto explain after query timeout

From
James Coleman
Date:
Hopefully I'm not missing something obvious, but as far as I know
there's no way to configure auto explain to work fire
statement_timeout fires.

I'd like to look into this at some point, but I'm wondering if anyone
has thought about it before, and, if so, is there any obvious
impediment to doing so?

Thanks,
James Coleman



Re: Auto explain after query timeout

From
"Gurjeet"
Date:
On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> Hopefully I'm not missing something obvious, but as far as I know
> there's no way to configure auto explain to work fire
> statement_timeout fires.

I believe you're correct.

> I'd like to look into this at some point, but I'm wondering if anyone
> has thought about it before, and, if so, is there any obvious
> impediment to doing so?

This would be a neat feature. Since the changes would be fairly
localized to the contrib module, this would be a great first patch for
someone new to contributing.

This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
I wish our conventions allowed for creating hierarchies of GUC
parameters, e.g. auto_explain.when.statmeent_timeout.

For someone who would like to achieve this in the field today, I believe
they can set auto_explain.log_min_duration equal to, or less than,
statement_timeout.

Best regards,
Gurjeet
http://Gurje.et



Re: Auto explain after query timeout

From
James Coleman
Date:
On Tue, Sep 20, 2022 at 2:12 PM Gurjeet <singh.gurjeet@gmail.com> wrote:
>
> On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> > Hopefully I'm not missing something obvious, but as far as I know
> > there's no way to configure auto explain to work fire
> > statement_timeout fires.
>
> I believe you're correct.
>
> > I'd like to look into this at some point, but I'm wondering if anyone
> > has thought about it before, and, if so, is there any obvious
> > impediment to doing so?
>
> This would be a neat feature. Since the changes would be fairly
> localized to the contrib module, this would be a great first patch for
> someone new to contributing.
>
> This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
> I wish our conventions allowed for creating hierarchies of GUC
> parameters, e.g. auto_explain.when.statmeent_timeout.
>
> For someone who would like to achieve this in the field today, I believe
> they can set auto_explain.log_min_duration equal to, or less than,
> statement_timeout.

Either I'm missing something (and/or this was fixed in a later PG
version), but I don't think this is how it works. We have this
specific problem now: we set auto_explain.log_min_duration to 200 (ms)
and statement_timeout set to 30s, but when a statement times out we do
not get the plan logged with auto-explain.

James Coleman



Re: Auto explain after query timeout

From
Robert Haas
Date:
On Tue, Sep 20, 2022 at 2:35 PM James Coleman <jtc331@gmail.com> wrote:
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

I think you're correct. auto_explain uses the ExecutorEnd hook, but
that hook will not be fired in the case of an error. Indeed, if an
error has already been thrown, it would be unsafe to try to
auto-explain anything. For instance -- and this is just one problem of
probably many -- ExplainTargetRel() performs catalog lookups, which is
not OK in a failed transaction.

To make this work, I think you'd need a hook that fires just BEFORE
the error is thrown. However, previous attempts to introduce hooks
into ProcessInterrupts() have not met with a wam response from Tom, so
it might be a tough sell. And maybe for good reason. I see at least
two problems. The first is that explaining a query is a pretty
complicated operation that involves catalog lookups and lots of
complicated stuff, and I don't think that it would be safe to do all
of that at any arbitrary point in the code where ProcessInterrupts()
happened to fire. What if one of the syscache lookups misses the cache
and we have to open the underlying catalog? Then
AcceptInvalidationMessages() will fire, but we don't currently assume
that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
the running query has called a user-defined function or procedure
which is running DDL which is in the middle of changing catalog state
for some relation involved in the query at the moment that the
statement timeout arrives? I feel like there are big problems here.

The other problem I see is that ProcessInterrupts() is our mechanism
for allowing people to escape from queries that would otherwise run
forever by hitting ^C. But what if the explain code goes crazy and
itself needs to be interrupted, when we're already inside
ProcessInterrupts()? Maybe that would work out OK... or maybe it
wouldn't. I'm not really sure. But it's another reason to be very,
very cautious about putting complicated logic inside
ProcessInterrupts().

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Auto explain after query timeout

From
"Gurjeet"
Date:
On Tue Sep 20, 2022 at 11:34 AM PDT, James Coleman wrote:
> On Tue, Sep 20, 2022 at 2:12 PM Gurjeet <singh.gurjeet@gmail.com> wrote:
> >
> > For someone who would like to achieve this in the field today, I believe
> > they can set auto_explain.log_min_duration equal to, or less than,
> > statement_timeout.
>
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

My DBA skills are rusty, so I'll take your word for it.

If this is the current behaviour, I'm inclined to treat this as a bug,
or at least a desirable improvement, and see if auto_explain can be
improved to emit the plan on statment_timeout.

From what I undestand, the behaviour of auto_explain is that it waits
for the query to finish before it emits the plan. This information is
very useful for diagnosing long-running queries that are still running.
Many a times, you encounter such queries in production workloads, and
reproducing such a scenario later on is either undesirable, expensive, or
even impossible. So being able to see the plan of a query that has
crossed auto_explain.log_min_duration as soon as possible, would be highly
desirable.

Best regards,
Gurjeet
http://Gurje.et



Re: Auto explain after query timeout

From
James Coleman
Date:
On Tue, Sep 20, 2022 at 3:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Sep 20, 2022 at 2:35 PM James Coleman <jtc331@gmail.com> wrote:
> > Either I'm missing something (and/or this was fixed in a later PG
> > version), but I don't think this is how it works. We have this
> > specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> > and statement_timeout set to 30s, but when a statement times out we do
> > not get the plan logged with auto-explain.
>
> I think you're correct. auto_explain uses the ExecutorEnd hook, but
> that hook will not be fired in the case of an error. Indeed, if an
> error has already been thrown, it would be unsafe to try to
> auto-explain anything. For instance -- and this is just one problem of
> probably many -- ExplainTargetRel() performs catalog lookups, which is
> not OK in a failed transaction.
>
> To make this work, I think you'd need a hook that fires just BEFORE
> the error is thrown. However, previous attempts to introduce hooks
> into ProcessInterrupts() have not met with a wam response from Tom, so
> it might be a tough sell. And maybe for good reason. I see at least
> two problems. The first is that explaining a query is a pretty
> complicated operation that involves catalog lookups and lots of
> complicated stuff, and I don't think that it would be safe to do all
> of that at any arbitrary point in the code where ProcessInterrupts()
> happened to fire. What if one of the syscache lookups misses the cache
> and we have to open the underlying catalog? Then
> AcceptInvalidationMessages() will fire, but we don't currently assume
> that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
> the running query has called a user-defined function or procedure
> which is running DDL which is in the middle of changing catalog state
> for some relation involved in the query at the moment that the
> statement timeout arrives? I feel like there are big problems here.
>
> The other problem I see is that ProcessInterrupts() is our mechanism
> for allowing people to escape from queries that would otherwise run
> forever by hitting ^C. But what if the explain code goes crazy and
> itself needs to be interrupted, when we're already inside
> ProcessInterrupts()? Maybe that would work out OK... or maybe it
> wouldn't. I'm not really sure. But it's another reason to be very,
> very cautious about putting complicated logic inside
> ProcessInterrupts().

This is exactly the kind of background I was hoping someone would
provide; thank you, Robert.

It seems like one could imagine addressing all of these by having one of:

- A safe explain (e.g., disallow catalog access) that is potentially
missing information.
- A safe way to interrupt queries such as "safe shutdown" of a node
(e.g., a seq scan could stop returning tuples early) and allow a
configurable buffer of time after the statement timeout before firing
a hard abort of the query (and transaction).

Both of those seem like a significant amount of work.

Alternatively I wonder if it's possible (this would maybe assume no
catalog changes in the current transaction -- or at least none that
would be referenced by the current query) to open a new transaction
(with the same horizon information) and duplicate the plan over to
that transaction and run the explain there. This way you do it *after*
the error is raised. That's some serious spit-balling -- I'm not
saying that's doable, just trying to imagine how one might
comprehensively address the concerns.

Does any of that seem at all like a path you could imagine being fruitful?

Thanks,
James Coleman



Re: Auto explain after query timeout

From
Robert Haas
Date:
On Tue, Sep 20, 2022 at 5:08 PM James Coleman <jtc331@gmail.com> wrote:
> - A safe explain (e.g., disallow catalog access) that is potentially
> missing information.

This would be pretty useless I think, because you'd be missing all
relation names.

> - A safe way to interrupt queries such as "safe shutdown" of a node
> (e.g., a seq scan could stop returning tuples early) and allow a
> configurable buffer of time after the statement timeout before firing
> a hard abort of the query (and transaction).

This might be useful, but it seems extremely difficult to get working.
You'd not only have to design the safe shutdown mechanism itself, but
also find a way to safely engage it at the right times.

> Alternatively I wonder if it's possible (this would maybe assume no
> catalog changes in the current transaction -- or at least none that
> would be referenced by the current query) to open a new transaction
> (with the same horizon information) and duplicate the plan over to
> that transaction and run the explain there. This way you do it *after*
> the error is raised. That's some serious spit-balling -- I'm not
> saying that's doable, just trying to imagine how one might
> comprehensively address the concerns.

Doesn't work, because the new transaction's snapshot wouldn't be the
same as that of the old one. Imagine that you create a table and run a
query on it in the same transaction. Then you migrate the plan tree to
a new transaction and try to find out the table name. But in the new
transaction, that table doesn't exist: it was destroyed by the
previous rollback.

Honestly I have no very good ideas how to create the feature you want
here. I guess the only thing I can think of is to separate the EXPLAIN
process into two phases: a first phase that runs when the plan tree is
set up and gathers all of the information that we might need later,
like relation names, and then a second phase that runs later when you
want to generate the output and does nothing that can fail, or at
least no database: maybe it's allowed to allocate memory, for example.
But that sounds like a big and perhaps painful refactoring exercise,
and I can imagine that there might be reasons why it doesn't work out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com