Thread: Making auto_explain more useful / convenient
Hello,
I'm not sure if these ideas were circulated before or not.
We use auto_explain a lot to investigate slow/problematic queries.
One of the main issues with its usability comes from the fact that EXPLAIN output is logged rather than returned to the caller in some way. If you have a large cluster with lots of replicas, there is also an extra inconvenience of log accumulation, search, etc.
Why not have an option to return EXPLAIN results as a NoticeResponse instead? That would make its usage more convenient.
Another thing is tangentially related...
I think it may be good to have a number of options to generate significantly shorter output similar to EXPLAIN. EXPLAIN is great, but sometimes people need more concise and specific information, for example total number of buffers and reads by certain query (this is pretty common), whether or not we had certain nodes in the plan (seq scan, scan of certain index(es)), how bad was cardinality misprediction on certain nodes, etc. It's not totally clear yet what would be the best way to define those rules, but I think we can come up with something reasonable. Logging or returning shorter messages like that can cause less overhead than logging full EXPLAIN and can potentially allow for better query monitoring overall.
Do you see any potential issues with implementing those? Of course there should be more details, like what kind of configuration parameters to add, how to define rules for the 2nd case, etc. Just wanted to check if there are any objections in general.
Thank you,
-Vladimir Churyukin.
I'm not sure if these ideas were circulated before or not.
We use auto_explain a lot to investigate slow/problematic queries.
One of the main issues with its usability comes from the fact that EXPLAIN output is logged rather than returned to the caller in some way. If you have a large cluster with lots of replicas, there is also an extra inconvenience of log accumulation, search, etc.
Why not have an option to return EXPLAIN results as a NoticeResponse instead? That would make its usage more convenient.
Another thing is tangentially related...
I think it may be good to have a number of options to generate significantly shorter output similar to EXPLAIN. EXPLAIN is great, but sometimes people need more concise and specific information, for example total number of buffers and reads by certain query (this is pretty common), whether or not we had certain nodes in the plan (seq scan, scan of certain index(es)), how bad was cardinality misprediction on certain nodes, etc. It's not totally clear yet what would be the best way to define those rules, but I think we can come up with something reasonable. Logging or returning shorter messages like that can cause less overhead than logging full EXPLAIN and can potentially allow for better query monitoring overall.
Do you see any potential issues with implementing those? Of course there should be more details, like what kind of configuration parameters to add, how to define rules for the 2nd case, etc. Just wanted to check if there are any objections in general.
Thank you,
-Vladimir Churyukin.
Hello auto_explain.log_level is available since postgresql 12. postgres=# load 'auto_explain'; LOAD postgres=# set auto_explain.log_min_duration to 0; SET postgres=# set auto_explain.log_level to 'notice'; SET postgres=# select 1; NOTICE: duration: 0.010 ms plan: Query Text: select 1; Result (cost=0.00..0.01 rows=1 width=4) ?column? ---------- 1 regards, Sergei
Vladimir Churyukin <vladimir@churyukin.com> writes: > Why not have an option to return EXPLAIN results as a NoticeResponse > instead? That would make its usage more convenient. That seems quite useless to me, and likely actually counterproductive. If you are manually investigating query performance, you can just use EXPLAIN directly. The point of auto_explain, ISTM, is to capture info about queries issued by automated applications. So something like the above could only work if you taught every one of your applications to capture the NOTICE output, separate it from random other NOTICE output, and then (probably) log it somewhere central for later inspection. That's a lot of code to write, and at the end you'd only have effectively duplicated existing tooling such as pgbadger. Also, what happens in applications you forgot to convert? > Another thing is tangentially related... > I think it may be good to have a number of options to generate > significantly shorter output similar to EXPLAIN. EXPLAIN is great, but > sometimes people need more concise and specific information, for example > total number of buffers and reads by certain query (this is pretty common), > whether or not we had certain nodes in the plan (seq scan, scan of certain > index(es)), how bad was cardinality misprediction on certain nodes, etc. Maybe, but again I'm a bit skeptical. IME you frequently don't know what you're looking for until you've seen the bigger picture. Zeroing in on details like this could be pretty misleading. regards, tom lane
Thank you, that answers the first part of my question.
On Sat, Nov 11, 2023 at 2:43 AM Sergei Kornilov <sk@zsrv.org> wrote:
Hello
auto_explain.log_level is available since postgresql 12.
postgres=# load 'auto_explain';
LOAD
postgres=# set auto_explain.log_min_duration to 0;
SET
postgres=# set auto_explain.log_level to 'notice';
SET
postgres=# select 1;
NOTICE: duration: 0.010 ms plan:
Query Text: select 1;
Result (cost=0.00..0.01 rows=1 width=4)
?column?
----------
1
regards, Sergei
On Sat, Nov 11, 2023 at 7:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Vladimir Churyukin <vladimir@churyukin.com> writes:
> Why not have an option to return EXPLAIN results as a NoticeResponse
> instead? That would make its usage more convenient.
That seems quite useless to me, and likely actually counterproductive.
If you are manually investigating query performance, you can just use
EXPLAIN directly. The point of auto_explain, ISTM, is to capture info
about queries issued by automated applications. So something like the
above could only work if you taught every one of your applications to
capture the NOTICE output, separate it from random other NOTICE
output, and then (probably) log it somewhere central for later
inspection. That's a lot of code to write, and at the end you'd
only have effectively duplicated existing tooling such as pgbadger.
Also, what happens in applications you forgot to convert?
Sergey Kornilov just gave the right answer above in the thread for this one.
Unfortunately, there are a lot of scenarios where you can't use pgbadger or any other log analysis or it's not convenient.
There are a bunch of cloud hosted forks of postgres for example, not all of them give you this functionality.
In AWS for example you need to download all the logs first, which complicates it significantly.
The goal of this is not investigating performance of a single query but rather constant monitoring of a bunch (or all) queries, so you can detect
plan degradations right away.
plan degradations right away.
> Another thing is tangentially related...
> I think it may be good to have a number of options to generate
> significantly shorter output similar to EXPLAIN. EXPLAIN is great, but
> sometimes people need more concise and specific information, for example
> total number of buffers and reads by certain query (this is pretty common),
> whether or not we had certain nodes in the plan (seq scan, scan of certain
> index(es)), how bad was cardinality misprediction on certain nodes, etc.
Maybe, but again I'm a bit skeptical. IME you frequently don't know
what you're looking for until you've seen the bigger picture. Zeroing
in on details like this could be pretty misleading.
If you don't know what you're looking for, then it's not very useful, I agree.
But in many cases you know. There are certain generic "signs of trouble" that you can detect by
the amount of data the query processor scans, by cache hit rate for certain queries. presence of seq scans or scans of certain indexes,
large differences between predicted and actual rows, some other stuff that may be relevant to your app/queries specifically that you want to monitor.
We're already doing similar analysis on our side (a multi-terabyte db cluster with hundreds of millions to billions queries running daily).
But it's not efficient enough because:
1. the problem I mentioned above, access to logs is limited on cloud environments
large differences between predicted and actual rows, some other stuff that may be relevant to your app/queries specifically that you want to monitor.
We're already doing similar analysis on our side (a multi-terabyte db cluster with hundreds of millions to billions queries running daily).
But it's not efficient enough because:
1. the problem I mentioned above, access to logs is limited on cloud environments
2. explain output could be huge, it causes performance issues because of its size. compact output is much more preferable for mass processing
(it's even more important if this output is to notice messages rather than to logs, that's why I said it's tangentially related)
Since it seems the notice output is already possible, half of the problem is solved already.
I'll try to come up with possible options for more compact output then, unless you think it's completely futile.
thank you,
-Vladimir Churyukin
-Vladimir Churyukin