Thread: Re: Proposal: Progressive explain
Hello,
Was recently benchmarking the last version of the patch and found room for
improvement when GUC progressive_explain is enabled globally.
Results with the last published version of the patch:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18249.363540 (without initial connection time)
- progressive_explain = 'explain':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 3536.635125 (without initial connection time)
This is because progressive explains are being printed for every query, including
the ones that finish instantly.
If we think about it, those printed plans for instant queries are useless as
other backends won't have a chance to look at the plans before they get removed
from pg_stat_progress_explain.
So this new version of the patch implements new GUC progressive_explain_min_duration
to be a used as a threshold for the plan to be printed for the first time:
- progressive_explain_min_duration: min query duration until progressive
explain starts.
- type: int
- default: 1s
- min: 0
- context: user
Results with the new version:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18871.800242 (without initial connection time)
- progressive_explain = 'explain' and progressive_explain_min_duration = '5s':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18896.266538 (without initial connection time)
Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the progressive
explain.
There is a catch to this implementation. In thread https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns about
logging plans in the CFI, a sensible part of the code. So torikoshia implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all nodes
so that the plan printing can be done there.
I'm not sure if this same concern applies to timeout callbacks so I also implemented
a second version of the latest patch that uses that execProcNode wrapper strategy.
The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com), so
adding the credits here.
Rafael.
Was recently benchmarking the last version of the patch and found room for
improvement when GUC progressive_explain is enabled globally.
Results with the last published version of the patch:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18249.363540 (without initial connection time)
- progressive_explain = 'explain':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 3536.635125 (without initial connection time)
This is because progressive explains are being printed for every query, including
the ones that finish instantly.
If we think about it, those printed plans for instant queries are useless as
other backends won't have a chance to look at the plans before they get removed
from pg_stat_progress_explain.
So this new version of the patch implements new GUC progressive_explain_min_duration
to be a used as a threshold for the plan to be printed for the first time:
- progressive_explain_min_duration: min query duration until progressive
explain starts.
- type: int
- default: 1s
- min: 0
- context: user
Results with the new version:
- progressive_explain = off:
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18871.800242 (without initial connection time)
- progressive_explain = 'explain' and progressive_explain_min_duration = '5s':
/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18896.266538 (without initial connection time)
Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the progressive
explain.
There is a catch to this implementation. In thread https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns about
logging plans in the CFI, a sensible part of the code. So torikoshia implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all nodes
so that the plan printing can be done there.
I'm not sure if this same concern applies to timeout callbacks so I also implemented
a second version of the latest patch that uses that execProcNode wrapper strategy.
The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com), so
adding the credits here.
Rafael.
Attachment
Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the progressive
explain.
There is a catch to this implementation. In thread https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns about
logging plans in the CFI, a sensible part of the code. So torikoshia implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all nodes
so that the plan printing can be done there.
I'm not sure if this same concern applies to timeout callbacks so I also implemented
a second version of the latest patch that uses that execProcNode wrapper strategy.
The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com), so
adding the credits here.
Did additional benchmarks and found issues with the patch that doesn't do execProcNode
wrapping. There are sporadic crashes with double free or corruption (top)
So making the patch that uses the wrapper the current one. Again, giving the credits to
torikoshia as being the owner of that section of the code.
Rafael.
Attachment
On Fri, Mar 7, 2025, at 5:34 PM, Rafael Thofehrn Castro wrote:
Did additional benchmarks and found issues with the patch that doesn't do execProcNodewrapping. There are sporadic crashes with double free or corruption (top)So making the patch that uses the wrapper the current one. Again, giving the credits totorikoshia as being the owner of that section of the code.
Rafael, thanks for working on it. It is a step forward in observability. I
started with some performance tests and the last improvements seem to fix the
overhead imposed in the initial patch version. I didn't notice any of these new
function in the perf report while executing fast queries.
I found a crash. It is simple to reproduce.
Session A:
select * from pg_stat_progress_explain;
\watch 2
Session B:
explain select pg_sleep(30);
\watch 2
8<--------------------------------------------------------------------8<
Backtrace:
Core was generated by `postgres: euler postgres [lo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
5401 if (ps->ExecProcNodeOriginal != NULL)
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
#1 0x00005624173829aa in handle_sig_alarm (postgres_signal_arg=<optimized out>) at timeout.c:414
#2 0x00005624173ba02c in wrapper_handler (postgres_signal_arg=14) at pqsignal.c:110
#3 <signal handler called>
#4 0x00007f20fa529e63 in epoll_wait (epfd=6, events=0x56244ef37e58, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#5 0x00005624171fb02f in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdd9e62080, cur_timeout=-1, set=0x56244ef37dd8) at waiteventset.c:1190
#6 WaitEventSetWait (set=0x56244ef37dd8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdd9e62080, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at waiteventset.c:1138
#7 0x000056241709513c in secure_read (port=0x56244eeb90e0, ptr=0x56241775a9a0 <PqRecvBuffer>, len=8192) at be-secure.c:218
#8 0x000056241709bf2e in pq_recvbuf () at pqcomm.c:924
#9 0x000056241709ceb5 in pq_getbyte () at pqcomm.c:970
#10 0x000056241721b617 in SocketBackend (inBuf=0x7ffdd9e622a0) at postgres.c:361
#11 ReadCommand (inBuf=0x7ffdd9e622a0) at postgres.c:484
#12 PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4625
#13 0x00005624172167ed in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000056241717519b in postmaster_child_launch (child_type=<optimized out>, child_slot=2, startup_data=startup_data@entry=0x7ffdd9e6253c, startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0x7ffdd9e62540) at launch_backend.c:274
#15 0x0000562417178c32 in BackendStartup (client_sock=0x7ffdd9e62540) at postmaster.c:3519
#16 ServerLoop () at postmaster.c:1688
#17 0x000056241717a6da in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x56244eeb81b0) at postmaster.c:1386
#18 0x0000562416e64f9a in main (argc=1, argv=0x56244eeb81b0) at main.c:230
8<--------------------------------------------------------------------8<
You call this feature "progressive explain". My first impression is that it
will only provide the execution plans for EXPLAIN commands. Instead of
"progressive explain", I would suggest "query progress" that is a general
database terminology. It seems natural to use "progressive explain" since you
are using the explain infrastructure (including the same options -- format,
settings, wal, ...) -- to print the execution plan.
+CREATE VIEW pg_stat_progress_explain AS
+ SELECT
+ *
+ FROM pg_stat_progress_explain(true);
+
There is no use for the function argument. If you decide to keep this function,
remove it.
Why don't you use the pgstat_progress_XXX() API? Since you are using a
pg_stat_progress_XXX view name I would expect using the command progress
reporting infrastructure (see backend_progress.c).
Maybe you could include datid and datname as the other progress reporting
views. It would avoid a join to figure out what the database is.
+static const struct config_enum_entry explain_format_options[] = {
+ {"text", EXPLAIN_FORMAT_TEXT, false},
+ {"xml", EXPLAIN_FORMAT_XML, false},
+ {"json", EXPLAIN_FORMAT_JSON, false},
+ {"yaml", EXPLAIN_FORMAT_YAML, false},
+ {NULL, 0, false}
+};
Isn't it the same definition as in auto_explain.c? Use only one definition for
auto_explain and this feature. You can put this struct into explain.c, use an
extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
guc.h. See wal_level_options, for an example.
+static const struct config_enum_entry progressive_explain_options[] = {
+ {"off", PROGRESSIVE_EXPLAIN_NONE, false},
+ {"explain", PROGRESSIVE_EXPLAIN_EXPLAIN, false},
+ {"analyze", PROGRESSIVE_EXPLAIN_ANALYZE, false},
+ {"false", PROGRESSIVE_EXPLAIN_NONE, true},
+ {NULL, 0, false}
+};
The "analyze" is a separate option in auto_explain. Should we have 2 options?
One that enable/disable this feature and another one that enable/disable
analyze option.
Don't the other EXPLAIN options make sense here? Like serialize and summary.
TupleDesc tupDesc; /* descriptor for result tuples */
EState *estate; /* executor's query-wide state */
PlanState *planstate; /* tree of per-plan-node state */
+ struct ExplainState *pe_es; /* progressive explain state if enabled */
Should you use the same name pattern here? pestate, for example.
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ExplainHash)
Could you use a specific name? Even if you keep the proposed name, you should
use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.
+$node->init;
+# Configure progressive explain to be logged immediatelly
+$node->append_conf('postgresql.conf', 'progressive_explain_min_duration = 0');
+$node->start;
s/immediatelly/immediately/
+typedef struct progressiveExplainHashKey
+{
+ int pid; /* PID */
+} progressiveExplainHashKey;
+
+typedef struct progressiveExplainHashEntry
+{
+ progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
+ dsa_handle h;
+ dsa_pointer p;
+} progressiveExplainHashEntry;
You don't need progressiveExplainHashKey. Use pid as key directly.
+ /* Update shared memory with new data */
+ strcpy(pe_data->plan, es->str->data);
+ pe_data->last_print = GetCurrentTimestamp();
I don't think last_print is accurate because it is not the time the execution plan
is printed but the time it was updated. I suggest last_updated_time.
+/* Flag set by timeouts to control when to print progressive explains */
+bool ProgressiveExplainPending = false;
s/print/update/
There are other point that you use "print" but it is better to replace it with
"update".
+ progressiveExplainArray = ShmemInitHash("progressive explain hash",
+ 50, 50,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
better name.
+ entry->p = dsa_allocate(es->pe_a,
+ add_size(sizeof(progressiveExplainData),
+ add_size(strlen(es->str->data),
+ PROGRESSIVE_EXPLAIN_ALLOC_SIZE)));
I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
PROGRESSIVE_EXPLAIN_AVAIL_SIZE?
Maybe you can use dshash.
There are some comments that still refers to the wrong function name.
+/*
+ * ExecProcNodeWithExplain
+ * ExecProcNode wrapper that initializes progressive explain
+ * and uwraps ExecProdNode to the original function.
+ */
+static TupleTableSlot *
+ExecProcNodeExplain(PlanState *node)
and
+/*
+ * ExecProcNodeWithExplain
+ * Responsible for initialization of all structures related to progressive
+ * explains.
+ *
+ /* state related to progressive explains */
+ struct PlanState *pe_curr_node;
+ struct Instrumentation *pe_local_instr;
+ dsa_area *pe_a;
Could you add some comments saying what each of these variables are for?
I didn't experiment but I was wondering if there is a way to avoid the
duplicates that you added to avoid the overhead.
Thanks for the valuable inputs Euler. Adjusted most of your recommendations.
> I found a crash. It is simple to reproduce.
Indeed, I failed to test plain EXPLAIN after the addition of the new GUC
progressive_explain_min_duration. This is fixed.
> You call this feature "progressive explain". My first impression is that it
> will only provide the execution plans for EXPLAIN commands. Instead of
> "progressive explain", I would suggest "query progress" that is a general
> database terminology. It seems natural to use "progressive explain" since you
> are using the explain infrastructure (including the same options -- format,
> settings, wal, ...) -- to print the execution plan.
Makes sense. Kept progressive explain for now but this is still open for
discussion.
> There is no use for the function argument. If you decide to keep this function,
remove it.
Done.
> Why don't you use the pgstat_progress_XXX() API? Since you are using a
> pg_stat_progress_XXX view name I would expect using the command progress
> reporting infrastructure (see backend_progress.c).
I haven't changed that part as of now. My implementation and underlying data
structure may not work well with that API, but I am investigating.
> Maybe you could include datid and datname as the other progress reporting
> views. It would avoid a join to figure out what the database is.
Done.
> Isn't it the same definition as in auto_explain.c? Use only one definition for
> auto_explain and this feature. You can put this struct into explain.c, use an
> extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
> guc.h. See wal_level_options, for an example.
Done.
> The "analyze" is a separate option in auto_explain. Should we have 2 options?
> One that enable/disable this feature and another one that enable/disable
> analyze option.
Tomas Vondra proposed the current logic and I think it makes sense. Having a
single GUC to control the execution behavior keeps the feature simpler IMO.
> Don't the other EXPLAIN options make sense here? Like serialize and summary.
I added a missing GUC for option COSTS (progressive_explain_costs). Adding
> I found a crash. It is simple to reproduce.
Indeed, I failed to test plain EXPLAIN after the addition of the new GUC
progressive_explain_min_duration. This is fixed.
> You call this feature "progressive explain". My first impression is that it
> will only provide the execution plans for EXPLAIN commands. Instead of
> "progressive explain", I would suggest "query progress" that is a general
> database terminology. It seems natural to use "progressive explain" since you
> are using the explain infrastructure (including the same options -- format,
> settings, wal, ...) -- to print the execution plan.
Makes sense. Kept progressive explain for now but this is still open for
discussion.
> There is no use for the function argument. If you decide to keep this function,
remove it.
Done.
> Why don't you use the pgstat_progress_XXX() API? Since you are using a
> pg_stat_progress_XXX view name I would expect using the command progress
> reporting infrastructure (see backend_progress.c).
I haven't changed that part as of now. My implementation and underlying data
structure may not work well with that API, but I am investigating.
> Maybe you could include datid and datname as the other progress reporting
> views. It would avoid a join to figure out what the database is.
Done.
> Isn't it the same definition as in auto_explain.c? Use only one definition for
> auto_explain and this feature. You can put this struct into explain.c, use an
> extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
> guc.h. See wal_level_options, for an example.
Done.
> The "analyze" is a separate option in auto_explain. Should we have 2 options?
> One that enable/disable this feature and another one that enable/disable
> analyze option.
Tomas Vondra proposed the current logic and I think it makes sense. Having a
single GUC to control the execution behavior keeps the feature simpler IMO.
> Don't the other EXPLAIN options make sense here? Like serialize and summary.
I added a missing GUC for option COSTS (progressive_explain_costs). Adding
the other ones doesn't make much sense IMO. SUMMARY, SERIALIZE and MEMORY
are information added at the end of the query execution (or plan creation for plain
EXPLAIN) in the summary section but at that point the progressive explain will be
already finished, with no more information in pg_stat_progress_explain.
> TupleDesc tupDesc; /* descriptor for result tuples */
> EState *estate; /* executor's query-wide state */
> PlanState *planstate; /* tree of per-plan-node state */
> + struct ExplainState *pe_es; /* progressive explain state if enabled */
> Should you use the same name pattern here? pestate, for example.
Done.
> PG_LWLOCK(52, SerialControl)
> +PG_LWLOCK(53, ExplainHash)
> Could you use a specific name? Even if you keep the proposed name, you should
> use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.
Changed to ProgressiveExplainHash.
> You don't need progressiveExplainHashKey. Use pid as key directly.
Done.
> I don't think last_print is accurate because it is not the time the execution plan
> is printed but the time it was updated. I suggest last_updated_time.
Changed from last_print to last_update. This is still open for discussion.
> I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
> better name.
Used to be compatible with the ProcArray (that is also a hash). But what you
proposed is better indeed. Changed.
> I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
> doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
> PROGRESSIVE_EXPLAIN_AVAIL_SIZE?
Changed to PROGRESSIVE_EXPLAIN_FREE_SIZE.
Fixed the wrong function names in the comments and changed the format of those
comments in function headers to be comptible with other functions in explain.c.
> + /* state related to progressive explains */
> + struct PlanState *pe_curr_node;
> + struct Instrumentation *pe_local_instr;
> + dsa_area *pe_a;
> Could you add some comments saying what each of these variables are for?
Done.
> I didn't experiment but I was wondering if there is a way to avoid the
> duplicates that you added to avoid the overhead.
You mean the local instrumentation object reused for each node?
Rafael.
EXPLAIN) in the summary section but at that point the progressive explain will be
already finished, with no more information in pg_stat_progress_explain.
> TupleDesc tupDesc; /* descriptor for result tuples */
> EState *estate; /* executor's query-wide state */
> PlanState *planstate; /* tree of per-plan-node state */
> + struct ExplainState *pe_es; /* progressive explain state if enabled */
> Should you use the same name pattern here? pestate, for example.
Done.
> PG_LWLOCK(52, SerialControl)
> +PG_LWLOCK(53, ExplainHash)
> Could you use a specific name? Even if you keep the proposed name, you should
> use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.
Changed to ProgressiveExplainHash.
> You don't need progressiveExplainHashKey. Use pid as key directly.
Done.
> I don't think last_print is accurate because it is not the time the execution plan
> is printed but the time it was updated. I suggest last_updated_time.
Changed from last_print to last_update. This is still open for discussion.
> I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
> better name.
Used to be compatible with the ProcArray (that is also a hash). But what you
proposed is better indeed. Changed.
> I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
> doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
> PROGRESSIVE_EXPLAIN_AVAIL_SIZE?
Changed to PROGRESSIVE_EXPLAIN_FREE_SIZE.
Fixed the wrong function names in the comments and changed the format of those
comments in function headers to be comptible with other functions in explain.c.
> + /* state related to progressive explains */
> + struct PlanState *pe_curr_node;
> + struct Instrumentation *pe_local_instr;
> + dsa_area *pe_a;
> Could you add some comments saying what each of these variables are for?
Done.
> I didn't experiment but I was wondering if there is a way to avoid the
> duplicates that you added to avoid the overhead.
You mean the local instrumentation object reused for each node?
Rafael.
Attachment
Sending a new version as rebase was required.
Rafael.
Attachment
On Wed, Mar 19, 2025 at 1:47 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > Sending a new version as rebase was required. Reading this thread, it seems to me that there has been a good deal of discussion about things like the name of the feature and what the UI ought to be, but not much discussion of whether the feature is actually safe, and not much detailed review of the code. I'm willing to bet that everybody wants some version of this feature if we can convince ourselves that it's not going to do horrible things like cause server crashes, and that even people who don't get their first choice in terms of how the feature is named or how the GUCs work will still be pretty happy to have it overall. However, if it breaks stuff and there's no easy way to fix the breakage, that's going to be a problem. In broad strokes, the danger here is that doing stuff in the middle of query execution that we currently only do at the end of query execution will turn out to be problematic. The biggest problem, I think, is whether it's safe to do all of the things that EXPLAIN does while we're at some random point in query execution. It looks to me like the wrap/unwrap stuff is more-or-less consistent with previous discussions of how a feature of this kind should work, though I don't recall the previous discussion and I think the patch should contain some comments about why it works the way that it works. I do notice that WrapExecProcNodeWithExplain does not walk the ps->initPlan list, which I think is an oversight. Without having the prior discussion near to hand, I *think* that the reason we wanted to do this wrap/unwrap stuff is to make it so that the progressive EXPLAIN code could only execute when entering a new plan node rather than at any random point inside of that plan node, and that does seem a lot safer than the alternative. For example, I think it means that we won't start trying to do progressive EXPLAIN while already holding some random LWLock, which is very good. However, this still means we could do a bunch of catalog access (and thus potentially receive invalidations) at places where that can't happen today. I'm not sure that's a problem -- the same thing could happen at any place in the executor where we evaluate a user-supplied expression, and there are many such places -- but it's certainly worth a few senior people thinking real carefully about it and trying to imagine whether there's any scenario in which it might break something that works today. One way in which this proposal seems safer than previous proposals is that previous proposals have involved session A poking session B and trying to get session B to emit an EXPLAIN on the fly with no prior setup. That would be very useful, but I think it's more difficult and more risky than this proposal, where all the configuration happens in the session that is going to emit the EXPLAIN output. It knows from the beginning that it's going to maybe be doing this, and so it can do whatever setup it likes to accomplish that goal. So I think this design looks pretty good from that point of view. I don't understand how this would be safe against interrupts or errors. If a running query is interrupted, what would cause ProgressiveExplainCleanup() to be called? If the answer is "nothing," why isn't that unsafe? ExecProcNodeOriginal looks like it's basically the same thing as ExecProcNodeReal, except that it's for a different purpose. But you would be hard-pressed to guess which one is used for what purpose based on the field names or the comments. Maybe it's also worth worrying about whether this is a scalable design. Can we find a way to use the existing fields here instead of having to add a new one? The documentation for the progressive_explain = { off | explain | analyze } option seems like it should go into more detail about how the "explain" and "analyze" values are different. I'm not 100% sure I know the answer, and I'm not the least-experienced person who will ever read this documentation. WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It invites confusion with MultiExecProcNode, to which it is unrelated. I just went to some trouble to start breaking up the monolith that is explain.c, so I'm not that happy about seeing this patch dump another 800+ lines of source code into that file. Probably we should have a new source file for some or this, or maybe even more than one. The changes to explain.h add three new data types. Two of them begin with an uppercase letter and one with a lowercase letter. That seems inconsistent. I also don't think that the declaration of char plan[] is per PostgreSQL coding style. I believe we always write char plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something other than plan, since it's really a string-ified explain-y kind of thing, not literally a Plan. Also, can we please not have structure members with single letter names? "h" and "p" are going to be completely ungreppable, and I like grepping. It looks very strange to me that ProgressiveExplainPrint() seems to have a theoretical possibility of generating the output and then throwing it away if we end up with entry == NULL. I guess maybe that case is not supposed to happen because ProgressiveExplainInit() is supposed to create the entry, but then why isn't this an elog(ERROR) or something instead of a no-op? It seems like when we replace a longer entry with a shorter one, we forget that it was originally longer. Hence, if the length of a progressive EXPLAIN is alternately 2922 characters and 2923 characters, we'll reallocate on every other progressive EXPLAIN instead of at most once. I'll try to look at this some more tomorrow. It seems like a very interesting patch, but time is very short for this release and it doesn't look to me like we have all the kinks sorted out here just yet. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert,
Thanks for sparing part of your precious time to look at the patch.
I acknowledge it is a very complex one. Since you're going to take
another look, providing some preliminary comments related to some
of the implementation concerns.
> I don't understand how this would be safe against interrupts or
> errors. If a running query is interrupted, what would cause
> ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
> why isn't that unsafe?
The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.
> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?
Agreed. Will fix this.
> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.
Are you talking about re-printing the plan in the same query execution?
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.
Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.
Regarding all other comments related to variable/function names and having
the feature in a separate file, agree with all the comments. Will send a
new version with the fixes.
Rafael.
Thanks for sparing part of your precious time to look at the patch.
I acknowledge it is a very complex one. Since you're going to take
another look, providing some preliminary comments related to some
of the implementation concerns.
> I don't understand how this would be safe against interrupts or
> errors. If a running query is interrupted, what would cause
> ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
> why isn't that unsafe?
The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.
> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?
Agreed. Will fix this.
> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.
Are you talking about re-printing the plan in the same query execution?
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.
Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.
Regarding all other comments related to variable/function names and having
the feature in a separate file, agree with all the comments. Will send a
new version with the fixes.
Rafael.
On Wed, Mar 19, 2025 at 6:53 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > The strategy I used here is to use a MemoryContextCallback > (ProgressiveExplainReleaseFunc), configured in the memory context > where the query is being executed, being responsible for calling > ProgressiveExplainCleanup() if the query doesn't end gracefully. Thanks for the pointer. I'm a bit skeptical about what's going on here in ProgressiveExplainReleaseFunc(). It seems like we're depending on shared memory to tell us whether we need to do purely backend-local cleanup, like calling disable_timeout() and resetting ProgressiveExplainPending and activeQueryDesc. I would have expected us to keep track in local memory of whether this kind of work needs to be done. It seems roundabout to take an LWLock, do a hash table lookup to see if there's an entry there, release the LWLock, and then very shortly thereafter take the lock a second time to release the entry that we now know is there. The comment in ProgressiveExplainCleanup about only detaching from the DSA if the query ended gracefully is not ideal from my point of view because it says what the code does instead of why the code does that thing. Also, the function is seemingly called with queryDesc as an argument not because you need it for anything but because you're going to test whether it's null. In that case, you could just pass a Boolean. Even then, something seems odd about this: why do we have to be holding ProgressiveExplainHashLock to dsa_detach the somewhat-inscrutably named area pe_a? And why are we not detaching it in case of error? I am wondering why you chose this relatively unusual error cleanup strategy. What I would have done is invent AtEOXact_ProgressiveExplain and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler, because it doesn't need separate handling for transactions and subtransactions, but it's so unlike what we do in most places that it's hard for me to tell whether it's correct. I feel like the approach you've chosen here would make sense if what we wanted to do was basically release memory or some memory-like resource associated closely with the context -- e.g. expandedrecord.c releases a TupleDesc, but this is doing more than that. I think the effect of this choice is that cleanup of the progressive-EXPLAIN stuff happens much later than it normally would. Most of the time, in the case of an abort, we would want AbortTransaction() to release as many resources as possible, leaving basically no work to do at CleanupTransaction() time. This is so that if a user types "SELECT 1/0;" we release resources, as far as possible, right away, and don't wait for them to subsequently type "ROLLBACK;". The transaction lives on only as a shell. But these resources, if I'm reading this right, are going to stick around until the transaction is actually rolled back, because memory is not freed until CleanupTransaction() time. I wonder what happens if a query inside of an explicit transaction aborts after putting an entry in the progressive-explain view. My guess is that the entry will stick around until the actual rollback happens. In fact, now that I think about it, this is probably why we don't dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the resource owner cleanup will have already released the DSA segments long before the memory context is deleted. I'm sort of inclined to say that this should be rewritten to do error cleanup in a more normal way. It's probably more code to do it that way, but I think having it be more similar to other subsystems is probably worth quite a bit. > > It seems like when we replace a longer entry with a shorter one, we > > forget that it was originally longer. Hence, if the length of a > > progressive EXPLAIN is alternately 2922 characters and 2923 > > characters, we'll reallocate on every other progressive EXPLAIN > > instead of at most once. > > Are you talking about re-printing the plan in the same query execution? Yes. > The logic for the code, using your example, would be to allocate 2922 + > PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans > alternate between 2922 and 2923 no additional allocation will be done. > A reallocation will be needed only if the plan length ends up exceeding > 2922+4096. At the end of query execution (or cancellation) that DSA will > be freed and a next query execution will have to allocate again using the > same logic. It seems to me that if ProgressiveExplainPrint() reaches /* Update shared memory with new data */ without reallocating, strlen(pe_data->plan) can be reduced. On the next trip through the function, we don't know whether the string we're seeing is the original string -- for which strlen()+PROGRESSIVE_EXPLAIN_FREE_SIZE) gives us the original allocation size -- or whether the string we're seeing is a shorter one that was copied over the original, longer string. PROGRESSIVE_EXPLAIN_FREE_SIZE is big enough that this probably isn't much of a problem in practice, because consecutive EXPLAIN outputs for the same query probably won't vary in size by enough to cause any reallocation. Looking at this more carefully, I think that the query plan would have to shrink in size by >4kB and then expand again in order to trigger reallocation, which seems unlikely. But it still seems unclean to me. Normally, we track how much space we have actually allocated explicitly, instead of reconstructing that number from something else, especially something that isn't guaranteed to produce an accurate result. I think you should just store the number of available payload bytes at the beginning of the chunk and then reallocate if it isn't big enough to hold the payload that you have got. > Regarding the execProcNode wrapper strategy. It used it precisely because > of the discussion in that other patch. I actually tried not using it here, > and call ProgressiveExplainPrint() in the timeout callback. This resulted > in sporadic crashes, confirming the suspicion that it is not a good > idea. Makes sense, but we need adequate documentation of what we did and why it works (or at least why we think it works). Another thing I just noticed is that pg_stat_progress_explain() uses beentry->st_procpid == ped->pid as the permissions check, but a more typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know that's only in pgstatfuncs.c, but I think it would be OK to duplicate the associated test here (i.e. has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I don't really see a reason why this shouldn't use the same permission rules as other pg_stat_ things, in particular pg_stat_get_activity(). -- Robert Haas EDB: http://www.enterprisedb.com
Hello Robert,
Fixed most of the recommendations. Going over one at a time.
> The documentation for the progressive_explain = { off | explain |
> analyze } option seems like it should go into more detail about how
> the "explain" and "analyze" values are different. I'm not 100% sure I
> know the answer, and I'm not the least-experienced person who will
> ever read this documentation.
Added details about behavior of each option. In the doc for that GUC
there is a link to another section that explains in detail how progressive
explains work.
> WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
> invites confusion with MultiExecProcNode, to which it is unrelated.
Renamed that function (and the unwrap equivalent) to WrapMemberNodesExecProcNodesWithExplain
as MemberNodes is the name used by explain when parsing those types of nodes.
Fixed most of the recommendations. Going over one at a time.
> The documentation for the progressive_explain = { off | explain |
> analyze } option seems like it should go into more detail about how
> the "explain" and "analyze" values are different. I'm not 100% sure I
> know the answer, and I'm not the least-experienced person who will
> ever read this documentation.
Added details about behavior of each option. In the doc for that GUC
there is a link to another section that explains in detail how progressive
explains work.
> WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
> invites confusion with MultiExecProcNode, to which it is unrelated.
Renamed that function (and the unwrap equivalent) to WrapMemberNodesExecProcNodesWithExplain
as MemberNodes is the name used by explain when parsing those types of nodes.
> I do notice that WrapExecProcNodeWithExplain does not walk
> the ps->initPlan list, which I think is an oversight.
Fixed.
> I just went to some trouble to start breaking up the monolith that is
> explain.c, so I'm not that happy about seeing this patch dump another
> 800+ lines of source code into that file. Probably we should have a
> new source file for some or this, or maybe even more than one.
The whole progressive explain code was moved to a new set of files
explain_progressive.h and explain_progressive.c.
> The changes to explain.h add three new data types. Two of them begin
> with an uppercase letter and one with a lowercase letter. That seems
> inconsistent. I also don't think that the declaration of char plan[]
> is per PostgreSQL coding style. I believe we always write char
> plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
> other than plan, since it's really a string-ified explain-y kind of
> thing, not literally a Plan. Also, can we please not have structure
> members with single letter names? "h" and "p" are going to be
> completely ungreppable, and I like grepping
Done. Adjusted all data types to be uppercase. Added FLEXIBLE_ARRAY_MEMBER
and renamed "h" and "p".
> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?
Changed to throw ERROR, which shouldn't happen.
> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.
Adjusted this logic. Structure ProgressiveExplainHashEntry now contains
a field to store the allocated size, which is used to compare with
the new plan being printed.
> Thanks for the pointer. I'm a bit skeptical about what's going on here
> in ProgressiveExplainReleaseFunc(). It seems like we're depending on
> shared memory to tell us whether we need to do purely backend-local
> cleanup, like calling disable_timeout() and resetting
> ProgressiveExplainPending and activeQueryDesc. I would have expected
> us to keep track in local memory of whether this kind of work needs to
> be done. It seems roundabout to take an LWLock, do a hash table lookup
> to see if there's an entry there, release the LWLock, and then very
> shortly thereafter take the lock a second time to release the entry
> that we now know is there.
> The comment in ProgressiveExplainCleanup about only detaching from the
> DSA if the query ended gracefully is not ideal from my point of view
> because it says what the code does instead of why the code does that
> thing. Also, the function is seemingly called with queryDesc as an
> argument not because you need it for anything but because you're going
> to test whether it's null. In that case, you could just pass a
> Boolean. Even then, something seems odd about this: why do we have to
> be holding ProgressiveExplainHashLock to dsa_detach the
> somewhat-inscrutably named area pe_a? And why are we not detaching it
> in case of error?
> I am wondering why you chose this relatively unusual error cleanup
> strategy. What I would have done is invent AtEOXact_ProgressiveExplain
> and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
> because it doesn't need separate handling for transactions and
> subtransactions, but it's so unlike what we do in most places that
> it's hard for me to tell whether it's correct. I feel like the
> approach you've chosen here would make sense if what we wanted to do
> was basically release memory or some memory-like resource associated
> closely with the context -- e.g. expandedrecord.c releases a
> TupleDesc, but this is doing more than that.
> I think the effect of this choice is that cleanup of the
> progressive-EXPLAIN stuff happens much later than it normally would.
> Most of the time, in the case of an abort, we would want
> AbortTransaction() to release as many resources as possible, leaving
> basically no work to do at CleanupTransaction() time. This is so that
> if a user types "SELECT 1/0;" we release resources, as far as
> possible, right away, and don't wait for them to subsequently type
> "ROLLBACK;". The transaction lives on only as a shell. But these
> resources, if I'm reading this right, are going to stick around until
> the transaction is actually rolled back, because memory is not freed
> until CleanupTransaction() time. I wonder what happens if a query
> inside of an explicit transaction aborts after putting an entry in the
> progressive-explain view. My guess is that the entry will stick around
> until the actual rollback happens.
> In fact, now that I think about it, this is probably why we don't
> dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
> resource owner cleanup will have already released the DSA segments
> long before the memory context is deleted.
> I'm sort of inclined to say that this should be rewritten to do error
> cleanup in a more normal way. It's probably more code to do it that
> way, but I think having it be more similar to other subsystems is
> probably worth quite a bit.
Right. I use dsa_detach() only when the query finishes gracefully. This
is needed otherwise PG will complain with:
WARNING: resource was not closed: dynamic shared memory segment 32349774
That WARNING doesn't appear in case of error and according to my tests
shared memory is correctly being released.
I completely removed the MemoryContextCallback strategy and switched to
using the new function AtEOXact_ProgressiveExplain() called from AbortTransaction().
This first version of the progressive explain feature was designed to only keep
track of initial query called by the backend, ignoring all subquery calls. So
I believe we don't need to worry about having to add custom logic in
AbortSubTransaction(). In case query errors out AbortTransaction() will be called
and everything related to progressive explains will be cleaned.
> Another thing I just noticed is that pg_stat_progress_explain() uses
> beentry->st_procpid == ped->pid as the permissions check, but a more
> typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
> that's only in pgstatfuncs.c, but I think it would be OK to duplicate
> the associated test here (i.e. has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
> don't really see a reason why this shouldn't use the same permission
> rules as other pg_stat_ things, in particular pg_stat_get_activity().
Adjusted the logic to use has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role) when
checking the privileges.
Rafael.
> I just went to some trouble to start breaking up the monolith that is
> explain.c, so I'm not that happy about seeing this patch dump another
> 800+ lines of source code into that file. Probably we should have a
> new source file for some or this, or maybe even more than one.
The whole progressive explain code was moved to a new set of files
explain_progressive.h and explain_progressive.c.
> The changes to explain.h add three new data types. Two of them begin
> with an uppercase letter and one with a lowercase letter. That seems
> inconsistent. I also don't think that the declaration of char plan[]
> is per PostgreSQL coding style. I believe we always write char
> plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
> other than plan, since it's really a string-ified explain-y kind of
> thing, not literally a Plan. Also, can we please not have structure
> members with single letter names? "h" and "p" are going to be
> completely ungreppable, and I like grepping
Done. Adjusted all data types to be uppercase. Added FLEXIBLE_ARRAY_MEMBER
and renamed "h" and "p".
> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?
Changed to throw ERROR, which shouldn't happen.
> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.
Adjusted this logic. Structure ProgressiveExplainHashEntry now contains
a field to store the allocated size, which is used to compare with
the new plan being printed.
> Thanks for the pointer. I'm a bit skeptical about what's going on here
> in ProgressiveExplainReleaseFunc(). It seems like we're depending on
> shared memory to tell us whether we need to do purely backend-local
> cleanup, like calling disable_timeout() and resetting
> ProgressiveExplainPending and activeQueryDesc. I would have expected
> us to keep track in local memory of whether this kind of work needs to
> be done. It seems roundabout to take an LWLock, do a hash table lookup
> to see if there's an entry there, release the LWLock, and then very
> shortly thereafter take the lock a second time to release the entry
> that we now know is there.
> The comment in ProgressiveExplainCleanup about only detaching from the
> DSA if the query ended gracefully is not ideal from my point of view
> because it says what the code does instead of why the code does that
> thing. Also, the function is seemingly called with queryDesc as an
> argument not because you need it for anything but because you're going
> to test whether it's null. In that case, you could just pass a
> Boolean. Even then, something seems odd about this: why do we have to
> be holding ProgressiveExplainHashLock to dsa_detach the
> somewhat-inscrutably named area pe_a? And why are we not detaching it
> in case of error?
> I am wondering why you chose this relatively unusual error cleanup
> strategy. What I would have done is invent AtEOXact_ProgressiveExplain
> and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
> because it doesn't need separate handling for transactions and
> subtransactions, but it's so unlike what we do in most places that
> it's hard for me to tell whether it's correct. I feel like the
> approach you've chosen here would make sense if what we wanted to do
> was basically release memory or some memory-like resource associated
> closely with the context -- e.g. expandedrecord.c releases a
> TupleDesc, but this is doing more than that.
> I think the effect of this choice is that cleanup of the
> progressive-EXPLAIN stuff happens much later than it normally would.
> Most of the time, in the case of an abort, we would want
> AbortTransaction() to release as many resources as possible, leaving
> basically no work to do at CleanupTransaction() time. This is so that
> if a user types "SELECT 1/0;" we release resources, as far as
> possible, right away, and don't wait for them to subsequently type
> "ROLLBACK;". The transaction lives on only as a shell. But these
> resources, if I'm reading this right, are going to stick around until
> the transaction is actually rolled back, because memory is not freed
> until CleanupTransaction() time. I wonder what happens if a query
> inside of an explicit transaction aborts after putting an entry in the
> progressive-explain view. My guess is that the entry will stick around
> until the actual rollback happens.
> In fact, now that I think about it, this is probably why we don't
> dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
> resource owner cleanup will have already released the DSA segments
> long before the memory context is deleted.
> I'm sort of inclined to say that this should be rewritten to do error
> cleanup in a more normal way. It's probably more code to do it that
> way, but I think having it be more similar to other subsystems is
> probably worth quite a bit.
Right. I use dsa_detach() only when the query finishes gracefully. This
is needed otherwise PG will complain with:
WARNING: resource was not closed: dynamic shared memory segment 32349774
That WARNING doesn't appear in case of error and according to my tests
shared memory is correctly being released.
I completely removed the MemoryContextCallback strategy and switched to
using the new function AtEOXact_ProgressiveExplain() called from AbortTransaction().
This first version of the progressive explain feature was designed to only keep
track of initial query called by the backend, ignoring all subquery calls. So
I believe we don't need to worry about having to add custom logic in
AbortSubTransaction(). In case query errors out AbortTransaction() will be called
and everything related to progressive explains will be cleaned.
> Another thing I just noticed is that pg_stat_progress_explain() uses
> beentry->st_procpid == ped->pid as the permissions check, but a more
> typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
> that's only in pgstatfuncs.c, but I think it would be OK to duplicate
> the associated test here (i.e. has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
> don't really see a reason why this shouldn't use the same permission
> rules as other pg_stat_ things, in particular pg_stat_get_activity().
Adjusted the logic to use has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role) when
checking the privileges.
Rafael.
Attachment
Sending a new version that includes the new explain_progressive.c file in meson.build.
Rafael.
Attachment
On Tue, Mar 25, 2025 at 8:52 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > This first version of the progressive explain feature was designed to only keep > track of initial query called by the backend, ignoring all subquery calls. So > I believe we don't need to worry about having to add custom logic in > AbortSubTransaction(). In case query errors out AbortTransaction() will be called > and everything related to progressive explains will be cleaned. Suppose: BEGIN; SELECT 1; SAVEPOINT bob; progressively explain something that aborts I think in this case we will call AbortSubTransaction(), not AbortTransaction(). -- Robert Haas EDB: http://www.enterprisedb.com
> Suppose:
> BEGIN;
> SELECT 1;
> SAVEPOINT bob;
> progressively explain something that aborts
> I think in this case we will call AbortSubTransaction(), not AbortTransaction().
Indeed. We need special treatment for subtransactions. There are 2 scenarios where
AbortSubTransaction() will be called alone and can affect progressive explains:
savepoints and PL/pgSQL exception blocks.
We don't want subxact aborts in PL/pgSQL exception blocks to perform cleanup
as the original query will continue to run and progressive explain tracking
is still applicable there.
So implementation was done based on transaction nested level. Cleanup is only
performed when AbortSubTransaction() is called in the same transaction nested
level as the one where the query is running. This covers both PL/pgSQL exception
blocks and savepoints.
Rafael.
> BEGIN;
> SELECT 1;
> SAVEPOINT bob;
> progressively explain something that aborts
> I think in this case we will call AbortSubTransaction(), not AbortTransaction().
Indeed. We need special treatment for subtransactions. There are 2 scenarios where
AbortSubTransaction() will be called alone and can affect progressive explains:
savepoints and PL/pgSQL exception blocks.
We don't want subxact aborts in PL/pgSQL exception blocks to perform cleanup
as the original query will continue to run and progressive explain tracking
is still applicable there.
So implementation was done based on transaction nested level. Cleanup is only
performed when AbortSubTransaction() is called in the same transaction nested
level as the one where the query is running. This covers both PL/pgSQL exception
blocks and savepoints.
Rafael.
Attachment
On Wed, Mar 26, 2025 at 5:58 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > So implementation was done based on transaction nested level. Cleanup is only > performed when AbortSubTransaction() is called in the same transaction nested > level as the one where the query is running. This covers both PL/pgSQL exception > blocks and savepoints. Right. I think this is much closer to being correct. However, I actually think it should look more like this: void AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth) { if (activeQueryDesc != NULL && activeQueryXactNestLevel >= nestDepth) { if (isCommit) elog(WARNING, "leaked progressive explain query descriptor"); ProgressiveExplainCleanup(NULL); } } By including the is-commit case in there, we can catch any bugs where things aren't cleaned up properly before a transaction is committed. We generally want to test >= nestDepth instead of == nestDepth in case multiple subtransaction levels abort all at once; I'm not sure it matters here, but even if it isn't, it's best to be consistent with the practice elsewhere. Having {Commit,Abort}SubTransaction pass the nestDepth instead of calling GetCurrentTransactionNestLevel() also has precedent e.g. see AtEOSubXact_HashTables. As a further refinement, consider initializing activeQueryXactNestLevel to -1 and resetting it to that value each time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if and only if activeQueryXactNestLevel >= 0. Then the test above can be simplified to just if (activeQueryXactNestLevel >= nestDepth). The comment for ProgressiveExplainIsActive() is a copy-paste from another function that you forgot to update. Also, the function body could be reduced to a one-liner: return queryDesc == activeQueryDesc; There is a comment in ExplainOnePlan() that says "Handle progressive explain cleanup manually if enabled as it is not called as part of ExecutorFinish," but standard_ExecutorFinish does indeed call ProgressiveExplainFinish(), so either the comment is misleading or the code is wrong. standard_ExecutorFinish() makes its call to ProgressiveExplainFinish() dependent on the value of the progressive_explain GUC, but that GUC could be changed in mid-query via set_config() or a plpgsql function that calls SET, which could result in skipping the cleanup even when it's needed. I think you should make the call unconditional and make it entirely the job of ProgressiveExplainFinish() to decide whether any cleanup is needed. ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most cases, but sometimes just disables the timeout instead. I think this is weird. I think you should just always call ProgressiveExplainCleanup() and make sure it's robust and cleans up however much or little is appropriate in all cases. On the flip side, I can't really see why dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding ProgressiveExplainHashLock. Why not just have ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then afterwards it can do the dsa_detach() in the caller? Then ProgressiveExplainCleanup() no longer needs an argument. ProgressiveExplainPrint() can save a level of indentation in a large chunk of the function by understanding that elog(ERROR) does not return. You don't need to wrap everything that follows in else {}. In the documentation table called pg-stat-progress-explain-view, some descriptions end with a period and others do not. Calling ProgressiveExplainTrigger() directly from ProgressiveExplainStartupTimeoutHandler() seems extremely scary -- we've tried hard to make our signal handlers do only very simple things like setting a flag, and this one runs around the entire PlanState tree modifying Very Important Things. I fear that's going to be hard to make robust. Like, what happens if we're going around trying to change ExecProcNode pointers while the calling code was also going around trying to change ExecProcNode pointers? I can't quite see how this won't result in chaos. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks Robert. Very thorough analysis there.
Things I don't comment on will be fixed without further discussion.
> There is a comment in ExplainOnePlan() that says "Handle progressive
> explain cleanup manually if enabled as it is not called as part of
> ExecutorFinish," but standard_ExecutorFinish does indeed call
> ProgressiveExplainFinish(), so either the comment is misleading or the
> code is wrong.
The comment is misleading. What I meant to say is that queries executed via
EXPLAIN (without analyze) don't call ExecutorFinish, so ProgressiveExplainFinish
isn't called in that context. I will fix the comment.
> Calling ProgressiveExplainTrigger() directly from
> ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
> we've tried hard to make our signal handlers do only very simple
> things like setting a flag, and this one runs around the entire
> PlanState tree modifying Very Important Things. I fear that's going to
> be hard to make robust. Like, what happens if we're going around
> trying to change ExecProcNode pointers while the calling code was also
> going around trying to change ExecProcNode pointers? I can't quite see
> how this won't result in chaos.
Agreed. In that other similar patch to log query plans after a signal is sent
from another session there was the same discussion and concerns.
I don't see another way of doing it here. This patch became 100x more complex
after I added GUC progressive_explain_min_duration, that required changing the
execProcNode wrapper on the fly.
I can see some alternatives here:
A) Use a temporary execProcNode wrapper set at query start here:
/*
* If instrumentation is required, change the wrapper to one that just
* does instrumentation. Otherwise we can dispense with all wrappers and
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
{
/*
* Use wrapper for progressive explains if enabled and the node
* belongs to the currently tracked query descriptor.
*/
if (progressive_explain == PROGRESSIVE_EXPLAIN_ANALYZE &&
ProgressiveExplainIsActive(node->state->query_desc))
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;
This wrapper will have an additional logic that will check if a boolean set
by the timeout function has changed. When that happens the initial progressive
explain setup is done and execProcNode is unwrapped in place. This should be
safe.
The problem here is that all queries would always be using the custom
wrapper until timeout is triggered, and that can add unnecessary overhead.
B) We get rid of the idea of applying progressive explains to non instrumented
runs (which was my original idea). My main use-case here is to see progress
of instrumented runs anyways. For that idea we have 2 possibilities:
B1) Keep implementation as is, with all the existing GUCs to control what
is included in the printed plan. We just change GUC progressive_explain to be
a boolean again. If true, instrumentation will be enabled for the query being
executed and progressive explain will be printed consecutively.
B2) Get rid of all new GUCs I added and change the progressive explain feature
to become an option of the EXPLAIN command. Something like:
EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...
(B1) would allow progressive explans in regular queries, but I'm not sure people
would be willing to enable it globally as it adds instrumentation overhead.
What do you think of the options?
Rafael.
Things I don't comment on will be fixed without further discussion.
> There is a comment in ExplainOnePlan() that says "Handle progressive
> explain cleanup manually if enabled as it is not called as part of
> ExecutorFinish," but standard_ExecutorFinish does indeed call
> ProgressiveExplainFinish(), so either the comment is misleading or the
> code is wrong.
The comment is misleading. What I meant to say is that queries executed via
EXPLAIN (without analyze) don't call ExecutorFinish, so ProgressiveExplainFinish
isn't called in that context. I will fix the comment.
> Calling ProgressiveExplainTrigger() directly from
> ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
> we've tried hard to make our signal handlers do only very simple
> things like setting a flag, and this one runs around the entire
> PlanState tree modifying Very Important Things. I fear that's going to
> be hard to make robust. Like, what happens if we're going around
> trying to change ExecProcNode pointers while the calling code was also
> going around trying to change ExecProcNode pointers? I can't quite see
> how this won't result in chaos.
Agreed. In that other similar patch to log query plans after a signal is sent
from another session there was the same discussion and concerns.
I don't see another way of doing it here. This patch became 100x more complex
after I added GUC progressive_explain_min_duration, that required changing the
execProcNode wrapper on the fly.
I can see some alternatives here:
A) Use a temporary execProcNode wrapper set at query start here:
/*
* If instrumentation is required, change the wrapper to one that just
* does instrumentation. Otherwise we can dispense with all wrappers and
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
{
/*
* Use wrapper for progressive explains if enabled and the node
* belongs to the currently tracked query descriptor.
*/
if (progressive_explain == PROGRESSIVE_EXPLAIN_ANALYZE &&
ProgressiveExplainIsActive(node->state->query_desc))
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;
This wrapper will have an additional logic that will check if a boolean set
by the timeout function has changed. When that happens the initial progressive
explain setup is done and execProcNode is unwrapped in place. This should be
safe.
The problem here is that all queries would always be using the custom
wrapper until timeout is triggered, and that can add unnecessary overhead.
B) We get rid of the idea of applying progressive explains to non instrumented
runs (which was my original idea). My main use-case here is to see progress
of instrumented runs anyways. For that idea we have 2 possibilities:
B1) Keep implementation as is, with all the existing GUCs to control what
is included in the printed plan. We just change GUC progressive_explain to be
a boolean again. If true, instrumentation will be enabled for the query being
executed and progressive explain will be printed consecutively.
B2) Get rid of all new GUCs I added and change the progressive explain feature
to become an option of the EXPLAIN command. Something like:
EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...
(B1) would allow progressive explans in regular queries, but I'm not sure people
would be willing to enable it globally as it adds instrumentation overhead.
What do you think of the options?
Rafael.
On Thu, Mar 27, 2025 at 9:38 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > > Calling ProgressiveExplainTrigger() directly from > > ProgressiveExplainStartupTimeoutHandler() seems extremely scary -- > > Agreed. In that other similar patch to log query plans after a signal is sent > from another session there was the same discussion and concerns. > > I don't see another way of doing it here. This patch became 100x more complex > after I added GUC progressive_explain_min_duration, that required changing the > execProcNode wrapper on the fly. > > I can see some alternatives here: > > A) Use a temporary execProcNode wrapper set at query start here: > > The problem here is that all queries would always be using the custom > wrapper until timeout is triggered, and that can add unnecessary overhead. > > B) We get rid of the idea of applying progressive explains to non instrumented > runs (which was my original idea). My main use-case here is to see progress > of instrumented runs anyways. For that idea we have 2 possibilities: > > B1) Keep implementation as is, with all the existing GUCs to control what > is included in the printed plan. We just change GUC progressive_explain to be > a boolean again. If true, instrumentation will be enabled for the query being > executed and progressive explain will be printed consecutively. > > B2) Get rid of all new GUCs I added and change the progressive explain feature > to become an option of the EXPLAIN command. Something like: > > EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ... > > (B1) would allow progressive explans in regular queries, but I'm not sure people > would be willing to enable it globally as it adds instrumentation overhead. I'm inclined to think that there isn't much value to this feature with progressive_explain=explain. If you just want to print plans after a certain amount of elapsed runtime, you can already use auto_explain to do that. Unless there's some pretty large benefit to doing it with this feature over what that already does, I think we shouldn't invent a second method. Granted, auto_explain is a contrib module and this is proposed as a core feature, but I feel like that use case of printing the plan once is so different from what I see as the core value proposition of this feature that I think it might just be confusing to include it in scope. There's nothing actually "progressive" about that, because you're just doing it once. But having said that, I'm not quite sure I understand why you're proposing (A) and (B1) as separate alternatives. Changing progressive_explain to be a Boolean doesn't seem like it solves the problem of needing to wrap ExecProcNode from a signal handler. The only thing that seems to solve that problem is to instead do the wrapping at the start of the query, which AIUI is (A). So I feel like you should do (A) to solve the problem I thought we were talking about and (B1) to make things simpler. Am I misunderstanding the trade-offs here? I did consider proposing (B2) yesterday, not so much because of this issue but because I wondered whether it might just be a better interface. But on reflection I decided it wasn't, because it removes the option to just turn this feature on for all of your queries, which somebody might want to do. Also, it would actually be kind of a weird interface, because every other form of EXPLAIN returns the plan as output and throws away the original query output; but this form would store the plan ephemerally in a view and return the original output. I'm sure we could make that work and find a way to document it but it seems odd. In general, I think we should err on the side of making this feature safe even at some performance cost. If we put in place a version of this feature that imposes a great performance overhead but does not crash the server, some people will be unhappy, complain, and/or be unable to use the feature. That will be sad. But, if we put in place a version of this feature that performs great and occasionally crashes the server, that will be much sadder. So let's do something we're very confident is safe. There is always the opportunity to change things in the future if we're more confident about some questionable choice then than we are now -- performance optimization can be done after the fact. But if it's not stable, the only thing we're likely to change in the future is to revert the whole thing. -- Robert Haas EDB: http://www.enterprisedb.com
On 3/26/25 22:57, Rafael Thofehrn Castro wrote: > So implementation was done based on transaction nested level. Cleanup is > only > performed when AbortSubTransaction() is called in the same transaction > nested > level as the one where the query is running. This covers both PL/pgSQL > exception > blocks and savepoints. Thanks for your efforts! Let me provide an alternative view of your code. Postgres has a massive gap between the start of the execution and the end. All the stuff that happens at that time can't be discovered. That's sad. If you add a new hook into the ExecuteNode (it may be designed as a pointer in the PlanState to affect only necessary nodes), you may push the code out of the core and give other extensions additional tools. I see your reasoning here [1], but I think with the commit 4fd02bf7cf9, it should be revised. As I see after reviewing your code, to design it as an extension, some parallel query execution stuff needs to be exported, and Instrumentation needs to be changed a little. For example, with progressive explain we have a new node state - 'not visited yet' that is different from 'not executed'. What is the purpose of a new hook from a broad perspective? I designed at least two features with such a hook in the Postgres fork: 1) real-time execution statistics and 2) query interruption on the arbitrary trigger (planning error detected, a signal arrived, too much temp memory allocated, etc.). I am not sure if it would be interesting for the community, but when a query is executed for more than one minute, I certainly want to control what is happening and have some tools except query abortion. The benefit of such an approach is that it is doable to change the Instrumentation, add a hook now, and develop this extension without a rush until it is stable - I think at least the case of parallel execution may be enhanced. [1] https://www.postgresql.org/message-id/CAG0ozMrtK_u8Uf5KNZUmRNuMphV5tnC5DEhRBNRGK%2BK4L506xw%40mail.gmail.com -- regards, Andrei Lepikhov
> But having said that, I'm not quite sure I understand why you're
> proposing (A) and (B1) as separate alternatives. Changing
> progressive_explain to be a Boolean doesn't seem like it solves the
> problem of needing to wrap ExecProcNode from a signal handler. The
> only thing that seems to solve that problem is to instead do the
> wrapping at the start of the query, which AIUI is (A). So I feel like
> you should do (A) to solve the problem I thought we were talking about
> and (B1) to make things simpler. Am I misunderstanding the trade-offs
> here?
Both (A) and (B1) use the same strategy, which is to wrap at query start. What
changes is that (A), where we keep the 'explain' option, allows users to still
see the plan without having to include instrumentation overhead.
But (A) is doomed from the start as the custom wrapper will have custom logic
that will add some overhead. That is what I was able to avoid with the current
patch that does wrapping in the timeout handler function.
In (B1) it is ok to have a custom wrapper that does the progressive explain
in fixed internals as that overhead is far less noticeable than the overhead
in the already existing ExecProcNodeInstr wrapper. I tested that.
Notice that the gist of (B1) is already part of the existing patch. If progressive_explain
is set to 'analyze' I set a wrapper ExecProcNodeInstrExplain at query start.
So I think the way to go is with (B1), where the ExecProcNodeInstrExplain
wrapper will continue to do what it does, but only after progressive_explain_min_duration
has passed (boolean flag set by the timeout handler function). And I get rid
of the 'explain'.
As you said, visibility on the non instrumented query plan is already a feature
of auto_explain.
> The benefit of such an approach is that it is doable to change the
> Instrumentation, add a hook now, and develop this extension without a
> rush until it is stable - I think at least the case of parallel
> execution may be enhanced.
Thanks for the input Andrei! The main issue is that progressive explain
touches a bunch of different parts of the code that would require additional
hooks. For example, to adjust the execProcNode wrapper at query start. Also
hook for xact/subxact cleanups, I don't think it exists yet.
The biggest challenge is what lies inside ExplainNode(), an already complex
function that calls several other helper functions. Progressive explain had
to change several parts of it.
The only solution I found back then while thinking about this idea was to make
ExplainNode itself a hook, and the extension would duplicate the whole code with
the additional custom logic. And that is definitely not a good idea.
With the new hooks Robert added it is indeed one big step ahead, but we still need
to deal with instrumentation and other nuances as you said.
I am definitely not the authority here to talk about the best way forward.
If there is an agreement in turning this into an extension, it can be a new
feature in auto_explain.
Rafael.
> proposing (A) and (B1) as separate alternatives. Changing
> progressive_explain to be a Boolean doesn't seem like it solves the
> problem of needing to wrap ExecProcNode from a signal handler. The
> only thing that seems to solve that problem is to instead do the
> wrapping at the start of the query, which AIUI is (A). So I feel like
> you should do (A) to solve the problem I thought we were talking about
> and (B1) to make things simpler. Am I misunderstanding the trade-offs
> here?
Both (A) and (B1) use the same strategy, which is to wrap at query start. What
changes is that (A), where we keep the 'explain' option, allows users to still
see the plan without having to include instrumentation overhead.
But (A) is doomed from the start as the custom wrapper will have custom logic
that will add some overhead. That is what I was able to avoid with the current
patch that does wrapping in the timeout handler function.
In (B1) it is ok to have a custom wrapper that does the progressive explain
in fixed internals as that overhead is far less noticeable than the overhead
in the already existing ExecProcNodeInstr wrapper. I tested that.
Notice that the gist of (B1) is already part of the existing patch. If progressive_explain
is set to 'analyze' I set a wrapper ExecProcNodeInstrExplain at query start.
So I think the way to go is with (B1), where the ExecProcNodeInstrExplain
wrapper will continue to do what it does, but only after progressive_explain_min_duration
has passed (boolean flag set by the timeout handler function). And I get rid
of the 'explain'.
As you said, visibility on the non instrumented query plan is already a feature
of auto_explain.
> The benefit of such an approach is that it is doable to change the
> Instrumentation, add a hook now, and develop this extension without a
> rush until it is stable - I think at least the case of parallel
> execution may be enhanced.
Thanks for the input Andrei! The main issue is that progressive explain
touches a bunch of different parts of the code that would require additional
hooks. For example, to adjust the execProcNode wrapper at query start. Also
hook for xact/subxact cleanups, I don't think it exists yet.
The biggest challenge is what lies inside ExplainNode(), an already complex
function that calls several other helper functions. Progressive explain had
to change several parts of it.
The only solution I found back then while thinking about this idea was to make
ExplainNode itself a hook, and the extension would duplicate the whole code with
the additional custom logic. And that is definitely not a good idea.
With the new hooks Robert added it is indeed one big step ahead, but we still need
to deal with instrumentation and other nuances as you said.
I am definitely not the authority here to talk about the best way forward.
If there is an agreement in turning this into an extension, it can be a new
feature in auto_explain.
Rafael.
On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > I am definitely not the authority here to talk about the best way forward. > If there is an agreement in turning this into an extension, it can be a new > feature in auto_explain. I'm not against adding some more hooks to explain.c, but I don't really see how you could add hooks to explain.c that would allow somebody to implement this feature out of core, unless either (a) you are thinking that they will copy and paste a lot of core code or (b) you make the hooks extremely fine-grained and special-purpose, designed to accomplish the extremely precise thing that this feature would need them to do. But that just seems like bad hook design. I don't think hooks have to be amazingly general in order to be accepted, but if there's no use of the hook that doesn't involve a big cut-and-paste operation, that's poor, and if the hook can only ever be used to implement one thing, you might as well put the thing in core. I feel like there are just two separate things here. There is Rafael's progressive EXPLAIN patch which, as far as I can see currently, is really only going to work as in-core feature, and then there is Andrei's desire for more hooks, which may very well also be a good idea but not the same idea. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > As you said, visibility on the non instrumented query plan is already a feature > of auto_explain. True, although thinking about it more, they're not being sent to the same place. auto_explain goes to the log, and this goes to a view. What about something like this: progressive_explain = on | off progressive_explain_inteval = 5s If progressive_explain is off, then we don't populate the view. If it's on, then we populate the view during query startup. Every time the interval elapses, we update it again. If you want to update the view just once at query startup and not thereafter, you can set progressive_explain_interval = 0. -- Robert Haas EDB: http://www.enterprisedb.com
> True, although thinking about it more, they're not being sent to the
> same place. auto_explain goes to the log, and this goes to a view.
> What about something like this:
> progressive_explain = on | off
> progressive_explain_inteval = 5s
> If progressive_explain is off, then we don't populate the view. If
> it's on, then we populate the view during query startup. Every time
> the interval elapses, we update it again. If you want to update the
> view just once at query startup and not thereafter, you can set
> progressive_explain_interval = 0.
That also works. My concern is that with that approach there is a huge
"hidden" change of execution behavior between setting progressive_explain_inteval
to 0 and something greater than that.
Setting to 0 doesn't do anything other than dumping the plain plan at
query start. Setting any other value will enable instrumentation under the
hood. This would have to be very well documented.
I'm still more inclined to use:
progressive_explain = off | explain | analyze
progressive_explain_interval = any millisecond greater than a min threshold
(currently 10ms). It doesn't make sense to be dumping the instrumented plan
> same place. auto_explain goes to the log, and this goes to a view.
> What about something like this:
> progressive_explain = on | off
> progressive_explain_inteval = 5s
> If progressive_explain is off, then we don't populate the view. If
> it's on, then we populate the view during query startup. Every time
> the interval elapses, we update it again. If you want to update the
> view just once at query startup and not thereafter, you can set
> progressive_explain_interval = 0.
That also works. My concern is that with that approach there is a huge
"hidden" change of execution behavior between setting progressive_explain_inteval
to 0 and something greater than that.
Setting to 0 doesn't do anything other than dumping the plain plan at
query start. Setting any other value will enable instrumentation under the
hood. This would have to be very well documented.
I'm still more inclined to use:
progressive_explain = off | explain | analyze
progressive_explain_interval = any millisecond greater than a min threshold
(currently 10ms). It doesn't make sense to be dumping the instrumented plan
every 1ms for example, IMHO.
On Fri, Mar 28, 2025 at 3:59 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > > True, although thinking about it more, they're not being sent to the > > same place. auto_explain goes to the log, and this goes to a view. > > What about something like this: > > progressive_explain = on | off > > progressive_explain_inteval = 5s > > I'm still more inclined to use: > > progressive_explain = off | explain | analyze > progressive_explain_interval = any millisecond greater than a min threshold > (currently 10ms). It doesn't make sense to be dumping the instrumented plan > every 1ms for example, IMHO. I still have trouble understanding what that means. Is the interval irrelevant except when progressive_explain = analyze? -- Robert Haas EDB: http://www.enterprisedb.com
> I still have trouble understanding what that means. Is the interval
> irrelevant except when progressive_explain = analyze?
> irrelevant except when progressive_explain = analyze?
That is correct. Interval currently is only used when instrumentation
is enabled through progressive_explain = analyze.
> That is correct. Interval currently is only used when instrumentation
> is enabled through progressive_explain = analyze.
I guess there would still be a point in printing the plan on a regular interval
I guess there would still be a point in printing the plan on a regular interval
when instrumentation is disabled. In that case the only thing we would see
changing in the plan is the node currently being executed. But with that we
add more overhead to progressive_explain = 'explain', that in that case will
also require a custom execProcNode wrapper.
On Fri, Mar 28, 2025 at 4:12 PM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > > That is correct. Interval currently is only used when instrumentation > > is enabled through progressive_explain = analyze. > > I guess there would still be a point in printing the plan on a regular interval > when instrumentation is disabled. In that case the only thing we would see > changing in the plan is the node currently being executed. But with that we > add more overhead to progressive_explain = 'explain', that in that case will > also require a custom execProcNode wrapper. I don't think that's at all worth it. I had even considered asking for the current node stuff to be removed altogether. Doing regular updates to only change that seems like a real waste. It's barely ever going to be useful to see the node being executed right this exact instant - what you care about is which nodes suck up resources over time. As far as option naming is concerned, I'm open to input from others, but personally I feel like ANALYZE is a bit opaque as an EXPLAIN option in general. We've had previous discussions about the fact that we might have wanted to name it EXECUTE if EXPLAIN EXECUTE didn't already mean something else. I think for most people, what ANALYZE means - somewhat counterintuitively - is that we're actually going to run the query. But here we're always going to run the query, so my brain just goes into a tailspin trying to figure out what ANALYZE means. So, personally, I would like to see progressive_explain = off | explain | analyze go away in favor of progressive_explain = off | on. I think we have two ideas on how to get there so far: (1) Just get rid of what you call progressive_explain = explain. After all, that amounts to just serializing the plan once, and maybe that's not actually such a great feature. With that change, then we only have two remaining values for the progressive_explain, and we can call them "off" and "on". (2) Merge progressive_explain = explain and progressive_explain = analyze into a single value, progressive_explain = on. To tell which is intended, just check progresive_explain_interval. If it's zero, then there's no repeat, so we mean what you currently call progressive_explain = explain i.e. serialize once. Otherwise we mean progressive_explain = analyze. I'm open to idea #3, but I'm pretty resistant to the idea of progressive_explain remaining three-valued as it is today. If 27 people show up to say that they understand what that means perfectly and Robert's a big fat idiot, I shall of course defer to the consensus, but I kind of think I might not be alone in finding this naming confusing. For what it's worth, I have trouble seeing how anyone gets confused if we do what I propose as #2. I mean, I agree with your point that they could misunderstand how much overhead different settings will cause, so we would need to document that carefully. But if you're just wondering about the behavior, then I feel like people have a pretty good chance of guessing that progressive_explain=on, progressive_explain_interval=0 means do it once. You could think that means do it at maximum speed, but that seems like a somewhat naive interpretation. You could also think it disables the feature, effectively negating progressive_explain=on, but then you should start to wonder why there are two GUCs. If you don't think either of those things, then I think "do it once" is the only other reasonable interpretation. Of course, sometimes what I think turns out to be completely wrong! -- Robert Haas EDB: http://www.enterprisedb.com
> (2) Merge progressive_explain = explain and progressive_explain =
> analyze into a single value, progressive_explain = on. To tell which
> is intended, just check progresive_explain_interval. If it's zero,
> then there's no repeat, so we mean what you currently call
> progressive_explain = explain i.e. serialize once. Otherwise we mean
> progressive_explain = analyze.
Implemented this version. New patch has the following characteristics:
- progressive_explain is a boolean
- GUC progressive_explain_min_duration is removed
- if progresive_explain_interval = 0, update plan in memory only at query
start
- if progresive_explain_interval > 0, share the plan in memory at query
start and update every progresive_explain_interval (instrumentation is
enabled automatically)
- The plan shared in pg_stat_progress_explain at query start does not
contain any instrumentation detail ((never executed), (actual time), etc)
if progresive_explain_interval = 0, even if query is started with EXPLAIN
ANALYZE. My reasoning here is that if the plan will be shared only once we
are actually interested in seeing the plan itself and not instrumentation
progress.
Fixed other comments you shared in previous emails too:
> void
> AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
> {
> if (activeQueryDesc != NULL &&
> activeQueryXactNestLevel >= nestDepth)
> {
> if (isCommit)
> elog(WARNING, "leaked progressive explain query descriptor");
> ProgressiveExplainCleanup(NULL);
> }
>}
> By including the is-commit case in there, we can catch any bugs where
> things aren't cleaned up properly before a transaction is committed.
Added the isCommit logic both to Transaction and Subtransaction commits
and aborts that will notify about leakage if cleanup was not properly
done in a commit. Changed function names back to AtEOXact and AtEOSubXact,
as opposed to AtAbort and AtSubAbort.
> We generally want to test >= nestDepth instead of == nestDepth in case
> multiple subtransaction levels abort all at once; I'm not sure it
> matters here, but even if it isn't, it's best to be consistent with
> the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
> nestDepth instead of calling GetCurrentTransactionNestLevel() also has
> precedent e.g. see AtEOSubXact_HashTables.
Done.
> As a further refinement, consider initializing
> activeQueryXactNestLevel to -1 and resetting it to that value each
> time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
> and only if activeQueryXactNestLevel >= 0. Then the test above can be
> simplified to just if (activeQueryXactNestLevel >= nestDepth).
Done.
> standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
> dependent on the value of the progressive_explain GUC, but that GUC
> could be changed in mid-query via set_config() or a plpgsql function
> that calls SET, which could result in skipping the cleanup even when
> it's needed. I think you should make the call unconditional and make
> it entirely the job of ProgressiveExplainFinish() to decide whether
> any cleanup is needed.
Done.
> ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
> cases, but sometimes just disables the timeout instead. I think this
> is weird. I think you should just always call
> ProgressiveExplainCleanup() and make sure it's robust and cleans up
> however much or little is appropriate in all cases.
Now that I removed all the execProcNode wrapping and the conditional cleanup
based on progressive_explain_min_duration (that got removed), this
part became much simpler. ProgressiveExplainFinish always calls
ProgressiveExplainCleanup.
> On the flip side, I can't really see why
> dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
> ProgressiveExplainHashLock. Why not just have
> ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
> afterwards it can do the dsa_detach() in the caller? Then
> ProgressiveExplainCleanup() no longer needs an argument.
That is perfect. Implemented.
> ProgressiveExplainPrint() can save a level of indentation in a large
> chunk of the function by understanding that elog(ERROR) does not
> return. You don't need to wrap everything that follows in else {}.
Fixed.
I will not fix documentation for now as we are not done with
implementation changes yet. Once we agree that the code logic is
safe and sound we can discuss cosmetics (feature name, GUCs, view, etc).
Rafael.
> analyze into a single value, progressive_explain = on. To tell which
> is intended, just check progresive_explain_interval. If it's zero,
> then there's no repeat, so we mean what you currently call
> progressive_explain = explain i.e. serialize once. Otherwise we mean
> progressive_explain = analyze.
Implemented this version. New patch has the following characteristics:
- progressive_explain is a boolean
- GUC progressive_explain_min_duration is removed
- if progresive_explain_interval = 0, update plan in memory only at query
start
- if progresive_explain_interval > 0, share the plan in memory at query
start and update every progresive_explain_interval (instrumentation is
enabled automatically)
- The plan shared in pg_stat_progress_explain at query start does not
contain any instrumentation detail ((never executed), (actual time), etc)
if progresive_explain_interval = 0, even if query is started with EXPLAIN
ANALYZE. My reasoning here is that if the plan will be shared only once we
are actually interested in seeing the plan itself and not instrumentation
progress.
Fixed other comments you shared in previous emails too:
> void
> AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
> {
> if (activeQueryDesc != NULL &&
> activeQueryXactNestLevel >= nestDepth)
> {
> if (isCommit)
> elog(WARNING, "leaked progressive explain query descriptor");
> ProgressiveExplainCleanup(NULL);
> }
>}
> By including the is-commit case in there, we can catch any bugs where
> things aren't cleaned up properly before a transaction is committed.
Added the isCommit logic both to Transaction and Subtransaction commits
and aborts that will notify about leakage if cleanup was not properly
done in a commit. Changed function names back to AtEOXact and AtEOSubXact,
as opposed to AtAbort and AtSubAbort.
> We generally want to test >= nestDepth instead of == nestDepth in case
> multiple subtransaction levels abort all at once; I'm not sure it
> matters here, but even if it isn't, it's best to be consistent with
> the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
> nestDepth instead of calling GetCurrentTransactionNestLevel() also has
> precedent e.g. see AtEOSubXact_HashTables.
Done.
> As a further refinement, consider initializing
> activeQueryXactNestLevel to -1 and resetting it to that value each
> time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
> and only if activeQueryXactNestLevel >= 0. Then the test above can be
> simplified to just if (activeQueryXactNestLevel >= nestDepth).
Done.
> standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
> dependent on the value of the progressive_explain GUC, but that GUC
> could be changed in mid-query via set_config() or a plpgsql function
> that calls SET, which could result in skipping the cleanup even when
> it's needed. I think you should make the call unconditional and make
> it entirely the job of ProgressiveExplainFinish() to decide whether
> any cleanup is needed.
Done.
> ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
> cases, but sometimes just disables the timeout instead. I think this
> is weird. I think you should just always call
> ProgressiveExplainCleanup() and make sure it's robust and cleans up
> however much or little is appropriate in all cases.
Now that I removed all the execProcNode wrapping and the conditional cleanup
based on progressive_explain_min_duration (that got removed), this
part became much simpler. ProgressiveExplainFinish always calls
ProgressiveExplainCleanup.
> On the flip side, I can't really see why
> dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
> ProgressiveExplainHashLock. Why not just have
> ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
> afterwards it can do the dsa_detach() in the caller? Then
> ProgressiveExplainCleanup() no longer needs an argument.
That is perfect. Implemented.
> ProgressiveExplainPrint() can save a level of indentation in a large
> chunk of the function by understanding that elog(ERROR) does not
> return. You don't need to wrap everything that follows in else {}.
Fixed.
I will not fix documentation for now as we are not done with
implementation changes yet. Once we agree that the code logic is
safe and sound we can discuss cosmetics (feature name, GUCs, view, etc).
Rafael.
Attachment
On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro <rafaelthca@gmail.com> wrote: > The wrapper code was implemented by torikoshia > (torikoshia(at)oss(dot)nttdata(dot)com), > so adding the credits here. On Thu, Mar 20, 2025 at 5:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > Without having the prior discussion near to hand, I *think* that the > reason we wanted to do this wrap/unwrap stuff is to make it so that > the progressive EXPLAIN code could only execute when entering a new > plan node rather than at any random point inside of that plan node, > and that does seem a lot safer than the alternative. Your assumption is correct. Various approaches were suggested, such as picking a small number of safe and sufficient places for this feature or classifying CFI() calls into safe and unsafe ones. However, in the end, the wrapping approach [1] was the only one that remained On 2025-03-30 02:51, Rafael Thofehrn Castro wrote: > Implemented this version. New patch has the following characteristics: I haven't looked into the code yet, but when I ran below commands during make installcheck, there was an error and an assertion failure =# select * from pg_stat_progress_explain; =# \watch 0.1 ERROR: could not attach to dynamic shared area WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The connection to the server was lost. Attempting reset: Failed. TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 73180 TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 73181 0 postgres 0x000000010365f5c4 ExceptionalCondition + 236 1 postgres 0x00000001035a7830 get_parameter + 1076 0 postgres 0x000000010365f5c4 ExceptionalCondition + 236 2 postgres 0x000000010359ff2c get_rule_expr + 276 1 postgres 0x00000001035a7830 get_parameter + 1076 3 postgres 0x00000001035a841c get_rule_expr_paren + 168 2 postgres 0x000000010359ff2c get_rule_expr + 276 4 postgres 0x00000001035a82c4 get_oper_expr + 292 3 postgres 0x00000001035a841c get_rule_expr_paren + 168 5 postgres 0x00000001035a01f8 get_rule_expr + 992 4 postgres 0x00000001035a82c4 get_oper_expr + 292 6 postgres 0x0000000103598520 deparse_expression_pretty + 176 5 postgres 0x00000001035a01f8 get_rule_expr + 992 7 postgres 0x0000000103598e78 deparse_expression + 76 6 postgres 0x0000000103598520 deparse_expression_pretty + 176 8 postgres 0x0000000102f94198 show_expression + 100 7 postgres 0x0000000103598e78 deparse_expression + 76 9 postgres 0x0000000102f97690 show_qual + 112 8 postgres 0x0000000102f94198 show_expression + 100 10 postgres 0x0000000102f93734 show_scan_qual + 132 9 postgres 0x0000000102f97690 show_qual + 112 TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 73183 11 postgres 0x0000000102f90680 ExplainNode + 6828 10 postgres 0x0000000102f93734 show_scan_qual + 132 12 postgres 0x0000000102f8d398 ExplainPrintPlan + 540 0 postgres 0x000000010365f5c4 ExceptionalCondition + 236 11 postgres 0x0000000102f90680 ExplainNode + 6828 13 postgres 0x0000000102f9b974 ProgressiveExplainPrint + 72 12 postgres 0x0000000102f8d398 ExplainPrintPlan + 540 1 postgres 0x00000001035a7830 get_parameter + 1076 14 postgres 0x0000000102f9b920 ProgressiveExplainStart + 660 13 postgres 0x0000000102f9b974 ProgressiveExplainPrint + 72 2 postgres 0x000000010359ff2c get_rule_expr + 276 15 postgres 0x00000001030771e0 standard_ExecutorStart + 984 14 postgres 0x0000000102f9b920 ProgressiveExplainStart + 660 3 postgres 0x00000001035a841c get_rule_expr_paren + 168 16 postgres 0x0000000103076de8 ExecutorStart + 112 15 postgres 0x00000001030771e0 standard_ExecutorStart + 984 4 postgres 0x00000001035a82c4 get_oper_expr + 292 17 postgres 0x0000000103080d90 ParallelQueryMain + 292 16 postgres 0x0000000103076de8 ExecutorStart + 112 5 postgres 0x00000001035a01f8 get_rule_expr + 992 18 postgres 0x0000000102df7ef8 ParallelWorkerMain + 1712 17 postgres 0x0000000103080d90 ParallelQueryMain + 292 6 postgres 0x0000000103598520 deparse_expression_pretty + 176 19 postgres 0x00000001032a5d60 BackgroundWorkerMain + 824 18 postgres 0x0000000102df7ef8 ParallelWorkerMain + 1712 7 postgres 0x0000000103598e78 deparse_expression + 76 20 postgres 0x00000001032a9ee8 postmaster_child_launch + 492 19 postgres 0x00000001032a5d60 BackgroundWorkerMain + 824 8 postgres 0x0000000102f94198 show_expression + 100 21 postgres 0x00000001032b4c10 StartBackgroundWorker + 416 20 postgres 0x00000001032a9ee8 postmaster_child_launch + 492 9 postgres 0x0000000102f97690 show_qual + 112 22 postgres 0x00000001032af9d8 maybe_start_bgworkers + 552 21 postgres 0x00000001032b4c10 StartBackgroundWorker + 416 10 postgres 0x0000000102f93734 show_scan_qual + 132 23 postgres 0x00000001032b26cc LaunchMissingBackgroundProcesses + 1316 22 postgres 0x00000001032af9d8 maybe_start_bgworkers + 552 11 postgres 0x0000000102f90680 ExplainNode + 6828 24 postgres 0x00000001032afcb0 ServerLoop + 616 23 postgres 0x00000001032b26cc LaunchMissingBackgroundProcesses + 1316 12 postgres 0x0000000102f8d398 ExplainPrintPlan + 540 25 postgres 0x00000001032ae55c PostmasterMain + 6632 13 postgres 0x0000000102f9b974 ProgressiveExplainPrint + 72 24 postgres 0x00000001032afcb0 ServerLoop + 616 25 postgres 0x00000001032ae55c PostmasterMain + 6632 26 postgres 0x0000000103121160 main + 952 27 dyld 0x000000019cdc0274 start + 2840 26 postgres 0x0000000103121160 main + 952 27 dyld 0x000000019cdc0274 start + 2840 TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 73182 [1] https://www.postgresql.org/message-id/ac6c51071316279bf903078cf264c37a%40oss.nttdata.com -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On 3/31/25 02:23, torikoshia wrote: > On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro >> Implemented this version. New patch has the following characteristics: > > > I haven't looked into the code yet, but when I ran below commands during > make installcheck, there was an error and an assertion failure > > =# select * from pg_stat_progress_explain; > =# \watch 0.1 Yeah, that's to be expected. I think many corner cases may be found: hash table in the middle of filling, opened file descriptors, an incorrect combination of variables, 'not yet visited' subtrees - who knows what else? So, last time, I just ended up with the idea that using the explain code is a bit dangerous - in the middle of execution, it is enough to expose only basic data - rows, numbers and timings. It seems safe to gather. -- regards, Andrei Lepikhov
Thanks for this valuable testing. I think this is actually a really good idea for how to test something like this, because the regression tests contain lots of different queries that do lots of weird things. On Sun, Mar 30, 2025 at 8:23 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > I haven't looked into the code yet, but when I ran below commands during > make installcheck, there was an error and an assertion failure > > =# select * from pg_stat_progress_explain; > =# \watch 0.1 > > ERROR: could not attach to dynamic shared area This seems like a race condition. Probably some backend's dsa_area went away between the time we got a pointer to it and the time we actually attached to it. We should be able to find some way of handling this without an error, like treating the case where the DSA area is missing the same as the case where there was no DSA pointer in the first place. However, this is also making me wonder if we shouldn't be using one DSA shared by all backends rather than a separate DSA area for every backend. That would require more care to avoid leaks, but I'm not sure that it's a good idea to be creating and destroying a DSA area for every single query. But I'm not 100% sure that's a problem. > TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: > "ruleutils.c", Line: 8802, PID: 73180 > TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: > "ruleutils.c", Line: 8802, PID: 73181 I wonder what is happening here. One systemic danger of the patch is that there can be a difference between what must be true at the *end* of a query and what must be true *during* a query. Anything that can't happen at the end but can happen in the middle is something that the patch will need to do something about in order to work properly. But I don't see how that can explain this failure, because AFAIR the patch just prints the same things that would have been printed by any other EXPLAIN, with the same stack of namespaces. It seems possible that this is a pre-existing bug: the regression tests might contain a query that would cause EXPLAIN to fail, but because the regression tests don't actually EXPLAIN that query, no failure occurs. But it could also be something else; for example, maybe this patch is trying to EXPLAIN something that couldn't be used with a regular EXPLAIN for some reason. Or maybe the patch doesn't actually succeed in doing the EXPLAIN with the correct namespace stack in all cases. -- Robert Haas EDB: http://www.enterprisedb.com
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failurepatch after code refactoring, which is to not properly check that
a QueryDesc is already being tracked. So every subquery call
in the same query is allocating DSAs and those segments are not
being properly cleared. So the patch is broken and probably explains
your crashes.
Just made a 1 line fix and make installcheck looks clean now. Will
do more tests before sending another version.
Hello again,
> ERROR: could not attach to dynamic shared area
In addition to that refactoring issue, the current patch had a race
condition in pg_stat_progress_explain to access the DSA of a process
running a query that gets aborted.
While discussing with Robert we agreed that it would be wiser to take
a step back and change the strategy used to share progressive explain
data in shared memory.
Instead of using per backend's DSAs shared via a hash structure I now
define a dsa_pointer and a LWLock in each backend's PGPROC.
A global DSA is created by the first backend that attempts to use
the progressive explain feature. After the DSA is created, subsequent
uses of the feature will just allocate memory there and reference
via PGPROC's dsa_pointer.
This solves the race condition reported by Torikoshi and improves
concurrency performance as now we don't have a global LWLock
> ERROR: could not attach to dynamic shared area
In addition to that refactoring issue, the current patch had a race
condition in pg_stat_progress_explain to access the DSA of a process
running a query that gets aborted.
While discussing with Robert we agreed that it would be wiser to take
a step back and change the strategy used to share progressive explain
data in shared memory.
Instead of using per backend's DSAs shared via a hash structure I now
define a dsa_pointer and a LWLock in each backend's PGPROC.
A global DSA is created by the first backend that attempts to use
the progressive explain feature. After the DSA is created, subsequent
uses of the feature will just allocate memory there and reference
via PGPROC's dsa_pointer.
This solves the race condition reported by Torikoshi and improves
concurrency performance as now we don't have a global LWLock
controlling shared memory access, but a per-backend LWLock.
Performed the same tests done by Torikoshi and it looks like we are
good now. Even with more frequent inspects in pg_stat_progress_explain
(\watch 0.01).
Rafael.
Performed the same tests done by Torikoshi and it looks like we are
good now. Even with more frequent inspects in pg_stat_progress_explain
(\watch 0.01).
Rafael.
Attachment
On 2025-04-01 15:23, Rafael Thofehrn Castro wrote: > Hello again, > >> ERROR: could not attach to dynamic shared area > > In addition to that refactoring issue, the current patch had a race > condition in pg_stat_progress_explain to access the DSA of a process > running a query that gets aborted. > > While discussing with Robert we agreed that it would be wiser to take > a step back and change the strategy used to share progressive explain > data in shared memory. > > Instead of using per backend's DSAs shared via a hash structure I now > define a dsa_pointer and a LWLock in each backend's PGPROC. > > A global DSA is created by the first backend that attempts to use > the progressive explain feature. After the DSA is created, subsequent > uses of the feature will just allocate memory there and reference > via PGPROC's dsa_pointer. > > This solves the race condition reported by Torikoshi and improves > concurrency performance as now we don't have a global LWLock > controlling shared memory access, but a per-backend LWLock. > > Performed the same tests done by Torikoshi and it looks like we are > good now. Even with more frequent inspects in pg_stat_progress_explain > (\watch 0.01). Thanks for updating the patch! Have you tested enabling progressive_explain? When I ran the 'make installcheck' test again setting progressive_explain to on, there was the same assertion failure: TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 116832 postgres: parallel worker for PID 116822 (ExceptionalCondition+0x98)[0xb7311ea8bf80] postgres: parallel worker for PID 116822 (+0x89de6c)[0xb7311e9ede6c] postgres: parallel worker for PID 116822 (+0x89eb68)[0xb7311e9eeb68] postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c] postgres: parallel worker for PID 116822 (+0x8a1d10)[0xb7311e9f1d10] postgres: parallel worker for PID 116822 (+0x89ed80)[0xb7311e9eed80] postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c] postgres: parallel worker for PID 116822 (+0x89f174)[0xb7311e9ef174] postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c] postgres: parallel worker for PID 116822 (+0x89f0b8)[0xb7311e9ef0b8] postgres: parallel worker for PID 116822 (+0x8928dc)[0xb7311e9e28dc] postgres: parallel worker for PID 116822 (deparse_expression+0x34)[0xb7311e9e2834] postgres: parallel worker for PID 116822 (+0x347870)[0xb7311e497870] postgres: parallel worker for PID 116822 (+0x3478e4)[0xb7311e4978e4] postgres: parallel worker for PID 116822 (+0x347970)[0xb7311e497970] ... TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: "ruleutils.c", Line: 8802, PID: 116831 [115650] LOG: 00000: background worker "parallel worker" (PID 116831) was terminated by signal 6: Aborted [115650] DETAIL: Failed process was running: explain (analyze, costs off, summary off, timing off, buffers off) select count(*) from ab where (a = (select 1) or a = (select 3)) and b = 2 [115650] LOCATION: LogChildExit, postmaster.c:2846 We can reproduce it as follows: show progressive_explain; progressive_explain --------------------- on create table ab (a int not null, b int not null) partition by list (a); create table ab_a2 partition of ab for values in(2) partition by list (b); create table ab_a2_b1 partition of ab_a2 for values in (1); create table ab_a2_b2 partition of ab_a2 for values in (2); create table ab_a2_b3 partition of ab_a2 for values in (3); create table ab_a1 partition of ab for values in(1) partition by list (b); create table ab_a1_b1 partition of ab_a1 for values in (1); create table ab_a1_b2 partition of ab_a1 for values in (2); create table ab_a1_b3 partition of ab_a1 for values in (3); create table ab_a3 partition of ab for values in(3) partition by list (b); create table ab_a3_b1 partition of ab_a3 for values in (1); create table ab_a3_b2 partition of ab_a3 for values in (2); create table ab_a3_b3 partition of ab_a3 for values in (3); set parallel_setup_cost = 0; set parallel_tuple_cost = 0; set min_parallel_table_scan_size = 0; set max_parallel_workers_per_gather = 2; explain (analyze, costs off, summary off, timing off, buffers off) select count(*) from ab where (a = (select 1) or a = (select 3)) and b = 2; WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Note that there is no need to access pg_stat_progress_explain. Could you please check if you can reproduce this? -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Tue, Apr 1, 2025 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > Could you please check if you can reproduce this? I can, and I now see that this patch has a pretty big design problem. The assertion failure occurs when a background worker tries to call ruleutils.c's get_parameter(), which tries to find the expression from which the parameter was computed. To do that, it essentially looks upward in the plan tree until it finds the source of the parameter. For example, if the parameter came from the bar_id column of relation foo, we would then deparse the parameter as foo.bar_id, rather than as $1 or similar. However, this doesn't work when the deparsing is attempted from within a parallel worker, because the parallel worker only gets the portion of the plan it's attempting to execute, not the whole thing. In your test case, the whole plan looks like this: Aggregate InitPlan 1 -> Result InitPlan 2 -> Result -> Gather Workers Planned: 2 -> Parallel Append -> Parallel Seq Scan on ab_a1_b2 ab_1 Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) -> Parallel Seq Scan on ab_a2_b2 ab_2 Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) -> Parallel Seq Scan on ab_a3_b2 ab_3 Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) Those references to (InitPlan 1).col1 are actually Params. I think what's happening here is approximately that the worker tries to find the source of those Params, but (InitPlan 1) is above the Gather node and thus not available to the worker, and so the worker can't find it and the assertion fails. In one sense, it is not hard to fix this: the workers shouldn't really be doing progressive_explain at all, because then we get a progressive_explain from each process individually instead of one for the query as a whole, so we could just think of having the workers ignore the progressive_explain GUC. However, one thing I realized earlier this morning is that the progressive EXPLAIN can't show any of the instrumentation that is relayed back from workers to the leader only at the end of the execution. See the code in ParallelQueryMain() just after ExecutorFinish(). What I think this means is that this patch needs significant rethinking to cope with parallel query. I don't think users are going to be happy with a progressive EXPLAIN that just ignores what the workers are doing, and I don't think they're going to be happy with N separate progressive explains that they have to merge together to get an overall picture of what the query is doing, and I'm absolutely positive that they're not going to be happy with something that crashes. I think there may be a way to come up with a good design here that avoids these problems, but we definitely don't have time before feature freeze (not that we were necessarily in a great place to think of committing this before feature freeze anyway, but it definitely doesn't make sense now that I understand this problem). -- Robert Haas EDB: http://www.enterprisedb.com
My bad, I mistakenly did the tests without assertion enabled in the last 2 days,
so couldn't catch that Assertion failure. Was able to reproduce it, thanks.
I guess when the code was designed we were not expecting to be doing explains
in parallel workers.
One comment is that this has nothing to do with instrumentation. So the hacks
done with instrument objects are not to blame here.
What is interesting is that removing that Assert (not saying we should do that)
fixes it and there doesn't seem to be other Asserts complaining anywhere. The
feature works as expected and there are no crashes.
> What I think this means is that this patch needs significant
> rethinking to cope with parallel query. I don't think users are going
> to be happy with a progressive EXPLAIN that just ignores what the
> workers are doing, and I don't think they're going to be happy with N
> separate progressive explains that they have to merge together to get
> an overall picture of what the query is doing, and I'm absolutely
> positive that they're not going to be happy with something that
> crashes. I think there may be a way to come up with a good design here
> that avoids these problems, but we definitely don't have time before
> feature freeze (not that we were necessarily in a great place to think
> of committing this before feature freeze anyway, but it definitely
> doesn't make sense now that I understand this problem).
Yeah, that is a fair point. But I actually envisioned this feature to also
target parallel workers, from the start. I see a lot of value in being able
to debug what each parallel worker is doing in their blackboxes. It could be
that a worker is lagging behind others as it is dealing with non cached
data blocks for example.
According to my tests the parallel workers can actually push instrumentation
to the parent while still running. It all depends on the types of operations
being performed in the plan.
For example, if the parallel worker is doing a parallel seq scan, then it will
continuously send chunks of rows to the parent and instrumentation goes with
the data. So we don't actually need to wait for the workers to finish until
instrumentation on the parent gets updated. Here is what I got from Torikoshi's
test after removing that Assert and enabling instrumented progressive
explain (progressive_explain_interval = '10ms'):
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169555
last_update | 2025-04-01 12:44:07.36775-03
query_plan | Gather (cost=0.02..137998.03 rows=10000002 width=8) (actual time=0.715..868.619 rows=9232106.00 loops=1) (current) +
| Workers Planned: 2 +
| Workers Launched: 2 +
| InitPlan 1 +
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1) +
| InitPlan 2 +
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.000..0.000 rows=1.00 loops=1) +
| -> Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.364..264.804 rows=1533011.00 loops=1) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.361..192.896 rows=1533011.00 loops=1)+
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
|
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169596
last_update | 2025-04-01 12:44:07.361846-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.990..666.845 rows=3839515.00 loops=1) (current) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.985..480.531 rows=3839515.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
|
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169597
last_update | 2025-04-01 12:44:07.36181-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=1.003..669.398 rows=3830293.00 loops=1) (current) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (actual time=0.019..0.019 rows=0.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.979..482.420 rows=3830293.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
|
But yeah, if the parallel worker does a hash join and the HASH node is a huge
block of sub operations, then yes, upstream will not see anything until the HASH
is computed. That is why IMO having visibility on background workers has value
too.
I even designed a query visualizer that combines plans of the parent and
parallel workers in side-by-side windows that includes other stats (per
process wait events, CPU, memory consumption, temp file generation) allowing
us to correlate with specific operations being done in the plans.
But Robert is right, the engineering required to make this happen isn't
easy. Changing that Assert can open a pandora box of other issues and we
have no time to do that for PG18.
Rafael.
so couldn't catch that Assertion failure. Was able to reproduce it, thanks.
I guess when the code was designed we were not expecting to be doing explains
in parallel workers.
One comment is that this has nothing to do with instrumentation. So the hacks
done with instrument objects are not to blame here.
What is interesting is that removing that Assert (not saying we should do that)
fixes it and there doesn't seem to be other Asserts complaining anywhere. The
feature works as expected and there are no crashes.
> What I think this means is that this patch needs significant
> rethinking to cope with parallel query. I don't think users are going
> to be happy with a progressive EXPLAIN that just ignores what the
> workers are doing, and I don't think they're going to be happy with N
> separate progressive explains that they have to merge together to get
> an overall picture of what the query is doing, and I'm absolutely
> positive that they're not going to be happy with something that
> crashes. I think there may be a way to come up with a good design here
> that avoids these problems, but we definitely don't have time before
> feature freeze (not that we were necessarily in a great place to think
> of committing this before feature freeze anyway, but it definitely
> doesn't make sense now that I understand this problem).
Yeah, that is a fair point. But I actually envisioned this feature to also
target parallel workers, from the start. I see a lot of value in being able
to debug what each parallel worker is doing in their blackboxes. It could be
that a worker is lagging behind others as it is dealing with non cached
data blocks for example.
According to my tests the parallel workers can actually push instrumentation
to the parent while still running. It all depends on the types of operations
being performed in the plan.
For example, if the parallel worker is doing a parallel seq scan, then it will
continuously send chunks of rows to the parent and instrumentation goes with
the data. So we don't actually need to wait for the workers to finish until
instrumentation on the parent gets updated. Here is what I got from Torikoshi's
test after removing that Assert and enabling instrumented progressive
explain (progressive_explain_interval = '10ms'):
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169555
last_update | 2025-04-01 12:44:07.36775-03
query_plan | Gather (cost=0.02..137998.03 rows=10000002 width=8) (actual time=0.715..868.619 rows=9232106.00 loops=1) (current) +
| Workers Planned: 2 +
| Workers Launched: 2 +
| InitPlan 1 +
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1) +
| InitPlan 2 +
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.000..0.000 rows=1.00 loops=1) +
| -> Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.364..264.804 rows=1533011.00 loops=1) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.361..192.896 rows=1533011.00 loops=1)+
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1))) +
|
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169596
last_update | 2025-04-01 12:44:07.361846-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.990..666.845 rows=3839515.00 loops=1) (current) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.985..480.531 rows=3839515.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
|
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169597
last_update | 2025-04-01 12:44:07.36181-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8) (actual time=1.003..669.398 rows=3830293.00 loops=1) (current) +
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1 width=8) (never executed) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1 width=8) (actual time=0.019..0.019 rows=0.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
| -> Parallel Seq Scan on ab_a1_b1 ab_1 (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.979..482.420 rows=3830293.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1))) +
|
But yeah, if the parallel worker does a hash join and the HASH node is a huge
block of sub operations, then yes, upstream will not see anything until the HASH
is computed. That is why IMO having visibility on background workers has value
too.
I even designed a query visualizer that combines plans of the parent and
parallel workers in side-by-side windows that includes other stats (per
process wait events, CPU, memory consumption, temp file generation) allowing
us to correlate with specific operations being done in the plans.
But Robert is right, the engineering required to make this happen isn't
easy. Changing that Assert can open a pandora box of other issues and we
have no time to do that for PG18.
Rafael.