Thread: Re: Proposal: Progressive explain

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Hello,

Was recently benchmarking the last version of the patch and found room for
improvement when GUC progressive_explain is enabled globally.

Results with the last published version of the patch:


- progressive_explain = off:

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18249.363540 (without initial connection time)

- progressive_explain = 'explain':

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 3536.635125 (without initial connection time)

This is because progressive explains are being printed for every query, including
the ones that finish instantly.

If we think about it, those printed plans for instant queries are useless as
other backends won't have a chance to look at the plans before they get removed
from pg_stat_progress_explain.

So this new version of the patch implements new GUC progressive_explain_min_duration
to be a used as a threshold for the plan to be printed for the first time:

- progressive_explain_min_duration: min query duration until progressive
explain starts.
  - type: int
  - default: 1s
  - min: 0
  - context: user

Results with the new version:

- progressive_explain = off:

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18871.800242 (without initial connection time)

- progressive_explain = 'explain' and progressive_explain_min_duration = '5s':

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18896.266538 (without initial connection time)

Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the progressive
explain.

There is a catch to this implementation. In thread https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns about
logging plans in the CFI, a sensible part of the code. So torikoshia implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all nodes
so that the plan printing can be done there.

I'm not sure if this same concern applies to timeout callbacks so I also implemented
a second version of the latest patch that uses that execProcNode wrapper strategy.

The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com), so
adding the credits here.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
 
Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the progressive
explain.

There is a catch to this implementation. In thread https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns about
logging plans in the CFI, a sensible part of the code. So torikoshia implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all nodes
so that the plan printing can be done there.

I'm not sure if this same concern applies to timeout callbacks so I also implemented
a second version of the latest patch that uses that execProcNode wrapper strategy.

The wrapper code was implemented by torikoshia (torikoshia@oss.nttdata.com), so
adding the credits here.

Did additional benchmarks and found issues with the patch that doesn't do execProcNode
wrapping. There are sporadic crashes with double free or corruption (top)

So making the patch that uses the wrapper the current one. Again, giving the credits to
torikoshia as being the owner of that section of the code.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
"Euler Taveira"
Date:
On Fri, Mar 7, 2025, at 5:34 PM, Rafael Thofehrn Castro wrote:
 Did additional benchmarks and found issues with the patch that doesn't do execProcNode
wrapping. There are sporadic crashes with double free or corruption (top)

So making the patch that uses the wrapper the current one. Again, giving the credits to
torikoshia as being the owner of that section of the code.


Rafael, thanks for working on it. It is a step forward in observability. I
started with some performance tests and the last improvements seem to fix the
overhead imposed in the initial patch version. I didn't notice any of these new
function in the perf report while executing fast queries.

I found a crash. It is simple to reproduce.

Session A:

select * from pg_stat_progress_explain;
\watch 2

Session B:

explain select pg_sleep(30);
\watch 2

8<--------------------------------------------------------------------8<

Backtrace:

Core was generated by `postgres: euler postgres [lo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
5401 if (ps->ExecProcNodeOriginal != NULL)
#0  WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
#1  0x00005624173829aa in handle_sig_alarm (postgres_signal_arg=<optimized out>) at timeout.c:414
#2  0x00005624173ba02c in wrapper_handler (postgres_signal_arg=14) at pqsignal.c:110
#3  <signal handler called>
#4  0x00007f20fa529e63 in epoll_wait (epfd=6, events=0x56244ef37e58, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#5  0x00005624171fb02f in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdd9e62080, cur_timeout=-1, set=0x56244ef37dd8) at waiteventset.c:1190
#6  WaitEventSetWait (set=0x56244ef37dd8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffdd9e62080, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at waiteventset.c:1138
#7  0x000056241709513c in secure_read (port=0x56244eeb90e0, ptr=0x56241775a9a0 <PqRecvBuffer>, len=8192) at be-secure.c:218
#8  0x000056241709bf2e in pq_recvbuf () at pqcomm.c:924
#9  0x000056241709ceb5 in pq_getbyte () at pqcomm.c:970
#10 0x000056241721b617 in SocketBackend (inBuf=0x7ffdd9e622a0) at postgres.c:361
#11 ReadCommand (inBuf=0x7ffdd9e622a0) at postgres.c:484
#12 PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4625
#13 0x00005624172167ed in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000056241717519b in postmaster_child_launch (child_type=<optimized out>, child_slot=2, startup_data=startup_data@entry=0x7ffdd9e6253c, startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0x7ffdd9e62540) at launch_backend.c:274
#15 0x0000562417178c32 in BackendStartup (client_sock=0x7ffdd9e62540) at postmaster.c:3519
#16 ServerLoop () at postmaster.c:1688
#17 0x000056241717a6da in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x56244eeb81b0) at postmaster.c:1386
#18 0x0000562416e64f9a in main (argc=1, argv=0x56244eeb81b0) at main.c:230

8<--------------------------------------------------------------------8<

You call this feature "progressive explain". My first impression is that it
will only provide the execution plans for EXPLAIN commands. Instead of
"progressive explain", I would suggest "query progress" that is a general
database terminology. It seems natural to use "progressive explain" since you
are using the explain infrastructure (including the same options -- format,
settings, wal, ...) -- to print the execution plan.

+CREATE VIEW pg_stat_progress_explain AS
+    SELECT
+            *
+    FROM pg_stat_progress_explain(true);
+

There is no use for the function argument. If you decide to keep this function,
remove it.

Why don't you use the pgstat_progress_XXX() API? Since you are using a
pg_stat_progress_XXX view name I would expect using the command progress
reporting infrastructure (see backend_progress.c).

Maybe you could include datid and datname as the other progress reporting
views. It would avoid a join to figure out what the database is.

+static const struct config_enum_entry explain_format_options[] = {
+   {"text", EXPLAIN_FORMAT_TEXT, false},
+   {"xml", EXPLAIN_FORMAT_XML, false},
+   {"json", EXPLAIN_FORMAT_JSON, false},
+   {"yaml", EXPLAIN_FORMAT_YAML, false},
+   {NULL, 0, false}
+};

Isn't it the same definition as in auto_explain.c? Use only one definition for
auto_explain and this feature. You can put this struct into explain.c, use an
extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
guc.h. See wal_level_options, for an example.

+static const struct config_enum_entry progressive_explain_options[] = {
+   {"off", PROGRESSIVE_EXPLAIN_NONE, false},
+   {"explain", PROGRESSIVE_EXPLAIN_EXPLAIN, false},
+   {"analyze", PROGRESSIVE_EXPLAIN_ANALYZE, false},
+   {"false", PROGRESSIVE_EXPLAIN_NONE, true},
+   {NULL, 0, false}
+};

The "analyze" is a separate option in auto_explain. Should we have 2 options?
One that enable/disable this feature and another one that enable/disable
analyze option.

Don't the other EXPLAIN options make sense here? Like serialize and summary.

    TupleDesc   tupDesc;        /* descriptor for result tuples */
    EState     *estate;         /* executor's query-wide state */
    PlanState  *planstate;      /* tree of per-plan-node state */
+   struct ExplainState *pe_es; /* progressive explain state if enabled */

Should you use the same name pattern here? pestate, for example.

PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ExplainHash)

Could you use a specific name? Even if you keep the proposed name, you should
use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.

+$node->init;
+# Configure progressive explain to be logged immediatelly
+$node->append_conf('postgresql.conf', 'progressive_explain_min_duration = 0');
+$node->start;

s/immediatelly/immediately/

+typedef struct progressiveExplainHashKey
+{
+   int         pid;            /* PID */
+} progressiveExplainHashKey;
+    
+typedef struct progressiveExplainHashEntry
+{
+   progressiveExplainHashKey key;  /* hash key of entry - MUST BE FIRST */
+   dsa_handle  h;
+   dsa_pointer p;
+} progressiveExplainHashEntry;

You don't need progressiveExplainHashKey. Use pid as key directly.

+           /* Update shared memory with new data */
+           strcpy(pe_data->plan, es->str->data);
+           pe_data->last_print = GetCurrentTimestamp();

I don't think last_print is accurate because it is not the time the execution plan
is printed but the time it was updated. I suggest last_updated_time.

+/* Flag set by timeouts to control when to print progressive explains */
+bool       ProgressiveExplainPending = false;

s/print/update/

There are other point that you use "print" but it is better to replace it with
"update".

+   progressiveExplainArray = ShmemInitHash("progressive explain hash",
+                                           50, 50,
+                                           &info,
+                                           HASH_ELEM | HASH_BLOBS);

I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
better name.

+               entry->p = dsa_allocate(es->pe_a,
+                                       add_size(sizeof(progressiveExplainData),
+                                                add_size(strlen(es->str->data),
+                                                         PROGRESSIVE_EXPLAIN_ALLOC_SIZE)));

I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
PROGRESSIVE_EXPLAIN_AVAIL_SIZE?

Maybe you can use dshash.

There are some comments that still refers to the wrong function name.

+/*
+ * ExecProcNodeWithExplain
+ * ExecProcNode wrapper that initializes progressive explain
+ * and uwraps ExecProdNode to the original function.
+ */
+static TupleTableSlot *
+ExecProcNodeExplain(PlanState *node)

and

+/*
+ * ExecProcNodeWithExplain
+ * Responsible for initialization of all structures related to progressive
+ * explains.
+ *

+   /* state related to progressive explains */
+   struct PlanState *pe_curr_node; 
+   struct Instrumentation *pe_local_instr;
+   dsa_area   *pe_a;

Could you add some comments saying what each of these variables are for?

I didn't experiment but I was wondering if there is a way to avoid the
duplicates that you added to avoid the overhead.


--
Euler Taveira

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Thanks for the valuable inputs Euler. Adjusted most of your recommendations.

> I found a crash. It is simple to reproduce.

Indeed, I failed to test plain EXPLAIN after the addition of the new GUC
progressive_explain_min_duration. This is fixed.

> You call this feature "progressive explain". My first impression is that it
> will only provide the execution plans for EXPLAIN commands. Instead of
> "progressive explain", I would suggest "query progress" that is a general
> database terminology. It seems natural to use "progressive explain" since you
> are using the explain infrastructure (including the same options -- format,
> settings, wal, ...) -- to print the execution plan.

Makes sense. Kept progressive explain for now but this is still open for
discussion.

> There is no use for the function argument. If you decide to keep this function,
remove it.

Done.

> Why don't you use the pgstat_progress_XXX() API? Since you are using a
> pg_stat_progress_XXX view name I would expect using the command progress
> reporting infrastructure (see backend_progress.c).

I haven't changed that part as of now. My implementation and underlying data
structure may not work well with that API, but I am investigating.

> Maybe you could include datid and datname as the other progress reporting
> views. It would avoid a join to figure out what the database is.

Done.

> Isn't it the same definition as in auto_explain.c? Use only one definition for
> auto_explain and this feature. You can put this struct into explain.c, use an
> extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
> guc.h. See wal_level_options, for an example.

Done.

> The "analyze" is a separate option in auto_explain. Should we have 2 options?
> One that enable/disable this feature and another one that enable/disable
> analyze option.

Tomas Vondra proposed the current logic and I think it makes sense. Having a
single GUC to control the execution behavior keeps the feature simpler IMO.

> Don't the other EXPLAIN options make sense here? Like serialize and summary.

I added a missing GUC for option COSTS (progressive_explain_costs). Adding 
the other ones doesn't make much sense IMO. SUMMARY, SERIALIZE and MEMORY 
are information added at the end of the query execution (or plan creation for plain
EXPLAIN) in the summary section but at that point the progressive explain will be
already finished, with no more information in pg_stat_progress_explain.


> TupleDesc   tupDesc;        /* descriptor for result tuples */
> EState     *estate;         /* executor's query-wide state */
> PlanState  *planstate;      /* tree of per-plan-node state */
> +   struct ExplainState *pe_es; /* progressive explain state if enabled */

> Should you use the same name pattern here? pestate, for example.

Done.

> PG_LWLOCK(52, SerialControl)
> +PG_LWLOCK(53, ExplainHash)

> Could you use a specific name? Even if you keep the proposed name, you should
> use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.

Changed to ProgressiveExplainHash.

> You don't need progressiveExplainHashKey. Use pid as key directly.

Done.

> I don't think last_print is accurate because it is not the time the execution plan
> is printed but the time it was updated. I suggest last_updated_time.

Changed from last_print to last_update. This is still open for discussion.

> I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
> better name.

Used to be compatible with the ProcArray (that is also a hash). But what you
proposed is better indeed. Changed.

> I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
> doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
> PROGRESSIVE_EXPLAIN_AVAIL_SIZE?

Changed to PROGRESSIVE_EXPLAIN_FREE_SIZE.

Fixed the wrong function names in the comments and changed the format of those
comments in function headers to be comptible with other functions in explain.c.

> +   /* state related to progressive explains */
> +   struct PlanState *pe_curr_node;
> +   struct Instrumentation *pe_local_instr;
> +   dsa_area   *pe_a;

> Could you add some comments saying what each of these variables are for?

Done.

> I didn't experiment but I was wondering if there is a way to avoid the
> duplicates that you added to avoid the overhead.

You mean the local instrumentation object reused for each node?

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Sending a new version as rebase was required.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Wed, Mar 19, 2025 at 1:47 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> Sending a new version as rebase was required.

Reading this thread, it seems to me that there has been a good deal of
discussion about things like the name of the feature and what the UI
ought to be, but not much discussion of whether the feature is
actually safe, and not much detailed review of the code. I'm willing
to bet that everybody wants some version of this feature if we can
convince ourselves that it's not going to do horrible things like
cause server crashes, and that even people who don't get their first
choice in terms of how the feature is named or how the GUCs work will
still be pretty happy to have it overall. However, if it breaks stuff
and there's no easy way to fix the breakage, that's going to be a
problem.

In broad strokes, the danger here is that doing stuff in the middle of
query execution that we currently only do at the end of query
execution will turn out to be problematic. The biggest problem, I
think, is whether it's safe to do all of the things that EXPLAIN does
while we're at some random point in query execution. It looks to me
like the wrap/unwrap stuff is more-or-less consistent with previous
discussions of how a feature of this kind should work, though I don't
recall the previous discussion and I think the patch should contain
some comments about why it works the way that it works. I do notice
that WrapExecProcNodeWithExplain does not walk the ps->initPlan list,
which I think is an oversight.

Without having the prior discussion near to hand, I *think* that the
reason we wanted to do this wrap/unwrap stuff is to make it so that
the progressive EXPLAIN code could only execute when entering a new
plan node rather than at any random point inside of that plan node,
and that does seem a lot safer than the alternative. For example, I
think it means that we won't start trying to do progressive EXPLAIN
while already holding some random LWLock, which is very good. However,
this still means we could do a bunch of catalog access (and thus
potentially receive invalidations) at places where that can't happen
today. I'm not sure that's a problem -- the same thing could happen at
any place in the executor where we evaluate a user-supplied
expression, and there are many such places -- but it's certainly worth
a few senior people thinking real carefully about it and trying to
imagine whether there's any scenario in which it might break something
that works today.

One way in which this proposal seems safer than previous proposals is
that previous proposals have involved session A poking session B and
trying to get session B to emit an EXPLAIN on the fly with no prior
setup. That would be very useful, but I think it's more difficult and
more risky than this proposal, where all the configuration happens in
the session that is going to emit the EXPLAIN output. It knows from
the beginning that it's going to maybe be doing this, and so it can do
whatever setup it likes to accomplish that goal. So I think this
design looks pretty good from that point of view.

I don't understand how this would be safe against interrupts or
errors. If a running query is interrupted, what would cause
ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
why isn't that unsafe?

ExecProcNodeOriginal looks like it's basically the same thing as
ExecProcNodeReal, except that it's for a different purpose. But you
would be hard-pressed to guess which one is used for what purpose
based on the field names or the comments. Maybe it's also worth
worrying about whether this is a scalable design. Can we find a way to
use the existing fields here instead of having to add a new one?

The documentation for the progressive_explain = { off | explain |
analyze } option seems like it should go into more detail about how
the "explain" and "analyze" values are different. I'm not 100% sure I
know the answer, and I'm not the least-experienced person who will
ever read this documentation.

WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
invites confusion with MultiExecProcNode, to which it is unrelated.

I just went to some trouble to start breaking up the monolith that is
explain.c, so I'm not that happy about seeing this patch dump another
800+ lines of source code into that file. Probably we should have a
new source file for some or this, or maybe even more than one.

The changes to explain.h add three new data types. Two of them begin
with an uppercase letter and one with a lowercase letter. That seems
inconsistent. I also don't think that the declaration of char plan[]
is per PostgreSQL coding style. I believe we always write char
plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
other than plan, since it's really a string-ified explain-y kind of
thing, not literally a Plan. Also, can we please not have structure
members with single letter names? "h" and "p" are going to be
completely ungreppable, and I like grepping.

It looks very strange to me that ProgressiveExplainPrint() seems to
have a theoretical possibility of generating the output and then
throwing it away if we end up with entry == NULL. I guess maybe that
case is not supposed to happen because ProgressiveExplainInit() is
supposed to create the entry, but then why isn't this an elog(ERROR)
or something instead of a no-op?

It seems like when we replace a longer entry with a shorter one, we
forget that it was originally longer. Hence, if the length of a
progressive EXPLAIN is alternately 2922 characters and 2923
characters, we'll reallocate on every other progressive EXPLAIN
instead of at most once.

I'll try to look at this some more tomorrow. It seems like a very
interesting patch, but time is very short for this release and it
doesn't look to me like we have all the kinks sorted out here just
yet.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Hi Robert,

Thanks for sparing part of your precious time to look at the patch.
I acknowledge it is a very complex one. Since you're going to take
another look, providing some preliminary comments related to some
of the implementation concerns.

> I don't understand how this would be safe against interrupts or
> errors. If a running query is interrupted, what would cause
> ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
> why isn't that unsafe?

The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.

> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?

Agreed. Will fix this.

> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.

Are you talking about re-printing the plan in the same query execution?
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.

Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.

Regarding all other comments related to variable/function names and having
the feature in a separate file, agree with all the comments. Will send a
new version with the fixes.

Rafael.

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Wed, Mar 19, 2025 at 6:53 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> The strategy I used here is to use a MemoryContextCallback
> (ProgressiveExplainReleaseFunc), configured in the memory context
> where the query is being executed, being responsible for calling
> ProgressiveExplainCleanup() if the query doesn't end gracefully.

Thanks for the pointer. I'm a bit skeptical about what's going on here
in ProgressiveExplainReleaseFunc(). It seems like we're depending on
shared memory to tell us whether we need to do purely backend-local
cleanup, like calling disable_timeout() and resetting
ProgressiveExplainPending and activeQueryDesc. I would have expected
us to keep track in local memory of whether this kind of work needs to
be done. It seems roundabout to take an LWLock, do a hash table lookup
to see if there's an entry there, release the LWLock, and then very
shortly thereafter take the lock a second time to release the entry
that we now know is there.

The comment in ProgressiveExplainCleanup about only detaching from the
DSA if the query ended gracefully is not ideal from my point of view
because it says what the code does instead of why the code does that
thing. Also, the function is seemingly called with queryDesc as an
argument not because you need it for anything but because you're going
to test whether it's null. In that case, you could just pass a
Boolean. Even then, something seems odd about this: why do we have to
be holding ProgressiveExplainHashLock to dsa_detach the
somewhat-inscrutably named area pe_a? And why are we not detaching it
in case of error?

I am wondering why you chose this relatively unusual error cleanup
strategy. What I would have done is invent AtEOXact_ProgressiveExplain
and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
because it doesn't need separate handling for transactions and
subtransactions, but it's so unlike what we do in most places that
it's hard for me to tell whether it's correct. I feel like the
approach you've chosen here would make sense if what we wanted to do
was basically release memory or some memory-like resource associated
closely with the context -- e.g. expandedrecord.c releases a
TupleDesc, but this is doing more than that.

I think the effect of this choice is that cleanup of the
progressive-EXPLAIN stuff happens much later than it normally would.
Most of the time, in the case of an abort, we would want
AbortTransaction() to release as many resources as possible, leaving
basically no work to do at CleanupTransaction() time. This is so that
if a user types "SELECT 1/0;" we release resources, as far as
possible, right away, and don't wait for them to subsequently type
"ROLLBACK;". The transaction lives on only as a shell. But these
resources, if I'm reading this right, are going to stick around until
the transaction is actually rolled back, because memory is not freed
until CleanupTransaction() time. I wonder what happens if a query
inside of an explicit transaction aborts after putting an entry in the
progressive-explain view. My guess is that the entry will stick around
until the actual rollback happens.

In fact, now that I think about it, this is probably why we don't
dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
resource owner cleanup will have already released the DSA segments
long before the memory context is deleted.

I'm sort of inclined to say that this should be rewritten to do error
cleanup in a more normal way. It's probably more code to do it that
way, but I think having it be more similar to other subsystems is
probably worth quite a bit.

> > It seems like when we replace a longer entry with a shorter one, we
> > forget that it was originally longer. Hence, if the length of a
> > progressive EXPLAIN is alternately 2922 characters and 2923
> > characters, we'll reallocate on every other progressive EXPLAIN
> > instead of at most once.
>
> Are you talking about re-printing the plan in the same query execution?

Yes.

> The logic for the code, using your example, would be to allocate 2922 +
> PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
> alternate between 2922 and 2923 no additional allocation will be done.
> A reallocation will be needed only if the plan length ends up exceeding
> 2922+4096. At the end of query execution (or cancellation) that DSA will
> be freed and a next query execution will have to allocate again using the
 > same logic.

It seems to me that if ProgressiveExplainPrint() reaches /* Update
shared memory with new data */ without reallocating,
strlen(pe_data->plan) can be reduced. On the next trip through the
function, we don't know whether the string we're seeing is the
original string -- for which strlen()+PROGRESSIVE_EXPLAIN_FREE_SIZE)
gives us the original allocation size -- or whether the string we're
seeing is a shorter one that was copied over the original, longer
string. PROGRESSIVE_EXPLAIN_FREE_SIZE is big enough that this probably
isn't much of a problem in practice, because consecutive EXPLAIN
outputs for the same query probably won't vary in size by enough to
cause any reallocation. Looking at this more carefully, I think that
the query plan would have to shrink in size by >4kB and then expand
again in order to trigger reallocation, which seems unlikely. But it
still seems unclean to me. Normally, we track how much space we have
actually allocated explicitly, instead of reconstructing that number
from something else, especially something that isn't guaranteed to
produce an accurate result. I think you should just store the number
of available payload bytes at the beginning of the chunk and then
reallocate if it isn't big enough to hold the payload that you have
got.

> Regarding the execProcNode wrapper strategy. It used it precisely because
> of the discussion in that other patch. I actually tried not using it here,
> and call ProgressiveExplainPrint() in the timeout callback. This resulted
> in sporadic crashes, confirming the suspicion that it is not a good
> idea.

Makes sense, but we need adequate documentation of what we did and why
it works (or at least why we think it works).

Another thing I just noticed is that pg_stat_progress_explain() uses
beentry->st_procpid == ped->pid as the permissions check, but a more
typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
that's only in pgstatfuncs.c, but I think it would be OK to duplicate
the associated test here (i.e. has_privs_of_role(GetUserId(),
ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
don't really see a reason why this shouldn't use the same permission
rules as other pg_stat_ things, in particular pg_stat_get_activity().

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Hello Robert,

Fixed most of the recommendations. Going over one at a time.

> The documentation for the progressive_explain = { off | explain |
> analyze } option seems like it should go into more detail about how
> the "explain" and "analyze" values are different. I'm not 100% sure I
> know the answer, and I'm not the least-experienced person who will
> ever read this documentation.

Added details about behavior of each option. In the doc for that GUC
there is a link to another section that explains in detail how progressive
explains work.

> WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
> invites confusion with MultiExecProcNode, to which it is unrelated.

Renamed that function (and the unwrap equivalent) to WrapMemberNodesExecProcNodesWithExplain
as MemberNodes is the name used by explain when parsing those types of nodes.

> I do notice that WrapExecProcNodeWithExplain does not walk 
> the ps->initPlan list, which I think is an oversight.

Fixed.

> I just went to some trouble to start breaking up the monolith that is
> explain.c, so I'm not that happy about seeing this patch dump another
> 800+ lines of source code into that file. Probably we should have a
> new source file for some or this, or maybe even more than one.

The whole progressive explain code was moved to a new set of files
explain_progressive.h and explain_progressive.c.

> The changes to explain.h add three new data types. Two of them begin
> with an uppercase letter and one with a lowercase letter. That seems
> inconsistent. I also don't think that the declaration of char plan[]
> is per PostgreSQL coding style. I believe we always write char
> plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
> other than plan, since it's really a string-ified explain-y kind of
> thing, not literally a Plan. Also, can we please not have structure
> members with single letter names? "h" and "p" are going to be
> completely ungreppable, and I like grepping

Done. Adjusted all data types to be uppercase. Added FLEXIBLE_ARRAY_MEMBER
and renamed "h" and "p".

> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?

Changed to throw ERROR, which shouldn't happen.

> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.

Adjusted this logic. Structure ProgressiveExplainHashEntry now contains
a field to store the allocated size, which is used to compare with
the new plan being printed.

> Thanks for the pointer. I'm a bit skeptical about what's going on here
> in ProgressiveExplainReleaseFunc(). It seems like we're depending on
> shared memory to tell us whether we need to do purely backend-local
> cleanup, like calling disable_timeout() and resetting
> ProgressiveExplainPending and activeQueryDesc. I would have expected
> us to keep track in local memory of whether this kind of work needs to
> be done. It seems roundabout to take an LWLock, do a hash table lookup
> to see if there's an entry there, release the LWLock, and then very
> shortly thereafter take the lock a second time to release the entry
> that we now know is there.

> The comment in ProgressiveExplainCleanup about only detaching from the
> DSA if the query ended gracefully is not ideal from my point of view
> because it says what the code does instead of why the code does that
> thing. Also, the function is seemingly called with queryDesc as an
> argument not because you need it for anything but because you're going
> to test whether it's null. In that case, you could just pass a
> Boolean. Even then, something seems odd about this: why do we have to
> be holding ProgressiveExplainHashLock to dsa_detach the
> somewhat-inscrutably named area pe_a? And why are we not detaching it
> in case of error?

> I am wondering why you chose this relatively unusual error cleanup
> strategy. What I would have done is invent AtEOXact_ProgressiveExplain
> and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
> because it doesn't need separate handling for transactions and
> subtransactions, but it's so unlike what we do in most places that
> it's hard for me to tell whether it's correct. I feel like the
> approach you've chosen here would make sense if what we wanted to do
> was basically release memory or some memory-like resource associated
> closely with the context -- e.g. expandedrecord.c releases a
> TupleDesc, but this is doing more than that.

> I think the effect of this choice is that cleanup of the
> progressive-EXPLAIN stuff happens much later than it normally would.
> Most of the time, in the case of an abort, we would want
> AbortTransaction() to release as many resources as possible, leaving
> basically no work to do at CleanupTransaction() time. This is so that
> if a user types "SELECT 1/0;" we release resources, as far as
> possible, right away, and don't wait for them to subsequently type
> "ROLLBACK;". The transaction lives on only as a shell. But these
> resources, if I'm reading this right, are going to stick around until
> the transaction is actually rolled back, because memory is not freed
> until CleanupTransaction() time. I wonder what happens if a query
> inside of an explicit transaction aborts after putting an entry in the
> progressive-explain view. My guess is that the entry will stick around
> until the actual rollback happens.

> In fact, now that I think about it, this is probably why we don't
> dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
> resource owner cleanup will have already released the DSA segments
> long before the memory context is deleted.

> I'm sort of inclined to say that this should be rewritten to do error
> cleanup in a more normal way. It's probably more code to do it that
> way, but I think having it be more similar to other subsystems is
> probably worth quite a bit.

Right. I use dsa_detach() only when the query finishes gracefully. This
is needed otherwise PG will complain with:

WARNING:  resource was not closed: dynamic shared memory segment 32349774

That WARNING doesn't appear in case of error and according to my tests
shared memory is correctly being released.

I completely removed the MemoryContextCallback strategy and switched to
using the new function AtEOXact_ProgressiveExplain() called from AbortTransaction().

This first version of the progressive explain feature was designed to only keep
track of initial query called by the backend, ignoring all subquery calls. So
I believe we don't need to worry about having to add custom logic in
AbortSubTransaction(). In case query errors out AbortTransaction() will be called
and everything related to progressive explains will be cleaned.

> Another thing I just noticed is that pg_stat_progress_explain() uses
> beentry->st_procpid == ped->pid as the permissions check, but a more
> typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
> that's only in pgstatfuncs.c, but I think it would be OK to duplicate
> the associated test here (i.e. has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
> don't really see a reason why this shouldn't use the same permission
> rules as other pg_stat_ things, in particular pg_stat_get_activity().

Adjusted the logic to use has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role) when
checking the privileges.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Sending a new version that includes the new explain_progressive.c file in meson.build.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Tue, Mar 25, 2025 at 8:52 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> This first version of the progressive explain feature was designed to only keep
> track of initial query called by the backend, ignoring all subquery calls. So
> I believe we don't need to worry about having to add custom logic in
> AbortSubTransaction(). In case query errors out AbortTransaction() will be called
> and everything related to progressive explains will be cleaned.

Suppose:

BEGIN;
SELECT 1;
SAVEPOINT bob;
progressively explain something that aborts

I think in this case we will call AbortSubTransaction(), not AbortTransaction().

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> Suppose:
> BEGIN;
> SELECT 1;
> SAVEPOINT bob;
> progressively explain something that aborts
> I think in this case we will call AbortSubTransaction(), not AbortTransaction().

Indeed. We need special treatment for subtransactions. There are 2 scenarios where
AbortSubTransaction() will be called alone and can affect progressive explains:
savepoints and PL/pgSQL exception blocks.

We don't want subxact aborts in PL/pgSQL exception blocks to perform cleanup
as the original query will continue to run and progressive explain tracking
is still applicable there.

So implementation was done based on transaction nested level. Cleanup is only
performed when AbortSubTransaction() is called in the same transaction nested
level as the one where the query is running. This covers both PL/pgSQL exception
blocks and savepoints.

Rafael.
Attachment

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Wed, Mar 26, 2025 at 5:58 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> So implementation was done based on transaction nested level. Cleanup is only
> performed when AbortSubTransaction() is called in the same transaction nested
> level as the one where the query is running. This covers both PL/pgSQL exception
> blocks and savepoints.

Right. I think this is much closer to being correct. However, I
actually think it should look more like this:

void
AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
{
    if (activeQueryDesc != NULL &&
        activeQueryXactNestLevel >= nestDepth)
    {
        if (isCommit)
            elog(WARNING, "leaked progressive explain query descriptor");
        ProgressiveExplainCleanup(NULL);
    }
}

By including the is-commit case in there, we can catch any bugs where
things aren't cleaned up properly before a transaction is committed.
We generally want to test >= nestDepth instead of == nestDepth in case
multiple subtransaction levels abort all at once; I'm not sure it
matters here, but even if it isn't, it's best to be consistent with
the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
nestDepth instead of calling GetCurrentTransactionNestLevel() also has
precedent e.g. see AtEOSubXact_HashTables.

As a further refinement, consider initializing
activeQueryXactNestLevel to -1 and resetting it to that value each
time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
and only if activeQueryXactNestLevel >= 0. Then the test above can be
simplified to just if (activeQueryXactNestLevel >= nestDepth).

The comment for ProgressiveExplainIsActive() is a copy-paste from
another function that you forgot to update. Also, the function body
could be reduced to a one-liner: return queryDesc == activeQueryDesc;

There is a comment in ExplainOnePlan() that says "Handle progressive
explain cleanup manually if enabled as it is not called as part of
ExecutorFinish," but standard_ExecutorFinish does indeed call
ProgressiveExplainFinish(), so either the comment is misleading or the
code is wrong.

standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
dependent on the value of the progressive_explain GUC, but that GUC
could be changed in mid-query via set_config() or a plpgsql function
that calls SET, which could result in skipping the cleanup even when
it's needed. I think you should make the call unconditional and make
it entirely the job of ProgressiveExplainFinish() to decide whether
any cleanup is needed.

ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
cases, but sometimes just disables the timeout instead. I think this
is weird. I think you should just always call
ProgressiveExplainCleanup() and make sure it's robust and cleans up
however much or little is appropriate in all cases.

On the flip side, I can't really see why
dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
ProgressiveExplainHashLock. Why not just have
ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
afterwards it can do the dsa_detach() in the caller? Then
ProgressiveExplainCleanup() no longer needs an argument.

ProgressiveExplainPrint() can save a level of indentation in a large
chunk of the function by understanding that elog(ERROR) does not
return. You don't need to wrap everything that follows in else {}.

In the documentation table called pg-stat-progress-explain-view, some
descriptions end with a period and others do not.

Calling ProgressiveExplainTrigger() directly from
ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
we've tried hard to make our signal handlers do only very simple
things like setting a flag, and this one runs around the entire
PlanState tree modifying Very Important Things. I fear that's going to
be hard to make robust. Like, what happens if we're going around
trying to change ExecProcNode pointers while the calling code was also
going around trying to change ExecProcNode pointers? I can't quite see
how this won't result in chaos.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Thanks Robert. Very thorough analysis there.

Things I don't comment on will be fixed without further discussion.

> There is a comment in ExplainOnePlan() that says "Handle progressive
> explain cleanup manually if enabled as it is not called as part of
> ExecutorFinish," but standard_ExecutorFinish does indeed call
> ProgressiveExplainFinish(), so either the comment is misleading or the
> code is wrong.

The comment is misleading. What I meant to say is that queries executed via
EXPLAIN (without analyze) don't call ExecutorFinish, so ProgressiveExplainFinish
isn't called in that context. I will fix the comment.

> Calling ProgressiveExplainTrigger() directly from
> ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
> we've tried hard to make our signal handlers do only very simple
> things like setting a flag, and this one runs around the entire
> PlanState tree modifying Very Important Things. I fear that's going to
> be hard to make robust. Like, what happens if we're going around
> trying to change ExecProcNode pointers while the calling code was also
> going around trying to change ExecProcNode pointers? I can't quite see
> how this won't result in chaos.

Agreed. In that other similar patch to log query plans after a signal is sent
from another session there was the same discussion and concerns.

I don't see another way of doing it here. This patch became 100x more complex
after I added GUC progressive_explain_min_duration, that required changing the
execProcNode wrapper on the fly.

I can see some alternatives here:

A) Use a temporary execProcNode wrapper set at query start here:

/*
 * If instrumentation is required, change the wrapper to one that just
 * does instrumentation.  Otherwise we can dispense with all wrappers and
 * have ExecProcNode() directly call the relevant function from now on.
 */
if (node->instrument)
{
  /*
   * Use wrapper for progressive explains if enabled and the node
   * belongs to the currently tracked query descriptor.
   */
  if (progressive_explain == PROGRESSIVE_EXPLAIN_ANALYZE &&
    ProgressiveExplainIsActive(node->state->query_desc))
    node->ExecProcNode = ExecProcNodeInstrExplain;
  else
    node->ExecProcNode = ExecProcNodeInstr;

This wrapper will have an additional logic that will check if a boolean set
by the timeout function has changed. When that happens the initial progressive
explain setup is done and execProcNode is unwrapped in place. This should be
safe.

The problem here is that all queries would always be using the custom
wrapper until timeout is triggered, and that can add unnecessary overhead.

B) We get rid of the idea of applying progressive explains to non instrumented
runs (which was my original idea). My main use-case here is to see progress
of instrumented runs anyways. For that idea we have 2 possibilities:

B1) Keep implementation as is, with all the existing GUCs to control what
is included in the printed plan. We just change GUC progressive_explain to be
a boolean again. If true, instrumentation will be enabled for the query being
executed and progressive explain will be printed consecutively.

B2) Get rid of all new GUCs I added and change the progressive explain feature
to become an option of the EXPLAIN command. Something like:

EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...

(B1) would allow progressive explans in regular queries, but I'm not sure people
would be willing to enable it globally as it adds instrumentation overhead.

What do you think of the options?

Rafael.

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 9:38 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> > Calling ProgressiveExplainTrigger() directly from
> > ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
>
> Agreed. In that other similar patch to log query plans after a signal is sent
> from another session there was the same discussion and concerns.
>
> I don't see another way of doing it here. This patch became 100x more complex
> after I added GUC progressive_explain_min_duration, that required changing the
> execProcNode wrapper on the fly.
>
> I can see some alternatives here:
>
> A) Use a temporary execProcNode wrapper set at query start here:
>
> The problem here is that all queries would always be using the custom
> wrapper until timeout is triggered, and that can add unnecessary overhead.
>
> B) We get rid of the idea of applying progressive explains to non instrumented
> runs (which was my original idea). My main use-case here is to see progress
> of instrumented runs anyways. For that idea we have 2 possibilities:
>
> B1) Keep implementation as is, with all the existing GUCs to control what
> is included in the printed plan. We just change GUC progressive_explain to be
> a boolean again. If true, instrumentation will be enabled for the query being
> executed and progressive explain will be printed consecutively.
>
> B2) Get rid of all new GUCs I added and change the progressive explain feature
> to become an option of the EXPLAIN command. Something like:
>
> EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...
>
> (B1) would allow progressive explans in regular queries, but I'm not sure people
> would be willing to enable it globally as it adds instrumentation overhead.

I'm inclined to think that there isn't much value to this feature with
progressive_explain=explain. If you just want to print plans after a
certain amount of elapsed runtime, you can already use auto_explain to
do that. Unless there's some pretty large benefit to doing it with
this feature over what that already does, I think we shouldn't invent
a second method. Granted, auto_explain is a contrib module and this is
proposed as a core feature, but I feel like that use case of printing
the plan once is so different from what I see as the core value
proposition of this feature that I think it might just be confusing to
include it in scope. There's nothing actually "progressive" about
that, because you're just doing it once.

But having said that, I'm not quite sure I understand why you're
proposing (A) and (B1) as separate alternatives. Changing
progressive_explain to be a Boolean doesn't seem like it solves the
problem of needing to wrap ExecProcNode from a signal handler. The
only thing that seems to solve that problem is to instead do the
wrapping at the start of the query, which AIUI is (A). So I feel like
you should do (A) to solve the problem I thought we were talking about
and (B1) to make things simpler. Am I misunderstanding the trade-offs
here?

I did consider proposing (B2) yesterday, not so much because of this
issue but because I wondered whether it might just be a better
interface. But on reflection I decided it wasn't, because it removes
the option to just turn this feature on for all of your queries, which
somebody might want to do. Also, it would actually be kind of a weird
interface, because every other form of EXPLAIN returns the plan as
output and throws away the original query output; but this form would
store the plan ephemerally in a view and return the original output.
I'm sure we could make that work and find a way to document it but it
seems odd.

In general, I think we should err on the side of making this feature
safe even at some performance cost. If we put in place a version of
this feature that imposes a great performance overhead but does not
crash the server, some people will be unhappy, complain, and/or be
unable to use the feature. That will be sad. But, if we put in place a
version of this feature that performs great and occasionally crashes
the server, that will be much sadder. So let's do something we're very
confident is safe. There is always the opportunity to change things in
the future if we're more confident about some questionable choice then
than we are now -- performance optimization can be done after the
fact. But if it's not stable, the only thing we're likely to change in
the future is to revert the whole thing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Andrei Lepikhov
Date:
On 3/26/25 22:57, Rafael Thofehrn Castro wrote:
> So implementation was done based on transaction nested level. Cleanup is 
> only
> performed when AbortSubTransaction() is called in the same transaction 
> nested
> level as the one where the query is running. This covers both PL/pgSQL 
> exception
> blocks and savepoints.
Thanks for your efforts!
Let me provide an alternative view of your code.
Postgres has a massive gap between the start of the execution and the 
end. All the stuff that happens at that time can't be discovered. That's 
sad. If you add a new hook into the ExecuteNode (it may be designed as a 
pointer in the PlanState to affect only necessary nodes), you may push 
the code out of the core and give other extensions additional tools.
I see your reasoning here [1], but I think with the commit 4fd02bf7cf9, 
it should be revised.
As I see after reviewing your code, to design it as an extension, some 
parallel query execution stuff needs to be exported, and Instrumentation 
needs to be changed a little. For example, with progressive explain we 
have a new node state - 'not visited yet' that is different from 'not 
executed'.
What is the purpose of a new hook from a broad perspective? I designed 
at least two features with such a hook in the Postgres fork: 1) 
real-time execution statistics and 2) query interruption on the 
arbitrary trigger (planning error detected, a signal arrived, too much 
temp memory allocated, etc.). I am not sure if it would be interesting 
for the community, but when a query is executed for more than one 
minute, I certainly want to control what is happening and have some 
tools except query abortion.
The benefit of such an approach is that it is doable to change the 
Instrumentation, add a hook now, and develop this extension without a 
rush until it is stable - I think at least the case of parallel 
execution may be enhanced.

[1] 
https://www.postgresql.org/message-id/CAG0ozMrtK_u8Uf5KNZUmRNuMphV5tnC5DEhRBNRGK%2BK4L506xw%40mail.gmail.com

-- 
regards, Andrei Lepikhov



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> But having said that, I'm not quite sure I understand why you're
> proposing (A) and (B1) as separate alternatives. Changing
> progressive_explain to be a Boolean doesn't seem like it solves the
> problem of needing to wrap ExecProcNode from a signal handler. The
> only thing that seems to solve that problem is to instead do the
> wrapping at the start of the query, which AIUI is (A). So I feel like
> you should do (A) to solve the problem I thought we were talking about
> and (B1) to make things simpler. Am I misunderstanding the trade-offs
> here?

Both (A) and (B1) use the same strategy, which is to wrap at query start. What
changes is that (A), where we keep the 'explain' option, allows users to still
see the plan without having to include instrumentation overhead.

But (A) is doomed from the start as the custom wrapper will have custom logic
that will add some overhead. That is what I was able to avoid with the current
patch that does wrapping in the timeout handler function.

In (B1) it is ok to have a custom wrapper that does the progressive explain
in fixed internals as that overhead is far less noticeable than the overhead
in the already existing ExecProcNodeInstr wrapper. I tested that.

Notice that the gist of (B1) is already part of the existing patch. If progressive_explain
is set to 'analyze' I set a wrapper ExecProcNodeInstrExplain at query start.

So I think the way to go is with (B1), where the ExecProcNodeInstrExplain
wrapper will continue to do what it does, but only after progressive_explain_min_duration
has passed (boolean flag set by the timeout handler function). And I get rid
of the 'explain'.

As you said, visibility on the non instrumented query plan is already a feature
of auto_explain.

> The benefit of such an approach is that it is doable to change the
> Instrumentation, add a hook now, and develop this extension without a
> rush until it is stable - I think at least the case of parallel
> execution may be enhanced.

Thanks for the input Andrei! The main issue is that progressive explain
touches a bunch of different parts of the code that would require additional
hooks. For example, to adjust the execProcNode wrapper at query start. Also
hook for xact/subxact cleanups, I don't think it exists yet.

The biggest challenge is what lies inside ExplainNode(), an already complex
function that calls several other helper functions. Progressive explain had
to change several parts of it.

The only solution I found back then while thinking about this idea was to make
ExplainNode itself a hook, and the extension would duplicate the whole code with
the additional custom logic. And that is definitely not a good idea.

With the new hooks Robert added it is indeed one big step ahead, but we still need
to deal with instrumentation and other nuances as you said.

I am definitely not the authority here to talk about the best way forward.
If there is an agreement in turning this into an extension, it can be a new
feature in auto_explain.

Rafael.

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> I am definitely not the authority here to talk about the best way forward.
> If there is an agreement in turning this into an extension, it can be a new
> feature in auto_explain.

I'm not against adding some more hooks to explain.c, but I don't
really see how you could add hooks to explain.c that would allow
somebody to implement this feature out of core, unless either (a) you
are thinking that they will copy and paste a lot of core code or (b)
you make the hooks extremely fine-grained and special-purpose,
designed to accomplish the extremely precise thing that this feature
would need them to do. But that just seems like bad hook design. I
don't think hooks have to be amazingly general in order to be
accepted, but if there's no use of the hook that doesn't involve a big
cut-and-paste operation, that's poor, and if the hook can only ever be
used to implement one thing, you might as well put the thing in core.

I feel like there are just two separate things here. There is Rafael's
progressive EXPLAIN patch which, as far as I can see currently, is
really only going to work as in-core feature, and then there is
Andrei's desire for more hooks, which may very well also be a good
idea but not the same idea.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> As you said, visibility on the non instrumented query plan is already a feature
> of auto_explain.

True, although thinking about it more, they're not being sent to the
same place. auto_explain goes to the log, and this goes to a view.
What about something like this:

progressive_explain = on | off
progressive_explain_inteval = 5s

If progressive_explain is off, then we don't populate the view. If
it's on, then we populate the view during query startup. Every time
the interval elapses, we update it again. If you want to update the
view just once at query startup and not thereafter, you can set
progressive_explain_interval = 0.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> True, although thinking about it more, they're not being sent to the
> same place. auto_explain goes to the log, and this goes to a view.
> What about something like this:
> progressive_explain = on | off
> progressive_explain_inteval = 5s
> If progressive_explain is off, then we don't populate the view. If
> it's on, then we populate the view during query startup. Every time
> the interval elapses, we update it again. If you want to update the
> view just once at query startup and not thereafter, you can set
> progressive_explain_interval = 0.

That also works. My concern is that with that approach there is a huge
"hidden" change of execution behavior between setting progressive_explain_inteval
to 0 and something greater than that.

Setting to 0 doesn't do anything other than dumping the plain plan at
query start. Setting any other value will enable instrumentation under the
hood. This would have to be very well documented.

I'm still more inclined to use:

progressive_explain = off | explain | analyze
progressive_explain_interval = any millisecond greater than a min threshold
(currently 10ms). It doesn't make sense to be dumping the instrumented plan 
every 1ms for example, IMHO.

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Fri, Mar 28, 2025 at 3:59 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> > True, although thinking about it more, they're not being sent to the
> > same place. auto_explain goes to the log, and this goes to a view.
> > What about something like this:
> > progressive_explain = on | off
> > progressive_explain_inteval = 5s
>
> I'm still more inclined to use:
>
> progressive_explain = off | explain | analyze
> progressive_explain_interval = any millisecond greater than a min threshold
> (currently 10ms). It doesn't make sense to be dumping the instrumented plan
> every 1ms for example, IMHO.

I still have trouble understanding what that means. Is the interval
irrelevant except when progressive_explain = analyze?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> I still have trouble understanding what that means. Is the interval
> irrelevant except when progressive_explain = analyze?

That is correct. Interval currently is only used when instrumentation
is enabled through progressive_explain = analyze.

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> That is correct. Interval currently is only used when instrumentation
> is enabled through progressive_explain = analyze.

I guess there would still be a point in printing the plan on a regular interval
when instrumentation is disabled. In that case the only thing we would see
changing in the plan is the node currently being executed. But with that we
add more overhead to progressive_explain = 'explain', that in that case will
also require a custom execProcNode wrapper.

Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Fri, Mar 28, 2025 at 4:12 PM Rafael Thofehrn Castro
<rafaelthca@gmail.com> wrote:
> > That is correct. Interval currently is only used when instrumentation
> > is enabled through progressive_explain = analyze.
>
> I guess there would still be a point in printing the plan on a regular interval
> when instrumentation is disabled. In that case the only thing we would see
> changing in the plan is the node currently being executed. But with that we
> add more overhead to progressive_explain = 'explain', that in that case will
> also require a custom execProcNode wrapper.

I don't think that's at all worth it. I had even considered asking for
the current node stuff to be removed altogether. Doing regular updates
to only change that seems like a real waste. It's barely ever going to
be useful to see the node being executed right this exact instant -
what you care about is which nodes suck up resources over time.

As far as option naming is concerned, I'm open to input from others,
but personally I feel like ANALYZE is a bit opaque as an EXPLAIN
option in general. We've had previous discussions about the fact that
we might have wanted to name it EXECUTE if EXPLAIN EXECUTE didn't
already mean something else. I think for most people, what ANALYZE
means - somewhat counterintuitively - is that we're actually going to
run the query. But here we're always going to run the query, so my
brain just goes into a tailspin trying to figure out what ANALYZE
means. So, personally, I would like to see progressive_explain = off |
explain | analyze go away in favor of progressive_explain = off | on.

I think we have two ideas on how to get there so far:

(1) Just get rid of what you call progressive_explain = explain. After
all, that amounts to just serializing the plan once, and maybe that's
not actually such a great feature. With that change, then we only have
two remaining values for the progressive_explain, and we can call them
"off" and "on".

(2) Merge progressive_explain = explain and progressive_explain =
analyze into a single value, progressive_explain = on. To tell which
is intended, just check progresive_explain_interval. If it's zero,
then there's no repeat, so we mean what you currently call
progressive_explain = explain i.e. serialize once. Otherwise we mean
progressive_explain = analyze.

I'm open to idea #3, but I'm pretty resistant to the idea of
progressive_explain remaining three-valued as it is today. If 27
people show up to say that they understand what that means perfectly
and Robert's a big fat idiot, I shall of course defer to the
consensus, but I kind of think I might not be alone in finding this
naming confusing.

For what it's worth, I have trouble seeing how anyone gets confused if
we do what I propose as #2. I mean, I agree with your point that they
could misunderstand how much overhead different settings will cause,
so we would need to document that carefully. But if you're just
wondering about the behavior, then I feel like people have a pretty
good chance of guessing that progressive_explain=on,
progressive_explain_interval=0 means do it once. You could think that
means do it at maximum speed, but that seems like a somewhat naive
interpretation. You could also think it disables the feature,
effectively negating progressive_explain=on, but then you should start
to wonder why there are two GUCs. If you don't think either of those
things, then I think "do it once" is the only other reasonable
interpretation.

Of course, sometimes what I think turns out to be completely wrong!

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> (2) Merge progressive_explain = explain and progressive_explain =
> analyze into a single value, progressive_explain = on. To tell which
> is intended, just check progresive_explain_interval. If it's zero,
> then there's no repeat, so we mean what you currently call
> progressive_explain = explain i.e. serialize once. Otherwise we mean
> progressive_explain = analyze.


Implemented this version. New patch has the following characteristics:

- progressive_explain is a boolean

- GUC progressive_explain_min_duration is removed

- if progresive_explain_interval = 0, update plan in memory only at query
start

- if progresive_explain_interval > 0, share the plan in memory at query
start and update every progresive_explain_interval (instrumentation is
enabled automatically)

- The plan shared in pg_stat_progress_explain at query start does not
contain any instrumentation detail ((never executed), (actual time), etc)
if progresive_explain_interval = 0, even if query is started with EXPLAIN
ANALYZE. My reasoning here is that if the plan will be shared only once we
are actually interested in seeing the plan itself and not instrumentation
progress.

Fixed other comments you shared in previous emails too:

> void
> AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
> {
>     if (activeQueryDesc != NULL &&
>         activeQueryXactNestLevel >= nestDepth)
>     {
>         if (isCommit)
>            elog(WARNING, "leaked progressive explain query descriptor");
>        ProgressiveExplainCleanup(NULL);
>    }
>}
> By including the is-commit case in there, we can catch any bugs where
> things aren't cleaned up properly before a transaction is committed.

Added the isCommit logic both to Transaction and Subtransaction commits
and aborts that will notify about leakage if cleanup was not properly
done in a commit. Changed function names back to AtEOXact and AtEOSubXact,
as opposed to AtAbort and AtSubAbort.

> We generally want to test >= nestDepth instead of == nestDepth in case
> multiple subtransaction levels abort all at once; I'm not sure it
> matters here, but even if it isn't, it's best to be consistent with
> the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
> nestDepth instead of calling GetCurrentTransactionNestLevel() also has
> precedent e.g. see AtEOSubXact_HashTables.

Done.

> As a further refinement, consider initializing
> activeQueryXactNestLevel to -1 and resetting it to that value each
> time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
> and only if activeQueryXactNestLevel >= 0. Then the test above can be
> simplified to just if (activeQueryXactNestLevel >= nestDepth).

Done.

> standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
> dependent on the value of the progressive_explain GUC, but that GUC
> could be changed in mid-query via set_config() or a plpgsql function
> that calls SET, which could result in skipping the cleanup even when
> it's needed. I think you should make the call unconditional and make
> it entirely the job of ProgressiveExplainFinish() to decide whether
> any cleanup is needed.

Done.

> ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
> cases, but sometimes just disables the timeout instead. I think this
> is weird. I think you should just always call
> ProgressiveExplainCleanup() and make sure it's robust and cleans up
> however much or little is appropriate in all cases.

Now that I removed all the execProcNode wrapping and the conditional cleanup
based on progressive_explain_min_duration (that got removed), this
part became much simpler. ProgressiveExplainFinish always calls
ProgressiveExplainCleanup.

> On the flip side, I can't really see why
> dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
> ProgressiveExplainHashLock. Why not just have
> ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
> afterwards it can do the dsa_detach() in the caller? Then
> ProgressiveExplainCleanup() no longer needs an argument.

That is perfect. Implemented.

> ProgressiveExplainPrint() can save a level of indentation in a large
> chunk of the function by understanding that elog(ERROR) does not
> return. You don't need to wrap everything that follows in else {}.

Fixed.

I will not fix documentation for now as we are not done with
implementation changes yet. Once we agree that the code logic is
safe and sound we can discuss cosmetics (feature name, GUCs, view, etc).

Rafael.
Attachment

Re: Proposal: Progressive explain

From
torikoshia
Date:
On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro 
<rafaelthca@gmail.com> wrote:
> The wrapper code was implemented by torikoshia 
> (torikoshia(at)oss(dot)nttdata(dot)com),
> so adding the credits here.

On Thu, Mar 20, 2025 at 5:35 AM Robert Haas <robertmhaas@gmail.com> 
wrote:
> Without having the prior discussion near to hand, I *think* that the
> reason we wanted to do this wrap/unwrap stuff is to make it so that
> the progressive EXPLAIN code could only execute when entering a new
> plan node rather than at any random point inside of that plan node,
> and that does seem a lot safer than the alternative.

Your assumption is correct. Various approaches were suggested, such as 
picking a small number of safe and sufficient places for this feature or 
classifying CFI() calls into safe and unsafe ones. However, in the end, 
the wrapping approach [1] was the only one that remained


On 2025-03-30 02:51, Rafael Thofehrn Castro wrote:

> Implemented this version. New patch has the following characteristics:


I haven't looked into the code yet, but when I ran below commands during 
make installcheck, there was an error and an assertion failure

   =# select * from pg_stat_progress_explain;
   =# \watch 0.1

ERROR:  could not attach to dynamic shared area

   WARNING:  terminating connection because of crash of another server 
process
   DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
   HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
   server closed the connection unexpectedly
           This probably means the server terminated abnormally
           before or while processing the request.
   The connection to the server was lost. Attempting reset: Failed.
   The connection to the server was lost. Attempting reset: Failed.

   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 73180
   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 73181
   0   postgres                            0x000000010365f5c4 
ExceptionalCondition + 236
   1   postgres                            0x00000001035a7830 
get_parameter + 1076
   0   postgres                            0x000000010365f5c4 
ExceptionalCondition + 236
   2   postgres                            0x000000010359ff2c 
get_rule_expr + 276
   1   postgres                            0x00000001035a7830 
get_parameter + 1076
   3   postgres                            0x00000001035a841c 
get_rule_expr_paren + 168
   2   postgres                            0x000000010359ff2c 
get_rule_expr + 276
   4   postgres                            0x00000001035a82c4 
get_oper_expr + 292
   3   postgres                            0x00000001035a841c 
get_rule_expr_paren + 168
   5   postgres                            0x00000001035a01f8 
get_rule_expr + 992
   4   postgres                            0x00000001035a82c4 
get_oper_expr + 292
   6   postgres                            0x0000000103598520 
deparse_expression_pretty + 176
   5   postgres                            0x00000001035a01f8 
get_rule_expr + 992
   7   postgres                            0x0000000103598e78 
deparse_expression + 76
   6   postgres                            0x0000000103598520 
deparse_expression_pretty + 176
   8   postgres                            0x0000000102f94198 
show_expression + 100
   7   postgres                            0x0000000103598e78 
deparse_expression + 76
   9   postgres                            0x0000000102f97690 show_qual + 
112
   8   postgres                            0x0000000102f94198 
show_expression + 100
   10  postgres                            0x0000000102f93734 
show_scan_qual + 132
   9   postgres                            0x0000000102f97690 show_qual + 
112
   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 73183
   11  postgres                            0x0000000102f90680 ExplainNode 
+ 6828
   10  postgres                            0x0000000102f93734 
show_scan_qual + 132
   12  postgres                            0x0000000102f8d398 
ExplainPrintPlan + 540
   0   postgres                            0x000000010365f5c4 
ExceptionalCondition + 236
   11  postgres                            0x0000000102f90680 ExplainNode 
+ 6828
   13  postgres                            0x0000000102f9b974 
ProgressiveExplainPrint + 72
   12  postgres                            0x0000000102f8d398 
ExplainPrintPlan + 540
   1   postgres                            0x00000001035a7830 
get_parameter + 1076
   14  postgres                            0x0000000102f9b920 
ProgressiveExplainStart + 660
   13  postgres                            0x0000000102f9b974 
ProgressiveExplainPrint + 72
   2   postgres                            0x000000010359ff2c 
get_rule_expr + 276
   15  postgres                            0x00000001030771e0 
standard_ExecutorStart + 984
   14  postgres                            0x0000000102f9b920 
ProgressiveExplainStart + 660
   3   postgres                            0x00000001035a841c 
get_rule_expr_paren + 168
   16  postgres                            0x0000000103076de8 
ExecutorStart + 112
   15  postgres                            0x00000001030771e0 
standard_ExecutorStart + 984
   4   postgres                            0x00000001035a82c4 
get_oper_expr + 292
   17  postgres                            0x0000000103080d90 
ParallelQueryMain + 292
   16  postgres                            0x0000000103076de8 
ExecutorStart + 112
   5   postgres                            0x00000001035a01f8 
get_rule_expr + 992
   18  postgres                            0x0000000102df7ef8 
ParallelWorkerMain + 1712
   17  postgres                            0x0000000103080d90 
ParallelQueryMain + 292
   6   postgres                            0x0000000103598520 
deparse_expression_pretty + 176
   19  postgres                            0x00000001032a5d60 
BackgroundWorkerMain + 824
   18  postgres                            0x0000000102df7ef8 
ParallelWorkerMain + 1712
   7   postgres                            0x0000000103598e78 
deparse_expression + 76
   20  postgres                            0x00000001032a9ee8 
postmaster_child_launch + 492
   19  postgres                            0x00000001032a5d60 
BackgroundWorkerMain + 824
   8   postgres                            0x0000000102f94198 
show_expression + 100
   21  postgres                            0x00000001032b4c10 
StartBackgroundWorker + 416
   20  postgres                            0x00000001032a9ee8 
postmaster_child_launch + 492
   9   postgres                            0x0000000102f97690 show_qual + 
112
   22  postgres                            0x00000001032af9d8 
maybe_start_bgworkers + 552
   21  postgres                            0x00000001032b4c10 
StartBackgroundWorker + 416
   10  postgres                            0x0000000102f93734 
show_scan_qual + 132
   23  postgres                            0x00000001032b26cc 
LaunchMissingBackgroundProcesses + 1316
   22  postgres                            0x00000001032af9d8 
maybe_start_bgworkers + 552
   11  postgres                            0x0000000102f90680 ExplainNode 
+ 6828
   24  postgres                            0x00000001032afcb0 ServerLoop 
+ 616
   23  postgres                            0x00000001032b26cc 
LaunchMissingBackgroundProcesses + 1316
   12  postgres                            0x0000000102f8d398 
ExplainPrintPlan + 540
   25  postgres                            0x00000001032ae55c 
PostmasterMain + 6632
   13  postgres                            0x0000000102f9b974 
ProgressiveExplainPrint + 72
   24  postgres                            0x00000001032afcb0 ServerLoop 
+ 616
   25  postgres                            0x00000001032ae55c 
PostmasterMain + 6632
   26  postgres                            0x0000000103121160 main + 952
   27  dyld                                0x000000019cdc0274 start + 
2840
   26  postgres                            0x0000000103121160 main + 952
   27  dyld                                0x000000019cdc0274 start + 
2840
   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 73182


[1] 
https://www.postgresql.org/message-id/ac6c51071316279bf903078cf264c37a%40oss.nttdata.com

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Proposal: Progressive explain

From
Andrei Lepikhov
Date:
On 3/31/25 02:23, torikoshia wrote:
> On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro 
>> Implemented this version. New patch has the following characteristics:
> 
> 
> I haven't looked into the code yet, but when I ran below commands during 
> make installcheck, there was an error and an assertion failure
> 
>    =# select * from pg_stat_progress_explain;
>    =# \watch 0.1
Yeah, that's to be expected.
I think many corner cases may be found: hash table in the middle of 
filling, opened file descriptors, an incorrect combination of variables, 
  'not yet visited' subtrees - who knows what else? So, last time, I 
just ended up with the idea that using the explain code is a bit 
dangerous - in the middle of execution, it is enough to expose only 
basic data - rows, numbers and timings. It seems safe to gather.

-- 
regards, Andrei Lepikhov



Re: Proposal: Progressive explain

From
Robert Haas
Date:
Thanks for this valuable testing. I think this is actually a really
good idea for how to test something like this, because the regression
tests contain lots of different queries that do lots of weird things.

On Sun, Mar 30, 2025 at 8:23 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure
>
>    =# select * from pg_stat_progress_explain;
>    =# \watch 0.1
>
> ERROR:  could not attach to dynamic shared area

This seems like a race condition. Probably some backend's dsa_area
went away between the time we got a pointer to it and the time we
actually attached to it. We should be able to find some way of
handling this without an error, like treating the case where the DSA
area is missing the same as the case where there was no DSA pointer in
the first place. However, this is also making me wonder if we
shouldn't be using one DSA shared by all backends rather than a
separate DSA area for every backend. That would require more care to
avoid leaks, but I'm not sure that it's a good idea to be creating and
destroying a DSA area for every single query. But I'm not 100% sure
that's a problem.

>    TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73180
>    TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73181

I wonder what is happening here. One systemic danger of the patch is
that there can be a difference between what must be true at the *end*
of a query and what must be true *during* a query. Anything that can't
happen at the end but can happen in the middle is something that the
patch will need to do something about in order to work properly. But I
don't see how that can explain this failure, because AFAIR the patch
just prints the same things that would have been printed by any other
EXPLAIN, with the same stack of namespaces. It seems possible that
this is a pre-existing bug: the regression tests might contain a query
that would cause EXPLAIN to fail, but because the regression tests
don't actually EXPLAIN that query, no failure occurs. But it could
also be something else; for example, maybe this patch is trying to
EXPLAIN something that couldn't be used with a regular EXPLAIN for
some reason. Or maybe the patch doesn't actually succeed in doing the
EXPLAIN with the correct namespace stack in all cases.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure

Thanks for the report. I actually made a nasty mistake in the last
patch after code refactoring, which is to not properly check that
a QueryDesc is already being tracked. So every subquery call
in the same query is allocating DSAs and those segments are not
being properly cleared. So the patch is broken and probably explains 
your crashes.

Just made a 1 line fix and make installcheck looks clean now. Will
do more tests before sending another version.

Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
Hello again,

> ERROR:  could not attach to dynamic shared area

In addition to that refactoring issue, the current patch had a race
condition in pg_stat_progress_explain to access the DSA of a process
running a query that gets aborted.

While discussing with Robert we agreed that it would be wiser to take
a step back and change the strategy used to share progressive explain
data in shared memory.

Instead of using per backend's DSAs shared via a hash structure I now
define a dsa_pointer and a LWLock in each backend's PGPROC.

A global DSA is created by the first backend that attempts to use
the progressive explain feature. After the DSA is created, subsequent
uses of the feature will just allocate memory there and reference
via PGPROC's dsa_pointer.

This solves the race condition reported by Torikoshi and improves
concurrency performance as now we don't have a global LWLock 
controlling shared memory access, but a per-backend LWLock.

Performed the same tests done by Torikoshi and it looks like we are
good now. Even with more frequent inspects in pg_stat_progress_explain
(\watch 0.01).

Rafael.
Attachment

Re: Proposal: Progressive explain

From
torikoshia
Date:
On 2025-04-01 15:23, Rafael Thofehrn Castro wrote:
> Hello again,
> 
>> ERROR:  could not attach to dynamic shared area
> 
> In addition to that refactoring issue, the current patch had a race
> condition in pg_stat_progress_explain to access the DSA of a process
> running a query that gets aborted.
> 
> While discussing with Robert we agreed that it would be wiser to take
> a step back and change the strategy used to share progressive explain
> data in shared memory.
> 
> Instead of using per backend's DSAs shared via a hash structure I now
> define a dsa_pointer and a LWLock in each backend's PGPROC.
> 
> A global DSA is created by the first backend that attempts to use
> the progressive explain feature. After the DSA is created, subsequent
> uses of the feature will just allocate memory there and reference
> via PGPROC's dsa_pointer.
> 
> This solves the race condition reported by Torikoshi and improves
> concurrency performance as now we don't have a global LWLock
> controlling shared memory access, but a per-backend LWLock.
> 
> Performed the same tests done by Torikoshi and it looks like we are
> good now. Even with more frequent inspects in pg_stat_progress_explain
> (\watch 0.01).

Thanks for updating the patch!
Have you tested enabling progressive_explain?

When I ran the 'make installcheck' test again setting 
progressive_explain to on, there was the same assertion failure:

   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 116832
   postgres: parallel worker for PID 116822 
(ExceptionalCondition+0x98)[0xb7311ea8bf80]
   postgres: parallel worker for PID 116822 (+0x89de6c)[0xb7311e9ede6c]
   postgres: parallel worker for PID 116822 (+0x89eb68)[0xb7311e9eeb68]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x8a1d10)[0xb7311e9f1d10]
   postgres: parallel worker for PID 116822 (+0x89ed80)[0xb7311e9eed80]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x89f174)[0xb7311e9ef174]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x89f0b8)[0xb7311e9ef0b8]
   postgres: parallel worker for PID 116822 (+0x8928dc)[0xb7311e9e28dc]
   postgres: parallel worker for PID 116822 
(deparse_expression+0x34)[0xb7311e9e2834]
   postgres: parallel worker for PID 116822 (+0x347870)[0xb7311e497870]
   postgres: parallel worker for PID 116822 (+0x3478e4)[0xb7311e4978e4]
   postgres: parallel worker for PID 116822 (+0x347970)[0xb7311e497970]
   ...
   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 116831
   [115650] LOG:  00000: background worker "parallel worker" (PID 116831) 
was terminated by signal 6: Aborted
   [115650] DETAIL:  Failed process was running: explain (analyze, costs 
off, summary off, timing off, buffers off) select count(*) from ab where 
(a = (select 1) or a = (select 3)) and b = 2
   [115650] LOCATION:  LogChildExit, postmaster.c:2846


We can reproduce it as follows:

   show progressive_explain;
   progressive_explain
  ---------------------
   on

   create table ab (a int not null, b int not null) partition by list 
(a);
   create table ab_a2 partition of ab for values in(2) partition by list 
(b);
   create table ab_a2_b1 partition of ab_a2 for values in (1);
   create table ab_a2_b2 partition of ab_a2 for values in (2);
   create table ab_a2_b3 partition of ab_a2 for values in (3);
   create table ab_a1 partition of ab for values in(1) partition by list 
(b);
   create table ab_a1_b1 partition of ab_a1 for values in (1);
   create table ab_a1_b2 partition of ab_a1 for values in (2);
   create table ab_a1_b3 partition of ab_a1 for values in (3);
   create table ab_a3 partition of ab for values in(3) partition by list 
(b);
   create table ab_a3_b1 partition of ab_a3 for values in (1);
   create table ab_a3_b2 partition of ab_a3 for values in (2);
   create table ab_a3_b3 partition of ab_a3 for values in (3);

   set parallel_setup_cost = 0;
   set parallel_tuple_cost = 0;
   set min_parallel_table_scan_size = 0;
   set max_parallel_workers_per_gather = 2;

   explain (analyze, costs off, summary off, timing off, buffers off) 
select count(*) from ab where (a = (select 1) or a = (select 3)) and b = 
2;
   WARNING:  terminating connection because of crash of another server 
process
   DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
   HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
   server closed the connection unexpectedly
           This probably means the server terminated abnormally
           before or while processing the request.
   The connection to the server was lost. Attempting reset: Failed.


Note that there is no need to access pg_stat_progress_explain.

Could you please check if you can reproduce this?



-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Proposal: Progressive explain

From
Robert Haas
Date:
On Tue, Apr 1, 2025 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> Could you please check if you can reproduce this?

I can, and I now see that this patch has a pretty big design problem.
The assertion failure occurs when a background worker tries to call
ruleutils.c's get_parameter(), which tries to find the expression from
which the parameter was computed. To do that, it essentially looks
upward in the plan tree until it finds the source of the parameter.
For example, if the parameter came from the bar_id column of relation
foo, we would then deparse the parameter as foo.bar_id, rather than as
$1 or similar. However, this doesn't work when the deparsing is
attempted from within a parallel worker, because the parallel worker
only gets the portion of the plan it's attempting to execute, not the
whole thing. In your test case, the whole plan looks like this:

 Aggregate
   InitPlan 1
     ->  Result
   InitPlan 2
     ->  Result
   ->  Gather
         Workers Planned: 2
         ->  Parallel Append
               ->  Parallel Seq Scan on ab_a1_b2 ab_1
                     Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))
               ->  Parallel Seq Scan on ab_a2_b2 ab_2
                     Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))
               ->  Parallel Seq Scan on ab_a3_b2 ab_3
                     Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))

Those references to (InitPlan 1).col1 are actually Params. I think
what's happening here is approximately that the worker tries to find
the source of those Params, but (InitPlan 1) is above the Gather node
and thus not available to the worker, and so the worker can't find it
and the assertion fails.

In one sense, it is not hard to fix this: the workers shouldn't really
be doing progressive_explain at all, because then we get a
progressive_explain from each process individually instead of one for
the query as a whole, so we could just think of having the workers
ignore the progressive_explain GUC. However, one thing I realized
earlier this morning is that the progressive EXPLAIN can't show any of
the instrumentation that is relayed back from workers to the leader
only at the end of the execution. See the code in ParallelQueryMain()
just after ExecutorFinish().

What I think this means is that this patch needs significant
rethinking to cope with parallel query. I don't think users are going
to be happy with a progressive EXPLAIN that just ignores what the
workers are doing, and I don't think they're going to be happy with N
separate progressive explains that they have to merge together to get
an overall picture of what the query is doing, and I'm absolutely
positive that they're not going to be happy with something that
crashes. I think there may be a way to come up with a good design here
that avoids these problems, but we definitely don't have time before
feature freeze (not that we were necessarily in a great place to think
of committing this before feature freeze anyway, but it definitely
doesn't make sense now that I understand this problem).

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Proposal: Progressive explain

From
Rafael Thofehrn Castro
Date:
My bad, I mistakenly did the tests without assertion enabled in the last 2 days,
so couldn't catch that Assertion failure. Was able to reproduce it, thanks.

I guess when the code was designed we were not expecting to be doing explains
in parallel workers.

One comment is that this has nothing to do with instrumentation. So the hacks
done with instrument objects are not to blame here.

What is interesting is that removing that Assert (not saying we should do that)
fixes it and there doesn't seem to be other Asserts complaining anywhere. The
feature works as expected and there are no crashes.

> What I think this means is that this patch needs significant
> rethinking to cope with parallel query. I don't think users are going
> to be happy with a progressive EXPLAIN that just ignores what the
> workers are doing, and I don't think they're going to be happy with N
> separate progressive explains that they have to merge together to get
> an overall picture of what the query is doing, and I'm absolutely
> positive that they're not going to be happy with something that
> crashes. I think there may be a way to come up with a good design here
> that avoids these problems, but we definitely don't have time before
> feature freeze (not that we were necessarily in a great place to think
> of committing this before feature freeze anyway, but it definitely
> doesn't make sense now that I understand this problem).

Yeah, that is a fair point. But I actually envisioned this feature to also
target parallel workers, from the start. I see a lot of value in being able
to debug what each parallel worker is doing in their blackboxes. It could be
that a  worker is lagging behind others as it is dealing with non cached
data blocks for example.

According to my tests the parallel workers can actually push instrumentation
to the parent while still running. It all depends on the types of operations
being performed in the plan.

For example, if the parallel worker is doing a parallel seq scan, then it will
continuously send chunks of rows to the parent and instrumentation goes with
the data. So we don't actually need to wait for the workers to finish until
instrumentation on the parent gets updated. Here is what I got from Torikoshi's
test after removing that Assert and enabling instrumented progressive
explain (progressive_explain_interval = '10ms'):

-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid       | 5
datname     | postgres
pid         | 169555
last_update | 2025-04-01 12:44:07.36775-03
query_plan  | Gather  (cost=0.02..137998.03 rows=10000002 width=8) (actual time=0.715..868.619 rows=9232106.00 loops=1) (current)                             +
            |   Workers Planned: 2                                                                                                                            +
            |   Workers Launched: 2                                                                                                                           +
            |   InitPlan 1                                                                                                                                    +
            |     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)                                                   +
            |   InitPlan 2                                                                                                                                    +
            |     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.000..0.000 rows=1.00 loops=1)                                                   +
            |   ->  Parallel Append  (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.364..264.804 rows=1533011.00 loops=1)                         +
            |         ->  Seq Scan on ab_a2_b1 ab_2  (cost=0.00..0.00 rows=1 width=8) (never executed)                                                        +
            |               Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1)))                                                        +
            |         ->  Seq Scan on ab_a3_b1 ab_3  (cost=0.00..0.00 rows=1 width=8) (never executed)                                                        +
            |               Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1)))                                                        +
            |         ->  Parallel Seq Scan on ab_a1_b1 ab_1  (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.361..192.896 rows=1533011.00 loops=1)+
            |               Filter: ((b = 1) AND ((a = (InitPlan 1).col1) OR (a = (InitPlan 2).col1)))                                                        +
            |
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid       | 5
datname     | postgres
pid         | 169596
last_update | 2025-04-01 12:44:07.361846-03
query_plan  | Parallel Append  (cost=0.00..137998.01 rows=4166669 width=8) (actual time=0.990..666.845 rows=3839515.00 loops=1) (current)                     +
            |   ->  Seq Scan on ab_a2_b1 ab_2  (cost=0.00..0.00 rows=1 width=8) (never executed)                                                              +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |   ->  Seq Scan on ab_a3_b1 ab_3  (cost=0.00..0.00 rows=1 width=8) (never executed)                                                              +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |   ->  Parallel Seq Scan on ab_a1_b1 ab_1  (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.985..480.531 rows=3839515.00 loops=1)      +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------
datid       | 5
datname     | postgres
pid         | 169597
last_update | 2025-04-01 12:44:07.36181-03
query_plan  | Parallel Append  (cost=0.00..137998.01 rows=4166669 width=8) (actual time=1.003..669.398 rows=3830293.00 loops=1) (current)                     +
            |   ->  Seq Scan on ab_a2_b1 ab_2  (cost=0.00..0.00 rows=1 width=8) (never executed)                                                              +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |   ->  Seq Scan on ab_a3_b1 ab_3  (cost=0.00..0.00 rows=1 width=8) (actual time=0.019..0.019 rows=0.00 loops=1)                                  +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |   ->  Parallel Seq Scan on ab_a1_b1 ab_1  (cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.979..482.420 rows=3830293.00 loops=1)      +
            |         Filter: ((b = 1) AND ((a = $0) OR (a = $1)))                                                                                            +
            |

But yeah, if the parallel worker does a hash join and the HASH node is a huge
block of sub operations, then yes, upstream will not see anything until the HASH
is computed. That is why IMO having visibility on background workers has value
too.

I even designed a query visualizer that combines plans of the parent and
parallel workers in side-by-side windows that includes other stats (per
process wait events, CPU, memory consumption, temp file generation) allowing
us to correlate with specific operations being done in the plans.

But Robert is right, the engineering required to make this happen isn't
easy. Changing that Assert can open a pandora box of other issues and we
have no time to do that for PG18.

Rafael.