Re: generic plans and "initial" pruning - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: generic plans and "initial" pruning
Date
Msg-id 684c70d7-180e-461d-9377-600c2db581ba@vondra.me
Whole thread Raw
In response to Re: generic plans and "initial" pruning  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: generic plans and "initial" pruning
Re: generic plans and "initial" pruning
List pgsql-hackers

On 12/4/24 14:34, Amit Langote wrote:
> Hi Tomas,
> 
> On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Hi,
>>
>> I took a look at this patch, mostly to familiarize myself with the
>> pruning etc. I have a bunch of comments, but all of that is minor,
>> perhaps even nitpicking - with prior feedback from David, Tom and
>> Robert, I can't really compete with that.
> 
> Thanks for looking at this.  These are helpful.
> 
>> FWIW the patch needs a rebase, there's a minor bitrot - but it was
>> simply enough to fix for a review / testing.
>>
>>
>> 0001
>> ----
>>
>> 1) But if we don't expect this error to actually happen, do we really
>> need to make it ereport()? Maybe it should be plain elog(). I mean, it's
>> "can't happen" and thus doesn't need translations etc.
>>
>>     if (!bms_equal(relids, pruneinfo->relids))
>>         ereport(ERROR,
>>                 errcode(ERRCODE_INTERNAL_ERROR),
>>                 errmsg_internal("mismatching PartitionPruneInfo found at
>> part_prune_index %d",
>>                                 part_prune_index),
>>                 errdetail_internal("plan node relids %s, pruneinfo
>> relids %s",
>>                                    bmsToString(relids),
>>                                    bmsToString(pruneinfo->relids)));
> 
> I'm fine with elog() here even if it causes the message to be longer:
> 
> elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index
> %d (plan node relids %s, pruneinfo relids %s)
> 

I'm not forcing you to do elog, if you think ereport() is better. I'm
only asking because AFAIK the "policy" is that ereport is for cases that
think can happen (and thus get translated), while elog(ERROR) is for
cases that we believe shouldn't happen.

So every time I see "ereport" I ask myself "how could this happen" which
doesn't seem to be the case here.

>> Perhaps it should even be an assert?
> 
> I am not sure about that.  Having a message handy might be good if a
> user ends up hitting this case for whatever reason, like trying to run
> a corrupted plan.
> 

I'm a bit skeptical about this, TBH. If we assume the plan is
"corrupted", why should we notice in this particular place? I mean, it
could be corrupted in a million different ways, and the chance that it
got through all the earlier steps is like 1 in a 1.000.000.

>> 2) unnecessary newline added to execPartition.h
> 
> Perhaps you meant "removed".  Fixed.
> 

Yes, sorry. I misread the diff.

>> 5) PlannerGlobal
>>
>>     /* List of PartitionPruneInfo contained in the plan */
>>     List       *partPruneInfos;
>>
>> Why does this say "contained in the plan" unlike the other fields? Is
>> there some sort of difference? I'm not saying it's wrong.
> 
> Ok, maybe the following is a bit more helpful and like the comment for
> other fields:
> 
>     /* "flat" list of PartitionPruneInfos */
>     List       *partPruneInfos;
> 

WFM

>> 0002
>> ----
>>
>> 1) Isn't it weird/undesirable partkey_datum_from_expr() loses some of
>> the asserts? Would the assert be incorrect in the new implementation, or
>> are we removing it simply because we happen to not have one of the fields?
> 
> The former -- the asserts would be incorrect in the new implementation
> -- because in the new implementation a standalone ExprContext is used
> that is independent of the parent PlanState (when available) for both
> types of runtime pruning.
> 
> The old asserts, particularly the second one, weren't asserting
> something very useful anyway, IMO.  What I mean is that the
> ExprContext provided in the PartitionPruneContext to be the same as
> the parent PlanState's ps_ExprContext isn't critical to the code that
> follows.  Nor whether the PlanState is available or not.
> 

OK, thanks for explaining

>> 2) inconsistent spelling: run-time vs. runtime
> 
> I assume you meant in this comment:
> 
> * estate                       The EState for the query doing runtime pruning
> 
> Fixed by using run-time, which is a more commonly used term in the
> source code than runtime.
> 

Not quite. I was looking at runtime/run-time in the patch files, but now
I realize some of that is preexisting ... Still, maybe the patch should
stick to one spelling.

>> 2) I'm not quite sure what "exec" partition pruning is?
>>
>> /*
>>  * ExecInitPartitionPruning
>>  *   Initialize the data structures needed for runtime "exec" partition
>>  *   pruning and return the result of initial pruning, if available.
>>
>> Is that the same thing as "runtime pruning"?
> 
> "Exec" pruning refers to pruning performed during execution, using
> PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan
> initialization, using parameters whose values remain constant during
> execution, such as PARAM_EXTERN parameters and stable functions.
> 
> Before this patch, the ExecInitPartitionPruning function, called
> during ExecutorStart(), performed "init" pruning and set up state in
> the PartitionPruneState for subsequent "exec" pruning during
> ExecutorRun(). With this patch, "init" pruning is performed well
> before this function is called, leaving its sole responsibility to
> setting up the state for "exec" pruning. It may be worth renaming the
> function to better reflect this new role, rather than updating only
> the comment.
> 
> Actually, that is what I decided to do in the attached, along with
> some other adjustments like moving ExecDoInitialPruning() to
> execPartition.c from execMain.c, fixing up some obsolete comments,
> etc.
> 

I don't see any attachment :-(

Anyway, if I understand correctly, the "runtime pruning" has two
separate cases - initial pruning and exec pruning. Is that right?
> 
>>
>> 2) It may not be quite clear why ExecInitUpdateProjection() switches to
>> mt_updateColnosLists. Should that be explained in a comment, somewhere?
> 
> There is a comment in the ModifyTableState struct definition:
> 
>     /*
>      * List of valid updateColnosLists.  Contains only those belonging to
>      * unpruned relations from ModifyTable.updateColnosLists.
>      */
>     List       *mt_updateColnosLists;
> 
> It seems redundant to reiterate this in ExecInitUpdateProjection().
> 

Ah, I see. Makes sense.

> 
>> 0005
>> ----
>>
>> 1) auto_explain.c - So what happens if the plan gets invalidated? The
>> hook explain_ExecutorStart returns early, but then what? Does that break
>> the user session somehow, or what?
> 
> It will get called again after ExecutorStartExt() loops back to do
> ExecutorStart() with a new updated plan tree.
> 
>> 2) Isn't it a bit fragile if this requires every extension to update
>> and add the ExecPlanStillValid() calls to various places?
> 
> The ExecPlanStillValid() call only needs to be added immediately after
> the call to standard_ExecutorStart() in an extension's
> ExecutorStart_hook() implementation.
> 
>> What if an
>> extension doesn't do that? What weirdness will happen?
> 
> The QueryDesc.planstate won't contain a PlanState tree for starters
> and other state information that InitPlan() populates in EState based
> on the PlannedStmt.
> 

OK, and the consequence is that the query will fail, right?

>> Maybe it'd be
>> possible to at least check this in some other executor hook? Or at least
>> we could ensure the check was done in assert-enabled builds? Or
>> something to make extension authors aware of this?
> 
> I've added a note in the commit message, but if that's not enough, one
> idea might be to change the return type of ExecutorStart_hook so that
> the extensions that implement it are forced to be adjusted. Say, from
> void to bool to indicate whether standard_ExecutorStart() succeeded
> and thus created a "valid" plan.  I had that in the previous versions
> of the patch.  Thoughts?
> 

Maybe. My concern is that this case (plan getting invalidated) is fairly
rare, so it's entirely plausible the extension will seem to work just
fine without the code update for a long time.

Sure, changing the APIs is allowed, I'm just wondering if maybe there
might be a way to not have this issue, or at least notice the missing
call early.

I haven't tried, wouldn't it be better to modify ExecutorStart() to do
the retries internally? I mean, the extensions wouldn't need to check if
the plan is still valid, ExecutorStart() would take care of that. Yeah,
it might need some new arguments, but that's more obvious.

>> Aside from going through the patches, I did a simple benchmark to see
>> how this works in practice. I did a simple test, with pgbench -S and
>> variable number of partitions/clients. I also varied the number of locks
>> per transaction, because I was wondering if it may interact with the
>> fast-path improvements. See the attached xeon.sh script and CSV with
>> results from the 44/88-core machine.
>>
>> There's also two PDFs visualizing the results, to show the impact as a
>> difference between "master" (no patches) vs. "pruning" build with v57
>> applied. As usual, "green" is good (faster), read is "bad" (slower).
>>
>> For most combinations of parameters, there's no impact on throughput.
>> Anything in 99-101% is just regular noise, possibly even more. I'm
>> trying to reduce the noise a bit more, but this seems acceptable. I'd
>> like to discuss three "cases" I see in the results:
> 
> Thanks for doing these benchmarks.  I'll reply separately to discuss
> the individual cases.
> 
>> costing / auto mode
>> -------------------
>>
>> Anyway, this leads me to a related question - not quite a "bug" in the
>> patch, but something to perhaps think about. And that's costing, and
>> what "auto" should do.
>>
>> There are two PNG charts, showing throughput for runs with -M prepared
>> and 1000 partitions. Each chart shows throughput for the three cache
>> modes, and different client counts. There's a clear distinction between
>> "master" and "patched" runs - the "generic" plans performed terribly, by
>> orders of magnitude. With the patches it beats the "custom" plans.
>>
>> Which is great! But it also means that while "auto" used to do the right
>> thing, with the patches that's not the case.
>>
>> AFAIK that's because we don't consider the runtime pruning when costing
>> the plans, so the cost is calculated as if no pruning happened. And so
>> it seems way more expensive than it should ... and it loses with the
>> custom scans. Is that correct, or do I understand this wrong?
> 
> That's correct. The planner does not consider runtime pruning when
> assigning costs to Append or MergeAppend paths in
> create_{merge}append_path().
> 
>> Just to be clear, I'm not claiming the patch has to deal with this. I
>> suppose it can be handled as a future improvement, and I'm not even sure
>> there's a good way to consider this during costing. For example, can we
>> estimate how many partitions will be pruned?
> 
> There have been discussions about this in the 2017 development thread
> of run-time pruning [1] and likely at some later point in other
> threads.  One simple approach mentioned at [1] is to consider that
> only 1 partition will be scanned for queries containing WHERE partkey
> = $1, because only 1 partition can contain matching rows with that
> condition.
> 
> I agree that this should be dealt with sooner than later so users get
> generic plans even without having to use force_generic_plan.
> 
> I'll post the updated patches tomorrow.
> 

Cool, thanks!


regards
-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: deferred writing of two-phase state files adds fragility
Next
From: Tom Lane
Date:
Subject: Re: [18] Unintentional behavior change in commit e9931bfb75