Re: [PERFORM] Query planner gaining the ability to replanning afterstart of query execution. - Mailing list pgsql-performance

From Arne Roland
Subject Re: [PERFORM] Query planner gaining the ability to replanning afterstart of query execution.
Date
Msg-id a6070ab08105462b985e10d6928e6c12@index.de
Whole thread Raw
In response to Re: [PERFORM] Query planner gaining the ability to replanning afterstart of query execution.  (Oliver Mattos <omattos@gmail.com>)
List pgsql-performance
Hello,

that method is bound to introduce errors if the physical location of a row correlates strongly with a column, imagine
"andmy_timestamp> now() - INTERVAL '1 year'" as part of a where clause of a query without limit on an insert only table
whichis one and a half years old with a seqscan. There might be similar effects if an index on the timestamp is used to
goabout a query, if other rows of the filter correlate.
 

This method furthermore only judges filter predicates.
So it's not that easy to just go about the expected rowcount of a single node, since the underling node might return a
totallyinaccurate number of rows. While that isn't that common with underlying seqscans, it is very frequent if an
indexscanis used without being able to rely on the MCV. While it's still possible to notice a misestimation there is no
senseof how wrong it is, until the rows are already processed.
 

Furthermore: Did you think about parallel plans and most importantly cursors?

Best regards
Arne Roland

-----Original Message-----
From: Oliver Mattos [mailto:omattos@gmail.com] 
Sent: Monday, November 13, 2017 10:52 PM
To: Arne Roland <A.Roland@index.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Query planner gaining the ability to replanning after start of query execution.

> Can you be more elaborate how you'd want to go about it?

My initial approach would be to try to identify places in the plan
where selectivity is seriously over or underestimated.   I would reuse
the instrumentation infrastructure's counts of filtered and returned tuples for each execnode, and periodically call
backinto the planner (for example at every power of 2 tuples processed).
 

The planner would have a wrapper to clauselist_selectivity which somehow combines the existing estimate with the
filteredand returned
 
tuples so far.   Exactly how to merge them isn't clear, but I could
imagine using a poisson distribution to calculate the probability that the selectivity estimate is representative of
thefiltered and returned numbers, and then blending the two linearly based on that estimate.
 

When the planner has re-estimated the cost of the current plan, a discount would be applied for the percentage of each
execnodecompleted (rows processed / estimated rows), and all candidate plans compared.
 

If another candidate plan is now lower cost, the current plan would be terminated[1] by setting a flag instructing each
execnodeto return as if it had reached the end of the input, although still caching the node selectivity values, and
thenew plan started from scratch.
 

The old plan is kept within the query planner candidate list, together with it's cached selectivity values.  If at some
pointit again is cheaper, it is started from scratch too.
 


> Even if we know the cardinality is overestimated, we have no idea 
> whether the cardinality of a = 3 or b = 40 is wrong or they just 
> correlate

The goal would be not to know which is wrong, but to try each, discarding it if it turns out worse than we estimated.
Processinga few hundred rows of each of 5 plans is tiny compared to a scan of 1M rows...
 


[1]:   An improvement here (with much more code complexity) is to keep
multiple partially executed plans around, so that whichever one is most promising can be worked on, but can be halted
andresumed later as selectivity (and hence cost) estimates change.
 

On Mon, Nov 13, 2017 at 8:06 PM, Arne Roland <A.Roland@index.de> wrote:
> Hello,
>
> I'd love to have some sort of dynamic query feedback, yet it's very complicated to do it right. I am not convinced
thatchanging the plan during a single execution is the right way to do it, not only because it sounds intrusive to do
crazythings in the executor, but also because don't understand why the new plan should be any better than the old one.
Canyou be more elaborate how you'd want to go about it?
 
>
> In your example (which presumably just has a single relation), we have 
> no notion of whether the scan returns no rows because we were unlucky, 
> because just the first few pages were empty of matching rows (which in my experience happens more often), or because
thecardinality estimation is wrong. Even if the cardinality estimation is wrong, we have no notion of which predicate
orpredicate combination actually caused the misestimation. If the first few pages where empty, the same might happen
withevery order (so also with every available indexscan). Imagine a very simple seqscan plan of select * from mytab
wherea = 3 and b = 40 limit 1 Even if we know the cardinality is overestimated, we have no idea whether the cardinality
ofa = 3 or b = 40 is wrong or they just correlate, so there is no notion of which is actually the cheapest plan. Usual
workaroundfor most of these queries is to add an order by (which has the nice addition of having a deterministic
result)with an appropriate complex index, usually resulting in indexscans.
 
>
> While we actually know more after the first execution of a nodes like materialize, sort or hash nodes, I rarely
encountermaterialize nodes in the wild. Consequently that is the place where the work is usually already done, which is
especiallytrue with the hash node. Even though it still might be more optimal to switch from a mergejoin to a hashjoin
insome cases, I doubt that's worth any work (and even less the maintenance).
 
>
> Best regards
> Arne Roland
>
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org 
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Oliver 
> Mattos
> Sent: Monday, November 13, 2017 5:45 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Query planner gaining the ability to replanning after start of query execution.
>
> I am interested in giving the query planner the ability to replan (or re-rank plans) after query execution has begun,
basedon the progression of the query so far.
 
>
> Example use case:
>
> *  A LIMIT 1 query is planned using an expensive scan which the planner expects to return a large number of results,
andto terminate
 
> early.   The reality is the query actually produces no results, and
> the scan must run to completion, potentially taking thousands of times longer than expected.
>
> *  If this plans costs were adjusted mid-execution to reflect the fact that the scan is producing far fewer rows than
expected,then another query plan might come out ahead, which would complete far faster.
 
>
>
> Has this been done before?   Are there any pitfalls to beware of?
>
>
> --
> Sent via pgsql-performance mailing list 
> (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
>
>
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org 
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Oliver 
> Mattos
> Sent: Monday, November 13, 2017 5:45 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Query planner gaining the ability to replanning after start of query execution.
>
> I am interested in giving the query planner the ability to replan (or re-rank plans) after query execution has begun,
basedon the progression of the query so far.
 
>
> Example use case:
>
> *  A LIMIT 1 query is planned using an expensive scan which the planner expects to return a large number of results,
andto terminate
 
> early.   The reality is the query actually produces no results, and
> the scan must run to completion, potentially taking thousands of times longer than expected.
>
> *  If this plans costs were adjusted mid-execution to reflect the fact that the scan is producing far fewer rows than
expected,then another query plan might come out ahead, which would complete far faster.
 
>
>
> Has this been done before?   Are there any pitfalls to beware of?
>
>
> --
> Sent via pgsql-performance mailing list 
> (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
>




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Oliver Mattos
Date:
Subject: Re: [PERFORM] Query planner gaining the ability to replanning afterstart of query execution.
Next
From: Tom Lane
Date:
Subject: Re: [PERFORM] Query planner gaining the ability to replanning after start of query execution.