Re: Adaptive query optimization - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adaptive query optimization
Date
Msg-id 20190613132257.ppchci2izg62x63z@development
Whole thread Raw
In response to Re: Adaptive query optimization  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
List pgsql-hackers
On Thu, Jun 13, 2019 at 09:37:07AM +0530, Kuntal Ghosh wrote:
>On Thu, Jun 13, 2019 at 5:49 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> For example, we might require 1000 samples for a given node (say, scan
>> with some quals), before we start using it to tweak the estimates. Once
>> we get the number of estimates, we can continue collecting more data,
>> and once in a while update the correction. This would require some care,
>> of course, because we need to know what coefficient was used to compute
>> the estimate, but that's solvable by having some sort of epoch.
>>
>> Of course, the question is what number should we use, but overall this
>> would be a much lower-overhead way to do the learning.
>>
>> Unfortunately, the learning as implemented in the patch does not allow
>> this. It pretty much requires dedicated learning phase with generated
>> workload, in a single process.
>>
>> But I think that's solvable, assuming we:
>>
>> 1) Store the data in shared memory, instead of a file. Collect data from
>> all backends, instead of just a single one, etc.
>>
>> 2) Make the decision for individual entries, depending on how many
>> samples we have for it.
>>
>Sounds good. I was trying to think whether we can maintain a running
>coefficient. In that way, we don't have to store the samples. But,
>calculating a running coefficient for more than two variables (with
>some single pass algorithm) seems to be a hard problem. Moreover, it
>can introduce significant misestimation. Your suggested approach works
>better.
>

I don't know, TBH. I think it would be enough to store the coefficient and
the number of samples it's based on, so that you can consider that as a
weight when merging it with additional values. But I don't think it's a
solved issue, so we may need to experiment a bit.

>> I suggest we try to solve one issue at a time. I agree advising which
>> indexes to create is a very interesting (and valuable) thing, but I see
>> it as an extension of the AQO feature. That is, basic AQO (tweaking row
>> estimates) can work without it.
>>
>+1
>
>> >> Now, if someone uses this same scan in a join, like for example
>> >>
>> >>    SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
>> >>     WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
>> >>       AND (t2.x = ? AND t2.y = ?)
>> >>
>> >> then we can still apply the same correction to the t1 scan (I think).
>> >> But then we can also collect data for the t1-t2 join, and compute a
>> >> correction coefficient in a similar way. It requires a bit of care
>> >> because we need to compensate for misestimates of inputs, but I think
>> >> that's doable.
>> >>
>> >That'll be an interesting work. For the above query, we can definitely
>> >calculate the correction coefficient of t1-t2 join given (t1.a = ? AND
>> >t1.b = ? AND t1.c < ?) and
>> >(t2.x = ? AND t2.y = ?) are true. But, I'm not sure how we can
>> >extrapolate that value for t1-t2 join.
>>
>> I'm not sure I see the problem? Essentially, we need to know the sizes
>> of the join inputs, i.e.
>>
>>     t1 WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
>>
>>     t2 WHERE (t2.x = ? AND t2.y = ?)
>>
>> (which we know, and we know how to correct the estimate), and then the
>> selectivity of the join condition. Which we also know.
>>
>> Obviously, there's a chance those parts (clauses at the scan / join
>> level) are correlated, which could make this less accurate.
>This is exactly what my concern is. The base predicate selectivities
>of t1 and t2 should have an impact on the calculation of the
>correction coefficient. If those selectivities are low, the
>misestimation (which is actual/estimate) should not affect the t1-t2
>join correction coefficient much.
>

The question is whether it really matters. The question is whether this
correlation between restriction and join clauses is universal (applies to
most queries) or an exception.

If it's an exception (only for a small number of rarely queried values),
then we have little chance to fix it. If we ever get extended statistics
on joins, that might help, but I think AQO alone is unlikely to help.

OTOH if it's a systemic misestimate (affecting most queries), then we'll
catch it just fine.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Rafia Sabih
Date:
Subject: Re: Adaptive query optimization
Next
From: Tom Lane
Date:
Subject: Re: Update list of combining characters