Re: Adaptive query optimization - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Adaptive query optimization |
Date | |
Msg-id | b7644989-9c35-d0f2-2775-28f9c886c2f5@postgrespro.ru Whole thread Raw |
In response to | Re: Adaptive query optimization (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Adaptive query optimization
|
List | pgsql-hackers |
On 12.06.2019 0:43, Tomas Vondra wrote:
What is wrong with learning phase is that it requires some DBA assistance: somebody should determine when to start learning,
I don't think "learning phase" is an issue, in fact I think that's
something we need to do - it ensures we have enough data to make good
decisions.
provide relevant workload and determine when learning is finished.
One of the most recent trends in DBMSes is autonomous databases with zero administration effort.
It is especially important for clouds. And of one the main advantages of AQO is that it allows to optimize queries without human interaction.
But unfortunately I really do not know how to avoid learning phase, especially if we what to run queries at replica.
Thus is why my AQO implementation is storing data in file.2. It saves collected data in Postgres tables, which makes read-only transaction executing only queries to become read-write transaction, obtaining XID...
Yeah, that's an issue because it makes it useless on standbys etc. I
think it'd be enough to do something similar to pg_stat_statements, i.e.
store it in memory and flush it to disk once in a while.
3. It doesn't take in account concrete values of literals used in clauses, so it is not able to address data skews.
Yep. I don't think it's necessarily an issue with all approaches to
adaptive optimization, though. But I agree we should detect both
systemic estimation issues, and misestimates for particular parameter
values. I think that's doable.4. Machine learning can be quite expensive and seems to be overkill if we want just to adjust selectivities according to actual number of affected rows.
I think that depends - some machine learning approaches are not that
bad. But I think there's a more serious issue - explainability. We need
a solution where we can explain/justify why it makes some decisions. I
really don't want a black box that produces numbers that you just need
to take at face value.
The good thing is that the simpler the method, the less expensive and
more explainable it is.I tried to create much simpler version of AQO based on auto_explain extension.
This extension provide all necessary infrastructure to analyze statements with long execution time.
I have added two new modes to auto_explain:
1. Auto generation of multicolumn statistics for variables using in clauses with large estimation error.
Interesting! I probably wouldn't consider this part of adaptive query
optimization, but it probably makes sense to make it part of this. I
wonder if we might improve this to also suggest "missing" indexes?
I think that it should be nest step of adaptive query optimization:
- autogeneration of indexes
- auto adjustment of optimizer cost parameters (cpu cost, random/sequential page access cost,...)
There is already extension hypothetical index https://github.com/HypoPG/hypopg
which can be used to estimate effect of introducing new indexes.
Right. But I think I might have an idea how to address (some of) this.
As I already mentioned, I was experimenting with something similar,
maybe two or three years ago (I remember chatting about it with Teodor
at pgcon last week). I was facing the same issues, and my approach was
based on hooks too.
But my idea was to not to track stats for a plan as a whole, but instead
decompose it into individual nodes, categoried into three basic groups -
scans, joins and aggregations. And then use this extracted information
to other plans, with "matching" nodes.
For example, let's consider a simple "single-scan" query
SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?;
Now, if you execute this enought times (say, 100x or 1000x), tracking
the estimates and actual row counts, you may then compute the average
misestimate (maybe a geometric mean would be more appropriate here?):
AVG(actual/estimate)
Certainly stats should be collected for each plan node, not for the whole plan.
And it is done now in Oleg's and my implementation.
Oleg is using gradient descent method. I first tried to calculate average, but then find out that building something like "histogram",
where bin is determined as log10 of estimated number of rows.
and if this is significantly different from 1.0, then we can say there's
a systemic misestimate, and we can use this as a correction coefficient
when computing the scan estimate. (And we need to be careful about
collection new data, because the estimates will include this correction.
But that can be done by tracking "epoch" of the plan.)
Now, if someone uses this same scan in a join, like for example
SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
AND (t2.x = ? AND t2.y = ?)
then we can still apply the same correction to the t1 scan (I think).
But then we can also collect data for the t1-t2 join, and compute a
correction coefficient in a similar way. It requires a bit of care
because we need to compensate for misestimates of inputs, but I think
that's doable.
Of course, all this is rather speculative, and I never got to anything
beyond a very simple PoC. So I hope it makes at least some sense.
As far as I know Oleg's AQO is now used by Amason.
So it is something more than just PoC. But certainly there are still many problems
and my experiments with JOB benchmark shown that there are still a lot of things to improve.
pgsql-hackers by date: