Re: Adaptive query optimization - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adaptive query optimization |
Date | |
Msg-id | 20190611214331.eww5yot6lytomjcn@development Whole thread Raw |
In response to | Adaptive query optimization (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
Responses |
Re: Adaptive query optimization
Re: Adaptive query optimization |
List | pgsql-hackers |
Hi Alexander, Thanks for starting this thread. I've had similar ideas in the past and even hacked together something (very dirty), so it's great someone else is interested in this topic too. On Mon, Jun 10, 2019 at 11:53:02AM +0300, Konstantin Knizhnik wrote: >Hi, > >Inefficiency of Postgres on some complex queries in most cases is >caused by errors in selectivity estimations. >Postgres doesn't take in account correlation between columns unless >you explicitly create mutlicolumn statistics >(but multicolumn statistic is used only for restriction clauses, not >for join selectivity, where estimation errors are most critical). > Yes, that's the current status. I'd say we want to allow using multicolumn stats to join estimation too, but that's a hard problem. >Certainly it is possible to collect more statistics and improve >estimation formulas but c'est la vie is that estimation of relation >size after several joins //more looks like an exercise in guesswork. I'd go even further - it's a simple fact we can't have perfect stats that would give us "sufficiently good" estimates for all common data distributions and clauses. Firstly, stats are a merely a simplified representation of the overall data distribution - which makes them small, but eliminates some details (which may be quite important for highly non-uniform distributions). Secondly, we only have a number of generic stats types (MCV, histogram, ...) but that may not be sufficient to "capture" the imporant aspects of the data distribution. And finally, we only know how to use those stats for specific types of clauses (equality, inequality, ...) with very simple expressions. But that's often not what the users do. I think adaptive query optimization - in the sense of collecting data from query executions and and leverating that when planning future queries - can (hopefully) help with all those challenges. At least in some cases. >This is why alternative approach based on adaptive query optimization >seems to be more promising. When we analyze query execution with >EXPLAIN ANALYZE, we can see actual number of rows for each plan node. >We can use this information to adjust clause selectivity and reduce >estimation error. > Yep, that's roughly the idea. I don't think we need EXPLAIN ANALYZE, it should be enough to instrument queries to collect row counts on the fly. But I guess that's mostly what the explain_analyze changes do. >At PGconf 2017 my former colleague Oleg Ivanov made presentation about >using machine learning for AQO: >https://www.pgcon.org/2017/schedule/events/1086.en.html >Right now this project is available from PostgresPro repository: >https://github.com/postgrespro/aqo > >There are several problems with this approach: >1. It requires "learning phase" I don't think "learning phase" is an issue, in fact I think that's something we need to do - it ensures we have enough data to make good decisions. >2. It saves collected data in Postgres tables, which makes read-only >transaction executing only queries to become read-write transaction, >obtaining XID... Yeah, that's an issue because it makes it useless on standbys etc. I think it'd be enough to do something similar to pg_stat_statements, i.e. store it in memory and flush it to disk once in a while. >3. It doesn't take in account concrete values of literals used in >clauses, so it is not able to address data skews. Yep. I don't think it's necessarily an issue with all approaches to adaptive optimization, though. But I agree we should detect both systemic estimation issues, and misestimates for particular parameter values. I think that's doable. >4. Machine learning can be quite expensive and seems to be overkill >if we want just to adjust selectivities according to actual number of >affected rows. > I think that depends - some machine learning approaches are not that bad. But I think there's a more serious issue - explainability. We need a solution where we can explain/justify why it makes some decisions. I really don't want a black box that produces numbers that you just need to take at face value. The good thing is that the simpler the method, the less expensive and more explainable it is. >I tried to create much simpler version of AQO based on auto_explain >extension. >This extension provide all necessary infrastructure to analyze >statements with long execution time. >I have added two new modes to auto_explain: >1. Auto generation of multicolumn statistics for variables using in >clauses with large estimation error. Interesting! I probably wouldn't consider this part of adaptive query optimization, but it probably makes sense to make it part of this. I wonder if we might improve this to also suggest "missing" indexes? >2. Direct adjustment of estimated number of rows based on information >collected by EXPLAIN ANALYZE. > Yep! >As well as in Oleg's implementation, it requires few changes in >Postgres core: introducing some new hooks for relation size >estimation. >But most of functionality is implemented in auto_explain extension. >Attached please find patch to vanilla. >Please read Readme.ms file for more details. > >I have tested it on join order benchmark JOB >https://github.com/gregrahn/join-order-benchmark >aqo.svg contains results of applying my and Oleg's versions of AQO to >JOB queries. First result corresponds to the vanilla Postgres, second >- my AQO keeping literal values, third my AQO ignoring literal values >and last one result of Oleg's machine learning (after 10 iterations). > >The principle problem with AQO approach is that using provided explain >feedback we are able to adjust selectivities only for one particular >plan. >But there may be many other alternative plans, and once we adjust one >plan, optimizer most likely choose some other plan which actually can >be ever worser than >original plan. Certainly if we continue learning, then sooner or later >we will know real selectivities for all possible clauses. But number >of possible plans can be very >large for queries with many joins (factorial), so many iterations may >be required. What is worser some intermediate bad plans can take huge >amount of time. >Particularly sixth iteration of Oleg's AQO on JOB queries set takes >about two hours (instead of original 10 minutes!). >Such thing doesn't happen with my AQO, but it seems to be just matter >of luck. > Right. But I think I might have an idea how to address (some of) this. As I already mentioned, I was experimenting with something similar, maybe two or three years ago (I remember chatting about it with Teodor at pgcon last week). I was facing the same issues, and my approach was based on hooks too. But my idea was to not to track stats for a plan as a whole, but instead decompose it into individual nodes, categoried into three basic groups - scans, joins and aggregations. And then use this extracted information to other plans, with "matching" nodes. For example, let's consider a simple "single-scan" query SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?; Now, if you execute this enought times (say, 100x or 1000x), tracking the estimates and actual row counts, you may then compute the average misestimate (maybe a geometric mean would be more appropriate here?): AVG(actual/estimate) and if this is significantly different from 1.0, then we can say there's a systemic misestimate, and we can use this as a correction coefficient when computing the scan estimate. (And we need to be careful about collection new data, because the estimates will include this correction. But that can be done by tracking "epoch" of the plan.) Now, if someone uses this same scan in a join, like for example SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id) WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?) AND (t2.x = ? AND t2.y = ?) then we can still apply the same correction to the t1 scan (I think). But then we can also collect data for the t1-t2 join, and compute a correction coefficient in a similar way. It requires a bit of care because we need to compensate for misestimates of inputs, but I think that's doable. Of course, all this is rather speculative, and I never got to anything beyond a very simple PoC. So I hope it makes at least some sense. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: