Proposed Query Planner TODO items - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Proposed Query Planner TODO items |
Date | |
Msg-id | 200312051108.05868.josh@agliodbs.com Whole thread Raw |
Responses |
Re: Proposed Query Planner TODO items
Re: Proposed Query Planner TODO items Re: Proposed Query Planner TODO items |
List | pgsql-hackers |
PG Folks, What follows are a couple of proposed TODO items to make up for some of the places our planner is weak compared to other leading databases. Particularly, I'm personally concerned that as of 7.4.0 we would "fail" the TPC benchmark even if someone sponsored us for it (see Issue #2 below). I freely admit that I don't have the skill to implement either of these; instead, I want them on the TODO list just so we don't lose track of them, and just in case some new brilliant coder jumps into our community looking for something to do. 1) MAINTAIN CORROLARY STATS ON FORIEGN KEYS Summary: Keep correspondance statistics between FK columns. Description: One of the areas of ongoing planner estimation problems estimation of cross-table correspondence of column values. Indeed, as late a 7.2.4 the WHERE EXISTS code just estimated a flat 50%.While it would be practically impossible to maintain statistics betweenall columns in a database that might possibly be compared, there is one class of cross-table column comparisons which is both used heavily and is readily identifiable: foriegn keys. My proposal is to keep statistics on the correlation of values between the key and foriegn key values in order to arrive at better estimates. Adapting the newly committed pg_indexstats to track this as well seems to me to be the easiest method, but I'll admit to not really understanding Manfried's code. NOTE: This suggestion was dicussed on Hackers early last summer and received general approval but somehow never ended up on the TODO list. 2) DEVELOP BETTER PLANS FOR "OR GROUP" QUERIES Summary: Currently, queries with complex "or group" criteria get devolved by the planner into canonical and-or filters resulting in very poor execution on large data sets. We should find better ways of dealing with these queries, for example UNIONing. Description: While helping OSDL with their derivative TPC-R benchmark, we ran into a query (#19) which took several hours to complete on PostgreSQL. It was in the general form: SELECT t1.a, t2.b FROM t1, t2 WHERE t1.a = t2.a AND (( t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k)OR( t1.c = y AND t1.f IN (n, o, p) AND t2.d= v AND t2.e BETWEEN k AND h)OR ( t1.c = z AND t1.f IN (p, q) AND t2.d = w AND t2.e BETWEEN k AND h)) The reason why this query is included in the TPC benchmarks is the reason I've run into problems with similar querys before; it is the kind of query produced by many 3rd-party decision-support and reporting applications. Its distinguishing feature is the same thing which gives PG indigestion; the distinct OR groups with a complex set of criteria for each. Or planner's approach to this sort of query is to devolve the criteria into a 3-page long set of canonical and-or filters, and seq scan the entire underlying data set. This is fine if the data set is small, but if it's several times the size of RAM, a full-table seq scan is fatal, as it is for TPC-R which seems specifically designed to test for this kind of failure. One solution which suggests itself is that the following query form runs in a couple of seconds: SELECT t1.a, t2.b FROM t1, t2 WHERE t1.a = t2.aAND t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k UNION ALL SELECT t1.a, t2.b FROM t1, t2 WHERE t1.a = t2.aAND t1.c = y AND t1.f IN (n, o, p) AND t2.d = v AND t2.e BETWEEN k AND h UNION ALL SELECT t1.a, t2.b FROM t1, t2AND t1.c = z AND t1.f IN (p, q) AND t2.d = w AND t2.e BETWEEN k AND h So the trick would be teaching the planner to:a) recognize an "or group query" when it sees one;b) break down that queryinto a multi-part union and estimate the cost However, I'm sure there are other possible solutions. Oracle and MSSQL have solved this particular query problem; anybody know how they do it? -- Josh Berkus Aglio Database Solutions San Francisco
pgsql-hackers by date: