Proposed Query Planner TODO items - Mailing list pgsql-hackers

From Josh Berkus
Subject Proposed Query Planner TODO items
Date
Msg-id 200312051108.05868.josh@agliodbs.com
Whole thread Raw
Responses Re: Proposed Query Planner TODO items
Re: Proposed Query Planner TODO items
Re: Proposed Query Planner TODO items
List pgsql-hackers
PG Folks,

What follows are a couple of proposed TODO items to make up for some of the 
places our planner is weak compared to other leading databases.   
Particularly, I'm personally concerned that as of 7.4.0 we would "fail" the 
TPC benchmark even if someone sponsored us for it (see Issue #2 below).

I freely admit that I don't have the skill to implement either of these; 
instead, I want them on the TODO list just so we don't lose track of them, 
and just in case some new brilliant coder jumps into our community looking 
for something to do.

1) MAINTAIN CORROLARY STATS ON FORIEGN KEYS

Summary: Keep correspondance statistics between FK columns.   

Description:  One of the areas of ongoing planner estimation problems 
estimation of cross-table correspondence of column values.   Indeed, as late 
a 7.2.4 the WHERE EXISTS code just estimated a flat 50%.While it would be practically impossible to maintain statistics
betweenall 
 
columns in a database that might possibly be compared, there is one class of 
cross-table column comparisons which is both used heavily and is readily 
identifiable: foriegn keys. My proposal is to keep statistics on the correlation of values between the 
key and foriegn key values in order to arrive at better estimates.   Adapting 
the newly committed pg_indexstats to track this as well seems to me to be the 
easiest method, but I'll admit to not really understanding Manfried's code.

NOTE:  This suggestion was dicussed on Hackers early last summer and received 
general approval but somehow never ended up on the TODO list.

2) DEVELOP BETTER PLANS FOR "OR GROUP" QUERIES

Summary: Currently, queries with complex "or group" criteria get devolved by 
the planner into canonical and-or filters resulting in very poor execution on 
large data sets.   We should find better ways of dealing with these queries, 
for example UNIONing.

Description: While helping OSDL with their derivative TPC-R benchmark, we ran 
into a query (#19) which took several hours to complete on PostgreSQL.  It 
was in the general form:

SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.a
AND (( t1.c = x  AND t1.f IN (m, n, o)  AND t2.d = v  AND t2.e BETWEEN j AND k)OR( t1.c = y  AND t1.f IN (n, o, p)  AND
t2.d= v  AND t2.e BETWEEN k AND h)OR ( t1.c = z  AND t1.f IN (p, q)  AND t2.d = w  AND t2.e BETWEEN k AND h))
 

The reason why this query is included in the TPC benchmarks is the reason I've 
run into problems with similar querys before; it is the kind of query 
produced by many 3rd-party decision-support and reporting applications.   Its 
distinguishing feature is the same thing which gives PG indigestion; the 
distinct OR groups with a complex set of criteria for each.

Or planner's approach to this sort of query is to devolve the criteria into a 
3-page long set of canonical and-or filters, and seq scan the entire 
underlying data set.   This is fine if the data set is small, but if it's 
several times the size of RAM, a full-table seq scan is fatal, as it is for 
TPC-R which seems specifically designed to test for this kind of failure. 

One solution which suggests itself is that the following query form runs in a 
couple of seconds:

SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.aAND t1.c = x  AND t1.f IN (m, n, o)  AND t2.d = v  AND t2.e BETWEEN j AND k
UNION ALL
SELECT t1.a, t2.b
FROM t1, t2
WHERE t1.a = t2.aAND  t1.c = y  AND t1.f IN (n, o, p)  AND t2.d = v  AND t2.e BETWEEN k AND h
UNION ALL
SELECT t1.a, t2.b
FROM t1, t2AND t1.c = z  AND t1.f IN (p, q)  AND t2.d = w  AND t2.e BETWEEN k AND h

So the trick would be teaching the planner to:a) recognize an "or group query" when it sees one;b) break down that
queryinto a multi-part union and estimate the cost
 

However, I'm sure there are other possible solutions.   Oracle and MSSQL have 
solved this particular query problem; anybody know how they do it?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: request for feedback - read-only GUC variables,
Next
From: todd@tekinteractive.com (Todd R. Eigenschink)
Date:
Subject: Re: Examining the output of: ldd `which postgres`