Re: To use a VIEW or not to use a View..... - Mailing list pgsql-sql

From Tom Lane
Subject Re: To use a VIEW or not to use a View.....
Date
Msg-id 26724.1043342499@sss.pgh.pa.us
Whole thread Raw
In response to Re: To use a VIEW or not to use a View.....  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Responses Re: To use a VIEW or not to use a View.....  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
List pgsql-sql
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Wed, 22 Jan 2003, Tom Lane wrote:
>> It could still use more eyeballs looking at it.  One thing I'm concerned
>> about is whether the extra (derived) conditions lead to double-counting
>> restrictivity and thus underestimating the number of result rows.  I
>> haven't had time to really test that, but I suspect there may be a problem.

> I haven't looked at code yet but tried examples like Tomasz's and some
> simple ones and have gotten reasonable seeming output for the estimates
> given accurate statistics

I realized this morning that there definitely is a problem.  Consider
this example using the regression database:

regression=# explain analyze select * from tenk1 a join tenk1 b using(ten)
regression-# where ten = 3;                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------Merge
Join (cost=1055.45..2102.12 rows=83006 width=488) (actual time=582.97..65486.57 rows=1000000 loops=1)  Merge Cond:
("outer".ten= "inner".ten)  ->  Sort  (cost=527.73..530.00 rows=910 width=244) (actual time=373.57..382.48 rows=1000
loops=1)       Sort Key: a.ten        ->  Seq Scan on tenk1 a  (cost=0.00..483.00 rows=910 width=244) (actual
time=8.98..330.39rows=1000 loops=1)              Filter: (ten = 3)  ->  Sort  (cost=527.73..530.00 rows=910 width=244)
(actualtime=209.19..8057.64 rows=999001 loops=1)        Sort Key: b.ten        ->  Seq Scan on tenk1 b
(cost=0.00..483.00rows=910 width=244) (actual time=0.40..193.93 rows=1000 loops=1)              Filter: (3 = ten)Total
runtime:73291.01 msec
 
(11 rows)

The condition "ten=3" will select 1000 rows out of the 10000 in the
table.  But, once we have applied that condition to both sides of the
join, the join condition "a.ten = b.ten" is a no-op --- it will not
reject any pair of rows coming out of the seqscans.  Presently we count
its restrictivity anyway, so the estimated row count at the merge is a
bad underestimate.

Not only should we ignore the join condition for selectivity purposes,
but it's a waste of time for execution as well.  We could have
implemented the above query as a nestloop with no join condition, and
saved the effort of the sort and merge logic.

What I was thinking was that any time the code sees a "var = const"
clause as part of a mergejoin equivalence set, we could mark all the
"var = var" clauses in the same set as no-ops.  For example, given

WHERE a.f1 = b.f2 AND b.f2 = c.f3 AND c.f3 = 42

then after we finish deducing a.f1 = 42 and b.f2 = 42, there is no
longer any value in either of the original clauses a.f1 = b.f2 and
b.f2 = c.f3, nor in the other deduced clause a.f1 = c.f3.  This would
take a little bit of restructuring of generate_implied_equalities() and
process_implied_equality(), but it doesn't seem too difficult to do.

Thoughts?  Are there any holes in that logic?
        regards, tom lane


pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: calling function from rule
Next
From: jasiek@klaster.net
Date:
Subject: Re: To use a VIEW or not to use a View.....