Re: limit in subquery causes poor selectivity estimation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: limit in subquery causes poor selectivity estimation |
Date | |
Msg-id | CA+TgmoYTSQP0F8SOP7FAL_6myT0GSYxH1WBxPO9iQ=n1KsTAqw@mail.gmail.com Whole thread Raw |
In response to | Re: limit in subquery causes poor selectivity estimation (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: limit in subquery causes poor selectivity estimation
|
List | pgsql-hackers |
On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote: >> > EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 >> LIMIT 200); >> >> > Here, however, it has apparently not passed this knowledge through >> the >> > LIMIT. >> >> The LIMIT prevents the subquery from being flattened entirely, ie we >> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT * >> FROM test2 LIMIT 200)". If you look at examine_variable in selfuncs.c >> you'll note that it punts for Vars coming from unflattened subqueries. >> >> > So what's up with that? Just a case of, we haven't thought about >> > covering this case yet, or are there larger problems? >> >> The larger problem is that if a subquery didn't get flattened, it's >> often because it's got LIMIT, or GROUP BY, or some similar clause that >> makes it highly suspect whether the statistics available for the table >> column are reasonable to use for the subquery outputs. It wouldn't be >> that hard to grab the stats for test2.sha1, but then how do you want >> to adjust them to reflect the LIMIT? > > It turns out that this is a regression introduced in 8.4.8; the same > topic is also being discussed in > > http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php > > and > > http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php > > This is the (previously posted) plan with 8.4.8: > > QUERY PLAN > ---------------------------------------------------------------------------------- > Hash Join (cost=10.60..34.35 rows=500 width=31) > Hash Cond: (test1.sha1 = test2.sha1) > -> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31) > -> Hash (cost=8.10..8.10 rows=200 width=32) > -> HashAggregate (cost=6.10..8.10 rows=200 width=32) > -> Limit (cost=0.00..3.60 rows=200 width=21) > -> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21) > > And this is the plan with 8.4.7: > > QUERY PLAN > ---------------------------------------------------------------------------------- > Hash Join (cost=10.80..34.55 rows=200 width=31) > Hash Cond: (test1.sha1 = test2.sha1) > -> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31) > -> Hash (cost=8.30..8.30 rows=200 width=32) > -> HashAggregate (cost=6.30..8.30 rows=200 width=32) > -> Limit (cost=0.00..3.80 rows=200 width=21) > -> Seq Scan on test2 (cost=0.00..19.01 rows=1001 width=21) > > I liked the old one better. ;-) AFAICS, those plans are identical, except for a minor difference in the cost of scanning test2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: