Home > mailing lists

Re: limit in subquery causes poor selectivity estimation - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: limit in subquery causes poor selectivity estimation
Date	August 27, 2011 14:33:09
Msg-id	2651.1314466374@sss.pgh.pa.us Whole thread Raw
In response to	limit in subquery causes poor selectivity estimation (Peter Eisentraut <peter_e@gmx.net>)
Responses	Re: limit in subquery causes poor selectivity estimation Re: limit in subquery causes poor selectivity estimation
List	pgsql-hackers

Tree view

Peter Eisentraut <peter_e@gmx.net> writes:
> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
>    Hash Cond: (test1.sha1 = test2.sha1)
>    ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
>    ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
>          ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

> That's OK.  Apparently it can tell that joining two tables on their
> primary keys cannot result in more rows than the smaller table.  (Or
> can it?)

More like it knows that a semijoin can't produce more rows than the
lefthand input has.  But I think it is actually applying stats for
both columns here.

> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);

> Here, however, it has apparently not passed this knowledge through the
> LIMIT.

The LIMIT prevents the subquery from being flattened entirely, ie we
don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
you'll note that it punts for Vars coming from unflattened subqueries.

> So what's up with that?  Just a case of, we haven't thought about
> covering this case yet, or are there larger problems?

The larger problem is that if a subquery didn't get flattened, it's
often because it's got LIMIT, or GROUP BY, or some similar clause that
makes it highly suspect whether the statistics available for the table
column are reasonable to use for the subquery outputs.  It wouldn't be
that hard to grab the stats for test2.sha1, but then how do you want
to adjust them to reflect the LIMIT?
        regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 27 August 2011, 13:28:50
Subject: Re: Cryptic error message in low-memory conditions

From: Peter Eisentraut
Date: 27 August 2011, 17:09:26
Subject: Re: tab stop in README

Re: limit in subquery causes poor selectivity estimation - Mailing list pgsql-hackers

Previous

Next