limit in subquery causes poor selectivity estimation - Mailing list pgsql-hackers

From Peter Eisentraut
Subject limit in subquery causes poor selectivity estimation
Date
Msg-id 1314444807.2349.10.camel@vanquo.pezone.net
Whole thread Raw
Responses Re: limit in subquery causes poor selectivity estimation
List pgsql-hackers
This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (   sha1 bytea PRIMARY KEY,   something text
);

CREATE TABLE test2 (   sha1 bytea PRIMARY KEY,   blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES
(decode('$sha1','hex'),'blah$i$i')"; done
 
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES
(decode('$sha1','hex'),'foo$i')"; done
 

(Doesn't really matter whether the key values are the same or
overlapping or not.  Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);                             QUERY PLAN
----------------------------------------------------------------------Hash Semi Join  (cost=30.52..61.27 rows=1000
width=27) Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)  ->  Hash
(cost=18.01..18.01rows=1001 width=21)        ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
 

That's OK.  Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table.  (Or
can it?)

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);                                   QUERY
PLAN
----------------------------------------------------------------------------------Hash Join  (cost=10.60..33.35
rows=500width=27)  Hash Cond: (test1.sha1 = test2.sha1)  ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
-> Hash  (cost=8.10..8.10 rows=200 width=32)        ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
-> Limit  (cost=0.00..3.60 rows=200 width=21)                    ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001
width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that?  Just a case of, we haven't thought about
covering this case yet, or are there larger problems?



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Inputting relative datetimes
Next
From: Dean Rasheed
Date:
Subject: Re: Inputting relative datetimes