Re: New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs

From Tom Lane
Subject Re: New hashed IN code ignores distinctiveness of subquery
Date
Msg-id 29875.1043608189@sss.pgh.pa.us
Whole thread Raw
In response to New hashed IN code ignores distinctiveness of subquery  (Bradley Baetz <bbaetz@acm.org>)
Responses Re: New hashed IN code ignores distinctiveness of subquery  (Bradley Baetz <bbaetz@acm.org>)
List pgsql-bugs
Bradley Baetz <bbaetz@acm.org> writes:
> I've been trying out the new hased subselect code from CVS. It appears
> that the planner isn't taking the distinctiveness of the values from the
> subselect into account:

This isn't really anything to do with the new IN code, but is a
long-standing problem: cost_mergejoin doesn't apply any penalty factor
for the case where there are lots of duplicates in both inner and outer
relation (causing rescans of big chunks of the inner relation).  You can
see the rescanning happening in the EXPLAIN ANALYZE output:

>    ->  Merge Join  (cost=0.00..3485661.38 rows=5570 width=8) (actual
> time=0.15..1429696.69 rows=50000 loops=1)
>          Merge Cond: ("outer".product_id = "inner".product_id)
>          ->  Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
> rows=50000 loops=1)
>          ->  Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
> rows=277884160 loops=1)
  ^^^^^^^^^^^^^^

277884160 rows pulled from a 50000-row relation means a heck of a lot of
rescanning went on :-(

The good news is that the system *is* aware of the small number of
distinct values in the table (note the dead-on estimate of the number of
distinct rows in your other query; which I think is from new-for-7.4
code, though the required stats have been available since 7.2).

I think it'd probably be possible to make some reasonable estimate of
the amount of rescanning required, and then inflate the mergejoin cost
estimate proportionally.  I have not gotten around to looking at the
problem though.  Care to take a shot at it?

            regards, tom lane

pgsql-bugs by date:

Previous
From: Bradley Baetz
Date:
Subject: New hashed IN code ignores distinctiveness of subquery
Next
From: Neil Conway
Date:
Subject: Re: Bug #883: explain analyze causes postgres to die