Thread: Re: [BUGS] New hashed IN code ignores distinctiveness of subquery

Re: [BUGS] New hashed IN code ignores distinctiveness of subquery

From
Bradley Baetz
Date:
[Moving -> -hackers]

On Mon, Jan 27, 2003 at 12:00:08AM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz@acm.org> writes:

> > However, its much faster (although not as fast as sticking the DISTINCT
> > in there myself), but the actual rows coming from the sort is really odd
> > - where is that number coming from? How can sorting 9 rows take 44476
> > anythings?
>
> We're back full circle to my original comment about rescans in
> mergejoin.  The EXPLAIN ANALYZE instrumentation counts a re-fetch
> as another returned row.

Hmm. OK, I poked through the code a bit more, and I think I now realise
why we were talking across each other. I've attached a 'patch' which
gets the mergejoin counts down to something reasonable. (The patch also
includes a bit to fix the estimated row count for JOIN_IN, as discussed
on -bugs)

When calculating the cost for a join, if a path is a T_UniquePath, then
the reduction in the number of rows to be examined isn't taken into
account. This is why the mergejoin code was calulating the cost of
merging two 50000 tuple paths - the overestimation that you menioned
earlier isn't as important here.

For reference, bugs is a 50000 row table, and there are 9 distinct
values for product_id.

Before(with the num_rows part of the patch, though)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
 Total runtime: 579.84 msec
(9 rows)

After:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8007.21..8007.21 rows=1 width=8) (actual
time=578.16..578.16 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..7881.67 rows=50218 width=8) (actual
time=110.94..527.79 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..250.74
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=110.78..142.80 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=110.70..110.71 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.14 rows=50000 loops=1)
 Total runtime: 578.30 msec
(9 rows)

The patch isn't correct as-is, because it only covers merge joins:

bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281712.05..4281712.05 rows=1 width=8) (actual
time=410.14..410.14 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=50218 width=8) (actual
time=126.32..362.30 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..66.81 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=126.08..126.08 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.02..68.23 rows=50000 loops=1)
 Total runtime: 410.25 msec
(7 rows)

I don't think that propogating my hack to everywhere which wants to know
how many rows are returned is a good idea though - is there a more
generic way to get the number of rows really returned by a path?

>
>             regards, tom lane

Bradley

Attachment

Re: [BUGS] New hashed IN code ignores distinctiveness of subquery

From
Tom Lane
Date:
Bradley Baetz <bbaetz@acm.org> writes:
> Hmm. OK, I poked through the code a bit more, and I think I now realise
> why we were talking across each other. I've attached a 'patch' which
> gets the mergejoin counts down to something reasonable.

I've just committed a significant set of changes in the join cost
estimation routines.  On looking closer, they hadn't been upgraded for
any of the recent changes --- they were still assuming that merge and
hash join clauses could only be simple var = var, for instance.  I did
something about the mergejoin rescan issue, as well as modeling JOIN
short-circuiting.  All of the estimates are a bit crude, but certainly
better than no model at all.

I think this covers your concerns, though I'm still worried about
whether it's okay to use the existing selectivity routines to compute
selectivities in the JOIN_IN/JOIN_UNIQUE case.
        regards, tom lane