Re: New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs

From Bradley Baetz
Subject Re: New hashed IN code ignores distinctiveness of subquery
Date
Msg-id 20030127045041.GA8600@mango.home
Whole thread Raw
In response to Re: New hashed IN code ignores distinctiveness of subquery  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: New hashed IN code ignores distinctiveness of subquery  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Sun, Jan 26, 2003 at 11:18:31PM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz@acm.org> writes:
> > Right, or skip it entirely when selecting stuff with unique constraints.
>
> I'm hesitant to do that until we have some scheme in place for
> invalidating cached plans.

By cached, do you mean PREPARE stuff, or something else?

>
> > I don't think it is. The number of rows is correct if you do product_id
> > IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on.
>
> But that's a completely different code path; it doesn't even enter the
> routines we're concerned about here.

Yes, but its the same concept. Although we seem to be agreeing about
that now :)

> > What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do
> > that JOIN_IN doesn't?
>
> Uniqify the inner/outer path and then do a normal inner join.  See
> joinpath.c.

Ah, OK. If I comment out line 547 of joinrels.c (which adds JOIN_IN to
the set of join paths) so that the UNIQUE joins are all that are left to
try, then I get:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
 Total runtime: 579.84 msec
(9 rows)

(This isn't picked without my hack, because the cost is slightly higher
than the JOIN_IN version)

However, its much faster (although not as fast as sticking the DISTINCT
in there myself), but the actual rows coming from the sort is really odd
- where is that number coming from? How can sorting 9 rows take 44476
anythings? The final mergejoin cost is still way off, too.

>             regards, tom lane

Thanks,

Bradley

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Next
From: Tom Lane
Date:
Subject: Re: New hashed IN code ignores distinctiveness of subquery