Re: [BUGS] New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-hackers

From Bradley Baetz
Subject Re: [BUGS] New hashed IN code ignores distinctiveness of subquery
Date
Msg-id 20030127072918.GA24215@mango.home
Whole thread Raw
Responses Re: [BUGS] New hashed IN code ignores distinctiveness of subquery  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
[Moving -> -hackers]

On Mon, Jan 27, 2003 at 12:00:08AM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz@acm.org> writes:

> > However, its much faster (although not as fast as sticking the DISTINCT
> > in there myself), but the actual rows coming from the sort is really odd
> > - where is that number coming from? How can sorting 9 rows take 44476
> > anythings?
>
> We're back full circle to my original comment about rescans in
> mergejoin.  The EXPLAIN ANALYZE instrumentation counts a re-fetch
> as another returned row.

Hmm. OK, I poked through the code a bit more, and I think I now realise
why we were talking across each other. I've attached a 'patch' which
gets the mergejoin counts down to something reasonable. (The patch also
includes a bit to fix the estimated row count for JOIN_IN, as discussed
on -bugs)

When calculating the cost for a join, if a path is a T_UniquePath, then
the reduction in the number of rows to be examined isn't taken into
account. This is why the mergejoin code was calulating the cost of
merging two 50000 tuple paths - the overestimation that you menioned
earlier isn't as important here.

For reference, bugs is a 50000 row table, and there are 9 distinct
values for product_id.

Before(with the num_rows part of the patch, though)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
 Total runtime: 579.84 msec
(9 rows)

After:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8007.21..8007.21 rows=1 width=8) (actual
time=578.16..578.16 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..7881.67 rows=50218 width=8) (actual
time=110.94..527.79 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..250.74
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=110.78..142.80 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=110.70..110.71 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.14 rows=50000 loops=1)
 Total runtime: 578.30 msec
(9 rows)

The patch isn't correct as-is, because it only covers merge joins:

bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281712.05..4281712.05 rows=1 width=8) (actual
time=410.14..410.14 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=50218 width=8) (actual
time=126.32..362.30 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..66.81 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=126.08..126.08 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.02..68.23 rows=50000 loops=1)
 Total runtime: 410.25 msec
(7 rows)

I don't think that propogating my hack to everywhere which wants to know
how many rows are returned is a good idea though - is there a more
generic way to get the number of rows really returned by a path?

>
>             regards, tom lane

Bradley

Attachment

pgsql-hackers by date:

Previous
From: "Christopher Kings-Lynne"
Date:
Subject: Re: unquoted special constants
Next
From: Antti Haapala
Date:
Subject: Re: Call for objections: put back OIDs in CREATE TABLE