Re: Plan for relatively simple query seems to be very inefficient - Mailing list pgsql-performance

From Dave Held
Subject Re: Plan for relatively simple query seems to be very inefficient
Date
Msg-id 49E94D0CFCD4DB43AFBA928DDD20C8F9026184A0@asg002.asg.local
Whole thread Raw
In response to Plan for relatively simple query seems to be very inefficient  (Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl>)
Responses Re: Plan for relatively simple query seems to be very inefficient
List pgsql-performance
> -----Original Message-----
> From: Arjen van der Meijden
> [mailto:acmmailing@vulcanus.its.tudelft.nl]
> Sent: Wednesday, April 06, 2005 11:53 AM
> To: performance pgsql
> Subject: [PERFORM] Plan for relatively simple query seems to be very
> inefficient
>
> [...]
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range BETWEEN p.range_from AND p.range_till
> [...]
>   Aggregate  (cost=332586.85..332586.85 rows=1 width=0) (actual
> time=22712.038..22712.039 rows=1 loops=1)
>     ->  Nested Loop  (cost=3.76..328945.96 rows=1456356
> width=0) (actual
> time=0.054..22600.826 rows=82688 loops=1)

I'm still a noob at reading EXPLAIN ANALYZE, but it seems to me
that your statistics are throwing off the planner here.  It
estimates 1.4M and gets 82K, so it's off by a factor of about 20.
Have you considered doing a VACUUM or upping your statistics?

> [...]
> When I do something completely bogus, which will result in
> coupling the data per record from data_main on one record from
> postcodes, it still not very fast but acceptable:
> [...]
>   Aggregate  (cost=10076.98..10076.98 rows=1 width=0) (actual
> time=1456.016..1456.017 rows=1 loops=1)
>     ->  Merge Join  (cost=8636.81..9913.13 rows=65537
> width=0) (actual
> time=1058.105..1358.571 rows=81920 loops=1)

Looks like Merge Join is faster than the Nested Loop for this
query.  If you notice, the row counts are a lot closer to the
estimates, too.  This is probably a "good" plan.

> [...]
> Doing something similarily bogus, but with less results is
> much faster, even though it should have basically the same
> plan:
>
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range  = p.postcode_id
> [...]
>   Aggregate  (cost=2138.63..2138.63 rows=1 width=0) (actual
> time=180.667..180.668 rows=1 loops=1)
>     ->  Hash Join  (cost=4.00..2087.02 rows=20642 width=0) (actual
> time=180.645..180.645 rows=0 loops=1)

This one I don't understand at all.  Clearly, the Hash Join is
the way to go, but the estimates are way off (which probably
explains why this plan isn't chosen in the first place).

>           Hash Cond: ("outer".range = "inner".postcode_id)
>           ->  Seq Scan on data_main dm  (cost=0.00..1262.20
> rows=81920
> width=2) (actual time=0.005..105.548 rows=81920 loops=1)
>           ->  Hash  (cost=3.60..3.60 rows=160 width=2) (actual
> time=0.592..0.592 rows=0 loops=1)
>                 ->  Seq Scan on postcodes p  (cost=0.00..3.60
> rows=160
> width=2) (actual time=0.025..0.349 rows=160 loops=1)
>   Total runtime: 180.807 ms
> (7 rows)
> [...]

My completely amateur guess is that the planner is able to use
Merge Join and Hash Join on your contrived queries because you
are only trying to join one field to a single value (i.e.:
operator=).  But the BETWEEN clause is what forces the Nested
Loop.  You can see that here:

                ->  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160
width=4) (actual time=0.010..0.396 rows=160 loops=1)
vs. here:

          ->  Index Scan using postcodes_pkey on postcodes p
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160
loops=1)

So the first query forces a SeqScan on postcodes, while the
second can do an IndexScan.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

pgsql-performance by date:

Previous
From: "Steinar H. Gunderson"
Date:
Subject: Re: Réf
Next
From: Arjen van der Meijden
Date:
Subject: Re: Plan for relatively simple query seems to be very inefficient