Re: Plan for relatively simple query seems to be very inefficient - Mailing list pgsql-performance
From | Dave Held |
---|---|
Subject | Re: Plan for relatively simple query seems to be very inefficient |
Date | |
Msg-id | 49E94D0CFCD4DB43AFBA928DDD20C8F9026184A0@asg002.asg.local Whole thread Raw |
In response to | Plan for relatively simple query seems to be very inefficient (Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl>) |
Responses |
Re: Plan for relatively simple query seems to be very inefficient
|
List | pgsql-performance |
> -----Original Message----- > From: Arjen van der Meijden > [mailto:acmmailing@vulcanus.its.tudelft.nl] > Sent: Wednesday, April 06, 2005 11:53 AM > To: performance pgsql > Subject: [PERFORM] Plan for relatively simple query seems to be very > inefficient > > [...] > SELECT COUNT(*) FROM > data_main AS dm, > postcodes AS p > WHERE dm.range BETWEEN p.range_from AND p.range_till > [...] > Aggregate (cost=332586.85..332586.85 rows=1 width=0) (actual > time=22712.038..22712.039 rows=1 loops=1) > -> Nested Loop (cost=3.76..328945.96 rows=1456356 > width=0) (actual > time=0.054..22600.826 rows=82688 loops=1) I'm still a noob at reading EXPLAIN ANALYZE, but it seems to me that your statistics are throwing off the planner here. It estimates 1.4M and gets 82K, so it's off by a factor of about 20. Have you considered doing a VACUUM or upping your statistics? > [...] > When I do something completely bogus, which will result in > coupling the data per record from data_main on one record from > postcodes, it still not very fast but acceptable: > [...] > Aggregate (cost=10076.98..10076.98 rows=1 width=0) (actual > time=1456.016..1456.017 rows=1 loops=1) > -> Merge Join (cost=8636.81..9913.13 rows=65537 > width=0) (actual > time=1058.105..1358.571 rows=81920 loops=1) Looks like Merge Join is faster than the Nested Loop for this query. If you notice, the row counts are a lot closer to the estimates, too. This is probably a "good" plan. > [...] > Doing something similarily bogus, but with less results is > much faster, even though it should have basically the same > plan: > > SELECT COUNT(*) FROM > data_main AS dm, > postcodes AS p > WHERE dm.range = p.postcode_id > [...] > Aggregate (cost=2138.63..2138.63 rows=1 width=0) (actual > time=180.667..180.668 rows=1 loops=1) > -> Hash Join (cost=4.00..2087.02 rows=20642 width=0) (actual > time=180.645..180.645 rows=0 loops=1) This one I don't understand at all. Clearly, the Hash Join is the way to go, but the estimates are way off (which probably explains why this plan isn't chosen in the first place). > Hash Cond: ("outer".range = "inner".postcode_id) > -> Seq Scan on data_main dm (cost=0.00..1262.20 > rows=81920 > width=2) (actual time=0.005..105.548 rows=81920 loops=1) > -> Hash (cost=3.60..3.60 rows=160 width=2) (actual > time=0.592..0.592 rows=0 loops=1) > -> Seq Scan on postcodes p (cost=0.00..3.60 > rows=160 > width=2) (actual time=0.025..0.349 rows=160 loops=1) > Total runtime: 180.807 ms > (7 rows) > [...] My completely amateur guess is that the planner is able to use Merge Join and Hash Join on your contrived queries because you are only trying to join one field to a single value (i.e.: operator=). But the BETWEEN clause is what forces the Nested Loop. You can see that here: -> Seq Scan on postcodes p (cost=0.00..3.60 rows=160 width=4) (actual time=0.010..0.396 rows=160 loops=1) vs. here: -> Index Scan using postcodes_pkey on postcodes p (cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160 loops=1) So the first query forces a SeqScan on postcodes, while the second can do an IndexScan. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
pgsql-performance by date: