Plan for relatively simple query seems to be very inefficient - Mailing list pgsql-performance

From Arjen van der Meijden
Subject Plan for relatively simple query seems to be very inefficient
Date
Msg-id 425413D3.5030304@vulcanus.its.tudelft.nl
Whole thread Raw
Responses Re: Plan for relatively simple query seems to be very inefficient  (Steve Atkins <steve@blighty.com>)
Re: Plan for relatively simple query seems to be very inefficient  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Plan for relatively simple query seems to be very inefficient  (Mischa <mischa.Sandberg@telus.net>)
List pgsql-performance
Hi list,

I noticed on a forum a query taking a surprisingly large amount of time
in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much
better. To my surprise PostgreSQL was ten times worse on the same
machine! And I don't understand why.

I don't really need this query to be fast since I don't use it, but the
range-thing is not really an uncommon query I suppose. So I'm wondering
why it is so slow and this may point to a wrong plan being chosen or
generated.

Here are table definitions:

         Table "public.postcodes"
    Column    |     Type      | Modifiers
-------------+---------------+-----------
  postcode_id | smallint      | not null
  range_from  | smallint      |
  range_till  | smallint      |
Indexes:
     "postcodes_pkey" PRIMARY KEY, btree (postcode_id)
     "range" UNIQUE, btree (range_from, range_till)

    Table "public.data_main"
  Column |   Type   | Modifiers
--------+----------+-----------
  userid | integer  | not null
  range  | smallint |
Indexes:
     "data_main_pkey" PRIMARY KEY, btree (userid)

And here's the query I ran:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range BETWEEN p.range_from AND p.range_till
                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=332586.85..332586.85 rows=1 width=0) (actual
time=22712.038..22712.039 rows=1 loops=1)
    ->  Nested Loop  (cost=3.76..328945.96 rows=1456356 width=0) (actual
time=0.054..22600.826 rows=82688 loops=1)
          Join Filter: (("outer".range >= "inner".range_from) AND
("outer".range <= "inner".range_till))
          ->  Seq Scan on data_main dm  (cost=0.00..1262.20 rows=81920
width=2) (actual time=0.020..136.930 rows=81920 loops=1)
          ->  Materialize  (cost=3.76..5.36 rows=160 width=4) (actual
time=0.001..0.099 rows=160 loops=81920)
                ->  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160
width=4) (actual time=0.010..0.396 rows=160 loops=1)
  Total runtime: 22712.211 ms


When I do something completely bogus, which will result in coupling the
data per record from data_main on one record from postcodes, it still
not very fast but acceptable:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range / 10 = p.postcode_id

                                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=10076.98..10076.98 rows=1 width=0) (actual
time=1456.016..1456.017 rows=1 loops=1)
    ->  Merge Join  (cost=8636.81..9913.13 rows=65537 width=0) (actual
time=1058.105..1358.571 rows=81920 loops=1)
          Merge Cond: ("outer".postcode_id = "inner"."?column2?")
          ->  Index Scan using postcodes_pkey on postcodes p
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160
loops=1)
          ->  Sort  (cost=8636.81..8841.61 rows=81920 width=2) (actual
time=1057.698..1169.879 rows=81920 loops=1)
                Sort Key: (dm.range / 10)
                ->  Seq Scan on data_main dm  (cost=0.00..1262.20
rows=81920 width=2) (actual time=0.020..238.886 rows=81920 loops=1)
  Total runtime: 1461.156 ms


Doing something similarily bogus, but with less results is much faster,
even though it should have basically the same plan:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range  = p.postcode_id

                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=2138.63..2138.63 rows=1 width=0) (actual
time=180.667..180.668 rows=1 loops=1)
    ->  Hash Join  (cost=4.00..2087.02 rows=20642 width=0) (actual
time=180.645..180.645 rows=0 loops=1)
          Hash Cond: ("outer".range = "inner".postcode_id)
          ->  Seq Scan on data_main dm  (cost=0.00..1262.20 rows=81920
width=2) (actual time=0.005..105.548 rows=81920 loops=1)
          ->  Hash  (cost=3.60..3.60 rows=160 width=2) (actual
time=0.592..0.592 rows=0 loops=1)
                ->  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160
width=2) (actual time=0.025..0.349 rows=160 loops=1)
  Total runtime: 180.807 ms
(7 rows)

If you like to toy around with the datasets on your heavily optimized
postgresql-installs, let me know. The data is just generated for
testing-purposes and I'd happily send a copy to anyone interested.

Best regards,

Arjen van der Meijden

pgsql-performance by date:

Previous
From: "Mohan, Ross"
Date:
Subject: Re: RE : RE: Postgresql vs SQLserver for thisapplication ?
Next
From: Steve Atkins
Date:
Subject: Re: Plan for relatively simple query seems to be very inefficient