Massive performance issues - Mailing list pgsql-performance

From Matthew Sackman
Subject Massive performance issues
Date
Msg-id 20050901174231.GE23970@pongo.lshift.net
Whole thread Raw
Responses Re: Massive performance issues  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Massive performance issues  (Ron <rjpeace@earthlink.net>)
Re: Massive performance issues  ("Steinar H. Gunderson" <sgunderson@bigfoot.com>)
Re: Massive performance issues  (Sebastian Hennebrueder <usenet@laliluna.de>)
Re: Massive performance issues  (Arjen van der Meijden <acmmailing@tweakers.net>)
Re: Massive performance issues  (Michael Fuhr <mike@fuhr.org>)
Re: Massive performance issues  (Mark Kirkwood <markir@paradise.net.nz>)
Re: Massive performance issues  (Mark Kirkwood <markir@paradise.net.nz>)
Re: Massive performance issues  (Matthew Sackman <matthew@lshift.net>)
List pgsql-performance
Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
                  Table "public.address"
        Column        |          Type          | Modifiers
----------------------+------------------------+-----------
 postcode_top         | character varying(2)   | not null
 postcode_middle      | character varying(4)   | not null
 postcode_bottom      | character varying(7)   | not null
 postcode             | character varying(10)  | not null
 property_type        | character varying(15)  | not null
 sale_type            | character varying(10)  | not null
 flat_extra           | character varying(100) | not null
 number               | character varying(100) | not null
 street               | character varying(100) | not null
 locality_1           | character varying(100) | not null
 locality_2           | character varying(100) | not null
 city                 | character varying(100) | not null
 county               | character varying(100) | not null
Indexes:
    "address_city_index" btree (city)
    "address_county_index" btree (county)
    "address_locality_1_index" btree (locality_1)
    "address_locality_2_index" btree (locality_2)
    "address_pc_bottom_index" btree (postcode_bottom)
    "address_pc_middle_index" btree (postcode_middle)
    "address_pc_top_index" btree (postcode_top)
    "address_pc_top_middle_bottom_index" btree (postcode_top,
                             postcode_middle, postcode_bottom)
    "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
    "address_postcode_index" btree (postcode)
    "address_property_type_index" btree (property_type)
    "address_street_index" btree (street)
    "street_prefix" btree (lower("substring"((street)::text, 1, 1)))

This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
SATA harddrive.

Queries such as:

select locality_2 from address where locality_2 = 'Manchester';

are taking 14 seconds to complete, and this is only 2 years worth of
data - we will have up to 15 years (so over 15 million rows).

Interestingly, doing:
explain select locality_2 from address where locality_2 = 'Manchester';
gives
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
   Filter: ((locality_2)::text = 'Manchester'::text)

but:
explain select locality_1 from address where locality_1 = 'Manchester';
gives
                           QUERY PLAN
----------------------------------------------------------------
 Index Scan using address_locality_1_index on address
(cost=0.00..69882.18 rows=17708 width=13)
   Index Cond: ((locality_1)::text = 'Manchester'::text)

Sadly, using the index makes things worse, the query taking 17 seconds.

locality_1 has 16650 distinct values and locality_2 has 1156 distinct
values.

Whilst the locality_2 query is in progress, both the disk and the CPU
are maxed out with the disk constantly reading at 60MB/s and the CPU
rarely dropping under 100% load.

With the locality_1 query in progress, the CPU is maxed out but the disk
is reading at just 3MB/s.

Obviously, to me, this is a problem, I need these queries to be under a
second to complete. Is this unreasonable? What can I do to make this "go
faster"? I've considered normalising the table but I can't work out
whether the slowness is in dereferencing the pointers from the index
into the table or in scanning the index in the first place. And
normalising the table is going to cause much pain when inserting values
and I'm not entirely sure if I see why normalising it should cause a
massive performance improvement.

I need to get to the stage where I can run queries such as:
select street, locality_1, locality_2, city from address
where (city = 'Nottingham' or locality_2 = 'Nottingham'
       or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A'
group by street, locality_1, locality_2, city
order by street
limit 20 offset 0

and have the results very quickly.

Any help most gratefully received (even if it's to say that I should be
posting to a different mailing list!).

Many thanks,

Matthew


pgsql-performance by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: Need for speed 3
Next
From: "Merlin Moncure"
Date:
Subject: Re: Massive performance issues