Thread: Massive performance issues

Massive performance issues

From
Matthew Sackman
Date:
Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
                  Table "public.address"
        Column        |          Type          | Modifiers
----------------------+------------------------+-----------
 postcode_top         | character varying(2)   | not null
 postcode_middle      | character varying(4)   | not null
 postcode_bottom      | character varying(7)   | not null
 postcode             | character varying(10)  | not null
 property_type        | character varying(15)  | not null
 sale_type            | character varying(10)  | not null
 flat_extra           | character varying(100) | not null
 number               | character varying(100) | not null
 street               | character varying(100) | not null
 locality_1           | character varying(100) | not null
 locality_2           | character varying(100) | not null
 city                 | character varying(100) | not null
 county               | character varying(100) | not null
Indexes:
    "address_city_index" btree (city)
    "address_county_index" btree (county)
    "address_locality_1_index" btree (locality_1)
    "address_locality_2_index" btree (locality_2)
    "address_pc_bottom_index" btree (postcode_bottom)
    "address_pc_middle_index" btree (postcode_middle)
    "address_pc_top_index" btree (postcode_top)
    "address_pc_top_middle_bottom_index" btree (postcode_top,
                             postcode_middle, postcode_bottom)
    "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
    "address_postcode_index" btree (postcode)
    "address_property_type_index" btree (property_type)
    "address_street_index" btree (street)
    "street_prefix" btree (lower("substring"((street)::text, 1, 1)))

This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
SATA harddrive.

Queries such as:

select locality_2 from address where locality_2 = 'Manchester';

are taking 14 seconds to complete, and this is only 2 years worth of
data - we will have up to 15 years (so over 15 million rows).

Interestingly, doing:
explain select locality_2 from address where locality_2 = 'Manchester';
gives
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
   Filter: ((locality_2)::text = 'Manchester'::text)

but:
explain select locality_1 from address where locality_1 = 'Manchester';
gives
                           QUERY PLAN
----------------------------------------------------------------
 Index Scan using address_locality_1_index on address
(cost=0.00..69882.18 rows=17708 width=13)
   Index Cond: ((locality_1)::text = 'Manchester'::text)

Sadly, using the index makes things worse, the query taking 17 seconds.

locality_1 has 16650 distinct values and locality_2 has 1156 distinct
values.

Whilst the locality_2 query is in progress, both the disk and the CPU
are maxed out with the disk constantly reading at 60MB/s and the CPU
rarely dropping under 100% load.

With the locality_1 query in progress, the CPU is maxed out but the disk
is reading at just 3MB/s.

Obviously, to me, this is a problem, I need these queries to be under a
second to complete. Is this unreasonable? What can I do to make this "go
faster"? I've considered normalising the table but I can't work out
whether the slowness is in dereferencing the pointers from the index
into the table or in scanning the index in the first place. And
normalising the table is going to cause much pain when inserting values
and I'm not entirely sure if I see why normalising it should cause a
massive performance improvement.

I need to get to the stage where I can run queries such as:
select street, locality_1, locality_2, city from address
where (city = 'Nottingham' or locality_2 = 'Nottingham'
       or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A'
group by street, locality_1, locality_2, city
order by street
limit 20 offset 0

and have the results very quickly.

Any help most gratefully received (even if it's to say that I should be
posting to a different mailing list!).

Many thanks,

Matthew


Re: Massive performance issues

From
"Merlin Moncure"
Date:
> I'm having performance issues with a table consisting of 2,043,133
rows.
> The
> schema is:

> locality_1 has 16650 distinct values and locality_2 has 1156 distinct
> values.

Just so you know I have a 2GHz p4 workstation with similar size (2M
rows), several keys, and can find and fetch 2k rows based on 20k unique
value key in about 60 ms. (.06 seconds).

Merlin

Re: Massive performance issues

From
"Merlin Moncure"
Date:
>                   Table "public.address"
>         Column        |          Type          | Modifiers
> ----------------------+------------------------+-----------
>  postcode_top         | character varying(2)   | not null
>  postcode_middle      | character varying(4)   | not null
>  postcode_bottom      | character varying(7)   | not null


consider making above fields char(x) not varchar(x) for small but
important savings.

>  postcode             | character varying(10)  | not null
>  property_type        | character varying(15)  | not null
>  sale_type            | character varying(10)  | not null
>  flat_extra           | character varying(100) | not null
>  number               | character varying(100) | not null
>  street               | character varying(100) | not null
>  locality_1           | character varying(100) | not null
>  locality_2           | character varying(100) | not null
>  city                 | character varying(100) | not null
>  county               | character varying(100) | not null
> Indexes:
>     "address_city_index" btree (city)
>     "address_county_index" btree (county)
>     "address_locality_1_index" btree (locality_1)
>     "address_locality_2_index" btree (locality_2)
>     "address_pc_bottom_index" btree (postcode_bottom)
>     "address_pc_middle_index" btree (postcode_middle)
>     "address_pc_top_index" btree (postcode_top)
>     "address_pc_top_middle_bottom_index" btree (postcode_top,
>                              postcode_middle, postcode_bottom)
>     "address_pc_top_middle_index" btree (postcode_top,
postcode_middle)
>     "address_postcode_index" btree (postcode)
>     "address_property_type_index" btree (property_type)
>     "address_street_index" btree (street)
>     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
>
> Obviously, to me, this is a problem, I need these queries to be under
a
> second to complete. Is this unreasonable? What can I do to make this
"go
> faster"? I've considered normalising the table but I can't work out
> whether the slowness is in dereferencing the pointers from the index
> into the table or in scanning the index in the first place. And
> normalising the table is going to cause much pain when inserting
values
> and I'm not entirely sure if I see why normalising it should cause a
> massive performance improvement.

http://www.dbdebunk.com :)

> I need to get to the stage where I can run queries such as:
> select street, locality_1, locality_2, city from address
> where (city = 'Nottingham' or locality_2 = 'Nottingham'
>        or locality_1 = 'Nottingham')
>   and upper(substring(street from 1 for 1)) = 'A'
> group by street, locality_1, locality_2, city
> order by street
> limit 20 offset 0
>
> and have the results very quickly.
>
> Any help most gratefully received (even if it's to say that I should
be
> posting to a different mailing list!).

this is correct list.  did you run vacuum/analyze, etc?
Please post vacuum analyze times.

Merlin

Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 02:47:06PM -0400, Tom Lane wrote:
> Matthew Sackman <matthew@lshift.net> writes:
> > Obviously, to me, this is a problem, I need these queries to be under a
> > second to complete. Is this unreasonable?
>
> Yes.  Pulling twenty thousand rows at random from a table isn't free.

I appreciate that. But I'm surprised by how un-free it seems to be.
And it seems others here have performance I need on similar hardware.

> You were pretty vague about your disk hardware, which makes me think
> you didn't spend a lot of money on it ... and on low-ball hardware,
> that sort of random access speed just isn't gonna happen.

Well, this is a development box. But the live box wouldn't be much more
than RAID 1 on SCSI 10ks so that should only be a halving of seek time,
not the 1000 times reduction I'm after!

In fact, now I think about it, I have been testing on a 2.4 kernel on a
dual HT 3GHz Xeon with SCSI RAID array and the performance is only
marginally better.

> If the queries you need are very consistent, you might be able to get
> some mileage out of CLUSTERing by the relevant index ... but the number
> of indexes you've created makes me think that's not so ...

No, the queries, whilst in just three distinct forms, will effectively
be for fairly random values.

Matthew

Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote:
> > Any help most gratefully received (even if it's to say that I should
> be
> > posting to a different mailing list!).
>
> this is correct list.  did you run vacuum/analyze, etc?
> Please post vacuum analyze times.

2005-09-01 19:47:08 LOG:  statement: vacuum full analyze address;
2005-09-01 19:48:44 LOG:  duration: 96182.777 ms

2005-09-01 19:50:20 LOG:  statement: vacuum analyze address;
2005-09-01 19:51:48 LOG:  duration: 87675.268 ms

I run them regularly, pretty much after every bulk import.

Matthew

Re: Massive performance issues

From
Tom Lane
Date:
Matthew Sackman <matthew@lshift.net> writes:
> Obviously, to me, this is a problem, I need these queries to be under a
> second to complete. Is this unreasonable?

Yes.  Pulling twenty thousand rows at random from a table isn't free.
You were pretty vague about your disk hardware, which makes me think
you didn't spend a lot of money on it ... and on low-ball hardware,
that sort of random access speed just isn't gonna happen.

If the queries you need are very consistent, you might be able to get
some mileage out of CLUSTERing by the relevant index ... but the number
of indexes you've created makes me think that's not so ...

            regards, tom lane

Re: Massive performance issues

From
"Joel Fradkin"
Date:
Any chance it's a vacuum thing?
Or configuration (out of the box it needs adjusting)?

Joel Fradkin



-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Thursday, September 01, 2005 2:11 PM
To: Matthew Sackman
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Massive performance issues

> I'm having performance issues with a table consisting of 2,043,133
rows.
> The
> schema is:

> locality_1 has 16650 distinct values and locality_2 has 1156 distinct
> values.

Just so you know I have a 2GHz p4 workstation with similar size (2M
rows), several keys, and can find and fetch 2k rows based on 20k unique
value key in about 60 ms. (.06 seconds).

Merlin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match


Re: Massive performance issues

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 03:51:35PM -0400, Merlin Moncure wrote:

> > Huh, hang on -- AFAIK there's no saving at all by doing that.  Quite
> > the opposite really, because with char(x) you store the padding
> > blanks, which are omitted with varchar(x), so less I/O (not
> > necessarily a measurable amount, mind you, maybe even zero because
> > of padding issues.)
>
> You are right, all this time I thought there was a 4 byte penalty for
> storing varchar type and not in char :(.  So there is no reason at all
> to use the char type?

Other than SQL conformance, apparently not.

--
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.              (Don Knuth)

Re: Massive performance issues

From
"Merlin Moncure"
Date:
> -----Original Message-----
> From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org]
> Sent: Thursday, September 01, 2005 3:34 PM
> To: Merlin Moncure
> Cc: Matthew Sackman; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Massive performance issues
>
> On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote:
> > >                   Table "public.address"
> > >         Column        |          Type          | Modifiers
> > > ----------------------+------------------------+-----------
> > >  postcode_top         | character varying(2)   | not null
> > >  postcode_middle      | character varying(4)   | not null
> > >  postcode_bottom      | character varying(7)   | not null
> >
> > consider making above fields char(x) not varchar(x) for small but
> > important savings.
>
> Huh, hang on -- AFAIK there's no saving at all by doing that.  Quite
the
> opposite really, because with char(x) you store the padding blanks,
> which are omitted with varchar(x), so less I/O (not necessarily a
> measurable amount, mind you, maybe even zero because of padding
issues.)

You are right, all this time I thought there was a 4 byte penalty for
storing varchar type and not in char :(.  So there is no reason at all
to use the char type?

Merlin


Re: Massive performance issues

From
Ron
Date:
This should be able to run _very_ fast.

At 01:42 PM 9/1/2005, Matthew Sackman wrote:
>Hi,
>
>I'm having performance issues with a table consisting of 2,043,133 rows. The
>schema is:
>
>\d address
>                   Table "public.address"
>         Column        |          Type          | Modifiers
>----------------------+------------------------+-----------
>  postcode_top         | character varying(2)   | not null
>  postcode_middle      | character varying(4)   | not null
>  postcode_bottom      | character varying(7)   | not null
>  postcode             | character varying(10)  | not null
>  property_type        | character varying(15)  | not null
>  sale_type            | character varying(10)  | not null
>  flat_extra           | character varying(100) | not null
>  number               | character varying(100) | not null
>  street               | character varying(100) | not null
>  locality_1           | character varying(100) | not null
>  locality_2           | character varying(100) | not null
>  city                 | character varying(100) | not null
>  county               | character varying(100) | not null
>Indexes:
>     "address_city_index" btree (city)
>     "address_county_index" btree (county)
>     "address_locality_1_index" btree (locality_1)
>     "address_locality_2_index" btree (locality_2)
>     "address_pc_bottom_index" btree (postcode_bottom)
>     "address_pc_middle_index" btree (postcode_middle)
>     "address_pc_top_index" btree (postcode_top)
>     "address_pc_top_middle_bottom_index" btree (postcode_top,
>                              postcode_middle, postcode_bottom)
>     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
>     "address_postcode_index" btree (postcode)
>     "address_property_type_index" btree (property_type)
>     "address_street_index" btree (street)
>     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))

IOW, each row takes ~1KB on HD.  First suggestion: format your HD to
use 8KB pages with 1KB segments.  That'll out each row down on HD as
an atomic unit.  8KB pages also "play nice" with pg.

At 1KB per row, this table takes up ~2.1GB and should fit into RAM
fairly easily on a decently configured DB server (my _laptop_ has 2GB
of RAM after all...)

Since you are using ~2.1GB for 2 years worth of data, 15 years worth
should take no more than 2.1GB*7.5= 15.75GB.

If you replace some of those 100 char fields with integers for code
numbers and have an auxiliary table for each of those fields mapping
the code numbers to the associated 100 char string, you should be
able to shrink a row considerably.  Your target is to have each row
take <= 512B.  Once a row fits into one 512B sector on HD, there's a
no point in making it smaller unless you can shrink it enough to fit
2 rows into one sector (<= 256B).  Once two rows fit into one sector,
there's no point shrinking a row unless you can make 3 rows fit into
a sector.  Etc.

Assuming each 100 char (eg 100B) field can be replaced with a 4B int,
each row could be as small as 76B.  That makes 85B per row the goal
as it would allow you to fit 6 rows per 512B HD sector.  So in the
best case your table will be 12x smaller in terms of real HD space.

Fitting one (or more) row(s) into one sector will cut down the real
space used on HD for the table to ~7.88GB (or 1.32GB in the best
case).  Any such streamlining will make it faster to load, make the
working set that needs to be RAM for best performance smaller, etc, etc.


>This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4
>and a SATA harddrive.

Upgrade pg to 8.0.3 and make sure you have enough RAM for your real
day to day load.  Frankly, RAM is so cheap ($75-$150/GB), I'd just
upgrade the machine to 4GB as a matter of course.  P4's have PAE, so
if your mainboard can hold it, put more than 4GB of RAM in if you
find you need it.

Since you are describing your workload as being predominantly reads,
you can get away with far less HD capability as long as you crank up
RAM high enough to hold the working set of the DB.  The indications
from the OP are that you may very well be able to hold the entire DB
in RAM.  That's a big win whenever you can achieve it.

After these steps, there may still be performance issues that need
attention, but the DBMS should be _much_ faster.

Ron Peacetree



Re: Massive performance issues

From
"Steinar H. Gunderson"
Date:
On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote:
>  flat_extra           | character varying(100) | not null
>  number               | character varying(100) | not null
>  street               | character varying(100) | not null
>  locality_1           | character varying(100) | not null
>  locality_2           | character varying(100) | not null
>  city                 | character varying(100) | not null
>  county               | character varying(100) | not null

Having these fixed probably won't give you any noticeable improvements;
unless there's something natural about your data setting 100 as a hard limit,
you could just as well drop these.

>     "address_city_index" btree (city)
>     "address_county_index" btree (county)
>     "address_locality_1_index" btree (locality_1)
>     "address_locality_2_index" btree (locality_2)
>     "address_pc_bottom_index" btree (postcode_bottom)
>     "address_pc_middle_index" btree (postcode_middle)
>     "address_pc_top_index" btree (postcode_top)
>     "address_pc_top_middle_bottom_index" btree (postcode_top,
>                              postcode_middle, postcode_bottom)
>     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
>     "address_postcode_index" btree (postcode)
>     "address_property_type_index" btree (property_type)
>     "address_street_index" btree (street)
>     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))

Wow, that's quite a lof of indexes... but your problem isn't reported as
being in insert/update/delete.

> This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
> SATA harddrive.

8.0 or 8.1 might help you some -- better (and more!) disks will probably help
a _lot_.

> Queries such as:
>
> select locality_2 from address where locality_2 = 'Manchester';
>
> are taking 14 seconds to complete, and this is only 2 years worth of
> data - we will have up to 15 years (so over 15 million rows).

As Tom pointed out; you're effectively doing random searches here, and using
CLUSTER might help. Normalizing your data to get smaller rows (and avoid
possibly costly string comparisons if your strcoll() is slow) will probably
also help.

> I need to get to the stage where I can run queries such as:
> select street, locality_1, locality_2, city from address
> where (city = 'Nottingham' or locality_2 = 'Nottingham'
>        or locality_1 = 'Nottingham')
>   and upper(substring(street from 1 for 1)) = 'A'
> group by street, locality_1, locality_2, city
> order by street
> limit 20 offset 0

This might be a lot quicker than pulling all the records like in your example
queries...

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Massive performance issues

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote:
> >                   Table "public.address"
> >         Column        |          Type          | Modifiers
> > ----------------------+------------------------+-----------
> >  postcode_top         | character varying(2)   | not null
> >  postcode_middle      | character varying(4)   | not null
> >  postcode_bottom      | character varying(7)   | not null
>
> consider making above fields char(x) not varchar(x) for small but
> important savings.

Huh, hang on -- AFAIK there's no saving at all by doing that.  Quite the
opposite really, because with char(x) you store the padding blanks,
which are omitted with varchar(x), so less I/O (not necessarily a
measurable amount, mind you, maybe even zero because of padding issues.)

--
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
You liked Linux a lot when he was just the gawky kid from down the block
mowing your lawn or shoveling the snow. But now that he wants to date
your daughter, you're not so sure he measures up. (Larry Greenemeier)

Re: Massive performance issues

From
Tom Lane
Date:
Ron <rjpeace@earthlink.net> writes:
> ...  Your target is to have each row take <= 512B.

Ron, are you assuming that the varchar fields are blank-padded or
something?  I think it's highly unlikely that he's got more than a
couple hundred bytes per row right now --- at least if the data is
what it sounds like.

The upthread comment about strcoll() set off some alarm bells in my head.
If the database wasn't initdb'd in C locale already, try making it so.
Also, use a single-byte encoding if you can (LatinX is fine, Unicode not).

> Upgrade pg to 8.0.3 and make sure you have enough RAM for your real
> day to day load.

Newer PG definitely better.  Some attention to the configuration
parameters might also be called for.  I fear though that these things
are probably just chipping at the margins ...

            regards, tom lane

Re: Massive performance issues

From
Sebastian Hennebrueder
Date:
Matthew Sackman schrieb:

>Hi,
>
>I'm having performance issues with a table consisting of 2,043,133 rows. The
>schema is:
>
>\d address
>                  Table "public.address"
>        Column        |          Type          | Modifiers
>----------------------+------------------------+-----------
> postcode_top         | character varying(2)   | not null
> postcode_middle      | character varying(4)   | not null
> postcode_bottom      | character varying(7)   | not null
> postcode             | character varying(10)  | not null
> property_type        | character varying(15)  | not null
> sale_type            | character varying(10)  | not null
> flat_extra           | character varying(100) | not null
> number               | character varying(100) | not null
> street               | character varying(100) | not null
> locality_1           | character varying(100) | not null
> locality_2           | character varying(100) | not null
> city                 | character varying(100) | not null
> county               | character varying(100) | not null
>Indexes:
>    "address_city_index" btree (city)
>    "address_county_index" btree (county)
>    "address_locality_1_index" btree (locality_1)
>    "address_locality_2_index" btree (locality_2)
>    "address_pc_bottom_index" btree (postcode_bottom)
>    "address_pc_middle_index" btree (postcode_middle)
>    "address_pc_top_index" btree (postcode_top)
>    "address_pc_top_middle_bottom_index" btree (postcode_top,
>                             postcode_middle, postcode_bottom)
>    "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
>    "address_postcode_index" btree (postcode)
>    "address_property_type_index" btree (property_type)
>    "address_street_index" btree (street)
>    "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
>
>This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
>SATA harddrive.
>
>Queries such as:
>
>select locality_2 from address where locality_2 = 'Manchester';
>
>are taking 14 seconds to complete, and this is only 2 years worth of
>data - we will have up to 15 years (so over 15 million rows).
>
>Interestingly, doing:
>explain select locality_2 from address where locality_2 = 'Manchester';
>gives
>                           QUERY PLAN
>----------------------------------------------------------------
> Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
>   Filter: ((locality_2)::text = 'Manchester'::text)
>
>but:
>explain select locality_1 from address where locality_1 = 'Manchester';
>gives
>                           QUERY PLAN
>----------------------------------------------------------------
> Index Scan using address_locality_1_index on address
>(cost=0.00..69882.18 rows=17708 width=13)
>   Index Cond: ((locality_1)::text = 'Manchester'::text)
>
>Sadly, using the index makes things worse, the query taking 17 seconds.
>
>locality_1 has 16650 distinct values and locality_2 has 1156 distinct
>values.
>
>Whilst the locality_2 query is in progress, both the disk and the CPU
>are maxed out with the disk constantly reading at 60MB/s and the CPU
>rarely dropping under 100% load.
>
>With the locality_1 query in progress, the CPU is maxed out but the disk
>is reading at just 3MB/s.
>
>Obviously, to me, this is a problem, I need these queries to be under a
>second to complete. Is this unreasonable? What can I do to make this "go
>faster"? I've considered normalising the table but I can't work out
>whether the slowness is in dereferencing the pointers from the index
>into the table or in scanning the index in the first place. And
>normalising the table is going to cause much pain when inserting values
>and I'm not entirely sure if I see why normalising it should cause a
>massive performance improvement.
>
>
>
Just an idea: When you do not want to adapt your application to use a
normalized database you may push the data into normalized table using
triggers.
Example:
Add a table city with column id, name
and add a column city_id to your main table.
In this case you have redundant data in your main table (locality_1 and
city_id) but you could make queries to the city table when searching for
'Man%'

--
Best Regards / Viele Grüße

Sebastian Hennebrueder

----

http://www.laliluna.de

Tutorials for JSP, JavaServer Faces, Struts, Hibernate and EJB

Get support, education and consulting for these technologies.


Re: Massive performance issues

From
Arjen van der Meijden
Date:
On 1-9-2005 19:42, Matthew Sackman wrote:
> Obviously, to me, this is a problem, I need these queries to be under a
> second to complete. Is this unreasonable? What can I do to make this "go
> faster"? I've considered normalising the table but I can't work out
> whether the slowness is in dereferencing the pointers from the index
> into the table or in scanning the index in the first place. And
> normalising the table is going to cause much pain when inserting values
> and I'm not entirely sure if I see why normalising it should cause a
> massive performance improvement.

In this case, I think normalising will give a major decrease in on-disk
table-size of this large table and the indexes you have. If that's the
case, that in itself will speed-up all i/o-bound queries quite a bit.

locality_1, _2, city and county can probably be normalised away without
much problem, but going from varchar's to integers will probably safe
you quite a bit of (disk)space.

But since it won't change the selectivity of indexes, so you won't get
more index-scans instead of sequential scans, I suppose.
I think its not that hard to create a normalized set of tables from this
  data-set (using insert into tablename select distinct ... from address
and such, insert into address_new (..., city) select ... (select cityid
from cities where city = address.city) from address)
So its at least relatively easy to figure out the performance
improvement from normalizing the dataset a bit.

If you want to improve your hardware, have a look at the Western Digital
Raptor-series SATA disks, they are fast scsi-like SATA drives. You may
also have a look at the amount of memory available, to allow caching
this (entire) table.

Best regards,

Arjen

Re: Massive performance issues

From
Ron
Date:
At 04:25 PM 9/1/2005, Tom Lane wrote:
>Ron <rjpeace@earthlink.net> writes:
> > ...  Your target is to have each row take <= 512B.
>
>Ron, are you assuming that the varchar fields are blank-padded or
>something?  I think it's highly unlikely that he's got more than a
>couple hundred bytes per row right now --- at least if the data is
>what it sounds like.

As it stands, each row will take 55B - 748B and each field is
variable in size up to the maximums given in the OP's schema.  Since
pg uses an underlying OS FS, and not a native one, there will be
extra FS overhead no matter what we do, particularly to accommodate
such flexibility...  The goal is to minimize overhead and maximize
regularity in layout.  The recipe I know for HD IO speed is in
keeping the data small, regular, and as simple as possible.

Even better, if the table(s) can be made RAM resident, then searches,
even random ones, can be very fast.  He wants a 1000x performance
improvement.  Going from disk resident to RAM resident should help
greatly in attaining that goal.

In addition, by replacing as many variable sized text strings as
possible with ints, the actual compare functions he used as examples
should run faster as well.


>The upthread comment about strcoll() set off some alarm bells in my
>head.  If the database wasn't initdb'd in C locale already, try
>making it so.  Also, use a single-byte encoding if you can (LatinX
>is fine, Unicode not).

Good thoughts I hadn't had.


> > Upgrade pg to 8.0.3 and make sure you have enough RAM for your real
> > day to day load.
>
>Newer PG definitely better.  Some attention to the configuration
>parameters might also be called for.  I fear though that these
>things are probably just chipping at the margins ...

I don't expect 8.0.3 to be a major performance improvement.  I do
expect it to be a major _maintenance_ improvement for both him and
those of us trying to help him ;-)

The performance difference between not having the working set of the
DB fit into RAM during ordinary operation vs having it be so (or
better, having the whole DB fit into RAM during ordinary operation)
has been considerably more effective than "chipping at the margins"
IME.  Especially so if the HD IO subsystem is wimpy.

Ron Peacetree



Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
> >     "address_city_index" btree (city)
> >     "address_county_index" btree (county)
> >     "address_locality_1_index" btree (locality_1)
> >     "address_locality_2_index" btree (locality_2)
> >     "address_pc_bottom_index" btree (postcode_bottom)
> >     "address_pc_middle_index" btree (postcode_middle)
> >     "address_pc_top_index" btree (postcode_top)
> >     "address_pc_top_middle_bottom_index" btree (postcode_top,
> >                              postcode_middle, postcode_bottom)
> >     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
> >     "address_postcode_index" btree (postcode)
> >     "address_property_type_index" btree (property_type)
> >     "address_street_index" btree (street)
> >     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
>
> Wow, that's quite a lof of indexes... but your problem isn't reported as
> being in insert/update/delete.

Hah, well now that you mention it. Basically, 100,000 rows come in in a
bulk import every month and the only way I can get it to complete in any
sane time frame at all is to drop the indexes, do the import and then
recreate the indexes. But that's something that I'm OK with - the
imports don't have to be that fast and whilst important, it's not *the*
critical path. Selection from the database is, hence the indexes.

> > This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
> > SATA harddrive.
>
> 8.0 or 8.1 might help you some -- better (and more!) disks will probably help
> a _lot_.

Ok, I did try 8.0 when I started this and found that the server bind
parameters (both via DBD::Pg (with pg_prepare_server => 1) and via JDBC
(various versions I tried)) failed - the parameters were clearly not
being substituted. This was Postgresql 8.0 from Debian unstable. That
was a couple of weeks ago and I've not been back to check whether its
been fixed. Anyway, because of these problems I dropped back to 7.4.

> > Queries such as:
> >
> > select locality_2 from address where locality_2 = 'Manchester';
> >
> > are taking 14 seconds to complete, and this is only 2 years worth of
> > data - we will have up to 15 years (so over 15 million rows).
>
> As Tom pointed out; you're effectively doing random searches here, and using
> CLUSTER might help. Normalizing your data to get smaller rows (and avoid
> possibly costly string comparisons if your strcoll() is slow) will probably
> also help.

Ok, so you're saying that joining the address table into an address_city
table (the obvious normalization) will help here?

The locale settings in postgresql.conf all have en_GB and a \l shows
encoding of LATIN1. So I don't think I've set anything to UTF8 or such
like.

> > I need to get to the stage where I can run queries such as:
> > select street, locality_1, locality_2, city from address
> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >        or locality_1 = 'Nottingham')
> >   and upper(substring(street from 1 for 1)) = 'A'
> > group by street, locality_1, locality_2, city
> > order by street
> > limit 20 offset 0
>
> This might be a lot quicker than pulling all the records like in your example
> queries...

Yes, that certainly does seem to be the case - around 4 seconds. But I
need it to be 10 times faster (or thereabouts) otherwise I have big
problems!

Many thanks for all the advice so far.

Matthew


Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 10:54:45PM +0200, Arjen van der Meijden wrote:
> On 1-9-2005 19:42, Matthew Sackman wrote:
> >Obviously, to me, this is a problem, I need these queries to be under a
> >second to complete. Is this unreasonable? What can I do to make this "go
> >faster"? I've considered normalising the table but I can't work out
> >whether the slowness is in dereferencing the pointers from the index
> >into the table or in scanning the index in the first place. And
> >normalising the table is going to cause much pain when inserting values
> >and I'm not entirely sure if I see why normalising it should cause a
> >massive performance improvement.
>
> In this case, I think normalising will give a major decrease in on-disk
> table-size of this large table and the indexes you have. If that's the
> case, that in itself will speed-up all i/o-bound queries quite a bit.

Well that's the thing - on the queries where it decides to use the index
it only reads at around 3MB/s and the CPU is maxed out, whereas when it
doesn't use the index, the disk is being read at 60MB/s. So when it
decides to use an index, I don't seem to be IO bound at all. Or at least
that's the way it seems to me.

> locality_1, _2, city and county can probably be normalised away without
> much problem, but going from varchar's to integers will probably safe
> you quite a bit of (disk)space.

Sure, that's what I've been considering today.

> But since it won't change the selectivity of indexes, so you won't get
> more index-scans instead of sequential scans, I suppose.
> I think its not that hard to create a normalized set of tables from this
>  data-set (using insert into tablename select distinct ... from address
> and such, insert into address_new (..., city) select ... (select cityid
> from cities where city = address.city) from address)
> So its at least relatively easy to figure out the performance
> improvement from normalizing the dataset a bit.

Yeah, the initial creation isn't too painful but when adding rows into
the address table it gets more painful. However, as I've said elsewhere,
the import isn't the critical path so I can cope with that pain,
possibly coding around it in a stored proceedure and triggers as
suggested.

> If you want to improve your hardware, have a look at the Western Digital
> Raptor-series SATA disks, they are fast scsi-like SATA drives. You may
> also have a look at the amount of memory available, to allow caching
> this (entire) table.

Well I've got 1GB of RAM, but from analysis of its use, a fair amount
isn't being used. About 50% is actually in use by applications and about
half of the rest is cache and the rest isn't being used. Has this to do
with the max_fsm_pages and max_fsm_relations settings? I've pretty much
not touched the configuration and it's the standard Debian package.

Matthew

Re: Massive performance issues

From
Jeff Frost
Date:
> Well I've got 1GB of RAM, but from analysis of its use, a fair amount
> isn't being used. About 50% is actually in use by applications and about
> half of the rest is cache and the rest isn't being used. Has this to do
> with the max_fsm_pages and max_fsm_relations settings? I've pretty much
> not touched the configuration and it's the standard Debian package.

Matt, have a look at the annotated postgresql.conf for 7.x here:

http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

If you have the default settings, you're likely hampering yourself quite a
bit.  You probably care about shared_buffers, sort_mem,
vacuum_mem, max_fsm_pages, effective_cache_size

Also, you may want to read the  PostgreSQL 8.0 Performance Checklist.  Even
though it's for 8.0, it'll give you good ideas on what to change in 7.4.  You
can find it here: http://www.powerpostgresql.com/PerfList/

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: Massive performance issues

From
"Steinar H. Gunderson"
Date:
On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote:
> Well that's the thing - on the queries where it decides to use the index
> it only reads at around 3MB/s and the CPU is maxed out, whereas when it
> doesn't use the index, the disk is being read at 60MB/s. So when it
> decides to use an index, I don't seem to be IO bound at all. Or at least
> that's the way it seems to me.

You are I/O bound; your disk is doing lots and lots of seeks. The SATA
interface is not the bottleneck; the disk's ability to rotate and move its
heads is.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 02:26:47PM -0700, Jeff Frost wrote:
> >Well I've got 1GB of RAM, but from analysis of its use, a fair amount
> >isn't being used. About 50% is actually in use by applications and about
> >half of the rest is cache and the rest isn't being used. Has this to do
> >with the max_fsm_pages and max_fsm_relations settings? I've pretty much
> >not touched the configuration and it's the standard Debian package.
>
> Matt, have a look at the annotated postgresql.conf for 7.x here:
>
> http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
>
> If you have the default settings, you're likely hampering yourself quite a
> bit.  You probably care about shared_buffers, sort_mem,
> vacuum_mem, max_fsm_pages, effective_cache_size

That's a useful resource, thanks for the pointer. I'll work through that
tomorrow.

> Also, you may want to read the  PostgreSQL 8.0 Performance Checklist.  Even
> though it's for 8.0, it'll give you good ideas on what to change in 7.4.
> You can find it here: http://www.powerpostgresql.com/PerfList/

Thanks, another good resource. I'll work through that too.

Matthew

Re: Massive performance issues

From
Ron
Date:
At 05:06 PM 9/1/2005, Matthew Sackman wrote:
>On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote:
> > >     "address_city_index" btree (city)
> > >     "address_county_index" btree (county)
> > >     "address_locality_1_index" btree (locality_1)
> > >     "address_locality_2_index" btree (locality_2)
> > >     "address_pc_bottom_index" btree (postcode_bottom)
> > >     "address_pc_middle_index" btree (postcode_middle)
> > >     "address_pc_top_index" btree (postcode_top)
> > >     "address_pc_top_middle_bottom_index" btree (postcode_top,
> > >                              postcode_middle, postcode_bottom)
> > >     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)
> > >     "address_postcode_index" btree (postcode)
> > >     "address_property_type_index" btree (property_type)
> > >     "address_street_index" btree (street)
> > >     "street_prefix" btree (lower("substring"((street)::text, 1, 1)))
> >
> > Wow, that's quite a lof of indexes... but your problem isn't reported as
> > being in insert/update/delete.
>
>Hah, well now that you mention it. Basically, 100,000 rows come in in a
>bulk import every month and the only way I can get it to complete in any
>sane time frame at all is to drop the indexes, do the import and then
>recreate the indexes. But that's something that I'm OK with -

FTR, this "drop the indexes, do <foo>, recreate the indexes" is
Industry Standard Practice for bulk
inserts/updates/deletes.  Regardless of DB product used.


>  - the imports don't have to be that fast and whilst important,
> it's not *the*
>critical path.  Selection from the database is, hence the indexes.

A DB _without_ indexes that fits into RAM during ordinary operation
may actually be faster than a DB _with_ indexes that does
not.  Fitting the entire DB into RAM during ordinary operation if at
all possible should be the first priority with a small data mine-like
application such as you've described.

Also normalization is _not_ always a good thing for data mining like
apps.  Having most or everything you need in one place in a compact
and regular format is usually more effective for data mines than "Nth
Order Normal Form" optimization to the degree usually found in
textbooks using OLTP-like examples.

Indexes are a complication used as a performance enhancing technique
because without them the DB is not performing well enough.  IME, it's
usually better to get as much performance as one can from other
aspects of design and _then_ start adding complications.  Including
indexes.  Even if you fit the whole DB in RAM, you are very likely to
need some indexes; but profile your performance first and then add
indexes as needed rather than just adding them willy nilly early in
the design process.

You said you had 1GB of RAM on the machine now.  That clearly is
inadequate to your desired performance given what you said about the
DB.  Crank that box to 4GB and tighten up your data structures.  Then
see where you are.


> > > This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a
> > > SATA harddrive.
> >
> > 8.0 or 8.1 might help you some -- better (and more!) disks will
> probably help
> > a _lot_.
>
>Ok, I did try 8.0 when I started this and found that the server bind
>parameters (both via DBD::Pg (with pg_prepare_server => 1) and via JDBC
>(various versions I tried)) failed - the parameters were clearly not
>being substituted. This was Postgresql 8.0 from Debian unstable. That
>was a couple of weeks ago and I've not been back to check whether its
>been fixed. Anyway, because of these problems I dropped back to 7.4.

Since I assume you are not going to run anything with the string
"unstable" in its name in production (?!), why not try a decent
production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a
OS more representative of what you are likely (or at least what is
safe...) to run in production?


> > > Queries such as:
> > >
> > > select locality_2 from address where locality_2 = 'Manchester';
> > >
> > > are taking 14 seconds to complete, and this is only 2 years worth of
> > > data - we will have up to 15 years (so over 15 million rows).
> >
> > As Tom pointed out; you're effectively doing random searches
> here, and using
> > CLUSTER might help. Normalizing your data to get smaller rows (and avoid
> > possibly costly string comparisons if your strcoll() is slow) will probably
> > also help.
>
>Ok, so you're saying that joining the address table into an address_city
>table (the obvious normalization) will help here?
>
>The locale settings in postgresql.conf all have en_GB and a \l shows
>encoding of LATIN1. So I don't think I've set anything to UTF8 or such
>like.
>
> > > I need to get to the stage where I can run queries such as:
> > > select street, locality_1, locality_2, city from address
> > > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> > >        or locality_1 = 'Nottingham')
> > >   and upper(substring(street from 1 for 1)) = 'A'
> > > group by street, locality_1, locality_2, city
> > > order by street
> > > limit 20 offset 0
> >
> > This might be a lot quicker than pulling all the records like in
> your example
> > queries...
>
>Yes, that certainly does seem to be the case - around 4 seconds. But I
>need it to be 10 times faster (or thereabouts) otherwise I have big
>problems!

*beats drum* Get it in RAM, Get it in RAM, ...

Ron Peacetree



Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 11:52:45PM +0200, Steinar H. Gunderson wrote:
> On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote:
> > Well that's the thing - on the queries where it decides to use the index
> > it only reads at around 3MB/s and the CPU is maxed out, whereas when it
> > doesn't use the index, the disk is being read at 60MB/s. So when it
> > decides to use an index, I don't seem to be IO bound at all. Or at least
> > that's the way it seems to me.
>
> You are I/O bound; your disk is doing lots and lots of seeks. The SATA
> interface is not the bottleneck; the disk's ability to rotate and move its
> heads is.

Ahh of course (/me hits head against wall). Because I've /seen/ it read
at 60MB/s I was assuming that if it wasn't reading that fast then I'm
not IO bound but of course, it's not reading sequentially. That all
makes sense. Been a long day etc... ;-)

Matthew

Re: Massive performance issues

From
Matthew Sackman
Date:
On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote:
> > Selection from the database is, hence the indexes.
>
> A DB _without_ indexes that fits into RAM during ordinary operation
> may actually be faster than a DB _with_ indexes that does
> not.  Fitting the entire DB into RAM during ordinary operation if at
> all possible should be the first priority with a small data mine-like
> application such as you've described.

That makes sense.

> Also normalization is _not_ always a good thing for data mining like
> apps.  Having most or everything you need in one place in a compact
> and regular format is usually more effective for data mines than "Nth
> Order Normal Form" optimization to the degree usually found in
> textbooks using OLTP-like examples.

Sure.

> >Ok, I did try 8.0 when I started this and found that the server bind
> >parameters (both via DBD::Pg (with pg_prepare_server => 1) and via JDBC
> >(various versions I tried)) failed - the parameters were clearly not
> >being substituted. This was Postgresql 8.0 from Debian unstable. That
> >was a couple of weeks ago and I've not been back to check whether its
> >been fixed. Anyway, because of these problems I dropped back to 7.4.
>
> Since I assume you are not going to run anything with the string
> "unstable" in its name in production (?!), why not try a decent
> production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a
> OS more representative of what you are likely (or at least what is
> safe...) to run in production?

Well, you see, as ever, it's a bit complicated. The company I'm doing
the development for has been subcontracted to do it and the contractor was
contracted by the actual "client". So there are two companies involved
in addition to the "client". Sadly, the "client" actually has dictated
things like "it will be deployed on FreeBSD and thou shall not argue".
At this point in time, I actually have very little information about the
specification of the boxen that'll be running this application. This is
something I'm hoping to solve very soon. The worst part of it is that
I'm not going have direct (ssh) access to the box and all configuration
changes will most likely have to be relayed through techies at the
"client" so fine tuning this is going to be a veritable nightmare.

> >> > I need to get to the stage where I can run queries such as:
> >> > select street, locality_1, locality_2, city from address
> >> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >> >        or locality_1 = 'Nottingham')
> >> >   and upper(substring(street from 1 for 1)) = 'A'
> >> > group by street, locality_1, locality_2, city
> >> > order by street
> >> > limit 20 offset 0
> >>
> >> This might be a lot quicker than pulling all the records like in
> >your example
> >> queries...
> >
> >Yes, that certainly does seem to be the case - around 4 seconds. But I
> >need it to be 10 times faster (or thereabouts) otherwise I have big
> >problems!
>
> *beats drum* Get it in RAM, Get it in RAM, ...

Ok, but I currently have 2 million rows. When this launches in a couple
of weeks, it'll launch with 5 million+ and then gain > a million a year.
I think the upshot of this all is 4GB RAM as a minimum and judicious use
of normalization so as to avoid more expensive string comparisons and
reduce table size is my immediate plan (along with proper configuration
of pg).

Matthew

Re: Massive performance issues

From
Michael Fuhr
Date:
On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote:
>
>     "address_pc_top_index" btree (postcode_top)
>     "address_pc_top_middle_bottom_index" btree (postcode_top,
>                              postcode_middle, postcode_bottom)
>     "address_pc_top_middle_index" btree (postcode_top, postcode_middle)

This doesn't address the query performance problem, but isn't only
one of these indexes necessary?  The second one, on all three
columns, because searches involving only postcode_top or only
postcode_top and postcode_middle could use it, making the indexes
on only those columns superfluous.  Or am I missing something?

--
Michael Fuhr

Re: Massive performance issues

From
Ron
Date:
At 06:22 PM 9/1/2005, Matthew Sackman wrote:
>On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote:
> >
> > Since I assume you are not going to run anything with the string
> > "unstable" in its name in production (?!), why not try a decent
> > production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a
> > OS more representative of what you are likely (or at least what is
> > safe...) to run in production?
>
>Well, you see, as ever, it's a bit complicated. The company I'm doing
>the development for has been subcontracted to do it and the contractor was
>contracted by the actual "client". So there are two companies involved
>in addition to the "client". Sadly, the "client" actually has dictated
>things like "it will be deployed on FreeBSD and thou shall not argue".

At least get them to promise they will use a release the BSD folks
mark "stable"!


>At this point in time, I actually have very little information about the
>specification of the boxen that'll be running this application. This is
>something I'm hoping to solve very soon. The worst part of it is that
>I'm not going have direct (ssh) access to the box and all configuration
>changes will most likely have to be relayed through techies at the
>"client" so fine tuning this is going to be a veritable nightmare.

IME, what you have actually just said is "It will not be possible to
safely fine tune the DB unless or until I have direct access; and/or
someone who does have direct access is correctly trained."

Ick.


> > >> > I need to get to the stage where I can run queries such as:
> > >> > select street, locality_1, locality_2, city from address
> > >> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> > >> >        or locality_1 = 'Nottingham')
> > >> >   and upper(substring(street from 1 for 1)) = 'A'
> > >> > group by street, locality_1, locality_2, city
> > >> > order by street
> > >> > limit 20 offset 0
> > >>
> > >> This might be a lot quicker than pulling all the records like in
> > >your example
> > >> queries...
> > >
> > >Yes, that certainly does seem to be the case - around 4 seconds. But I
> > >need it to be 10 times faster (or thereabouts) otherwise I have big
> > >problems!
> >
> > *beats drum* Get it in RAM, Get it in RAM, ...
>
>Ok, but I currently have 2 million rows. When this launches in a couple
>of weeks, it'll launch with 5 million+ and then gain > a million a year.

At my previously mentioned optimum of 85B per row, 2M rows is
170MB.  5M rows is 425MB.  Assuming the gain of 1M rows per year,
that's +85MB per year for this table.

Up to 2GB DIMMs are currently standard, and 4GB DIMMs are just in the
process of being introduced.  Mainboards with anything from 4 to 16
DIMM slots are widely available.

IOW, given the description you've provided this DB should _always_
fit in RAM.  Size the production system such that the entire DB fits
into RAM during ordinary operation with an extra 1GB of RAM initially
tossed on as a safety measure and the client will be upgrading the HW
because it's obsolete before they run out of room in RAM.


>I think the upshot of this all is 4GB RAM as a minimum and judicious use
>of normalization so as to avoid more expensive string comparisons and
>reduce table size is my immediate plan (along with proper configuration
>of pg).

My suggestion is only slightly different.  Reduce table size(s) and
up the RAM to the point where the whole DB fits comfortably in RAM.

You've got the rare opportunity to build a practical Memory Resident
Database.  It should run like a banshee when you're done.  I'd love
to see the benches on the final product.

Ron Peacetree



Re: Massive performance issues

From
Mark Kirkwood
Date:
It would be good to see EXPLAIN ANALYZE output for the three queries
below (the real vs. estimated row counts being of interest).

The number of pages in your address table might be interesting to know too.

regards

Mark

Matthew Sackman wrote (with a fair bit of snippage):
> explain select locality_2 from address where locality_2 = 'Manchester';
> gives
>                            QUERY PLAN
> ----------------------------------------------------------------
>  Seq Scan on address  (cost=0.00..80677.16 rows=27923 width=12)
>    Filter: ((locality_2)::text = 'Manchester'::text)
>
>
> explain select locality_1 from address where locality_1 = 'Manchester';
> gives
>                            QUERY PLAN
> ----------------------------------------------------------------
>  Index Scan using address_locality_1_index on address
> (cost=0.00..69882.18 rows=17708 width=13)
>    Index Cond: ((locality_1)::text = 'Manchester'::text)
>
 >
> select street, locality_1, locality_2, city from address
> where (city = 'Nottingham' or locality_2 = 'Nottingham'
>        or locality_1 = 'Nottingham')
>   and upper(substring(street from 1 for 1)) = 'A'
> group by street, locality_1, locality_2, city
> order by street
> limit 20 offset 0
>

Re: Massive performance issues

From
Mark Kirkwood
Date:
Matthew Sackman wrote:

> I need to get to the stage where I can run queries such as:
 >
> select street, locality_1, locality_2, city from address
> where (city = 'Nottingham' or locality_2 = 'Nottingham'
>        or locality_1 = 'Nottingham')
>   and upper(substring(street from 1 for 1)) = 'A'
> group by street, locality_1, locality_2, city
> order by street
> limit 20 offset 0
>
> and have the results very quickly.
>

This sort of query will be handled nicely in 8.1 - it has bitmap and/or
processing to make use of multiple indexes. Note that 8.1 is in beta now.

Cheers

Mark


Re: Massive performance issues

From
"Akshay Mathur"
Date:
Ron,

Can you give me some pointers to make the tables RAM resident. If one
does so, is the program accessing the data need to change. Does pgsql
take care to write the data to disk?

Regards,

akshay

---------------------------------------
Akshay Mathur
SMTS, Product Verification
AirTight Networks, Inc. (www.airtightnetworks.net)
O: +91 20 2588 1555 ext 205
F: +91 20 2588 1445


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ron
Sent: Friday, September 02, 2005 2:36 AM
To: Tom Lane; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Massive performance issues

Even better, if the table(s) can be made RAM resident, then searches,
even random ones, can be very fast.  He wants a 1000x performance
improvement.  Going from disk resident to RAM resident should help
greatly in attaining that goal.





Re: Massive performance issues

From
Josh Berkus
Date:
Matthew,

> Well, this is a development box. But the live box wouldn't be much more
> than RAID 1 on SCSI 10ks so that should only be a halving of seek time,
> not the 1000 times reduction I'm after!

If you're looking for 1000 times reduction, I think you're going to need
*considerably* beefier hardware.   You'd pretty much have to count on the
whole DB being in RAM, and a CPU being always available for incoming queries.

> In fact, now I think about it, I have been testing on a 2.4 kernel on a
> dual HT 3GHz Xeon with SCSI RAID array and the performance is only
> marginally better.

Yes, but HT sucks for databases, so you're probably bottlenecking yourself on
CPU on that machine.

However, if this is the query you really want to optimize for:

select street, locality_1, locality_2, city from address
where (city = 'Nottingham' or locality_2 = 'Nottingham'
       or locality_1 = 'Nottingham')
  and upper(substring(street from 1 for 1)) = 'A'
group by street, locality_1, locality_2, city
order by street
limit 20 offset 0

... then this is the query you should test on.   Although I will say that your
denormalized schema is actually hurting you siginificantly with the above
type of query; indexes aren't going to be possible for it.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Massive performance issues

From
Chris Travers
Date:
Akshay Mathur wrote:

>Ron,
>
>Can you give me some pointers to make the tables RAM resident. If one
>does so, is the program accessing the data need to change. Does pgsql
>take care to write the data to disk?
>
>
>
PostgreSQL tried to intelligently cache information and then will also
use the OS disk cache as a secondary cache.  So a sufficiently small and
frequently accessed table will be resident in RAM.

The simplest way to affect this calculus is to put more RAM in the
machine.  There are hacks I can think of to create RAM caches of
specific tables, but I don't want to take responsibility for anyone
trying these and running into trouble.

Best Wishes,
Chris Travers
Metatron Technology Consulting

Re: Massive performance issues

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:
> Matthew Sackman wrote:
>
>> I need to get to the stage where I can run queries such as:
>
>  >
>
>> select street, locality_1, locality_2, city from address where (city =
>> 'Nottingham' or locality_2 = 'Nottingham'
>>        or locality_1 = 'Nottingham')
>>   and upper(substring(street from 1 for 1)) = 'A' group by street,
>> locality_1, locality_2, city
>> order by street
>> limit 20 offset 0
>>
>> and have the results very quickly.
>>
>
> This sort of query will be handled nicely in 8.1 - it has bitmap and/or
> processing to make use of multiple indexes. Note that 8.1 is in beta now.
>

As others have commented, you will probably need better hardware to
achieve a factor of 1000 improvement, However I think using 8.1 could by
itself give you a factor or 10->100 improvement.

e.g. Using your schema and generating synthetic data:


EXPLAIN
SELECT street, locality_1, locality_2, city
FROM address
WHERE (city = '500TH CITY'
        OR locality_2 = '50TH LOCALITY'
        OR locality_1 = '500TH LOCALITY')
   AND upper(substring(street from 1 for 1)) = 'A'
GROUP BY street, locality_1, locality_2, city
ORDER BY street
LIMIT 20 OFFSET 0

          QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=59559.04..59559.09 rows=20 width=125)
    ->  Sort  (cost=59559.04..59559.09 rows=21 width=125)
          Sort Key: street
          ->  HashAggregate  (cost=59558.37..59558.58 rows=21 width=125)
                ->  Bitmap Heap Scan on address  (cost=323.19..59556.35
rows=202 width=125)
                      Recheck Cond: (((city)::text = '500TH CITY'::text)
OR ((locality_2)::text = '50TH LOCALITY'::text) OR ((locality_1)::text =
'500TH LOCALITY'::text))
                      Filter: (upper("substring"((street)::text, 1, 1))
= 'A'::text)
                      ->  BitmapOr  (cost=323.19..323.19 rows=40625 width=0)
                            ->  Bitmap Index Scan on address_city_index
  (cost=0.00..15.85 rows=1958 width=0)
                                  Index Cond: ((city)::text = '500TH
CITY'::text)
                            ->  Bitmap Index Scan on
address_locality_2_index  (cost=0.00..143.00 rows=18000 width=0)
                                  Index Cond: ((locality_2)::text =
'50TH LOCALITY'::text)
                            ->  Bitmap Index Scan on
address_locality_1_index  (cost=0.00..164.33 rows=20667 width=0)
                                  Index Cond: ((locality_1)::text =
'500TH LOCALITY'::text)
(14 rows)


This takes 0.5s -> 2s to execute (depending on the frequencies generated
for the two localities).

So we are a factor of 10 better already, on modest HW (2xPIII 1Ghz 2G
running FreeBSD 5.4).

To go better than this you could try a specific summary table:


CREATE TABLE address_summary AS
SELECT street,
        locality_1,
        locality_2,
        city,
        upper(substring(street from 1 for 1)) AS cut_street
FROM address
GROUP BY street, locality_1, locality_2, city
;

CREATE INDEX address_summary_city_index ON address_summary(city);
CREATE INDEX address_summary_locality_1_index ON
address_summary(locality_1);
CREATE INDEX address_summary_locality_2_index ON
address_summary(locality_2);
CREATE INDEX address_summary_street_index ON address_summary(street);
CREATE INDEX street_summary_prefix ON address_summary(cut_street);


And the query can be rewritten as:

EXPLAIN
SELECT street, locality_1, locality_2, city
FROM address_summary
WHERE (city = '500TH CITY'
        OR locality_2 = '50TH LOCALITY'
        OR locality_1 = '500TH LOCALITY')
   AND cut_street = 'A'
ORDER BY street
LIMIT 20 OFFSET 0

                QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..2006.05 rows=20 width=125)
    ->  Index Scan using address_summary_street_index on address_summary
  (cost=0.00..109028.81 rows=1087 width=125)
          Filter: ((((city)::text = '500TH CITY'::text) OR
((locality_2)::text = '50TH LOCALITY'::text) OR ((locality_1)::text =
'500TH LOCALITY'::text)) AND (cut_street = 'A'::text))
(3 rows)


This takes 0.02s - so getting close to the factor of 1000 (a modern
machine with 3-5 times the memory access speed will get you there easily).

The effectiveness of the summary table will depend on the how much the
GROUP BY reduces the cardinality (not much in this synthetic case), so
you will probably get better improvement with the real data!

Cheers

Mark

Re: Massive performance issues

From
Matthew Sackman
Date:
Hi,

Many thanks for all your thoughts and advice. With just 2GB or RAM, no
change to the harddisc (still SATA) but proper tuning of Postgresql
(still 7.4) and aggressive normalization to shrink row width, I have
managed to get suitable performance, with, when fully cached, queries on
a 5 million row data set, including queries such as:

select to_char(sale_date, 'DD Mon YYYY') as sale_date_text, cost,
       property_types.type as property_type, sale_types.type as sale_type,
       flat_extra, number, street, loc1.component as locality_1,
       loc2.component as locality_2, city.component as city,
       county.component as county, postcode
from address
     inner join (
           select id from address_components
           where component = 'Woodborough'
           ) as t1
       on locality_1_id = t1.id or locality_2_id = t1.id or city_id = t1.id
     inner join (
           select id, street from streets where street = 'Lowdham Lane'
           ) as t2
       on street_id = t2.id
     inner join sale_types
       on sale_types.id = sale_type_id
     inner join property_types
       on property_types.id = property_type_id
     inner join address_components as county
       on county_id = county.id
     inner join address_components as city
       on city_id = city.id
     inner join address_components as loc2
       on locality_2_id = loc2.id
     inner join address_components as loc1
       on locality_1_id = loc1.id
order by sale_date desc limit 11 offset 0

completing within 50ms. I've also now managed to request that the full
production system will have 4GB of RAM (there are still a few queries
that don't quite fit in 2GB of RAM) and a 15kRPM SCSI HD.

So once again, thanks for all your help. I've literally been pulling my
hair out over this so it's great to have basically got it solved.

Matthew