Thread: Indexing Strategy

Indexing Strategy

From

Matthew Engelbert

Date:

10 December 2004, 22:05:16

I'm having trouble correctly indexing a large table.  The table will be
receiving about 30 million rows/day, and has the following definition:

CREATE TABLE quote (
        symbol  varchar (5),
        price   numeric (7,4),
        time timestamp
        );

I need to configure the indices so that I can minimize the time to request all
the records for a particular symbol, for a particular day, ordered by time.

I've tried clustering a single index based on symbol, then time, but this
operation takes much too long.  I think this is because this operation would
have O(N^2) complexity(?).  I think my next best option would be to cluster on
an index based on symbol, then have another index on time.

Does any one know of some resources on the web that discusses indexing
strategies ?  Any help would be appreciated?  Thanks.

- Matt

Re: Indexing Strategy

From

Matthew Engelbert

Date:

11 December 2004, 07:44:01

The select query will be like this:

select * from quote where sybmol = 'MSFT' and extract(doy from time) = 340
order by time

I need to select all the records for a particular symbol, for a particular day,
ordered by time.  I might need to select the records in the same way, but for a
shorter time interval, like some hour during the day.  I'm using Postgresql
version 7.4.

Other info I forgot to mention.... this data is historical, and so I won't be
doing lots of inserts, updates or deletes - I'll just load a big chunk of data
once in a while, and rebuild the index.  I realize that I can also add a lookup
table for symbol, as there are only about 150 unique values.  Time values are
about 99% unique.  Yes, I'm sure I need to tune the db better.  I'm new to
Postgresql.  Thanks for the help.

--- Josh Berkus <josh@agliodbs.com> wrote:

> Matthew,
>
> > I'm having trouble correctly indexing a large table.  The table will be
> > receiving about 30 million rows/day, and has the following definition:
> >
> > CREATE TABLE quote (
> >         symbol  varchar (5),
> >         price   numeric (7,4),
> >         time timestamp
> >         );
> >
> > I need to configure the indices so that I can minimize the time to request
> > all the records for a particular symbol, for a particular day, ordered by
> > time.
>
> Please post the query you're using.   How are you selecting the day?   What
> version of PostgreSQL are you using?
>
> > I've tried clustering a single index based on symbol, then time, but this
> > operation takes much too long.  I think this is because this operation
> > would have O(N^2) complexity(?).
>
> Nope,  just because it's a very large table and you're probably swapping.
> Try
> increasing sort_mem dramatically, like half your RAM.
>
> > I think my next best option would be to
> > cluster on an index based on symbol, then have another index on time.
>
> Not that either.   You need a multicolumn index.  When you answer the above
> quesitons, I'll explain it.
>
> > Does any one know of some resources on the web that discusses indexing
> > strategies ?  Any help would be appreciated?  Thanks.
>
> Not that covers your question.   I'm writing a book which covers this right
> now, but it won't be published until January 2006.
>
> --
> --Josh
>
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>

Re: Indexing Strategy

From

Josh Berkus

Date:

11 December 2004, 07:59:40

Matthew,

> I'm having trouble correctly indexing a large table.  The table will be
> receiving about 30 million rows/day, and has the following definition:
>
> CREATE TABLE quote (
>         symbol  varchar (5),
>         price   numeric (7,4),
>         time timestamp
>         );
>
> I need to configure the indices so that I can minimize the time to request
> all the records for a particular symbol, for a particular day, ordered by
> time.

Please post the query you're using.   How are you selecting the day?   What
version of PostgreSQL are you using?

> I've tried clustering a single index based on symbol, then time, but this
> operation takes much too long.  I think this is because this operation
> would have O(N^2) complexity(?).

Nope,  just because it's a very large table and you're probably swapping.  Try
increasing sort_mem dramatically, like half your RAM.

> I think my next best option would be to
> cluster on an index based on symbol, then have another index on time.

Not that either.   You need a multicolumn index.  When you answer the above
quesitons, I'll explain it.

> Does any one know of some resources on the web that discusses indexing
> strategies ?  Any help would be appreciated?  Thanks.

Not that covers your question.   I'm writing a book which covers this right
now, but it won't be published until January 2006.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Indexing Strategy

From

Josh Berkus

Date:

15 December 2004, 02:32:20

Matthew,

> select * from quote where sybmol = 'MSFT' and extract(doy from time) = 340
> order by time

Aha, that should be easy.

CREATE INDEX quote_doy_symbol ON quote ( (extract(doy from "time")), symbol );

The reason that I'm putting the DOY first is because it's liable to be
slightly more selective; 365 values as opposed to 150.  This is just to help
the planner realize that the index is useful.

I doubt it's worth indexing the order by within that, since the query should
produce a fairly small amount of rows

Of course, using extract doesn't gain you anything, and in fact adds
significant CPU overhead to both the query and the index.   So you'd be
slightly better off doing:

SELECT * FROM quote
WHERE symbol = 'MSFT'
    AND "time" BETWEEN '2004-10-11' AND '2004-10-12';

This will also allow you to create a single index on:

CREATE INDEX quote_time_symbol ON quote("time", symbol);

... which will be useful for any time-based query, not just one on days.  And
it would be potentially useful for time-based queries which don't include a
symbol.

Further, if your queries are *always* structured like the above (time +
symbol) I'd suggest CLUSTERing on the index.

Also, I hope that you didn't really name a column "time".

--
Josh Berkus
Aglio Database Solutions
San Francisco