Thread: Indexing Strategy
I'm having trouble correctly indexing a large table. The table will be receiving about 30 million rows/day, and has the following definition: CREATE TABLE quote ( symbol varchar (5), price numeric (7,4), time timestamp ); I need to configure the indices so that I can minimize the time to request all the records for a particular symbol, for a particular day, ordered by time. I've tried clustering a single index based on symbol, then time, but this operation takes much too long. I think this is because this operation would have O(N^2) complexity(?). I think my next best option would be to cluster on an index based on symbol, then have another index on time. Does any one know of some resources on the web that discusses indexing strategies ? Any help would be appreciated? Thanks. - Matt
The select query will be like this: select * from quote where sybmol = 'MSFT' and extract(doy from time) = 340 order by time I need to select all the records for a particular symbol, for a particular day, ordered by time. I might need to select the records in the same way, but for a shorter time interval, like some hour during the day. I'm using Postgresql version 7.4. Other info I forgot to mention.... this data is historical, and so I won't be doing lots of inserts, updates or deletes - I'll just load a big chunk of data once in a while, and rebuild the index. I realize that I can also add a lookup table for symbol, as there are only about 150 unique values. Time values are about 99% unique. Yes, I'm sure I need to tune the db better. I'm new to Postgresql. Thanks for the help. --- Josh Berkus <josh@agliodbs.com> wrote: > Matthew, > > > I'm having trouble correctly indexing a large table. The table will be > > receiving about 30 million rows/day, and has the following definition: > > > > CREATE TABLE quote ( > > symbol varchar (5), > > price numeric (7,4), > > time timestamp > > ); > > > > I need to configure the indices so that I can minimize the time to request > > all the records for a particular symbol, for a particular day, ordered by > > time. > > Please post the query you're using. How are you selecting the day? What > version of PostgreSQL are you using? > > > I've tried clustering a single index based on symbol, then time, but this > > operation takes much too long. I think this is because this operation > > would have O(N^2) complexity(?). > > Nope, just because it's a very large table and you're probably swapping. > Try > increasing sort_mem dramatically, like half your RAM. > > > I think my next best option would be to > > cluster on an index based on symbol, then have another index on time. > > Not that either. You need a multicolumn index. When you answer the above > quesitons, I'll explain it. > > > Does any one know of some resources on the web that discusses indexing > > strategies ? Any help would be appreciated? Thanks. > > Not that covers your question. I'm writing a book which covers this right > now, but it won't be published until January 2006. > > -- > --Josh > > Josh Berkus > Aglio Database Solutions > San Francisco >
Matthew, > I'm having trouble correctly indexing a large table. The table will be > receiving about 30 million rows/day, and has the following definition: > > CREATE TABLE quote ( > symbol varchar (5), > price numeric (7,4), > time timestamp > ); > > I need to configure the indices so that I can minimize the time to request > all the records for a particular symbol, for a particular day, ordered by > time. Please post the query you're using. How are you selecting the day? What version of PostgreSQL are you using? > I've tried clustering a single index based on symbol, then time, but this > operation takes much too long. I think this is because this operation > would have O(N^2) complexity(?). Nope, just because it's a very large table and you're probably swapping. Try increasing sort_mem dramatically, like half your RAM. > I think my next best option would be to > cluster on an index based on symbol, then have another index on time. Not that either. You need a multicolumn index. When you answer the above quesitons, I'll explain it. > Does any one know of some resources on the web that discusses indexing > strategies ? Any help would be appreciated? Thanks. Not that covers your question. I'm writing a book which covers this right now, but it won't be published until January 2006. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Matthew, > select * from quote where sybmol = 'MSFT' and extract(doy from time) = 340 > order by time Aha, that should be easy. CREATE INDEX quote_doy_symbol ON quote ( (extract(doy from "time")), symbol ); The reason that I'm putting the DOY first is because it's liable to be slightly more selective; 365 values as opposed to 150. This is just to help the planner realize that the index is useful. I doubt it's worth indexing the order by within that, since the query should produce a fairly small amount of rows Of course, using extract doesn't gain you anything, and in fact adds significant CPU overhead to both the query and the index. So you'd be slightly better off doing: SELECT * FROM quote WHERE symbol = 'MSFT' AND "time" BETWEEN '2004-10-11' AND '2004-10-12'; This will also allow you to create a single index on: CREATE INDEX quote_time_symbol ON quote("time", symbol); ... which will be useful for any time-based query, not just one on days. And it would be potentially useful for time-based queries which don't include a symbol. Further, if your queries are *always* structured like the above (time + symbol) I'd suggest CLUSTERing on the index. Also, I hope that you didn't really name a column "time". -- Josh Berkus Aglio Database Solutions San Francisco