Re: pgsql and large tables - Mailing list pgsql-general
From | Gurupartap Davis |
---|---|
Subject | Re: pgsql and large tables |
Date | |
Msg-id | 014001c16e28$d8451dc0$0f00a8c0@marlows Whole thread Raw |
In response to | Re: pgsql and large tables (Francisco Reyes <lists@natserv.com>) |
List | pgsql-general |
err..forgot to cc the list... -------------------------------------- Thanks for trying to help me out...I've got some more info for you: > > I've got a large database, currently about 40 million rows in the biggest table. > ... > > > I initially created the table and with a primary key (5 fields: char(4), varchar(32), date, int, int) > > Out of curiosity why are you using this as your primary key and not a > serial? Will your tables be normalized? > > >and a couple additional indexes, one that just changes the order of the > >primary key, and one that sorts on the date. > > >From whawt I have read so far your primary key doesn't sound like what you > would want to have as a primary key... but I guess the design of your > table is a totally different issue. I'm migrating this table from an existing mysql installation... This is what it looks like right now: Table "forecast" Attribute | Type | Modifier --------------+-----------------------+---------- zhr | smallint | zday | smallint | model | character varying(32) | temp | numeric(6,2) | modelhr | smallint | not null modelruntime | smallint | not null modelrundate | smallint | stn | character(4) | not null rh | numeric(6,2) | wdsp | numeric(6,2) | wddir | character varying(2) | dwpt | numeric(6,2) | lpre | numeric(6,2) | yearmoda | date | not null It's a table for weather forecasts, a record is identified uniquely by (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a query that specifies all of those fields. The reason for the permuted indexes mentioned above is because mysql will use a prefix of a multi-column key to narrow down a search. I guess pgsql doesn't use indexes that way? (I noticed in the to-do list something like "reenable partial indexes") The cardinality on these by the way, is approximately: model: 15-30, stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and modelhr:10-40...Does this make any difference on what order they should be listed in the index? So, I should add a serial id column, I guess, and make that the primary key (why is this? I'm wondering, since I will most likely never refer to that column). Now I need some useful indexes. Most of my queries will be for a complete model run at a particular station, so I'm thinking of an index on (model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ... Another common set of queries deals with a *complete* model run (all stn's), so I would need another index on (model, yearmoda, modelruntime). Yet another useful query type contains all model runs within a certain date range, aggregated by modelhr...it looks to me like the above 2 indexes might be sufficient for that, though.... eg: SELECT avg(temp) from forecast WHERE model='avn' and stn='KDRO' and yearmoda >= $date1 and yearmoda <= $date2 and modelruntime=0 GROUP BY modelhr... ...or would I need an index that references modelhr for that one??? Oh, and I guess I still need a UNIQUE index, as well. Hmm. Now I'm wondering again if that serial id column is going to mess with me. Sometimes I will need to re-import a forecast because something went wrong, and all the values are bogus...can I do a REPLACE into a table, specifying all of the columns of a UNIQUE index, without specifying the primary key? How exactly is having a primary key as serial going to help me here? (Sorry if this is a dumb question, I'm kinda new to this) > >Should I make an artificial primary key with a serial type to simplify things? > > I recommend you do this. Not only because it is more efficient, but > because the type of key you selected has "issues" for lack of a better > term. You can of course have a "unique" index so you don't have dups. > > I think that when you are dealing with a 40 million table you need to > consider your hardware. You also didn't tell us how big are the rows. > The more info you give us the more others will be able to help. The 40 million rows, by the way, is for about 2 months of data ;-)...we've got about another year of data to import (and will be growing by about 1.5 million rows a day) , so I want to make sure I've got the table and indexes set up optimally, first, since it will take about a week to import all the data. It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL 7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could move it to for production(the mysql db is currently running on it)
pgsql-general by date: