Re: pgsql and large tables - Mailing list pgsql-general
From | Francisco Reyes |
---|---|
Subject | Re: pgsql and large tables |
Date | |
Msg-id | 20011115180851.X60926-100000@zoraida.natserv.net Whole thread Raw |
In response to | pgsql and large tables ("Gurupartap Davis" <partap@yahoo.com>) |
List | pgsql-general |
On Thu, 15 Nov 2001, Gurupartap Davis wrote: > Thanks for trying to help me out...I've got some more info for you: Not a problem, but remember to include the list. I am but a relatively new users and you would deprive yourself from the help of others more experienced if you don't cc the list. > I'm migrating this table from an existing mysql installation... > This is what it looks like right now: > Table "forecast" > Attribute | Type | Modifier > --------------+-----------------------+---------- > zhr | smallint | > zday | smallint | > model | character varying(32) | > temp | numeric(6,2) | > modelhr | smallint | not null > modelruntime | smallint | not null > modelrundate | smallint | > stn | character(4) | not null > rh | numeric(6,2) | > wdsp | numeric(6,2) | > wddir | character varying(2) | > dwpt | numeric(6,2) | > lpre | numeric(6,2) | > yearmoda | date | not null > > It's a table for weather forecasts, a record is identified uniquely by > (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a > query that specifies all of those fields. What are your most heavily looked upon columns and do you have indexes on them? >The reason for the permuted indexes mentioned above Are those 5 columns what make a record unique? I see it strange that all of the fields that you described as your primary key you didn't put not null on them. In particular model. What are zhr and zday? >is because mysql will use a prefix of a multi-column >key to narrow down a search. I guess pgsql doesn't use indexes that way? Don't know. I usually use Index according to my needs. If I need to search a lot on 3 columns then I index those 3 columns. I don't see the point on adding columns to an index if it will rarely be used. >(I noticed in the to-do list something like "reenable partial indexes") So far I have not tried to do searches on partial index keys. > The cardinality on these by the way, is approximately: model: 15-30, > stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and > modelhr:10-40...Does this make any difference on what order they should be > listed in the index? I think that on a compound index then probably not. If you have something which is a limiting factor then you may want to have that by itself on an index. In general, not only on pgsql, anything which would limit your searches is usually to have that on it's own index. I don't know much about your data, but as an example if you work a lot by range of dates then having an index on date would be a good index to have. > So, I should add a serial id column, I guess, and make that the primary key That is one approach and I still don't understand enough your data to say this is actually good. As a theoretical example let's say that you have a an index like you suggest below model, stn, yearmoda and modelruntime, these could be on a model table and then have a serial key on those 4. On the other table with the rest of the info you only have the key instead of those 4 fields. This is basically normalization of your data. It's advantage is limiting how much I/O you need to do. Are you familiar with normalization? Don't want to bore you with stuff you may know. >Now I need some useful indexes. Most of my queries will be for a >complete model run at a particular station, so I'm thinking of an index on >(model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ... When you do your DB design you not only want to look at what makes it easy to use, but what makes it efficient. This is where normalization usually helps. Although I rarelly ever normalize formally, I usually have some of it's concepts used depeding on what I want to achieve. >Another common set of queries deals with a *complete* model run (all stn's), >so I would need another index on (model, yearmoda, modelruntime). > Yet another useful query type contains all model runs within a certain date > range, aggregated by modelhr...it looks to me like the above 2 indexes might > be sufficient for that, though.... I advise you try to think if there is anything which can limit your queries. Any factor in common on all these queries. If such limiting factor exists then you could have this on a separate, smaller, table. > ...or would I need an index that references modelhr for that one??? It is difficult for me to try to understand your data, since I don't know almost anything about the meaning of the fields and how you will access these columns. > Sometimes I will need to re-import a forecast because something went wrong, > and all the values are bogus...can I do a REPLACE into a table, specifying Postgresql Doesn't have a REPLACE, but what you do is that you "begin" a transaction, delete all the old data, re-insert the model and then "end" the transaction. > How exactly is having a primary key as serial going to help me here? If you can have some sort of "parent" table with some data which is a "key" and that you will use it as your starting point. Having a smaller table with the right indexes can greatly help your queries due to smaller I/O needing to be done. > The 40 million rows, by the way, is for about 2 months of data ;-)...we've > got about another year of data to import (and will be growing by about 1.5 > million rows a day) , so I want to make sure I've got the table and indexes > set up optimally, first, since it will take about a week to import all the > data. When you talk about such sizes you need to look at your hardware and at the optimizations you have done with PostgreSQL, or any database for that matter. > It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL > 7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could > move it to for production(the mysql db is currently running on it) Memory is super cheap nowadays, about $30 for 256MB if memory serves me right. Increase the memory on these machines. Are you using IDE or SCSI? How many HDs? Using any type of RAID? Have you increased your buffers on PostgreSQL? Have you looked at your shared memory settings? Are you using explain with all your queries to see if they are using your indexes? You also mentioned issues with downtime. Is this DB going to be used 24x7? You need to do vacuum analyze at least after every big update. How often will your data be updated? Once data is loaded will it be changed at all?
pgsql-general by date: