Thread: pgsql and large tables
I've got a large database, currently about 40 million rows in the biggest table. I'm trying to set it up with PostgreSQL 7.1.3, but I've got some questions...
I initially created the table and with a primary key (5 fields: char(4), varchar(32), date, int, int) and a couple additional indexes, one that just changes the order of the primary key, and one that sorts on the date. Then I imported 40 million rows and tried some queries, but discovered that none of my queries were using indexes, causing them to take forever.
So I read somewhere in the archives thatyou need to VACUUM a table regularly for indexes to work properly. I tried that, but aborted after about 5 hours. I can't use pgsql if I have to take the db down for more than 10-15 minutes a day.
Then I read somewhere else that you should drop your indexes before VACUUMing and re-create them afterwards. I tried that, and VACUUM finished in about 10 minutes. Kewl... but now I've been trying to recreate my primary key for the last 18 hours...not so good.
Should I have dropped all indexes *except* for the primary? or would VACUUM still take forever that way? Should I make an artificial primary key with a serial type to simplify things? Anyone have any hints at all for me?
thanks,
Partap Davis
Hello, Hey folks, I just got some exciting news. Our book Practical PostgreSQL has sold over 5,000 copies already and it isn't even out! I believe that is a testament to how well PostgreSQL is growing. Command Prompt, would like to thank all of the PostgreSQL community for help with this book. Especially, Tom Lane who has assisted us a lot with some of the oddities of PG. Sincerely, Joshua Drake -- -- by way of pgsql-general@commandprompt.com http://www.postgresql.info/ http://www.commandprompt.com/
On Tue, 13 Nov 2001, Gurupartap Davis wrote: I am somewhat new to PostgreSQL, but didn't see a reply to your answer so I would tell you what I know about your queries. > I've got a large database, currently about 40 million rows in the biggest table. ... > I initially created the table and with a primary key (5 fields: char(4), varchar(32), date, int, int) Out of curiosity why are you using this as your primary key and not a serial? Will your tables be normalized? >and a couple additional indexes, one that just changes the order of the >primary key, and one that sorts on the date. From whawt I have read so far your primary key doesn't sound like what you would want to have as a primary key... but I guess the design of your table is a totally different issue. > Then I imported 40 million rows and tried some queries, but discovered >that none of my queries were using indexes, causing them to take forever. It is difficult for anyone to reply to generic questions. How about showing us the query, an explain plan, the index attributes. Any of the index based on a function? > So I read somewhere in the archives thatyou need to VACUUM a table regularly >for indexes to work properly. Sort of. Let's say they work better. It has to do with what the optimizer believes is your data distribution to determine when to use the index. > I tried that, but aborted after about 5 hours. You have told us nothing about your hardware, your OS, which version of PostgreSQL (unless I missed it) > I can't use pgsql if I have to take the db down for more than 10-15 minutes a day. Then maybe you may not be able to use it right now. I believe that the current "vacuum analyse" and maybe vacuum too may need to lock the table while they are run. Hopefully others will explain this better. This will also change on the soon to be release 7.2. >Then I read somewhere else that you should drop your indexes before VACUUMing >and re-create them afterwards. That makes little sense to me. Also don't recall reading this. What I do recall is that before a bulk load you want to drop your indexes. My understanding was that you wanted to have your indexes when you do vacuum analyze. >I tried that, and VACUUM finished in about 10 minutes. >I've been trying to recreate my primary key for the last >18 hours...not so good. Again, we need more info about your hardware, OS, versino of pgsql, etc... > Should I have dropped all indexes *except* for the primary? My understanding is that you may want to drop your data when doing a big load, not when doing a vacuum. >Should I make an artificial primary key with a serial type to simplify things? I recommend you do this. Not only because it is more efficient, but because the type of key you selected has "issues" for lack of a better term. You can of course have a "unique" index so you don't have dups. I think that when you are dealing with a 40 million table you need to consider your hardware. You also didn't tell us how big are the rows. The more info you give us the more others will be able to help.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2001 November 13 03:49 pm, Gurupartap Davis wrote: > I initially created the table and with a primary key (5 fields: char(4), > varchar(32), date, int, int) and a couple additional indexes, one that just > changes the order of the primary key, and one that sorts on the date. Then > I imported 40 million rows and tried some queries, but discovered that none > of my queries were using indexes, causing them to take forever. If your primary key is spread over 5 fields, then it's index will also be spread over those 5 fields. This means that in order to use the index, your querries must specify those fields in the WHERE clause. Order is also important. From what it looks like above, your index is sorted on first the char(4), then varchar(32) ... to the last int. So if you specify everything except the first column (char(4)) then your primary key index will be useless to you. > So I read somewhere in the archives thatyou need to VACUUM a table > regularly for indexes to work properly. As I understand it, VACUUM simply cleans up obsolete versions of the data (an artifact from the transactions?), but VACUUM ANALYZE generates some statistics about various useful stuff that allows the optimizer or planner or something to do it's job more effectively. > Should I make an artificial primary key with a serial type to simplify > things? Anyone have any hints at all for me? Integer primary keys (such as the SERIAL type) are probably going to be a heck of a lot more efficient than a 5 field mixed key. I ALWAYS start EVERY table I make with (id SERIAL PRIMARY KEY, ... I believe that this is sound practice. You can still add extra constraints (such as declaring a UNIQUE INDEX over several fields) to ensure data consistency, but using an integer id as the primary key has always paid off for me. As an aside, I extend my nomenclature to use key_foo where foo is the name of the table being referenced when I use foreign keys. Makes life a lot easier in any number of situations. - -- Andrew G. Hammond mailto:drew@xyzzy.dhs.org http://xyzzy.dhs.org/~drew/ 56 2A 54 EF 19 C0 3B 43 72 69 5B E3 69 5B A1 1F 613-389-5481 5CD3 62B0 254B DEB1 86E0 8959 093E F70A B457 84B1 "To blow recursion you must first blow recur" -- me -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iEYEARECAAYFAjv0IvwACgkQCT73CrRXhLHkWwCeNiCFdaTpB7y2k20DTCVTdzcf y9IAn3/m6tSNmxB1zI0LFx1cVn17Bfbh =voiW -----END PGP SIGNATURE-----
err..forgot to cc the list... -------------------------------------- Thanks for trying to help me out...I've got some more info for you: > > I've got a large database, currently about 40 million rows in the biggest table. > ... > > > I initially created the table and with a primary key (5 fields: char(4), varchar(32), date, int, int) > > Out of curiosity why are you using this as your primary key and not a > serial? Will your tables be normalized? > > >and a couple additional indexes, one that just changes the order of the > >primary key, and one that sorts on the date. > > >From whawt I have read so far your primary key doesn't sound like what you > would want to have as a primary key... but I guess the design of your > table is a totally different issue. I'm migrating this table from an existing mysql installation... This is what it looks like right now: Table "forecast" Attribute | Type | Modifier --------------+-----------------------+---------- zhr | smallint | zday | smallint | model | character varying(32) | temp | numeric(6,2) | modelhr | smallint | not null modelruntime | smallint | not null modelrundate | smallint | stn | character(4) | not null rh | numeric(6,2) | wdsp | numeric(6,2) | wddir | character varying(2) | dwpt | numeric(6,2) | lpre | numeric(6,2) | yearmoda | date | not null It's a table for weather forecasts, a record is identified uniquely by (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a query that specifies all of those fields. The reason for the permuted indexes mentioned above is because mysql will use a prefix of a multi-column key to narrow down a search. I guess pgsql doesn't use indexes that way? (I noticed in the to-do list something like "reenable partial indexes") The cardinality on these by the way, is approximately: model: 15-30, stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and modelhr:10-40...Does this make any difference on what order they should be listed in the index? So, I should add a serial id column, I guess, and make that the primary key (why is this? I'm wondering, since I will most likely never refer to that column). Now I need some useful indexes. Most of my queries will be for a complete model run at a particular station, so I'm thinking of an index on (model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ... Another common set of queries deals with a *complete* model run (all stn's), so I would need another index on (model, yearmoda, modelruntime). Yet another useful query type contains all model runs within a certain date range, aggregated by modelhr...it looks to me like the above 2 indexes might be sufficient for that, though.... eg: SELECT avg(temp) from forecast WHERE model='avn' and stn='KDRO' and yearmoda >= $date1 and yearmoda <= $date2 and modelruntime=0 GROUP BY modelhr... ...or would I need an index that references modelhr for that one??? Oh, and I guess I still need a UNIQUE index, as well. Hmm. Now I'm wondering again if that serial id column is going to mess with me. Sometimes I will need to re-import a forecast because something went wrong, and all the values are bogus...can I do a REPLACE into a table, specifying all of the columns of a UNIQUE index, without specifying the primary key? How exactly is having a primary key as serial going to help me here? (Sorry if this is a dumb question, I'm kinda new to this) > >Should I make an artificial primary key with a serial type to simplify things? > > I recommend you do this. Not only because it is more efficient, but > because the type of key you selected has "issues" for lack of a better > term. You can of course have a "unique" index so you don't have dups. > > I think that when you are dealing with a 40 million table you need to > consider your hardware. You also didn't tell us how big are the rows. > The more info you give us the more others will be able to help. The 40 million rows, by the way, is for about 2 months of data ;-)...we've got about another year of data to import (and will be growing by about 1.5 million rows a day) , so I want to make sure I've got the table and indexes set up optimally, first, since it will take about a week to import all the data. It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL 7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could move it to for production(the mysql db is currently running on it)
On Thu, 15 Nov 2001, Gurupartap Davis wrote: > Thanks for trying to help me out...I've got some more info for you: Not a problem, but remember to include the list. I am but a relatively new users and you would deprive yourself from the help of others more experienced if you don't cc the list. > I'm migrating this table from an existing mysql installation... > This is what it looks like right now: > Table "forecast" > Attribute | Type | Modifier > --------------+-----------------------+---------- > zhr | smallint | > zday | smallint | > model | character varying(32) | > temp | numeric(6,2) | > modelhr | smallint | not null > modelruntime | smallint | not null > modelrundate | smallint | > stn | character(4) | not null > rh | numeric(6,2) | > wdsp | numeric(6,2) | > wddir | character varying(2) | > dwpt | numeric(6,2) | > lpre | numeric(6,2) | > yearmoda | date | not null > > It's a table for weather forecasts, a record is identified uniquely by > (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a > query that specifies all of those fields. What are your most heavily looked upon columns and do you have indexes on them? >The reason for the permuted indexes mentioned above Are those 5 columns what make a record unique? I see it strange that all of the fields that you described as your primary key you didn't put not null on them. In particular model. What are zhr and zday? >is because mysql will use a prefix of a multi-column >key to narrow down a search. I guess pgsql doesn't use indexes that way? Don't know. I usually use Index according to my needs. If I need to search a lot on 3 columns then I index those 3 columns. I don't see the point on adding columns to an index if it will rarely be used. >(I noticed in the to-do list something like "reenable partial indexes") So far I have not tried to do searches on partial index keys. > The cardinality on these by the way, is approximately: model: 15-30, > stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and > modelhr:10-40...Does this make any difference on what order they should be > listed in the index? I think that on a compound index then probably not. If you have something which is a limiting factor then you may want to have that by itself on an index. In general, not only on pgsql, anything which would limit your searches is usually to have that on it's own index. I don't know much about your data, but as an example if you work a lot by range of dates then having an index on date would be a good index to have. > So, I should add a serial id column, I guess, and make that the primary key That is one approach and I still don't understand enough your data to say this is actually good. As a theoretical example let's say that you have a an index like you suggest below model, stn, yearmoda and modelruntime, these could be on a model table and then have a serial key on those 4. On the other table with the rest of the info you only have the key instead of those 4 fields. This is basically normalization of your data. It's advantage is limiting how much I/O you need to do. Are you familiar with normalization? Don't want to bore you with stuff you may know. >Now I need some useful indexes. Most of my queries will be for a >complete model run at a particular station, so I'm thinking of an index on >(model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ... When you do your DB design you not only want to look at what makes it easy to use, but what makes it efficient. This is where normalization usually helps. Although I rarelly ever normalize formally, I usually have some of it's concepts used depeding on what I want to achieve. >Another common set of queries deals with a *complete* model run (all stn's), >so I would need another index on (model, yearmoda, modelruntime). > Yet another useful query type contains all model runs within a certain date > range, aggregated by modelhr...it looks to me like the above 2 indexes might > be sufficient for that, though.... I advise you try to think if there is anything which can limit your queries. Any factor in common on all these queries. If such limiting factor exists then you could have this on a separate, smaller, table. > ...or would I need an index that references modelhr for that one??? It is difficult for me to try to understand your data, since I don't know almost anything about the meaning of the fields and how you will access these columns. > Sometimes I will need to re-import a forecast because something went wrong, > and all the values are bogus...can I do a REPLACE into a table, specifying Postgresql Doesn't have a REPLACE, but what you do is that you "begin" a transaction, delete all the old data, re-insert the model and then "end" the transaction. > How exactly is having a primary key as serial going to help me here? If you can have some sort of "parent" table with some data which is a "key" and that you will use it as your starting point. Having a smaller table with the right indexes can greatly help your queries due to smaller I/O needing to be done. > The 40 million rows, by the way, is for about 2 months of data ;-)...we've > got about another year of data to import (and will be growing by about 1.5 > million rows a day) , so I want to make sure I've got the table and indexes > set up optimally, first, since it will take about a week to import all the > data. When you talk about such sizes you need to look at your hardware and at the optimizations you have done with PostgreSQL, or any database for that matter. > It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL > 7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could > move it to for production(the mysql db is currently running on it) Memory is super cheap nowadays, about $30 for 256MB if memory serves me right. Increase the memory on these machines. Are you using IDE or SCSI? How many HDs? Using any type of RAID? Have you increased your buffers on PostgreSQL? Have you looked at your shared memory settings? Are you using explain with all your queries to see if they are using your indexes? You also mentioned issues with downtime. Is this DB going to be used 24x7? You need to do vacuum analyze at least after every big update. How often will your data be updated? Once data is loaded will it be changed at all?
Sheesh...I think I see what you're saying about normalization. I've never had a formal db class, but it looks like you're talking about removing all the redundancy from the database...and there's a lot in this one(!) I could have a forecast table like so: id serial primary key, model_id references model (id), yearmoda date, modelruntime smallint Then a temperature table: forecast_id references forecast (id), stn_id references station (id) modelhr smallint, value numeric (6,2) repeat for relative humidity, dewpoint, etc... zhr,zday, and modelrundate in the current table are redundant, as they are derivitive of the date and modelhr.... It looks like it would save a hella lot of disk space...probably over 50%, seeing as how certain variables (dwpt, rh, lpre) only show up in a fraction of the records...(ie, most records have NULL for these fields) Dang...I don't know if my employer will let me make a change this radical, though. We would have to redo every single query in about 50 scripts. Hrmmm. I'm a programmer, not a DB admin...but my employer is too cheap (errr, sorry..."thrifty") to hire a real DB admin :-/ > > > I'm migrating this table from an existing mysql installation... > > This is what it looks like right now: > > Table "forecast" > > Attribute | Type | Modifier > > --------------+-----------------------+---------- > > zhr | smallint | > > zday | smallint | > > model | character varying(32) | > > temp | numeric(6,2) | > > modelhr | smallint | not null > > modelruntime | smallint | not null > > modelrundate | smallint | > > stn | character(4) | not null > > rh | numeric(6,2) | > > wdsp | numeric(6,2) | > > wddir | character varying(2) | > > dwpt | numeric(6,2) | > > lpre | numeric(6,2) | > > yearmoda | date | not null > > > > It's a table for weather forecasts, a record is identified uniquely by > > (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a > > query that specifies all of those fields. We need to run the db 24x7. Data comes in all day long, one model run at a time. A model run is anywhere from 10-40 hours interpolated over about 1800 weather stations. All we do is add data (and, occasionally, re-insert data that was found corrupt)...no deleting, though. Would you recommend doing a vacuum analyze after every model run, or would once a day be sufficient? > You also mentioned issues with downtime. Is this DB going to be used > 24x7? You need to do vacuum analyze at least after every big update. > > How often will your data be updated? Once data is loaded will it be > changed at all? >
"Gurupartap Davis" <partap@yahoo.com> writes: > Sheesh...I think I see what you're saying about normalization. I've never > had a formal db class, but it looks like you're talking about removing all > the redundancy from the database...and there's a lot in this one(!) [...] > Dang...I don't know if my employer will let me make a change this radical, > though. We would have to redo every single query in about 50 scripts. You might be able to define views and rules in such a way that old queries still work, or work with minor changes. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863
On Thu, 15 Nov 2001, Gurupartap Davis wrote: > Sheesh...I think I see what you're saying about normalization. I've never > had a formal db class, but it looks like you're talking about removing all > the redundancy from the database...and there's a lot in this one(!) The basics of normalization are not all that complex. Most of the time the biggest issue is truly understanding your data and how you will need to access it. > I could have a forecast table like so: I could try to help you, but I would need the original tables with every field explained. Then your suggested new design. > zhr,zday, and modelrundate in the current table are redundant, as they are > derivitive of the date and modelhr.... If they can be derived AND you will never need to search on them, then don't store them at all. > It looks like it would save a hella lot of disk space. It is not only the space saving, but your queries will run faster. > seeing as how certain variables (dwpt, rh, lpre) only show up in a fraction > of the records...(ie, most records have NULL for these fields) Those could possibly be on a separate table then. > Dang...I don't know if my employer will let me make a change this radical, > though. You could do something in between an optimal design and something which would be relatively easy to do. Just explain to your employer that these changes can reduce space consumption and speed up your queries. The alternative is getting faster hardware and more memory... and even then queries may not be as fast as they could be just by re-designing your database. > We would have to redo every single query in about 50 scripts. Again talk to your employer. This change will take place once and your queries and the savings on time and space will be a long term process. > Hrmmm. I'm a programmer, not a DB admin...but my employer is too cheap > (errr, sorry..."thrifty") to hire a real DB admin :-/ If you are not paid by the hour then it may be just an issue of you getting some help over the net and doing it yourself. You may want to look into general DB forums, newsgroups, etc... where people with more experience on design can help you. Doing a non formal design is fairly simple, it is just a matter of knowing your data. If you take the time to do what I suggested and explain your existing tables field by fiend, taking in consideration we may not know the terms, then people on this forum may be able to help you. You don't need to go into too many details. Something like zhr hour of ???. Usually blank/populated, etc.. zday day of ???. Usually blank/populated, etc.. model a description/name?? temp temperature at time/date??? You could also through in 20 records so we see how things are distributed. In other words which values are often repeated and may be better off on a separate table. Also which fields are usually blank so perhaps you may consider putting them on a separate table for the space savings. > We need to run the db 24x7. Data comes in all day long, How about letting the data go into ASCII files and then import them at set intervals. In particular you want to have time to do your vacuum analyze at least once per day. >one model run at a time. >A model run is anywhere from 10-40 hours >interpolated over about 1800 weather stations. Is the model run on the database? >All we do is add data (and, occasionally, re-insert data > that was found corrupt)...no deleting, though. Would you recommend doing a > vacuum analyze after every model run, or would once a day be sufficient? Actually in your case I would recommend to do a vacuum analyze after each large insert, if possible. If not then once a day. Does the data comes in batches or as a continues stream? What are the chances of getting more memory as I mentioned on the previous mail?
The way I would do this is to spend some time designing the data model which you find optimal, when this is done you createit in a database, and test it for a little while. Then you create views which look like your old database. This waythe interface with your application will not be broken. Finally you test the new interface (make sure it actually worksas it is supposed to) import the old data, and continue running it. Then you make your own little plan, as to how, whichand when you are going to update the other scripts. This will give you many advantages: 1. You will learn a lot more about postgres (views, rules etc.) 2. You will be able to get many of the advantages much quicker 3. You will get all the advantages over time and you can focus on the most important one's first. Regards, Aasmund. On Fri, 16 Nov 2001 10:01:16 -0500 (EST), Francisco Reyes <lists@natserv.com> wrote: > On Thu, 15 Nov 2001, Gurupartap Davis wrote: > > > The basics of normalization are not all that complex. Most of the time the > biggest issue is truly understanding your data and how you will need to > access it. > > > I could try to help you, but I would need the original tables with every > field explained. Then your suggested new design. > > > If they can be derived AND you will never need to search on them, then > don't store them at all. > > > It is not only the space saving, but your queries will run faster. > > > Those could possibly be on a separate table then. > > > You could do something in between an optimal design and something which > would be relatively easy to do. Just explain to your employer that these > changes can reduce space consumption and speed up your queries. The > alternative is getting faster hardware and more memory... and even then > queries may not be as fast as they could be just by re-designing your > database. > > > Again talk to your employer. This change will take place once and your > queries and the savings on time and space will be a long term process. > > > If you are not paid by the hour then it may be just an issue of you > getting some help over the net and doing it yourself. > You may want to look into general DB forums, newsgroups, etc... where > people with more experience on design can help you. > > Doing a non formal design is fairly simple, it is just a matter of knowing > your data. If you take the time to do what I suggested and explain your > existing tables field by fiend, taking in consideration we may not know > the terms, then people on this forum may be able to help you. > > You don't need to go into too many details. Something like > zhr hour of ???. Usually blank/populated, etc.. > zday day of ???. Usually blank/populated, etc.. > model a description/name?? > temp temperature at time/date??? > > You could also through in 20 records so we see how things are distributed. > In other words which values are often repeated and may be better off on a > separate table. Also which fields are usually blank so perhaps you may > consider putting them on a separate table for the space savings. > > > How about letting the data go into ASCII files and then import them at set > intervals. In particular you want to have time to do your vacuum analyze > at least once per day. > > > Is the model run on the database? > > > > Actually in your case I would recommend to do a vacuum analyze after each > large insert, if possible. If not then once a day. > > Does the data comes in batches or as a continues stream? > > What are the chances of getting more memory as I mentioned on the previous > mail? > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html Aasmund Midttun Godal aasmund@godal.com - http://www.godal.com/ +47 40 45 20 46