Thread: Vacuum advice
Hi, I have a system that has a moderate amount of activity on it, nothing strenuous. The activity is a real mixture of operations: selects, updates, inserts and deletes. One thing strange about our database is that we have a log of stored procedures that use temporary tables. Most of the time the system runs fine, and a nightly vacuum does the job. However, occasionally we need to import data, and this involves inserting several million rows into a table, but this just *cripples* postgres. After the import has been running for a while, simple selects take a long time, and strangely, the query planner starts using a sequential scan on a table with millions of items in it. Surely inserts shouldn't be doing this? Should I be running a vacuum during the import? After how many operations should I be calling vacuum? We tried using pg_autovacuum but it didn't seem to do a very good job since we would often find the query planner getting it wrong, and a vacuum analyze would fix it. Part of the import is to update items in a table to keep track of where the import is up to. The update is just setting an integer column, so does this create a dead tuple or will postgres alter the column in-place? Thanks for your help -- David Mitchell Software Engineer Telogis
David Mitchell <david.mitchell@telogis.com> writes: > However, occasionally we need to import data, and this involves > inserting several million rows into a table, but this just *cripples* > postgres. After the import has been running for a while, simple selects > take a long time, and strangely, the query planner starts using a > sequential scan on a table with millions of items in it. Surely inserts > shouldn't be doing this? Should I be running a vacuum during the import? PG 8.0 and up should handle this a bit more nicely. Before 8.0 the planner looked only at the table size recorded in pg_catalog (the relpages and reltuples fields) and so of course it wouldn't react to a still-in-progress mass insert. Now it looks at the current actual physical table size, and so it should be willing to switch to indexscan if needed before the import completes. If you *are* using 8.0 then we need to look closer. > After how many operations should I be calling vacuum? We tried using > pg_autovacuum but it didn't seem to do a very good job since we would > often find the query planner getting it wrong, and a vacuum analyze > would fix it. autovacuum has a problem with responding to transactions that are still-in-progress too :-( > Part of the import is to update items in a table to keep track of where > the import is up to. The update is just setting an integer column, so > does this create a dead tuple or will postgres alter the column in-place? Postgres doesn't *ever* update anything in-place. regards, tom lane
Thanks Tom, > If you *are* using 8.0 then we need to look closer. > Sorry, I should have mentioned, I am using PG 8.0. Also, although this is a 'mass insert', it's only kind of mass. While there are millions of rows, they are inserted in blocks of 500 (with a commit in between). We're thinking we might set up vacuum_cost_limit to around 100 and put vacuum_cost_delay at 100 and then just run vacuumdb in a cron job every 15 minutes or so, does this sound silly? -- David Mitchell Software Engineer Telogis
David Mitchell <david.mitchell@telogis.com> writes: >> If you *are* using 8.0 then we need to look closer. > Sorry, I should have mentioned, I am using PG 8.0. Also, although this > is a 'mass insert', it's only kind of mass. While there are millions of > rows, they are inserted in blocks of 500 (with a commit in between). > We're thinking we might set up vacuum_cost_limit to around 100 and put > vacuum_cost_delay at 100 and then just run vacuumdb in a cron job every > 15 minutes or so, does this sound silly? It doesn't sound completely silly, but if you are doing inserts and not updates/deletes then there's not anything for VACUUM to do, really. An ANALYZE command might get the same result with less effort. I am however still wondering why 8.0 doesn't get it right without help. Can you try a few EXPLAIN ANALYZEs as the table grows and watch whether the cost estimates change? (Also, if this is actually 8.0.0 and not a more recent dot-release, I believe there were some bug fixes in this vicinity in 8.0.2.) regards, tom lane
> >>We're thinking we might set up vacuum_cost_limit to around 100 and put >>vacuum_cost_delay at 100 and then just run vacuumdb in a cron job every >>15 minutes or so, does this sound silly? > > > It doesn't sound completely silly, but if you are doing inserts and not > updates/deletes then there's not anything for VACUUM to do, really. > An ANALYZE command might get the same result with less effort. I think that perhaps the fact we are doing updates in the secondary table to track the import is the culprit here. It gets updated for each item inserted into the main table, so even though it has 500 rows, it ended up with about 2million dead tuples, which left a lot to be desired in terms of seq scan speed. Vacuum full cleared this up, so I assume a frequent regular vacuum would keep it in tip top condition. We are using PG 8.0.1. Thanks for your help Tom. -- David Mitchell Software Engineer Telogis
On 06/22/2005 04:39:00 PM, Tom Lane wrote: > David Mitchell <david.mitchell@telogis.com> writes: > > However, occasionally we need to import data, and this involves > > inserting several million rows into a table, but this just > *cripples* > > postgres. After the import has been running for a while, simple > selects > > take a long time, and strangely, the query planner starts using a > > sequential scan on a table with millions of items in it. Surely > inserts > > shouldn't be doing this? Should I be running a vacuum during the > import? > > PG 8.0 and up should handle this a bit more nicely. Before 8.0 the > planner looked only at the table size recorded in pg_catalog (the > relpages and reltuples fields) and so of course it wouldn't react to a > still-in-progress mass insert. Now it looks at the current actual > physical table size, and so it should be willing to switch to > indexscan > if needed before the import completes. I've made a habit of disconnecting, analyzing and re-connecting every so often on mass data loading because any stored procedures I'm using for the load, including triggers, will not have their query plans updated to account for the new data. It's a pain, so I figured I'd ask whether I'm missing something and this is somehow unnecessary. Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein