Thread: possible vacuum improvement?
I know everyone is busy with the 7.3beta, but maybe this is something to think of before releasing the beta. Currently VACUUMwill vacuum every table, but sometimes it's desireable to leave tables untouched because the're mostly static or protocol tables. In my case this would be the pg_largeobjectwhich is around 4GB of data, while the other tables are ~40MB. Vacuuming the data is important, the large object table however rarely changes. The same goes for a protocol table which is around 1GB and never is changed beside INSERTS, so it's just growing, but never needs vacuum. VACUUM on the 4GB table needs a long long time andno improvements, it just hurts performance and fills OS buffers. If pg_class would have a field for storing misc flags (e.g. a bitfield). This would allow to set a flag like NO_AUTO_VACUUMand modify the vacuum code to leave that tables untouched if not specified by hand. Maybe there are other uses for such a bitfield too, and will help prevent an initdb for simpleimprovements. Any comments? Best regards,Mario Weilguni
On 3 Sep 2002 at 8:55, Mario Weilguni wrote: > I know everyone is busy with the 7.3beta, but maybe this is something to think of before releasing the beta. CurrentlyVACUUM will vacuum every table, but sometimes > it's desireable to leave tables untouched because the're mostly static or protocol tables. In my case this would be thepg_largeobject which is around 4GB of data, while the > other tables are ~40MB. Vacuuming the data is important, the large object table however rarely changes. The same goes for a protocol table which is around 1GB and never is > changed beside INSERTS, so it's just growing, but never needs vacuum. VACUUM on the 4GB table needs a long long time andno improvements, it just hurts performance and > fills OS buffers. > > If pg_class would have a field for storing misc flags (e.g. a bitfield). This would allow to set a flag like NO_AUTO_VACUUMand modify the vacuum code to leave that tables untouched > if not specified by hand. Maybe there are other uses for such a bitfield too, and will help prevent an initdb for simpleimprovements. > > Any comments? I suggest vacumming only the table that changes. Further I believe, updates/deletes should be watched for performance as they cause dead tuples. Of course insert impacts statistics and should be monitored but something like a log table does not need vacuuming that often.. Knowing the application load can help a lot in tuning the DB, in short. I was running a banking simulation for benchmarking. I know that accounts table gets updated for each transaction but log table is just an insert. So rather than vacumming entire db, just doing 'vacuum analyze accounts' give me almost same results. Performance was far better in earlier case. Without any vacuum I got something like 50 tps for 80K transactions. With 'vacuum analyze accounts' for each 5K transactions I got 200tps. Personally I would prefer to have a trigger on a metadata table where I could trigger vacuuming a particular table each n number of transactions(Oh it would be great if that vacuum runs in background not blocking meta data table.. just a wishlist...). Can anybody tell me which table I could write such a trigger? I went thr. pg_* for some time but didn't find what I was looking for.. ByeShridhar -- Reisner's Rule of Conceptual Inertia: If you think big enough, you'll never have to do it.
> Personally I would prefer to have a trigger on a metadata table > where I could > trigger vacuuming a particular table each n number of > transactions(Oh it would > be great if that vacuum runs in background not blocking meta data > table.. just > a wishlist...). Can anybody tell me which table I could write > such a trigger? I > went thr. pg_* for some time but didn't find what I was looking for.. Actually, if you wrote it in C and kept some static data on each table, you could probably write a vacuum trigger pretty easily. You could even keep the info in a table. Chris
On 3 Sep 2002 at 15:14, Christopher Kings-Lynne wrote: > > Personally I would prefer to have a trigger on a metadata table > > where I could > > trigger vacuuming a particular table each n number of > > transactions(Oh it would > > be great if that vacuum runs in background not blocking meta data > > table.. just > > a wishlist...). Can anybody tell me which table I could write > > such a trigger? I > > went thr. pg_* for some time but didn't find what I was looking for.. > > Actually, if you wrote it in C and kept some static data on each table, you > could probably write a vacuum trigger pretty easily. You could even keep > the info in a table. Actually that's what I did. Update global transaction counter than trigger the vacuum from a spare thread. but having it in DB has advantages of centralisation. It's just a good to have kind of thing.. ByeShridhar -- "I don't know why, but first C programs tend to look a lot worse thanfirst programs in any other language (maybe except for fortran, but thenI suspect all fortran programs look like `firsts')"(By Olaf Kirch)
> gets updated for each transaction but log table is just an insert. So rather > than vacumming entire db, just doing 'vacuum analyze accounts' give me almost > same results. > That is not really practicable, one datebase has 107 tables, and making a cron job with 107 vacuum calls is completly out of question and very error prone anyway. Regards, Mario Weilguni
> Actually that's what I did. Update global transaction counter > than trigger the > vacuum from a spare thread. > > but having it in DB has advantages of centralisation. It's just a > good to have > kind of thing.. Care to submit it as a BSD licensed contrib module then? Or at least create a project for it on http://gborg.postgresql.org/ ? Chris
On 3 Sep 2002 at 9:36, Mario Weilguni wrote: > That is not really practicable, one datebase has 107 tables, and making a > cron job > with 107 vacuum calls is completly out of question and very error prone > anyway. That's correct.. What are the possible alternatives? Either backend has to support something or the DBA has to script something. 1)If number of tables that need vacuum are far more than those who don't, then a simple all vacuum would do. But again sizes of individual tables will affect that judgement as well. 2)As OP suggested, if vacuum could pick up only those tables marked by bitfields, ay by an additional option like, 'vacuum analyse frequent_ones'.. this is going to need a backend change. 3)I guess scripting cron job for vacuum is one time job. If it's desparately needed, say 60 tables out of 107 require vacuum, personally I would spend some time making that script. Depends upon the requirement actually. On a sidenote, does anybody have some statistics from benchmark may be, as in what's a rule of thumb for vacuuming? I found that a vacuum every 5K-10K transactions increases the tps like anything but below 1K transactions, it's not as much effective. May be one should consider this factor as well.. ByeShridhar -- Pascal: A programming language named after a man who would turn over in his grave if he knew about it. -- Datamation, January 15, 1984
On 3 Sep 2002 at 15:39, Christopher Kings-Lynne wrote: > > Actually that's what I did. Update global transaction counter > > than trigger the > > vacuum from a spare thread. > > > > but having it in DB has advantages of centralisation. It's just a > > good to have > > kind of thing.. > > Care to submit it as a BSD licensed contrib module then? Or at least create > a project for it on http://gborg.postgresql.org/ ? Sounds like a nice idea. I would do that by this week end, once I finalise the details about it. Give me couple of days to finish it. Will come back soon with that.. ByeShridhar -- Reporter, n.: A writer who guesses his way to the truth and dispels it with a tempest of words. -- Ambrose Bierce, "The Devil's Dictionary"
On Tue, 2002-09-03 at 03:36, Mario Weilguni wrote: > > gets updated for each transaction but log table is just an insert. So > rather > > than vacumming entire db, just doing 'vacuum analyze accounts' give me > almost > > same results. > > > > That is not really practicable, one datebase has 107 tables, and making a > cron job > with 107 vacuum calls is completly out of question and very error prone > anyway. So... Write a script which does something like: skiptables = "'skipme' 'andme'" tables = `psql -c 'SELECT relname from pg_class where relname not in (${skiptables})' template1` for tab in ${tables} ; do vacuumdb -t ${tab} done Fill in the holes and your done -- get the right pg_class type, handle schemas appropriately, etc.
Mario Weilguni <mweilguni@sime.com> writes: > I know everyone is busy with the 7.3beta, but maybe this is something > to think of before releasing the beta. We are already in feature freeze. In terms of what might happen for 7.4 or beyond, what I'd personally like to see is some "auto vacuum" facility that would launch background vacuums automatically every so often. This could (eventually) be made self-tuning so that it would vacuum heavily-updated tables more often than seldom-updated ones --- while not forgetting the every-billion-transactions rule... regards, tom lane
On 3 Sep 2002 at 9:49, Tom Lane wrote: > In terms of what might happen for 7.4 or beyond, what I'd personally > like to see is some "auto vacuum" facility that would launch background > vacuums automatically every so often. This could (eventually) be made > self-tuning so that it would vacuum heavily-updated tables more often > than seldom-updated ones --- while not forgetting the > every-billion-transactions rule... OK, I plan to work on this. Here is my brief idea 1)Create a table (vacuum_info) that stores table name and auto vacuum defaults. Since I am planning this in contrib, I would not touch pg_class. The table will store- table names- number of transactions to trigger vacuum analyze(default 1K)- number of transactions totrigger full vacuum(default 10K) A trigger on pg_class i.e. table creation should add a row in this table as well. 2)Write a trigger on tables that updates statistics on table activity. I see -pg_stat_all_tables -pg_stat_sys_tables -pg_stat_user_tables. The columns are -n_tup_ins -n_tup_upd -n_tup_del Of course it will ignore it's own updates and inserts to avoid infinite loops. This will update the pseudo statistics in vacuum_info table Another trigger on vacuum_info will trigger vacuum if required. Ideally I would write it in external multithreaded library to trigger vacuum in background without blocking operations on vacuum_info table. I need to know the following.. 1)Is this sounds like a workable solution? 2)Is this as simple as I have put here or am I missing some vital components? 3)Is there some kind of rework involved? 4)Is use of threads sounds portable enough? I just need to trigger a thread in background and return. No locking, nothing is required. Will there be any problem for postgres invoking such an external trigger? 5)When I create a function in a .so, is it possible to invoke init/startup routines? I can create and destroy thread in these routine to avoid thread creation overhead. If postgres is using dlopen, I can use _init, _fini. 6)such a 'daemon' would be on per back-end basis if I am guessing correctly. Would locking things in transactions for vacuum_info be sufficient? I hope I am making a sensible proposal/design(My first attempt to contribute to postgres). Please let me know your comments. ByeShridhar -- Blast medicine anyway! We've learned to tie into every organ in thehuman body but one. The brain! The brain is what life is all about. -- McCoy, "The Menagerie", stardate 3012.4
"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > 1)Is this sounds like a workable solution? Adding a trigger to every tuple update won't do at all. Storing the counts in a table won't do either, as the updates on that table will generate a huge amount of wasted space themselves (not to mention enough contention to destroy concurrent performance). > 4)Is use of threads sounds portable enough? Threads are completely out of the question, at least if you have any hope of seeing this code get accepted into the core distro. For vacuum's purposes all that we really care to know about is the number of obsoleted tuples in each table: committed deletes and updates, and aborted inserts and updates all count. Furthermore, we do not need or want a 100% reliable solution; approximate counts would be plenty good enough. What I had in the back of my mind was: each backend counts attempted insertions and deletions in its relcache entries (an update adds to both counts). At transaction commit or abort, we know which of these two counts represents the number of dead tuples added to each relation, so while we scan the relcache for post-xact cleanup (which we will be doing anyway) we can transfer the correct count into the shared FSM entry for the relation. This gives us a reasonably accurate count in shared memory of all the tuple obsoletions since bootup, at least for heavily-used tables. (The FSM might choose to forget about lightly-used tables.) The auto vacuumer could look at the FSM numbers to decide which tables are highest priority to vacuum. This scheme would lose the count info on a database restart, but that doesn't bother me. In typical scenarios the same tables will soon get enough new counts to be highly ranked for vacuuming. In any case the auto vacuumer must be designed so that it vacuums every table every so often anyhow, so the possibility of forgetting that there were some dead tuples in a given table isn't catastrophic. I do not think we need or want a control table for this; certainly I see no need for per-table manual control over this process. There should probably be a few knobs in the form of GUC parameters so that the admin can control how much overall work the auto-vacuumer does. For instance you'd probably like to turn it off when under peak interactive load. regards, tom lane
On Tue, 2002-09-03 at 11:01, Tom Lane wrote: > "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > > 1)Is this sounds like a workable solution? > > Adding a trigger to every tuple update won't do at all. Storing the > counts in a table won't do either, as the updates on that table will > generate a huge amount of wasted space themselves (not to mention > enough contention to destroy concurrent performance). > > > 4)Is use of threads sounds portable enough? > > Threads are completely out of the question, at least if you have any > hope of seeing this code get accepted into the core distro. > > > For vacuum's purposes all that we really care to know about is the > number of obsoleted tuples in each table: committed deletes and updates, > and aborted inserts and updates all count. Furthermore, we do not need > or want a 100% reliable solution; approximate counts would be plenty > good enough. It would be nice if it could track successful inserts, and fire off an analyze run when it changes more than 20% from what stats says.
Rod Taylor <rbt@zort.ca> writes: > On Tue, 2002-09-03 at 11:01, Tom Lane wrote: >> For vacuum's purposes all that we really care to know about is the >> number of obsoleted tuples in each table: committed deletes and updates, >> and aborted inserts and updates all count. Furthermore, we do not need >> or want a 100% reliable solution; approximate counts would be plenty >> good enough. > It would be nice if it could track successful inserts, and fire off an > analyze run when it changes more than 20% from what stats says. That's a thought too. I was only thinking of space reclamation, but it'd be easy to extend the scheme to keep track of the number of tuples successfully inserted, changed, or deleted (all three events would affect stats) as well as the number of dead tuples. Then you could fire auto-analyze every so often, along with auto-vacuum. Auto-analyze might need more tuning controls than auto-vacuum, though. Vacuum doesn't have any question about when it needs to run: a dead tuple is a dead tuple. But for analyze you might have plenty of update traffic and yet no meaningful change in the interesting stats for a table. An admin who knows the behavior of his tables would like to be able to configure the frequency of analyze runs, rather than trust to a necessarily-not-too-bright auto-analyze routine. (Not sure whether this is important enough to warrant the complications of making it configurable though. You can always do it the old-fashioned way with cron scripts if you want that kind of control, I suppose.) regards, tom lane
>I do not think we need or want a control table for this; certainly I see >no need for per-table manual control over this process. There should >probably be a few knobs in the form of GUC parameters so that the admin >can control how much overall work the auto-vacuumer does. For instance >you'd probably like to turn it off when under peak interactive load. If (auto)vacuum is clever to check that some tables do not need vacuum there's really no need for that. That brings me to another point, can't the statistics collector used for that? For my database I wrote a statistic display program for web-access, and all the info autovacuum would need is here. http://mw.sime.com/pgsql.htm That brings me to another point, is there interest for this web-statistics-frontend, maybe for /contrib? I found it extremly useful because it showed up the weak points in my applications. Best regards,Mario Weilguni
Wouldn't it make sense to implement autovacuum information in a struture like the FSM, a Dirty Space Map (DSM)? As blocks are dirtied by transactions they can be added to the DSM. Then vacuum can give priority processing to those blocks only. The reason I suggest this is that in many usage senerios it will be more efficient to only vacuum part of a table than the entire table. Given a large table that grows over time, it tends to be the case that older data in the table becomes more static as it ages (a lot of financial data is like this, when it is initially created it may get a lot of updates done early in it's life and may even be deleted, but once the data gets older (for example a year old), it is unlikely to change). This would imply that over time the first blocks in a table will change less and most activity will occur towards the end of the table. If you have a multigig table, where most of the activity occurs near the end, a lot of cpu cycles can be wasted going over the mostly static begining of the table. thanks, --Barry Tom Lane wrote: >"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > > >>1)Is this sounds like a workable solution? >> >> > >Adding a trigger to every tuple update won't do at all. Storing the >counts in a table won't do either, as the updates on that table will >generate a huge amount of wasted space themselves (not to mention >enough contention to destroy concurrent performance). > > > >>4)Is use of threads sounds portable enough? >> >> > >Threads are completely out of the question, at least if you have any >hope of seeing this code get accepted into the core distro. > > >For vacuum's purposes all that we really care to know about is the >number of obsoleted tuples in each table: committed deletes and updates, >and aborted inserts and updates all count. Furthermore, we do not need >or want a 100% reliable solution; approximate counts would be plenty >good enough. > >What I had in the back of my mind was: each backend counts attempted >insertions and deletions in its relcache entries (an update adds to both >counts). At transaction commit or abort, we know which of these two >counts represents the number of dead tuples added to each relation, so >while we scan the relcache for post-xact cleanup (which we will be doing >anyway) we can transfer the correct count into the shared FSM entry for >the relation. This gives us a reasonably accurate count in shared >memory of all the tuple obsoletions since bootup, at least for >heavily-used tables. (The FSM might choose to forget about lightly-used >tables.) The auto vacuumer could look at the FSM numbers to decide >which tables are highest priority to vacuum. > >This scheme would lose the count info on a database restart, but that >doesn't bother me. In typical scenarios the same tables will soon get >enough new counts to be highly ranked for vacuuming. In any case the >auto vacuumer must be designed so that it vacuums every table every so >often anyhow, so the possibility of forgetting that there were some dead >tuples in a given table isn't catastrophic. > >I do not think we need or want a control table for this; certainly I see >no need for per-table manual control over this process. There should >probably be a few knobs in the form of GUC parameters so that the admin >can control how much overall work the auto-vacuumer does. For instance >you'd probably like to turn it off when under peak interactive load. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > >
"Mario Weilguni" <mario.weilguni@icomedias.com> writes: > That brings me to another point, can't the > statistics collector used for that? Hmm, that would be a different way of attacking the problem. Not sure offhand which is better, but it'd surely be worth considering both. Note that collecting of dead-tuple counts requires input from aborted transactions as well as successful ones. I don't recall whether the stats collector currently collects anything from aborted xacts; that might or might not be a sticky point. regards, tom lane
> That brings me to another point, is there interest for this > web-statistics-frontend, maybe for /contrib? I found it extremly useful > because it showed up the weak points in my applications. Why not create a project here for it: http://gborg.postgresql.org/ Chris
On Tuesday 03 September 2002 16:24, Tom Lane wrote: > "Mario Weilguni" <mario.weilguni@icomedias.com> writes: > > That brings me to another point, can't the > > statistics collector used for that? > > Hmm, that would be a different way of attacking the problem. Not sure > offhand which is better, but it'd surely be worth considering both. > > Note that collecting of dead-tuple counts requires input from aborted > transactions as well as successful ones. I don't recall whether the > stats collector currently collects anything from aborted xacts; that > might or might not be a sticky point. I have been doing some poking around with this item, and I was planning on using the stats collector to do "intelligent" auto-vacuuming. I was planning on adding some new columns that account for activity that has taken place since the last vacuum. The current stats collector shows n_tup_ins, n_tup_upd and n_tup_del for any given rel, but those numbers have nothing to do with what has happened since the last vacuum, hence nothing to do with current status or need for vacuum. I hope to have something worth showing soon (a week or two). I know that is a bit slow, but I am new at pg internals and since we are in beta I know this is a 7.4 item. FYI, the current stats collector does keep track of inserts, updates and deletes that are part of a rolled back transaction, as shown in the example below: matthew=# create TABLE foo (id serial, name text); NOTICE: CREATE TABLE will create implicit sequence 'foo_id_seq' for SERIAL column 'foo.id' CREATE TABLE matthew=# select relname,n_tup_ins, n_tup_upd, n_tup_del from pg_stat_all_tables where relname = 'foo';relname | n_tup_ins | n_tup_upd | n_tup_del ---------+-----------+-----------+-----------foo | 0 | 0 | 0 (1 row) matthew=# INSERT INTO foo (name) VALUES ('asdf'); INSERT 17075 1 matthew=# UPDATE foo SET name='qwert'; UPDATE 1 matthew=# DELETE FROM foo; DELETE 1 matthew=# select relname,n_tup_ins, n_tup_upd, n_tup_del from pg_stat_all_tables where relname = 'foo';relname | n_tup_ins | n_tup_upd | n_tup_del ---------+-----------+-----------+-----------foo | 1 | 1 | 1 (1 row) matthew=# begin; BEGIN matthew=# INSERT INTO foo (name) VALUES ('asdf'); INSERT 17076 1 matthew=# UPDATE foo SET name='qwert'; UPDATE 1 matthew=# DELETE FROM foo; DELETE 1 matthew=# rollback; ROLLBACK matthew=# select relname,n_tup_ins, n_tup_upd, n_tup_del from pg_stat_all_tables where relname = 'foo';relname | n_tup_ins | n_tup_upd | n_tup_del ---------+-----------+-----------+-----------foo | 2 | 2 | 2 (1 row)
> I have been doing some poking around with this item, and I was > planning on > using the stats collector to do "intelligent" auto-vacuuming. I > was planning > on adding some new columns that account for activity that has taken place > since the last vacuum. The current stats collector shows n_tup_ins, > n_tup_upd and n_tup_del for any given rel, but those numbers have > nothing to > do with what has happened since the last vacuum, hence nothing to do with > current status or need for vacuum. Postgres 7.3-beta has a new function 'pg_stat_reset()' that you can call to reset the stats collector after a vacuum... Chris
On Tuesday 03 September 2002 23:47, Christopher Kings-Lynne wrote: > > I have been doing some poking around with this item, and I was > > planning on > > using the stats collector to do "intelligent" auto-vacuuming. I > > was planning > > on adding some new columns that account for activity that has taken place > > since the last vacuum. The current stats collector shows n_tup_ins, > > n_tup_upd and n_tup_del for any given rel, but those numbers have > > nothing to > > do with what has happened since the last vacuum, hence nothing to do with > > current status or need for vacuum. > > Postgres 7.3-beta has a new function 'pg_stat_reset()' that you can call to > reset the stats collector after a vacuum... Just my opinion here, but I don't think having autovac constantly resetting the stats is a good idea, it means that you lose the current stat functionality when using autovacuum, and also implies that the stats mean differnet things if autovac is turned on or off.
Am Mittwoch, 4. September 2002 05:44 schrieb Matthew T. OConnor: > I have been doing some poking around with this item, and I was planning on > using the stats collector to do "intelligent" auto-vacuuming. I was > planning on adding some new columns that account for activity that has > taken place since the last vacuum. The current stats collector shows > n_tup_ins, n_tup_upd and n_tup_del for any given rel, but those numbers > have nothing to do with what has happened since the last vacuum, hence > nothing to do with current status or need for vacuum. This should be no real problem, extending the table pg_stat_all_tables with 3 fields "av_n_tup_ins", "av_n_tup_upd", "av_n_tup_del" should do it IMO.
How about counting the number of dead tuples examined and the number of live tuples returned. As the ratio of dead tuples over live tuples visited increases the table becomes a candidate for vacuuming. -regards richt > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane > Sent: Tuesday, September 03, 2002 4:25 PM > To: Mario Weilguni > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] possible vacuum improvement? > > > "Mario Weilguni" <mario.weilguni@icomedias.com> writes: > > That brings me to another point, can't the > > statistics collector used for that? > > Hmm, that would be a different way of attacking the problem. Not sure > offhand which is better, but it'd surely be worth considering both. > > Note that collecting of dead-tuple counts requires input from aborted > transactions as well as successful ones. I don't recall whether the > stats collector currently collects anything from aborted xacts; that > might or might not be a sticky point. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >