Thread: Re: [GENERAL] Autovacuum Improvements
On Fri, 2006-12-29 at 20:25 -0300, Alvaro Herrera wrote: > Christopher Browne wrote: > > > Seems to me that you could get ~80% of the way by having the simplest > > "2 queue" implementation, where tables with size < some threshold get > > thrown at the "little table" queue, and tables above that size go to > > the "big table" queue. > > > > That should keep any small tables from getting "vacuum-starved." > > Hmm, would it make sense to keep 2 queues, one that goes through the > tables in smaller-to-larger order, and the other one in the reverse > direction? > > I am currently writing a design on how to create "vacuum queues" but I'm > thinking that maybe it's getting too complex to handle, and a simple > idea like yours is enough (given sufficient polish). Sounds good to me. My colleague Pavan has just suggested multiple autovacuums and then prototyped something almost as a side issue while trying to solve other problems. I'll show him this entry, maybe he saw it already? I wasn't following this discussion until now. The 2 queue implementation seemed to me to be the most straightforward implementation, mirroring Chris' suggestion. A few aspects that haven't been mentioned are: - if you have more than one VACUUM running, we'll need to watch memory management. Having different queues based upon table size is a good way of doing that, since the smaller queues have a naturally limited memory consumption. - with different size-based queues, the larger VACUUMs can be delayed so they take much longer, while the small tables can go straight through Some feedback from initial testing is that 2 queues probably isn't enough. If you have tables with 100s of blocks and tables with millions of blocks, the tables in the mid-range still lose out. So I'm thinking that a design with 3 queues based upon size ranges, plus the idea that when a queue is empty it will scan for tables slightly above/below its normal range. That way we wouldn't need to specify the cut-offs with a difficult to understand new set of GUC parameters, define them exactly and then have them be wrong when databases grow. The largest queue would be the one reserved for Xid wraparound avoidance. No table would be eligible for more than one queue at a time, though it might change between queues as it grows. Alvaro, have you completed your design? Pavan, what are your thoughts? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > Some feedback from initial testing is that 2 queues probably isn't > enough. If you have tables with 100s of blocks and tables with millions > of blocks, the tables in the mid-range still lose out. So I'm thinking > that a design with 3 queues based upon size ranges, plus the idea that > when a queue is empty it will scan for tables slightly above/below its > normal range. Yeah, eventually it occurred to me the fact that as soon as you have 2 queues, you may as well want to have 3 or in fact any number. Which in my proposal is very easily achieved. > Alvaro, have you completed your design? No, I haven't, and the part that's missing is precisely the queues stuff. I think I've been delaying posting it for too long, and that is harmful because it makes other people waste time thinking on issues that I may already have resolved, and delays the bashing that yet others will surely inflict on my proposal, which is never a good thing ;-) So maybe I'll put in a stub about the "queues" stuff and see how people like the whole thing. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, 2007-01-12 at 19:33 -0300, Alvaro Herrera wrote: > > Alvaro, have you completed your design? > > No, I haven't, and the part that's missing is precisely the queues > stuff. I think I've been delaying posting it for too long, and that is > harmful because it makes other people waste time thinking on issues that > I may already have resolved, and delays the bashing that yet others will > surely inflict on my proposal, which is never a good thing ;-) So maybe > I'll put in a stub about the "queues" stuff and see how people like the > whole thing. I've not read a word spoken against the general idea, so I think we should pursue this actively for 8.3. It should be straightforward to harvest the good ideas, though there will definitely be many. Perhaps we should focus on the issues that might result, so that we address those before we spend time on the details of the user interface. Can we deadlock or hang from running multiple autovacuums? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
In an attempt to throw the authorities off his trail, alvherre@commandprompt.com (Alvaro Herrera) transmitted: > Simon Riggs wrote: > >> Some feedback from initial testing is that 2 queues probably isn't >> enough. If you have tables with 100s of blocks and tables with >> millions of blocks, the tables in the mid-range still lose out. So >> I'm thinking that a design with 3 queues based upon size ranges, >> plus the idea that when a queue is empty it will scan for tables >> slightly above/below its normal range. > > Yeah, eventually it occurred to me the fact that as soon as you have > 2 queues, you may as well want to have 3 or in fact any number. > Which in my proposal is very easily achieved. Adding an extra attribute to reflect a different ordering or a different policy allows having as many queues in one queue table as you might need. >> Alvaro, have you completed your design? > > No, I haven't, and the part that's missing is precisely the queues > stuff. I think I've been delaying posting it for too long, and that > is harmful because it makes other people waste time thinking on > issues that I may already have resolved, and delays the bashing that > yet others will surely inflict on my proposal, which is never a good > thing ;-) So maybe I'll put in a stub about the "queues" stuff and > see how people like the whole thing. Seems like a good idea to me. Implementing multiple queues amounts to having different worker processes/threads that operate on the queue table using varying policies. -- output = reverse("gro.mca" "@" "enworbbc") http://linuxdatabases.info/info/lisp.html Rules of the Evil Overlord #60. "My five-year-old child advisor will also be asked to decipher any code I am thinking of using. If he breaks the code in under 30 seconds, it will not be used. Note: this also applies to passwords." <http://www.eviloverlord.com/>
In an attempt to throw the authorities off his trail, alvherre@commandprompt.com (Alvaro Herrera) transmitted: > Simon Riggs wrote: > >> Some feedback from initial testing is that 2 queues probably isn't >> enough. If you have tables with 100s of blocks and tables with >> millions of blocks, the tables in the mid-range still lose out. So >> I'm thinking that a design with 3 queues based upon size ranges, >> plus the idea that when a queue is empty it will scan for tables >> slightly above/below its normal range. > > Yeah, eventually it occurred to me the fact that as soon as you have > 2 queues, you may as well want to have 3 or in fact any number. > Which in my proposal is very easily achieved. Adding an extra attribute to reflect a different ordering or a different policy allows having as many queues in one queue table as you might need. >> Alvaro, have you completed your design? > > No, I haven't, and the part that's missing is precisely the queues > stuff. I think I've been delaying posting it for too long, and that > is harmful because it makes other people waste time thinking on > issues that I may already have resolved, and delays the bashing that > yet others will surely inflict on my proposal, which is never a good > thing ;-) So maybe I'll put in a stub about the "queues" stuff and > see how people like the whole thing. Seems like a good idea to me. Implementing multiple queues amounts to having different worker processes/threads that operate on the queue table using varying policies. -- output = reverse("gro.mca" "@" "enworbbc") http://linuxdatabases.info/info/lisp.html Rules of the Evil Overlord #60. "My five-year-old child advisor will also be asked to decipher any code I am thinking of using. If he breaks the code in under 30 seconds, it will not be used. Note: this also applies to passwords." <http://www.eviloverlord.com/>
On Fri, Jan 12, 2007 at 07:33:05PM -0300, Alvaro Herrera wrote: > Simon Riggs wrote: > > > Some feedback from initial testing is that 2 queues probably isn't > > enough. If you have tables with 100s of blocks and tables with millions > > of blocks, the tables in the mid-range still lose out. So I'm thinking > > that a design with 3 queues based upon size ranges, plus the idea that > > when a queue is empty it will scan for tables slightly above/below its > > normal range. > > Yeah, eventually it occurred to me the fact that as soon as you have 2 > queues, you may as well want to have 3 or in fact any number. Which in > my proposal is very easily achieved. > > > > Alvaro, have you completed your design? > > No, I haven't, and the part that's missing is precisely the queues > stuff. I think I've been delaying posting it for too long, and that is > harmful because it makes other people waste time thinking on issues that > I may already have resolved, and delays the bashing that yet others will > surely inflict on my proposal, which is never a good thing ;-) So maybe > I'll put in a stub about the "queues" stuff and see how people like the > whole thing. Have you made any consideration of providing feedback on autovacuum to users? Right now we don't even know what tables were vacuumed when and what was reaped. This might actually be another topic. ---elein elein@varlena.com
Simon Riggs wrote: > On Fri, 2006-12-29 at 20:25 -0300, Alvaro Herrera wrote: >> Christopher Browne wrote: >> >>> Seems to me that you could get ~80% of the way by having the simplest >>> "2 queue" implementation, where tables with size < some threshold get >>> thrown at the "little table" queue, and tables above that size go to >>> the "big table" queue. >>> >>> That should keep any small tables from getting "vacuum-starved." > This is exectly what I am trying, two process autovacuum and a GUC to seperate small tables. In this case, one process takes up vacuuming of the small tables and other process vacuuming of the remaining tables as well as Xid avoidance related vacuuming. The goal is to avoid starvation of small tables when a large table is being vacuumed (which may take several hours) without adding too much complexity to the code. > > Some feedback from initial testing is that 2 queues probably isn't > enough. If you have tables with 100s of blocks and tables with millions > of blocks, the tables in the mid-range still lose out. So I'm thinking > that a design with 3 queues based upon size ranges, plus the idea that > when a queue is empty it will scan for tables slightly above/below its > normal range. That way we wouldn't need to specify the cut-offs with a > difficult to understand new set of GUC parameters, define them exactly > and then have them be wrong when databases grow. > > The largest queue would be the one reserved for Xid wraparound > avoidance. No table would be eligible for more than one queue at a time, > though it might change between queues as it grows. > > Alvaro, have you completed your design? > > Pavan, what are your thoughts? > IMO 2-queue is a good step forward, but in long term we may need to go for a multiprocess autovacuum where the number and tasks of processes are either demand based and/or user configurable. Another idea is to vacuum the tables in round-robin fashion where the quantum could be either "time" or "number of block". The autovacuum process would vacuum 'x' blocks of one table and then schedule next table in the queue. This would avoid starvation of small tables, though cost of index cleanup might go up because of increased IO. Any thoughts of this approach ? Thanks, Pavan
Simon Riggs wrote: > On Fri, 2006-12-29 at 20:25 -0300, Alvaro Herrera wrote: >> Christopher Browne wrote: >> >>> Seems to me that you could get ~80% of the way by having the simplest >>> "2 queue" implementation, where tables with size < some threshold get >>> thrown at the "little table" queue, and tables above that size go to >>> the "big table" queue. >>> >>> That should keep any small tables from getting "vacuum-starved." > This is exectly what I am trying, two process autovacuum and a GUC to seperate small tables. In this case, one process takes up vacuuming of the small tables and other process vacuuming of the remaining tables as well as Xid avoidance related vacuuming. The goal is to avoid starvation of small tables when a large table is being vacuumed (which may take several hours) without adding too much complexity to the code. > > Some feedback from initial testing is that 2 queues probably isn't > enough. If you have tables with 100s of blocks and tables with millions > of blocks, the tables in the mid-range still lose out. So I'm thinking > that a design with 3 queues based upon size ranges, plus the idea that > when a queue is empty it will scan for tables slightly above/below its > normal range. That way we wouldn't need to specify the cut-offs with a > difficult to understand new set of GUC parameters, define them exactly > and then have them be wrong when databases grow. > > The largest queue would be the one reserved for Xid wraparound > avoidance. No table would be eligible for more than one queue at a time, > though it might change between queues as it grows. > > Alvaro, have you completed your design? > > Pavan, what are your thoughts? > IMO 2-queue is a good step forward, but in long term we may need to go for a multiprocess autovacuum where the number and tasks of processes are either demand based and/or user configurable. Another idea is to vacuum the tables in round-robin fashion where the quantum could be either "time" or "number of block". The autovacuum process would vacuum 'x' blocks of one table and then schedule next table in the queue. This would avoid starvation of small tables, though cost of index cleanup might go up because of increased IO. Any thoughts of this approach ? Thanks, Pavan
Pavan Deolasee wrote: > Simon Riggs wrote: >> On Fri, 2006-12-29 at 20:25 -0300, Alvaro Herrera wrote: >>> Christopher Browne wrote: >>> >>>> Seems to me that you could get ~80% of the way by having the simplest >>>> "2 queue" implementation, where tables with size < some threshold get >>>> thrown at the "little table" queue, and tables above that size go to >>>> the "big table" queue. >>>> >>>> That should keep any small tables from getting "vacuum-starved." >> > > This is exectly what I am trying, two process autovacuum and a GUC to > seperate small tables. > > In this case, one process takes up vacuuming of the small tables and > other process vacuuming of the remaining tables as well as Xid > avoidance related vacuuming. The goal is to avoid starvation of small > tables when a large table is being vacuumed (which may take > several hours) without adding too much complexity to the code. Would it work to make the queues push the treshold into the direction of the still running queue if the other queue finishes before the still running one? This would achieve some kind of auto-tuning, but that is usually tricky. For example, what if one of the queues got stuck on a lock? -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Simon Riggs wrote: > Perhaps we should focus on the issues that might result, so that we > address those before we spend time on the details of the user interface. > Can we deadlock or hang from running multiple autovacuums? If you were to run multiple autovacuum processes the way they are today, maybe. But that's not my intention -- the launcher would be the only one to read the catalogs; the workers would be started only to do a single VACUUM job. This reduces the risk of this kind of problems. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
elein wrote: > Have you made any consideration of providing feedback on autovacuum to users? > Right now we don't even know what tables were vacuumed when and what was > reaped. This might actually be another topic. I'd like to hear other people's opinions on Darcy Buskermolen proposal to have a log table, on which we'd register what did we run, at what time, how long did it last, how many tuples did it clean, etc. I feel having it on the regular text log is useful but it's not good enough. Keep in mind that in the future we may want to peek at that collected information to be able to take better scheduling decisions (or at least inform the DBA that he sucks). Now, I'd like this to be a VACUUM thing, not autovacuum. That means that manually-run vacuums would be logged as well. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tuesday 16 January 2007 06:29, Alvaro Herrera wrote: > elein wrote: > > Have you made any consideration of providing feedback on autovacuum to > > users? Right now we don't even know what tables were vacuumed when and > > what was reaped. This might actually be another topic. > > I'd like to hear other people's opinions on Darcy Buskermolen proposal > to have a log table, on which we'd register what did we run, at what > time, how long did it last, how many tuples did it clean, etc. I feel > having it on the regular text log is useful but it's not good enough. > Keep in mind that in the future we may want to peek at that collected > information to be able to take better scheduling decisions (or at least > inform the DBA that he sucks). > > Now, I'd like this to be a VACUUM thing, not autovacuum. That means > that manually-run vacuums would be logged as well. Yes I did intend this thought for vacuum, not strictly autovacuum.
Alvaro Herrera wrote: > I'd like to hear other people's opinions on Darcy Buskermolen proposal > to have a log table, on which we'd register what did we run, at what > time, how long did it last, how many tuples did it clean, etc. I feel > having it on the regular text log is useful but it's not good enough. > Keep in mind that in the future we may want to peek at that collected > information to be able to take better scheduling decisions (or at least > inform the DBA that he sucks). I'm not familiar with his proposal, but I'm not sure what I think of logging vacuum (and perhaps analyze) commands to a table. We have never logged anything to tables inside PG. I would be worried about this eating a lot of space in some situations. I think most people would just be happy if we could get autovacuum to log it's actions at a much higher log level. I think that "autovacuum vacuumed table x" is important and shouldn't be all the way down at the debug level. The other (more involved) solution to this problem was proposed which was create a separate set of logging control params for autovacuum so that you can turn it up or down independent of the general server logging. > Now, I'd like this to be a VACUUM thing, not autovacuum. That means > that manually-run vacuums would be logged as well. +1
Matthew T. O'Connor wrote: > Alvaro Herrera wrote: >> I'd like to hear other people's opinions on Darcy Buskermolen proposal >> to have a log table, on which we'd register what did we run, at what >> time, how long did it last, [...] > > I think most people would just be happy if we could get autovacuum to > log it's actions at a much higher log level. I think that "autovacuum > vacuumed table x" is important and shouldn't be all the way down at the > debug level. +1 here. Even more than "autovacuum vacuumed table x", I'd like to see "vacuum starting table x" and "vacuum done table x". The reason I say that is because the speculation "autovacuum might have been running" is now a frequent phrase I hear when performance questions are asked. If vacuum start and end times were logged at a much earlier level, that feature plus log_min_duration_statement could easily disprove the "vacuum might have been running" hypothesis.
On Tue, 2007-01-16 at 07:16 -0800, Darcy Buskermolen wrote: > On Tuesday 16 January 2007 06:29, Alvaro Herrera wrote: > > elein wrote: > > > Have you made any consideration of providing feedback on autovacuum to > > > users? Right now we don't even know what tables were vacuumed when and > > > what was reaped. This might actually be another topic. > > > > I'd like to hear other people's opinions on Darcy Buskermolen proposal > > to have a log table, on which we'd register what did we run, at what > > time, how long did it last, how many tuples did it clean, etc. I feel > > having it on the regular text log is useful but it's not good enough. > > Keep in mind that in the future we may want to peek at that collected > > information to be able to take better scheduling decisions (or at least > > inform the DBA that he sucks). > > > > Now, I'd like this to be a VACUUM thing, not autovacuum. That means > > that manually-run vacuums would be logged as well. > > Yes I did intend this thought for vacuum, not strictly autovacuum. I agree, for all VACUUMs: we need a log table. The only way we can get a feedback loop on what has come before is by remembering what happened. Simply logging it is interesting, but not enough. There is some complexity there, because with many applications a small table gets VACUUMed every few minutes, so the log table would become a frequently updated table itself. I'd also suggest that we might want to take account of the number of tuples removed by btree pre-split VACUUMs also. I also like the idea of a single scheduler and multiple child workers. The basic architecture is clear and obviously beneficial. What worries me is how the scheduler will work; there seems to be as many ideas as we have hackers. I'm wondering if we should provide the facility of a pluggable scheduler? That way you'd be able to fine tune the schedule to both the application and to the business requirements. That would allow integration with external workflow engines and job schedulers, for when VACUUMs need to not-conflict with external events. If no scheduler has been defined, just use a fairly simple default. The three main questions are - what is the maximum size of VACUUM that can start *now* - can *this* VACUUM start now? - which is the next VACUUM to run? If we have an API that allows those 3 questions to be asked, then a scheduler plug-in could supply the answers. That way any complex application rules (table A is available for VACUUM now for next 60 mins, table B is in constant use so we must use vacuum_delay), external events (long running reports have now finished, OK to VACUUM), time-based rules (e.g. first Sunday of the month 00:00 - 04:00 is scheduled downtime, first 3 days of the each month is financial accounting close) can be specified. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Friday 19 January 2007 01:47, Simon Riggs wrote: > On Tue, 2007-01-16 at 07:16 -0800, Darcy Buskermolen wrote: > > On Tuesday 16 January 2007 06:29, Alvaro Herrera wrote: > > > elein wrote: > > > > Have you made any consideration of providing feedback on autovacuum > > > > to users? Right now we don't even know what tables were vacuumed when > > > > and what was reaped. This might actually be another topic. > > > > > > I'd like to hear other people's opinions on Darcy Buskermolen proposal > > > to have a log table, on which we'd register what did we run, at what > > > time, how long did it last, how many tuples did it clean, etc. I feel > > > having it on the regular text log is useful but it's not good enough. > > > Keep in mind that in the future we may want to peek at that collected > > > information to be able to take better scheduling decisions (or at least > > > inform the DBA that he sucks). > > > > > > Now, I'd like this to be a VACUUM thing, not autovacuum. That means > > > that manually-run vacuums would be logged as well. > > > > Yes I did intend this thought for vacuum, not strictly autovacuum. > > I agree, for all VACUUMs: we need a log table. > > The only way we can get a feedback loop on what has come before is by > remembering what happened. Simply logging it is interesting, but not > enough. Correct, I think we are all saying the same thing that is this log table is purely inserts so that we can see trends over time. > > There is some complexity there, because with many applications a small > table gets VACUUMed every few minutes, so the log table would become a > frequently updated table itself. I'd also suggest that we might want to > take account of the number of tuples removed by btree pre-split VACUUMs > also. Thinking on this a bit more, I suppose that this table really should allow for user defined triggers on it, so that a DBA can create partioning for it, not to mention being able to move it off into it's own tablespace. > > I also like the idea of a single scheduler and multiple child workers. > > The basic architecture is clear and obviously beneficial. What worries > me is how the scheduler will work; there seems to be as many ideas as we > have hackers. I'm wondering if we should provide the facility of a > pluggable scheduler? That way you'd be able to fine tune the schedule to > both the application and to the business requirements. That would allow > integration with external workflow engines and job schedulers, for when > VACUUMs need to not-conflict with external events. > > If no scheduler has been defined, just use a fairly simple default. > > The three main questions are > - what is the maximum size of VACUUM that can start *now* How can we determine this given we have no real knowledge of the upcoming adverse IO conditions ? > - can *this* VACUUM start now? > - which is the next VACUUM to run? > > If we have an API that allows those 3 questions to be asked, then a > scheduler plug-in could supply the answers. That way any complex > application rules (table A is available for VACUUM now for next 60 mins, > table B is in constant use so we must use vacuum_delay), external events > (long running reports have now finished, OK to VACUUM), time-based rules > (e.g. first Sunday of the month 00:00 - 04:00 is scheduled downtime, > first 3 days of the each month is financial accounting close) can be > specified. Another thought, is it at all possible to do a partial vacuum? ie spend the next 30 minutes vacuuming foo table, and update the fsm with what hew have learned over the 30 mins, even if we have not done a full table scan ? -- Darcy Buskermolen The PostgreSQL company, Command Prompt Inc.
Added to TODO: > o Allow multiple vacuums so large tables do not starve small > tables > > http://archives.postgresql.org/pgsql-general/2007-01/msg00031.php > > o Improve control of auto-vacuum > > http://archives.postgresql.org/pgsql-hackers/2006-12/msg00876.php --------------------------------------------------------------------------- Darcy Buskermolen wrote: > On Friday 19 January 2007 01:47, Simon Riggs wrote: > > On Tue, 2007-01-16 at 07:16 -0800, Darcy Buskermolen wrote: > > > On Tuesday 16 January 2007 06:29, Alvaro Herrera wrote: > > > > elein wrote: > > > > > Have you made any consideration of providing feedback on autovacuum > > > > > to users? Right now we don't even know what tables were vacuumed when > > > > > and what was reaped. This might actually be another topic. > > > > > > > > I'd like to hear other people's opinions on Darcy Buskermolen proposal > > > > to have a log table, on which we'd register what did we run, at what > > > > time, how long did it last, how many tuples did it clean, etc. I feel > > > > having it on the regular text log is useful but it's not good enough. > > > > Keep in mind that in the future we may want to peek at that collected > > > > information to be able to take better scheduling decisions (or at least > > > > inform the DBA that he sucks). > > > > > > > > Now, I'd like this to be a VACUUM thing, not autovacuum. That means > > > > that manually-run vacuums would be logged as well. > > > > > > Yes I did intend this thought for vacuum, not strictly autovacuum. > > > > I agree, for all VACUUMs: we need a log table. > > > > The only way we can get a feedback loop on what has come before is by > > remembering what happened. Simply logging it is interesting, but not > > enough. > > Correct, I think we are all saying the same thing that is this log table is > purely inserts so that we can see trends over time. > > > > > There is some complexity there, because with many applications a small > > table gets VACUUMed every few minutes, so the log table would become a > > frequently updated table itself. I'd also suggest that we might want to > > take account of the number of tuples removed by btree pre-split VACUUMs > > also. > > Thinking on this a bit more, I suppose that this table really should allow for > user defined triggers on it, so that a DBA can create partioning for it, not > to mention being able to move it off into it's own tablespace. > > > > > > I also like the idea of a single scheduler and multiple child workers. > > > > The basic architecture is clear and obviously beneficial. What worries > > me is how the scheduler will work; there seems to be as many ideas as we > > have hackers. I'm wondering if we should provide the facility of a > > pluggable scheduler? That way you'd be able to fine tune the schedule to > > both the application and to the business requirements. That would allow > > integration with external workflow engines and job schedulers, for when > > VACUUMs need to not-conflict with external events. > > > > If no scheduler has been defined, just use a fairly simple default. > > > > The three main questions are > > - what is the maximum size of VACUUM that can start *now* > > How can we determine this given we have no real knowledge of the upcoming > adverse IO conditions ? > > > - can *this* VACUUM start now? > > - which is the next VACUUM to run? > > > > If we have an API that allows those 3 questions to be asked, then a > > scheduler plug-in could supply the answers. That way any complex > > application rules (table A is available for VACUUM now for next 60 mins, > > table B is in constant use so we must use vacuum_delay), external events > > (long running reports have now finished, OK to VACUUM), time-based rules > > (e.g. first Sunday of the month 00:00 - 04:00 is scheduled downtime, > > first 3 days of the each month is financial accounting close) can be > > specified. > > Another thought, is it at all possible to do a partial vacuum? ie spend the > next 30 minutes vacuuming foo table, and update the fsm with what hew have > learned over the 30 mins, even if we have not done a full table scan ? > > > -- > > > Darcy Buskermolen > The PostgreSQL company, Command Prompt Inc. > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Darcy Buskermolen wrote:
1. partial vacuum would mean that parts of the table are missed, the user could never vacuum certain parts and transaction wraparound would get you. You may also have other performance issues as you forgot certian parts of the table
2. Index cleanup is the most expensive part of vacuum. So doing a partial vacuum actually means more I/O as you have to do index cleanup more often.
If we are talking about autovacuum, 1 doesn't become so much of an issue as you just make the autovacuum remember what parts of the table it's vacuumed. This really has great power when you have a dead space map.
Item 2 will still be an issue. But if you define "partial" as either fill maintenance_work_mem, or finish the table, you are not increasing I/O at all. As when maintenance work mem is full, you have to cleanup all the indexes anyway. This is really more like VACUUM SINGLE, but the same principal applies.
I believe all planning really needs to think about how a dead space map will effect what vacuum is going to be doing in the future.
Strange idea that I haven't researched, Given Vacuum can't be run in a transaction, it is possible at a certain point to quit the current transaction and start another one. There has been much chat and now a TODO item about allowing multiple vacuums to not starve small tables. But if a big table has a long running vacuum the vacuum of the small table won't be effective anyway will it? If vacuum of a big table was done in multiple transactions you could reduce the effect of long running vacuum. I'm not sure how this effects the rest of the system thought.
Russell Smith
[snip]There was a proposal for this, but it was dropped on 2 grounds.Another thought, is it at all possible to do a partial vacuum? ie spend the next 30 minutes vacuuming foo table, and update the fsm with what hew have learned over the 30 mins, even if we have not done a full table scan ?
1. partial vacuum would mean that parts of the table are missed, the user could never vacuum certain parts and transaction wraparound would get you. You may also have other performance issues as you forgot certian parts of the table
2. Index cleanup is the most expensive part of vacuum. So doing a partial vacuum actually means more I/O as you have to do index cleanup more often.
If we are talking about autovacuum, 1 doesn't become so much of an issue as you just make the autovacuum remember what parts of the table it's vacuumed. This really has great power when you have a dead space map.
Item 2 will still be an issue. But if you define "partial" as either fill maintenance_work_mem, or finish the table, you are not increasing I/O at all. As when maintenance work mem is full, you have to cleanup all the indexes anyway. This is really more like VACUUM SINGLE, but the same principal applies.
I believe all planning really needs to think about how a dead space map will effect what vacuum is going to be doing in the future.
Strange idea that I haven't researched, Given Vacuum can't be run in a transaction, it is possible at a certain point to quit the current transaction and start another one. There has been much chat and now a TODO item about allowing multiple vacuums to not starve small tables. But if a big table has a long running vacuum the vacuum of the small table won't be effective anyway will it? If vacuum of a big table was done in multiple transactions you could reduce the effect of long running vacuum. I'm not sure how this effects the rest of the system thought.
Russell Smith
Russell Smith wrote: > Strange idea that I haven't researched, Given Vacuum can't be run in a > transaction, it is possible at a certain point to quit the current > transaction and start another one. There has been much chat and now a > TODO item about allowing multiple vacuums to not starve small tables. > But if a big table has a long running vacuum the vacuum of the small > table won't be effective anyway will it? If vacuum of a big table was > done in multiple transactions you could reduce the effect of long > running vacuum. I'm not sure how this effects the rest of the system > thought. That was fixed by Hannu Krosing's patch in 8.2 that made vacuum to ignore other vacuums in the oldest xmin calculation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, 2007-01-20 at 09:41 +1100, Russell Smith wrote: > Darcy Buskermolen wrote: > > [snip] > > > > Another thought, is it at all possible to do a partial vacuum? ie spend the > > next 30 minutes vacuuming foo table, and update the fsm with what hew have > > learned over the 30 mins, even if we have not done a full table scan ? > > > There was a proposal for this, but it was dropped on 2 grounds. > 1. partial vacuum would mean that parts of the table are missed, the > user could never vacuum certain parts and transaction wraparound would > get you. You may also have other performance issues as you forgot > certian parts of the table Partial vacuum would still be possible if you remembered where you got to in the VACUUM and then started from that same point next time. It could then go to the end of the table and wrap back around. > 2. Index cleanup is the most expensive part of vacuum. So doing a > partial vacuum actually means more I/O as you have to do index cleanup > more often. Again, not necessarily. A large VACUUM can currently perform more than one set of index scans, so if you chose the right stopping place for a partial VACUUM you need never incur any additional work. It might even save effort in the long run. I'm not necessarily advocating partial VACUUM, just pointing out that the problems you raise need not be barriers to implementation, should that be considered worthwhile. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sun, Jan 21, 2007 at 12:24:38PM +0000, Simon Riggs wrote: > Partial vacuum would still be possible if you remembered where you got > to in the VACUUM and then started from that same point next time. It > could then go to the end of the table and wrap back around. ISTM the Dead Space Map would give you this automatically, since that's your memory... Once you have the DSM to track where the dead pages are, you can set it up to target clusters first, thus giving maximum bang for buck. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
On Sun, Jan 21, 2007 at 11:39:45AM +0000, Heikki Linnakangas wrote: > Russell Smith wrote: > >Strange idea that I haven't researched, Given Vacuum can't be run in a > >transaction, it is possible at a certain point to quit the current > >transaction and start another one. There has been much chat and now a > >TODO item about allowing multiple vacuums to not starve small tables. > >But if a big table has a long running vacuum the vacuum of the small > >table won't be effective anyway will it? If vacuum of a big table was > >done in multiple transactions you could reduce the effect of long > >running vacuum. I'm not sure how this effects the rest of the system > >thought. > > That was fixed by Hannu Krosing's patch in 8.2 that made vacuum to > ignore other vacuums in the oldest xmin calculation. And IIRC in 8.1 every time vacuum finishes a pass over the indexes it will commit and start a new transaction. That's still useful even with Hannu's patch in case you start a vacuum with maintenance_work_mem too small; you can abort the vacuum some time later and at least some of the work it's done will get committed. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Sun, 2007-01-21 at 14:26 -0600, Jim C. Nasby wrote: > On Sun, Jan 21, 2007 at 11:39:45AM +0000, Heikki Linnakangas wrote: > > Russell Smith wrote: > > >Strange idea that I haven't researched, Given Vacuum can't be run in a > > >transaction, it is possible at a certain point to quit the current > > >transaction and start another one. There has been much chat and now a > > >TODO item about allowing multiple vacuums to not starve small tables. > > >But if a big table has a long running vacuum the vacuum of the small > > >table won't be effective anyway will it? If vacuum of a big table was > > >done in multiple transactions you could reduce the effect of long > > >running vacuum. I'm not sure how this effects the rest of the system > > >thought. > > > > That was fixed by Hannu Krosing's patch in 8.2 that made vacuum to > > ignore other vacuums in the oldest xmin calculation. > > And IIRC in 8.1 every time vacuum finishes a pass over the indexes it > will commit and start a new transaction. err...It doesn't do this now and IIRC didn't do that in 8.1 either. > That's still useful even with > Hannu's patch in case you start a vacuum with maintenance_work_mem too > small; you can abort the vacuum some time later and at least some of the > work it's done will get committed. True, but not recommended, though for a variety of reasons. The reason is not intermediate commits, but just that the work of VACUUM is mostly non-transactional in nature, apart from the various catalog entries when it completes. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi there, Is is possible to stop all user access to postgres, but still give access to admin? Just temporarily, not a security setup. Something like, stop all users but allow user x and y. thx
org@kewlstuff.co.za wrote: > Hi there, > > Is is possible to stop all user access to postgres, but still give > access to admin? > Just temporarily, not a security setup. > > Something like, stop all users but allow user x and y. You could restart in single user mode, or alter pg_hba.conf to allow the users you want and disallow all other users. single user mode will require you have direct access to the machine to do the alterations. using pg_hba.conf will not disconnect existing users as far as I'm aware. That's the best advice I can offer, maybe somebody else will be able to give you more > > thx > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > >
Am Montag, 22. Januar 2007 10:32 schrieb org@kewlstuff.co.za: > Is is possible to stop all user access to postgres, but still give access > to admin? Make the appropriate adjustments to pg_hba.conf. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Russell Smith wrote: > 2. Index cleanup is the most expensive part of vacuum. So doing a > partial vacuum actually means more I/O as you have to do index cleanup > more often. I don't think that's usually the case. Index(es) are typically only a fraction of the size of the table, and since 8.2 we do index vacuums in a single scan in physical order. In fact, in many applications the index is be mostly cached and the index scan doesn't generate any I/O at all. I believe the heap scans are the biggest issue at the moment. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Russell Smith wrote: > > 2. Index cleanup is the most expensive part of vacuum. So doing a > > partial vacuum actually means more I/O as you have to do index cleanup > > more often. > > I don't think that's usually the case. Index(es) are typically only a > fraction of the size of the table, and since 8.2 we do index vacuums in > a single scan in physical order. In fact, in many applications the index > is be mostly cached and the index scan doesn't generate any I/O at all. Are _all_ the indexes cached? I would doubt that. Also, for typical table, what percentage is the size of all indexes combined? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Russell Smith wrote: >>> 2. Index cleanup is the most expensive part of vacuum. So doing a >>> partial vacuum actually means more I/O as you have to do index cleanup >>> more often. >> I don't think that's usually the case. Index(es) are typically only a >> fraction of the size of the table, and since 8.2 we do index vacuums in >> a single scan in physical order. In fact, in many applications the index >> is be mostly cached and the index scan doesn't generate any I/O at all. > > Are _all_ the indexes cached? I would doubt that. Well, depends on your schema, of course. In many applications, yes. > Also, for typical > table, what percentage is the size of all indexes combined? Well, there's no such thing as a typical table. As an anecdote here's the ratios (total size of all indexes of a table)/(size of corresponding heap) for the bigger tables for a DBT-2 run I have at hand: Stock: 1190470/68550 = 6% Order_line: 950103/274372 = 29% Customer: 629011 /(5711+20567) = 8% In any case, for the statement "Index cleanup is the most expensive part of vacuum" to be true, you're indexes would have to take up 2x as much space as the heap, since the heap is scanned twice. I'm sure there's databases like that out there, but I don't think it's the common case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> Russell Smith wrote: > >>> 2. Index cleanup is the most expensive part of vacuum. So doing a > >>> partial vacuum actually means more I/O as you have to do index cleanup > >>> more often. > >> I don't think that's usually the case. Index(es) are typically only a > >> fraction of the size of the table, and since 8.2 we do index vacuums in > >> a single scan in physical order. In fact, in many applications the index > >> is be mostly cached and the index scan doesn't generate any I/O at all. > > > > Are _all_ the indexes cached? I would doubt that. > > Well, depends on your schema, of course. In many applications, yes. > > > Also, for typical > > table, what percentage is the size of all indexes combined? > > Well, there's no such thing as a typical table. As an anecdote here's > the ratios (total size of all indexes of a table)/(size of corresponding > heap) for the bigger tables for a DBT-2 run I have at hand: > > Stock: 1190470/68550 = 6% > Order_line: 950103/274372 = 29% > Customer: 629011 /(5711+20567) = 8% > > In any case, for the statement "Index cleanup is the most expensive part > of vacuum" to be true, you're indexes would have to take up 2x as much > space as the heap, since the heap is scanned twice. I'm sure there's > databases like that out there, but I don't think it's the common case. I agree it index cleanup isn't > 50% of vacuum. I was trying to figure out how small, and it seems about 15% of the total table, which means if we have bitmap vacuum, we can conceivably reduce vacuum load by perhaps 80%, assuming 5% of the table is scanned. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, 2007-01-22 at 12:18 -0500, Bruce Momjian wrote: > Heikki Linnakangas wrote: > > > > In any case, for the statement "Index cleanup is the most expensive part > > of vacuum" to be true, you're indexes would have to take up 2x as much > > space as the heap, since the heap is scanned twice. I'm sure there's > > databases like that out there, but I don't think it's the common case. > > I agree it index cleanup isn't > 50% of vacuum. I was trying to figure > out how small, and it seems about 15% of the total table, which means if > we have bitmap vacuum, we can conceivably reduce vacuum load by perhaps > 80%, assuming 5% of the table is scanned. Clearly keeping track of what needs vacuuming will lead to a more efficient VACUUM. Your math applies to *any* design that uses some form of book-keeping to focus in on the hot spots. On a separate thread, Heikki has raised a different idea for VACUUM. Heikki's idea asks an important question: where and how should DSM information be maintained? Up to now everybody has assumed that it would be maintained when DML took place and that the DSM would be a transactional data structure (i.e. on-disk). Heikki's idea requires similar bookkeeping requirements to the original DSM concept, but the interesting aspect is that the DSM information is collected off-line, rather than being an overhead on every statement's response time. That idea seems extremely valuable to me. One of the main challenges is how we cope with large tables that have a very fine spray of updates against them. A DSM bitmap won't help with that situation, regrettably. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Bruce Momjian" <bruce@momjian.us> writes: > I agree it index cleanup isn't > 50% of vacuum. I was trying to figure > out how small, and it seems about 15% of the total table, which means if > we have bitmap vacuum, we can conceivably reduce vacuum load by perhaps > 80%, assuming 5% of the table is scanned. Actually no. A while back I did experiments to see how fast reading a file sequentially was compared to reading the same file sequentially but skipping x% of the blocks randomly. The results were surprising (to me) and depressing. The breakeven point was about 7%. That is, if you assum that only 5% of the table will be scanned and you arrange to do it sequentially then you should expect the i/o to be marginally faster than just reading the entire table. Vacuum does do some cpu work and wouldn't have to consult the clog as often, so it would still be somewhat faster. The theory online was that as long as you're reading one page from each disk track you're going to pay the same seek overhead as reading the entire track. I also had some theories involving linux being confused by the seeks and turning off read-ahead but I could never prove them. In short, to see big benefits you would have to have a much smaller percentage of the table being read. That shouldn't be taken to mean that the DSM is a loser. There are plenty of use cases where tables can be extremely large and have only very small percentages that are busy. The big advantage of the DSM is that it takes the size of the table out of the equation and replaces it with the size of the busy portion of the table. So updating a single record in a terabyte table has the same costs as updating a single record in a kilobyte table. Sadly that's not quite true due to indexes, and due to the size of the bitmap itself. But going back to your numbers it does mean that if you update a single row out of a terabyte table then we'll be removing about 85% of the i/o (minus the i/o needed to read the DSM, about .025%). If you update about 1% then you would be removing substantially less, and once you get to about 10% then you're back where you started. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Yep, agreed on the random I/O issue. The larger question is if you have a huge table, do you care to reclaim 3% of the table size, rather than just vacuum it when it gets to 10% dirty? I realize the vacuum is going to take a lot of time, but vacuuming to relaim 3% three times seems like it is going to be more expensive than just vacuuming the 10% once. And vacuuming to reclaim 1% ten times seems even more expensive. The partial vacuum idea is starting to look like a loser to me again. --------------------------------------------------------------------------- Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > > > I agree it index cleanup isn't > 50% of vacuum. I was trying to figure > > out how small, and it seems about 15% of the total table, which means if > > we have bitmap vacuum, we can conceivably reduce vacuum load by perhaps > > 80%, assuming 5% of the table is scanned. > > Actually no. A while back I did experiments to see how fast reading a file > sequentially was compared to reading the same file sequentially but skipping > x% of the blocks randomly. The results were surprising (to me) and depressing. > The breakeven point was about 7%. > > That is, if you assum that only 5% of the table will be scanned and you > arrange to do it sequentially then you should expect the i/o to be marginally > faster than just reading the entire table. Vacuum does do some cpu work and > wouldn't have to consult the clog as often, so it would still be somewhat > faster. > > The theory online was that as long as you're reading one page from each disk > track you're going to pay the same seek overhead as reading the entire track. > I also had some theories involving linux being confused by the seeks and > turning off read-ahead but I could never prove them. > > In short, to see big benefits you would have to have a much smaller percentage > of the table being read. That shouldn't be taken to mean that the DSM is a > loser. There are plenty of use cases where tables can be extremely large and > have only very small percentages that are busy. The big advantage of the DSM > is that it takes the size of the table out of the equation and replaces it > with the size of the busy portion of the table. So updating a single record in > a terabyte table has the same costs as updating a single record in a kilobyte > table. > > Sadly that's not quite true due to indexes, and due to the size of the bitmap > itself. But going back to your numbers it does mean that if you update a > single row out of a terabyte table then we'll be removing about 85% of the i/o > (minus the i/o needed to read the DSM, about .025%). If you update about 1% > then you would be removing substantially less, and once you get to about 10% > then you're back where you started. > > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Bruce Momjian" <bruce@momjian.us> writes: > Yep, agreed on the random I/O issue. The larger question is if you have > a huge table, do you care to reclaim 3% of the table size, rather than > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > to take a lot of time, but vacuuming to relaim 3% three times seems like > it is going to be more expensive than just vacuuming the 10% once. And > vacuuming to reclaim 1% ten times seems even more expensive. The > partial vacuum idea is starting to look like a loser to me again. Well the answer is of course "that depends". If you maintain the dead space at a steady state averaging 1.5% instead of 5% your table is 3.33% smaller on average. If this is a DSS system that will translate into running your queries 3.33% faster. It will take a lot of vacuums before they hurt more than a 3%+ performance drop. If it's an OLTP system the it's harder to figure. a 3.33% increase in data density will translate to a higher cache hit rate but how much higher depends on a lot of factors. In our experiments we actually got bigger boost in these kinds of situations than the I expected (I expected comparable to the 3.33% improvement). So it could be even more than 3.33%. But like said it depends. If you already have the whole database cache you won't see any improvement. If you are right on the cusp you could see a huge benefit. It sounds like you're underestimating the performance drain 10% wasted space has. If we found out that one routine was unnecessarily taking 10% of the cpu time it would be an obvious focus of attention. 10% wasted space is going to work out to about 10% of the i/o time. It also sounds like we're still focused on the performance impact in absolute terms. I'm much more interested in changing the performance characteristics so they're predictable and scalable. It doesn't matter much if your 1kb table is 100% slower than necessary but it does matter if your 1TB table needs 1,000x as much vacuuming as your 1GB table even if it's getting the same update traffic. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-01-22 at 13:27 -0500, Bruce Momjian wrote: > Yep, agreed on the random I/O issue. The larger question is if you have > a huge table, do you care to reclaim 3% of the table size, rather than > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > to take a lot of time, but vacuuming to relaim 3% three times seems like > it is going to be more expensive than just vacuuming the 10% once. And > vacuuming to reclaim 1% ten times seems even more expensive. The > partial vacuum idea is starting to look like a loser to me again. Hold that thought! Read Heikki's Piggyback VACUUM idea on new thread... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > > Yep, agreed on the random I/O issue. The larger question is if you have > a huge table, do you care to reclaim 3% of the table size, rather than > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > to take a lot of time, but vacuuming to relaim 3% three times seems like > it is going to be more expensive than just vacuuming the 10% once. And > vacuuming to reclaim 1% ten times seems even more expensive. The > partial vacuum idea is starting to look like a loser to me again. But if the partial vacuum is able to clean the busiest pages and reclaim useful space, currently-running transactions will be able to use that space and thus not have to extend the table. Not that extension is a problem on itself, but it'll keep your working set smaller. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Bruce Momjian wrote: > Yep, agreed on the random I/O issue. The larger question is if you have > a huge table, do you care to reclaim 3% of the table size, rather than > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > to take a lot of time, but vacuuming to relaim 3% three times seems like > it is going to be more expensive than just vacuuming the 10% once. And > vacuuming to reclaim 1% ten times seems even more expensive. The > partial vacuum idea is starting to look like a loser to me again. Buying a house with a 25-year mortgage is much more expensive than just paying cash too, but you don't always have a choice. Surely the key benefit of the partial vacuuming thing is that you can at least do something useful with a large table if a full vacuum takes 24 hours and you only have 4 hours of idle I/O. It's also occurred to me that all the discussion of scheduling way back when isn't directly addressing the issue. What most people want (I'm guessing) is to vacuum *when the user-workload allows* and the time-tabling is just a sysadmin first-approximation at that. With partial vacuuming possible, we can arrange things with just three thresholds and two measurements: Measurement 1 = system workload Measurement 2 = a per-table "requires vacuuming" value Threshold 1 = workload at which we do more vacuuming Threshold 2 = workload at which we do less vacuuming Threshold 3 = point at which a table is considered worth vacuuming. Once every 10 seconds, the manager compares the current workload to the thresholds and starts a new vacuum, kills one or does nothing. New vacuum processes keep getting started as long as there is workload spare and tables that need vacuuming. Now the trick of course is how you measure system workload in a meaningful manner. -- Richard Huxton Archonet Ltd
Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > >> I agree it index cleanup isn't > 50% of vacuum. I was trying to figure >> out how small, and it seems about 15% of the total table, which means if >> we have bitmap vacuum, we can conceivably reduce vacuum load by perhaps >> 80%, assuming 5% of the table is scanned. > > Actually no. A while back I did experiments to see how fast reading a file > sequentially was compared to reading the same file sequentially but skipping > x% of the blocks randomly. The results were surprising (to me) and depressing. > The breakeven point was about 7%. Note that with uniformly random updates, you have dirtied every page of the table until you get anywhere near 5% of dead space. So we have to assume non-uniform distribution of update for the DSM to be of any help. And if we assume non-uniform distribution, it's a good bet that the blocks that need vacuuming are also not randomly distributed. In fact, they might very well all be in one cluster, so that scanning that cluster is indeed sequential I/O. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Kenneth Marshall wrote: > On Mon, Jan 22, 2007 at 06:42:09PM +0000, Simon Riggs wrote: >> Hold that thought! Read Heikki's Piggyback VACUUM idea on new thread... > > There may be other functions that could leverage a similar sort of > infrastructure. For example, a long DB mining query could be registered > with the system. Then as the pieces of the table/database are brought in > to shared memory during the normal daily DB activity they can be acquired > without forcing the DB to run a very I/O expensive query when waiting a > bit for the results would be acceptable. As long as we are thinking > piggyback. Yeah, I had the same idea when we discussed synchronizing sequential scans. The biggest difference is that with queries, there's often a user waiting for the query to finish, but with vacuum we don't care so much how long it takes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > Bruce Momjian wrote: > > > > Yep, agreed on the random I/O issue. The larger question is if you have > > a huge table, do you care to reclaim 3% of the table size, rather than > > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > > to take a lot of time, but vacuuming to relaim 3% three times seems like > > it is going to be more expensive than just vacuuming the 10% once. And > > vacuuming to reclaim 1% ten times seems even more expensive. The > > partial vacuum idea is starting to look like a loser to me again. > > But if the partial vacuum is able to clean the busiest pages and reclaim > useful space, currently-running transactions will be able to use that > space and thus not have to extend the table. Not that extension is a > problem on itself, but it'll keep your working set smaller. Yes, but my point is that if you are trying to avoid vacuuming the table, I am afraid the full index scan is going to be painful too. I can see corner cases where partial vacuum is a win (I only have 4 hours of idle I/O), but for the general case I am still worried that partial vacuum will not be that useful as long as we have to scan the indexes. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > > Actually no. A while back I did experiments to see how fast reading a file > sequentially was compared to reading the same file sequentially but skipping > x% of the blocks randomly. The results were surprising (to me) and depressing. > The breakeven point was about 7%. [...] > > The theory online was that as long as you're reading one page from each disk > track you're going to pay the same seek overhead as reading the entire track. Could one take advantage of this observation in designing the DSM? Instead of a separate bit representing every page, having each bit represent 20 or so pages might be a more useful unit. It sounds like the time spent reading would be similar; while the bitmap would be significantly smaller.
>-----Original Message----- >From: pgsql-general-owner@postgresql.org >[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Gregory Stark >Sent: maandag 22 januari 2007 19:41 >To: Bruce Momjian >Cc: Heikki Linnakangas; Russell Smith; Darcy Buskermolen; >Simon Riggs; Alvaro Herrera; Matthew T. O'Connor; Pavan >Deolasee; Christopher Browne; pgsql-general@postgresql.org; >pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] [GENERAL] Autovacuum Improvements > >"Bruce Momjian" <bruce@momjian.us> writes: > >> Yep, agreed on the random I/O issue. The larger question is if you >> have a huge table, do you care to reclaim 3% of the table >size, rather >> than just vacuum it when it gets to 10% dirty? I realize the vacuum >> is going to take a lot of time, but vacuuming to relaim 3% >three times >> seems like it is going to be more expensive than just vacuuming the >> 10% once. And vacuuming to reclaim 1% ten times seems even more >> expensive. The partial vacuum idea is starting to look like >a loser to me again. > >Well the answer is of course "that depends". > >If you maintain the dead space at a steady state averaging >1.5% instead of 5% your table is 3.33% smaller on average. If >this is a DSS system that will translate into running your >queries 3.33% faster. It will take a lot of vacuums before >they hurt more than a 3%+ performance drop. Good, this means a DSS system will mostly do table scans (right?). So probably you should witness the 'table scan' statistic and rows fetched aproaching the end of the universe (at least compared to inserts/updates/deletes)? >If it's an OLTP system the it's harder to figure. a 3.33% >increase in data density will translate to a higher cache hit >rate but how much higher depends on a lot of factors. In our >experiments we actually got bigger boost in these kinds of >situations than the I expected (I expected comparable to the >3.33% improvement). So it could be even more than 3.33%. But >like said it depends. >If you already have the whole database cache you won't see any >improvement. If you are right on the cusp you could see a huge benefit. These tables have high insert, update and delete rates, probably a lot of index scans? I believe the workload on table scans should be (close to) none. Are you willing to share some of this measured data? I'm quite interested in such figures. >It sounds like you're underestimating the performance drain >10% wasted space has. If we found out that one routine was >unnecessarily taking 10% of the cpu time it would be an >obvious focus of attention. 10% wasted space is going to work >out to about 10% of the i/o time. > >It also sounds like we're still focused on the performance >impact in absolute terms. I'm much more interested in changing >the performance characteristics so they're predictable and >scalable. It doesn't matter much if your 1kb table is 100% >slower than necessary but it does matter if your 1TB table >needs 1,000x as much vacuuming as your 1GB table even if it's >getting the same update traffic. Or rather, the vacuuming should pay back. A nice metric might be: cost_of_not_vacuuming / cost_of_vacuuming. Obviously, the higher the better. - Joris Dobbelsteen
Gregory Stark wrote: > > Actually no. A while back I did experiments to see how fast reading a file > sequentially was compared to reading the same file sequentially but skipping > x% of the blocks randomly. The results were surprising (to me) and depressing. > The breakeven point was about 7%. [...] > > The theory online was that as long as you're reading one page from each disk > track you're going to pay the same seek overhead as reading the entire track. Could one take advantage of this observation in designing the DSM? Instead of a separate bit representing every page, having each bit represent 20 or so pages might be a more useful unit. It sounds like the time spent reading would be similar; while the bitmap would be significantly smaller.
On Mon, Jan 22, 2007 at 07:24:20PM +0000, Heikki Linnakangas wrote: > Kenneth Marshall wrote: > >On Mon, Jan 22, 2007 at 06:42:09PM +0000, Simon Riggs wrote: > >>Hold that thought! Read Heikki's Piggyback VACUUM idea on new thread... > > > >There may be other functions that could leverage a similar sort of > >infrastructure. For example, a long DB mining query could be registered > >with the system. Then as the pieces of the table/database are brought in > >to shared memory during the normal daily DB activity they can be acquired > >without forcing the DB to run a very I/O expensive query when waiting a > >bit for the results would be acceptable. As long as we are thinking > >piggyback. > > Yeah, I had the same idea when we discussed synchronizing sequential > scans. The biggest difference is that with queries, there's often a user > waiting for the query to finish, but with vacuum we don't care so much > how long it takes. > Yes, but with trending and statistical analysis you may not need the exact answer ASAP. An approximate answer based on a fraction of the information would be useful. Also, "what if" queries could be run without impacting the production uses of a database. One might imagine having a query with results that "converge" as the table is processed during normal use. Ken
On Mon, Jan 22, 2007 at 06:42:09PM +0000, Simon Riggs wrote: > On Mon, 2007-01-22 at 13:27 -0500, Bruce Momjian wrote: > > Yep, agreed on the random I/O issue. The larger question is if you have > > a huge table, do you care to reclaim 3% of the table size, rather than > > just vacuum it when it gets to 10% dirty? I realize the vacuum is going > > to take a lot of time, but vacuuming to relaim 3% three times seems like > > it is going to be more expensive than just vacuuming the 10% once. And > > vacuuming to reclaim 1% ten times seems even more expensive. The > > partial vacuum idea is starting to look like a loser to me again. > > Hold that thought! Read Heikki's Piggyback VACUUM idea on new thread... > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > There may be other functions that could leverage a similar sort of infrastructure. For example, a long DB mining query could be registered with the system. Then as the pieces of the table/database are brought in to shared memory during the normal daily DB activity they can be acquired without forcing the DB to run a very I/O expensive query when waiting a bit for the results would be acceptable. As long as we are thinking piggyback. Ken
On Mon, Jan 22, 2007 at 12:17:39PM -0800, Ron Mayer wrote: > Gregory Stark wrote: > > > > Actually no. A while back I did experiments to see how fast reading a file > > sequentially was compared to reading the same file sequentially but skipping > > x% of the blocks randomly. The results were surprising (to me) and depressing. > > The breakeven point was about 7%. [...] > > > > The theory online was that as long as you're reading one page from each disk > > track you're going to pay the same seek overhead as reading the entire track. > > Could one take advantage of this observation in designing the DSM? > > Instead of a separate bit representing every page, having each bit > represent 20 or so pages might be a more useful unit. It sounds > like the time spent reading would be similar; while the bitmap > would be significantly smaller. If we extended relations by more than one page at a time we'd probably have a better shot at the blocks on disk being contiguous and all read at the same time by the OS. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Jan 22, 2007, at 11:16 AM, Richard Huxton wrote: > Bruce Momjian wrote: >> Yep, agreed on the random I/O issue. The larger question is if >> you have >> a huge table, do you care to reclaim 3% of the table size, rather >> than >> just vacuum it when it gets to 10% dirty? I realize the vacuum is >> going >> to take a lot of time, but vacuuming to relaim 3% three times >> seems like >> it is going to be more expensive than just vacuuming the 10% >> once. And >> vacuuming to reclaim 1% ten times seems even more expensive. The >> partial vacuum idea is starting to look like a loser to me again. > > Buying a house with a 25-year mortgage is much more expensive than > just paying cash too, but you don't always have a choice. > > Surely the key benefit of the partial vacuuming thing is that you > can at least do something useful with a large table if a full > vacuum takes 24 hours and you only have 4 hours of idle I/O. > > It's also occurred to me that all the discussion of scheduling way > back when isn't directly addressing the issue. What most people > want (I'm guessing) is to vacuum *when the user-workload allows* > and the time-tabling is just a sysadmin first-approximation at that. Yup. I'd really like for my app to be able to say "Hmm. No interactive users at the moment, no critical background tasks. Now would be a really good time for the DB to do some maintenance." but also to be able to interrupt the maintenance process if some new users or other system load show up. > With partial vacuuming possible, we can arrange things with just > three thresholds and two measurements: > Measurement 1 = system workload > Measurement 2 = a per-table "requires vacuuming" value > Threshold 1 = workload at which we do more vacuuming > Threshold 2 = workload at which we do less vacuuming > Threshold 3 = point at which a table is considered worth vacuuming. > Once every 10 seconds, the manager compares the current workload to > the thresholds and starts a new vacuum, kills one or does nothing. > New vacuum processes keep getting started as long as there is > workload spare and tables that need vacuuming. > > Now the trick of course is how you measure system workload in a > meaningful manner. I'd settle for a "start maintenance", "stop maintenance" API. Anything else (for instance the heuristics you suggest above) would definitely be gravy. It's not going to be simple to do, though, I don't think. Cheers, Steve
On Mon, Jan 22, 2007 at 05:51:53PM +0000, Gregory Stark wrote: > Actually no. A while back I did experiments to see how fast reading a file > sequentially was compared to reading the same file sequentially but skipping > x% of the blocks randomly. The results were surprising (to me) and depressing. > The breakeven point was about 7%. I asusume this means you were reading 7% of the blocks, not skipping 7% of the blocks when you broke even? I presume by break-even you mean it took just as long, time-wise. But did it have the same effect on system load? If reading only 7% of the blocks allows the drive to complete other requests more quickly then it's beneficial, even if the vacuum takes longer. This may be a silly thought, I'm not sure how drives handle multiple requests... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Thx Russel, I want to control it from software, changing network access via pg_hba with software doesnt feel right. ================ possible case================ Say I have a Group called Normal_Rights and one called Zero_Rights. So dB runs as... Normal_Rights(User A, User B, User C, User D) Then via sql, superuser REVOKEs those user rights and GRANTs them Zero_Rights(User A, User B, User C, User D)... ie make users a member of the ZERO rights group. Then hopefully Postgres kicks them out gracefully????? Then software make changes and switch's them back to their Normal_Rights group. ================ or more general case================ RECORD all the SQL for all user rights... REVOKE everything except needed software superusers (postgres, and program superuser). make changes via software. PLAY BACK all the rights SQL script. What do you think, will PG kill connections, let them go gracefully, stop after current transaction???? ================ maybe I'm in the wrong tree================ Is it possible to make quick structural changes to postgres, with user activety? Maybe start a transaction that changes structure... wonder if that will stop or hold user activity??? Thx .
org@kewlstuff.co.za schrieb: > Thx Russel, > I want to control it from software, changing network access via pg_hba > with software doesnt feel right. > > ================ possible case================ > Say I have a Group called Normal_Rights and one called Zero_Rights. > > So dB runs as... Normal_Rights(User A, User B, User C, User D) > Then via sql, superuser REVOKEs those user rights and GRANTs them > Zero_Rights(User A, User B, User C, User D)... ie make users a member of > the ZERO rights group. > > Then hopefully Postgres kicks them out gracefully????? > > Then software make changes and switch's them back to their Normal_Rights > group. > > ================ or more general case================ > RECORD all the SQL for all user rights... > REVOKE everything except needed software superusers (postgres, and > program superuser). > make changes via software. > PLAY BACK all the rights SQL script. > > What do you think, will PG kill connections, let them go gracefully, > stop after current transaction???? > > ================ maybe I'm in the wrong tree================ Yes I'm thinking that too: > Is it possible to make quick structural changes to postgres, with user > activety? of course. > Maybe start a transaction that changes structure... wonder if that will > stop or hold user activity??? Usually not - all your DDL is done in a transaction just like any other access users would make. So it only fails (but as a whole) if you want to modify locked tables and such. But you would not end up w/ a partly changed database in any case. Just make sure you do everything in a transaction. No need to suspend user accounts for that. Regards Tino
On Jan 22, 2007, at 6:53 PM, Kenneth Marshall wrote: > The default should > be approximately the OS standard read-ahead amount. Is there anything resembling a standard across the OSes we support? Better yet, is there a standard call that allows you to find out what the read-ahead setting is? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Ha ha... thx Tino Yes, I think this is way to go, strange how my mind climbs the wrong tree sometimes :) I actually need to aquire a transaction across several dB's, check if the conditions are right, and then modify some tables, write and remove some triggers. Transactions in postgres are 2 sophisticated, I dont think they will hold the locks at the level I need them. But I was thinking (climbing out of the wrong tree;)... I can just aquire exclusive locks on the tables, and hey presto, users are on hold while the software checks the dB's. Effectively creating a very rough transaction, with the lock scope needed? ... ie it will keep users out long enough to align several dB's... I'm hoping? From: "Tino Wildenhain" <tino@wildenhain.de> >> ================ maybe I'm in the wrong tree================ > > Yes I'm thinking that too: > >> Is it possible to make quick structural changes to postgres, with user >> activety? > > of course. > >> Maybe start a transaction that changes structure... wonder if that will >> stop or hold user activity??? > > Usually not - all your DDL is done in a transaction just like any other > access users would make. So it only fails (but as a whole) if you want > to modify locked tables and such. But you would not end up w/ a partly > changed database in any case. Just make sure you do everything in > a transaction. No need to suspend user accounts for that. > > > Regards > Tino > > >
Hi, org@kewlstuff.co.za wrote: > Ha ha... thx Tino > Yes, I think this is way to go, strange how my mind climbs the wrong > tree sometimes :) > I actually need to aquire a transaction across several dB's, check if > the conditions are right, and then modify some tables, write and remove > some triggers. > Transactions in postgres are 2 sophisticated, I dont think they will > hold the locks at the level I need them. You want to read about explicit locking: http://www.postgresql.org/docs/8.2/static/explicit-locking.html > But I was thinking (climbing out of the wrong tree;)... I can just > aquire exclusive locks on the tables, and hey presto, users are on hold > while the software checks the dB's. I'm sure, that's possible. However, I remember you were talking about replication, thus I have to add a warning: please keep in mind that this does not scale. You're most probably better using two phase commit, aren't you? Regards Markus
On Mon, Jan 22, 2007 at 05:11:03PM -0600, Jim C. Nasby wrote: > On Mon, Jan 22, 2007 at 12:17:39PM -0800, Ron Mayer wrote: > > Gregory Stark wrote: > > > > > > Actually no. A while back I did experiments to see how fast reading a file > > > sequentially was compared to reading the same file sequentially but skipping > > > x% of the blocks randomly. The results were surprising (to me) and depressing. > > > The breakeven point was about 7%. [...] > > > > > > The theory online was that as long as you're reading one page from each disk > > > track you're going to pay the same seek overhead as reading the entire track. > > > > Could one take advantage of this observation in designing the DSM? > > > > Instead of a separate bit representing every page, having each bit > > represent 20 or so pages might be a more useful unit. It sounds > > like the time spent reading would be similar; while the bitmap > > would be significantly smaller. > > If we extended relations by more than one page at a time we'd probably > have a better shot at the blocks on disk being contiguous and all read > at the same time by the OS. > -- > Jim Nasby jim@nasby.net > EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) > Yes, most OS have some read-ahead when reading a file from disk. Any increment over 1 would be an improvement. If you used a counter with a time-based decrement function, you could increase the amount that the relation is extended based on temporal proximity. If you have extended it several times recently, increase the size of the new extension to reduce the overhead even further. The default should be approximately the OS standard read-ahead amount. Ken
On Tue, Jan 23, 2007 at 09:01:41PM -0600, Jim Nasby wrote: > On Jan 22, 2007, at 6:53 PM, Kenneth Marshall wrote: > >The default should > >be approximately the OS standard read-ahead amount. > > Is there anything resembling a standard across the OSes we support? > Better yet, is there a standard call that allows you to find out what > the read-ahead setting is? > -- > Jim Nasby jim@nasby.net > EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) > Not that I am aware of. Even extending the relation by one additional block can make a big difference in performance and should easily fall within every read-ahead in use today. Or a GUC variable, that defaults to a small power of 2 number of PostgreSQL blocks, with a default arrived at by testing. Ken
Kenneth Marshall <ktm@is.rice.edu> writes: > Not that I am aware of. Even extending the relation by one additional > block can make a big difference in performance Do you have any evidence to back up that assertion? It seems a bit nontrivial to me --- not the extension part exactly, but making sure that the space will get used promptly. With the current code the backend extending a relation will do subsequent inserts into the block it just got, which is fine, but there's no mechanism for remembering that any other newly-added blocks are available --- unless you wanted to push them into the FSM, which could work but the current FSM code doesn't support piecemeal addition of space, and in any case there's some question in my mind about the concurrency cost of increasing FSM traffic even more. In short, it's hardly an unquestionable improvement, so we need some evidence. regards, tom lane
Good memory you have and you exactly right. Yes... the replication is using posgres's normal transactions ie 2 phase commits.... and it works like a dream. When moving data during replication, the locks are happening at record level, and its intrinsic to the postgres transaction machinery. ie postgres is deciding how 'fine grained' the locks should be, and doing all that other amazing MVCC stuff. The part I'm toying with and struggling with is the start and stop, or the admin side of the replication. As it stands now, one has to start with identicle databases, then setup replication, and then the users come on. But now say I want to make a structural change... as it stands I have to claim the dB's back, fix them all, make sure they identicle, re-set up the replication, and then the users can come back on. Its that.... the dB's must be identicle on setting up replication, that I'm trying to get around. I think that when it comes to the structural side, I have to hold(LOCK) those dB's, while the software removes the replication, changes the structures of all the dB's, reinstalls the scripts and triggers... and I want to make that invisible to a system thats already active. Ideally the user software just delays for say 10 seconds, and in that time, 6 dB's have been restructured, checked and the replication restarted. In terms of the set up I want to get it to... make those 5 dB's the same as this template and start or continue replicating.... becomes a mind twister. Thats the idea anyway... current version is at http://coolese.100free.com/ it works great, but you'll see it has a setup, breakdown problem on an active system. Thx 4 the help Johnny From: "Markus Schiltknecht" <markus@bluegap.ch> >> But I was thinking (climbing out of the wrong tree;)... I can just aquire >> exclusive locks on the tables, and hey presto, users are on hold while >> the software checks the dB's. > > I'm sure, that's possible. However, I remember you were talking about > replication, thus I have to add a warning: please keep in mind that this > does not scale. You're most probably better using two phase commit, aren't > you? > > Regards > > Markus >
Jim C. Nasby wrote: > On Mon, Jan 22, 2007 at 12:17:39PM -0800, Ron Mayer wrote: > > Gregory Stark wrote: > > > > > > Actually no. A while back I did experiments to see how fast reading a file > > > sequentially was compared to reading the same file sequentially but skipping > > > x% of the blocks randomly. The results were surprising (to me) and depressing. > > > The breakeven point was about 7%. [...] > > > > > > The theory online was that as long as you're reading one page from each disk > > > track you're going to pay the same seek overhead as reading the entire track. > > > > Could one take advantage of this observation in designing the DSM? > > > > Instead of a separate bit representing every page, having each bit > > represent 20 or so pages might be a more useful unit. It sounds > > like the time spent reading would be similar; while the bitmap > > would be significantly smaller. > > If we extended relations by more than one page at a time we'd probably > have a better shot at the blocks on disk being contiguous and all read > at the same time by the OS. Actually, there is evidence that adding only a single page to the end causes a lot of contention for that last page, and that adding a few might be better. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Jim C. Nasby wrote: >> If we extended relations by more than one page at a time we'd probably >> have a better shot at the blocks on disk being contiguous and all read >> at the same time by the OS. > Actually, there is evidence that adding only a single page to the end > causes a lot of contention for that last page, and that adding a few > might be better. Evidence where? The code is designed so that the last page *isn't* shared --- go read the comments in hio.c sometime. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Jim C. Nasby wrote: > >> If we extended relations by more than one page at a time we'd probably > >> have a better shot at the blocks on disk being contiguous and all read > >> at the same time by the OS. > > > Actually, there is evidence that adding only a single page to the end > > causes a lot of contention for that last page, and that adding a few > > might be better. > > Evidence where? The code is designed so that the last page *isn't* > shared --- go read the comments in hio.c sometime. I was talking about the last page of a table, where INSERTs all cluster on that last page and cause lots of page locking. hio.c does look like it avoids that problem. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, Jan 24, 2007 at 07:30:05PM -0500, Tom Lane wrote: > Kenneth Marshall <ktm@is.rice.edu> writes: > > Not that I am aware of. Even extending the relation by one additional > > block can make a big difference in performance > > Do you have any evidence to back up that assertion? > > It seems a bit nontrivial to me --- not the extension part exactly, but > making sure that the space will get used promptly. With the current > code the backend extending a relation will do subsequent inserts into > the block it just got, which is fine, but there's no mechanism for > remembering that any other newly-added blocks are available --- unless > you wanted to push them into the FSM, which could work but the current > FSM code doesn't support piecemeal addition of space, and in any case > there's some question in my mind about the concurrency cost of increasing > FSM traffic even more. > > In short, it's hardly an unquestionable improvement, so we need some > evidence. > > regards, tom lane > My comment was purely based on the reduction in fragmentation of the file behind the relation. A result that I have seen repeatedly in file related data processing. It does sound much more complicated to make the additional space available to other backends. If one backend was doing many inserts, it might still be of value even for just that backend. As you mention, testing is needed to see if there is enough value in this process. Ken