Thread: autovac hung/blocked
I have an 8.1.2 autovac which appears to be hanging/blocking every few days or so, but we're don't understand what's causing it. I wasn't able to catch a backtrace before we killed it. I do not see autovac locks in the pg_locks view. Will running 8.1.5 buy me anything in terms of being able to see what it has locked and what locks it might be awaiting? Or maybe in terms of bugs that might explain it? TIA. Ed
On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote: > I have an 8.1.2 autovac which appears to be hanging/blocking > every few days or so, but we're don't understand what's causing > it. I wasn't able to catch a backtrace before we killed it. I > do not see autovac locks in the pg_locks view. > > Will running 8.1.5 buy me anything in terms of being able to see > what it has locked and what locks it might be awaiting? Or > maybe in terms of bugs that might explain it? If there's no locks, why do you think it's blocked? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tuesday November 14 2006 12:49 pm, Jim C. Nasby wrote: > On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote: > > I have an 8.1.2 autovac which appears to be hanging/blocking > > every few days or so, but we're don't understand what's > > causing it. I wasn't able to catch a backtrace before we > > killed it. I do not see autovac locks in the pg_locks view. > > > > Will running 8.1.5 buy me anything in terms of being able to > > see what it has locked and what locks it might be awaiting? > > Or maybe in terms of bugs that might explain it? > > If there's no locks, why do you think it's blocked? I had wondered if all autovac locks were in the pg_locks view, but I now see them. A gdb backtrace a few days ago had autovac waiting on a semaphore, and I assumed that was waiting on a lock. Ed
On Tue, Nov 14, 2006 at 12:53:56PM -0700, Ed L. wrote: > On Tuesday November 14 2006 12:49 pm, Jim C. Nasby wrote: > > On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote: > > > I have an 8.1.2 autovac which appears to be hanging/blocking > > > every few days or so, but we're don't understand what's > > > causing it. I wasn't able to catch a backtrace before we > > > killed it. I do not see autovac locks in the pg_locks view. > > > > > > Will running 8.1.5 buy me anything in terms of being able to > > > see what it has locked and what locks it might be awaiting? > > > Or maybe in terms of bugs that might explain it? > > > > If there's no locks, why do you think it's blocked? > > I had wondered if all autovac locks were in the pg_locks view, > but I now see them. A gdb backtrace a few days ago had autovac > waiting on a semaphore, and I assumed that was waiting on a > lock. Well, vacuum shouldn't really block on much... DDL and LOCK TABLE are the only things I can think of off the top of my head. You don't have the vacuum cost delay settings set unreasonably high, do you? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote: > You don't have the vacuum cost delay settings set unreasonably > high, do you? On Tuesday November 14 2006 12:56 pm, you wrote: > You don't have the vacuum cost delay settings set unreasonably > high, do you? I'm not sure. Here's what we're running: #vacuum_cost_delay = 0 # 0-1000 milliseconds #vacuum_cost_page_hit = 1 # 0-10000 credits #vacuum_cost_page_miss = 10 # 0-10000 credits #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits autovacuum = on # enable autovacuum subprocess? #autovacuum_naptime = 60 # time between autovacuum runs, in secs autovacuum_naptime = 600 # changed by CW 9/11/06 to minimize interference with application autovacuum_vacuum_threshold = 1000 # min # of tuple updates before # vacuum autovacuum_analyze_threshold = 500 # min # of tuple updates before autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before # vacuum autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before autovacuum_vacuum_cost_delay = 500 # default vacuum cost delay for # vacuum_cost_delay autovacuum_vacuum_cost_limit = 200 # default vacuum cost limit for # vacuum_cost_limit The table on which it seems to be struggling is updated many times per second and has 7.2M rows over 15GB total. I'm trying to figure out how often it would be napping on that...? Maybe we're just seeing a very conservative vacuum on a large table...? Ed
On Tuesday November 14 2006 1:02 pm, Ed L. wrote: > On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote: > > You don't have the vacuum cost delay settings set > > unreasonably high, do you? > > On Tuesday November 14 2006 12:56 pm, you wrote: > > You don't have the vacuum cost delay settings set > > unreasonably high, do you? > > I'm not sure. Here's what we're running: Well, I think we clearly have an HPUX CPU bottleneck (long pri queue, high cpu utilization, high user cpu %, lots of processes "blocked on pri"). It seems to get worst and slow all queries down across the board when autovac tries to vacuum a 15GB table. I'm guessing this is flushing the OS/DB caches, exacerbating the CPU bottleneck. I'm also not sure what to do about it beyond the customer buying some politically/financially expensive CPUs. The table in question appears to be the pathological case for vacuum: very large with lots of frequent UPDATEs. It's essentially a log table. So the other options seem to be figure out how to partition the table so as to minimize size of the data needing vacuuming, or to redesign if possible to replace UPDATEs with INSERTs. Other ideas? Ed
Ed L. wrote: > Well, I think we clearly have an HPUX CPU bottleneck (long pri > queue, high cpu utilization, high user cpu %, lots of processes > "blocked on pri"). > > It seems to get worst and slow all queries down across the board > when autovac tries to vacuum a 15GB table. I'm guessing this is > flushing the OS/DB caches, exacerbating the CPU bottleneck. > > I'm also not sure what to do about it beyond the customer buying > some politically/financially expensive CPUs. I suggest turning autovac off for that particular table, and doing the vacuum during off-peak hours. > The table in > question appears to be the pathological case for vacuum: very > large with lots of frequent UPDATEs. It's essentially a log > table. A big log table where the log entries are being updated? Certainly sounds like a recipe for vacuum headaches. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wednesday November 15 2006 6:30 am, Alvaro Herrera wrote: > > > The table in > > question appears to be the pathological case for vacuum: > > very large with lots of frequent UPDATEs. It's essentially > > a log table. > > A big log table where the log entries are being updated? > Certainly sounds like a recipe for vacuum headaches. I'm curious to know how others are dealing with this problem, and how pgsql might support this issue. In our case, we have a 1.8GB OS cache, a 30GB DB cache, serving around 200 transactions/second from a 110GB DB, and this problematic table is 15GB on disk. So when it is vacuumed, I suspect it essentially flushes the OS cache and half the DB cache, severely impacting performance in an already cpu-bottlenecked machine. I have attempted to adjusted autovac to spread out its I/O impact, but then it takes so long to run that other smaller frequently-updated tables are not vacuumed/analyzed in the meantime and performance starts to suffer. Suppose there simply are no "off-hours" periods when you can vacuum a very large table with many frequent updates. (There is never a good time to flush the caches.) How do you manage such a table in a 24x7 environment? One idea would be to partition the table some how such that the chunks getting vacuumed are much smaller and thus not such an impact. On the app side, I suppose we could break the table into multiple tables on some dimension (time) to make the vacuum impacts smaller. But a pgsql solution in the future would be nice. I don't know the pgsql code well, but what if the freespace map was divided into smaller sized sectors such that individual sectors could be vacuumed without having to hit the entire table? Or even simply breaking up the 15GB vacuum job into 1GB chunks with some spacing between would help. Of course, it'd be important to keep the smaller tables vacuumed/analyzed as needed in the in-between time. I don't know what the best answer is here, just groping for ideas. TIA. Ed
On Nov 14, 2006, at 1:02 PM, Ed L. wrote: > On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote: >> You don't have the vacuum cost delay settings set unreasonably >> high, do you? > On Tuesday November 14 2006 12:56 pm, you wrote: >> You don't have the vacuum cost delay settings set unreasonably >> high, do you? > > I'm not sure. Here's what we're running: > > #vacuum_cost_delay = 0 # 0-1000 milliseconds > #vacuum_cost_page_hit = 1 # 0-10000 credits > #vacuum_cost_page_miss = 10 # 0-10000 credits > #vacuum_cost_page_dirty = 20 # 0-10000 credits > #vacuum_cost_limit = 200 # 0-10000 credits > autovacuum = on # enable autovacuum > subprocess? > #autovacuum_naptime = 60 # time between autovacuum > runs, in secs > autovacuum_naptime = 600 # changed by CW 9/11/06 to > minimize interference with application > autovacuum_vacuum_threshold = 1000 # min # of tuple updates > before > # vacuum > autovacuum_analyze_threshold = 500 # min # of tuple updates > before > autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before > # vacuum > autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before > autovacuum_vacuum_cost_delay = 500 # default vacuum cost delay > for > # vacuum_cost_delay > autovacuum_vacuum_cost_limit = 200 # default vacuum cost limit > for > # vacuum_cost_limit > > The table on which it seems to be struggling is updated > many times per second and has 7.2M rows over 15GB total. > > I'm trying to figure out how often it would be napping > on that...? Maybe we're just seeing a very conservative > vacuum on a large table...? Well, it's going to take some time to vacuum 15G, but it appears there's also confusion about what some of these settings do. _naptime simply controls how often autovacuum checks to see if there's work to do. There's really no reason at all to set it to 10 minutes. If you're worried about the impact vacuum has on your application, you want to be using the *_cost_* settings. Those work by incrementing a cost counter for every page that vacuum touches (vacuum_cost_page*). Once that counter hits vacuum_cost_limit, vacuum will stop what it's doing and sleep for vacuum_cost_delay milliseconds. So in broad terms, if you want to reduce the impact of vacuum/ autovacuum, you should increase vacuum_cost_delay/ autovacuum_vacuum_cost_delay to something (20 ms seems to be a decent starting point). If you monitor disk activity, you can adjust things so that you're at 90% or so of your drive capability with vacuum and a normal workload running. Depending on how much of your data is normally in cache, you might want to increase both page_dirty and the cost_limit. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Ed L. wrote: > One idea would be to partition the table some how such that the > chunks getting vacuumed are much smaller and thus not such an > impact. On the app side, I suppose we could break the table > into multiple tables on some dimension (time) to make the vacuum > impacts smaller. You're running on bigger datasets than I'm used to, but that would be my approach. Did you notice the constraint partitioning introduced in 8.1? http://www.postgresql.org/docs/8.1/static/ddl-partitioning.html#DDL-PARTITIONING-CONSTRAINT-EXCLUSION -- Richard Huxton Archonet Ltd
"Ed L." <pgsql@bluepolka.net> writes: > On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote: >> You don't have the vacuum cost delay settings set unreasonably >> high, do you? > I'm not sure. Here's what we're running: > autovacuum_vacuum_cost_delay = 500 # default vacuum cost delay for > # vacuum_cost_delay > autovacuum_vacuum_cost_limit = 200 # default vacuum cost limit for > # vacuum_cost_limit That's unreasonably high --- you're telling autovac to do a little work (about ten pages' worth) and then sleep for half a second. No wonder it seems to take forever on a big table. Try autovacuum_vacuum_cost_delay around 10 or 20 (msec) or so. regards, tom lane
On Thursday November 16 2006 3:33 am, Richard Huxton wrote: > Ed L. wrote: > > One idea would be to partition the table some how such that > > the chunks getting vacuumed are much smaller and thus not > > such an impact. On the app side, I suppose we could break > > the table into multiple tables on some dimension (time) to > > make the vacuum impacts smaller. > > You're running on bigger datasets than I'm used to, but that > would be my approach. Did you notice the constraint > partitioning introduced in 8.1? > > http://www.postgresql.org/docs/8.1/static/ddl-partitioning.htm >l#DDL-PARTITIONING-CONSTRAINT-EXCLUSION Thanks for the tip. We have avoided use of inheritance in order to stay closer to the beaten path, but the partitioning implementation ideas are useful. Ed