Thread: autovac hung/blocked

autovac hung/blocked

From
"Ed L."
Date:
I have an 8.1.2 autovac which appears to be hanging/blocking
every few days or so, but we're don't understand what's causing
it.  I wasn't able to catch a backtrace before we killed it.  I
do not see autovac locks in the pg_locks view.

Will running 8.1.5 buy me anything in terms of being able to see
what it has locked and what locks it might be awaiting?  Or
maybe in terms of bugs that might explain it?

TIA.
Ed


Re: autovac hung/blocked

From
"Jim C. Nasby"
Date:
On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote:
> I have an 8.1.2 autovac which appears to be hanging/blocking
> every few days or so, but we're don't understand what's causing
> it.  I wasn't able to catch a backtrace before we killed it.  I
> do not see autovac locks in the pg_locks view.
>
> Will running 8.1.5 buy me anything in terms of being able to see
> what it has locked and what locks it might be awaiting?  Or
> maybe in terms of bugs that might explain it?

If there's no locks, why do you think it's blocked?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: autovac hung/blocked

From
"Ed L."
Date:
On Tuesday November 14 2006 12:49 pm, Jim C. Nasby wrote:
> On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote:
> > I have an 8.1.2 autovac which appears to be hanging/blocking
> > every few days or so, but we're don't understand what's
> > causing it.  I wasn't able to catch a backtrace before we
> > killed it.  I do not see autovac locks in the pg_locks view.
> >
> > Will running 8.1.5 buy me anything in terms of being able to
> > see what it has locked and what locks it might be awaiting?
> > Or maybe in terms of bugs that might explain it?
>
> If there's no locks, why do you think it's blocked?

I had wondered if all autovac locks were in the pg_locks view,
but I now see them.  A gdb backtrace a few days ago had autovac
waiting on a semaphore, and I assumed that was waiting on a
lock.

Ed


Re: autovac hung/blocked

From
"Jim C. Nasby"
Date:
On Tue, Nov 14, 2006 at 12:53:56PM -0700, Ed L. wrote:
> On Tuesday November 14 2006 12:49 pm, Jim C. Nasby wrote:
> > On Tue, Nov 14, 2006 at 11:20:30AM -0700, Ed L. wrote:
> > > I have an 8.1.2 autovac which appears to be hanging/blocking
> > > every few days or so, but we're don't understand what's
> > > causing it.  I wasn't able to catch a backtrace before we
> > > killed it.  I do not see autovac locks in the pg_locks view.
> > >
> > > Will running 8.1.5 buy me anything in terms of being able to
> > > see what it has locked and what locks it might be awaiting?
> > > Or maybe in terms of bugs that might explain it?
> >
> > If there's no locks, why do you think it's blocked?
>
> I had wondered if all autovac locks were in the pg_locks view,
> but I now see them.  A gdb backtrace a few days ago had autovac
> waiting on a semaphore, and I assumed that was waiting on a
> lock.

Well, vacuum shouldn't really block on much... DDL and LOCK TABLE are
the only things I can think of off the top of my head.

You don't have the vacuum cost delay settings set unreasonably high, do
you?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: autovac hung/blocked

From
"Ed L."
Date:
On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote:
> You don't have the vacuum cost delay settings set unreasonably
> high, do you?
On Tuesday November 14 2006 12:56 pm, you wrote:
> You don't have the vacuum cost delay settings set unreasonably
> high, do you?

I'm not sure.  Here's what we're running:

#vacuum_cost_delay = 0                  # 0-1000 milliseconds
#vacuum_cost_page_hit = 1               # 0-10000 credits
#vacuum_cost_page_miss = 10             # 0-10000 credits
#vacuum_cost_page_dirty = 20            # 0-10000 credits
#vacuum_cost_limit = 200                # 0-10000 credits
autovacuum = on                         # enable autovacuum subprocess?
#autovacuum_naptime = 60                # time between autovacuum runs, in secs
autovacuum_naptime = 600                # changed by CW 9/11/06 to minimize interference with application
autovacuum_vacuum_threshold = 1000      # min # of tuple updates before
                                        # vacuum
autovacuum_analyze_threshold = 500      # min # of tuple updates before
autovacuum_vacuum_scale_factor = 0.4    # fraction of rel size before
                                        # vacuum
autovacuum_analyze_scale_factor = 0.2   # fraction of rel size before
autovacuum_vacuum_cost_delay = 500      # default vacuum cost delay for
                                        # vacuum_cost_delay
autovacuum_vacuum_cost_limit = 200      # default vacuum cost limit for
                                        # vacuum_cost_limit

The table on which it seems to be struggling is updated
many times per second and has 7.2M rows over 15GB total.

I'm trying to figure out how often it would be napping
on that...?  Maybe we're just seeing a very conservative
vacuum on a large table...?

Ed

Re: autovac hung/blocked

From
"Ed L."
Date:
On Tuesday November 14 2006 1:02 pm, Ed L. wrote:
> On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote:
> > You don't have the vacuum cost delay settings set
> > unreasonably high, do you?
>
> On Tuesday November 14 2006 12:56 pm, you wrote:
> > You don't have the vacuum cost delay settings set
> > unreasonably high, do you?
>
> I'm not sure.  Here's what we're running:

Well, I think we clearly have an HPUX CPU bottleneck (long pri
queue, high cpu utilization, high user cpu %, lots of processes
"blocked on pri").

It seems to get worst and slow all queries down across the board
when autovac tries to vacuum a 15GB table.  I'm guessing this is
flushing the OS/DB caches, exacerbating the CPU bottleneck.

I'm also not sure what to do about it beyond the customer buying
some politically/financially expensive CPUs.  The table in
question appears to be the pathological case for vacuum:  very
large with lots of frequent UPDATEs.  It's essentially a log
table.  So the other options seem to be figure out how to
partition the table so as to minimize size of the data needing
vacuuming, or to redesign if possible to replace UPDATEs with
INSERTs.  Other ideas?

Ed

Re: autovac hung/blocked

From
Alvaro Herrera
Date:
Ed L. wrote:

> Well, I think we clearly have an HPUX CPU bottleneck (long pri
> queue, high cpu utilization, high user cpu %, lots of processes
> "blocked on pri").
>
> It seems to get worst and slow all queries down across the board
> when autovac tries to vacuum a 15GB table.  I'm guessing this is
> flushing the OS/DB caches, exacerbating the CPU bottleneck.
>
> I'm also not sure what to do about it beyond the customer buying
> some politically/financially expensive CPUs.

I suggest turning autovac off for that particular table, and doing the
vacuum during off-peak hours.

> The table in
> question appears to be the pathological case for vacuum:  very
> large with lots of frequent UPDATEs.  It's essentially a log
> table.

A big log table where the log entries are being updated?  Certainly
sounds like a recipe for vacuum headaches.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: autovac hung/blocked

From
"Ed L."
Date:
On Wednesday November 15 2006 6:30 am, Alvaro Herrera wrote:
>
> > The table in
> > question appears to be the pathological case for vacuum:
> > very large with lots of frequent UPDATEs.  It's essentially
> > a log table.
>
> A big log table where the log entries are being updated?
> Certainly sounds like a recipe for vacuum headaches.

I'm curious to know how others are dealing with this problem, and
how pgsql might support this issue.

In our case, we have a 1.8GB OS cache, a 30GB DB cache, serving
around 200 transactions/second from a 110GB DB, and this
problematic table is 15GB on disk.  So when it is vacuumed, I
suspect it essentially flushes the OS cache and half the DB
cache, severely impacting performance in an already
cpu-bottlenecked machine.  I have attempted to adjusted autovac
to spread out its I/O impact, but then it takes so long to run
that other smaller frequently-updated tables are not
vacuumed/analyzed in the meantime and performance starts to
suffer.

Suppose there simply are no "off-hours" periods when you can
vacuum a very large table with many frequent updates.  (There is
never a good time to flush the caches.)  How do you manage such
a table in a 24x7 environment?

One idea would be to partition the table some how such that the
chunks getting vacuumed are much smaller and thus not such an
impact.  On the app side, I suppose we could break the table
into multiple tables on some dimension (time) to make the vacuum
impacts smaller.

But a pgsql solution in the future would be nice.  I don't know
the pgsql code well, but what if the freespace map was divided
into smaller sized sectors such that individual sectors could be
vacuumed without having to hit the entire table?  Or even simply
breaking up the 15GB vacuum job into 1GB chunks with some
spacing between would help.  Of course, it'd be important to
keep the smaller tables vacuumed/analyzed as needed in the
in-between time.

I don't know what the best answer is here, just groping for
ideas.

TIA.
Ed

Re: autovac hung/blocked

From
Jim Nasby
Date:
On Nov 14, 2006, at 1:02 PM, Ed L. wrote:
> On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote:
>> You don't have the vacuum cost delay settings set unreasonably
>> high, do you?
> On Tuesday November 14 2006 12:56 pm, you wrote:
>> You don't have the vacuum cost delay settings set unreasonably
>> high, do you?
>
> I'm not sure.  Here's what we're running:
>
> #vacuum_cost_delay = 0                  # 0-1000 milliseconds
> #vacuum_cost_page_hit = 1               # 0-10000 credits
> #vacuum_cost_page_miss = 10             # 0-10000 credits
> #vacuum_cost_page_dirty = 20            # 0-10000 credits
> #vacuum_cost_limit = 200                # 0-10000 credits
> autovacuum = on                         # enable autovacuum
> subprocess?
> #autovacuum_naptime = 60                # time between autovacuum
> runs, in secs
> autovacuum_naptime = 600                # changed by CW 9/11/06 to
> minimize interference with application
> autovacuum_vacuum_threshold = 1000      # min # of tuple updates
> before
>                                         # vacuum
> autovacuum_analyze_threshold = 500      # min # of tuple updates
> before
> autovacuum_vacuum_scale_factor = 0.4    # fraction of rel size before
>                                         # vacuum
> autovacuum_analyze_scale_factor = 0.2   # fraction of rel size before
> autovacuum_vacuum_cost_delay = 500      # default vacuum cost delay
> for
>                                         # vacuum_cost_delay
> autovacuum_vacuum_cost_limit = 200      # default vacuum cost limit
> for
>                                         # vacuum_cost_limit
>
> The table on which it seems to be struggling is updated
> many times per second and has 7.2M rows over 15GB total.
>
> I'm trying to figure out how often it would be napping
> on that...?  Maybe we're just seeing a very conservative
> vacuum on a large table...?

Well, it's going to take some time to vacuum 15G, but it appears
there's also confusion about what some of these settings do.

_naptime simply controls how often autovacuum checks to see if
there's work to do. There's really no reason at all to set it to 10
minutes.

If you're worried about the impact vacuum has on your application,
you want to be using the *_cost_* settings. Those work by
incrementing a cost counter for every page that vacuum touches
(vacuum_cost_page*). Once that counter hits vacuum_cost_limit, vacuum
will stop what it's doing and sleep for vacuum_cost_delay milliseconds.

So in broad terms, if you want to reduce the impact of vacuum/
autovacuum, you should increase vacuum_cost_delay/
autovacuum_vacuum_cost_delay to something (20 ms seems to be a decent
starting point). If you monitor disk activity, you can adjust things
so that you're at 90% or so of your drive capability with vacuum and
a normal workload running. Depending on how much of your data is
normally in cache, you might want to increase both page_dirty and the
cost_limit.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: autovac hung/blocked

From
Richard Huxton
Date:
Ed L. wrote:
> One idea would be to partition the table some how such that the
> chunks getting vacuumed are much smaller and thus not such an
> impact.  On the app side, I suppose we could break the table
> into multiple tables on some dimension (time) to make the vacuum
> impacts smaller.

You're running on bigger datasets than I'm used to, but that would be my
approach. Did you notice the constraint partitioning introduced in 8.1?

http://www.postgresql.org/docs/8.1/static/ddl-partitioning.html#DDL-PARTITIONING-CONSTRAINT-EXCLUSION

--
   Richard Huxton
   Archonet Ltd

Re: autovac hung/blocked

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> On Tuesday November 14 2006 12:56 pm, Jim C. Nasby wrote:
>> You don't have the vacuum cost delay settings set unreasonably
>> high, do you?

> I'm not sure.  Here's what we're running:

> autovacuum_vacuum_cost_delay = 500      # default vacuum cost delay for
>                                         # vacuum_cost_delay
> autovacuum_vacuum_cost_limit = 200      # default vacuum cost limit for
>                                         # vacuum_cost_limit

That's unreasonably high --- you're telling autovac to do a little work
(about ten pages' worth) and then sleep for half a second.  No wonder it
seems to take forever on a big table.  Try autovacuum_vacuum_cost_delay
around 10 or 20 (msec) or so.

            regards, tom lane

Re: autovac hung/blocked

From
"Ed L."
Date:
On Thursday November 16 2006 3:33 am, Richard Huxton wrote:
> Ed L. wrote:
> > One idea would be to partition the table some how such that
> > the chunks getting vacuumed are much smaller and thus not
> > such an impact.  On the app side, I suppose we could break
> > the table into multiple tables on some dimension (time) to
> > make the vacuum impacts smaller.
>
> You're running on bigger datasets than I'm used to, but that
> would be my approach. Did you notice the constraint
> partitioning introduced in 8.1?
>
> http://www.postgresql.org/docs/8.1/static/ddl-partitioning.htm
>l#DDL-PARTITIONING-CONSTRAINT-EXCLUSION

Thanks for the tip.  We have avoided use of inheritance in order
to stay closer to the beaten path, but the partitioning
implementation ideas are useful.

Ed