Thread: Why do we let autovacuum give up?

Why do we let autovacuum give up?

From
"Joshua D. Drake"
Date:
Hello,

I have run into yet again another situation where there was an 
assumption that autovacuum was keeping up and it wasn't. It was caused 
by autovacuum quitting because another process requested a lock.

In turn we received a ton of bloat on pg_attribute which caused all 
kinds of other issues (as can be expected).

The more I run into it, the more it seems like autovacuum should behave 
like vacuum, in that it gets precedence when it is running. First come, 
first serve as they say.

Thoughts?

JD
-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats



Re: Why do we let autovacuum give up?

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> I have run into yet again another situation where there was an 
> assumption that autovacuum was keeping up and it wasn't. It was caused 
> by autovacuum quitting because another process requested a lock.

> In turn we received a ton of bloat on pg_attribute which caused all 
> kinds of other issues (as can be expected).

> The more I run into it, the more it seems like autovacuum should behave 
> like vacuum, in that it gets precedence when it is running. First come, 
> first serve as they say.

1. Back when it worked like that, things were worse.

2. What have you got that is requesting exclusive lock on pg_attribute?
That seems like a pretty unfriendly behavior in itself.
        regards, tom lane



Re: Why do we let autovacuum give up?

From
Josh Berkus
Date:
On 01/23/2014 12:34 PM, Joshua D. Drake wrote:
> 
> Hello,
> 
> I have run into yet again another situation where there was an
> assumption that autovacuum was keeping up and it wasn't. It was caused
> by autovacuum quitting because another process requested a lock.
> 
> In turn we received a ton of bloat on pg_attribute which caused all
> kinds of other issues (as can be expected).
> 
> The more I run into it, the more it seems like autovacuum should behave
> like vacuum, in that it gets precedence when it is running. First come,
> first serve as they say.
> 
> Thoughts?

If we let autovacuum block user activity, a lot more people would turn
it off.

Now, if you were to argue that we should have some way to monitor the
tables which autovac can never touch because of conflicts, I would agree
with you.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Why do we let autovacuum give up?

From
Harold Giménez
Date:
On Thu, Jan 23, 2014 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 01/23/2014 12:34 PM, Joshua D. Drake wrote:
>>
>> Hello,
>>
>> I have run into yet again another situation where there was an
>> assumption that autovacuum was keeping up and it wasn't. It was caused
>> by autovacuum quitting because another process requested a lock.
>>
>> In turn we received a ton of bloat on pg_attribute which caused all
>> kinds of other issues (as can be expected).
>>
>> The more I run into it, the more it seems like autovacuum should behave
>> like vacuum, in that it gets precedence when it is running. First come,
>> first serve as they say.
>>
>> Thoughts?
>
> If we let autovacuum block user activity, a lot more people would turn
> it off.
>
> Now, if you were to argue that we should have some way to monitor the
> tables which autovac can never touch because of conflicts, I would agree
> with you.

Agree completely. Easy ways to monitor this would be great. Once you
know there's a problem, tweaking autovacuum settings is very hard and
misunderstood, and explaining how to be effective at it is a dark art
too.

-Harold



Re: Why do we let autovacuum give up?

From
Mark Kirkwood
Date:
On 24/01/14 09:49, Tom Lane wrote:
> 2. What have you got that is requesting exclusive lock on 
> pg_attribute? That seems like a pretty unfriendly behavior in itself. 
> regards, tom lane 

I've seen this sort of problem where every db session was busily 
creating temporary tables. I never got to the find *why* they needed to 
make so many, but it seemed like a bad idea.

Regards

Mark




Re: Why do we let autovacuum give up?

From
Robert Haas
Date:
On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 24/01/14 09:49, Tom Lane wrote:
>> 2. What have you got that is requesting exclusive lock on pg_attribute?
>> That seems like a pretty unfriendly behavior in itself. regards, tom lane
>
> I've seen this sort of problem where every db session was busily creating
> temporary tables. I never got to the find *why* they needed to make so many,
> but it seemed like a bad idea.

But... how does that result on a vacuum-incompatible lock request
against pg_attribute?

I see that it'll insert lots of rows into pg_attribute, and maybe
later delete them, but none of that blocks vacuum.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why do we let autovacuum give up?

From
"Joshua D. Drake"
Date:
On 01/23/2014 01:03 PM, Mark Kirkwood wrote:
>
> On 24/01/14 09:49, Tom Lane wrote:
>> 2. What have you got that is requesting exclusive lock on
>> pg_attribute? That seems like a pretty unfriendly behavior in itself.
>> regards, tom lane
>
> I've seen this sort of problem where every db session was busily
> creating temporary tables. I never got to the find *why* they needed to
> make so many, but it seemed like a bad idea.

Yep... that's the one. They are creating lots and lots of temp tables.

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats



Re: Why do we let autovacuum give up?

From
Pavel Stehule
Date:
<p dir="ltr"><br /> Dne 23.1.2014 22:04 "Mark Kirkwood" <<a
href="mailto:mark.kirkwood@catalyst.net.nz">mark.kirkwood@catalyst.net.nz</a>>napsal(a):<br /> ><br /> > On
24/01/1409:49, Tom Lane wrote:<br /> >><br /> >> 2. What have you got that is requesting exclusive lock on
pg_attribute?That seems like a pretty unfriendly behavior in itself. regards, tom lane <br /> ><br /> ><br />
>I've seen this sort of problem where every db session was busily creating temporary tables. I never got to the find
*why*they needed to make so many, but it seemed like a bad idea.<br /> ><p dir="ltr">Our customer had same problem
with temp tables by intensively plpgsql functions. For higher load a temp tables are performance and stability killer.
Vacuumof pg attrib has very ugly impacts :(<p dir="ltr">Regars<p dir="ltr">Pavel<p dir="ltr">After redesign - without
tmptables - his applications works well.<p dir="ltr">We needs a global temp tables<br /><p dir="ltr">> Regards<br />
><br/> > Mark<br /> ><br /> ><br /> ><br /> ><br /> > -- <br /> > Sent via pgsql-hackers
mailinglist (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> > To make changes
toyour subscription:<br /> > <a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/> 

Re: Why do we let autovacuum give up?

From
Tom Lane
Date:
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> On 24/01/14 09:49, Tom Lane wrote:
>> 2. What have you got that is requesting exclusive lock on 
>> pg_attribute? That seems like a pretty unfriendly behavior in itself. 

> I've seen this sort of problem where every db session was busily 
> creating temporary tables. I never got to the find *why* they needed to 
> make so many, but it seemed like a bad idea.

That shouldn't result in any table-level exclusive locks on system
catalogs, though.

[ thinks... ]  It's possible that what you saw is not the
kick-out-autovacuum-entirely behavior, but the behavior added in commit
bbb6e559c, whereby vacuum (auto or regular) will skip over pages that it
can't immediately get an exclusive buffer lock on.  On a heavily used
table, we might skip the same page repeatedly, so that dead tuples don't
get cleaned for a long time.

To add insult to injury, despite having done that, vacuum would reset the
pgstats dead-tuple count to zero, thus postponing the next autovacuum.
I think commit 115f41412 may have improved the situation, but I'd want
to see some testing of this theory before I'd propose back-patching it.
        regards, tom lane



Re: Why do we let autovacuum give up?

From
Mark Kirkwood
Date:
On 24/01/14 10:09, Robert Haas wrote:
> On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>> On 24/01/14 09:49, Tom Lane wrote:
>>> 2. What have you got that is requesting exclusive lock on pg_attribute?
>>> That seems like a pretty unfriendly behavior in itself. regards, tom lane
>> I've seen this sort of problem where every db session was busily creating
>> temporary tables. I never got to the find *why* they needed to make so many,
>> but it seemed like a bad idea.
> But... how does that result on a vacuum-incompatible lock request
> against pg_attribute?
>
> I see that it'll insert lots of rows into pg_attribute, and maybe
> later delete them, but none of that blocks vacuum.
>

That was my thought too - if I see it happening again here (was a year 
or so ago that I saw some serious pg_attribute bloat) I'll dig deeper.

regards

Mark



Re: Why do we let autovacuum give up?

From
Mark Kirkwood
Date:
On 24/01/14 10:16, Mark Kirkwood wrote:
> On 24/01/14 10:09, Robert Haas wrote:
>> On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
>> <mark.kirkwood@catalyst.net.nz> wrote:
>>> On 24/01/14 09:49, Tom Lane wrote:
>>>> 2. What have you got that is requesting exclusive lock on
>>>> pg_attribute?
>>>> That seems like a pretty unfriendly behavior in itself. regards,
>>>> tom lane
>>> I've seen this sort of problem where every db session was busily
>>> creating
>>> temporary tables. I never got to the find *why* they needed to make
>>> so many,
>>> but it seemed like a bad idea.
>> But... how does that result on a vacuum-incompatible lock request
>> against pg_attribute?
>>
>> I see that it'll insert lots of rows into pg_attribute, and maybe
>> later delete them, but none of that blocks vacuum.
>>
>
> That was my thought too - if I see it happening again here (was a year
> or so ago that I saw some serious pg_attribute bloat) I'll dig deeper.
>
>

Actually not much digging required. Running the attached script via
pgbench (8 sessions) against a default configured postgres 8.4 sees
pg_attribute get to 1G after about 15 minutes.


Attachment

Re: Why do we let autovacuum give up?

From
Magnus Hagander
Date:
On Thu, Jan 23, 2014 at 10:00 PM, Harold Giménez <harold@heroku.com> wrote:
On Thu, Jan 23, 2014 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 01/23/2014 12:34 PM, Joshua D. Drake wrote:
>>
>> Hello,
>>
>> I have run into yet again another situation where there was an
>> assumption that autovacuum was keeping up and it wasn't. It was caused
>> by autovacuum quitting because another process requested a lock.
>>
>> In turn we received a ton of bloat on pg_attribute which caused all
>> kinds of other issues (as can be expected).
>>
>> The more I run into it, the more it seems like autovacuum should behave
>> like vacuum, in that it gets precedence when it is running. First come,
>> first serve as they say.
>>
>> Thoughts?
>
> If we let autovacuum block user activity, a lot more people would turn
> it off.
>
> Now, if you were to argue that we should have some way to monitor the
> tables which autovac can never touch because of conflicts, I would agree
> with you.

Agree completely. Easy ways to monitor this would be great. Once you
know there's a problem, tweaking autovacuum settings is very hard and
misunderstood, and explaining how to be effective at it is a dark art
too.

FWIW, I have a patch around somewhere that I never cleaned up properly for submissions that simply added a counter to pg_stat_user_tables indicating how many times vacuum had aborted on that specific table. If that's enough info  (it was for my case) to cover this case, I can try to dig it out again and clean it up...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Why do we let autovacuum give up?

From
Josh Berkus
Date:
On 01/23/2014 02:17 PM, Magnus Hagander wrote:
> FWIW, I have a patch around somewhere that I never cleaned up properly for
> submissions that simply added a counter to pg_stat_user_tables indicating
> how many times vacuum had aborted on that specific table. If that's enough
> info  (it was for my case) to cover this case, I can try to dig it out
> again and clean it up...

It would be 100% more information than we currently have.  How much more
difficult would it be to count completed autovacuums as well?  It's
really the ratio of the two which matters ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Why do we let autovacuum give up?

From
Josh Berkus
Date:
On 01/23/2014 02:55 PM, Josh Berkus wrote:
> On 01/23/2014 02:17 PM, Magnus Hagander wrote:
>> FWIW, I have a patch around somewhere that I never cleaned up properly for
>> submissions that simply added a counter to pg_stat_user_tables indicating
>> how many times vacuum had aborted on that specific table. If that's enough
>> info  (it was for my case) to cover this case, I can try to dig it out
>> again and clean it up...
> 
> It would be 100% more information than we currently have.  How much more
> difficult would it be to count completed autovacuums as well?  It's
> really the ratio of the two which matters ...

Actually, now that I think about it, the ratio of the two doesn't matter
as much as whether the most recent autovacuum aborted or not.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Why do we let autovacuum give up?

From
Jeff Janes
Date:
On Thu, Jan 23, 2014 at 1:41 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 24/01/14 10:16, Mark Kirkwood wrote:
On 24/01/14 10:09, Robert Haas wrote:
On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
On 24/01/14 09:49, Tom Lane wrote:
2. What have you got that is requesting exclusive lock on pg_attribute?
That seems like a pretty unfriendly behavior in itself. regards, tom lane
I've seen this sort of problem where every db session was busily creating
temporary tables. I never got to the find *why* they needed to make so many,
but it seemed like a bad idea.
But... how does that result on a vacuum-incompatible lock request
against pg_attribute?

I see that it'll insert lots of rows into pg_attribute, and maybe
later delete them, but none of that blocks vacuum.


That was my thought too - if I see it happening again here (was a year or so ago that I saw some serious pg_attribute bloat) I'll dig deeper.



Actually not much digging required. Running the attached script via pgbench (8 sessions) against a default configured postgres 8.4 sees pg_attribute get to 1G after about 15 minutes.

At that rate, with default throttling, it will be a close race whether autovac can vacuum pages as fast as they are being added.  Even if it never gets cancelled, it might not ever finish.

Cheers,

Jeff 

Re: Why do we let autovacuum give up?

From
Mark Kirkwood
Date:
On 24/01/14 12:13, Jeff Janes wrote:
> On Thu, Jan 23, 2014 at 1:41 PM, Mark Kirkwood <
> mark.kirkwood@catalyst.net.nz> wrote:
>
>> On 24/01/14 10:16, Mark Kirkwood wrote:
>>
>>> On 24/01/14 10:09, Robert Haas wrote:
>>>
>>>> On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
>>>> <mark.kirkwood@catalyst.net.nz> wrote:
>>>>
>>>>> On 24/01/14 09:49, Tom Lane wrote:
>>>>>
>>>>>> 2. What have you got that is requesting exclusive lock on pg_attribute?
>>>>>> That seems like a pretty unfriendly behavior in itself. regards, tom
>>>>>> lane
>>>>>>
>>>>> I've seen this sort of problem where every db session was busily
>>>>> creating
>>>>> temporary tables. I never got to the find *why* they needed to make so
>>>>> many,
>>>>> but it seemed like a bad idea.
>>>>>
>>>> But... how does that result on a vacuum-incompatible lock request
>>>> against pg_attribute?
>>>>
>>>> I see that it'll insert lots of rows into pg_attribute, and maybe
>>>> later delete them, but none of that blocks vacuum.
>>>>
>>>>
>>> That was my thought too - if I see it happening again here (was a year or
>>> so ago that I saw some serious pg_attribute bloat) I'll dig deeper.
>>>
>>>
>>>
>> Actually not much digging required. Running the attached script via
>> pgbench (8 sessions) against a default configured postgres 8.4 sees
>> pg_attribute get to 1G after about 15 minutes.
>>
> At that rate, with default throttling, it will be a close race whether
> autovac can vacuum pages as fast as they are being added.  Even if it never
> gets cancelled, it might not ever finish.
>

Yes - I should have set the cost delay to 0 first (checking that now).




Re: Why do we let autovacuum give up?

From
Mark Kirkwood
Date:
On 24/01/14 12:28, Mark Kirkwood wrote:
> On 24/01/14 12:13, Jeff Janes wrote:
>> On Thu, Jan 23, 2014 at 1:41 PM, Mark Kirkwood <
>> mark.kirkwood@catalyst.net.nz> wrote:
>>
>>> On 24/01/14 10:16, Mark Kirkwood wrote:
>>>
>>>> On 24/01/14 10:09, Robert Haas wrote:
>>>>
>>>>> On Thu, Jan 23, 2014 at 4:03 PM, Mark Kirkwood
>>>>> <mark.kirkwood@catalyst.net.nz> wrote:
>>>>>
>>>>>> On 24/01/14 09:49, Tom Lane wrote:
>>>>>>
>>>>>>> 2. What have you got that is requesting exclusive lock on 
>>>>>>> pg_attribute?
>>>>>>> That seems like a pretty unfriendly behavior in itself. regards, 
>>>>>>> tom
>>>>>>> lane
>>>>>>>
>>>>>> I've seen this sort of problem where every db session was busily
>>>>>> creating
>>>>>> temporary tables. I never got to the find *why* they needed to 
>>>>>> make so
>>>>>> many,
>>>>>> but it seemed like a bad idea.
>>>>>>
>>>>> But... how does that result on a vacuum-incompatible lock request
>>>>> against pg_attribute?
>>>>>
>>>>> I see that it'll insert lots of rows into pg_attribute, and maybe
>>>>> later delete them, but none of that blocks vacuum.
>>>>>
>>>>>
>>>> That was my thought too - if I see it happening again here (was a 
>>>> year or
>>>> so ago that I saw some serious pg_attribute bloat) I'll dig deeper.
>>>>
>>>>
>>>>
>>> Actually not much digging required. Running the attached script via
>>> pgbench (8 sessions) against a default configured postgres 8.4 sees
>>> pg_attribute get to 1G after about 15 minutes.
>>>
>> At that rate, with default throttling, it will be a close race whether
>> autovac can vacuum pages as fast as they are being added.  Even if it 
>> never
>> gets cancelled, it might not ever finish.
>>
>
> Yes - I should have set the cost delay to 0 first (checking that now).
>
>
>

Doing that (and a few other autovac tweaks):

autovacuum_max_workers = 4
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.1
autovacuum_vacuum_cost_delay = 0ms

Stops excessive bloat - clearly autovacuum *is* able to vacuum 
pg_attribute in this case. Back to drawing board for a test case.

Regards

Mark





Re: Why do we let autovacuum give up?

From
Andres Freund
Date:
On 2014-01-24 12:49:57 +1300, Mark Kirkwood wrote:
> autovacuum_max_workers = 4
> autovacuum_naptime = 10s
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = 0ms
> 
> Stops excessive bloat - clearly autovacuum *is* able to vacuum pg_attribute
> in this case. Back to drawing board for a test case.

Well, I think quite many people don't realize it might be necessary to
tune autovac on busy workloads. As it very well might be the case in
Josh's case.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Why do we let autovacuum give up?

From
Andres Freund
Date:
On 2014-01-23 16:15:50 -0500, Tom Lane wrote:
> [ thinks... ]  It's possible that what you saw is not the
> kick-out-autovacuum-entirely behavior, but the behavior added in commit
> bbb6e559c, whereby vacuum (auto or regular) will skip over pages that it
> can't immediately get an exclusive buffer lock on.  On a heavily used
> table, we might skip the same page repeatedly, so that dead tuples don't
> get cleaned for a long time.

I don't think it's too likely as an explanation here. Such workloads are
likely to fill a page with only dead tuples, right? Once all tuples are
safely dead they will be killed from the btree which should cause the
page not to be visited anymore and thus safely vacuumable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Why do we let autovacuum give up?

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-23 16:15:50 -0500, Tom Lane wrote:
>> [ thinks... ]  It's possible that what you saw is not the
>> kick-out-autovacuum-entirely behavior, but the behavior added in commit
>> bbb6e559c, whereby vacuum (auto or regular) will skip over pages that it
>> can't immediately get an exclusive buffer lock on.  On a heavily used
>> table, we might skip the same page repeatedly, so that dead tuples don't
>> get cleaned for a long time.

> I don't think it's too likely as an explanation here. Such workloads are
> likely to fill a page with only dead tuples, right? Once all tuples are
> safely dead they will be killed from the btree which should cause the
> page not to be visited anymore and thus safely vacuumable.

I added some instrumentation to vacuumlazy.c to count the number of pages
skipped in this way.  You're right, it seems to be negligible, at least
with Mark's test case.  I saw at most two pages skipped per vacuum, and
usually none; so there's no way that a whole lot of tuples are going
unvacuumed because of this.  (I wonder though if we ought to add such
counting as a permanent feature ...)

I concur with the other reports that the main problem in this test case is
just that the default cost delay settings throttle autovacuum so hard that
it has no chance of keeping up.  If I reduce autovacuum_vacuum_cost_delay
from the default 20ms to 2ms, it seems to keep up quite nicely, on my
machine anyway.  Probably other combinations of changes would do it too.

Perhaps we need to back off the default cost delay settings a bit?
We've certainly heard more than enough reports of table bloat in
heavily-updated tables.  A system that wasn't hitting the updates as hard
as it could might not need this, but on the other hand it probably
wouldn't miss the I/O cycles from a more aggressive autovacuum, either.
        regards, tom lane



Re: Why do we let autovacuum give up?

From
Andres Freund
Date:
On 2014-01-23 19:29:23 -0500, Tom Lane wrote:
> I saw at most two pages skipped per vacuum, and
> usually none; so there's no way that a whole lot of tuples are going
> unvacuumed because of this.  (I wonder though if we ought to add such
> counting as a permanent feature ...)

I generally think we need to work a bit on the data reported back by
vacuum. Adding data and also making the data output when using
autovacuum more consistent with what VACUUM VERBOSE reports. The latter
curiously often has less detail than autovacuum.
I had hoped to get to that for 9.4, but it doesn't look like it.


> I concur with the other reports that the main problem in this test case is
> just that the default cost delay settings throttle autovacuum so hard that
> it has no chance of keeping up.  If I reduce autovacuum_vacuum_cost_delay
> from the default 20ms to 2ms, it seems to keep up quite nicely, on my
> machine anyway.  Probably other combinations of changes would do it too.

> Perhaps we need to back off the default cost delay settings a bit?
> We've certainly heard more than enough reports of table bloat in
> heavily-updated tables.  A system that wasn't hitting the updates as hard
> as it could might not need this, but on the other hand it probably
> wouldn't miss the I/O cycles from a more aggressive autovacuum, either.

Yes, I think adjusting the default makes sense, most setups that have
enough activity that costing plays a role have to greatly increase the
values. I'd rather increase the cost limit than reduce cost delay so
drastically though, but that's admittedly just gut feeling.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Why do we let autovacuum give up?

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-01-23 19:29:23 -0500, Tom Lane wrote:
>> I concur with the other reports that the main problem in this test case is
>> just that the default cost delay settings throttle autovacuum so hard that
>> it has no chance of keeping up.  If I reduce autovacuum_vacuum_cost_delay
>> from the default 20ms to 2ms, it seems to keep up quite nicely, on my
>> machine anyway.  Probably other combinations of changes would do it too.

>> Perhaps we need to back off the default cost delay settings a bit?
>> We've certainly heard more than enough reports of table bloat in
>> heavily-updated tables.  A system that wasn't hitting the updates as hard
>> as it could might not need this, but on the other hand it probably
>> wouldn't miss the I/O cycles from a more aggressive autovacuum, either.

> Yes, I think adjusting the default makes sense, most setups that have
> enough activity that costing plays a role have to greatly increase the
> values. I'd rather increase the cost limit than reduce cost delay so
> drastically though, but that's admittedly just gut feeling.

Well, I didn't experiment with intermediate values, I was just trying
to test the theory that autovac could keep up given less-extreme
throttling.  I'm not taking any position on just where we need to set
the values, only that what we've got is probably too extreme.
        regards, tom lane



Re: Why do we let autovacuum give up?

From
Craig Ringer
Date:
On 01/24/2014 07:52 AM, Andres Freund wrote:
> On 2014-01-24 12:49:57 +1300, Mark Kirkwood wrote:
>> autovacuum_max_workers = 4
>> autovacuum_naptime = 10s
>> autovacuum_vacuum_scale_factor = 0.1
>> autovacuum_analyze_scale_factor = 0.1
>> autovacuum_vacuum_cost_delay = 0ms
>>
>> Stops excessive bloat - clearly autovacuum *is* able to vacuum pg_attribute
>> in this case. Back to drawing board for a test case.
> 
> Well, I think quite many people don't realize it might be necessary to
> tune autovac on busy workloads. As it very well might be the case in
> Josh's case.

Oh, lots of people realise it's a good idea to tune autovac on busy
workloads.

They just do it in the wrong direction, making it run less often and
less aggressively, causing more bloat, and making their problem worse.

I've seen this enough times that I'm starting to think the autovauum
tuning knobs need a child safety lock ;-)

More seriously, people don't understand autovacuum, how it works, or why
they need it. They notice it when things are already bad, see that it's
doing lots of work and doing lots of I/O that competes with queries, and
turn it off to "solve" the problem.

I'm not sure how to tackle that.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Why do we let autovacuum give up?

From
Claudio Freire
Date:
On Thu, Jan 23, 2014 at 10:38 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>>
>>> Stops excessive bloat - clearly autovacuum *is* able to vacuum pg_attribute
>>> in this case. Back to drawing board for a test case.
>>
>> Well, I think quite many people don't realize it might be necessary to
>> tune autovac on busy workloads. As it very well might be the case in
>> Josh's case.
>
> Oh, lots of people realise it's a good idea to tune autovac on busy
> workloads.
>
> They just do it in the wrong direction, making it run less often and
> less aggressively, causing more bloat, and making their problem worse.
>
> I've seen this enough times that I'm starting to think the autovauum
> tuning knobs need a child safety lock ;-)
>
> More seriously, people don't understand autovacuum, how it works, or why
> they need it. They notice it when things are already bad, see that it's
> doing lots of work and doing lots of I/O that competes with queries, and
> turn it off to "solve" the problem.


AFAIK, tuning down autovacuum is common advice **when compounded with
manually scheduled vacuuming**.

The problem of autovacuum is that it always picks the wrong time to
work. That is, when the DB is the most active. Because statistically
that's when the thresholds are passed.

If you ask me, I'd like autovac to know when not to run (or rather
wait a bit, not forever), perhaps by checking load factors or some
other tell-tale of an already-saturated I/O system.



Re: Why do we let autovacuum give up?

From
Alvaro Herrera
Date:
Claudio Freire escribió:

> If you ask me, I'd like autovac to know when not to run (or rather
> wait a bit, not forever), perhaps by checking load factors or some
> other tell-tale of an already-saturated I/O system.

We had a proposed design to tell autovac when not to run (or rather,
when to switch settings very high so that in practice it'd never run).
At some point somebody said "but we can just change autovacuum=off in
postgresql.conf via crontab when the high load period starts, and turn
it back on afterwards" --- and that was the end of it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Why do we let autovacuum give up?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Claudio Freire escribi�:
>> If you ask me, I'd like autovac to know when not to run (or rather
>> wait a bit, not forever), perhaps by checking load factors or some
>> other tell-tale of an already-saturated I/O system.

> We had a proposed design to tell autovac when not to run (or rather,
> when to switch settings very high so that in practice it'd never run).
> At some point somebody said "but we can just change autovacuum=off in
> postgresql.conf via crontab when the high load period starts, and turn
> it back on afterwards" --- and that was the end of it.

The hard part of this is that shutting down autovacuum during heavy
load may be exactly the wrong thing to do.
        regards, tom lane



Re: Why do we let autovacuum give up?

From
Craig Ringer
Date:
On 01/24/2014 11:32 AM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Claudio Freire escribió:
>>> If you ask me, I'd like autovac to know when not to run (or rather
>>> wait a bit, not forever), perhaps by checking load factors or some
>>> other tell-tale of an already-saturated I/O system.
> 
>> We had a proposed design to tell autovac when not to run (or rather,
>> when to switch settings very high so that in practice it'd never run).
>> At some point somebody said "but we can just change autovacuum=off in
>> postgresql.conf via crontab when the high load period starts, and turn
>> it back on afterwards" --- and that was the end of it.
> 
> The hard part of this is that shutting down autovacuum during heavy
> load may be exactly the wrong thing to do.

Yep. In fact, it may be appropriate to limit or stop autovacuum's work
on some big tables, while pushing its activity even higher for small,
high churn tables.

If you stop autovacuum on a message-queue system when load gets high,
you'll get a giant messy bloat explosion.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Why do we let autovacuum give up?

From
Alvaro Herrera
Date:
Craig Ringer escribió:
> On 01/24/2014 11:32 AM, Tom Lane wrote:

> > The hard part of this is that shutting down autovacuum during heavy
> > load may be exactly the wrong thing to do.
> 
> Yep. In fact, it may be appropriate to limit or stop autovacuum's work
> on some big tables, while pushing its activity even higher for small,
> high churn tables.
> 
> If you stop autovacuum on a message-queue system when load gets high,
> you'll get a giant messy bloat explosion.

The design we had was to have table groups, each with their own set of
custom parameters, and they would change depending on schedule.  You
could keep the queue tables in one group which would not change
parameters, and only change the rest.

But as I said, it was never fully implemented.  (We had a partial patch
from a GSoC project, IIRC.)  I don't have the cycles to implement it
now, anyway.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Why do we let autovacuum give up?

From
Claudio Freire
Date:
On Fri, Jan 24, 2014 at 12:33 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 01/24/2014 11:32 AM, Tom Lane wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Claudio Freire escribió:
>>>> If you ask me, I'd like autovac to know when not to run (or rather
>>>> wait a bit, not forever), perhaps by checking load factors or some
>>>> other tell-tale of an already-saturated I/O system.
>>
>>> We had a proposed design to tell autovac when not to run (or rather,
>>> when to switch settings very high so that in practice it'd never run).
>>> At some point somebody said "but we can just change autovacuum=off in
>>> postgresql.conf via crontab when the high load period starts, and turn
>>> it back on afterwards" --- and that was the end of it.
>>
>> The hard part of this is that shutting down autovacuum during heavy
>> load may be exactly the wrong thing to do.
>
> Yep. In fact, it may be appropriate to limit or stop autovacuum's work
> on some big tables, while pushing its activity even higher for small,
> high churn tables.
>
> If you stop autovacuum on a message-queue system when load gets high,
> you'll get a giant messy bloat explosion.

A message queue has a steady state and needs way more than autovacuum.
A table used as a message queue would need a wholly dedicated
autovacuum worker to be constantly vacuuming. It's certainly an
extreme example.

But normal tables are much bigger than their active set, so vacuuming,
which walks all those cold gigabytes, tends to wreak havoc with I/O
performance. Doing it in peak hours, which is autovacuum's preferred
time, is terrible. Delaying autovacuum for a while doesn't sound like
such a disastruous thing.

In essence, I'm talking about two thresholds. A "vacuum in the
background" threshold, and a "omfg this table is a mess vacuum now now
now" threshold. The background part is quite not straightforward
though. As in, what is background?



Re: Why do we let autovacuum give up?

From
Josh Berkus
Date:
On 01/23/2014 07:22 PM, Alvaro Herrera wrote:
>> If you ask me, I'd like autovac to know when not to run (or rather
>> > wait a bit, not forever), perhaps by checking load factors or some
>> > other tell-tale of an already-saturated I/O system.
> We had a proposed design to tell autovac when not to run (or rather,
> when to switch settings very high so that in practice it'd never run).
> At some point somebody said "but we can just change autovacuum=off in
> postgresql.conf via crontab when the high load period starts, and turn
> it back on afterwards" --- and that was the end of it.

Anything which depends on a timing-based feedback loop is going to be
hopeless.  Saying "autovac shouldn't run if load is high" sounds like a
simple statement, until you actually try to implement it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Why do we let autovacuum give up?

From
Claudio Freire
Date:
On Fri, Jan 24, 2014 at 2:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 01/23/2014 07:22 PM, Alvaro Herrera wrote:
>>> If you ask me, I'd like autovac to know when not to run (or rather
>>> > wait a bit, not forever), perhaps by checking load factors or some
>>> > other tell-tale of an already-saturated I/O system.
>> We had a proposed design to tell autovac when not to run (or rather,
>> when to switch settings very high so that in practice it'd never run).
>> At some point somebody said "but we can just change autovacuum=off in
>> postgresql.conf via crontab when the high load period starts, and turn
>> it back on afterwards" --- and that was the end of it.
>
> Anything which depends on a timing-based feedback loop is going to be
> hopeless.  Saying "autovac shouldn't run if load is high" sounds like a
> simple statement, until you actually try to implement it.

Exactly.

But people tuning autovac down are doing exactly that: trying to tune
autovac to background-only work. They *must* then launch foreground
vacuums, at times they deem sensible, when doing that.

So, problem is not of people tuning down autovacuum, but of them
forgetting to vacuum explicitly after doing so.



Re: Why do we let autovacuum give up?

From
Robert Haas
Date:
On Thu, Jan 23, 2014 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2014-01-23 19:29:23 -0500, Tom Lane wrote:
>>> I concur with the other reports that the main problem in this test case is
>>> just that the default cost delay settings throttle autovacuum so hard that
>>> it has no chance of keeping up.  If I reduce autovacuum_vacuum_cost_delay
>>> from the default 20ms to 2ms, it seems to keep up quite nicely, on my
>>> machine anyway.  Probably other combinations of changes would do it too.
>
>>> Perhaps we need to back off the default cost delay settings a bit?
>>> We've certainly heard more than enough reports of table bloat in
>>> heavily-updated tables.  A system that wasn't hitting the updates as hard
>>> as it could might not need this, but on the other hand it probably
>>> wouldn't miss the I/O cycles from a more aggressive autovacuum, either.
>
>> Yes, I think adjusting the default makes sense, most setups that have
>> enough activity that costing plays a role have to greatly increase the
>> values. I'd rather increase the cost limit than reduce cost delay so
>> drastically though, but that's admittedly just gut feeling.
>
> Well, I didn't experiment with intermediate values, I was just trying
> to test the theory that autovac could keep up given less-extreme
> throttling.  I'm not taking any position on just where we need to set
> the values, only that what we've got is probably too extreme.

So, Greg Smith proposed what I think is a very useful methodology for
assessing settings in this area: figure out what it works out to in
MB/s.  If we assume we're going to read and dirty every page we
vacuum, and that this will take negligible time of itself so that the
work is dominated by the sleeps, the default settings work out to
200/(10 + 20) pages every 20ms, or 2.67MB/s.  Obviously, the rate will
be 3x higher if the pages don't need to be dirtied, and higher still
if they're all in cache, but considering the way the visibility map
works, it seems like a good bet that we WILL need to dirty most of the
pages that we look at - either they've got dead tuples and need
clean-up, or they don't and need to be marked all-visible.

A corollary of this is that if you're dirtying heap pages faster than
a few megabytes per second, autovacuum, at least with default
settings, is not going to keep up.  And if you assume that each write
transaction dirties at least one heap page, any volume of write
transactions in excess of a few hundred per second will meat that
criteria.  Which is really not that much; a single core can do over
1000 tps with synchronous_commit=off, or if there's a BBWC that can
absorb it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company