Thread: autovacuum not prioritising for-wraparound tables

autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Hi,

I have a bug pending that autovacuum fails to give priority to
for-wraparound tables.  When xid consumption rate is high and dead tuple
creation is also high, it is possible that some tables are waiting for
for-wraparound vacuums that don't complete in time because the workers
are busy processing other tables that have accumulated dead tuples; the
system is then down because it's too near the Xid wraparound horizon.
Apparently this is particularly notorious in connection with TOAST
tables, because those are always put in the tables-to-process list after
regular tables.

(As far as I recall, this was already reported elsewhere, but so far I
have been unable to find the discussion in the archives.  Pointers
appreciated.)

So here's a small, backpatchable patch that sorts the list of tables to
process (not all that much tested yet).  Tables which have the
wraparound flag set are processed before those that are not.  Other
than this criterion, the order is not defined.

Now we could implement this differently, and maybe more simply (say by
keeping two lists of tables to process, one with for-wraparound tables
and one with the rest) but this way it is simpler to add additional
sorting criteria later: say within each category we could first process
smaller tables that have more dead tuples.

My intention is to clean this up and backpatch to all live branches.
Comments?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: autovacuum not prioritising for-wraparound tables

From
Heikki Linnakangas
Date:
On 24.01.2013 23:57, Alvaro Herrera wrote:
> I have a bug pending that autovacuum fails to give priority to
> for-wraparound tables.  When xid consumption rate is high and dead tuple
> creation is also high, it is possible that some tables are waiting for
> for-wraparound vacuums that don't complete in time because the workers
> are busy processing other tables that have accumulated dead tuples; the
> system is then down because it's too near the Xid wraparound horizon.
> Apparently this is particularly notorious in connection with TOAST
> tables, because those are always put in the tables-to-process list after
> regular tables.
>
> (As far as I recall, this was already reported elsewhere, but so far I
> have been unable to find the discussion in the archives.  Pointers
> appreciated.)
>
> So here's a small, backpatchable patch that sorts the list of tables to
> process (not all that much tested yet).  Tables which have the
> wraparound flag set are processed before those that are not.  Other
> than this criterion, the order is not defined.
>
> Now we could implement this differently, and maybe more simply (say by
> keeping two lists of tables to process, one with for-wraparound tables
> and one with the rest) but this way it is simpler to add additional
> sorting criteria later: say within each category we could first process
> smaller tables that have more dead tuples.
>
> My intention is to clean this up and backpatch to all live branches.
> Comments?

Backpatching sounds a bit scary. It's not a clear-cut bug, it's just 
that autovacuum could be smarter about its priorities. There are other 
ways you can still bump into the xid-wraparound issue, even with this patch.

- Heikki



Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Thu, Jan 24, 2013 at 5:22 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Backpatching sounds a bit scary. It's not a clear-cut bug, it's just that
> autovacuum could be smarter about its priorities. There are other ways you
> can still bump into the xid-wraparound issue, even with this patch.

I don't think this is a single-priority issue.  It's *also* crucial
that small tables
with high "tuple attrition rates" get vacuumed extremely frequently; your system
will bog down, albeit in a different way, if the small tables don't
get vacuumed enough.

This seems to me to involve multiple competing priorities where the
main solution
*I* can think of is to have multiple backends doing autovacuum, and assigning
some to XID activity and others to the "small, needs vacuuming
frequently" tables.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Christopher Browne <cbbrowne@gmail.com> writes:
> On Thu, Jan 24, 2013 at 5:22 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Backpatching sounds a bit scary. It's not a clear-cut bug, it's just that
>> autovacuum could be smarter about its priorities. There are other ways you
>> can still bump into the xid-wraparound issue, even with this patch.

> I don't think this is a single-priority issue.  It's *also* crucial
> that small tables with high "tuple attrition rates" get vacuumed
> extremely frequently; your system will bog down, albeit in a different
> way, if the small tables don't get vacuumed enough.

Yeah.  Another problem with a simple-minded priority arrangement is that
it might cause some tables to get starved for service because workers
keep on choosing other ones; we have to be sure the sorting rule is
designed to prevent that.

As posted, what we've got here is sorting on a boolean condition, with
the behavior within each group totally up to the whims of qsort().  That
seems especially dangerous since the priority order is mostly undefined.

I was a bit surprised that Alvaro didn't propose sorting by the age of
relfrozenxid, at least for the subset of tables that are considered
wraparound hazards.  Not sure what a good criterion is for the rest.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
Hi Alvaro,

Nice to see a patch on this!

On 2013-01-24 18:57:15 -0300, Alvaro Herrera wrote:
> I have a bug pending that autovacuum fails to give priority to
> for-wraparound tables.  When xid consumption rate is high and dead tuple
> creation is also high, it is possible that some tables are waiting for
> for-wraparound vacuums that don't complete in time because the workers
> are busy processing other tables that have accumulated dead tuples; the
> system is then down because it's too near the Xid wraparound horizon.
> Apparently this is particularly notorious in connection with TOAST
> tables, because those are always put in the tables-to-process list after
> regular tables.
> 
> (As far as I recall, this was already reported elsewhere, but so far I
> have been unable to find the discussion in the archives.  Pointers
> appreciated.)
> 
> So here's a small, backpatchable patch that sorts the list of tables to
> process (not all that much tested yet).  Tables which have the
> wraparound flag set are processed before those that are not.  Other
> than this criterion, the order is not defined.
> 
> Now we could implement this differently, and maybe more simply (say by
> keeping two lists of tables to process, one with for-wraparound tables
> and one with the rest) but this way it is simpler to add additional
> sorting criteria later: say within each category we could first process
> smaller tables that have more dead tuples.

If I remember the issue that triggered this correctly I don't think this
would be sufficient to solve the whole issue although it sure would
delay the shutdown.
Due to the high activity on the system while some bigger, active table
got vacuumed, other previously vacuumed tables already hit
freeze_max_age again and thus they were reeligible for vacuum again even
though other tables - in our the specific case always toast relations
because they always got added last - were very short before the shutdown
limit.

So I think we need to sort by age(relfrozenxid) in tables that are over
the anti-wraparound limit. Given your code that doesn't seem to be that
hard?

I think after the infrastructure is there we might want to have some
more intelligence for non-wraparound tables too, but that possibly looks
more like a HEAD than a backpatch thing.

I am very much of the opinion that this needs to be backpatched though -
its a pretty bad thing if autovacuum cannot be relied on to keep a
system from shutting itself down because it always vacuums the wrong
relations and never gets to the problematic ones. Single user mode is
nothing normal users should ever have to see.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Tom Lane escribió:

> As posted, what we've got here is sorting on a boolean condition, with
> the behavior within each group totally up to the whims of qsort().  That
> seems especially dangerous since the priority order is mostly undefined.
>
> I was a bit surprised that Alvaro didn't propose sorting by the age of
> relfrozenxid, at least for the subset of tables that are considered
> wraparound hazards.  Not sure what a good criterion is for the rest.

Hah.  This patch began life with more complex prioritisation at first,
but before going much further I dumbed down the idea to avoid having to
discuss these issues, as it doesn't seem a particularly good timing.
And I do want to get something back-patchable.

So if we're to discuss this, here's what I had in mind:

1. for-wraparound tables always go first; oldest age(relfrozenxid) are
sorted earlier.  For tables of the same age, consider size as below.

2. for other tables, consider floor(log(size)).  This makes tables of
sizes in the same ballpark be considered together.

3. For tables of similar size, consider
(n_dead_tuples - threshold) / threshold.
"threshold" is what gets calculated as the number of tuples over which
a table is considered for vacuuming.  This number, then, is a relative
measure of how hard is vacuuming needed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Peter Eisentraut
Date:
On 1/25/13 10:29 AM, Alvaro Herrera wrote:
> And I do want to get something back-patchable.

Autovacuum has existed for N years and nobody complained about this
until just now, so I don't see a strong justification for backpatching.

Or is this a regression from an earlier release?

In general, I think we should backpatch less.




Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Peter Eisentraut escribió:
> On 1/25/13 10:29 AM, Alvaro Herrera wrote:
> > And I do want to get something back-patchable.
>
> Autovacuum has existed for N years and nobody complained about this
> until just now, so I don't see a strong justification for backpatching.

I disagree about people not complaining.  Maybe the complaints have not
been specifically about the wraparound stuff and toast tables, but for
sure there have been complaints about autovacuum not giving more
priority to tables that need work more urgently.

> Or is this a regression from an earlier release?

Nope.

> In general, I think we should backpatch less.

I don't disagree with this general principle, but I certainly don't like
the idea of letting systems run with known flaws just because we're too
scared to patch them.  Now I don't object to a plan such as keep it in
master only for a while and backpatch after it has seen some more
testing.  But for large sites, this is a real problem and they have to
work around it manually which is frequently inconvenient; keep in mind
9.0 is going to be supported for years yet.

That said, if consensus here is to not backpatch this at all, I will go
with that; but let's have the argument first.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Peter Eisentraut escribi�:
>> Autovacuum has existed for N years and nobody complained about this
>> until just now, so I don't see a strong justification for backpatching.

> I disagree about people not complaining.  Maybe the complaints have not
> been specifically about the wraparound stuff and toast tables, but for
> sure there have been complaints about autovacuum not giving more
> priority to tables that need work more urgently.

FWIW, I don't see that this is too scary to back-patch.  It's unlikely
to make things worse than the current coding, which is more or less
pg_class tuple order.

I do suggest that it might be wise not to try to squeeze it into the
early-February update releases.  Put it in master as soon as we agree
on the behavior, then back-patch after the next updates.  That will
give us a couple months' testing, rather than a few days, before it
hits any release tarballs.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> So if we're to discuss this, here's what I had in mind:

> 1. for-wraparound tables always go first; oldest age(relfrozenxid) are
> sorted earlier.  For tables of the same age, consider size as below.

It seems unlikely that age(relfrozenxid) will be identical for multiple
tables often enough to worry about, so the second part of that seems
like overcomplication.

> 2. for other tables, consider floor(log(size)).  This makes tables of
> sizes in the same ballpark be considered together.

> 3. For tables of similar size, consider
> (n_dead_tuples - threshold) / threshold.
> "threshold" is what gets calculated as the number of tuples over which
> a table is considered for vacuuming.  This number, then, is a relative
> measure of how hard is vacuuming needed.

The floor(log(size)) part seems like it will have rather arbitrary
behavioral shifts when a table grows just past a log boundary.  Also,
I'm not exactly sure whether you're proposing smaller tables first or
bigger tables first, nor that either of those orderings is a good thing.

I think sorting by just age(relfrozenxid) for for-wraparound tables, and
just the n_dead_tuples measurement for others, is probably reasonable
for now.  If we find out that has bad behaviors then we can look at how
to fix them, but I don't think we have enough understanding yet of what
the bad behaviors might be.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-25 11:51:33 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > 2. for other tables, consider floor(log(size)).  This makes tables of
> > sizes in the same ballpark be considered together.
>
> > 3. For tables of similar size, consider
> > (n_dead_tuples - threshold) / threshold.
> > "threshold" is what gets calculated as the number of tuples over which
> > a table is considered for vacuuming.  This number, then, is a relative
> > measure of how hard is vacuuming needed.
>
> The floor(log(size)) part seems like it will have rather arbitrary
> behavioral shifts when a table grows just past a log boundary.  Also,
> I'm not exactly sure whether you're proposing smaller tables first or
> bigger tables first, nor that either of those orderings is a good thing.

That seems dubious to me as well.

> I think sorting by just age(relfrozenxid) for for-wraparound tables, and
> just the n_dead_tuples measurement for others, is probably reasonable
> for now.  If we find out that has bad behaviors then we can look at how
> to fix them, but I don't think we have enough understanding yet of what
> the bad behaviors might be.

If we want another ordering criterion than that it might be worth
thinking about something like n_dead_tuples/relpages to make sure that
small tables with a high dead tuples ratio get vacuumed in time.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Fri, Jan 25, 2013 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The floor(log(size)) part seems like it will have rather arbitrary
> behavioral shifts when a table grows just past a log boundary.  Also,
> I'm not exactly sure whether you're proposing smaller tables first or
> bigger tables first, nor that either of those orderings is a good thing.
>
> I think sorting by just age(relfrozenxid) for for-wraparound tables, and
> just the n_dead_tuples measurement for others, is probably reasonable
> for now.  If we find out that has bad behaviors then we can look at how
> to fix them, but I don't think we have enough understanding yet of what
> the bad behaviors might be.

Which is exactly why back-patching this is not a good idea, IMHO.  We
could easily run across a system where pg_class order happens to be
better than anything else we come up with.  Such changes are expected
in new major versions, but not in maintenance releases.

I think that to do this right, we need to consider not only the status
quo but the trajectory.  For example, suppose we have two tables to
process, one of which needs a wraparound vacuum and the other one of
which needs dead tuples removed.  If the table needing the wraparound
vacuum is small and just barely over the threshold, it isn't urgent;
but if it's large and way over the threshold, it's quite urgent.
Similarly, if the table which needs dead tuples removed is rarely
updated, postponing vacuum is not a big deal, but if it's being
updated like crazy, postponing vacuum is a big problem.  Categorically
putting autovacuum wraparound tables ahead of everything else seems
simplistic, and thinking that more dead tuples is more urgent than
fewer dead tuples seems *extremely* simplistic.

I ran across a real-world case where a user had a small table that had
to be vacuumed every 15 seconds to prevent bloat.  If we change the
algorithm in a way that gives other things priority over that table,
then that user could easily get hosed when they install a maintenance
release containing this change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-25 12:19:25 -0500, Robert Haas wrote:
> On Fri, Jan 25, 2013 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The floor(log(size)) part seems like it will have rather arbitrary
> > behavioral shifts when a table grows just past a log boundary.  Also,
> > I'm not exactly sure whether you're proposing smaller tables first or
> > bigger tables first, nor that either of those orderings is a good thing.
> >
> > I think sorting by just age(relfrozenxid) for for-wraparound tables, and
> > just the n_dead_tuples measurement for others, is probably reasonable
> > for now.  If we find out that has bad behaviors then we can look at how
> > to fix them, but I don't think we have enough understanding yet of what
> > the bad behaviors might be.
> I think that to do this right, we need to consider not only the status
> quo but the trajectory.  For example, suppose we have two tables to
> process, one of which needs a wraparound vacuum and the other one of
> which needs dead tuples removed.  If the table needing the wraparound
> vacuum is small and just barely over the threshold, it isn't urgent;
> but if it's large and way over the threshold, it's quite urgent.
> Similarly, if the table which needs dead tuples removed is rarely
> updated, postponing vacuum is not a big deal, but if it's being
> updated like crazy, postponing vacuum is a big problem.  Categorically
> putting autovacuum wraparound tables ahead of everything else seems
> simplistic, and thinking that more dead tuples is more urgent than
> fewer dead tuples seems *extremely* simplistic.

I don't think the first part is problematic. Which scenario do you have
in mind where that would really cause adverse behaviour? autovacuum
seldomly does full table vacuums on tables otherwise these days so
tables get "old" in that sense pretty regularly and mostly uniform.

I agree that the second criterion isn't worth very much and that we need
something better there.

> I ran across a real-world case where a user had a small table that had
> to be vacuumed every 15 seconds to prevent bloat.  If we change the
> algorithm in a way that gives other things priority over that table,
> then that user could easily get hosed when they install a maintenance
> release containing this change.

I think if we backpatch this we should only prefer wraparound tables and
leave the rest unchanged.

> Which is exactly why back-patching this is not a good idea, IMHO.  We
> could easily run across a system where pg_class order happens to be
> better than anything else we come up with.  Such changes are expected
> in new major versions, but not in maintenance releases.

I think a minimal version might be acceptable. Its a bug if the database
regularly shuts down and you need to write manual vacuuming scripts to
prevent it from happening.

I don't think the argument that the pg_class order might work better
than anything holds that much truth - its not like thats something
really stable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I think if we backpatch this we should only prefer wraparound tables and
> leave the rest unchanged.

That's not a realistic option, at least not with anything that uses this
approach to sorting the tables.  You'd have to assume that qsort() is
stable which it probably isn't.

> I don't think the argument that the pg_class order might work better
> than anything holds that much truth - its not like thats something
> really stable.

I find that less than credible as well.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Fri, Jan 25, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-01-25 11:51:33 -0500, Tom Lane wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> > 2. for other tables, consider floor(log(size)).  This makes tables of
>> > sizes in the same ballpark be considered together.
>>
>> > 3. For tables of similar size, consider
>> > (n_dead_tuples - threshold) / threshold.
>> > "threshold" is what gets calculated as the number of tuples over which
>> > a table is considered for vacuuming.  This number, then, is a relative
>> > measure of how hard is vacuuming needed.
>>
>> The floor(log(size)) part seems like it will have rather arbitrary
>> behavioral shifts when a table grows just past a log boundary.  Also,
>> I'm not exactly sure whether you're proposing smaller tables first or
>> bigger tables first, nor that either of those orderings is a good thing.
>
> That seems dubious to me as well.
>
>> I think sorting by just age(relfrozenxid) for for-wraparound tables, and
>> just the n_dead_tuples measurement for others, is probably reasonable
>> for now.  If we find out that has bad behaviors then we can look at how
>> to fix them, but I don't think we have enough understanding yet of what
>> the bad behaviors might be.
>
> If we want another ordering criterion than that it might be worth
> thinking about something like n_dead_tuples/relpages to make sure that
> small tables with a high dead tuples ratio get vacuumed in time.

I'd imagine it a good idea to reserve some autovacuum connections for small
tables, that is, to have a maximum relpages for some portion of the
connections.

That way you don't get stuck having all the connections busy working on
huge tables and leaving small tables starved.  That scenario seems pretty
obvious.

I'd be inclined to do something a bit more sophisticated than just
age(relfrozenxid) for wraparound; I'd be inclined to kick off large tables'
wraparound vacuums earlier than those for smaller tables.

With a little bit of noodling around, here's a thought for a joint function
that I *think* has reasonably common scales:

f(deadtuples, relpages, age) =  deadtuples/relpages + e ^ (age*ln(relpages)/2^32)

When the age of the table is low, this is dominated by the deadtuple/relpages
part of the equation; you vacuum tables based on what has the largest % of
dead tuples.

But when a table is not vacuumed for a long time, the second term will kick
in, and we'll tend to:a) Vacuum the ones that are largest the earliest, but nonethelessb) Vacuum them as the ration of
age/2^32gets close to 1.
 

This function assumes relpages > 0, and there's a constant, 2^32, there which
might be fiddled with.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-25 12:52:46 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I think if we backpatch this we should only prefer wraparound tables and
> > leave the rest unchanged.
> 
> That's not a realistic option, at least not with anything that uses this
> approach to sorting the tables.  You'd have to assume that qsort() is
> stable which it probably isn't.

Well, comparing them equally will result in an about as arbitrary order
as right now, so I don't really see a problem with that. I am fine with
sorting them truly randomly as well (by assining a temporary value when
putting it into the list so the comparison is repeatable and conforms to
the triangle inequality etc).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I think that to do this right, we need to consider not only the status
>> quo but the trajectory.  For example, suppose we have two tables to
>> process, one of which needs a wraparound vacuum and the other one of
>> which needs dead tuples removed.  If the table needing the wraparound
>> vacuum is small and just barely over the threshold, it isn't urgent;
>> but if it's large and way over the threshold, it's quite urgent.
>> Similarly, if the table which needs dead tuples removed is rarely
>> updated, postponing vacuum is not a big deal, but if it's being
>> updated like crazy, postponing vacuum is a big problem.  Categorically
>> putting autovacuum wraparound tables ahead of everything else seems
>> simplistic, and thinking that more dead tuples is more urgent than
>> fewer dead tuples seems *extremely* simplistic.
>
> I don't think the first part is problematic. Which scenario do you have
> in mind where that would really cause adverse behaviour? autovacuum
> seldomly does full table vacuums on tables otherwise these days so
> tables get "old" in that sense pretty regularly and mostly uniform.

I'm worried about the case of a very, very frequently updated table
getting put ahead of a table that needs a wraparound vacuum, but only
just.  It doesn't sit well with me to think that the priority of that
goes from 0 (we don't even try to update it) to infinity (it goes
ahead of all tables needing to be vacuumed for dead tuples) the
instant we hit the vacuum_freeze_table_age.

One idea would be to give each table a "badness".  So estimate the
percentage of the tuples in each table that are dead.  And then we
compute the percentage by which age(relfrozenxid) exceeds the table
age, and add those two percentages up to get total badness.  We
process tables that are otherwise-eligible for vacuuming in descending
order of badness.  So if autovacuum_vacuum_scale_factor = 0.2 and a
table is more than than 120% of vacuum_freeze_table_age, then it's
certain to be vacuumed before any table that only needs dead-tuple
processing.  But if it's only slightly past the cutoff, it doesn't get
to stomp all over the people who need dead tuples cleaned up.

The thing is, avoiding a full-cluster shutdown due to anti-wraparound
vacuum is important.  But, IME, that rarely happens.  What is much
more common is that an individual table gets bloated and CLUSTER or
VACUUM FULL is required to recover, and now the system is effectively
down for as long as that takes to complete.  I don't want to make that
case substantially more likely just to avoid a danger of full-cluster
shutdown that, for most users most of the time, is really a very
remote risk.  There's some point at which an anti-wraparound vacuum
should not only trump everything else, but probably also ignore the
configured cost delay settings - but equating that point with the
first point at which we consider doing it at all does not seem right
to me.

> I think a minimal version might be acceptable. Its a bug if the database
> regularly shuts down and you need to write manual vacuuming scripts to
> prevent it from happening.
>
> I don't think the argument that the pg_class order might work better
> than anything holds that much truth - its not like thats something
> really stable.

I freely admit that if pg_class order happens to work better, it's
just good luck.  But sometimes people get lucky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't think the first part is problematic. Which scenario do you have
>> in mind where that would really cause adverse behaviour? autovacuum
>> seldomly does full table vacuums on tables otherwise these days so
>> tables get "old" in that sense pretty regularly and mostly uniform.

> I'm worried about the case of a very, very frequently updated table
> getting put ahead of a table that needs a wraparound vacuum, but only
> just.  It doesn't sit well with me to think that the priority of that
> goes from 0 (we don't even try to update it) to infinity (it goes
> ahead of all tables needing to be vacuumed for dead tuples) the
> instant we hit the vacuum_freeze_table_age.

Well, really the answer to that is that we have multiple autovac
workers, and even if the first one that comes along picks the wraparound
job, the next one won't.

Having said that, I agree that it might be better to express the
sort priority as some sort of continuous function of multiple figures of
merit, rather than "sort by one then the next".  See Chris Browne's
mail for another variant.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Fri, Jan 25, 2013 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jan 25, 2013 at 12:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> I don't think the first part is problematic. Which scenario do you have
>>> in mind where that would really cause adverse behaviour? autovacuum
>>> seldomly does full table vacuums on tables otherwise these days so
>>> tables get "old" in that sense pretty regularly and mostly uniform.
>
>> I'm worried about the case of a very, very frequently updated table
>> getting put ahead of a table that needs a wraparound vacuum, but only
>> just.  It doesn't sit well with me to think that the priority of that
>> goes from 0 (we don't even try to update it) to infinity (it goes
>> ahead of all tables needing to be vacuumed for dead tuples) the
>> instant we hit the vacuum_freeze_table_age.
>
> Well, really the answer to that is that we have multiple autovac
> workers, and even if the first one that comes along picks the wraparound
> job, the next one won't.

Sure, but you could easily have 10 or 20 cross the
vacuum_freeze_table_age threshold simultaneously - and you'll only be
able to process a few of those at a time, due to
autovacuum_max_workers.  Moreover, even if you don't hit the
autovacuum_max_workers limit (say it's jacked up to 100 or so), you're
still introducing a delay of up to N * autovacuum_naptime, where N is
the number of tables that cross the threshold at the same instant,
before any dead-tuple cleanup vacuums are initiated.  It's not
difficult to imagine that being bad.

> Having said that, I agree that it might be better to express the
> sort priority as some sort of continuous function of multiple figures of
> merit, rather than "sort by one then the next".  See Chris Browne's
> mail for another variant.

Ah, so.  I think, though, that my variant is a whole lot simpler and
accomplishes mostly the same purpose.  One difference between my
proposal and the others that have popped up thus far is that I am not
convinced table size matters, or at least not in the way that people
are proposing to make it matter.  The main reason I can see why big
tables matter more than small tables is that a big table takes
*longer* to autovacuum than a small table.  If you are 123,456
transactions from a cluster-wide shutdown, and there is one big table
and one small table that need to be autovacuumed, you had better start
on the big one first - because the next autovacuum worker to come
along will quite possibly be able to finish the small one before
doomsday, but if you don't start the big one now you won't finish in
time.  This remains true even if the small table has a slightly older
relfrozenxid than the large one, but ceases to be true when the
difference is large enough that vacuuming the small one first will
advance datfrozenxid enough to extend the time until a shutdown occurs
by more than the time it takes to vacuum it.

For dead-tuple vacuuming, the question of whether the table is large
or small does not seem to me to have a categorical right answer.  You
could argue that it's more important recover 2GB of space in a 20GB
table than 2MB of space in a 20MB table, because more space is being
wasted.  On the flip side you could argue that a small table becomes
bloated much more easily than a large table, because even a minute of
heavy update activity can turn over the entire table contents, which
is unlikely for a larger table.  I am inclined to think that the
percentage of dead tuples is a more important rubric - if things are
going well, it shouldn't ever be much different from the threshold
that triggers AV in the first place - but if somehow it is much
different (e.g. because the table's been locked for a while, or is
accumulating more bloat that the threshold in a single
autovacuum_naptime), that seems like good justification for doing it
ahead of other things that are less bloated.

We do need to make sure that the formula is defined in such a way that
something that is *severely* past vacuum_freeze_table_age always beats
an arbitrarily-bloated table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Simon Riggs
Date:
On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote:

> We
> could easily run across a system where pg_class order happens to be
> better than anything else we come up with.

I think you should read that back to yourself and see if you still
feel the word "easily" applies here.

I agree with Tom that its hard for almost any prioritisation not to be
better than we have now.

But also, we should keep it fairly simple to avoid introducing new
behaviour that defeats people with a highly tuned vacuum config.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Sun, Jan 27, 2013 at 4:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote:
>> We
>> could easily run across a system where pg_class order happens to be
>> better than anything else we come up with.
>
> I think you should read that back to yourself and see if you still
> feel the word "easily" applies here.

I absolutely do.  You will not convince me that whacking around the
behavior of autovacuum in a maintenance release is a remotely sane
thing to do.  There are plenty of things wrong with the way autovacuum
works today, and I am all in favor of fixing them - but not in the
back-branches.  Every time we whack behavior around in the back
branches, no matter how innocuous it looks, somebody's environment
gets broken, and then they won't apply patch releases, and it causes
all sorts of headaches.  At least, that's my experience at
EnterpriseDB. YMMV.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Fri, Jan 25, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I think that to do this right, we need to consider not only the status
> quo but the trajectory.  For example, suppose we have two tables to
> process, one of which needs a wraparound vacuum and the other one of
> which needs dead tuples removed.  If the table needing the wraparound
> vacuum is small and just barely over the threshold, it isn't urgent;

But it being small, it also won't take long to vacuum.  Why not just do it?

> but if it's large and way over the threshold, it's quite urgent.
> Similarly, if the table which needs dead tuples removed is rarely
> updated, postponing vacuum is not a big deal, but if it's being
> updated like crazy, postponing vacuum is a big problem.

I don't see this as being the case.  If it is being updated like
crazy, it doesn't matter whether it meets the threshold to have tuples
removed *right at the moment* or not.  It will meet that threshold
soon.  If you can't keep up with that need with your current settings,
you have a steady-state problem.  Changing the order, or not changing
the order, isn't going to make a whole lot of difference, you need to
overcome the steady-state problem.

> Categorically
> putting autovacuum wraparound tables ahead of everything else seems
> simplistic, and thinking that more dead tuples is more urgent than
> fewer dead tuples seems *extremely* simplistic.
>
> I ran across a real-world case where a user had a small table that had
> to be vacuumed every 15 seconds to prevent bloat.  If we change the
> algorithm in a way that gives other things priority over that table,

Eventually an anti-wrap around is going to be done, and once it starts
it does have priority, because things already underway don't get
preempted.  Have they ever reached that point?  Did it cause problems?

> then that user could easily get hosed when they install a maintenance
> release containing this change.

Yeah, I don't know that back-patching is a good idea, or at least not soon.

Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Fri, Jan 25, 2013 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I'm worried about the case of a very, very frequently updated table
> getting put ahead of a table that needs a wraparound vacuum, but only
> just.  It doesn't sit well with me to think that the priority of that
> goes from 0 (we don't even try to update it) to infinity (it goes
> ahead of all tables needing to be vacuumed for dead tuples) the
> instant we hit the vacuum_freeze_table_age.

What if it were the instant we hit autovacuum_freeze_max_age, not
vacuum_freeze_table_age?  Or does the current behavior already do
this?  Which process is responsible for enforcing
autovacuum_freeze_max_age?


Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Thu, Jan 24, 2013 at 1:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Hi,
>
> I have a bug pending that autovacuum fails to give priority to
> for-wraparound tables.  When xid consumption rate is high and dead tuple
> creation is also high, it is possible that some tables are waiting for
> for-wraparound vacuums that don't complete in time because the workers
> are busy processing other tables that have accumulated dead tuples; the
> system is then down because it's too near the Xid wraparound horizon.
> Apparently this is particularly notorious in connection with TOAST
> tables, because those are always put in the tables-to-process list after
> regular tables.

Is something killing off your autovacuum workers routinely, such that
they rarely reach the end of their to-do list?

Otherwise it seems like the tables would come up for vacuuming in a
cyclic fashion, staggered for each worker; and it being a circle it
shouldn't systematically matter where in it they were added.

What are the various settings for vacuuming?

Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Simon Riggs
Date:
On 27 January 2013 17:11, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 27, 2013 at 4:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 25 January 2013 17:19, Robert Haas <robertmhaas@gmail.com> wrote:
>>> We
>>> could easily run across a system where pg_class order happens to be
>>> better than anything else we come up with.
>>
>> I think you should read that back to yourself and see if you still
>> feel the word "easily" applies here.
>
> I absolutely do.

> You will not convince me that whacking around the
> behavior of autovacuum in a maintenance release is a remotely sane
> thing to do.

This is a different argument. It would be better to say this than to
come up with implausible problems as a way of rejecting something.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Josh Berkus
Date:
> So I think we need to sort by age(relfrozenxid) in tables that are over
> the anti-wraparound limit. Given your code that doesn't seem to be that
> hard?

I might also suggest that we think about changing the defaults for
wraparound vacuum behavior.  Partcularly, the fact that
vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is
optimal for absolutely nobody, and forces re-wraparound vacuuming of
wraparound tables which were just recently wraparound-vacuumed.  We
should lower vacuum_freeze_min_age to something sane, like 1000000.

(background:
http://www.databasesoup.com/2012/10/freezing-your-tuples-off-part-2.html)

Also, while I don't know if Alvaro's optimization is a net gain or not
(It might be), I do agree that backpatching it is not worth considering.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-29 00:11:12 +1100, Josh Berkus wrote:
> 
> > So I think we need to sort by age(relfrozenxid) in tables that are over
> > the anti-wraparound limit. Given your code that doesn't seem to be that
> > hard?
> 
> I might also suggest that we think about changing the defaults for
> wraparound vacuum behavior.  Partcularly, the fact that
> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is
> optimal for absolutely nobody, and forces re-wraparound vacuuming of
> wraparound tables which were just recently wraparound-vacuumed.  We
> should lower vacuum_freeze_min_age to something sane, like 1000000.

I have to admit, I fail to see why this is a good idea. There isn't much
of an efficiency bonus in freezing early (due to hint bits) and vacuums
over vacuum_freeze_table_age are considerably more expensive as they
have to scan the whole heap instead of using the visibilitymap. And if
you don't vacuum the whole heap you can't lower relfrozenxid. So
changing freeze_min_age doesn't help at all to avoid anti-wraparound
vacuums.

Am I missing something?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-01-29 00:11:12 +1100, Josh Berkus wrote:
>>
>>> So I think we need to sort by age(relfrozenxid) in tables that
>>> are over the anti-wraparound limit. Given your code that
>>> doesn't seem to be that hard?
>>
>> I might also suggest that we think about changing the defaults
>> for wraparound vacuum behavior.  Partcularly, the fact that
>> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by
>> default is optimal for absolutely nobody, and forces
>> re-wraparound vacuuming of wraparound tables which were just
>> recently wraparound-vacuumed. We should lower
>> vacuum_freeze_min_age to something sane, like 1000000.
>
> I have to admit, I fail to see why this is a good idea. There
> isn't much of an efficiency bonus in freezing early (due to hint
> bits) and vacuums over vacuum_freeze_table_age are considerably
> more expensive as they have to scan the whole heap instead of
> using the visibilitymap. And if you don't vacuum the whole heap
> you can't lower relfrozenxid. So changing freeze_min_age doesn't
> help at all to avoid anti-wraparound vacuums.
>
> Am I missing something?

IMO, anything which changes an anti-wraparound vacuum of a
bulk-loaded table from "read the entire table and rewrite nearly
the complete table with WAL-logging" to rewriting a smaller portion
of the table with WAL-logging is an improvement.  Anyone who has
run an OLTP load on a database which was loaded from pg_dump output
or other bulk load processes, has probably experienced the pain
related to the WAL-logged rewrite of massive quantities of data.
Of course, since it triggers based on transaction count, the higher
your transaction rate at any moment, the more likely it is to kick
in at that moment.  Whatever we can do to reduce that pain helps.

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-28 08:15:29 -0800, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-01-29 00:11:12 +1100, Josh Berkus wrote:
> >>
> >>> So I think we need to sort by age(relfrozenxid) in tables that
> >>> are over the anti-wraparound limit. Given your code that
> >>> doesn't seem to be that hard?
> >>
> >> I might also suggest that we think about changing the defaults
> >> for wraparound vacuum behavior.  Partcularly, the fact that
> >> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by
> >> default is optimal for absolutely nobody, and forces
> >> re-wraparound vacuuming of wraparound tables which were just
> >> recently wraparound-vacuumed. We should lower
> >> vacuum_freeze_min_age to something sane, like 1000000.
> >
> > I have to admit, I fail to see why this is a good idea. There
> > isn't much of an efficiency bonus in freezing early (due to hint
> > bits) and vacuums over vacuum_freeze_table_age are considerably
> > more expensive as they have to scan the whole heap instead of
> > using the visibilitymap. And if you don't vacuum the whole heap
> > you can't lower relfrozenxid. So changing freeze_min_age doesn't
> > help at all to avoid anti-wraparound vacuums.
> >
> > Am I missing something?
>
> IMO, anything which changes an anti-wraparound vacuum of a
> bulk-loaded table from "read the entire table and rewrite nearly
> the complete table with WAL-logging" to rewriting a smaller portion
> of the table with WAL-logging is an improvement.

Yes, but the proposed changes make that *more* frequent, not less. 1mio
transactions will usually be *after* the next checkpoint and thus be
written twice.

Imnsho the solution to actually solve that problem is to have have
'freeze map' that marks blocks (say 16MB) of tables as already frozen so
it doesn't have to be reread/written every by (auto-)vacuum at all.
I would like to work on that sometime in the future, but it won't be all
that soon...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Sun, Jan 27, 2013 at 2:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jan 25, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think that to do this right, we need to consider not only the status
>> quo but the trajectory.  For example, suppose we have two tables to
>> process, one of which needs a wraparound vacuum and the other one of
>> which needs dead tuples removed.  If the table needing the wraparound
>> vacuum is small and just barely over the threshold, it isn't urgent;
>
> But it being small, it also won't take long to vacuum.  Why not just do it?

Because "big" and "small" are relative terms.

>> but if it's large and way over the threshold, it's quite urgent.
>> Similarly, if the table which needs dead tuples removed is rarely
>> updated, postponing vacuum is not a big deal, but if it's being
>> updated like crazy, postponing vacuum is a big problem.
>
> I don't see this as being the case.  If it is being updated like
> crazy, it doesn't matter whether it meets the threshold to have tuples
> removed *right at the moment* or not.  It will meet that threshold
> soon.  If you can't keep up with that need with your current settings,
> you have a steady-state problem.  Changing the order, or not changing
> the order, isn't going to make a whole lot of difference, you need to
> overcome the steady-state problem.

Sure.  There are many people for which vacuum has no trouble at all
keeping up, and others for whom it isn't even close to keeping up.
People in the first category aren't likely to be damaged by the
proposed change and people in the second category aren't likely to be
helped.  The issue is around what happens for people who are close to
the edge.  Will things get better or worse?  Alvaro (and Simon)
content that there will be cases where full-cluster shutdowns that
happen under today's algorithm would be avoided if we prioritize
anti-wraparound vacuums over dead-tuple-cleanup vacuums.  I believe
that.  I also believe that there will be cases where it goes the other
way - where a bloat situation that remains under control with today's
algorithm gets just perturbed just enough by this change to cause
runaway table bloat.  Or at least, I contend that we don't have nearly
enough evidence that that *won't* happen to risk back-patching a
change of this type.

In my experience, full-cluster shutdowns caused by autovacuum failing
to advance datfrozenxid are extremely rare - and if they do happen,
it's usually because the vacuum cost delay is set too high, or the
cost limit too low.  If we want to attack the problem of making sure
such shutdowns don't happen, I'd argue that the most promising way to
attack that problem is to progressively ratchet the delay down and the
cost limit up as age(relfrozenxid) gets larger.  On the other hand,
problems with runaway table bloat are relatively common.  Heikki's
8.4-era changes have of course helped quite a bit, but the problem is
still very, very common.  All you need is a series of "long"-running
transactions (like a couple of *minutes* on a busy system), or a
vacuum cost delay that is just ever-so-slightly too high, and you're
completely hosed.  I agree with you that if you've got a database
that's well-tuned, so that you aren't skating on the ragged edge of
disaster, this change probably won't break anything.  But I am willing
to bet that there are people out there who are, completely
unknowingly, skating on that ragged edge.  It is not as if we provide
an easy way to know whether you've got the cost delay set optimally.

>> Categorically
>> putting autovacuum wraparound tables ahead of everything else seems
>> simplistic, and thinking that more dead tuples is more urgent than
>> fewer dead tuples seems *extremely* simplistic.
>>
>> I ran across a real-world case where a user had a small table that had
>> to be vacuumed every 15 seconds to prevent bloat.  If we change the
>> algorithm in a way that gives other things priority over that table,
>
> Eventually an anti-wrap around is going to be done, and once it starts
> it does have priority, because things already underway don't get
> preempted.  Have they ever reached that point?  Did it cause problems?

In that specific case, I don't know.

>> then that user could easily get hosed when they install a maintenance
>> release containing this change.
>
> Yeah, I don't know that back-patching is a good idea, or at least not soon.

That's all I'm arguing.  I think it would be nice to do something for
9.3, preferably a little more sophisticated than just "put all
anti-wraparound vacuums first".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Mon, Jan 28, 2013 at 8:11 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> So I think we need to sort by age(relfrozenxid) in tables that are over
>> the anti-wraparound limit. Given your code that doesn't seem to be that
>> hard?
>
> I might also suggest that we think about changing the defaults for
> wraparound vacuum behavior.  Partcularly, the fact that
> vacuum_freeze_min_age is 50% of autovacuum_freeze_max_age by default is
> optimal for absolutely nobody, and forces re-wraparound vacuuming of
> wraparound tables which were just recently wraparound-vacuumed.  We
> should lower vacuum_freeze_min_age to something sane, like 1000000.

Currently, the defaults are 50m for vacuum_freeze_min_age, 150m for
vacuum_freeze_table_age, and 200m for autovacuum_freeze_max_age, which
doesn't seem like it matches what you're saying.
vacuum_freeze_min_age doesn't force anti-wraparound vacuuming anyway;
it just controls the tuple age at which opportunistic freezing starts.There's possibly some advantage to lowering it
anyway,in the hopes
 
of leaving less work for the eventual full-table scan to do, but
dropping it to as low as 1m seems like it might be too much of a good
thing - we might end up freezing a lot of tuples that would have
gotten removed before they hit the 50m freeze threshold we have today.

While we're talking about prioritizing vacuum, though, it would be
nice to have a high-priority anti-wraparound vacuum and a low-priority
anti-wraparound vacuum.  When a table crosses the "low priority"
threshold, we work on freezing it *if there is no other vacuuming that
needs doing* - i.e. we're the only autovacuum process running, and no
other table needs autovacuuming either for dead tuples or for
wraparound.  When it crosses the "high priority" threshold, then we
behave as now, or perhaps even more aggressively (move to front of
queue, raise or cost delay or ignore it outright, etc.).  The "low
priority" anti-wraparound vacuum would abort if some other vacuuming
came to be required, if a lock conflict occurred, etc., and might also
run with a higher cost delay.  I believe this would tend to spread the
anti-wraparound work out over a longer period of time, instead of
clumping it all together as often happens today, and reduce the effect
it has on foreground activities.  It might not be going far enough in
that direction but at least it would be a start.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Josh Berkus
Date:
> I have to admit, I fail to see why this is a good idea. There isn't much
> of an efficiency bonus in freezing early (due to hint bits) and vacuums
> over vacuum_freeze_table_age are considerably more expensive as they
> have to scan the whole heap instead of using the visibilitymap. And if
> you don't vacuum the whole heap you can't lower relfrozenxid. So
> changing freeze_min_age doesn't help at all to avoid anti-wraparound
> vacuums.
> 
> Am I missing something?

Yep.  First, you're confusing vacuum_freeze_table_age  and
vacuum_freeze_min_age.  Second, you're not doing any arithmatic.

Let's do this by example.  TableA is a large table which receives an
almost constant stream of individual row updates, inserts, and deletes.

DEFAULTS:

XID 1:  First rows in TableA are updated.
XID 200m: Anti-wraparound autovac of TableA.    All XIDs older than XID 100m set to FROZENXID.
XID 300m: Anti-wraparound autovac of TableA    All XIDs older than XID 200M set to FROZENXID.
XID 400m: Anti-wraparound autovac of TableA    All XIDs older than XID 300M set to FROZENXID.
XID 500m: Anti-wraparound autovac of TableA    All XIDs older than XID 400M set to FROZENXID.
XID 600m: Anti-wraparound autovac of TableA    All XIDs older than XID 500M set to FROZENXID.


vacuum_freeze_min_age = 1m

XID 1:  First rows in TableA are updated.
XID 200m: Anti-wraparound autovac of TableA.    All XIDs older than XID 199m set to FROZENXID.
XID 399m: Anti-wraparound autovac of TableA    All XIDs older than XID 398M set to FROZENXID.
XID 598m: Anti-wraparound autovac of TableA    All XIDs older than XID 597M set to FROZENXID.


vacuum_freeze_min_age = 1m, autovacuum_freeze_max_age = 500m

XID 1:  First rows in TableA are updated.
XID 500m: Anti-wraparound autovac of TableA.    All XIDs older than XID 499m set to FROZENXID.

As you can see, the current default settings cause 80% more wraparound
autovacs per table than vacuum_freeze_min_age of 1m would, and almost
500% more than what I consider sane settings would.  Just so that we can
preserve XIDs which are 90m transactions old.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-29 16:09:52 +1100, Josh Berkus wrote:
> 
> > I have to admit, I fail to see why this is a good idea. There isn't much
> > of an efficiency bonus in freezing early (due to hint bits) and vacuums
> > over vacuum_freeze_table_age are considerably more expensive as they
> > have to scan the whole heap instead of using the visibilitymap. And if
> > you don't vacuum the whole heap you can't lower relfrozenxid. So
> > changing freeze_min_age doesn't help at all to avoid anti-wraparound
> > vacuums.
> > 
> > Am I missing something?
> 
> Yep.  First, you're confusing vacuum_freeze_table_age  and
> vacuum_freeze_min_age.

Don't think I did. I was talking about vacuum_freeze_table_age because
that influences the amount of full-table scans in contrast to ones using
the vm. Thats an independent thing from anti-wraparound vacuums which
are triggered by autovacuum_freeze_max_age.

The point I was trying to make is that a very big part of the load is
not actually the freezing itself but the full-table vacuums which are
triggered by freeze_table_age.

>  Second, you're not doing any arithmatic.

Because its not actually as easy to calculate as you make it seem.

Even in the case of a large insert-only table you have way much more
complex behaviour than what you describe.
The lifetime of tuples/buffers in that context is approx the following:
- inserted
- written by bgwriter or by checkpoint
- vacuum reads the non-all-visible part of the table - vacuum sets HEAP_XMIN_COMMITTED
- freeze_table_age vacuum reads the whole table - doesn't find anything because of freeze_min_age
- freeze_table_age vacuum reads the whole table - freezes tuple because >= freeze_min_age
- freeze_table_age vacuum reads the whole table - doesn't change anything in our page because its already frozen

So the big point your computation is missing is that all those
anti-wraparound vacuums a) might not even happen due to normal vacuums
being over freeze_table_age which change the relfrozenxid b) don't
rewrite the whole table because the tuples actually are already frozen
c) will be written out a page repeatedly because of tuples that get
changed again d) incur full page writes.

> Let's do this by example.  TableA is a large table which receives an
> almost constant stream of individual row updates, inserts, and deletes.
> 
> DEFAULTS:
> 
> XID 1:  First rows in TableA are updated.
> XID 200m: Anti-wraparound autovac of TableA.
>      All XIDs older than XID 100m set to FROZENXID.

Between those the table will have been vacuumed already and depending on
the schedule the tuples will already have been vacuumed due to
freeze_min_age being 50mio and freeze_table_age being 150mio. Before
that all the tuples will already have been written another time for hint
bit writes.

> XID 300m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 200M set to FROZENXID.

Only the newer tuples are going to be rewritten, the older parts of the
table will only be read.

> XID 400m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 300M set to FROZENXID.
> XID 500m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 400M set to FROZENXID.
> XID 600m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 500M set to FROZENXID.

> vacuum_freeze_min_age = 1m
> 
> XID 1:  First rows in TableA are updated.
> XID 200m: Anti-wraparound autovac of TableA.
>      All XIDs older than XID 199m set to FROZENXID.

Even in an insert-only case the tuples will be written at least twice
before an anti-freeze-wraparound, often thrice:
- first checkpoint
- hint bit sets due to a normal vacuum
- frozen due to a full-table vacuum

But, as you assumed the table will also get deletes and updates the low
freeze age will mean that some tuples on a page will get frozen on each
vacuum that reads the page which incurs a full-page-write everytime the
some tuples are frozen as most of the time the last time the page was
touched will be before the last checkpoint happened. As the WAL is a
major bottleneck on a write-heavy server that can incur a pretty heft
global slowdown.
Its *good* to only freeze tuples once youre pretty damn sure it won't be
touched by actual data changes again. As full-table vacuums happen more
frequently than anti-freeze vacuums anyway the cost of actual
anti-freeze vacuums, should they happen because of a too busy
autovacuum, aren't a problem in itself.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-01-29 16:09:52 +1100, Josh Berkus wrote:
>>
>>> I have to admit, I fail to see why this is a good idea. There
>>> isn't much of an efficiency bonus in freezing early (due to
>>> hint bits) and vacuums over vacuum_freeze_table_age are
>>> considerably more expensive as they have to scan the whole heap
>>> instead of using the visibilitymap. And if you don't vacuum the
>>> whole heap you can't lower relfrozenxid.  So changing
>>> freeze_min_age doesn't help at all to avoid anti-wraparound
>>> vacuums.
>>>
>>> Am I missing something?
>>
>> Yep.  First, you're confusing vacuum_freeze_table_age  and
>> vacuum_freeze_min_age.
>
> Don't think I did. I was talking about vacuum_freeze_table_age
> because that influences the amount of full-table scans

Not any more than vacuum_freeze_min_age does.

http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM

| a whole table sweep is forced if the table hasn't been fully
| scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age
| transactions.

So reducing vacuum_freeze_min_age not only helps minimize the
writes that are needed when autovacuum needs to scan the entire
heap, but also decreases the frequency of those full-table scans.

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-30 05:39:29 -0800, Kevin Grittner wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-01-29 16:09:52 +1100, Josh Berkus wrote:
> >>
> >>> I have to admit, I fail to see why this is a good idea. There
> >>> isn't much of an efficiency bonus in freezing early (due to
> >>> hint bits) and vacuums over vacuum_freeze_table_age are
> >>> considerably more expensive as they have to scan the whole heap
> >>> instead of using the visibilitymap. And if you don't vacuum the
> >>> whole heap you can't lower relfrozenxid.  So changing
> >>> freeze_min_age doesn't help at all to avoid anti-wraparound
> >>> vacuums.
> >>>
> >>> Am I missing something?
> >>
> >> Yep.  First, you're confusing vacuum_freeze_table_age  and
> >> vacuum_freeze_min_age.
> >
> > Don't think I did. I was talking about vacuum_freeze_table_age
> > because that influences the amount of full-table scans
>
> Not any more than vacuum_freeze_min_age does.

Well, vacuum_freeze_min_age is 50m while vacuum_freeze_table_age is
150m.

> http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM
>
> | a whole table sweep is forced if the table hasn't been fully
> | scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age
> | transactions.
>
> So reducing vacuum_freeze_min_age not only helps minimize the
> writes that are needed when autovacuum needs to scan the entire
> heap, but also decreases the frequency of those full-table scans.

But it increases the amount of pages that are written out multiple times
because they contain tuples of different ages, in contrast to increasing
vacuum_freeze_table_age which doesn't have that problem. In combination
with full_page_writes that makes a noticeable different in total write
volume.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-30 14:58:24 +0100, Andres Freund wrote:
> > So reducing vacuum_freeze_min_age not only helps minimize the
> > writes that are needed when autovacuum needs to scan the entire
> > heap, but also decreases the frequency of those full-table scans.
>
> But it increases the amount of pages that are written out multiple times
> because they contain tuples of different ages, in contrast to increasing
> vacuum_freeze_table_age which doesn't have that problem. In combination
> with full_page_writes that makes a noticeable different in total write
> volume.

Btw, as far as I read the code that behaviour only exists insofar that
the last time vacuum runs it freezes all tuples below freeze_min_age but
not newer ones, so relfrozenxid will only be set to current_xmin -
freeze_min_age. But if you manually freeze or no such old tuples exist
its solely influenced by freeze_table_age.

The relevant parts of the code are:

c.f.
vacuum_set_xid_limits:
        /*
         * Determine the table freeze age to use: as specified by the caller,
         * or vacuum_freeze_table_age, but in any case not more than
         * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
         * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
         * before anti-wraparound autovacuum is launched.
         */
        freezetable = freeze_min_age;
        if (freezetable < 0)
            freezetable = vacuum_freeze_table_age;
        freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
        Assert(freezetable >= 0);

        /*
         * Compute the cutoff XID, being careful not to generate a "permanent"
         * XID.
         */
        limit = ReadNewTransactionId() - freezetable;
        if (!TransactionIdIsNormal(limit))
            limit = FirstNormalTransactionId;

        *freezeTableLimit = limit;

lazy_vacuum_rel:
    scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
                         freezeTableLimit);

If youre careful you can also notice that there is an interesting typo
in the freeze table computation. Namely it uses freeze_min_age instead
of freeze_table_age. Which probably explains why I had so bad
performance results with lowering vacuum_freeze_min_age, it basically
radically increases the amount of full-table-scans, far more than it
should.

I can't imagine that anybody with a large database ran pg successfully
with a small freeze_min_age due to this.

It seems to be broken since the initial introduction of freeze_table_age
in 6587818542e79012276dcfedb2f97e3522ee5e9b. I guess it wasn't noticed
because the behaviour is only visible via autovacuum because a
user-issued VACUUM passes -1 as freeze_min_age.

Trivial patch attached.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Mon, Jan 28, 2013 at 9:09 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
> Let's do this by example.  TableA is a large table which receives an
> almost constant stream of individual row updates, inserts, and deletes.
>
> DEFAULTS:
>
> XID 1:  First rows in TableA are updated.
> XID 200m: Anti-wraparound autovac of TableA.
>      All XIDs older than XID 100m set to FROZENXID.
> XID 300m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 200M set to FROZENXID.
> XID 400m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 300M set to FROZENXID.
> XID 500m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 400M set to FROZENXID.
> XID 600m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 500M set to FROZENXID.

You seem to be assuming the only reason for vacuums to occur on this
table is autovacuum_freeze_max_age, which doesn't seem likely to be
the case for a busy table.

> vacuum_freeze_min_age = 1m
>
> XID 1:  First rows in TableA are updated.
> XID 200m: Anti-wraparound autovac of TableA.
>      All XIDs older than XID 199m set to FROZENXID.
> XID 399m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 398M set to FROZENXID.
> XID 598m: Anti-wraparound autovac of TableA
>      All XIDs older than XID 597M set to FROZENXID.
>
>
> vacuum_freeze_min_age = 1m, autovacuum_freeze_max_age = 500m
>
> XID 1:  First rows in TableA are updated.
> XID 500m: Anti-wraparound autovac of TableA.
>      All XIDs older than XID 499m set to FROZENXID.
>
> As you can see, the current default settings cause 80% more wraparound
> autovacs per table than vacuum_freeze_min_age of 1m would, and almost
> 500% more than what I consider sane settings would.

You can get nearly all the benefits of your "sane" settings just by
increasing autovacuum_freeze_max_age and leaving vacuum_freeze_min_age
alone.  (Assuming the table doesn't get vacuumed for other reasons)


Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Wed, Jan 30, 2013 at 5:39 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
>>
>> Don't think I did. I was talking about vacuum_freeze_table_age
>> because that influences the amount of full-table scans
>
> Not any more than vacuum_freeze_min_age does.

There is a lot more room for vacuum_freeze_table_age to be increased,
then there is for vacuum_freeze_min_age to be decreased.

> http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#AUTOVACUUM
>
> | a whole table sweep is forced if the table hasn't been fully
> | scanned for vacuum_freeze_table_age minus vacuum_freeze_min_age
> | transactions.

Why is vacuuming described as such a violent process?  It doesn't
"force" a table sweep, it just goes ahead and performs one.  In
general, it seems hard to tell from the docs that this only promotes a
vacuum which is going to happen anyway from a vm one to a full scan
one.  The forcefulness makes it sound more like it is doing vacuums
that wouldn't happen otherwise (like autovacuum_freeze_max_age does,
which actually could be described as "force" since it turns on the
autovac launcher even if it is configured to be off)

> So reducing vacuum_freeze_min_age not only helps minimize the
> writes that are needed when autovacuum needs to scan the entire
> heap,

How does it do that? If the tuple doesn't need to frozen now because
it was already frozen, that just means the write happened at a
different time but it still happened.

Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:

> I can't imagine that anybody with a large database ran pg
> successfully with a small freeze_min_age due to this.

I can't speak to this from personal experience, because at
Wisconsin Courts we found ourselves best served by running a
database VACUUM FREEZE ANALYZE each night during off-peak hours.

> It seems to be broken since the initial introduction of
> freeze_table_age in 6587818542e79012276dcfedb2f97e3522ee5e9b.

> Trivial patch attached.

I didn't see a patch attached.

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-01-30 10:21:07 -0800, Kevin Grittner wrote:
> > It seems to be broken since the initial introduction of
> > freeze_table_age in 6587818542e79012276dcfedb2f97e3522ee5e9b.
> 
> > Trivial patch attached.
> 
> I didn't see a patch attached.

The archive has it, so I for once haven't forgotten sending it:
http://www.postgresql.org/message-id/20130130145521.GB3355@awork2.anarazel.de 

While its a clear bug and fixing it in HEAD is a sure thing an argument
could be made that its a clear behavioural change in the back
branches. I don't think it holds too much water, but wanted to mention
it.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:
> To: Andres Freund <andres@2ndquadrant.com>

>> Trivial patch attached.
>
> I didn't see a patch attached.

Never mind; I was looking in the wrong spot.  (I just switched
email providers again because the last one couldn't seem to get the
email headers right for threading purposes.)

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Jeff Janes <jeff.janes@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:

>> So reducing vacuum_freeze_min_age not only helps minimize the
>> writes that are needed when autovacuum needs to scan the entire
>> heap,
>
> How does it do that? If the tuple doesn't need to frozen now
> because it was already frozen, that just means the write happened
> at a different time but it still happened.

There's a lot to be said for nibbling away at it during VM
autovacuum runs versus doing it in big chunks in heap-scan runs,
particularly if your BBU cache can absorb up to a certain amout
with no real pain, but causes major pain if the write cache fills.

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Jim Nasby
Date:
On 1/25/13 11:56 AM, Christopher Browne wrote:
> With a little bit of noodling around, here's a thought for a joint function
> that I*think*  has reasonably common scales:
>
> f(deadtuples, relpages, age) =
>     deadtuples/relpages + e ^ (age*ln(relpages)/2^32)

Be careful with dead/relpages, because dead tuples increase relpages as 
well. The effect is extremely noticeable on frequently hit tables that 
need to be kept small. If you want to have a deadtuples/size metric, I 
think it would be far better to do deadtuples/non_bloated_table_size.

Someone else in the thread mentioned that what we really need to be 
watching aren't raw values, but trends. Or you can think of it as 
watching first (or even second) derivatives if you like. I couldn't 
agree more. I believe there are several parts of Postgres that end up 
with a bunch of hard to tune GUCs specifically because we're measuring 
the wrong things.

Take freezing for example. Since the only reason to freeze is XID wrap 
then the *ideal* time to start a freeze vacuum on a table is so that the 
vacuum would end *exactly* as we were about to hit XID wrap.

Obviously that's a completely impractical goal to hit, but notice the 
simplicity of the goal: we only care about the vacuum ending right 
before we'd hit XID wrap. The only way to do that is to monitor how fast 
vacuums are running, how fast XIDs are being consumed, and how quickly 
the oldest XID in each table is advancing. Notice that all of those 
measurements are time derivatives.
From a more practical standpoint, I think it would be extremely useful 
to have a metric that showed how quickly a table churned. Something like 
dead tuples per time period. Comparing that to the non-bloated table 
size should give a very strong indication of how critical frequent 
vacuums on that table are.

I don't have a good metric in mind for freeze right now, but I do want 
to mention a use case that I don't think has come up before. When 
building a londiste slave (and presumably all the other trigger 
replication systems suffer from this), each table is copied over in a 
single transaction, and then updates start flowing in for that table. 
That can easily result in a scenario where you have an enormous volume 
of tuples that will all need freezing at almost exactly the same time. 
It would be nice if we could detect such a condition and freeze those 
tuples over time, instead of trying to freeze all of them in one shot.



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Jim Nasby <jim@nasby.net> wrote:

> the only reason to freeze is XID wrap

The conclusions you draw seem to be based on a slightly different
premise than stated here; the conclusions are only supported by the
assumption that "the only reason to freeze at any particular moment
is to assure that all tuples are frozen in time to prevent
transaction wraparound."  In a mostly write-only database, that has
to happen for nearly every tuple at some point -- but *which* point
is best is not necessarily "as late as possible".

> then the *ideal* time to start a freeze vacuum on a table is so
> that the vacuum would end *exactly* as we were about to hit XID
> wrap.

For a tuple which you know is going to survive long enough to be
frozen, the *ideal* time to freeze a tuple, from a performance
standpoint, is when the tuple is written, to eliminate a separate
WAL-log operation.  The *next best* time to freeze is when the hint
bits are set, to avoid a separate page write.  If you are doing
differential backups, the *third best* time to freeze is before the
first differential backup of the tuple, to avoid a separate backup
after the freeze.  And so on.

-Kevin



Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Wed, Jan 30, 2013 at 4:05 PM, Jim Nasby <jim@nasby.net> wrote:
> On 1/25/13 11:56 AM, Christopher Browne wrote:
>>
>> With a little bit of noodling around, here's a thought for a joint
>> function
>> that I*think*  has reasonably common scales:
>>
>>
>> f(deadtuples, relpages, age) =
>>     deadtuples/relpages + e ^ (age*ln(relpages)/2^32)
>
>
> Be careful with dead/relpages, because dead tuples increase relpages as
> well. The effect is extremely noticeable on frequently hit tables that need
> to be kept small. If you want to have a deadtuples/size metric, I think it
> would be far better to do deadtuples/non_bloated_table_size.

You may well be right about that; I didn't try to get to second order effects,
just to come up with something that seemed reasonably readily computable
based on existing data that seemed relevant.

One approach to things is to ask if we can "tweak" that model to be materially
better without requiring extra data collection.  It seems plausible to do so; we
might get something better by deducting dead tuples.  I'll not try to present
this yet; I think it's a better idea to come up with a broadly fuller model.

> Someone else in the thread mentioned that what we really need to be watching
> aren't raw values, but trends. Or you can think of it as watching first (or
> even second) derivatives if you like. I couldn't agree more. I believe there
> are several parts of Postgres that end up with a bunch of hard to tune GUCs
> specifically because we're measuring the wrong things.

My "first order" reaction is that this sounds great.  And I reckon it would be
a great idea to come up with a reasonably sophisticated model that covers
all of the behaviours that we see emerging.

Mind you, I'm not sure that we can actually *use* that model; it may require
collecting enough more data that it becomes burdensome to collect the
additional data.

> Take freezing for example. Since the only reason to freeze is XID wrap then
> the *ideal* time to start a freeze vacuum on a table is so that the vacuum
> would end *exactly* as we were about to hit XID wrap.
>
> Obviously that's a completely impractical goal to hit, but notice the
> simplicity of the goal: we only care about the vacuum ending right before
> we'd hit XID wrap. The only way to do that is to monitor how fast vacuums
> are running, how fast XIDs are being consumed, and how quickly the oldest
> XID in each table is advancing. Notice that all of those measurements are
> time derivatives.

Yep, and I think some second derivatives will prove useful too.

The function that I presented took a slightly different tack to things; the
exponential term would try to make sure that tables get frozen WELL in
advance of necessity.  That strikes me as being safer in cases where
a database is blowing through XIDs mighty quickly.

> From a more practical standpoint, I think it would be extremely useful to
> have a metric that showed how quickly a table churned. Something like dead
> tuples per time period. Comparing that to the non-bloated table size should
> give a very strong indication of how critical frequent vacuums on that table
> are.

That's a nice idea, though I'm pretty sure that this requires capturing and
storing a lot more statistical data, which presumably worsens things
(presuming the data gets stored in tables), making the database bigger,
adding an extra set of data that needs to get collected, queried, and
trimmed, and chewing through still more XIDs.

If you could demonstrate pretty forcibly improved behaviours falling
out of that, that may make it salable.  There's quite a bit of work
there, and we would need to accept a burden of still more stats
collection.

> I don't have a good metric in mind for freeze right now, but I do want to
> mention a use case that I don't think has come up before. When building a
> londiste slave (and presumably all the other trigger replication systems
> suffer from this), each table is copied over in a single transaction, and
> then updates start flowing in for that table. That can easily result in a
> scenario where you have an enormous volume of tuples that will all need
> freezing at almost exactly the same time. It would be nice if we could
> detect such a condition and freeze those tuples over time, instead of trying
> to freeze all of them in one shot.

I wonder if these are perhaps good candidates for being frozen immediately;
COPY FREEZE was added in not so long ago; it doesn't perfectly cover
this, but if I squint at it a bit...

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62

I don't see a lot of value to having the data COPYed over to a subscriber
NOT getting frozen immediately.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Jim Nasby
Date:
On 1/30/13 3:28 PM, Kevin Grittner wrote:
> Jim Nasby <jim@nasby.net> wrote:

>> then the *ideal* time to start a freeze vacuum on a table is so
>> that the vacuum would end *exactly* as we were about to hit XID
>> wrap.
>
> For a tuple which you know is going to survive long enough to be
> frozen, the *ideal* time to freeze a tuple, from a performance
> standpoint, is when the tuple is written, to eliminate a separate
> WAL-log operation.  The *next best* time to freeze is when the hint
> bits are set, to avoid a separate page write.  If you are doing
> differential backups, the *third best* time to freeze is before the
> first differential backup of the tuple, to avoid a separate backup
> after the freeze.  And so on.

Yeah, I hadn't considered that aspect, but I think that's a separate 
discussion. I agree that there are many cases where we would benefit 
from freezing early, and those should all help reduce the impact of 
XID-wrap induced freezing.

Even if we do introduce "up-front" freezing, we still need to deal with 
all the other cases though, which in the worst case means avoiding the 
XID wrap. I believe that particular case (as well as the whole autovac 
priority question) would be best served by looking at 1st and 2nd order 
derivatives.




Re: autovacuum not prioritising for-wraparound tables

From
Jim Nasby
Date:
On 1/30/13 4:37 PM, Christopher Browne wrote:
>>  From a more practical standpoint, I think it would be extremely useful to
>> >have a metric that showed how quickly a table churned. Something like dead
>> >tuples per time period. Comparing that to the non-bloated table size should
>> >give a very strong indication of how critical frequent vacuums on that table
>> >are.
> That's a nice idea, though I'm pretty sure that this requires capturing and
> storing a lot more statistical data, which presumably worsens things
> (presuming the data gets stored in tables), making the database bigger,
> adding an extra set of data that needs to get collected, queried, and
> trimmed, and chewing through still more XIDs.

We don't necessarily need added storage (or at least nothing serious). 
It could potentially be as simple as remembering when we last took a 
measurement and using that to also store number of blahs per second. 
That's only 2 added data points (time and the rate). Of course, there 
are pitfalls to something that simple, though some of those pitfals have 
simple solutions (such as using estimated weighted means).

In this case I think we would be better served by trying to define a 
good metric first, and then worrying about how to actually capture and 
store it.

>> >I don't have a good metric in mind for freeze right now, but I do want to
>> >mention a use case that I don't think has come up before. When building a
>> >londiste slave (and presumably all the other trigger replication systems
>> >suffer from this), each table is copied over in a single transaction, and
>> >then updates start flowing in for that table. That can easily result in a
>> >scenario where you have an enormous volume of tuples that will all need
>> >freezing at almost exactly the same time. It would be nice if we could
>> >detect such a condition and freeze those tuples over time, instead of trying
>> >to freeze all of them in one shot.
> I wonder if these are perhaps good candidates for being frozen immediately;
> COPY FREEZE was added in not so long ago; it doesn't perfectly cover
> this, but if I squint at it a bit...
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62
>
> I don't see a lot of value to having the data COPYed over to a subscriber
> NOT getting frozen immediately.

Yeah, COPY FREEZE would probably be highly valuable in this case.



Re: autovacuum not prioritising for-wraparound tables

From
Josh Berkus
Date:
> You can get nearly all the benefits of your "sane" settings just by
> increasing autovacuum_freeze_max_age and leaving vacuum_freeze_min_age
> alone.  (Assuming the table doesn't get vacuumed for other reasons)

Correct, it's the ratio that matters.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Christopher Browne escribió:
> On Fri, Jan 25, 2013 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> I'd be inclined to do something a bit more sophisticated than just
> age(relfrozenxid) for wraparound; I'd be inclined to kick off large tables'
> wraparound vacuums earlier than those for smaller tables.
>
> With a little bit of noodling around, here's a thought for a joint function
> that I *think* has reasonably common scales:
>
> f(deadtuples, relpages, age) =
>    deadtuples/relpages + e ^ (age*ln(relpages)/2^32)

Okay, here's a patch along these lines.  I haven't considered Jim's
suggestion downthread about discounting dead tuples from relpages; maybe
we can do that by subtracting the pages attributed to dead ones,
estimating via tuple density (reltuples/relpages).  But that's no my
main concern here.

Instead, what I propose (and is not really in the patch), as a
backpatchable item, is an approach in which the functions to compute
each rel's Browne strength and sort are hooks.  Normal behavior is not
to sort at all, as currently, and sites that have a problem with the
current random order can install a custom module that provide hooks to
change ordering as they see fit.  So behavior won't change for people
who have no problem today.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Alvaro Herrera escribió:

> Okay, here's a patch along these lines.  I haven't considered Jim's
> suggestion downthread about discounting dead tuples from relpages; maybe
> we can do that by subtracting the pages attributed to dead ones,
> estimating via tuple density (reltuples/relpages).

Patch attached.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Alvaro Herrera escribió:
>
>> Okay, here's a patch along these lines.  I haven't considered Jim's
>> suggestion downthread about discounting dead tuples from relpages; maybe
>> we can do that by subtracting the pages attributed to dead ones,
>> estimating via tuple density (reltuples/relpages).
>
> Patch attached.

This strikes me as too clever by half.  You've introduced the concept
of a "Browne strength" (apparently named for Christopher Browne) and
yet you haven't even bothered to add a comment explaining the meaning
of the term, let along justifying the choice of that formula rather
than any other.  I don't want to dog this proposal to death, because
surely we can do better than the status quo here, but adopting the
first formula someone proposed without any analysis of whether it does
the right thing cannot possibly be the correct decision process.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Thu, Jan 31, 2013 at 2:36 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Instead, what I propose (and is not really in the patch), as a
> backpatchable item, is an approach in which the functions to compute
> each rel's Browne strength and sort are hooks.  Normal behavior is not
> to sort at all, as currently, and sites that have a problem with the
> current random order can install a custom module that provide hooks to
> change ordering as they see fit.  So behavior won't change for people
> who have no problem today.

Can you think of any examples of cases where we have back-patched a new hook?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Instead, what I propose (and is not really in the patch), as a
> backpatchable item, is an approach in which the functions to compute
> each rel's Browne strength and sort are hooks.  Normal behavior is not
> to sort at all, as currently, and sites that have a problem with the
> current random order can install a custom module that provide hooks to
> change ordering as they see fit.  So behavior won't change for people
> who have no problem today.

Meh.  I'm not really thrilled with adding hooks (that presumably we'd
have to preserve forever) to solve a short-term problem.  Nor does this
sound hugely convenient for users with the problem, anyway.  Do we even
know for sure that anyone would create such modules?

I think we should just fix it as best we can in HEAD, and then anyone
who thinks the risk/reward ratio is favorable can back-patch that fix
into a private build.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Instead, what I propose (and is not really in the patch), as a
> > backpatchable item, is an approach in which the functions to compute
> > each rel's Browne strength and sort are hooks.  Normal behavior is not
> > to sort at all, as currently, and sites that have a problem with the
> > current random order can install a custom module that provide hooks to
> > change ordering as they see fit.  So behavior won't change for people
> > who have no problem today.
>
> Meh.  I'm not really thrilled with adding hooks (that presumably we'd
> have to preserve forever) to solve a short-term problem.  Nor does this
> sound hugely convenient for users with the problem, anyway.  Do we even
> know for sure that anyone would create such modules?

Well, I would.  Providing a custom module is many times more convenient
than providing a patched binary.  But since there seems to be
considerable resistance to the idea I will drop it, unless others vote
in favour.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Alvaro Herrera escribió:
> >
> >> Okay, here's a patch along these lines.  I haven't considered Jim's
> >> suggestion downthread about discounting dead tuples from relpages; maybe
> >> we can do that by subtracting the pages attributed to dead ones,
> >> estimating via tuple density (reltuples/relpages).
> >
> > Patch attached.
>
> This strikes me as too clever by half.  You've introduced the concept
> of a "Browne strength" (apparently named for Christopher Browne) and
> yet you haven't even bothered to add a comment explaining the meaning
> of the term, let along justifying the choice of that formula rather
> than any other.  I don't want to dog this proposal to death, because
> surely we can do better than the status quo here, but adopting the
> first formula someone proposed without any analysis of whether it does
> the right thing cannot possibly be the correct decision process.

My intention was to apply a Nasby correction to Browne Strength and call
the resulting function Browne' (Browne prime).  Does that sound better?

Now seriously, I did experiment a bit with this and it seems to behave
reasonably.  Of course, there might be problems with it, and I don't
oppose to changing the name.  "Vacuum strength" didn't sound so great,
so I picked the first term that came to mind.  It's not like picking
people's last names to name stuff is a completely new idea; that said,
it was sort of a joke.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas escribió:
>> On Thu, Jan 31, 2013 at 2:40 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > Alvaro Herrera escribió:
>> >
>> >> Okay, here's a patch along these lines.  I haven't considered Jim's
>> >> suggestion downthread about discounting dead tuples from relpages; maybe
>> >> we can do that by subtracting the pages attributed to dead ones,
>> >> estimating via tuple density (reltuples/relpages).
>> >
>> > Patch attached.
>>
>> This strikes me as too clever by half.  You've introduced the concept
>> of a "Browne strength" (apparently named for Christopher Browne) and
>> yet you haven't even bothered to add a comment explaining the meaning
>> of the term, let along justifying the choice of that formula rather
>> than any other.  I don't want to dog this proposal to death, because
>> surely we can do better than the status quo here, but adopting the
>> first formula someone proposed without any analysis of whether it does
>> the right thing cannot possibly be the correct decision process.
>
> My intention was to apply a Nasby correction to Browne Strength and call
> the resulting function Browne' (Browne prime).  Does that sound better?
>
> Now seriously, I did experiment a bit with this and it seems to behave
> reasonably.  Of course, there might be problems with it, and I don't
> oppose to changing the name.  "Vacuum strength" didn't sound so great,
> so I picked the first term that came to mind.  It's not like picking
> people's last names to name stuff is a completely new idea; that said,
> it was sort of a joke.

Color me amused :-).

And, when thinking about how strong these things are, just remember,
"smell isn't everything".

I spent 20 minutes at a whiteboard arriving at the "Browne strength",
and I think it's not unreasonable as a usage of the data already
immediately at hand.  But it is absolutely just intended as a
strawman proposal, and I'd be pleased to see it get prodded into
something more "prime."
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Andres Freund wrote:

> If youre careful you can also notice that there is an interesting typo
> in the freeze table computation. Namely it uses freeze_min_age instead
> of freeze_table_age. Which probably explains why I had so bad
> performance results with lowering vacuum_freeze_min_age, it basically
> radically increases the amount of full-table-scans, far more than it
> should.
>
> I can't imagine that anybody with a large database ran pg successfully
> with a small freeze_min_age due to this.
>
> It seems to be broken since the initial introduction of freeze_table_age
> in 6587818542e79012276dcfedb2f97e3522ee5e9b. I guess it wasn't noticed
> because the behaviour is only visible via autovacuum because a
> user-issued VACUUM passes -1 as freeze_min_age.

Backpatched all the way back to 8.4

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> My intention was to apply a Nasby correction to Browne Strength and call
> the resulting function Browne' (Browne prime).  Does that sound better?

/me rests head in hands.  I'm not halfway clever enough to hang with
this crowd; I'm not even going to touch the puns in Chris' reply.

> Now seriously, I did experiment a bit with this and it seems to behave
> reasonably.  Of course, there might be problems with it, and I don't
> oppose to changing the name.  "Vacuum strength" didn't sound so great,
> so I picked the first term that came to mind.  It's not like picking
> people's last names to name stuff is a completely new idea; that said,
> it was sort of a joke.

I don't think I really understand the origin of the formula, so
perhaps if someone would try to characterize why it seems to behave
reasonably that would be helpful (at least to me).

> f(deadtuples, relpages, age) =
>    deadtuples/relpages + e ^ (age*ln(relpages)/2^32)

To maybe make that discussion go more quickly let me kvetch about a
few things to kick things off:

- Using deadtuples/relpages as part of the formula means that tables
with smaller tuples (thus more tuples per page) will tend to get
vacuumed before tables with larger tuples (thus less tuples per page).I can't immediately see why that's a good thing.

- It's probably important to have a formula where we can be sure that
the wrap-around term will eventually dominate the dead-tuple term,
with enough time to spare to make sure nothing really bad happens; on
the other hand, it's also desirable to avoid the case where a table
that has just crossed the threshold for wraparound vacuuming doesn't
immediately shoot to the top of the list even if it isn't truly
urgent.  It's unclear to me just from looking at this formula how well
the second term meets those goals.

- More generally, it seems to me that we ought to be trying to think
about the units in which these various quantities are measured.  Each
term ought to be unit-less.  So perhaps the first term ought to divide
dead tuples by total tuples, which has the nice property that the
result is a dimensionless quantity that never exceeds 1.0.  Then the
second term can be scaled somehow based on that value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Wed, Jan 30, 2013 at 6:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> c.f.
> vacuum_set_xid_limits:
>                 /*
>                  * Determine the table freeze age to use: as specified by the caller,
>                  * or vacuum_freeze_table_age, but in any case not more than
>                  * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
>                  * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
>                  * before anti-wraparound autovacuum is launched.
>                  */
>                 freezetable = freeze_min_age;
>                 if (freezetable < 0)
>                         freezetable = vacuum_freeze_table_age;
>                 freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
>                 Assert(freezetable >= 0);
>
>                 /*
>                  * Compute the cutoff XID, being careful not to generate a "permanent"
>                  * XID.
>                  */
>                 limit = ReadNewTransactionId() - freezetable;
>                 if (!TransactionIdIsNormal(limit))
>                         limit = FirstNormalTransactionId;
>
>                 *freezeTableLimit = limit;
>
> lazy_vacuum_rel:
>         scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
>                                                  freezeTableLimit);
>
> If youre careful you can also notice that there is an interesting typo
> in the freeze table computation. Namely it uses freeze_min_age instead
> of freeze_table_age. Which probably explains why I had so bad
> performance results with lowering vacuum_freeze_min_age, it basically
> radically increases the amount of full-table-scans, far more than it
> should.
>
> I can't imagine that anybody with a large database ran pg successfully
> with a small freeze_min_age due to this.

As far as I can tell this bug kicks in when your cluster gets to be
older than freeze_min_age, and then lasts forever after.  After that
point pretty much every auto-vacuum inspired by update/deletion
activity will get promoted to a full table scan.  (Which makes me
wonder how much field-testing the vm-only vacuum has received, if it
was rarely happening in practice due to this bug.)

Lowering the setting of freeze_min_age does not make the bug worse, it
only makes it manifest earlier in the lifetime of the database.



Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-01 14:05:46 -0800, Jeff Janes wrote:
> On Wed, Jan 30, 2013 at 6:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > c.f.
> > vacuum_set_xid_limits:
> >                 /*
> >                  * Determine the table freeze age to use: as specified by the caller,
> >                  * or vacuum_freeze_table_age, but in any case not more than
> >                  * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
> >                  * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
> >                  * before anti-wraparound autovacuum is launched.
> >                  */
> >                 freezetable = freeze_min_age;
> >                 if (freezetable < 0)
> >                         freezetable = vacuum_freeze_table_age;
> >                 freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
> >                 Assert(freezetable >= 0);
> >
> >                 /*
> >                  * Compute the cutoff XID, being careful not to generate a "permanent"
> >                  * XID.
> >                  */
> >                 limit = ReadNewTransactionId() - freezetable;
> >                 if (!TransactionIdIsNormal(limit))
> >                         limit = FirstNormalTransactionId;
> >
> >                 *freezeTableLimit = limit;
> >
> > lazy_vacuum_rel:
> >         scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
> >                                                  freezeTableLimit);
> >
> > If youre careful you can also notice that there is an interesting typo
> > in the freeze table computation. Namely it uses freeze_min_age instead
> > of freeze_table_age. Which probably explains why I had so bad
> > performance results with lowering vacuum_freeze_min_age, it basically
> > radically increases the amount of full-table-scans, far more than it
> > should.
> >
> > I can't imagine that anybody with a large database ran pg successfully
> > with a small freeze_min_age due to this.
> 
> As far as I can tell this bug kicks in when your cluster gets to be
> older than freeze_min_age, and then lasts forever after.  After that
> point pretty much every auto-vacuum inspired by update/deletion
> activity will get promoted to a full table scan.  (Which makes me
> wonder how much field-testing the vm-only vacuum has received, if it
> was rarely happening in practice due to this bug.)

I think you're misreading the code. freezeTableLimit is calculated by
> >                 limit = ReadNewTransactionId() - freezetable;
which is always relative to the current xid. The bug was that
freezetable had the wrong value in autovac due to freeze_min_age being
used instead of freeze_table_age.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Fri, Feb 1, 2013 at 4:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 31, 2013 at 3:18 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> My intention was to apply a Nasby correction to Browne Strength and call
>> the resulting function Browne' (Browne prime).  Does that sound better?
>
> /me rests head in hands.  I'm not halfway clever enough to hang with
> this crowd; I'm not even going to touch the puns in Chris' reply.

It's Friday... Fun needs to be had :-).

>> Now seriously, I did experiment a bit with this and it seems to behave
>> reasonably.  Of course, there might be problems with it, and I don't
>> oppose to changing the name.  "Vacuum strength" didn't sound so great,
>> so I picked the first term that came to mind.  It's not like picking
>> people's last names to name stuff is a completely new idea; that said,
>> it was sort of a joke.
>
> I don't think I really understand the origin of the formula, so
> perhaps if someone would try to characterize why it seems to behave
> reasonably that would be helpful (at least to me).
>
>> f(deadtuples, relpages, age) =
>>    deadtuples/relpages + e ^ (age*ln(relpages)/2^32)
>
> To maybe make that discussion go more quickly let me kvetch about a
> few things to kick things off:
>
> - Using deadtuples/relpages as part of the formula means that tables
> with smaller tuples (thus more tuples per page) will tend to get
> vacuumed before tables with larger tuples (thus less tuples per page).
>  I can't immediately see why that's a good thing.

That wasn't intentional, and may be somewhat unfortunate.

I picked values that I knew could be easily grabbed, and we don't
have an immediate tuples-per-page estimate on pg_class.  An
estimate should be available in pg_statistic; I'm not sure that the
bias from this hurts things badly.

> - It's probably important to have a formula where we can be sure that
> the wrap-around term will eventually dominate the dead-tuple term,
> with enough time to spare to make sure nothing really bad happens; on
> the other hand, it's also desirable to avoid the case where a table
> that has just crossed the threshold for wraparound vacuuming doesn't
> immediately shoot to the top of the list even if it isn't truly
> urgent.  It's unclear to me just from looking at this formula how well
> the second term meets those goals.

I think the second term *does* provide a way for wraparound to dominate;
splitting it apart a bit...

Consider...
      age * ln(relpages)
e^ ----------------------------------         2^32

The wraparound portion of this involves age/2^32...  In the beginning, the
numerator will be near zero, and denominator near 2 billion, so is roughly
1.  As age trends towards 2^32, the fraction (ignoring ln(relpages)) trends
towards 1, so that the longer we go without vacuuming, the more certain
that the fraction indicates a value near 1.  That *tends* to give you something
looking like e^1, or 2.71828+, ignoring the relpages part.

I threw in multiplying by ln(relpages) as a way to step Well Back from
rollover; that means that this term will start growing considerably before
rollover, and the larger the table, the sooner that growth takes place.

There is a problem with the ln(relpages) term; if the table has just 1 page,
the ln(relpages) = 0 so the value of the exponential term is *always* 1.
Probably should have ln(relpages+CONSTANT) so that we guarantee
the numerator is never 0.

I'm a bit worried that the exponential term might dominate *too* quickly.

For a table I have handy with 163K tuples, spread across 3357 pages,
ln(relpage) = 8.1188, and the range of the "exponential bit" travels
like follows:

dotpro0620@localhost->  select generate_series(1,20)*100/20 as
percent_wraparound, power(2.71828,
(65536.0*32768.0*generate_series(1,20)/20.0 *
ln(3357))/(65536.0*32768)) as wraparound_term;percent_wraparound | wraparound_term
--------------------+------------------                 5 | 1.50071232210687                10 |  2.2521374737234
        15 | 3.37981045789535                20 | 5.07212320054923                25 | 7.61179778630838
30| 11.4231187312988                35 | 17.1428150369499                40 | 25.7264337615498                45 |
38.607976149824               50 | 57.9394655396491                55 |  86.950469871638                60 |
130.48764154935               65 | 195.824411555774                70 | 293.876107391077                75 |
441.023495534592               80 | 661.849394087408                85 |  993.24554108594                90 |
1490.57582238538               95 | 2236.92550368832               100 | 3356.98166702019
 
(20 rows)

At the beginning, the "wraparound" portion is just 1.5, so easily dominated by
a table with a lot of dead tuples.  As the time to wraparound declines,
the term becomes steadily more urgent.  There may be constants
factors to fiddle with at the edges, but this term definitely heads towards
dominance.

That's definitely doing what I intended, and after constructing that table,
I think I'm *more* confident.  Consider that if there are two tables of
different sizes, both head towards "maxing out" at a "wraparound_term"
value that is directly correlated with the size of each table, which seems
mighty right.  A bigger table needs to get drawn into play (e.g. - needs
to get vacuumed) earlier than a smaller one.

> - More generally, it seems to me that we ought to be trying to think
> about the units in which these various quantities are measured.  Each
> term ought to be unit-less.  So perhaps the first term ought to divide
> dead tuples by total tuples, which has the nice property that the
> result is a dimensionless quantity that never exceeds 1.0.  Then the
> second term can be scaled somehow based on that value.

Absolutely a good idea.  I'm not sure I agree it ought to wind up
unitless; I'd instead expect a common unit, perhaps number of pages,
indicating a surrogate for quantity of I/O.  That we're both thinking
about "what's the unit?" means we're on a compatible trail.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Christopher Browne <cbbrowne@gmail.com> writes:
> I picked values that I knew could be easily grabbed, and we don't
> have an immediate tuples-per-page estimate on pg_class.

Er, what?  reltuples/relpages is exactly that estimate --- in fact,
it's only because of historical accident that we don't store a single
float field with that ratio, rather than two fields.  Both the creation
and the usage of those numbers work explicitly with the ratio.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-02-01 14:05:46 -0800, Jeff Janes wrote:

>> As far as I can tell this bug kicks in when your cluster gets to be
>> older than freeze_min_age, and then lasts forever after.  After that
>> point pretty much every auto-vacuum inspired by update/deletion
>> activity will get promoted to a full table scan.  (Which makes me
>> wonder how much field-testing the vm-only vacuum has received, if it
>> was rarely happening in practice due to this bug.)
>
> I think you're misreading the code. freezeTableLimit is calculated by

>> >                 limit = ReadNewTransactionId() - freezetable;

> which is always relative to the current xid. The bug was that
> freezetable had the wrong value in autovac due to freeze_min_age being
> used instead of freeze_table_age.

Right.  Since freeze_min_age was mistakenly being used, the limit
would be 50 million in the past (rather than 150 million) under
defaults.  But since the last full-table vacuum, whenever that was,
used freeze_min_age for its intended purpose, that means the 50
million in the past *at the time of that last vacuum* is the highest
that relfrozenxid can be.  And that is going to be further back than
50 million from right now, so the vacuum will always be promoted to a
full scan.

I am not entirely sure of my logic above[1], but I'm depending on
empirical observation for my conclusion.  The attached patch emits a
log entry telling if scan_all is being used, and it always is used
(under the bug) once the database gets old enough.  Or at least, I've
never seen it not use scan_all after that point.

As an aside, it does seem like log_autovacuum_min_duration=0 should
log whether a scan_all was done, and if so what relfrozenxid got set
to.  But looking at where the log message is generated, I don't know
where to retrieve that info.


[1] I don't know why it is that a scan_all vacuum with a
freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not
set relfrozenxid to a higher value than that if it discovers that it
can, but it doesn't seem to.


Cheers,

Jeff

Attachment

Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Monday, January 28, 2013, Kevin Grittner wrote:
IMO, anything which changes an anti-wraparound vacuum of a
bulk-loaded table from "read the entire table and rewrite nearly
the complete table with WAL-logging" to rewriting a smaller portion
of the table with WAL-logging is an improvement.  Anyone who has
run an OLTP load on a database which was loaded from pg_dump output
or other bulk load processes, has probably experienced the pain
related to the WAL-logged rewrite of massive quantities of data.

pgbench seems to be the OLTP load par excellence (or perhaps ad nauseum).

What other set up is needed to get it to reproduce this problem?  Do we just do a dump/restore in lieu of pgbench -i?

Cheers,

Jeff

Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-01 15:09:34 -0800, Jeff Janes wrote:
> On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-02-01 14:05:46 -0800, Jeff Janes wrote:
> 
> >> As far as I can tell this bug kicks in when your cluster gets to be
> >> older than freeze_min_age, and then lasts forever after.  After that
> >> point pretty much every auto-vacuum inspired by update/deletion
> >> activity will get promoted to a full table scan.  (Which makes me
> >> wonder how much field-testing the vm-only vacuum has received, if it
> >> was rarely happening in practice due to this bug.)
> >
> > I think you're misreading the code. freezeTableLimit is calculated by
> 
> >> >                 limit = ReadNewTransactionId() - freezetable;
> 
> > which is always relative to the current xid. The bug was that
> > freezetable had the wrong value in autovac due to freeze_min_age being
> > used instead of freeze_table_age.
> 
> Right.  Since freeze_min_age was mistakenly being used, the limit
> would be 50 million in the past (rather than 150 million) under
> defaults.  But since the last full-table vacuum, whenever that was,
> used freeze_min_age for its intended purpose, that means the 50
> million in the past *at the time of that last vacuum* is the highest
> that relfrozenxid can be.  And that is going to be further back than
> 50 million from right now, so the vacuum will always be promoted to a
> full scan.

Oh, wow. Youre right. I shouldn't answer emails after sport with cramped
fingers on a friday night... And I should have thought about this
scenario, because I essentially already explained it upthread, just with
a different set of variables.

This is rather scary. How come nobody noticed that this major
performance improvement was effectively disabled for that long?

I wonder if Kevin's observations about the price of autovac during
OLTPish workloads isn't at least partially caused by this. It will cause
lots of io prematurely because it scans far more than it should and a
VACUUM FREEZE will push it off.

> As an aside, it does seem like log_autovacuum_min_duration=0 should
> log whether a scan_all was done, and if so what relfrozenxid got set
> to.  But looking at where the log message is generated, I don't know
> where to retrieve that info.

Yes, I agree, I already had been thinking about that because its really
hard to get that information right now.
It seems easy enough to include it in the ereport() at the bottom of
lazy_vacuum_rel, we determine scan_all in that function, so that seems
ok?
For head I would actually vote for two data points, full_table_scan:
yes/no, skipped_percentage..., both are already available, so it seems
like it should be an easy thing to do.
I'd like to do this for 9.3, agreed? I would even like to add it to the
back branches, but I guess I cannot convince people of that...

> [1] I don't know why it is that a scan_all vacuum with a
> freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not
> set relfrozenxid to a higher value than that if it discovers that it
> can, but it doesn't seem to.

There currently is no code to track whats the oldest observed xid, so a
simple implementation limitiation. Making that code better might be
rather worthwile if youre loading your table in a batch and don't touch
it later anymore...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-01 16:59:52 -0500, Robert Haas wrote:
> I don't think I really understand the origin of the formula, so
> perhaps if someone would try to characterize why it seems to behave
> reasonably that would be helpful (at least to me).
> 
> > f(deadtuples, relpages, age) =
> >    deadtuples/relpages + e ^ (age*ln(relpages)/2^32)
> 
> To maybe make that discussion go more quickly let me kvetch about a
> few things to kick things off:

I am not too happy with the formula yet either, but it seems youve
started the discussion into the right direction...
> - It's probably important to have a formula where we can be sure that
> the wrap-around term will eventually dominate the dead-tuple term,
> with enough time to spare to make sure nothing really bad happens; on
> the other hand, it's also desirable to avoid the case where a table
> that has just crossed the threshold for wraparound vacuuming doesn't
> immediately shoot to the top of the list even if it isn't truly
> urgent.  It's unclear to me just from looking at this formula how well
> the second term meets those goals.

I just wanted to mention that if everything goes well, we won't *ever*
get to an anti-wraparound-vacuum. Normally the table should cross the
vacuum_table_age barrier earlier and promote a normal vacuum to a
full-table vacuum which will set relfrozenxid to a new and lower value
and thus prevent anti-wraparound vacuums from occurring.
So priorizing anti-wraparound vacuums immediately and heavily doesn't
seem to be too bad.

> - More generally, it seems to me that we ought to be trying to think
> about the units in which these various quantities are measured.  Each
> term ought to be unit-less.  So perhaps the first term ought to divide
> dead tuples by total tuples, which has the nice property that the
> result is a dimensionless quantity that never exceeds 1.0.  Then the
> second term can be scaled somehow based on that value.

I think we also need to be careful to not try to get too elaborate on
this end. Once the general code for priorization is in, the exact
priorization formula can be easily incrementally tweaked. Just about any
half-way sensible priorization is better than what we have right now and
we might discover new effects once we do marginally better.

Imo the browne_strength field should be called 'priority' and the
priorization calculation formula should be moved qinto an extra
function.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Fri, Feb 1, 2013 at 6:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> As an aside, it does seem like log_autovacuum_min_duration=0 should
> log whether a scan_all was done, and if so what relfrozenxid got set
> to.

That would be nifty.

> [1] I don't know why it is that a scan_all vacuum with a
> freeze_min_age of 50m (or a freezeLimit of 50 million ago) will not
> set relfrozenxid to a higher value than that if it discovers that it
> can, but it doesn't seem to.

That also seems very much worth fixing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Sat, Feb 2, 2013 at 8:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> - It's probably important to have a formula where we can be sure that
>> the wrap-around term will eventually dominate the dead-tuple term,
>> with enough time to spare to make sure nothing really bad happens; on
>> the other hand, it's also desirable to avoid the case where a table
>> that has just crossed the threshold for wraparound vacuuming doesn't
>> immediately shoot to the top of the list even if it isn't truly
>> urgent.  It's unclear to me just from looking at this formula how well
>> the second term meets those goals.
>
> I just wanted to mention that if everything goes well, we won't *ever*
> get to an anti-wraparound-vacuum. Normally the table should cross the
> vacuum_table_age barrier earlier and promote a normal vacuum to a
> full-table vacuum which will set relfrozenxid to a new and lower value
> and thus prevent anti-wraparound vacuums from occurring.
> So priorizing anti-wraparound vacuums immediately and heavily doesn't
> seem to be too bad.

IMHO, this is hopelessly optimistic.  Yes, it's intended to work that
way.  But INSERT-only or INSERT-mostly tables are far from an uncommon
use case; and in fact they're probably the most common cause of pain
in this area.  You insert a gajillion tuples, and vacuum never kicks
off, and then eventually you either update some tuples or hit
autovacuum_freeze_max_age and suddenly, BAM, you get this gigantic
vacuum that rewrites the entire table.  And then you open a support
ticket with your preferred PostgreSQL support provider and say
something like "WTF?".

>> - More generally, it seems to me that we ought to be trying to think
>> about the units in which these various quantities are measured.  Each
>> term ought to be unit-less.  So perhaps the first term ought to divide
>> dead tuples by total tuples, which has the nice property that the
>> result is a dimensionless quantity that never exceeds 1.0.  Then the
>> second term can be scaled somehow based on that value.
>
> I think we also need to be careful to not try to get too elaborate on
> this end. Once the general code for priorization is in, the exact
> priorization formula can be easily incrementally tweaked. Just about any
> half-way sensible priorization is better than what we have right now and
> we might discover new effects once we do marginally better.

I agree.  It would be nice to have some way of measuring the positive
or negative impact of what we introduce, too, but I don't have a good
idea what that would be.

> Imo the browne_strength field should be called 'priority' and the
> priorization calculation formula should be moved qinto an extra
> function.

Yeah, or maybe vacuum_priority, since that would be easier to grep for.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-02 11:25:01 -0500, Robert Haas wrote:
> On Sat, Feb 2, 2013 at 8:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> - It's probably important to have a formula where we can be sure that
> >> the wrap-around term will eventually dominate the dead-tuple term,
> >> with enough time to spare to make sure nothing really bad happens; on
> >> the other hand, it's also desirable to avoid the case where a table
> >> that has just crossed the threshold for wraparound vacuuming doesn't
> >> immediately shoot to the top of the list even if it isn't truly
> >> urgent.  It's unclear to me just from looking at this formula how well
> >> the second term meets those goals.
> >
> > I just wanted to mention that if everything goes well, we won't *ever*
> > get to an anti-wraparound-vacuum. Normally the table should cross the
> > vacuum_table_age barrier earlier and promote a normal vacuum to a
> > full-table vacuum which will set relfrozenxid to a new and lower value
> > and thus prevent anti-wraparound vacuums from occurring.
> > So priorizing anti-wraparound vacuums immediately and heavily doesn't
> > seem to be too bad.
> 
> IMHO, this is hopelessly optimistic.  Yes, it's intended to work that
> way.  But INSERT-only or INSERT-mostly tables are far from an uncommon
> use case; and in fact they're probably the most common cause of pain
> in this area.  You insert a gajillion tuples, and vacuum never kicks
> off, and then eventually you either update some tuples or hit
> autovacuum_freeze_max_age and suddenly, BAM, you get this gigantic
> vacuum that rewrites the entire table.  And then you open a support
> ticket with your preferred PostgreSQL support provider and say
> something like "WTF?".

You're right, this doesn't work superbly well, especially for
insert-only tables... But imo the place to fix it is not the
priorization logic but relation_needs_vacanalyze, since fixing it in
priorization won't prevent the BAM just the timing of it.

I think scheduling a table for a partial vacuum every min_freeze * 2
xids, even if its insert only, would go a long way of reducing the
impact of full-table vacuums. Obviously that would require to retain the
last xid a vacuum was executed in...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Robert Haas
Date:
On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> You're right, this doesn't work superbly well, especially for
> insert-only tables... But imo the place to fix it is not the
> priorization logic but relation_needs_vacanalyze, since fixing it in
> priorization won't prevent the BAM just the timing of it.

Agreed.

> I think scheduling a table for a partial vacuum every min_freeze * 2
> xids, even if its insert only, would go a long way of reducing the
> impact of full-table vacuums. Obviously that would require to retain the
> last xid a vacuum was executed in...

I'm not sure that min_freeze * 2 is the right value, but otherwise agreed.

I keep coming back to the idea that vacuum should have a high-priority
queue and a low-priority queue.  When stuff meets the current
thresholds, it goes into the high-priority queue.  But then there
should be a low-priority queue where we do partial vacuums of things
that meet some lower threshold - like the unfrozen portions of
insert-only tables.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-02 14:54:10 -0500, Robert Haas wrote:
> On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think scheduling a table for a partial vacuum every min_freeze * 2
> > xids, even if its insert only, would go a long way of reducing the
> > impact of full-table vacuums. Obviously that would require to retain the
> > last xid a vacuum was executed in...
>
> I'm not sure that min_freeze * 2 is the right value, but otherwise agreed.

Yes, min_freeze * 2 was purely a very first quick guess. My reasoning
was basically that values smaller than that won't really do all that
much work since barely any of the tuples are old enough and values much
bigger won't help much in reducing huge amounts of writes being
submitted at the same time and they might miss the window before the
next anti-freeze vacuum.
I guess something like 5 or so would theoretically be better because it
would reduce the number of tuples that are too young for freezing which
will be marked all-visible nonetheless and thus not scanned again. But
it would not work with the current relevant default settings:

vacuum_freeze_table_age = 150,000,000
vacuum_freeze_min_age = 50,000,000
autovacuum_freeze_max_age = 200,000,000

I guess we will have to think about the default for those values. Adhoc
I am thinking something like:

vacuum_freeze_table_age = 300,000,000
vacuum_freeze_min_age = 20,000,000
autovacuum_freeze_max_age = 800,000,000

and scheduling a vacuum independent from n_dead_tuples every
(freeze_table_age - freeze_max_age ) / 5 or so xids. That would mean
that approx 4/5 (or more on a busy system) of the tuples would get
frozen before a full-table vacuum.
I don't the disk size argument for freeze_max_age = 200000000 is
particularly relevant anymore, especially as we would normally keep the
size at vacuum_freeze_table_age.

To finally fix the issue ISTM that we need an 'age map' to know which
parts to scan again and which parts never need to be scanned again, but
thats a separate, not too small, feature.

> I keep coming back to the idea that vacuum should have a high-priority
> queue and a low-priority queue.  When stuff meets the current
> thresholds, it goes into the high-priority queue.  But then there
> should be a low-priority queue where we do partial vacuums of things
> that meet some lower threshold - like the unfrozen portions of
> insert-only tables.

I don't think thats the most crucial part atm. Such a queue doesn't
solve the problem that we don't want to do unneccesary, repetitive work.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-01 15:09:34 -0800, Jeff Janes wrote:
> As an aside, it does seem like log_autovacuum_min_duration=0 should
> log whether a scan_all was done, and if so what relfrozenxid got set
> to.  But looking at where the log message is generated, I don't know
> where to retrieve that info.

What about the following, very rough and quick, patch:

LOG:  automatic vacuum of table "postgres.public.data_1": index scans: 1     pages: 2703 removed, 2702 remain, 5405
(100.00%)scanned     tuples: 499999 removed, 500001 remain     full-scan: 1, freeze-limit: 28824, new-frozen-xid: 28824
   buffer usage: 29957 hits, 2 misses, 4 dirtied     avg read rate: 0.020 MB/s, avg write rate: 0.040 MB/s     system
usage:CPU 0.01s/0.67u sec elapsed 0.77 sec
 
...
LOG:  automatic vacuum of table "postgres.public.data_1": index scans: 1     pages: 2703 removed, 5404 remain, 5411
(66.74%)scanned     tuples: 499999 removed, 1000071 remain     full-scan: 0, freeze-limit: 28828, new-frozen-xid: -
bufferusage: 34085 hits, 3 misses, 4 dirtied     avg read rate: 0.027 MB/s, avg write rate: 0.036 MB/s     system
usage:CPU 0.01s/0.73u sec elapsed 0.86 sec
 

It obviously needs more polish:

- I opted for using the 64bit representation of xids, seems to be better in a log which very well might be looked at
onlyafter some wraparounds
 
- exporting 'txid' from adt/txid.c is pretty ugly. I don't like the invention of the type in general, but making it
visibleoutside of txid.c is even uglier, but using both, plain uint64 and txid inside txid.c isn't nice either.
 
- txid_from_xid should be renamed, don't have a good idea to what right now.
- is there agreement on the additionally logged information?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-03 02:40:04 +0100, Andres Freund wrote:
> On 2013-02-01 15:09:34 -0800, Jeff Janes wrote:
> > As an aside, it does seem like log_autovacuum_min_duration=0 should
> > log whether a scan_all was done, and if so what relfrozenxid got set
> > to.  But looking at where the log message is generated, I don't know
> > where to retrieve that info.
>
> What about the following, very rough and quick, patch:

-EINTR

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: autovacuum not prioritising for-wraparound tables

From
Christopher Browne
Date:
On Sat, Feb 2, 2013 at 2:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> You're right, this doesn't work superbly well, especially for
>> insert-only tables... But imo the place to fix it is not the
>> priorization logic but relation_needs_vacanalyze, since fixing it in
>> priorization won't prevent the BAM just the timing of it.
>
> Agreed.
>
>> I think scheduling a table for a partial vacuum every min_freeze * 2
>> xids, even if its insert only, would go a long way of reducing the
>> impact of full-table vacuums. Obviously that would require to retain the
>> last xid a vacuum was executed in...
>
> I'm not sure that min_freeze * 2 is the right value, but otherwise agreed.
>
> I keep coming back to the idea that vacuum should have a high-priority
> queue and a low-priority queue.  When stuff meets the current
> thresholds, it goes into the high-priority queue.  But then there
> should be a low-priority queue where we do partial vacuums of things
> that meet some lower threshold - like the unfrozen portions of
> insert-only tables.

When I was thinking about your desire for "unitless" values, I found
myself uncomfortable about that, and I think I've mentioned that.

On further reflection, there's good reason.  The need to vacuum
tables with lots of dead tuples has very different characteristics
from the need to vacuum tables to avoid XID rollover.  Trying to
force them onto the same units seems unlikely to turn out
happily.

On the other hand, I always thought that there was use for having
multiple autovacuum queues, and giving queues different
shaped policies, one for each purpose, seems like a mighty
fine idea.  That way we don't need to worry about mixing the
policies.  There can be two "best policies."

I'd go further, and have 3 queues:

a) A queue devoted to vacuuming small tables.  Anything
with more than [some number of relpages] need not apply.

b) A queue devoted to vacuuming tables with a lot of dead
tuples.

c) A queue devoted to vacuuming tables before their XID
rollover.

The appropriate "strength" functions for b) and c) can be
pretty simple, possibly the relevant bits of the functions that
Nasby and I have suggested.  And any time b) and c) find
small tables, throw them to queue a), essentially doing
the "quick & easy" vacuums.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: autovacuum not prioritising for-wraparound tables

From
Gavin Flower
Date:
<div class="moz-cite-prefix">On 03/02/13 15:08, Christopher Browne wrote:<br /></div><blockquote
cite="mid:CAFNqd5WXnHJsewFugQUDQD1SHbGRku_9-7P2BskzQX+qh8+mvQ@mail.gmail.com"type="cite"><pre wrap="">On Sat, Feb 2,
2013at 2:54 PM, Robert Haas <a class="moz-txt-link-rfc2396E"
href="mailto:robertmhaas@gmail.com"><robertmhaas@gmail.com></a>wrote:
 
</pre><blockquote type="cite"><pre wrap="">On Sat, Feb 2, 2013 at 1:49 PM, Andres Freund <a
class="moz-txt-link-rfc2396E"href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a> wrote:
 
</pre><blockquote type="cite"><pre wrap="">You're right, this doesn't work superbly well, especially for
insert-only tables... But imo the place to fix it is not the
priorization logic but relation_needs_vacanalyze, since fixing it in
priorization won't prevent the BAM just the timing of it.
</pre></blockquote><pre wrap="">
Agreed.

</pre><blockquote type="cite"><pre wrap="">I think scheduling a table for a partial vacuum every min_freeze * 2
xids, even if its insert only, would go a long way of reducing the
impact of full-table vacuums. Obviously that would require to retain the
last xid a vacuum was executed in...
</pre></blockquote><pre wrap="">
I'm not sure that min_freeze * 2 is the right value, but otherwise agreed.

I keep coming back to the idea that vacuum should have a high-priority
queue and a low-priority queue.  When stuff meets the current
thresholds, it goes into the high-priority queue.  But then there
should be a low-priority queue where we do partial vacuums of things
that meet some lower threshold - like the unfrozen portions of
insert-only tables.
</pre></blockquote><pre wrap="">
When I was thinking about your desire for "unitless" values, I found
myself uncomfortable about that, and I think I've mentioned that.

On further reflection, there's good reason.  The need to vacuum
tables with lots of dead tuples has very different characteristics
from the need to vacuum tables to avoid XID rollover.  Trying to
force them onto the same units seems unlikely to turn out
happily.

On the other hand, I always thought that there was use for having
multiple autovacuum queues, and giving queues different
shaped policies, one for each purpose, seems like a mighty
fine idea.  That way we don't need to worry about mixing the
policies.  There can be two "best policies."

I'd go further, and have 3 queues:

a) A queue devoted to vacuuming small tables.  Anything
with more than [some number of relpages] need not apply.

b) A queue devoted to vacuuming tables with a lot of dead
tuples.

c) A queue devoted to vacuuming tables before their XID
rollover.

The appropriate "strength" functions for b) and c) can be
pretty simple, possibly the relevant bits of the functions that
Nasby and I have suggested.  And any time b) and c) find
small tables, throw them to queue a), essentially doing
the "quick & easy" vacuums.
</pre></blockquote><font size="-1">Hmm...<br /><br /><font size="-1">Could there be some measure of bloatedness?<br
/></font></font><pclass="western" style="margin-bottom: 0cm"><font size="2">A table with 10 live rows and a 100 dead
tuplesshould surely have a higher priority of being vacuumed than a table with a 1000 ro</font><font
size="2">w</font><fontsize="2">s and 100 dead tuples?  Especially for tables with <font size="2">hundreds of
</font>millionsof rows<font size="2">!</font><br /></font><font size="-1"><font size="-1"><font size="-1"><font
size="-1"><fontsize="-1"><font size="-1"><font size="-1"><br /><br /><font size="-1">Cheers,<br /><font
size="-1">Gavin</font><br/></font></font></font></font></font></font></font></font> 

Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> It obviously needs more polish:

> - I opted for using the 64bit representation of xids, seems to be better
>   in a log which very well might be looked at only after some
>   wraparounds
> - exporting 'txid' from adt/txid.c is pretty ugly. I don't like the
>   invention of the type in general, but making it visible outside of
>   txid.c is even uglier, but using both, plain uint64 and txid inside
>   txid.c isn't nice either.
> - txid_from_xid should be renamed, don't have a good idea to what right
>   now.
> - is there agreement on the additionally logged information?

-1 on using txids here.  If memory serves, we have had exactly this
discussion before and rejected spreading those into other parts
of the system.  That gets rid of three of your problems right there,
as well as a lot of ugly hackery with UINT64_FORMAT.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-03 11:17:42 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > It obviously needs more polish:
> 
> > - I opted for using the 64bit representation of xids, seems to be better
> >   in a log which very well might be looked at only after some
> >   wraparounds
> > - exporting 'txid' from adt/txid.c is pretty ugly. I don't like the
> >   invention of the type in general, but making it visible outside of
> >   txid.c is even uglier, but using both, plain uint64 and txid inside
> >   txid.c isn't nice either.
> > - txid_from_xid should be renamed, don't have a good idea to what right
> >   now.
> > - is there agreement on the additionally logged information?
> 
> -1 on using txids here.  If memory serves, we have had exactly this
> discussion before and rejected spreading those into other parts
> of the system.  That gets rid of three of your problems right there,
> as well as a lot of ugly hackery with UINT64_FORMAT.

What about providing something like char *TransactionIdToEpochStrP() and
implementing it in txid.c instead of transam.c? Not pretty but it
wouldn't expose much to the outside?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-02-01 15:09:34 -0800, Jeff Janes wrote:

>> Since freeze_min_age was mistakenly being used, the limit
>> would be 50 million in the past (rather than 150 million) under
>> defaults.  But since the last full-table vacuum, whenever that was,
>> used freeze_min_age for its intended purpose, that means the 50
>> million in the past *at the time of that last vacuum* is the highest
>> that relfrozenxid can be.  And that is going to be further back than
>> 50 million from right now, so the vacuum will always be promoted to a
>> full scan.

> This is rather scary. How come nobody noticed that this major
> performance improvement was effectively disabled for that long?
>
> I wonder if Kevin's observations about the price of autovac during
> OLTPish workloads isn't at least partially caused by this. It will cause
> lots of io prematurely because it scans far more than it should and a
> VACUUM FREEZE will push it off.

Well, there seem to be multiple problems with autovacuum which are
serious enough that people are noticing severe hits to production
performance, figuring out that it is autovacuum, turning off
autovacuum and getting immediate relief, and then calling for
support a few months later when the inevitable consequences hit.
At that point, of course, there is a lot of pain, and it is hard to
recover from it without a sizable maintenance window.

I was able to confirm two cases where this was a consequence of the
lazy truncate logic which Jan recently fixed, but there are clearly
other problems which I didn't have much of a grasp on prior to this
thread.  The only thing I knew for sure was that a bulk load, and
in particular a pg_dump/restore cycle, was a time bomb without
VACUUM FREEZE, and users are often reluctant to add time for that
to their maintenance window.  If we can't load tuples as frozen in
the first plance, the next best thing would be to nibble away at
freezing them in the background, so that there is no single big
hit.  While proposals here seem to deal with very real problems
which we should fix, I'm not sure that anything here addresses this
issue.  COPY FREEZE does address this, where it can be used; but
I'm not sure that we don't need to address it in the autovacuum end
of things, too.

These problems are serious enough to merit cautious back-patching,
in my view; the current state of affairs really is causing serious
disruption of production environments.

-Kevin




Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-02-03 11:17:42 -0500, Tom Lane wrote:
>> -1 on using txids here.  If memory serves, we have had exactly this
>> discussion before and rejected spreading those into other parts
>> of the system.  That gets rid of three of your problems right there,
>> as well as a lot of ugly hackery with UINT64_FORMAT.

> What about providing something like char *TransactionIdToEpochStrP() and
> implementing it in txid.c instead of transam.c? Not pretty but it
> wouldn't expose much to the outside?

I'm objecting to the entire concept, not just how much cruft gets
exposed outside txid.c.

I went looking for the previous discussion and couldn't find it, but
here are some significant reasons not to use txids for logging:

* They don't have anything to do with the xids you can actually see
in the database (at least not without mod-2^32 arithmetic that is hard
to do in one's head).

* txid_from_xid is very expensive because of the GetNextXidAndEpoch
call; and if it got to be commonly used it would significantly increase
contention for the XidGenLock lock.  (Admittedly, two such calls per
VACUUM probably don't mean anything.  But once we establish a precedent
of logging txids not xids, there's a slippery slope down to where it
will be a problem.)

* We've found bugs repeatedly in the txid epoch conversion code, and
I have little confidence that there aren't more.  (The fact that your
patch found it necessary to touch convert_xid() isn't exactly improving
my opinion of this code, either.)  Accordingly, I think that log entries
involving txids would be materially less reliable than if we just print
the xids and have done.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Andres Freund
Date:
On 2013-02-03 13:26:25 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-02-03 11:17:42 -0500, Tom Lane wrote:
> >> -1 on using txids here.  If memory serves, we have had exactly this
> >> discussion before and rejected spreading those into other parts
> >> of the system.  That gets rid of three of your problems right there,
> >> as well as a lot of ugly hackery with UINT64_FORMAT.
> 
> > What about providing something like char *TransactionIdToEpochStrP() and
> > implementing it in txid.c instead of transam.c? Not pretty but it
> > wouldn't expose much to the outside?
> 
> I'm objecting to the entire concept, not just how much cruft gets
> exposed outside txid.c.

Ok, I can live with that.

The reason I wanted to log txids instead of plain xids is exactly that
youre basically required to do the mod-2^32 arithmetic to understand the
numbers, the xids frequently are wrapped around. E.g. the freeze-xid in
a somewhat new cluster will be something like 4094966496 which isn't
that easy to interpret. Even moreso with relfrozenxids et al.
I personally find it relatively hard to compare an xid like 4094966496,
even moreso when comparing a wrapped arround value with one not.

> The fact that your patch found it necessary to touch convert_xid()
> isn't exactly improving my opinion of this code, either.

Thats just because it couldn't handle xids that are essentially in epoch
-1. E.g. the the freeze limit in a new cluster will wrap-around into
that. (~800 - 200mio).

Any other opinions on the content of whats being logged?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Sun, Feb 3, 2013 at 9:25 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> I was able to confirm two cases where this was a consequence of the
> lazy truncate logic which Jan recently fixed, but there are clearly
> other problems which I didn't have much of a grasp on prior to this
> thread.  The only thing I knew for sure was that a bulk load, and
> in particular a pg_dump/restore cycle, was a time bomb without
> VACUUM FREEZE, and users are often reluctant to add time for that
> to their maintenance window.  If we can't load tuples as frozen in
> the first plance, the next best thing would be to nibble away at
> freezing them in the background, so that there is no single big
> hit.  While proposals here seem to deal with very real problems
> which we should fix, I'm not sure that anything here addresses this
> issue.

It seems like we already have all the machinery we need for this.

Don't run VACUUM FREEZE during the maintenance window, instead at the
end of the maintenance window set vacuum_cost_delay to nibble
(whatever integer that translates to on your system) and let 'er rip.
Since vacuum_cost_delay is session settable, it wouldn't interfere
with the setting for other things.

The only real problem I see with this is how to restart it so that it
picks up where it left off if it should get interrupted.

I guess giving vacuumdb new options to set the vacuum_cost settings it
runs under would be a handy feature for doing this.

Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Jeff Janes
Date:
On Sat, Feb 2, 2013 at 5:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-02-01 15:09:34 -0800, Jeff Janes wrote:
>> On Fri, Feb 1, 2013 at 2:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2013-02-01 14:05:46 -0800, Jeff Janes wrote:
>>
>> >> As far as I can tell this bug kicks in when your cluster gets to be
>> >> older than freeze_min_age, and then lasts forever after.  After that
>> >> point pretty much every auto-vacuum inspired by update/deletion
>> >> activity will get promoted to a full table scan.  (Which makes me
>> >> wonder how much field-testing the vm-only vacuum has received, if it
>> >> was rarely happening in practice due to this bug.)
>> >
>> > I think you're misreading the code. freezeTableLimit is calculated by
>>
>> >> >                 limit = ReadNewTransactionId() - freezetable;
>>
>> > which is always relative to the current xid. The bug was that
>> > freezetable had the wrong value in autovac due to freeze_min_age being
>> > used instead of freeze_table_age.
>>
>> Right.  Since freeze_min_age was mistakenly being used, the limit
>> would be 50 million in the past (rather than 150 million) under
>> defaults.  But since the last full-table vacuum, whenever that was,
>> used freeze_min_age for its intended purpose, that means the 50
>> million in the past *at the time of that last vacuum* is the highest
>> that relfrozenxid can be.  And that is going to be further back than
>> 50 million from right now, so the vacuum will always be promoted to a
>> full scan.
>
> Oh, wow. Youre right. I shouldn't answer emails after sport with cramped
> fingers on a friday night... And I should have thought about this
> scenario, because I essentially already explained it upthread, just with
> a different set of variables.
>
> This is rather scary. How come nobody noticed that this major
> performance improvement was effectively disabled for that long?

I'm not sure whom to address this to, but the just-committed release
notes for this issue reflect the original understanding that it only
applied when vacuum_freeze_min_age was lowered from its default.
Rather than the current understanding that it effects all old-enough
systems.

If the release notes are not already baked in, I would suggest this wording:

+      The main consequence of this mistake is that it
+      caused full-table vacuuming scans to occur much more frequently
+      than intended.


Cheers,

Jeff



Re: autovacuum not prioritising for-wraparound tables

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Sat, Feb 2, 2013 at 5:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> If the release notes are not already baked in, I would suggest this wording:

> +      The main consequence of this mistake is that it
> +      caused full-table vacuuming scans to occur much more frequently
> +      than intended.

It's baked ... sorry about that, but when making the notes there's
seldom time to go through the threads about every patch to see if the
commit messages are accurate or not.
        regards, tom lane



Re: autovacuum not prioritising for-wraparound tables

From
Jim Nasby
Date:
On 1/31/13 2:18 PM, Alvaro Herrera wrote:
> My intention was to apply a Nasby correction to Browne Strength and call
> the resulting function Browne' (Browne prime).  Does that sound better?

I suggest painting that bikeshed "Browneby". :P



Re: autovacuum not prioritising for-wraparound tables

From
Alvaro Herrera
Date:
Here's another cut at this patch.  This is mainly about the
infrastructure to pass the data around in autovacuum; the proposed
formulas probably need lot of work.

We still have two terms in autovacuum priority, the first one considers
dead tuples and the second one considers wraparound limit.  I have kept
Chris' proposal for the second term, but refined the first one a bit per
Jim Nasby's suggestion of discounting dead space.  So we now have

    return
        (d->p_deadtuples * (d->p_livetuples + d->p_deadtuples) * d->p_relpages) /
        ((d->p_livetuples + 1) * d->p_reltuples * nblocks) +
        exp(d->p_xidage * logf(nblocks) / UINT_MAX);

Here, deadtuples and livetuples come from pgstat data, while relpages
and reltuples come from pg_class.  nblocks, on the other hand, comes
from the actual number of blocks in the table.

I haven't considered the case where pg_class.reltuples = 0 (which
results in division-by-zero), but I think to be really robust here we'd
want to have some code copied from estimate_rel_size; or maybe simply
use some hardcoded magic value.  I lean towards the latter, because I'm
not sure we want to expend a relation open at this point (incurring an
attempt to lock the table, which could be problematic); hence the new
RelFileNodeGetNumberOfBlocks() thingy, which is admittedly pretty ugly,
not to mention untested.

(I have considered livetuples=0 however, hence the +1 there).

I think we now need to have a more focused discussion on useful
formulas to use here.  One thing I noticed that fails in the above
formula is that as nblocks grows, ceteris paribus, the score falls; but
that's wrong, because if you have a table that turns out to have much
larger nblocks because it bloated and pgstat lost the message, we need
to look harder at it.  So somehow we need to consider the tuple density
as given by pg_class.reltuples/pg_class.relpages, and compare with the
one given by pgstat.(live+dead) / nblocks; and raise the score as the
ratio goes down (in normal conditions the ratio should be 1; a bloated
table that pgstat hasn't noticed will have a lower ratio).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment