Thread: VACUUM/ANALYZE counting of in-doubt tuples

VACUUM/ANALYZE counting of in-doubt tuples

From

Tom Lane

Date:

17 November 2007, 13:27:57

There was some discussion in pgsql-performance about the problem that
the live-and-dead-tuple counts that ANALYZE reports to the stats
collector don't reflect insert-in-progress tuples properly:
http://archives.postgresql.org/pgsql-performance/2007-11/msg00225.php
I proposed a patch here:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00169.php
to switch ANALYZE over to using HeapTupleSatisfiesVacuum() to determine
tuple status, which would at least give it awareness of tuples that are
in-doubt because of concurrent transactions.

On further thought though, that's not the whole story, and in fact
VACUUM itself isn't doing very well at accounting for in-doubt tuples.
The current implementation is that whatever live and dead tuple totals
are arrived at by a VACUUM or ANALYZE are sent to the stats collector
and simply overwrite its counters on arrival.  Meanwhile, the
transaction that is responsible for an in-doubt tuple will send a
stats message to increment either the live-tuple or dead-tuple count
as appropriate when it commits or aborts.  If that happens before
the VACUUM or ANALYZE completes, the increment will get overwritten
by VACUUM/ANALYZE's total; if afterwards, the increment will get
added onto the total.  So ideally we'd count the state change as already
done if we knew the other transaction would commit first, otherwise
not.

Since we don't know whether the other transaction will commit or abort,
much less when, it seems hopeless to have any really accurate accounting
for in-doubt tuples.  We could avoid the whole problem if we gave up on
having VACUUM/ANALYZE reset the stats counters (instead, VACUUM would
just send an incremental dead-tuple-count reduction, and ANALYZE would
do nothing); as indeed was already proposed by Itagaki-san.  But I am
still convinced that that cure would be worse than the disease --- if
the counters are driven entirely by incremental updates then there is
no way to correct errors that will inevitably creep in, and the
accumulation of errors will allow the counts to diverge arbitrarily
far from reality.  I'm happier with the idea that the error in the
counts can be bounded by the number of in-doubt tuples seen by the
last VACUUM/ANALYZE.

So having said all that, we need to devise heuristic rules that minimize
the error.  In that light, I suggest that the accounting rules assume
that the transactions responsible for in-doubt tuples will finish after
the VACUUM/ANALYZE finishes.  That probably sounds wrong, because in
most of the sorts of applications we think about, updating transactions
are short and are likely to finish before the VACUUM/ANALYZE does.  But
what I'm thinking is that a short transaction by definition can't change
very many rows.  The cases that can seriously hurt the accuracy of our
counts are long-running bulk inserts or updates, and for those it seems
entirely likely that ANALYZE in particular could finish first.

Given that assumption, then the rule should be to not count
INSERT_IN_PROGRESS tuples at all --- they will be added to the 
appropriate count when the inserting transaction commits or aborts.
And DELETE_IN_PROGRESS tuples should be counted as live --- if the
deleting transaction commits, it will increment the dead-tuple count
at that time, and if it aborts then the tuple remains live.

I feel fairly comfortable with this analysis for ANALYZE, and the
patch I posted yesterday can easily be adjusted to accommodate it.
However, what of VACUUM?  As that code stands, every non-removable
tuple (including RECENTLY_DEAD ones) is counted as live, and the
dead-tuples count gets reset to zero.  That seems clearly bogus.
But the other-transaction-commits-second hypothesis seems a good bit
more dubious for VACUUM than it is for ANALYZE.

Should we attempt to adjust VACUUM's accounting as well, or leave it
for 8.4?  For that matter, should adjusting ANALYZE be left for 8.4?
Thoughts?
        regards, tom lane

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Russell Smith

Date:

17 November 2007, 22:59:58

Hi,

Please read the below is some skepticism.  I am not an expert with 
regard to statistics and vacuum internals.  Hopefully it just keeps the 
thinking caps moving.

Tom Lane wrote:
> There was some discussion in pgsql-performance about the problem that
> the live-and-dead-tuple counts that ANALYZE reports to the stats
> collector don't reflect insert-in-progress tuples properly:
> http://archives.postgresql.org/pgsql-performance/2007-11/msg00225.php
> I proposed a patch here:
> http://archives.postgresql.org/pgsql-patches/2007-11/msg00169.php
> to switch ANALYZE over to using HeapTupleSatisfiesVacuum() to determine
> tuple status, which would at least give it awareness of tuples that are
> in-doubt because of concurrent transactions.
>
> On further thought though, that's not the whole story, and in fact
> VACUUM itself isn't doing very well at accounting for in-doubt tuples.
> The current implementation is that whatever live and dead tuple totals
> are arrived at by a VACUUM or ANALYZE are sent to the stats collector
> and simply overwrite its counters on arrival.  Meanwhile, the
> transaction that is responsible for an in-doubt tuple will send a
> stats message to increment either the live-tuple or dead-tuple count
> as appropriate when it commits or aborts.  If that happens before
> the VACUUM or ANALYZE completes, the increment will get overwritten
> by VACUUM/ANALYZE's total; if afterwards, the increment will get
> added onto the total.  So ideally we'd count the state change as already
> done if we knew the other transaction would commit first, otherwise
> not.
>   
So given autovacuum starts on the basis of the stats collector.  The 
problem we are having is that one vacuum will be followed by a second 
basically useless vacuum for long running insert transactions that commit.
> Since we don't know whether the other transaction will commit or abort,
> much less when, it seems hopeless to have any really accurate accounting
> for in-doubt tuples.  We could avoid the whole problem if we gave up on
> having VACUUM/ANALYZE reset the stats counters (instead, VACUUM would
> just send an incremental dead-tuple-count reduction, and ANALYZE would
> do nothing); as indeed was already proposed by Itagaki-san.  But I am
> still convinced that that cure would be worse than the disease --- if
> the counters are driven entirely by incremental updates then there is
> no way to correct errors that will inevitably creep in, and the
> accumulation of errors will allow the counts to diverge arbitrarily
> far from reality.  I'm happier with the idea that the error in the
> counts can be bounded by the number of in-doubt tuples seen by the
> last VACUUM/ANALYZE.
>
> So having said all that, we need to devise heuristic rules that minimize
> the error.  In that light, I suggest that the accounting rules assume
> that the transactions responsible for in-doubt tuples will finish after
> the VACUUM/ANALYZE finishes.  That probably sounds wrong, because in
> most of the sorts of applications we think about, updating transactions
> are short and are likely to finish before the VACUUM/ANALYZE does.  But
> what I'm thinking is that a short transaction by definition can't change
> very many rows.  The cases that can seriously hurt the accuracy of our
> counts are long-running bulk inserts or updates, and for those it seems
> entirely likely that ANALYZE in particular could finish first.
>   
This may be the case, but I'm going to throw in the usual where is your 
evidence question.  Saying it doesn't actually make it so.  This makes 
the importance of the heuristic rules and determining what they are very 
important.

A potential though for measuring them, and possibly accounting for tuple 
changes would be;
1. At the beginning of vacuum of a table, cache the live/dead tuple 
statistics
2. At the end of vacuum of the table, compare current stats with cached 
version from (1).
3. Wipe out stats row and put the vacuum values with the changes between 
(1) and (2) included in the final stats row.

All transactions that committed during the vacuum run will be included 
in the changes (1) -> (2). All transactions that rolled back will be in 
the same boat.  The problem in my logic is; If vacuum starts, we check 
page X, a transaction alters page X and commits, transaction then update 
stats, vacuum finished and alters the stats again.  We have double 
counting :(  The goal is to come up with the best way to minimize size 
of the count error that could be experienced.

I suppose it all gets back to the original point.  What evidence do we 
have for how big the problem is and what cases are we fixing by a 
particular heuristic rule.  I'm sure if it's a major problem we can 
invent all sorts of complicated algorithms to track the stats correctly 
and update them at the end of the vacuum.  Is that worth the performance 
hit and the trouble?  If we adopt the wrong solution can we make the 
situation worse for other types of workloads?
> Given that assumption, then the rule should be to not count
> INSERT_IN_PROGRESS tuples at all --- they will be added to the 
> appropriate count when the inserting transaction commits or aborts.
> And DELETE_IN_PROGRESS tuples should be counted as live --- if the
> deleting transaction commits, it will increment the dead-tuple count
> at that time, and if it aborts then the tuple remains live.
>
> I feel fairly comfortable with this analysis for ANALYZE, and the
> patch I posted yesterday can easily be adjusted to accommodate it.
> However, what of VACUUM?  As that code stands, every non-removable
> tuple (including RECENTLY_DEAD ones) is counted as live, and the
> dead-tuples count gets reset to zero.  That seems clearly bogus.
> But the other-transaction-commits-second hypothesis seems a good bit
> more dubious for VACUUM than it is for ANALYZE.
>
> Should we attempt to adjust VACUUM's accounting as well, or leave it
> for 8.4?  For that matter, should adjusting ANALYZE be left for 8.4?
> Thoughts?
>   
Give all my unqualified statements above, I'd be tempted to make sure we 
can measure the problem with 8.3 and get a proper solution into 8.4.  
Analyze seems safe to change now.  It doesn't actually make change to 
the live/dead tuple counts (I don't think) so is a lower risk.  I feel 
uneasy about the vacuum stuff as we don't know the exact side effects a 
change like that could have.  It could have a larger impact that 
suspected.  And at beta3/rc1 I feel it's too late in the cycle.

Regards

Russell Smith

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Simon Riggs

Date:

18 November 2007, 11:20:49

On Sat, 2007-11-17 at 12:27 -0500, Tom Lane wrote:

> I feel fairly comfortable with this analysis for ANALYZE, and the
> patch I posted yesterday can easily be adjusted to accommodate it.
> However, what of VACUUM?  As that code stands, every non-removable
> tuple (including RECENTLY_DEAD ones) is counted as live, and the
> dead-tuples count gets reset to zero.  That seems clearly bogus.
> But the other-transaction-commits-second hypothesis seems a good bit
> more dubious for VACUUM than it is for ANALYZE.
> 
> Should we attempt to adjust VACUUM's accounting as well, or leave it
> for 8.4?  For that matter, should adjusting ANALYZE be left for 8.4?
> Thoughts?

Some random thoughts:

Question is, what % of table is recently dead? If its zero, then we have
no problem. Only care if its a substantial number.

By fixing ANALYZE we are now also less likely to run a VACUUM while
executing a large COPY or DELETE.

At the end of VACUUM it takes a lock prior to truncation. If it gets
that lock it knows nobody else is locking table. In that case the
recently dead count should be shown as dead, not live.

If we read the stats before VACUUM starts then we can use the number of
non-HOT updates and deletes made between start and finish as the number
of dead rows when VACUUM completes. 

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Alvaro Herrera

Date:

19 November 2007, 08:20:26

Tom Lane wrote:

> On further thought though, that's not the whole story, and in fact
> VACUUM itself isn't doing very well at accounting for in-doubt tuples.
> The current implementation is that whatever live and dead tuple totals
> are arrived at by a VACUUM or ANALYZE are sent to the stats collector
> and simply overwrite its counters on arrival.  Meanwhile, the
> transaction that is responsible for an in-doubt tuple will send a
> stats message to increment either the live-tuple or dead-tuple count
> as appropriate when it commits or aborts.  If that happens before
> the VACUUM or ANALYZE completes, the increment will get overwritten
> by VACUUM/ANALYZE's total; if afterwards, the increment will get
> added onto the total.  So ideally we'd count the state change as already
> done if we knew the other transaction would commit first, otherwise
> not.

How about this: let's have VACUUM send a message at the start of
processing the table.  pgstats saves the current counters for the table
somewhere and resets them to zero; and any transaction that sends
messages after that is counted to the new counter.

When vacuum finishes and commits, it sends another message and pgstats
forgets the counters it saved.  At this point, the count of dead tuples
will be correct.  (If during vacuum anyone retrieves the number of dead
tuples, the logical thing would be to report the saved counter).

If vacuum aborts, it sends a message saying so and pgstats restores the
saved counter, adding whatever has been accumulated on the other counter
during the vacuum.

If the system crashes there is no problem because the stats are reset
anyway.

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
Este mail se entrega garantizadamente 100% libre de sarcasmo.

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Tom Lane

Date:

19 November 2007, 11:38:52

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> On further thought though, that's not the whole story, and in fact
>> VACUUM itself isn't doing very well at accounting for in-doubt tuples.

> How about this: let's have VACUUM send a message at the start of
> processing the table.  pgstats saves the current counters for the table
> somewhere and resets them to zero; and any transaction that sends
> messages after that is counted to the new counter.

> When vacuum finishes and commits, it sends another message and pgstats
> forgets the counters it saved.  At this point, the count of dead tuples
> will be correct.  (If during vacuum anyone retrieves the number of dead
> tuples, the logical thing would be to report the saved counter).

No, that doesn't work (not to mention that adding two more counters
per-table will be a lot of bloat for the stats tables).

The race conditions are a lot more subtle than that.  The stats
collector cannot know when it receives a tabstat message after VACUUM
starts whether VACUUM has/will see the tuples involved, or whether it
will see them as committed or not.  That would depend on whether VACUUM
has yet reached the page(s) the tuples are in.  (Conversely tabstats
arriving shortly after the VACUUM completion report might or might not
correspond to tuples seen by VACUUM, though neither your proposal nor
mine tries to address that.)

AFAICS the only way to be 100% correct would be to track live/dead
counts on a page-by-page basis, which is obviously impractical.
(And I'm not sure even that works, given the possibility of stats
messages arriving at the collector out-of-order compared to the
actual page-changing operations.)

So we have to settle for an approximation, and that being the case
I'd rather not have an expensive approximation.
        regards, tom lane

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Simon Riggs

Date:

19 November 2007, 14:00:29

On Mon, 2007-11-19 at 10:38 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> On further thought though, that's not the whole story, and in fact
> >> VACUUM itself isn't doing very well at accounting for in-doubt tuples.
> 
> > How about this: let's have VACUUM send a message at the start of
> > processing the table.  pgstats saves the current counters for the table
> > somewhere and resets them to zero; and any transaction that sends
> > messages after that is counted to the new counter.
> 
> > When vacuum finishes and commits, it sends another message and pgstats
> > forgets the counters it saved.  At this point, the count of dead tuples
> > will be correct.  (If during vacuum anyone retrieves the number of dead
> > tuples, the logical thing would be to report the saved counter).
> 
> No, that doesn't work (not to mention that adding two more counters
> per-table will be a lot of bloat for the stats tables).
> 
> The race conditions are a lot more subtle than that.  The stats
> collector cannot know when it receives a tabstat message after VACUUM
> starts whether VACUUM has/will see the tuples involved, or whether it
> will see them as committed or not.  That would depend on whether VACUUM
> has yet reached the page(s) the tuples are in.  (Conversely tabstats
> arriving shortly after the VACUUM completion report might or might not
> correspond to tuples seen by VACUUM, though neither your proposal nor
> mine tries to address that.)
> 
> AFAICS the only way to be 100% correct would be to track live/dead
> counts on a page-by-page basis, which is obviously impractical.
> (And I'm not sure even that works, given the possibility of stats
> messages arriving at the collector out-of-order compared to the
> actual page-changing operations.)
> 
> So we have to settle for an approximation, and that being the case
> I'd rather not have an expensive approximation.

I think the before-and-after approach can be made to work:

VACUUM just needs to save the counter in memory, it doesn't need to
write that anywhere else.

VACUUM can force the flush of the tabstat file so that there is no race
condition, or at least a minimised one. We need only do that for larger
tables where the chance of concurrent deletes is high enough that its
worth making this special correction for.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Tom Lane

Date:

19 November 2007, 14:33:34

Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2007-11-19 at 10:38 -0500, Tom Lane wrote:
>> The race conditions are a lot more subtle than that.  The stats
>> collector cannot know when it receives a tabstat message after VACUUM
>> starts whether VACUUM has/will see the tuples involved, or whether it
>> will see them as committed or not.  That would depend on whether VACUUM
>> has yet reached the page(s) the tuples are in.

> I think the before-and-after approach can be made to work:

> VACUUM just needs to save the counter in memory, it doesn't need to
> write that anywhere else.

> VACUUM can force the flush of the tabstat file so that there is no race
> condition, or at least a minimised one.

I don't think you understood what I said at all.  The race condition is
not "before vs after VACUUM starts", it is "before vs after when VACUUM
scans the page that the in-doubt tuple is in".
        regards, tom lane

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Simon Riggs

Date:

19 November 2007, 15:16:34

On Mon, 2007-11-19 at 13:33 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2007-11-19 at 10:38 -0500, Tom Lane wrote:
> >> The race conditions are a lot more subtle than that.  The stats
> >> collector cannot know when it receives a tabstat message after VACUUM
> >> starts whether VACUUM has/will see the tuples involved, or whether it
> >> will see them as committed or not.  That would depend on whether VACUUM
> >> has yet reached the page(s) the tuples are in.
> 
> > I think the before-and-after approach can be made to work:
> 
> > VACUUM just needs to save the counter in memory, it doesn't need to
> > write that anywhere else.
> 
> > VACUUM can force the flush of the tabstat file so that there is no race
> > condition, or at least a minimised one.
> 
> I don't think you understood what I said at all.  The race condition is
> not "before vs after VACUUM starts", it is "before vs after when VACUUM
> scans the page that the in-doubt tuple is in".

I thought we were looking for heuristics, not exact accuracy? I
understand the visibility issues.

Right now the larger the table the more likely we will get deletes start
and complete while we are doing the VACUUM. So right now, the larger the
table the larger the error, assuming a constant workload.

Large VACUUMs are currently not sweeping up all the rows they could do,
as Heikki's patch for re-evaluating xmin and associated test results
showed. Those are the same rows that we are currently ignoring now.

Carrying on ignoring them probably isn't the right thing to do, even if
the exactly right thing to do isn't fully knowable.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Alvaro Herrera

Date:

21 November 2007, 09:10:48

Tom Lane wrote:

> I feel fairly comfortable with this analysis for ANALYZE, and the
> patch I posted yesterday can easily be adjusted to accommodate it.
> However, what of VACUUM?  As that code stands, every non-removable
> tuple (including RECENTLY_DEAD ones) is counted as live, and the
> dead-tuples count gets reset to zero.  That seems clearly bogus.
> But the other-transaction-commits-second hypothesis seems a good bit
> more dubious for VACUUM than it is for ANALYZE.
> 
> Should we attempt to adjust VACUUM's accounting as well, or leave it
> for 8.4?  For that matter, should adjusting ANALYZE be left for 8.4?
> Thoughts?

Has this issue been a real problem?  If so, probably we should consider
adjusting ANALYZE for 8.3 per your proposal.

For VACUUM, I think one thing we should do to reduce the severity of the
problem is to send the pgstat message before attempting the truncation;
that way, less transactions are lost.  (There is still going to be a lot
of lost pgstat traffic when vacuum_delay is high).  I am not sure about
further changes.

For 8.4 we could discuss more invasive changes.  Maybe send a pgstat
message just before each vacuum_delay sleep point?  This would have to
use the incremental update approach, which is probably better when
vacuum_delay is enabled.

-- 
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

Tom Lane

Date:

21 November 2007, 14:33:13

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Should we attempt to adjust VACUUM's accounting as well, or leave it
>> for 8.4?  For that matter, should adjusting ANALYZE be left for 8.4?
>> Thoughts?

> Has this issue been a real problem?  If so, probably we should consider
> adjusting ANALYZE for 8.3 per your proposal.

I'm not sure.  Upthread, two or three people said they thought they'd
seen autovac launching vacuums against tables during bulk inserts.
However, that could only happen if there were already a reason to launch
an auto-analyze (which could misreport dead tuples and thus trigger a
subsequent auto-vacuum), and in typical bulk load situations I don't see
why that would be very likely to happen.

I'm fine with leaving the whole issue for 8.4.
        regards, tom lane

Re: VACUUM/ANALYZE counting of in-doubt tuples

From

"Kevin Grittner"

Date:

21 November 2007, 14:56:06

>>> On Wed, Nov 21, 2007 at 12:32 PM, in message <15089.1195669979@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> Has this issue been a real problem?  If so, probably we should consider
>> adjusting ANALYZE for 8.3 per your proposal.
>
> I'm not sure.  Upthread, two or three people said they thought they'd
> seen autovac launching vacuums against tables during bulk inserts.
> However, that could only happen if there were already a reason to launch
> an auto-analyze (which could misreport dead tuples and thus trigger a
> subsequent auto-vacuum), and in typical bulk load situations I don't see
> why that would be very likely to happen.
We had been in the habit of throwing a commit into our bulk loads
periodically (say, every 10,000 or 100,000 rows.  This was because
our prior database product needed to keep the entire transaction
image in a fixed-size transaction log until commit; if we didn't
commit now and then, the whole thing locked up and died.  I'm not
sure I've seen the behavior since we realized it was just an old
habit and went to a single transaction per table.
> I'm fine with leaving the whole issue for 8.4.
Perhaps a comment somewhere in the documentation regarding the
above should go into releases where this technique can be costly?
Suggesting a single transaction or suspension of autovacuum?
-Kevin