Thread: VACUUM FULL versus system catalog cache invalidation

VACUUM FULL versus system catalog cache invalidation

From
Tom Lane
Date:
I've been testing the problem reported by Dave Gould by running "make
installcheck-parallel" together with a tight loop of "vacuum full
pg_class":

while true; do psql -c "vacuum full pg_class" regression; usleep 100000; done

Even after fixing the cache-reset-recovery-order problem I described
yesterday, there are still occasional bizarre failures, for example
in this bit of truncate.sql:

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE trunc_f;
SELECT * FROM trunc_f;
ROLLBACK;

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE ONLY trunc_f;
ERROR:  attempted to update invisible tuple

I'm not sure that there's only one bug remaining here, but I've
identified this bug at least.  In the first transaction above, the
TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode
value.  Say the update invalidates the tuple at tid A and inserts the
updated version at tid B.  Then vacuum full gets control and rewrites
pg_class.  It correctly keeps both versions of the trunc_f tuple, but
now they will be at new tid locations, say C and D.  The original
transaction meanwhile was blocked trying to re-open pg_class to reload
trunc_f's pg_class tuple into its catcaches.  Now it successfully does
that, and the tuple it reads has tid D.  It finishes out the
transaction, and rolls back as per script.  In the rollback, we know
that we have to flush catcache entries for tuple versions inserted by
the failed transaction ... but what inval.c has stored is an inval event
against pg_class tid B.  There is no such entry in the catcaches, so
nothing happens, and the entry with tid D stays valid.  Ooops.
The relcache entry for trunc_f does get rebuilt correctly (without
consulting the bogus catcache entry), so the next SELECT goes through
all right.  But when the second TRUNCATE wants to again update the
pg_class tuple, it finds it in the catcaches ... and what it finds has
tid D, so it attempts to heap_update that TID instead of the actually
live tuple at tid C.

In this particular case, the incorrectly targeted tuple is the
just-invalidated version of trunc_f, so heap_update sees it as
XMIN_INVALID and fails with "attempted to update invisible tuple".

The potential consequences are hugely worse than that, though: with a
bit more time between the two operations, it'd be possible for someone
else to reclaim the dead tuple and replace it with something else.
As long as the TID is live when we get to it, heap_update will blindly
replace it, whether or not it has anything to do with the tuple we think
we're replacing.  I suspect that some of the other bizarre cases I'm
seeing are artifacts of that type of outcome, but they're hard to
reproduce so I've not been able to pin them down yet.

In principle this same type of failure could have occurred before 9.0,
since catcache inval has always been TID-based.  It turns out we were
saved by the fact that the old VACUUM FULL implementation would chicken
out and not do any tuple moving if it found any INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples.  The pre-9.0 versions of CLUSTER reject such
tuples too, so I don't think we need to do anything in those branches.

But in 9.0 and up, we have a problem.  So far I've thought of two possible
avenues to fix it:

1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
FULL or CLUSTER just finished on a system catalog), enter that message
into the transaction's local inval list as well as executing it
immediately.  On transaction abort, redo the resulting cache flushes a
second time.  This would ensure we flushed any bad entries even though
they were logged with obsolete TIDs elsewhere in the inval list.
(We might need to do this on commit too, though right now I don't think
so.  Also, a cache reset event would a fortiori have to queue a similar
entry to blow things away a second time at transaction end.)

2. Forget about targeting catcache invals by TID, and instead just use the
key hash value to determine which cache entries to drop.

Approach #2 seems a lot less invasive and more trustworthy, but it has the
disadvantage that cache invals would become more likely to blow away
entries unnecessarily (because of chance hashvalue collisions), even
without any VACUUM FULL being done.  If we could make approach #1 work
reliably, it would result in more overhead during VACUUM FULL but less at
other times --- or at least we could hope so.  In an environment where
lots of sinval overflows and consequent resets happen, we might come out
behind due to doubling the number of catcache flushes forced by a reset
event.

Right at the moment I'm leaning to approach #2.  I wonder if anyone
sees it differently, or has an idea for a third approach?

BTW, going forward it might be interesting to think about invalidating
the catcaches based on xmin/xmax rather than specific TIDs.  That would
reduce the sinval traffic to one message per transaction that had
affected the catalogs, instead of one per TID.  But that clearly is not
going to lead to something we'd dare back-patch.
        regards, tom lane


Re: VACUUM FULL versus system catalog cache invalidation

From
Robert Haas
Date:
On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But in 9.0 and up, we have a problem.  So far I've thought of two possible
> avenues to fix it:
>
> 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
> FULL or CLUSTER just finished on a system catalog), enter that message
> into the transaction's local inval list as well as executing it
> immediately.  On transaction abort, redo the resulting cache flushes a
> second time.  This would ensure we flushed any bad entries even though
> they were logged with obsolete TIDs elsewhere in the inval list.
> (We might need to do this on commit too, though right now I don't think
> so.  Also, a cache reset event would a fortiori have to queue a similar
> entry to blow things away a second time at transaction end.)
>
> 2. Forget about targeting catcache invals by TID, and instead just use the
> key hash value to determine which cache entries to drop.
>
> Approach #2 seems a lot less invasive and more trustworthy, but it has the
> disadvantage that cache invals would become more likely to blow away
> entries unnecessarily (because of chance hashvalue collisions), even
> without any VACUUM FULL being done.  If we could make approach #1 work
> reliably, it would result in more overhead during VACUUM FULL but less at
> other times --- or at least we could hope so.  In an environment where
> lots of sinval overflows and consequent resets happen, we might come out
> behind due to doubling the number of catcache flushes forced by a reset
> event.
>
> Right at the moment I'm leaning to approach #2.  I wonder if anyone
> sees it differently, or has an idea for a third approach?

I don't think it really matters whether we occasionally blow away an
entry unnecessarily due to a hash-value collision.  IIUC, we'd only
need to worry about hash-value collisions between rows in the same
catalog; and the number of entries that we have cached had better be
many orders of magnitude less than 2^32.  If the cache is large enough
that we're having hash value collisions more than once in a great
while, we probably should have flushed some entries out of it a whole
lot sooner and a whole lot more aggressively, because we're likely
eating memory like crazy.

On the other hand, having to duplicate sinval resets seems like a bad
thing.  So +1 for #2.

> BTW, going forward it might be interesting to think about invalidating
> the catcaches based on xmin/xmax rather than specific TIDs.  That would
> reduce the sinval traffic to one message per transaction that had
> affected the catalogs, instead of one per TID.  But that clearly is not
> going to lead to something we'd dare back-patch.

To be honest, I am not real keen on back-patching either of the
options you list above, either.  I think we should probably do
something about the problem Dave Gould reported (to wit: sinval resets
need to be smarter about the order they do things in), but I am not
sure it's a good idea to back-patch what seems like a fairly radical
overhaul of the sinval mechanism to fix a problem nobody's actually
complained about hitting in production, especially given there's no
certainty that this is the last bug.  Perhaps we should just fix this
one in master and consider back-patching it if and when we get some
plausibly related bug reports.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: VACUUM FULL versus system catalog cache invalidation

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. Forget about targeting catcache invals by TID, and instead just use the
>> key hash value to determine which cache entries to drop.

>> Right at the moment I'm leaning to approach #2. �I wonder if anyone
>> sees it differently, or has an idea for a third approach?

> To be honest, I am not real keen on back-patching either of the
> options you list above, either.  I think we should probably do
> something about the problem Dave Gould reported (to wit: sinval resets
> need to be smarter about the order they do things in), but I am not
> sure it's a good idea to back-patch what seems like a fairly radical
> overhaul of the sinval mechanism to fix a problem nobody's actually
> complained about hitting in production, especially given there's no
> certainty that this is the last bug.

What radical overhaul?  I'm talking about removing one if-test in
CatalogCacheIdInvalidate, viz
        nextelt = DLGetSucc(elt);
        if (hashValue != ct->hash_value)            continue;        /* ignore non-matching hash values */

-            if (ct->negative ||
-                ItemPointerEquals(pointer, &ct->tuple.t_self))        {            if (ct->refcount > 0 ||
  (ct->c_list && ct->c_list->refcount > 0))            {
 

In any case, it is now clear to me that this bug is capable of eating
peoples' databases, as in "what just happened to our most critical
table?  Uh, it's not there anymore, boss, but we seem to have duplicate
pg_class entries for this other table".  This may not have happened
yet in the field, but that's probably only because 9.0 hasn't been in
production that long.  Doing nothing about it in the released branches
is unacceptable IMO.  People that that happens to will not be satisfied
with a response like "you shouldn't have been using VACUUM FULL on
system catalogs, it's buggy".

For that matter, I'm not convinced that Gould hasn't seen some form of
this --- he's mentioned seeing bizarre errors above and beyond the
"could not find pg_class tuple for index 2662" one.

As for it not being the last bug, no, it's probably not.  That's a
pretty sucky excuse for not fixing it too.
        regards, tom lane


Re: VACUUM FULL versus system catalog cache invalidation

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> Perhaps we should just fix this one in master and consider
> back-patching it if and when we get some plausibly related bug
> reports.
I'm not completely clear on what one would do to be vulnerable to
hitting the bug, or what the impact of hitting it would be.  Tom
said:
> The potential consequences are hugely worse than that, though:
> with a bit more time between the two operations, it'd be possible
> for someone else to reclaim the dead tuple and replace it with
> something else.  As long as the TID is live when we get to it,
> heap_update will blindly replace it, whether or not it has
> anything to do with the tuple we think we're replacing.
What would a worst-case bug report based on hitting that look like?
-Kevin


Re: VACUUM FULL versus system catalog cache invalidation

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In any case, it is now clear to me that this bug is capable of
> eating peoples' databases, as in "what just happened to our most
> critical table?  Uh, it's not there anymore, boss, but we seem to
> have duplicate pg_class entries for this other table".
Based on this, I don't think we want to wait for bug reports.  One
would be too many.
+1 for backpatching a fix.
> As for it not being the last bug, no, it's probably not.  That's a
> pretty sucky excuse for not fixing it too.
I agree.
-Kevin


Re: VACUUM FULL versus system catalog cache invalidation

From
Heikki Linnakangas
Date:
On 12.08.2011 21:49, Robert Haas wrote:
> On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> 2. Forget about targeting catcache invals by TID, and instead just use the
>> key hash value to determine which cache entries to drop.
>>
>> Approach #2 seems a lot less invasive and more trustworthy, but it has the
>> disadvantage that cache invals would become more likely to blow away
>> entries unnecessarily (because of chance hashvalue collisions), even
>> without any VACUUM FULL being done.  If we could make approach #1 work
>> reliably, it would result in more overhead during VACUUM FULL but less at
>> other times --- or at least we could hope so.  In an environment where
>> lots of sinval overflows and consequent resets happen, we might come out
>> behind due to doubling the number of catcache flushes forced by a reset
>> event.
>>
>> Right at the moment I'm leaning to approach #2.  I wonder if anyone
>> sees it differently, or has an idea for a third approach?
>
> I don't think it really matters whether we occasionally blow away an
> entry unnecessarily due to a hash-value collision.  IIUC, we'd only
> need to worry about hash-value collisions between rows in the same
> catalog; and the number of entries that we have cached had better be
> many orders of magnitude less than 2^32.  If the cache is large enough
> that we're having hash value collisions more than once in a great
> while, we probably should have flushed some entries out of it a whole
> lot sooner and a whole lot more aggressively, because we're likely
> eating memory like crazy.

What would suck, though, is if you have an application that repeatedly 
creates and drops a temporary table, and the hash value for that happens 
to match some other table in the database. catcache invalidation would 
keep flushing the entry for that other table too, and you couldn't do 
anything about it except for renaming one of the tables.

Despite that, +1 for option #2. The risk of collision seems acceptable, 
and the consequence of a collision wouldn't be too bad in most 
applications anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: VACUUM FULL versus system catalog cache invalidation

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 12.08.2011 21:49, Robert Haas wrote:
>> I don't think it really matters whether we occasionally blow away an
>> entry unnecessarily due to a hash-value collision.  IIUC, we'd only
>> need to worry about hash-value collisions between rows in the same
>> catalog; and the number of entries that we have cached had better be
>> many orders of magnitude less than 2^32.  If the cache is large enough
>> that we're having hash value collisions more than once in a great
>> while, we probably should have flushed some entries out of it a whole
>> lot sooner and a whole lot more aggressively, because we're likely
>> eating memory like crazy.

> What would suck, though, is if you have an application that repeatedly 
> creates and drops a temporary table, and the hash value for that happens 
> to match some other table in the database. catcache invalidation would 
> keep flushing the entry for that other table too, and you couldn't do 
> anything about it except for renaming one of the tables.

> Despite that, +1 for option #2. The risk of collision seems acceptable, 
> and the consequence of a collision wouldn't be too bad in most 
> applications anyway.

Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
to invalidate catcaches instead of recording individual TIDs, I'm
excited about doing that instead for 9.2 and beyond.  I believe that
that could result in a significant reduction in sinval traffic, which
would be a considerable performance win.
        regards, tom lane


Re: VACUUM FULL versus system catalog cache invalidation

From
Robert Haas
Date:
On Fri, Aug 12, 2011 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In any case, it is now clear to me that this bug is capable of eating
> peoples' databases, as in "what just happened to our most critical
> table?  Uh, it's not there anymore, boss, but we seem to have duplicate
> pg_class entries for this other table".

OK, I take your point.  That's certainly scary enough to worry about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: VACUUM FULL versus system catalog cache invalidation

From
Simon Riggs
Date:
On Fri, Aug 12, 2011 at 7:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Right at the moment I'm leaning to approach #2.  I wonder if anyone
> sees it differently, or has an idea for a third approach?

You are trying to solve the problem directly, which seems overkill.

With HOT, there is very little need to perform a VACUUM FULL on any
shared catalog table. Look at the indexes...

I would a suggest that VACUUM FULL perform only a normal VACUUM on
shared catalog tables, then perform an actual VACUUM FULL only in dire
need (some simple heuristic in size and density). This avoids doing a
VACUUM FULL unless it is actually necessary to do so. That has the
added advantage of not locking out essential tables, which is always a
concern.

In the unlikely event we do actually have to VACUUM FULL a shared
catalog table, nuke any cache entry for the whole shared catalog. That
way we absolutely and positively will never get any more bugs in this
area, ever again. Sounds harsh, but these events are only actually
needed very, very rarely and hygiene is more important than a few
minor points of performance.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: VACUUM FULL versus system catalog cache invalidation

From
daveg
Date:
On Fri, Aug 12, 2011 at 09:26:02PM +0100, Simon Riggs wrote:
> With HOT, there is very little need to perform a VACUUM FULL on any
> shared catalog table. Look at the indexes...
> 
> I would a suggest that VACUUM FULL perform only a normal VACUUM on
> shared catalog tables, then perform an actual VACUUM FULL only in dire
> need (some simple heuristic in size and density). This avoids doing a
> VACUUM FULL unless it is actually necessary to do so. That has the
> added advantage of not locking out essential tables, which is always a
> concern.
> 
> In the unlikely event we do actually have to VACUUM FULL a shared
> catalog table, nuke any cache entry for the whole shared catalog. That
> way we absolutely and positively will never get any more bugs in this
> area, ever again. Sounds harsh, but these events are only actually
> needed very, very rarely and hygiene is more important than a few
> minor points of performance.

This is a very optimistic view. My client makes heavy use of temp tables.
HOT and autovacuum are not sufficient to keep catalog bloat under control.
We run a daily script that calculates the density of the catalog and only
vaccum fulls those that are severely bloated. Here is a result from a
recent bloat check on one db. 'packed' is the number of pages needed for
the rows if they were packed, 'bloat' is the multiple of pages in use over
the number really needed.
   relation      | tuples | pages | packed | bloat
------------------+--------+-------+--------+-------pg_class; --     |   4292 | 10619 |    114 |  93.2pg_depend; --
| 25666 |  7665 |    217 |  35.4pg_attrdef; --   |   6585 |  7595 |    236 |  32.2pg_type; --      |   4570 |  8177 |
416 |  19.6pg_shdepend; --  |  52040 |  7968 |    438 |  18.2
 

-dg
-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.


Re: VACUUM FULL versus system catalog cache invalidation

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> With HOT, there is very little need to perform a VACUUM FULL on any
> shared catalog table. Look at the indexes...

This is not really a useful argument.  People do do VAC FULL on
catalogs, whether we think they should or not.  Also, it's not only
"shared" catalogs that are at stake.

We could avoid fixing the bug if we forbade both VAC FULL and CLUSTER
on all system catalogs, period, no exceptions.  That doesn't seem like
a workable answer though.  Some usage patterns do see bloat on the
catalogs, especially if you disable or dial down autovacuum.

> In the unlikely event we do actually have to VACUUM FULL a shared
> catalog table, nuke any cache entry for the whole shared catalog.

You're still not getting the point.  We *do* nuke all cache entries
after a VAC FULL.  That does not avoid this bug.
        regards, tom lane


Re: VACUUM FULL versus system catalog cache invalidation

From
Tom Lane
Date:
I wrote:
> Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
> 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
> to invalidate catcaches instead of recording individual TIDs, I'm
> excited about doing that instead for 9.2 and beyond.  I believe that
> that could result in a significant reduction in sinval traffic, which
> would be a considerable performance win.

On closer inspection this idea doesn't seem workable.  I was imagining
that after a transaction commits, we could find obsoleted catcache
entries by looking for tuples with xmax equal to the transaction's XID.
But a catcache entry made before the transaction had done the update
wouldn't contain the correct xmax, so we'd fail to remove it.  The only
apparent way to fix that would be to go out to disk and consult the
current on-disk xmax, which would hardly be any cheaper than just
dropping the cache entry and then reloading it when/if needed.

All is not entirely lost, however: there's still some possible
performance benefit to be gained here, if we go to the scheme of
identifying victim catcache entries by hashvalue only.  Currently,
each heap_update in a cached catalog has to issue two sinval messages
(per cache!): one against the old TID and one against the new TID.
We'd be able to reduce that to one message in the common case where the
hashvalue remains the same because the cache key columns didn't change.

Another thing we could consider doing, if one-in-2^32 hash collisions
seems too high, is widening the hash values to 64 bits.  I'm not
terribly excited about that, because a lot of the caches are on OID
columns for which there'd be zero benefit.
        regards, tom lane


Re: VACUUM FULL versus system catalog cache invalidation

From
Robert Haas
Date:
On Sat, Aug 13, 2011 at 12:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
>> 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
>> to invalidate catcaches instead of recording individual TIDs, I'm
>> excited about doing that instead for 9.2 and beyond.  I believe that
>> that could result in a significant reduction in sinval traffic, which
>> would be a considerable performance win.
>
> On closer inspection this idea doesn't seem workable.  I was imagining
> that after a transaction commits, we could find obsoleted catcache
> entries by looking for tuples with xmax equal to the transaction's XID.
> But a catcache entry made before the transaction had done the update
> wouldn't contain the correct xmax, so we'd fail to remove it.  The only
> apparent way to fix that would be to go out to disk and consult the
> current on-disk xmax, which would hardly be any cheaper than just
> dropping the cache entry and then reloading it when/if needed.
>
> All is not entirely lost, however: there's still some possible
> performance benefit to be gained here, if we go to the scheme of
> identifying victim catcache entries by hashvalue only.  Currently,
> each heap_update in a cached catalog has to issue two sinval messages
> (per cache!): one against the old TID and one against the new TID.
> We'd be able to reduce that to one message in the common case where the
> hashvalue remains the same because the cache key columns didn't change.

Cool.

> Another thing we could consider doing, if one-in-2^32 hash collisions
> seems too high, is widening the hash values to 64 bits.  I'm not
> terribly excited about that, because a lot of the caches are on OID
> columns for which there'd be zero benefit.

Yeah, and I can't get excited about the increased memory usage, either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company