Thread: VACUUM FULL versus system catalog cache invalidation
I've been testing the problem reported by Dave Gould by running "make installcheck-parallel" together with a tight loop of "vacuum full pg_class": while true; do psql -c "vacuum full pg_class" regression; usleep 100000; done Even after fixing the cache-reset-recovery-order problem I described yesterday, there are still occasional bizarre failures, for example in this bit of truncate.sql: BEGIN; SELECT * FROM trunc_f; TRUNCATE trunc_f; SELECT * FROM trunc_f; ROLLBACK; BEGIN; SELECT * FROM trunc_f; TRUNCATE ONLY trunc_f; ERROR: attempted to update invisible tuple I'm not sure that there's only one bug remaining here, but I've identified this bug at least. In the first transaction above, the TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode value. Say the update invalidates the tuple at tid A and inserts the updated version at tid B. Then vacuum full gets control and rewrites pg_class. It correctly keeps both versions of the trunc_f tuple, but now they will be at new tid locations, say C and D. The original transaction meanwhile was blocked trying to re-open pg_class to reload trunc_f's pg_class tuple into its catcaches. Now it successfully does that, and the tuple it reads has tid D. It finishes out the transaction, and rolls back as per script. In the rollback, we know that we have to flush catcache entries for tuple versions inserted by the failed transaction ... but what inval.c has stored is an inval event against pg_class tid B. There is no such entry in the catcaches, so nothing happens, and the entry with tid D stays valid. Ooops. The relcache entry for trunc_f does get rebuilt correctly (without consulting the bogus catcache entry), so the next SELECT goes through all right. But when the second TRUNCATE wants to again update the pg_class tuple, it finds it in the catcaches ... and what it finds has tid D, so it attempts to heap_update that TID instead of the actually live tuple at tid C. In this particular case, the incorrectly targeted tuple is the just-invalidated version of trunc_f, so heap_update sees it as XMIN_INVALID and fails with "attempted to update invisible tuple". The potential consequences are hugely worse than that, though: with a bit more time between the two operations, it'd be possible for someone else to reclaim the dead tuple and replace it with something else. As long as the TID is live when we get to it, heap_update will blindly replace it, whether or not it has anything to do with the tuple we think we're replacing. I suspect that some of the other bizarre cases I'm seeing are artifacts of that type of outcome, but they're hard to reproduce so I've not been able to pin them down yet. In principle this same type of failure could have occurred before 9.0, since catcache inval has always been TID-based. It turns out we were saved by the fact that the old VACUUM FULL implementation would chicken out and not do any tuple moving if it found any INSERT_IN_PROGRESS or DELETE_IN_PROGRESS tuples. The pre-9.0 versions of CLUSTER reject such tuples too, so I don't think we need to do anything in those branches. But in 9.0 and up, we have a problem. So far I've thought of two possible avenues to fix it: 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC FULL or CLUSTER just finished on a system catalog), enter that message into the transaction's local inval list as well as executing it immediately. On transaction abort, redo the resulting cache flushes a second time. This would ensure we flushed any bad entries even though they were logged with obsolete TIDs elsewhere in the inval list. (We might need to do this on commit too, though right now I don't think so. Also, a cache reset event would a fortiori have to queue a similar entry to blow things away a second time at transaction end.) 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Approach #2 seems a lot less invasive and more trustworthy, but it has the disadvantage that cache invals would become more likely to blow away entries unnecessarily (because of chance hashvalue collisions), even without any VACUUM FULL being done. If we could make approach #1 work reliably, it would result in more overhead during VACUUM FULL but less at other times --- or at least we could hope so. In an environment where lots of sinval overflows and consequent resets happen, we might come out behind due to doubling the number of catcache flushes forced by a reset event. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? BTW, going forward it might be interesting to think about invalidating the catcaches based on xmin/xmax rather than specific TIDs. That would reduce the sinval traffic to one message per transaction that had affected the catalogs, instead of one per TID. But that clearly is not going to lead to something we'd dare back-patch. regards, tom lane
On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But in 9.0 and up, we have a problem. So far I've thought of two possible > avenues to fix it: > > 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC > FULL or CLUSTER just finished on a system catalog), enter that message > into the transaction's local inval list as well as executing it > immediately. On transaction abort, redo the resulting cache flushes a > second time. This would ensure we flushed any bad entries even though > they were logged with obsolete TIDs elsewhere in the inval list. > (We might need to do this on commit too, though right now I don't think > so. Also, a cache reset event would a fortiori have to queue a similar > entry to blow things away a second time at transaction end.) > > 2. Forget about targeting catcache invals by TID, and instead just use the > key hash value to determine which cache entries to drop. > > Approach #2 seems a lot less invasive and more trustworthy, but it has the > disadvantage that cache invals would become more likely to blow away > entries unnecessarily (because of chance hashvalue collisions), even > without any VACUUM FULL being done. If we could make approach #1 work > reliably, it would result in more overhead during VACUUM FULL but less at > other times --- or at least we could hope so. In an environment where > lots of sinval overflows and consequent resets happen, we might come out > behind due to doubling the number of catcache flushes forced by a reset > event. > > Right at the moment I'm leaning to approach #2. I wonder if anyone > sees it differently, or has an idea for a third approach? I don't think it really matters whether we occasionally blow away an entry unnecessarily due to a hash-value collision. IIUC, we'd only need to worry about hash-value collisions between rows in the same catalog; and the number of entries that we have cached had better be many orders of magnitude less than 2^32. If the cache is large enough that we're having hash value collisions more than once in a great while, we probably should have flushed some entries out of it a whole lot sooner and a whole lot more aggressively, because we're likely eating memory like crazy. On the other hand, having to duplicate sinval resets seems like a bad thing. So +1 for #2. > BTW, going forward it might be interesting to think about invalidating > the catcaches based on xmin/xmax rather than specific TIDs. That would > reduce the sinval traffic to one message per transaction that had > affected the catalogs, instead of one per TID. But that clearly is not > going to lead to something we'd dare back-patch. To be honest, I am not real keen on back-patching either of the options you list above, either. I think we should probably do something about the problem Dave Gould reported (to wit: sinval resets need to be smarter about the order they do things in), but I am not sure it's a good idea to back-patch what seems like a fairly radical overhaul of the sinval mechanism to fix a problem nobody's actually complained about hitting in production, especially given there's no certainty that this is the last bug. Perhaps we should just fix this one in master and consider back-patching it if and when we get some plausibly related bug reports. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 2. Forget about targeting catcache invals by TID, and instead just use the >> key hash value to determine which cache entries to drop. >> Right at the moment I'm leaning to approach #2. �I wonder if anyone >> sees it differently, or has an idea for a third approach? > To be honest, I am not real keen on back-patching either of the > options you list above, either. I think we should probably do > something about the problem Dave Gould reported (to wit: sinval resets > need to be smarter about the order they do things in), but I am not > sure it's a good idea to back-patch what seems like a fairly radical > overhaul of the sinval mechanism to fix a problem nobody's actually > complained about hitting in production, especially given there's no > certainty that this is the last bug. What radical overhaul? I'm talking about removing one if-test in CatalogCacheIdInvalidate, viz nextelt = DLGetSucc(elt); if (hashValue != ct->hash_value) continue; /* ignore non-matching hash values */ - if (ct->negative || - ItemPointerEquals(pointer, &ct->tuple.t_self)) { if (ct->refcount > 0 || (ct->c_list && ct->c_list->refcount > 0)) { In any case, it is now clear to me that this bug is capable of eating peoples' databases, as in "what just happened to our most critical table? Uh, it's not there anymore, boss, but we seem to have duplicate pg_class entries for this other table". This may not have happened yet in the field, but that's probably only because 9.0 hasn't been in production that long. Doing nothing about it in the released branches is unacceptable IMO. People that that happens to will not be satisfied with a response like "you shouldn't have been using VACUUM FULL on system catalogs, it's buggy". For that matter, I'm not convinced that Gould hasn't seen some form of this --- he's mentioned seeing bizarre errors above and beyond the "could not find pg_class tuple for index 2662" one. As for it not being the last bug, no, it's probably not. That's a pretty sucky excuse for not fixing it too. regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > Perhaps we should just fix this one in master and consider > back-patching it if and when we get some plausibly related bug > reports. I'm not completely clear on what one would do to be vulnerable to hitting the bug, or what the impact of hitting it would be. Tom said: > The potential consequences are hugely worse than that, though: > with a bit more time between the two operations, it'd be possible > for someone else to reclaim the dead tuple and replace it with > something else. As long as the TID is live when we get to it, > heap_update will blindly replace it, whether or not it has > anything to do with the tuple we think we're replacing. What would a worst-case bug report based on hitting that look like? -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, it is now clear to me that this bug is capable of > eating peoples' databases, as in "what just happened to our most > critical table? Uh, it's not there anymore, boss, but we seem to > have duplicate pg_class entries for this other table". Based on this, I don't think we want to wait for bug reports. One would be too many. +1 for backpatching a fix. > As for it not being the last bug, no, it's probably not. That's a > pretty sucky excuse for not fixing it too. I agree. -Kevin
On 12.08.2011 21:49, Robert Haas wrote: > On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> 2. Forget about targeting catcache invals by TID, and instead just use the >> key hash value to determine which cache entries to drop. >> >> Approach #2 seems a lot less invasive and more trustworthy, but it has the >> disadvantage that cache invals would become more likely to blow away >> entries unnecessarily (because of chance hashvalue collisions), even >> without any VACUUM FULL being done. If we could make approach #1 work >> reliably, it would result in more overhead during VACUUM FULL but less at >> other times --- or at least we could hope so. In an environment where >> lots of sinval overflows and consequent resets happen, we might come out >> behind due to doubling the number of catcache flushes forced by a reset >> event. >> >> Right at the moment I'm leaning to approach #2. I wonder if anyone >> sees it differently, or has an idea for a third approach? > > I don't think it really matters whether we occasionally blow away an > entry unnecessarily due to a hash-value collision. IIUC, we'd only > need to worry about hash-value collisions between rows in the same > catalog; and the number of entries that we have cached had better be > many orders of magnitude less than 2^32. If the cache is large enough > that we're having hash value collisions more than once in a great > while, we probably should have flushed some entries out of it a whole > lot sooner and a whole lot more aggressively, because we're likely > eating memory like crazy. What would suck, though, is if you have an application that repeatedly creates and drops a temporary table, and the hash value for that happens to match some other table in the database. catcache invalidation would keep flushing the entry for that other table too, and you couldn't do anything about it except for renaming one of the tables. Despite that, +1 for option #2. The risk of collision seems acceptable, and the consequence of a collision wouldn't be too bad in most applications anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 12.08.2011 21:49, Robert Haas wrote: >> I don't think it really matters whether we occasionally blow away an >> entry unnecessarily due to a hash-value collision. IIUC, we'd only >> need to worry about hash-value collisions between rows in the same >> catalog; and the number of entries that we have cached had better be >> many orders of magnitude less than 2^32. If the cache is large enough >> that we're having hash value collisions more than once in a great >> while, we probably should have flushed some entries out of it a whole >> lot sooner and a whole lot more aggressively, because we're likely >> eating memory like crazy. > What would suck, though, is if you have an application that repeatedly > creates and drops a temporary table, and the hash value for that happens > to match some other table in the database. catcache invalidation would > keep flushing the entry for that other table too, and you couldn't do > anything about it except for renaming one of the tables. > Despite that, +1 for option #2. The risk of collision seems acceptable, > and the consequence of a collision wouldn't be too bad in most > applications anyway. Yeah. Also, to my mind this is only a fix that will be used in 9.0 and 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax to invalidate catcaches instead of recording individual TIDs, I'm excited about doing that instead for 9.2 and beyond. I believe that that could result in a significant reduction in sinval traffic, which would be a considerable performance win. regards, tom lane
On Fri, Aug 12, 2011 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, it is now clear to me that this bug is capable of eating > peoples' databases, as in "what just happened to our most critical > table? Uh, it's not there anymore, boss, but we seem to have duplicate > pg_class entries for this other table". OK, I take your point. That's certainly scary enough to worry about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 7:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right at the moment I'm leaning to approach #2. I wonder if anyone > sees it differently, or has an idea for a third approach? You are trying to solve the problem directly, which seems overkill. With HOT, there is very little need to perform a VACUUM FULL on any shared catalog table. Look at the indexes... I would a suggest that VACUUM FULL perform only a normal VACUUM on shared catalog tables, then perform an actual VACUUM FULL only in dire need (some simple heuristic in size and density). This avoids doing a VACUUM FULL unless it is actually necessary to do so. That has the added advantage of not locking out essential tables, which is always a concern. In the unlikely event we do actually have to VACUUM FULL a shared catalog table, nuke any cache entry for the whole shared catalog. That way we absolutely and positively will never get any more bugs in this area, ever again. Sounds harsh, but these events are only actually needed very, very rarely and hygiene is more important than a few minor points of performance. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 09:26:02PM +0100, Simon Riggs wrote: > With HOT, there is very little need to perform a VACUUM FULL on any > shared catalog table. Look at the indexes... > > I would a suggest that VACUUM FULL perform only a normal VACUUM on > shared catalog tables, then perform an actual VACUUM FULL only in dire > need (some simple heuristic in size and density). This avoids doing a > VACUUM FULL unless it is actually necessary to do so. That has the > added advantage of not locking out essential tables, which is always a > concern. > > In the unlikely event we do actually have to VACUUM FULL a shared > catalog table, nuke any cache entry for the whole shared catalog. That > way we absolutely and positively will never get any more bugs in this > area, ever again. Sounds harsh, but these events are only actually > needed very, very rarely and hygiene is more important than a few > minor points of performance. This is a very optimistic view. My client makes heavy use of temp tables. HOT and autovacuum are not sufficient to keep catalog bloat under control. We run a daily script that calculates the density of the catalog and only vaccum fulls those that are severely bloated. Here is a result from a recent bloat check on one db. 'packed' is the number of pages needed for the rows if they were packed, 'bloat' is the multiple of pages in use over the number really needed. relation | tuples | pages | packed | bloat ------------------+--------+-------+--------+-------pg_class; -- | 4292 | 10619 | 114 | 93.2pg_depend; -- | 25666 | 7665 | 217 | 35.4pg_attrdef; -- | 6585 | 7595 | 236 | 32.2pg_type; -- | 4570 | 8177 | 416 | 19.6pg_shdepend; -- | 52040 | 7968 | 438 | 18.2 -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
Simon Riggs <simon@2ndQuadrant.com> writes: > With HOT, there is very little need to perform a VACUUM FULL on any > shared catalog table. Look at the indexes... This is not really a useful argument. People do do VAC FULL on catalogs, whether we think they should or not. Also, it's not only "shared" catalogs that are at stake. We could avoid fixing the bug if we forbade both VAC FULL and CLUSTER on all system catalogs, period, no exceptions. That doesn't seem like a workable answer though. Some usage patterns do see bloat on the catalogs, especially if you disable or dial down autovacuum. > In the unlikely event we do actually have to VACUUM FULL a shared > catalog table, nuke any cache entry for the whole shared catalog. You're still not getting the point. We *do* nuke all cache entries after a VAC FULL. That does not avoid this bug. regards, tom lane
I wrote: > Yeah. Also, to my mind this is only a fix that will be used in 9.0 and > 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax > to invalidate catcaches instead of recording individual TIDs, I'm > excited about doing that instead for 9.2 and beyond. I believe that > that could result in a significant reduction in sinval traffic, which > would be a considerable performance win. On closer inspection this idea doesn't seem workable. I was imagining that after a transaction commits, we could find obsoleted catcache entries by looking for tuples with xmax equal to the transaction's XID. But a catcache entry made before the transaction had done the update wouldn't contain the correct xmax, so we'd fail to remove it. The only apparent way to fix that would be to go out to disk and consult the current on-disk xmax, which would hardly be any cheaper than just dropping the cache entry and then reloading it when/if needed. All is not entirely lost, however: there's still some possible performance benefit to be gained here, if we go to the scheme of identifying victim catcache entries by hashvalue only. Currently, each heap_update in a cached catalog has to issue two sinval messages (per cache!): one against the old TID and one against the new TID. We'd be able to reduce that to one message in the common case where the hashvalue remains the same because the cache key columns didn't change. Another thing we could consider doing, if one-in-2^32 hash collisions seems too high, is widening the hash values to 64 bits. I'm not terribly excited about that, because a lot of the caches are on OID columns for which there'd be zero benefit. regards, tom lane
On Sat, Aug 13, 2011 at 12:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Yeah. Also, to my mind this is only a fix that will be used in 9.0 and >> 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax >> to invalidate catcaches instead of recording individual TIDs, I'm >> excited about doing that instead for 9.2 and beyond. I believe that >> that could result in a significant reduction in sinval traffic, which >> would be a considerable performance win. > > On closer inspection this idea doesn't seem workable. I was imagining > that after a transaction commits, we could find obsoleted catcache > entries by looking for tuples with xmax equal to the transaction's XID. > But a catcache entry made before the transaction had done the update > wouldn't contain the correct xmax, so we'd fail to remove it. The only > apparent way to fix that would be to go out to disk and consult the > current on-disk xmax, which would hardly be any cheaper than just > dropping the cache entry and then reloading it when/if needed. > > All is not entirely lost, however: there's still some possible > performance benefit to be gained here, if we go to the scheme of > identifying victim catcache entries by hashvalue only. Currently, > each heap_update in a cached catalog has to issue two sinval messages > (per cache!): one against the old TID and one against the new TID. > We'd be able to reduce that to one message in the common case where the > hashvalue remains the same because the cache key columns didn't change. Cool. > Another thing we could consider doing, if one-in-2^32 hash collisions > seems too high, is widening the hash values to 64 bits. I'm not > terribly excited about that, because a lot of the caches are on OID > columns for which there'd be zero benefit. Yeah, and I can't get excited about the increased memory usage, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company