Thread: Nasty bug in heap_page_prune
While working on the previously discussed refactoring of heap_page_prune, I came to the realization that its use of CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-( The semantics aren't right: inval.c assumes that it's dealing with transactional invalidations, but what we are dealing with in a redirect collapse is non-transactional. Once heap_page_prune completes, the CTID update is a done deal even if the calling transaction rolls back. However, inval.c will think it doesn't need to broadcast any invalidations after a failed transaction. This is fairly difficult to show a self-contained example of, because it can only occur if VACUUM FULL errors out after doing a page prune, and there's no very easy way to guarantee that will happen. I resorted to putting an elog(ERROR) call into vacuum.c right after the scan_heap() call. With that, it was possible to demonstrate the problem: regression=# create table foo(f1 int); CREATE TABLE regression=# select ctid,relname from pg_class where relname = 'foo'; ctid | relname --------+---------(9,32) | foo (1 row) -- need a HOT-candidate update to the pg_class row, eg regression=# alter table foo owner to joe; ALTER TABLE -- check that update is on same page, else it's not HOT regression=# select ctid,relname from pg_class where relname = 'foo'; ctid | relname --------+---------(9,33) | foo (1 row) -- make sure the updated tuple is in local syscache regression=# select 'foo'::regclass;regclass ----------foo (1 row) -- now, in another backend, execute intentionally broken VACUUM FULL pg_class -- and try to alter the updated tuple again using a syscache-based operation regression=# alter table foo owner to postgres; server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. The connection to the server was lost. Attempting reset: Failed. The crash is here: TRAP: FailedAssertion("!(((lp)->lp_flags == 1))", File: "heapam.c", Line: 2330) LOG: server process (PID 8967) was terminated by signal 6 LOG: terminating any other active server processes because it's trying to find the tuple at a CTID that's no longer valid. Not sure about a clean solution to this. I don't really want to bastardize inval.c by making it deal with nontransactional semantics, but there may be no other way. Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT pointers, at least in system catalogs. That might be the best answer for 8.3 in any case; I am not seeing a real fix that I'd care to risk back-patching. (Note that this is not exactly trivial in itself, since then vacuum.c would need at least some minimal ability to deal with LP_REDIRECT entries.) regards, tom lane
On Thu, Mar 6, 2008 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > While working on the previously discussed refactoring of > heap_page_prune, I came to the realization that its use of > CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-( > The semantics aren't right: inval.c assumes that it's dealing with > transactional invalidations, Oh. I did not know that :-( > > Not sure about a clean solution to this. I don't really want to > bastardize inval.c by making it deal with nontransactional semantics, > but there may be no other way. > > Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT > pointers, at least in system catalogs. That might be the best answer > for 8.3 in any case; I am not seeing a real fix that I'd care to risk > back-patching. (Note that this is not exactly trivial in itself, since > then vacuum.c would need at least some minimal ability to deal with > LP_REDIRECT entries.) > I am not sure how ugly or difficult it would be to teach inval.c to handle non-transactional invalidations. But as you said, skipping collapse of LP_REDIRECT pointers may not be a good idea either. The overhead of 4 bytes per tuple for system tables may not be much, but handling LP_REDIRECT pointers in vacuum.c would be cumbersome and error prone. This was very painful before we added the step to collapse redirected pointers. We had a stress test to run concurrent INSERTs / UPDATEs / VACUUMs / VACUUM FULL and CREATE/DROP INDEXes, and VACUUM FULL used to once in a while complain about tuple mismatch. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-03-05 at 20:23 -0500, Tom Lane wrote: > Not sure about a clean solution to this. I don't really want to > bastardize inval.c by making it deal with nontransactional semantics, > but there may be no other way. Is this something that happens only with concurrent VACUUM FULLs ? If so, than can't we just disallow doing them concurrently ? VACUUM FULL is something you don't want on a high-load database anyway > Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT > pointers, at least in system catalogs. That might be the best answer > for 8.3 in any case; I am not seeing a real fix that I'd care to risk > back-patching. (Note that this is not exactly trivial in itself, since > then vacuum.c would need at least some minimal ability to deal with > LP_REDIRECT entries.) > > regards, tom lane Hannu Krosing
On Fri, Mar 7, 2008 at 3:42 PM, Hannu Krosing <hannu@krosing.net> wrote: > > Is this something that happens only with concurrent VACUUM FULLs ? > No, its about VACUUM FULL on a system catalog which fails for some reason. The VACUUM FULL may have changed CTID of a tuple because of line pointer redirection collapse. But the change is non-transactional. The current cache invalidation mechanism can only handle transactional changes (since it does not broadcast invalidations if the transaction aborts). Hence some other backend which has cached the tuple may not see the change in CTID and fail when the cached tuple is accessed. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Tom Lane escribió: > Not sure about a clean solution to this. I don't really want to > bastardize inval.c by making it deal with nontransactional semantics, > but there may be no other way. FWIW IIRC we hit precisely this problem while trying to do the pg_class_nt stuff awhile ago, so if it's overcome as a result of this HOT bug perhaps we can push forward with the other stuff too. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On Thu, Mar 6, 2008 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> While working on the previously discussed refactoring of >> heap_page_prune, I came to the realization that its use of >> CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-( >> The semantics aren't right: inval.c assumes that it's dealing with >> transactional invalidations, > I am not sure how ugly or difficult it would be to teach inval.c to handle > non-transactional invalidations. After further study I think it might not be that hard. As things stand, CacheInvalidateHeapTuple accumulates notices into a list that is pushed out at transaction commit. But we could push them out earlier, ie, before making the actual page changes in heap_page_prune. This seems safe since an unnecessary invalidation notice cannot break anything, at worst it causes useless work. My first visualization of how to do this was to extend the current subtransaction handling logic in inval.c, such that we'd build a phony "subtransaction" for each invocation of heap_page_prune, and then force the messages to be sent at "subtransaction commit". However, it looks to me like we don't even need that much mechanism. The problem case only occurs during the first stage of VACUUM FULL, and AFAICS there could never be any other pending inval events at that point. (VACUUM FULL will generate transactional inval events later, when it's doing cross-page tuple moves, but there shouldn't be any during the time that we're running heap_page_prune.) So I think the only mechanism we really need is something to force out pending inval events immediately, that is just AtEOXact_Inval(true) or something pretty close to it. I'm inclined to set this up as having heap_page_prune first call a function named something like "BeginNontransactionalInvalidation", then do its CacheInvalidateHeapTuple calls, then call "EndNontransactionalInvalidation". In the initial implementation the first would just assert that the inval queue is empty, and the second would push out the queue. If we ever need to generalize things then the code structure would be there to build a phony subtransaction. Obviously this is a bit of a hack, but there's hardly anything about VACUUM FULL that's not a bit of a hack ... I haven't coded this, much less tested it, but it seems like it'd work. Comments? regards, tom lane
On Thu, Mar 13, 2008 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > But we could push them out earlier, ie, > before making the actual page changes in heap_page_prune. This seems > safe since an unnecessary invalidation notice cannot break anything, > at worst it causes useless work. > If I understand correctly, that would let us call CacheInvalidateHeapTuple() outside a critical section. Seems like a good idea to me. > > I'm inclined to set this up as having heap_page_prune first call > a function named something like "BeginNontransactionalInvalidation", > then do its CacheInvalidateHeapTuple calls, then call > "EndNontransactionalInvalidation". In the initial implementation > the first would just assert that the inval queue is empty, and the > second would push out the queue. If we ever need to generalize > things then the code structure would be there to build a phony > subtransaction. > I wonder if we can have a separate list of non-transaction events in InvalidationListHeader and broadcast those events irrespective of transaction commit or abort. But we may still need the "BeginNontransactionalInvalidation" and "EndNontransactionalInvalidation" markers to push these events to the non-transactional list. > Obviously this is a bit of a hack, but there's hardly anything about > VACUUM FULL that's not a bit of a hack ... > True :-) And I would personally prefer any hack than playing with left over redirected line pointers in VACUUM FULL. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > I wonder if we can have a separate list of non-transaction events in > InvalidationListHeader and broadcast those events irrespective of > transaction commit or abort. Yeah, I started with that same idea. But AFAICS there is no percentage in postponing the broadcast until commit/abort; we may as well push the messages out immediately. The reason inval postpones transactional messages until commit/abort is that that's when the invalidation actually "takes effect" (or not) from the point of view of other transactions; telling them to flush their caches earlier would be useless. For these nontransactional invalidations the inval is effective immediately, and other sessions can reload their caches as soon as we release buffer lock. (Well, except that VACUUM FULL is holding ex-lock on the whole table...) Anyway, point is that that seems to be extra complication that doesn't buy anything, and if anything puts the behavior further away from what it should ideally be. regards, tom lane
On Thu, Mar 13, 2008 at 9:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Yeah, I started with that same idea. But AFAICS there is no percentage > in postponing the broadcast until commit/abort; we may as well push the > messages out immediately. Seems like a good plan to me. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com