Nasty bug in heap_page_prune - Mailing list pgsql-hackers
| From | Tom Lane |
|---|---|
| Subject | Nasty bug in heap_page_prune |
| Date | |
| Msg-id | 9319.1204766626@sss.pgh.pa.us Whole thread Raw |
| Responses |
Re: Nasty bug in heap_page_prune
Re: Nasty bug in heap_page_prune Re: Nasty bug in heap_page_prune |
| List | pgsql-hackers |
While working on the previously discussed refactoring of
heap_page_prune, I came to the realization that its use of
CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-(
The semantics aren't right: inval.c assumes that it's dealing with
transactional invalidations, but what we are dealing with in a redirect
collapse is non-transactional. Once heap_page_prune completes, the
CTID update is a done deal even if the calling transaction rolls back.
However, inval.c will think it doesn't need to broadcast any
invalidations after a failed transaction.
This is fairly difficult to show a self-contained example of, because
it can only occur if VACUUM FULL errors out after doing a page prune,
and there's no very easy way to guarantee that will happen. I resorted
to putting an elog(ERROR) call into vacuum.c right after the scan_heap()
call. With that, it was possible to demonstrate the problem:
regression=# create table foo(f1 int);
CREATE TABLE
regression=# select ctid,relname from pg_class where relname = 'foo'; ctid | relname
--------+---------(9,32) | foo
(1 row)
-- need a HOT-candidate update to the pg_class row, eg
regression=# alter table foo owner to joe;
ALTER TABLE
-- check that update is on same page, else it's not HOT
regression=# select ctid,relname from pg_class where relname = 'foo'; ctid | relname
--------+---------(9,33) | foo
(1 row)
-- make sure the updated tuple is in local syscache
regression=# select 'foo'::regclass;regclass
----------foo
(1 row)
-- now, in another backend, execute intentionally broken VACUUM FULL pg_class
-- and try to alter the updated tuple again using a syscache-based operation
regression=# alter table foo owner to postgres;
server closed the connection unexpectedly This probably means the server terminated abnormally before or
whileprocessing the request.
The connection to the server was lost. Attempting reset: Failed.
The crash is here:
TRAP: FailedAssertion("!(((lp)->lp_flags == 1))", File: "heapam.c", Line: 2330)
LOG: server process (PID 8967) was terminated by signal 6
LOG: terminating any other active server processes
because it's trying to find the tuple at a CTID that's no longer valid.
Not sure about a clean solution to this. I don't really want to
bastardize inval.c by making it deal with nontransactional semantics,
but there may be no other way.
Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT
pointers, at least in system catalogs. That might be the best answer
for 8.3 in any case; I am not seeing a real fix that I'd care to risk
back-patching. (Note that this is not exactly trivial in itself, since
then vacuum.c would need at least some minimal ability to deal with
LP_REDIRECT entries.)
regards, tom lane
pgsql-hackers by date: