Thread: Nasty bug in heap_page_prune

Nasty bug in heap_page_prune

From

Tom Lane

Date:

05 March 2008, 21:23:53

While working on the previously discussed refactoring of
heap_page_prune, I came to the realization that its use of
CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-(
The semantics aren't right: inval.c assumes that it's dealing with
transactional invalidations, but what we are dealing with in a redirect
collapse is non-transactional.  Once heap_page_prune completes, the
CTID update is a done deal even if the calling transaction rolls back.
However, inval.c will think it doesn't need to broadcast any
invalidations after a failed transaction.

This is fairly difficult to show a self-contained example of, because
it can only occur if VACUUM FULL errors out after doing a page prune,
and there's no very easy way to guarantee that will happen.  I resorted
to putting an elog(ERROR) call into vacuum.c right after the scan_heap()
call.  With that, it was possible to demonstrate the problem:

regression=# create table foo(f1 int);
CREATE TABLE
regression=# select ctid,relname from pg_class where relname = 'foo';  ctid  | relname 
--------+---------(9,32) | foo
(1 row)

-- need a HOT-candidate update to the pg_class row, eg
regression=# alter table foo owner to joe;
ALTER TABLE
-- check that update is on same page, else it's not HOT
regression=# select ctid,relname from pg_class where relname = 'foo'; ctid  | relname 
--------+---------(9,33) | foo
(1 row)

-- make sure the updated tuple is in local syscache
regression=# select 'foo'::regclass;regclass 
----------foo
(1 row)

-- now, in another backend, execute intentionally broken VACUUM FULL pg_class

-- and try to alter the updated tuple again using a syscache-based operation
regression=# alter table foo owner to postgres;
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.
 
The connection to the server was lost. Attempting reset: Failed.

The crash is here:

TRAP: FailedAssertion("!(((lp)->lp_flags == 1))", File: "heapam.c", Line: 2330)
LOG:  server process (PID 8967) was terminated by signal 6
LOG:  terminating any other active server processes

because it's trying to find the tuple at a CTID that's no longer valid.

Not sure about a clean solution to this.  I don't really want to
bastardize inval.c by making it deal with nontransactional semantics,
but there may be no other way.

Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT
pointers, at least in system catalogs.  That might be the best answer
for 8.3 in any case; I am not seeing a real fix that I'd care to risk
back-patching.  (Note that this is not exactly trivial in itself, since
then vacuum.c would need at least some minimal ability to deal with
LP_REDIRECT entries.)
        regards, tom lane

Re: Nasty bug in heap_page_prune

From

"Pavan Deolasee"

Date:

06 March 2008, 03:50:42

On Thu, Mar 6, 2008 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While working on the previously discussed refactoring of
>  heap_page_prune, I came to the realization that its use of
>  CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-(
>  The semantics aren't right: inval.c assumes that it's dealing with
>  transactional invalidations,

Oh. I did not know that :-(

>
>  Not sure about a clean solution to this.  I don't really want to
>  bastardize inval.c by making it deal with nontransactional semantics,
>  but there may be no other way.
>
>  Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT
>  pointers, at least in system catalogs.  That might be the best answer
>  for 8.3 in any case; I am not seeing a real fix that I'd care to risk
>  back-patching.  (Note that this is not exactly trivial in itself, since
>  then vacuum.c would need at least some minimal ability to deal with
>  LP_REDIRECT entries.)
>

I am not sure how ugly or difficult it would be to teach inval.c to handle
non-transactional invalidations.

But as you said, skipping collapse of LP_REDIRECT pointers may not be
a good idea either. The overhead of 4 bytes per tuple for system
tables may not be much, but handling LP_REDIRECT pointers in vacuum.c
would be cumbersome and error prone. This was very painful before we
added the step to collapse redirected pointers.

We had a stress test to run concurrent INSERTs / UPDATEs / VACUUMs /
VACUUM FULL and CREATE/DROP INDEXes, and VACUUM FULL used to
once in a while complain about tuple mismatch.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Nasty bug in heap_page_prune

From

Hannu Krosing

Date:

07 March 2008, 06:21:10

On Wed, 2008-03-05 at 20:23 -0500, Tom Lane wrote:

> Not sure about a clean solution to this.  I don't really want to
> bastardize inval.c by making it deal with nontransactional semantics,
> but there may be no other way.

Is this something that happens only with concurrent VACUUM FULLs ?

If so, than can't we just disallow doing them concurrently ?

VACUUM FULL is something you don't want on a high-load database anyway

> Or we could forget about letting VACUUM FULL collapse out LP_REDIRECT
> pointers, at least in system catalogs.  That might be the best answer
> for 8.3 in any case; I am not seeing a real fix that I'd care to risk
> back-patching.  (Note that this is not exactly trivial in itself, since
> then vacuum.c would need at least some minimal ability to deal with
> LP_REDIRECT entries.)
> 
>             regards, tom lane

Hannu Krosing

Re: Nasty bug in heap_page_prune

From

"Pavan Deolasee"

Date:

07 March 2008, 06:41:00

On Fri, Mar 7, 2008 at 3:42 PM, Hannu Krosing <hannu@krosing.net> wrote:

>
>  Is this something that happens only with concurrent VACUUM FULLs ?
>

No, its about VACUUM FULL on a system catalog which fails for some reason.
The VACUUM FULL may have changed CTID of a tuple because of line
pointer redirection collapse. But the change is non-transactional.
The current cache invalidation mechanism can only handle transactional changes
(since it does not broadcast invalidations if the transaction aborts). Hence
some other backend which has cached the tuple may not see the change
in CTID and fail when the cached tuple is accessed.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Nasty bug in heap_page_prune

From

Alvaro Herrera

Date:

07 March 2008, 09:07:44

Tom Lane escribió:

> Not sure about a clean solution to this.  I don't really want to
> bastardize inval.c by making it deal with nontransactional semantics,
> but there may be no other way.

FWIW IIRC we hit precisely this problem while trying to do the
pg_class_nt stuff awhile ago, so if it's overcome as a result of this
HOT bug perhaps we can push forward with the other stuff too.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Nasty bug in heap_page_prune

From

Tom Lane

Date:

13 March 2008, 01:54:01

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On Thu, Mar 6, 2008 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While working on the previously discussed refactoring of
>> heap_page_prune, I came to the realization that its use of
>> CacheInvalidateHeapTuple is not just a PANIC risk but simply wrong :-(
>> The semantics aren't right: inval.c assumes that it's dealing with
>> transactional invalidations,

> I am not sure how ugly or difficult it would be to teach inval.c to handle
> non-transactional invalidations.

After further study I think it might not be that hard.  As things stand,
CacheInvalidateHeapTuple accumulates notices into a list that is pushed
out at transaction commit.  But we could push them out earlier, ie,
before making the actual page changes in heap_page_prune.  This seems
safe since an unnecessary invalidation notice cannot break anything,
at worst it causes useless work.

My first visualization of how to do this was to extend the current
subtransaction handling logic in inval.c, such that we'd build a phony
"subtransaction" for each invocation of heap_page_prune, and then force
the messages to be sent at "subtransaction commit".  However, it looks
to me like we don't even need that much mechanism.  The problem case
only occurs during the first stage of VACUUM FULL, and AFAICS there
could never be any other pending inval events at that point.  (VACUUM
FULL will generate transactional inval events later, when it's doing
cross-page tuple moves, but there shouldn't be any during the time that
we're running heap_page_prune.)  So I think the only mechanism we really
need is something to force out pending inval events immediately, that
is just AtEOXact_Inval(true) or something pretty close to it.

I'm inclined to set this up as having heap_page_prune first call
a function named something like "BeginNontransactionalInvalidation",
then do its CacheInvalidateHeapTuple calls, then call
"EndNontransactionalInvalidation".  In the initial implementation
the first would just assert that the inval queue is empty, and the
second would push out the queue.  If we ever need to generalize
things then the code structure would be there to build a phony
subtransaction.

Obviously this is a bit of a hack, but there's hardly anything about
VACUUM FULL that's not a bit of a hack ...

I haven't coded this, much less tested it, but it seems like it'd
work.  Comments?
        regards, tom lane

Re: Nasty bug in heap_page_prune

From

"Pavan Deolasee"

Date:

13 March 2008, 09:27:47

On Thu, Mar 13, 2008 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>  But we could push them out earlier, ie,
>  before making the actual page changes in heap_page_prune.  This seems
>  safe since an unnecessary invalidation notice cannot break anything,
>  at worst it causes useless work.
>

If I understand correctly, that would let us call CacheInvalidateHeapTuple()
outside a critical section. Seems like a good idea to me.

>
>  I'm inclined to set this up as having heap_page_prune first call
>  a function named something like "BeginNontransactionalInvalidation",
>  then do its CacheInvalidateHeapTuple calls, then call
>  "EndNontransactionalInvalidation".  In the initial implementation
>  the first would just assert that the inval queue is empty, and the
>  second would push out the queue.  If we ever need to generalize
>  things then the code structure would be there to build a phony
>  subtransaction.
>

I wonder if we can have a separate list of non-transaction events in
InvalidationListHeader and broadcast those events irrespective of
transaction commit or abort. But we may still need the
"BeginNontransactionalInvalidation" and  "EndNontransactionalInvalidation"
markers to push these events to the non-transactional list.

>  Obviously this is a bit of a hack, but there's hardly anything about
>  VACUUM FULL that's not a bit of a hack ...
>

True :-) And I would personally prefer any hack than playing with left
over redirected line pointers in VACUUM FULL.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Nasty bug in heap_page_prune

From

Tom Lane

Date:

13 March 2008, 12:52:45

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> I wonder if we can have a separate list of non-transaction events in
> InvalidationListHeader and broadcast those events irrespective of
> transaction commit or abort.

Yeah, I started with that same idea.  But AFAICS there is no percentage
in postponing the broadcast until commit/abort; we may as well push the
messages out immediately.  The reason inval postpones transactional
messages until commit/abort is that that's when the invalidation
actually "takes effect" (or not) from the point of view of other
transactions; telling them to flush their caches earlier would be
useless.  For these nontransactional invalidations the inval is
effective immediately, and other sessions can reload their caches
as soon as we release buffer lock.  (Well, except that VACUUM FULL
is holding ex-lock on the whole table...)

Anyway, point is that that seems to be extra complication that
doesn't buy anything, and if anything puts the behavior further
away from what it should ideally be.
        regards, tom lane

Re: Nasty bug in heap_page_prune

From

"Pavan Deolasee"

Date:

13 March 2008, 14:17:14

On Thu, Mar 13, 2008 at 9:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>
>  Yeah, I started with that same idea.  But AFAICS there is no percentage
>  in postponing the broadcast until commit/abort; we may as well push the
>  messages out immediately.

Seems like a good plan to me.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com