Thread: Sure enough, SI buffer overrun is broken

Sure enough, SI buffer overrun is broken

From

Tom Lane

Date:

24 January 2000, 22:06:42

I built the current sources with MAXNUMMESSAGES set to 32 in
src/include/storage/sinvaladt.h.  The regular regress tests
run OK, with just a few NOTICEs about 'cache state reset'
and 'SI buffer overflow' inserted in the normal outputs
(as you'd expect, if SI overrun occurs).

However, the parallel tests crash spectacularly, with weird errors
and Assert() coredumps.  Some of the unexpected messages in the
postmaster log are:

ERROR:  Relation 0 does not exist
NOTICE:  LockRelease: locktable lookup failed, no lock
TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))):", File:
"fd.c",Line: 817)
 

!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))) (0)
NOTICE:  LockRelease: locktable lookup failed, no lock
TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] !=
((void*)0) && variable->vartype == tuple_type->attrs[attnum - 1]->atttypid)):", File: "execQual.c", Line: 283)
 

!(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void *)0) &&
variable->vartype== tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter]
 
TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))):", File:
"fd.c",Line: 817)
 

!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))) (0) [Not a typewriter]

We have a problem.

I think Hiroshi was beating on this code recently --- Hiroshi,
do you recall anything you might have done that would affect
SI cache reset recovery?
        regards, tom lane

RE: [HACKERS] Sure enough, SI buffer overrun is broken

From

"Hiroshi Inoue"

Date:

25 January 2000, 10:39:50

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> 
> I built the current sources with MAXNUMMESSAGES set to 32 in
> src/include/storage/sinvaladt.h.  The regular regress tests
> run OK, with just a few NOTICEs about 'cache state reset'
> and 'SI buffer overflow' inserted in the normal outputs
> (as you'd expect, if SI overrun occurs).
> 
> However, the parallel tests crash spectacularly, with weird errors
> and Assert() coredumps.  Some of the unexpected messages in the
> postmaster log are:
> 
> ERROR:  Relation 0 does not exist
> NOTICE:  LockRelease: locktable lookup failed, no lock
> TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && 
> VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817)
> 
> !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName 
> != ((void *)0))) (0)
> NOTICE:  LockRelease: locktable lookup failed, no lock
> TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= 
> tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void 
> *)0) && variable->vartype == tuple_type->attrs[attnum - 
> 1]->atttypid)):", File: "execQual.c", Line: 283)
> 
> !(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && 
> tuple_type->attrs[attnum - 1] != ((void *)0) && variable->vartype 
> == tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter]
> TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && 
> VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817)
> 
> !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName 
> != ((void *)0))) (0) [Not a typewriter]
> 
> We have a problem.
> 
> I think Hiroshi was beating on this code recently --- Hiroshi,
> do you recall anything you might have done that would affect
> SI cache reset recovery?
>

Certainly crash occurs.
But I couldn't see such Assert messages.

OK,I will examine tomorrow.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] Sure enough, SI buffer overrun is broken

From

Patrick Welche

Date:

27 January 2000, 08:07:57

On Wed, Jan 26, 2000 at 12:40:00AM +0900, Hiroshi Inoue wrote:
> > -----Original Message-----
> > From: owner-pgsql-hackers@postgreSQL.org
> > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> > 
> > I built the current sources with MAXNUMMESSAGES set to 32 in
> > src/include/storage/sinvaladt.h.  The regular regress tests
> > run OK, with just a few NOTICEs about 'cache state reset'
> > and 'SI buffer overflow' inserted in the normal outputs
> > (as you'd expect, if SI overrun occurs).
> > 
> > However, the parallel tests crash spectacularly, with weird errors
> > and Assert() coredumps.  Some of the unexpected messages in the
> > postmaster log are:
> > 
> > ERROR:  Relation 0 does not exist
> > NOTICE:  LockRelease: locktable lookup failed, no lock
> > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && 
> > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817)
> > 
> > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName 
> > != ((void *)0))) (0)
> > NOTICE:  LockRelease: locktable lookup failed, no lock
> > TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= 
> > tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void 
> > *)0) && variable->vartype == tuple_type->attrs[attnum - 
> > 1]->atttypid)):", File: "execQual.c", Line: 283)
> > 
> > !(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && 
> > tuple_type->attrs[attnum - 1] != ((void *)0) && variable->vartype 
> > == tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter]
> > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && 
> > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817)
> > 
> > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName 
> > != ((void *)0))) (0) [Not a typewriter]
> > 
> > We have a problem.
> > 
> > I think Hiroshi was beating on this code recently --- Hiroshi,
> > do you recall anything you might have done that would affect
> > SI cache reset recovery?
> >
> 
> Certainly crash occurs.
> But I couldn't see such Assert messages.
> 
> OK,I will examine tomorrow.

How's it going? Can I help?

Cheers,

Patrick

RE: [HACKERS] Sure enough, SI buffer overrun is broken

From

"Hiroshi Inoue"

Date:

27 January 2000, 10:38:14

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> 
> I built the current sources with MAXNUMMESSAGES set to 32 in
> src/include/storage/sinvaladt.h.  The regular regress tests
> run OK, with just a few NOTICEs about 'cache state reset'
> and 'SI buffer overflow' inserted in the normal outputs
> (as you'd expect, if SI overrun occurs).
> 
> However, the parallel tests crash spectacularly, with weird errors
> and Assert() coredumps.

Unfortunately,I don't understand the cause yet.
The cause may be not unique.

Is the call RelationCacheInvalidate(false not true) in ResetSys-
temCaches() right ?  Relation descriptors would be destoryed if 
ResetSystemCaches() occurs in CommandConterIncrement().
Recent change setheapoverride -> CommandCounterIncrement
may need reopen of relations after CommandCounterIncrement.     
static void
ResetSystemCaches()
{ResetSystemCache();RelationCacheInvalidate(false);                  ^^^^^^^^
}

I changed false ->  true and tried.
Crash decreased but still occurs.

> Some of the unexpected messages in the
> postmaster log are:
> 
> NOTICE:  LockRelease: locktable lookup failed, no lock

I have seen this NOTICE only once or twice.
This seems because of setting MyProc->xid to InvalidTransa
ctionId in CommitTransaction() and AbortTransaction().
There's a little time until AtCommit(Abort)_Locks.
I have no idea to solve this now.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

RE: [HACKERS] Sure enough, SI buffer overrun is broken

From

"Hiroshi Inoue"

Date:

27 January 2000, 11:32:15

> -----Original Message-----
> From: Patrick Welche [mailto:prlw1@newn.cam.ac.uk]
> > > I built the current sources with MAXNUMMESSAGES set to 32 in
> > > src/include/storage/sinvaladt.h.  The regular regress tests
> > > run OK, with just a few NOTICEs about 'cache state reset'
> > > and 'SI buffer overflow' inserted in the normal outputs
> > > (as you'd expect, if SI overrun occurs).
> > > 
> > > However, the parallel tests crash spectacularly, with weird errors
> > > and Assert() coredumps.  Some of the unexpected messages in the
> > > postmaster log are:
> > > 
> > 
> > Certainly crash occurs.
> > But I couldn't see such Assert messages.
> > 
> > OK,I will examine tomorrow.
> 
> How's it going? Can I help ?
>

Of cource you can help us.
As Tom wrote
1) set MAXNUMMESSAGES of src/include/storage/sinvaldt.h to 32
2) cd src/test/regress
3) make all runcheck

You would see crash and various kind of unexpected messages.
My another posting may be a little help.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] Sure enough, SI buffer overrun is broken

From

Tom Lane

Date:

29 January 2000, 14:54:22

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> I built the current sources with MAXNUMMESSAGES set to 32 in
>> src/include/storage/sinvaladt.h.  The regular regress tests
>> run OK, with just a few NOTICEs about 'cache state reset'
>> and 'SI buffer overflow' inserted in the normal outputs
>> (as you'd expect, if SI overrun occurs).
>> 
>> However, the parallel tests crash spectacularly, with weird errors
>> and Assert() coredumps.

> Is the call RelationCacheInvalidate(false not true) in ResetSys-
> temCaches() right ?  Relation descriptors would be destoryed if 
> ResetSystemCaches() occurs in CommandConterIncrement().

You are absolutely right.  I have thought before that it was extremely
dangerous for RelationCacheInvalidate *ever* to blow away a relcache
entry with a positive refcount.  I have changed ResetSystemCaches to
pass TRUE, and have modified the comments for RelationCacheInvalidate
to indicate that this is probably the only safe setting.  I also changed
RelationIdInvalidateRelationCacheByRelationId to unconditionally pass
true to RelationFlushRelation.  Now, RelationFlushRelation is *never*
called with onlyFlushReferenceCountZero=false, so it will always attempt
to rebuild a relcache entry that has positive refcount.

I suspect that the "feature" of RelationFlushRelation to allow blowing
away a relcache entry regardless of refcount was a hack put in back when
relcache refcounts couldn't be trusted (because elog(ERROR) would leave
refcounts positive).  Now we have RelationCacheAbort to fix refcounts
after elog, so I see no reason to take the risk of trying to destroy an
open relcache entry.

With these two changes in place, the parallel regress tests seem much
more stable.  There is still a big problem though, which is that the
"portals" regress test sometimes fails.  I traced this far enough to
discover that the code is trying to use a TupleDesc that it's stored in
the ScanTupleSlot of a plan, and this TupleDesc is no longer valid ---
presumably the relcache entry it was gotten from has been flushed and
rebuilt, leaving the plan with a dangling TupleDesc pointer.  Ugh.

I do not think it is very practical to try to change all of the places
that assume that they can save pointers to the TupleDesc of a relcache
entry.  Instead I am thinking about solving the problem inside the
relcache, as follows:

During a relcache entry rebuild, do not simply free and reconstruct
the TupleDesc.  Instead, read the catalogs to build a new TupleDesc,
and compare this one to the old one.  If they are the same, free the
new one instead (leaving the old one in place, and hence stored pointers
to it are still valid).  If they are not the same, then elog(ERROR).

(elog may sound overly paranoid, but this condition indicates that the
table's definition has actually changed since we grabbed the refcount on
it --- remember we wouldn't be doing this at all if the relcache entry had
zero refcount --- and therefore all kinds of derived information such as
plans may be wrong.  Pressing ahead will probably lead to crash.)

We may need to do the same for any other substructures of the relcache
entry that are visible from outside relcache.c.

I know this sounds pretty grotty, but we are already doing it for the
relcache entry itself --- rebuild re-uses the same physical entry
rather than deleting and reallocating it, to ensure that pointers
to an open Relation stay valid over an SI flush.  We need to extend
the same guarantee to the substructure of the relcache entry.

Unless someone has a better idea, I'll work on that.

>> Some of the unexpected messages in the
>> postmaster log are:
>> 
>> NOTICE:  LockRelease: locktable lookup failed, no lock

> I have seen this NOTICE only once or twice.
> This seems because of setting MyProc->xid to InvalidTransa
> ctionId in CommitTransaction() and AbortTransaction().
> There's a little time until AtCommit(Abort)_Locks.
> I have no idea to solve this now.

I am not seeing it after the change to never flush relcaches with
positive refcount.  I think the locks being complained of are
probably the locks that should have been kept on flushed relations,
and that it's not CommitTransaction's fault.
        regards, tom lane

RE: [HACKERS] Sure enough, SI buffer overrun is broken

From

"Hiroshi Inoue"

Date:

29 January 2000, 22:16:28

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> I built the current sources with MAXNUMMESSAGES set to 32 in
> >> src/include/storage/sinvaladt.h.  The regular regress tests
> >> run OK, with just a few NOTICEs about 'cache state reset'
> >> and 'SI buffer overflow' inserted in the normal outputs
> >> (as you'd expect, if SI overrun occurs).
> >>
> >> However, the parallel tests crash spectacularly, with weird errors
> >> and Assert() coredumps.
>
> With these two changes in place, the parallel regress tests seem much
> more stable.  There is still a big problem though, which is that the
> "portals" regress test sometimes fails.  I traced this far enough to
> discover that the code is trying to use a TupleDesc that it's stored in
> the ScanTupleSlot of a plan, and this TupleDesc is no longer valid ---
> presumably the relcache entry it was gotten from has been flushed and
> rebuilt, leaving the plan with a dangling TupleDesc pointer.  Ugh.
>

Oh great,I've also doubted relcache entry rebuild but wasn't able to
trace so far.

> I do not think it is very practical to try to change all of the places
> that assume that they can save pointers to the TupleDesc of a relcache
> entry.  Instead I am thinking about solving the problem inside the
> relcache, as follows:
>
> During a relcache entry rebuild, do not simply free and reconstruct
> the TupleDesc.  Instead, read the catalogs to build a new TupleDesc,
> and compare this one to the old one.  If they are the same, free the
> new one instead (leaving the old one in place, and hence stored pointers
> to it are still valid).  If they are not the same, then elog(ERROR).
>
> (elog may sound overly paranoid, but this condition indicates that the
> table's definition has actually changed since we grabbed the refcount on
> it --- remember we wouldn't be doing this at all if the relcache entry had
> zero refcount --- and therefore all kinds of derived information such as
> plans may be wrong.  Pressing ahead will probably lead to crash.)
>

Sounds reasonable.

> We may need to do the same for any other substructures of the relcache
> entry that are visible from outside relcache.c.
>
> I know this sounds pretty grotty, but we are already doing it for the
> relcache entry itself --- rebuild re-uses the same physical entry
> rather than deleting and reallocating it, to ensure that pointers
> to an open Relation stay valid over an SI flush.  We need to extend
> the same guarantee to the substructure of the relcache entry.
>
> Unless someone has a better idea, I'll work on that.
>

Agreed.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp