Thread: Sure enough, SI buffer overrun is broken
I built the current sources with MAXNUMMESSAGES set to 32 in src/include/storage/sinvaladt.h. The regular regress tests run OK, with just a few NOTICEs about 'cache state reset' and 'SI buffer overflow' inserted in the normal outputs (as you'd expect, if SI overrun occurs). However, the parallel tests crash spectacularly, with weird errors and Assert() coredumps. Some of the unexpected messages in the postmaster log are: ERROR: Relation 0 does not exist NOTICE: LockRelease: locktable lookup failed, no lock TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))):", File: "fd.c",Line: 817) !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))) (0) NOTICE: LockRelease: locktable lookup failed, no lock TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void*)0) && variable->vartype == tuple_type->attrs[attnum - 1]->atttypid)):", File: "execQual.c", Line: 283) !(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void *)0) && variable->vartype== tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter] TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))):", File: "fd.c",Line: 817) !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName != ((void *)0))) (0) [Not a typewriter] We have a problem. I think Hiroshi was beating on this code recently --- Hiroshi, do you recall anything you might have done that would affect SI cache reset recovery? regards, tom lane
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > > I built the current sources with MAXNUMMESSAGES set to 32 in > src/include/storage/sinvaladt.h. The regular regress tests > run OK, with just a few NOTICEs about 'cache state reset' > and 'SI buffer overflow' inserted in the normal outputs > (as you'd expect, if SI overrun occurs). > > However, the parallel tests crash spectacularly, with weird errors > and Assert() coredumps. Some of the unexpected messages in the > postmaster log are: > > ERROR: Relation 0 does not exist > NOTICE: LockRelease: locktable lookup failed, no lock > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817) > > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName > != ((void *)0))) (0) > NOTICE: LockRelease: locktable lookup failed, no lock > TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= > tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void > *)0) && variable->vartype == tuple_type->attrs[attnum - > 1]->atttypid)):", File: "execQual.c", Line: 283) > > !(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && > tuple_type->attrs[attnum - 1] != ((void *)0) && variable->vartype > == tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter] > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817) > > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName > != ((void *)0))) (0) [Not a typewriter] > > We have a problem. > > I think Hiroshi was beating on this code recently --- Hiroshi, > do you recall anything you might have done that would affect > SI cache reset recovery? > Certainly crash occurs. But I couldn't see such Assert messages. OK,I will examine tomorrow. Regards. Hiroshi Inoue Inoue@tpf.co.jp
On Wed, Jan 26, 2000 at 12:40:00AM +0900, Hiroshi Inoue wrote: > > -----Original Message----- > > From: owner-pgsql-hackers@postgreSQL.org > > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > > > > I built the current sources with MAXNUMMESSAGES set to 32 in > > src/include/storage/sinvaladt.h. The regular regress tests > > run OK, with just a few NOTICEs about 'cache state reset' > > and 'SI buffer overflow' inserted in the normal outputs > > (as you'd expect, if SI overrun occurs). > > > > However, the parallel tests crash spectacularly, with weird errors > > and Assert() coredumps. Some of the unexpected messages in the > > postmaster log are: > > > > ERROR: Relation 0 does not exist > > NOTICE: LockRelease: locktable lookup failed, no lock > > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && > > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817) > > > > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName > > != ((void *)0))) (0) > > NOTICE: LockRelease: locktable lookup failed, no lock > > TRAP: Failed Assertion("!(attnum <= 0 || (attnum - 1 <= > > tuple_type->natts - 1 && tuple_type->attrs[attnum - 1] != ((void > > *)0) && variable->vartype == tuple_type->attrs[attnum - > > 1]->atttypid)):", File: "execQual.c", Line: 283) > > > > !(attnum <= 0 || (attnum - 1 <= tuple_type->natts - 1 && > > tuple_type->attrs[attnum - 1] != ((void *)0) && variable->vartype > > == tuple_type->attrs[attnum - 1]->atttypid)) (0) [Not a typewriter] > > TRAP: Failed Assertion("!(((file) > 0 && (file) < SizeVfdCache && > > VfdCache[file].fileName != ((void *)0))):", File: "fd.c", Line: 817) > > > > !(((file) > 0 && (file) < SizeVfdCache && VfdCache[file].fileName > > != ((void *)0))) (0) [Not a typewriter] > > > > We have a problem. > > > > I think Hiroshi was beating on this code recently --- Hiroshi, > > do you recall anything you might have done that would affect > > SI cache reset recovery? > > > > Certainly crash occurs. > But I couldn't see such Assert messages. > > OK,I will examine tomorrow. How's it going? Can I help? Cheers, Patrick
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > > I built the current sources with MAXNUMMESSAGES set to 32 in > src/include/storage/sinvaladt.h. The regular regress tests > run OK, with just a few NOTICEs about 'cache state reset' > and 'SI buffer overflow' inserted in the normal outputs > (as you'd expect, if SI overrun occurs). > > However, the parallel tests crash spectacularly, with weird errors > and Assert() coredumps. Unfortunately,I don't understand the cause yet. The cause may be not unique. Is the call RelationCacheInvalidate(false not true) in ResetSys- temCaches() right ? Relation descriptors would be destoryed if ResetSystemCaches() occurs in CommandConterIncrement(). Recent change setheapoverride -> CommandCounterIncrement may need reopen of relations after CommandCounterIncrement. static void ResetSystemCaches() {ResetSystemCache();RelationCacheInvalidate(false); ^^^^^^^^ } I changed false -> true and tried. Crash decreased but still occurs. > Some of the unexpected messages in the > postmaster log are: > > NOTICE: LockRelease: locktable lookup failed, no lock I have seen this NOTICE only once or twice. This seems because of setting MyProc->xid to InvalidTransa ctionId in CommitTransaction() and AbortTransaction(). There's a little time until AtCommit(Abort)_Locks. I have no idea to solve this now. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> -----Original Message----- > From: Patrick Welche [mailto:prlw1@newn.cam.ac.uk] > > > I built the current sources with MAXNUMMESSAGES set to 32 in > > > src/include/storage/sinvaladt.h. The regular regress tests > > > run OK, with just a few NOTICEs about 'cache state reset' > > > and 'SI buffer overflow' inserted in the normal outputs > > > (as you'd expect, if SI overrun occurs). > > > > > > However, the parallel tests crash spectacularly, with weird errors > > > and Assert() coredumps. Some of the unexpected messages in the > > > postmaster log are: > > > > > > > Certainly crash occurs. > > But I couldn't see such Assert messages. > > > > OK,I will examine tomorrow. > > How's it going? Can I help ? > Of cource you can help us. As Tom wrote 1) set MAXNUMMESSAGES of src/include/storage/sinvaldt.h to 32 2) cd src/test/regress 3) make all runcheck You would see crash and various kind of unexpected messages. My another posting may be a little help. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> I built the current sources with MAXNUMMESSAGES set to 32 in >> src/include/storage/sinvaladt.h. The regular regress tests >> run OK, with just a few NOTICEs about 'cache state reset' >> and 'SI buffer overflow' inserted in the normal outputs >> (as you'd expect, if SI overrun occurs). >> >> However, the parallel tests crash spectacularly, with weird errors >> and Assert() coredumps. > Is the call RelationCacheInvalidate(false not true) in ResetSys- > temCaches() right ? Relation descriptors would be destoryed if > ResetSystemCaches() occurs in CommandConterIncrement(). You are absolutely right. I have thought before that it was extremely dangerous for RelationCacheInvalidate *ever* to blow away a relcache entry with a positive refcount. I have changed ResetSystemCaches to pass TRUE, and have modified the comments for RelationCacheInvalidate to indicate that this is probably the only safe setting. I also changed RelationIdInvalidateRelationCacheByRelationId to unconditionally pass true to RelationFlushRelation. Now, RelationFlushRelation is *never* called with onlyFlushReferenceCountZero=false, so it will always attempt to rebuild a relcache entry that has positive refcount. I suspect that the "feature" of RelationFlushRelation to allow blowing away a relcache entry regardless of refcount was a hack put in back when relcache refcounts couldn't be trusted (because elog(ERROR) would leave refcounts positive). Now we have RelationCacheAbort to fix refcounts after elog, so I see no reason to take the risk of trying to destroy an open relcache entry. With these two changes in place, the parallel regress tests seem much more stable. There is still a big problem though, which is that the "portals" regress test sometimes fails. I traced this far enough to discover that the code is trying to use a TupleDesc that it's stored in the ScanTupleSlot of a plan, and this TupleDesc is no longer valid --- presumably the relcache entry it was gotten from has been flushed and rebuilt, leaving the plan with a dangling TupleDesc pointer. Ugh. I do not think it is very practical to try to change all of the places that assume that they can save pointers to the TupleDesc of a relcache entry. Instead I am thinking about solving the problem inside the relcache, as follows: During a relcache entry rebuild, do not simply free and reconstruct the TupleDesc. Instead, read the catalogs to build a new TupleDesc, and compare this one to the old one. If they are the same, free the new one instead (leaving the old one in place, and hence stored pointers to it are still valid). If they are not the same, then elog(ERROR). (elog may sound overly paranoid, but this condition indicates that the table's definition has actually changed since we grabbed the refcount on it --- remember we wouldn't be doing this at all if the relcache entry had zero refcount --- and therefore all kinds of derived information such as plans may be wrong. Pressing ahead will probably lead to crash.) We may need to do the same for any other substructures of the relcache entry that are visible from outside relcache.c. I know this sounds pretty grotty, but we are already doing it for the relcache entry itself --- rebuild re-uses the same physical entry rather than deleting and reallocating it, to ensure that pointers to an open Relation stay valid over an SI flush. We need to extend the same guarantee to the substructure of the relcache entry. Unless someone has a better idea, I'll work on that. >> Some of the unexpected messages in the >> postmaster log are: >> >> NOTICE: LockRelease: locktable lookup failed, no lock > I have seen this NOTICE only once or twice. > This seems because of setting MyProc->xid to InvalidTransa > ctionId in CommitTransaction() and AbortTransaction(). > There's a little time until AtCommit(Abort)_Locks. > I have no idea to solve this now. I am not seeing it after the change to never flush relcaches with positive refcount. I think the locks being complained of are probably the locks that should have been kept on flushed relations, and that it's not CommitTransaction's fault. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > >> I built the current sources with MAXNUMMESSAGES set to 32 in > >> src/include/storage/sinvaladt.h. The regular regress tests > >> run OK, with just a few NOTICEs about 'cache state reset' > >> and 'SI buffer overflow' inserted in the normal outputs > >> (as you'd expect, if SI overrun occurs). > >> > >> However, the parallel tests crash spectacularly, with weird errors > >> and Assert() coredumps. > > With these two changes in place, the parallel regress tests seem much > more stable. There is still a big problem though, which is that the > "portals" regress test sometimes fails. I traced this far enough to > discover that the code is trying to use a TupleDesc that it's stored in > the ScanTupleSlot of a plan, and this TupleDesc is no longer valid --- > presumably the relcache entry it was gotten from has been flushed and > rebuilt, leaving the plan with a dangling TupleDesc pointer. Ugh. > Oh great,I've also doubted relcache entry rebuild but wasn't able to trace so far. > I do not think it is very practical to try to change all of the places > that assume that they can save pointers to the TupleDesc of a relcache > entry. Instead I am thinking about solving the problem inside the > relcache, as follows: > > During a relcache entry rebuild, do not simply free and reconstruct > the TupleDesc. Instead, read the catalogs to build a new TupleDesc, > and compare this one to the old one. If they are the same, free the > new one instead (leaving the old one in place, and hence stored pointers > to it are still valid). If they are not the same, then elog(ERROR). > > (elog may sound overly paranoid, but this condition indicates that the > table's definition has actually changed since we grabbed the refcount on > it --- remember we wouldn't be doing this at all if the relcache entry had > zero refcount --- and therefore all kinds of derived information such as > plans may be wrong. Pressing ahead will probably lead to crash.) > Sounds reasonable. > We may need to do the same for any other substructures of the relcache > entry that are visible from outside relcache.c. > > I know this sounds pretty grotty, but we are already doing it for the > relcache entry itself --- rebuild re-uses the same physical entry > rather than deleting and reallocating it, to ensure that pointers > to an open Relation stay valid over an SI flush. We need to extend > the same guarantee to the substructure of the relcache entry. > > Unless someone has a better idea, I'll work on that. > Agreed. Regards. Hiroshi Inoue Inoue@tpf.co.jp