RE: [HACKERS] Sure enough, SI buffer overrun is broken - Mailing list pgsql-hackers

From Hiroshi Inoue
Subject RE: [HACKERS] Sure enough, SI buffer overrun is broken
Date
Msg-id NDBBIJLOILGIKBGDINDFKEIECCAA.Inoue@tpf.co.jp
Whole thread Raw
In response to Re: [HACKERS] Sure enough, SI buffer overrun is broken  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> I built the current sources with MAXNUMMESSAGES set to 32 in
> >> src/include/storage/sinvaladt.h.  The regular regress tests
> >> run OK, with just a few NOTICEs about 'cache state reset'
> >> and 'SI buffer overflow' inserted in the normal outputs
> >> (as you'd expect, if SI overrun occurs).
> >>
> >> However, the parallel tests crash spectacularly, with weird errors
> >> and Assert() coredumps.
>
> With these two changes in place, the parallel regress tests seem much
> more stable.  There is still a big problem though, which is that the
> "portals" regress test sometimes fails.  I traced this far enough to
> discover that the code is trying to use a TupleDesc that it's stored in
> the ScanTupleSlot of a plan, and this TupleDesc is no longer valid ---
> presumably the relcache entry it was gotten from has been flushed and
> rebuilt, leaving the plan with a dangling TupleDesc pointer.  Ugh.
>

Oh great,I've also doubted relcache entry rebuild but wasn't able to
trace so far.

> I do not think it is very practical to try to change all of the places
> that assume that they can save pointers to the TupleDesc of a relcache
> entry.  Instead I am thinking about solving the problem inside the
> relcache, as follows:
>
> During a relcache entry rebuild, do not simply free and reconstruct
> the TupleDesc.  Instead, read the catalogs to build a new TupleDesc,
> and compare this one to the old one.  If they are the same, free the
> new one instead (leaving the old one in place, and hence stored pointers
> to it are still valid).  If they are not the same, then elog(ERROR).
>
> (elog may sound overly paranoid, but this condition indicates that the
> table's definition has actually changed since we grabbed the refcount on
> it --- remember we wouldn't be doing this at all if the relcache entry had
> zero refcount --- and therefore all kinds of derived information such as
> plans may be wrong.  Pressing ahead will probably lead to crash.)
>

Sounds reasonable.

> We may need to do the same for any other substructures of the relcache
> entry that are visible from outside relcache.c.
>
> I know this sounds pretty grotty, but we are already doing it for the
> relcache entry itself --- rebuild re-uses the same physical entry
> rather than deleting and reallocating it, to ensure that pointers
> to an open Relation stay valid over an SI flush.  We need to extend
> the same guarantee to the substructure of the relcache entry.
>
> Unless someone has a better idea, I'll work on that.
>

Agreed.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



pgsql-hackers by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: RE: ImmediateSharedRelationCacheInvalidate considered harmful
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] freefuncs.c is never called from anywhere!?