Thread: cache lookup failed dropping public schema with trgm index
Hi,
This reproduces on 15.4 and 13.12:
create table foo(t text);
create extension pg_trgm;
create index on foo using gist(t gist_trgm_ops);
drop schema public cascade;
NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to table foo
drop cascades to extension pg_trgm
ERROR: cache lookup failed for function 1195999
Time: 20.968 ms
DETAIL: drop cascades to table foo
drop cascades to extension pg_trgm
ERROR: cache lookup failed for function 1195999
Time: 20.968 ms
On Mon, Aug 21, 2023 at 11:40:15AM -0700, Wyatt Alt wrote: > This reproduces on 15.4 and 13.12: > > create table foo(t text); > create extension pg_trgm; > create index on foo using gist(t gist_trgm_ops); > drop schema public cascade; Indeed, reproduced here down to 13. 12 and older versions don't react like that. I'll go bisect that.. -- Michael
Attachment
Hi, On 2023-08-21 11:40:15 -0700, Wyatt Alt wrote: > This reproduces on 15.4 and 13.12: Also reproduces on HEAD. > create table foo(t text); > create extension pg_trgm; > create index on foo using gist(t gist_trgm_ops); > drop schema public cascade; > > NOTICE: drop cascades to 2 other objects > DETAIL: drop cascades to table foo > drop cascades to extension pg_trgm > ERROR: cache lookup failed for function 1195999 > Time: 20.968 ms It also seems to work without even involving a drop schema. Just dropping pg_trgm with cascade is sufficient. <several wrong theories> I think we might primarily dealing with missing invalidations. Dropping objects is scheduled in an order where we first drop schedule deletion of function 10 (text, text) of operator family t.gist_trgm_ops for access method gist: t.gtrgm_options(internal)at 0 schedule deletion of function t.gtrgm_options(internal) at 1 and then later schedule deletion of index t.foo_t at 32 During the index deletion we try to initialize the access method. But haven't performed sufficient invalidation and still think "function 10" exists. Calling it then causes the error. One can "verify" this theory by adding an InvalidateSystemCaches() at the end of deleteOneObject(). That "fixes" the issue. This also explains why dropping pg_trgm in a new session works - we don't have old cache entries that could be out of date. Not quite sure where we are dropping the ball with invalidations yet. However, I suspect there's more wrong than this, albeit perhaps not problematic in a huge way. While debugging I added a getObjectDescription() description call to deleteObjectsInList() *before* calling deleteOneObject(). That fails when dropping pg_trgmp, because we end up dropping the type gtrgm before trgm_out(). The reason this happens is because we reach gtrgm_out() via the extension dependency, rather than via the type. When recursing to gtrgm_out(), we recurse to type gtrgm, recurse to gtrgm_in(), schedule it for deletion, then recurse to gtrgm_out(), but find it's in the stack and do *not* schedule it for deletion, before scheduling gtrgm for deletion. Only then we delete gtrgm_out(). Now, this isn't a real issue in practice (without such a debugging statement, which likely can't work in some cases), but I strongly suspect that it indicates a scheduling order issue that's more widespread. Despite, I think, correct dependencies, we end up with a topologically inconsistent drop order. There aren't any cycles in the directed dependency graph from what I can see. Greetings, Andres Freund
On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote: > It also seems to work without even involving a drop schema. Just dropping > pg_trgm with cascade is sufficient. FWIW, after a bisect I can see that 911e7020 is the origin of the failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases). > Now, this isn't a real issue in practice (without such a debugging statement, > which likely can't work in some cases), but I strongly suspect that it > indicates a scheduling order issue that's more widespread. Despite, I think, > correct dependencies, we end up with a topologically inconsistent drop > order. There aren't any cycles in the directed dependency graph from what I > can see. Yeah, guess so. I was first betting on a missing shared inval here. Now note that for example, this command works: psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension pg_trgm; create index on foo using gist(t gist_trgm_ops);createindex on foo using gist(t gist_trgm_ops); drop schema public cascade;' -- Michael
Attachment
Thanks for having a look. Note also that switching the first two lines of my example prevents the failure.
On Mon, Aug 21, 2023 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:
> It also seems to work without even involving a drop schema. Just dropping
> pg_trgm with cascade is sufficient.
FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).
> Now, this isn't a real issue in practice (without such a debugging statement,
> which likely can't work in some cases), but I strongly suspect that it
> indicates a scheduling order issue that's more widespread. Despite, I think,
> correct dependencies, we end up with a topologically inconsistent drop
> order. There aren't any cycles in the directed dependency graph from what I
> can see.
Yeah, guess so. I was first betting on a missing shared inval here.
Now note that for example, this command works:
psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension pg_trgm; create index on foo using gist(t gist_trgm_ops);create index on foo using gist(t gist_trgm_ops); drop schema public cascade;'
--
Michael
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote: >> It also seems to work without even involving a drop schema. Just dropping >> pg_trgm with cascade is sufficient. > FWIW, after a bisect I can see that 911e7020 is the origin of the > failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases). Hmm. I see that 911e7020 modified pg_trgm's install script with +ALTER OPERATOR FAMILY gist_trgm_ops USING gist +ADD FUNCTION 10 (text) gtrgm_options (internal); I wonder whether that correctly adds a dependency to ensure the opfamily is dropped before the function. regards, tom lane
On Mon, Aug 21, 2023 at 07:27:19PM -0400, Tom Lane wrote: > Hmm. I see that 911e7020 modified pg_trgm's install script with > > +ALTER OPERATOR FAMILY gist_trgm_ops USING gist > +ADD FUNCTION 10 (text) gtrgm_options (internal); > > I wonder whether that correctly adds a dependency to ensure the > opfamily is dropped before the function. The dependencies look OK seen from here when it comes to the extension, the indexes or the option function. See, for instance: =# SELECT pg_describe_object(classid, objid, objsubid) as obj, pg_describe_object(refclassid,refobjid,refobjsubid) as objref, deptype FROM pg_depend WHERE (classid = 'pg_proc'::regclass AND objid = 'gtrgm_options'::regproc) OR (refclassid = 'pg_proc'::regclass AND refobjid = 'gtrgm_options'::regproc); -[ RECORD 1]------------------------------------------------------------------------------------------------------ obj | function gtrgm_options(internal) objref | schema public deptype | n -[ RECORD 2 ]------------------------------------------------------------------------------------------------------ obj | function gtrgm_options(internal) objref | extension pg_trgm deptype | e -[ RECORD 3]------------------------------------------------------------------------------------------------------ obj | function 10 (text, text) of operator family gist_trgm_ops for access method gist: gtrgm_options(internal) objref | function gtrgm_options(internal) deptype | a -- Michael
Attachment
Hi, On 2023-08-21 19:27:19 -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote: > >> It also seems to work without even involving a drop schema. Just dropping > >> pg_trgm with cascade is sufficient. > > > FWIW, after a bisect I can see that 911e7020 is the origin of the > > failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases). > > Hmm. I see that 911e7020 modified pg_trgm's install script with > > +ALTER OPERATOR FAMILY gist_trgm_ops USING gist > +ADD FUNCTION 10 (text) gtrgm_options (internal); > > I wonder whether that correctly adds a dependency to ensure the > opfamily is dropped before the function. It's not quite that, I think. Whether it fails or succeeds depends on the state of the system caches. The dependencies lead to the equivalent of ALTER OPERATOR FAMILY ... USING ... DROP being performed, before dropping the index. I think there are at least three levels of problems here: - I don't think we currently properly force index relcache entries to be invalidated when the relevant opfamily changes - LookupOpclassInfo() doesn't have *any* invalidation support, so even if we were to perform invalidation, we'd still return a potentially stale entry. The function says: * Note there is no provision for flushing the cache. This is OK at the * moment because there is no way to ALTER any interesting properties of an * existing opclass --- all you can do is drop it, which will result in * a useless but harmless dead entry in the cache. But that's not true (amymore?), because it does do pg_amproc lookups, and they *can* change. - Minor: Even if you force LookupOpclassInfo() to perform lookups again, it doesn't work, because when an amproc entry doesn't exist anymore, the old value in opcentry->supportProcs[i] survives, because nothing resets it. Greetings, Andres Freund
On Tue, Aug 22, 2023 at 11:45:58AM -0700, Andres Freund wrote: > It's not quite that, I think. Whether it fails or succeeds depends on the > state of the system caches. The dependencies lead to the equivalent of > ALTER OPERATOR FAMILY ... USING ... DROP > being performed, before dropping the index. > > I think there are at least three levels of problems here: > > - I don't think we currently properly force index relcache entries to be > invalidated when the relevant opfamily changes > > - LookupOpclassInfo() doesn't have *any* invalidation support, so even if we > were to perform invalidation, we'd still return a potentially stale > entry. The function says: > > * Note there is no provision for flushing the cache. This is OK at the > * moment because there is no way to ALTER any interesting properties of an > * existing opclass --- all you can do is drop it, which will result in > * a useless but harmless dead entry in the cache. > > But that's not true (amymore?), because it does do pg_amproc lookups, and > they *can* change. > > - Minor: Even if you force LookupOpclassInfo() to perform lookups again, it > doesn't work, because when an amproc entry doesn't exist anymore, the old > value in opcentry->supportProcs[i] survives, because nothing resets it. Another thing: we also need to think harder about RelationReloadIndexInfo() which is the cheap path for cache reload for the index information. This is quite different from rd_options, though, because we are going to need an extra facility to look at the pg_attribute tuples for the indexes to get access to the new option options? We touch that a bit in RelationBuildTupleDesc(), but not in this cheap reload path. This is going to require more infrastructure, at quick glance.. -- Michael