Thread: BUG #17182: Race condition on concurrent DROP and CREATE of dependent object
BUG #17182: Race condition on concurrent DROP and CREATE of dependent object
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17182 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 14beta3 Operating system: Ubuntu 20.04 Description: As result of the following script: for i in `seq 100`; do ( { for n in `seq 20`; do echo "DROP DOMAIN i;"; done } | psql ) >psql1.log 2>&1 & ( echo " CREATE DOMAIN i AS int; CREATE FUNCTION f1() RETURNS i LANGUAGE SQL RETURN 1; CREATE FUNCTION f2() RETURNS i LANGUAGE SQL RETURN 2; CREATE FUNCTION f3() RETURNS i LANGUAGE SQL RETURN 3; CREATE FUNCTION f4() RETURNS i LANGUAGE SQL RETURN 4; CREATE FUNCTION f5() RETURNS i LANGUAGE SQL RETURN 5; " | psql ) >psql2.log 2>&1 & wait psql -c "DROP DOMAIN i CASCADE" >psql3.log 2>&1 done I get several broken functions with the invalid return type: SELECT f1() ERROR: cache lookup failed for type 16519 CONTEXT: SQL function "f1" during inlining \df ERROR: cache lookup failed for type 16519 (\df is effectively unusable when such functions exist) SELECT pp.oid, proname, pronamespace, proowner, prolang, prorettype FROM pg_proc pp INNER JOIN pg_namespace pn ON (pp.pronamespace = pn.oid) WHERE pn.nspname='public' oid | proname | pronamespace | proowner | prolang | prorettype -------+---------+--------------+----------+---------+------------ 16520 | f1 | 2200 | 10 | 14 | 16519 16521 | f2 | 2200 | 10 | 14 | 16519 16564 | f3 | 2200 | 10 | 14 | 16563 16565 | f4 | 2200 | 10 | 14 | 16563 16616 | f5 | 2200 | 10 | 14 | 16615 (5 rows) The similar behaviour is reproduced with "CREATE SCHEMA s; CREATE FUNCTION s.funcX ... / DROP SCHEMA s;", but \df *.func* shows such functions with the empty schema column.
Re: BUG #17182: Race condition on concurrent DROP and CREATE of dependent object
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > As result of the following script: > for i in `seq 100`; do > ( { for n in `seq 20`; do echo "DROP DOMAIN i;"; done } | psql ) >psql1.log > 2>&1 & > ( echo " > CREATE DOMAIN i AS int; > CREATE FUNCTION f1() RETURNS i LANGUAGE SQL RETURN 1; > CREATE FUNCTION f2() RETURNS i LANGUAGE SQL RETURN 2; > CREATE FUNCTION f3() RETURNS i LANGUAGE SQL RETURN 3; > CREATE FUNCTION f4() RETURNS i LANGUAGE SQL RETURN 4; > CREATE FUNCTION f5() RETURNS i LANGUAGE SQL RETURN 5; > " | psql ) >psql2.log 2>&1 & > wait > psql -c "DROP DOMAIN i CASCADE" >psql3.log 2>&1 > done > I get several broken functions with the invalid return type: > SELECT f1() > ERROR: cache lookup failed for type 16519 > CONTEXT: SQL function "f1" during inlining I don't find this particularly surprising, and I'm unwilling to add the amount of locking overhead it'd take to prevent it. The generic problem is that a newly-created dependent object is not protected against deletion of its referenced object(s) until we commit its new pg_depend entries; before that, a concurrent DROP won't see the dependencies. I recall some discussion of trying to take an anti-deletion lock in recordDependency, but that's too late: the deletion might have committed and released its own lock since we looked up the type (or other referenced object). So the only real fix for this would be to make every object lookup in the entire system do the sort of dance that's done in RangeVarGetRelidExtended. We have agreed that the cost is worth it for tables (though I don't think that that was without controversy, nor am I 100% convinced that RangeVarGetRelidExtended is correct). But I'm not excited about extending the principle to other object types. regards, tom lane
Re: BUG #17182: Race condition on concurrent DROP and CREATE of dependent object
From
Alexander Lakhin
Date:
Hello Tom, 05.09.2021 17:15, Tom Lane wrote: >> I get several broken functions with the invalid return type: >> SELECT f1() >> ERROR: cache lookup failed for type 16519 >> CONTEXT: SQL function "f1" during inlining > I don't find this particularly surprising, and I'm unwilling to add the > amount of locking overhead it'd take to prevent it. > > The generic problem is that a newly-created dependent object is not > protected against deletion of its referenced object(s) until we commit > its new pg_depend entries; before that, a concurrent DROP won't see > the dependencies. I recall some discussion of trying to take an > anti-deletion lock in recordDependency, but that's too late: the > deletion might have committed and released its own lock since we > looked up the type (or other referenced object). So the only real fix > for this would be to make every object lookup in the entire system do > the sort of dance that's done in RangeVarGetRelidExtended. We have > agreed that the cost is worth it for tables (though I don't think > that that was without controversy, nor am I 100% convinced that > RangeVarGetRelidExtended is correct). But I'm not excited about > extending the principle to other object types. > Thanks for the detailed explanation! So if that is an acceptable behaviour, then it seems that to prevent not an accidental, but malicious exploitation of it, postgres should be hardened enough to stand up to the attempts to access such invalid objects (e.g. objects with non-existing schemas). (Maybe there are other ways for an unprivileged user to produce such objects, so fixing follow-up crashes is beneficial anyway.) For example, the following script produces functions with invalid schemas: for i in `seq 100`; do ( { for n in `seq 20`; do echo "DROP SCHEMA s;"; done } | psql ) >psql2.log 2>&1 & ( { for n in `seq 1`; do echo " CREATE SCHEMA s; CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;'; CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;'; CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;'; CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;'; "; done } | psql ) >psql1.log 2>&1 & wait psql -c "DROP SCHEMA s CASCADE" >psql3.log done and then the following query crashes the server: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM pg_proc pp LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL Core was generated by `postgres: law regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 quote_identifier (ident=0x0) at ruleutils.c:11357 11357 safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_'); (gdb) bt #0 quote_identifier (ident=0x0) at ruleutils.c:11357 #1 0x00005585a5dcee76 in pg_identify_object (fcinfo=<optimized out>) at objectaddress.c:4215 #2 0x00005585a5ed1291 in ExecInterpExpr (state=0x5585a67b1e80, econtext=0x5585a67a8b00, isnull=0x7ffcfbee9e5f) at execExprInterp.c:749 #3 0x00005585a5ecdb05 in ExecInterpExprStillValid (state=0x5585a67b1e80, econtext=0x5585a67a8b00, isNull=0x7ffcfbee9e5f) at execExprInterp.c:1824 So the quote_identifier(schema) construction seems unsafe in the circumstances. My colleagues Alexander Kozhemyakin and Sergey Shinderuk discovered similar crash case with OCLASS_DEFACL in getObjectIdentityParts(), that drove me to study this issue additionally. Best regards, Alexander