Thread: BUG #17182: Race condition on concurrent DROP and CREATE of dependent object

BUG #17182: Race condition on concurrent DROP and CREATE of dependent object

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17182
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 14beta3
Operating system:   Ubuntu 20.04
Description:

As result of the following script:
for i in `seq 100`; do
( { for n in `seq 20`; do echo "DROP DOMAIN i;"; done } | psql ) >psql1.log
2>&1 &
( echo "
CREATE DOMAIN i AS int;
CREATE FUNCTION f1() RETURNS i LANGUAGE SQL RETURN 1;
CREATE FUNCTION f2() RETURNS i LANGUAGE SQL RETURN 2;
CREATE FUNCTION f3() RETURNS i LANGUAGE SQL RETURN 3;
CREATE FUNCTION f4() RETURNS i LANGUAGE SQL RETURN 4;
CREATE FUNCTION f5() RETURNS i LANGUAGE SQL RETURN 5;
" | psql ) >psql2.log 2>&1 &
wait
psql -c "DROP DOMAIN i CASCADE" >psql3.log 2>&1
done

I get several broken functions with the invalid return type:
SELECT f1()
ERROR:  cache lookup failed for type 16519
CONTEXT:  SQL function "f1" during inlining

\df
ERROR:  cache lookup failed for type 16519
(\df is effectively unusable when such functions exist)

SELECT pp.oid, proname, pronamespace, proowner, prolang, prorettype FROM
pg_proc pp INNER JOIN pg_namespace pn ON (pp.pronamespace = pn.oid) WHERE
pn.nspname='public'
  oid  | proname | pronamespace | proowner | prolang | prorettype 
-------+---------+--------------+----------+---------+------------
 16520 | f1      |         2200 |       10 |      14 |      16519
 16521 | f2      |         2200 |       10 |      14 |      16519
 16564 | f3      |         2200 |       10 |      14 |      16563
 16565 | f4      |         2200 |       10 |      14 |      16563
 16616 | f5      |         2200 |       10 |      14 |      16615
(5 rows)

The similar behaviour is reproduced with "CREATE SCHEMA s; CREATE FUNCTION
s.funcX ... / DROP SCHEMA s;", but \df *.func* shows such functions with the
empty schema column.


PG Bug reporting form <noreply@postgresql.org> writes:
> As result of the following script:
> for i in `seq 100`; do
> ( { for n in `seq 20`; do echo "DROP DOMAIN i;"; done } | psql ) >psql1.log
> 2>&1 &
> ( echo "
> CREATE DOMAIN i AS int;
> CREATE FUNCTION f1() RETURNS i LANGUAGE SQL RETURN 1;
> CREATE FUNCTION f2() RETURNS i LANGUAGE SQL RETURN 2;
> CREATE FUNCTION f3() RETURNS i LANGUAGE SQL RETURN 3;
> CREATE FUNCTION f4() RETURNS i LANGUAGE SQL RETURN 4;
> CREATE FUNCTION f5() RETURNS i LANGUAGE SQL RETURN 5;
> " | psql ) >psql2.log 2>&1 &
> wait
> psql -c "DROP DOMAIN i CASCADE" >psql3.log 2>&1
> done

> I get several broken functions with the invalid return type:
> SELECT f1()
> ERROR:  cache lookup failed for type 16519
> CONTEXT:  SQL function "f1" during inlining

I don't find this particularly surprising, and I'm unwilling to add the
amount of locking overhead it'd take to prevent it.

The generic problem is that a newly-created dependent object is not
protected against deletion of its referenced object(s) until we commit
its new pg_depend entries; before that, a concurrent DROP won't see
the dependencies.  I recall some discussion of trying to take an
anti-deletion lock in recordDependency, but that's too late: the
deletion might have committed and released its own lock since we
looked up the type (or other referenced object).  So the only real fix
for this would be to make every object lookup in the entire system do
the sort of dance that's done in RangeVarGetRelidExtended.  We have
agreed that the cost is worth it for tables (though I don't think
that that was without controversy, nor am I 100% convinced that
RangeVarGetRelidExtended is correct).  But I'm not excited about
extending the principle to other object types.

            regards, tom lane



Re: BUG #17182: Race condition on concurrent DROP and CREATE of dependent object

From
Alexander Lakhin
Date:
Hello Tom,
05.09.2021 17:15, Tom Lane wrote:
>> I get several broken functions with the invalid return type:
>> SELECT f1()
>> ERROR:  cache lookup failed for type 16519
>> CONTEXT:  SQL function "f1" during inlining
> I don't find this particularly surprising, and I'm unwilling to add the
> amount of locking overhead it'd take to prevent it.
>
> The generic problem is that a newly-created dependent object is not
> protected against deletion of its referenced object(s) until we commit
> its new pg_depend entries; before that, a concurrent DROP won't see
> the dependencies.  I recall some discussion of trying to take an
> anti-deletion lock in recordDependency, but that's too late: the
> deletion might have committed and released its own lock since we
> looked up the type (or other referenced object).  So the only real fix
> for this would be to make every object lookup in the entire system do
> the sort of dance that's done in RangeVarGetRelidExtended.  We have
> agreed that the cost is worth it for tables (though I don't think
> that that was without controversy, nor am I 100% convinced that
> RangeVarGetRelidExtended is correct).  But I'm not excited about
> extending the principle to other object types.
>
Thanks for the detailed explanation!
So if that is an acceptable behaviour, then it seems that to prevent not
an accidental, but malicious exploitation of it, postgres should be
hardened enough to stand up to the attempts to access such invalid
objects (e.g. objects with non-existing schemas).
(Maybe there are other ways for an unprivileged user to produce such
objects, so fixing follow-up crashes is beneficial anyway.)

For example, the following script produces functions with invalid schemas:
for i in `seq 100`; do
( { for n in `seq 20`; do echo "DROP SCHEMA s;"; done } | psql )
>psql2.log 2>&1 &
( { for n in `seq 1`; do
echo "
CREATE SCHEMA s;
CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;';
CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;';
CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;';
CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;';
";  done } | psql ) >psql1.log 2>&1 &
wait
psql -c "DROP SCHEMA s CASCADE" >psql3.log
done

and then the following query crashes the server:
SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM
pg_proc pp LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE
pn.oid IS NULL

Core was generated by `postgres: law regression [local]
SELECT                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  quote_identifier (ident=0x0) at ruleutils.c:11357
11357           safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0]
== '_');
(gdb) bt
#0  quote_identifier (ident=0x0) at ruleutils.c:11357
#1  0x00005585a5dcee76 in pg_identify_object (fcinfo=<optimized out>) at
objectaddress.c:4215
#2  0x00005585a5ed1291 in ExecInterpExpr (state=0x5585a67b1e80,
econtext=0x5585a67a8b00, isnull=0x7ffcfbee9e5f)
    at execExprInterp.c:749
#3  0x00005585a5ecdb05 in ExecInterpExprStillValid
(state=0x5585a67b1e80, econtext=0x5585a67a8b00,
    isNull=0x7ffcfbee9e5f) at execExprInterp.c:1824

So the quote_identifier(schema) construction seems unsafe in the
circumstances. My colleagues Alexander Kozhemyakin and Sergey Shinderuk
discovered similar crash case with OCLASS_DEFACL in
getObjectIdentityParts(), that drove me to study this issue additionally.

Best regards,
Alexander