Thread: Re: Proper object locking for GRANT/REVOKE

Re: Proper object locking for GRANT/REVOKE

From

Bertrand Drouvot

Date:

31 October 2024, 17:26:57

Hi,

On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote:
> This patch started out as a refactoring, thinking that objectNamesToOids()
> in aclchk.c should really mostly be a loop around get_object_address().
> This is mostly true, with a few special cases because the node
> representations are a bit different in some cases, and OBJECT_PARAMETER_ACL,
> which is obviously very different.  This saves a bunch of duplicative code,
> which is nice.
> 
> Additionally, get_object_address() handles locking, which
> objectNamesToOids() somewhat famously does not do, and there is a code
> comment about it.  With this refactoring, we get the locking pretty much for
> free.

Thanks for the patch, this refactoring makes sense to me.

A few random comments:

1 ===

+               default:

I like the idea of using default as the first "case" as a way to emphasize that
this is the most "common" behavior.

2 === Nit

+  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+  Assert(relation == NULL);

Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).

3 ===

-
-       /* Used GRANT DOMAIN on a non-domain? */
-       if (istmt->objtype == OBJECT_DOMAIN &&
-               pg_type_tuple->typtype != TYPTYPE_DOMAIN)
-               ereport(ERROR,
-                               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
-                                errmsg("\"%s\" is not a domain",
-                                               NameStr(pg_type_tuple->typname))));

Yeah, get_object_address()->get_object_address_type() does take care of it.

4 ===

> 
> Interestingly, this changes the output of the intra-grant-inplace isolation
> test, which is recent work.  It might be good to get some review from those
> who worked on that, to make sure that the new behavior is still correct
> and/or to check whether those test cases are still applicable.

-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

I did not work on this but out of curiosity I looked at it, and IIUC this change
comes from:

-  relOid = RangeVarGetRelid(relvar, NoLock, false);
+  relOid = RangeVarGetRelid(relvar, lockmode, false);

So without the lock, the error is coming from ExecGrant_Relation() that way:

[local] DO(+0x22529b) [0x5c56afdb629b]
[local] DO(+0x2220b0) [0x5c56afdb30b0]
[local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047]
[local] DO(+0x6dc8cb) [0x5c56b026d8cb]
[local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da]
[local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b]
[local] DO(+0x460073) [0x5c56afff1073]

With the lock, the error comes from RangeVarGetRelidExtended() that way:

[local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36]
[local] DO(+0x2223fe) [0x5a1c3ac2a3fe]
[local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2]
[local] DO(+0x6dc1d2) [0x5a1c3b0e41d2]
[local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1]
[local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2]

That's due to (in RangeVarGetRelidExtended()):

"
     * But if lockmode = NoLock, then we assume that either the caller is OK
     * with the answer changing under them, or that they already hold some
     * appropriate lock, and therefore return the first answer we get without
     * checking for invalidation messages.
"

So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then
if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the
drop gets committed then we get the "cache lookup failed" error.

But if we are slow enough so that we don't reach "relId = RelnameGetRelid(relation->relname);"
before the drop get committed then we would also get the "relation "intra_grant_inplace" does not exist"
error (tested manually, attaching gdb on s4 and breakpoint before the RelnameGetRelid(relation->relname)
call in RangeVarGetRelidExtended()).

The fact that we use lockmode != NoLock in the patch produces a lock followed
by a "retry" in RangeVarGetRelidExtended() and so we get the "relation "intra_grant_inplace" does not exist"
error.

I think that the new behavior is still correct and in fact is more "appropriate" (
I mean that's the kind of error I expect to see from a user point of view).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Proper object locking for GRANT/REVOKE

From

Peter Eisentraut

Date:

09 November 2024, 15:43:13

On 31.10.24 15:26, Bertrand Drouvot wrote:
> +  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
> +  Assert(relation == NULL);
> 
> Worth to explain why we do expect relation to be NULL here? (the comment on top
> of get_object_address() says it all, but maybe a few words here could be worth
> it).

There are several other callers with this pattern.

Maybe it would be better to push the assertion into 
get_object_address(), something like

     Assert(!relation || relp)

near the end.  Meaning, if you pass NULL for the relp argument, then you 
don't expect a relation.  This is kind of what will happen now anyway, 
except with a segfault instead of an assertion.

Re: Proper object locking for GRANT/REVOKE

From

Peter Eisentraut

Date:

15 November 2024, 13:19:50

On 11.11.24 08:53, Bertrand Drouvot wrote:
>> Maybe it would be better to push the assertion into get_object_address(),
>> something like
>>
>>      Assert(!relation || relp)
>>
>> near the end.
> 
> That looks like a good idea to me, that would make the code cleaner and easier
> to understand.
> 
>> Meaning, if you pass NULL for the relp argument, then you
>> don't expect a relation.  This is kind of what will happen now anyway,
>> except with a segfault instead of an assertion.
> 
> Yeah, I like it.
> 
> So, something like the attached (provided as .txt file to no mess up the CF bot
> entry related to this thread) could be applied before?

Thanks.  I have applied your patch and then also mine with the 
appropriate adjustments.

Re: Proper object locking for GRANT/REVOKE

From

Noah Misch

Date:

25 November 2024, 04:24:00

On Fri, Nov 15, 2024 at 11:19:50AM +0100, Peter Eisentraut wrote:
> On 11.11.24 08:53, Bertrand Drouvot wrote:

> Thanks.  I have applied your patch and then also mine with the appropriate
> adjustments.

commit d31bbfb wrote:
> --- a/src/backend/catalog/aclchk.c
> +++ b/src/backend/catalog/aclchk.c
> @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
>   * objectNamesToOids
>   *
>   * Turn a list of object names of a given type into an Oid list.
> - *
> - * XXX: This function doesn't take any sort of locks on the objects whose
> - * names it looks up.  In the face of concurrent DDL, we might easily latch
> - * onto an old version of an object, causing the GRANT or REVOKE statement
> - * to fail.

To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:

- Use a self-exclusive lock here, not AccessShareLock.  With two sessions
  doing GRANT under AccessShareLock, one will "latch onto an old version".

- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
  GRANT/REVOKE can affect.  For example, today's locking in ALTER FUNCTION is
  the xmax stamped on the old tuple.  If GRANT switched to
  ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
  to "latch onto an old version".

I wouldn't do those, however.  It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema.  I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR:  tuple
concurrently updated" to blocking.  That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes.  What
other benefits and drawbacks should we weigh?

> --- a/src/test/isolation/expected/intra-grant-inplace.out
> +++ b/src/test/isolation/expected/intra-grant-inplace.out
> @@ -248,6 +248,6 @@ relhasindex
>  -----------
>  (0 rows)
>  
> -s4: WARNING:  got: cache lookup failed for relation REDACTED
> +s4: WARNING:  got: relation "intra_grant_inplace" does not exist

The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1().  Since this commit, that line no longer has coverage.

Re: Proper object locking for GRANT/REVOKE

From

Peter Eisentraut

Date:

02 December 2024, 14:13:56

On 25.11.24 02:24, Noah Misch wrote:
> commit d31bbfb wrote:
>> --- a/src/backend/catalog/aclchk.c
>> +++ b/src/backend/catalog/aclchk.c
>> @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
>>    * objectNamesToOids
>>    *
>>    * Turn a list of object names of a given type into an Oid list.
>> - *
>> - * XXX: This function doesn't take any sort of locks on the objects whose
>> - * names it looks up.  In the face of concurrent DDL, we might easily latch
>> - * onto an old version of an object, causing the GRANT or REVOKE statement
>> - * to fail.
> 
> To prevent "latch onto an old version" and remove the last sentence of the
> comment, we'd need two more things:
> 
> - Use a self-exclusive lock here, not AccessShareLock.  With two sessions
>    doing GRANT under AccessShareLock, one will "latch onto an old version".
> 
> - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
>    GRANT/REVOKE can affect.  For example, today's locking in ALTER FUNCTION is
>    the xmax stamped on the old tuple.  If GRANT switched to
>    ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
>    to "latch onto an old version".

Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ... 
$REASON.  In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."

> I wouldn't do those, however.  It would make GRANT ALL TABLES IN SCHEMA
> terminate every autovacuum running in the schema and consume a shared lock
> table entry per table in the schema.  I think the user-visible benefit of
> commit d31bbfb plus this additional work is just changing "ERROR:  tuple
> concurrently updated" to blocking.  That's not nothing, but I don't see it
> outweighing autovacuum termination and lock table consumption spikes.  What
> other benefits and drawbacks should we weigh?

I think what are describing is a reasonable tradeoff.  The user 
experience is tolerable: "tuple concurrently updated" is a mildly useful 
error message, and it's probably the table owner executing both commands.

The change to AccessShareLock at least prevents confusing "cache lookup 
failed" messages, and might alleviate some security concerns about 
swapping in a completely different object concurrently (even if we 
currently think this is not an actual problem).


>> --- a/src/test/isolation/expected/intra-grant-inplace.out
>> +++ b/src/test/isolation/expected/intra-grant-inplace.out
>> @@ -248,6 +248,6 @@ relhasindex
>>   -----------
>>   (0 rows)
>>   
>> -s4: WARNING:  got: cache lookup failed for relation REDACTED
>> +s4: WARNING:  got: relation "intra_grant_inplace" does not exist
> 
> The affected permutation existed to cover the first LockRelease() in
> SearchSysCacheLocked1().  Since this commit, that line no longer has coverage.

Do you have an idea how such a test case could be constructed now?

Re: Proper object locking for GRANT/REVOKE

From

Noah Misch

Date:

09 December 2024, 04:25:41

On Mon, Dec 02, 2024 at 12:13:56PM +0100, Peter Eisentraut wrote:
> On 25.11.24 02:24, Noah Misch wrote:
> > commit d31bbfb wrote:
> > > --- a/src/backend/catalog/aclchk.c
> > > +++ b/src/backend/catalog/aclchk.c
> > > @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
> > >    * objectNamesToOids
> > >    *
> > >    * Turn a list of object names of a given type into an Oid list.
> > > - *
> > > - * XXX: This function doesn't take any sort of locks on the objects whose
> > > - * names it looks up.  In the face of concurrent DDL, we might easily latch
> > > - * onto an old version of an object, causing the GRANT or REVOKE statement
> > > - * to fail.
> > 
> > To prevent "latch onto an old version" and remove the last sentence of the
> > comment, we'd need two more things:
> > 
> > - Use a self-exclusive lock here, not AccessShareLock.  With two sessions
> >    doing GRANT under AccessShareLock, one will "latch onto an old version".
> > 
> > - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
> >    GRANT/REVOKE can affect.  For example, today's locking in ALTER FUNCTION is
> >    the xmax stamped on the old tuple.  If GRANT switched to
> >    ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
> >    to "latch onto an old version".
> 
> Ok, we should probably put that comment back in slightly altered form, like
> 
> "XXX This function intentionally takes only an AccessShareLock ... $REASON.
> In the face of concurrent DDL, we might easily latch
> onto an old version of an object, causing the GRANT or REVOKE statement
> to fail."

Yep.

> > I wouldn't do those, however.  It would make GRANT ALL TABLES IN SCHEMA
> > terminate every autovacuum running in the schema and consume a shared lock
> > table entry per table in the schema.  I think the user-visible benefit of
> > commit d31bbfb plus this additional work is just changing "ERROR:  tuple
> > concurrently updated" to blocking.  That's not nothing, but I don't see it
> > outweighing autovacuum termination and lock table consumption spikes.  What
> > other benefits and drawbacks should we weigh?
> 
> I think what are describing is a reasonable tradeoff.  The user experience
> is tolerable: "tuple concurrently updated" is a mildly useful error message,
> and it's probably the table owner executing both commands.
> 
> The change to AccessShareLock at least prevents confusing "cache lookup
> failed" messages, and might alleviate some security concerns about swapping
> in a completely different object concurrently (even if we currently think
> this is not an actual problem).

Perhaps.  To me, the v17 behavior smells mildly superior to the v18 behavior.

> > > --- a/src/test/isolation/expected/intra-grant-inplace.out
> > > +++ b/src/test/isolation/expected/intra-grant-inplace.out
> > > @@ -248,6 +248,6 @@ relhasindex
> > >   -----------
> > >   (0 rows)
> > > -s4: WARNING:  got: cache lookup failed for relation REDACTED
> > > +s4: WARNING:  got: relation "intra_grant_inplace" does not exist
> > 
> > The affected permutation existed to cover the first LockRelease() in
> > SearchSysCacheLocked1().  Since this commit, that line no longer has coverage.
> 
> Do you have an idea how such a test case could be constructed now?

A rough idea.  The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.

One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1().  I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock.  If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.

Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.

Re: Proper object locking for GRANT/REVOKE

From

Noah Misch

Date:

24 June, 02:16:09

On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:
> On 09.12.24 02:25, Noah Misch wrote:
> > > Ok, we should probably put that comment back in slightly altered form, like
> > > 
> > > "XXX This function intentionally takes only an AccessShareLock ... $REASON.
> > > In the face of concurrent DDL, we might easily latch
> > > onto an old version of an object, causing the GRANT or REVOKE statement
> > > to fail."
> > 
> > Yep.
> 
> There is an open item for PG18 for this.  So here is a patch that adds a
> comment back, mostly from your descriptions in this thread.
> 
> > > The change to AccessShareLock at least prevents confusing "cache lookup
> > > failed" messages, and might alleviate some security concerns about swapping
> > > in a completely different object concurrently (even if we currently think
> > > this is not an actual problem).
> > 
> > Perhaps.  To me, the v17 behavior smells mildly superior to the v18 behavior.
> 
> Hmm.  I think there has been a general effort to get rid of internal errors
> like "cache lookup failed ..." and replace them with proper user-facing
> errors.  This change seems in line with that.

I have seen some commits along those lines, e.g. d8a0993.  They weren't adding
lock acquisitions, though.  If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.

> An alternative, if we wanted to go back to the old behavior (other than
> reverting altogether, since I think the refactoring is still valuable),
> would be to allow get_object_address() to work with lockmode == NoLock. That
> would require a bit of work, but nothing magical.

That seems a bit better to me than your comment-only proposal, but either
could be okay.