Thread: Re: Proper object locking for GRANT/REVOKE
Hi, On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote: > This patch started out as a refactoring, thinking that objectNamesToOids() > in aclchk.c should really mostly be a loop around get_object_address(). > This is mostly true, with a few special cases because the node > representations are a bit different in some cases, and OBJECT_PARAMETER_ACL, > which is obviously very different. This saves a bunch of duplicative code, > which is nice. > > Additionally, get_object_address() handles locking, which > objectNamesToOids() somewhat famously does not do, and there is a code > comment about it. With this refactoring, we get the locking pretty much for > free. Thanks for the patch, this refactoring makes sense to me. A few random comments: 1 === + default: I like the idea of using default as the first "case" as a way to emphasize that this is the most "common" behavior. 2 === Nit + address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false); + Assert(relation == NULL); Worth to explain why we do expect relation to be NULL here? (the comment on top of get_object_address() says it all, but maybe a few words here could be worth it). 3 === - - /* Used GRANT DOMAIN on a non-domain? */ - if (istmt->objtype == OBJECT_DOMAIN && - pg_type_tuple->typtype != TYPTYPE_DOMAIN) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg("\"%s\" is not a domain", - NameStr(pg_type_tuple->typname)))); Yeah, get_object_address()->get_object_address_type() does take care of it. 4 === > > Interestingly, this changes the output of the intra-grant-inplace isolation > test, which is recent work. It might be good to get some review from those > who worked on that, to make sure that the new behavior is still correct > and/or to check whether those test cases are still applicable. -s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not exist I did not work on this but out of curiosity I looked at it, and IIUC this change comes from: - relOid = RangeVarGetRelid(relvar, NoLock, false); + relOid = RangeVarGetRelid(relvar, lockmode, false); So without the lock, the error is coming from ExecGrant_Relation() that way: [local] DO(+0x22529b) [0x5c56afdb629b] [local] DO(+0x2220b0) [0x5c56afdb30b0] [local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047] [local] DO(+0x6dc8cb) [0x5c56b026d8cb] [local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da] [local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b] [local] DO(+0x460073) [0x5c56afff1073] With the lock, the error comes from RangeVarGetRelidExtended() that way: [local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36] [local] DO(+0x2223fe) [0x5a1c3ac2a3fe] [local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2] [local] DO(+0x6dc1d2) [0x5a1c3b0e41d2] [local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1] [local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2] That's due to (in RangeVarGetRelidExtended()): " * But if lockmode = NoLock, then we assume that either the caller is OK * with the answer changing under them, or that they already hold some * appropriate lock, and therefore return the first answer we get without * checking for invalidation messages. " So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the drop gets committed then we get the "cache lookup failed" error. But if we are slow enough so that we don't reach "relId = RelnameGetRelid(relation->relname);" before the drop get committed then we would also get the "relation "intra_grant_inplace" does not exist" error (tested manually, attaching gdb on s4 and breakpoint before the RelnameGetRelid(relation->relname) call in RangeVarGetRelidExtended()). The fact that we use lockmode != NoLock in the patch produces a lock followed by a "retry" in RangeVarGetRelidExtended() and so we get the "relation "intra_grant_inplace" does not exist" error. I think that the new behavior is still correct and in fact is more "appropriate" ( I mean that's the kind of error I expect to see from a user point of view). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 31.10.24 15:26, Bertrand Drouvot wrote: > + address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false); > + Assert(relation == NULL); > > Worth to explain why we do expect relation to be NULL here? (the comment on top > of get_object_address() says it all, but maybe a few words here could be worth > it). There are several other callers with this pattern. Maybe it would be better to push the assertion into get_object_address(), something like Assert(!relation || relp) near the end. Meaning, if you pass NULL for the relp argument, then you don't expect a relation. This is kind of what will happen now anyway, except with a segfault instead of an assertion.
On 11.11.24 08:53, Bertrand Drouvot wrote: >> Maybe it would be better to push the assertion into get_object_address(), >> something like >> >> Assert(!relation || relp) >> >> near the end. > > That looks like a good idea to me, that would make the code cleaner and easier > to understand. > >> Meaning, if you pass NULL for the relp argument, then you >> don't expect a relation. This is kind of what will happen now anyway, >> except with a segfault instead of an assertion. > > Yeah, I like it. > > So, something like the attached (provided as .txt file to no mess up the CF bot > entry related to this thread) could be applied before? Thanks. I have applied your patch and then also mine with the appropriate adjustments.
On Fri, Nov 15, 2024 at 11:19:50AM +0100, Peter Eisentraut wrote: > On 11.11.24 08:53, Bertrand Drouvot wrote: > Thanks. I have applied your patch and then also mine with the appropriate > adjustments. commit d31bbfb wrote: > --- a/src/backend/catalog/aclchk.c > +++ b/src/backend/catalog/aclchk.c > @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) > * objectNamesToOids > * > * Turn a list of object names of a given type into an Oid list. > - * > - * XXX: This function doesn't take any sort of locks on the objects whose > - * names it looks up. In the face of concurrent DDL, we might easily latch > - * onto an old version of an object, causing the GRANT or REVOKE statement > - * to fail. To prevent "latch onto an old version" and remove the last sentence of the comment, we'd need two more things: - Use a self-exclusive lock here, not AccessShareLock. With two sessions doing GRANT under AccessShareLock, one will "latch onto an old version". - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is the xmax stamped on the old tuple. If GRANT switched to ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT to "latch onto an old version". I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA terminate every autovacuum running in the schema and consume a shared lock table entry per table in the schema. I think the user-visible benefit of commit d31bbfb plus this additional work is just changing "ERROR: tuple concurrently updated" to blocking. That's not nothing, but I don't see it outweighing autovacuum termination and lock table consumption spikes. What other benefits and drawbacks should we weigh? > --- a/src/test/isolation/expected/intra-grant-inplace.out > +++ b/src/test/isolation/expected/intra-grant-inplace.out > @@ -248,6 +248,6 @@ relhasindex > ----------- > (0 rows) > > -s4: WARNING: got: cache lookup failed for relation REDACTED > +s4: WARNING: got: relation "intra_grant_inplace" does not exist The affected permutation existed to cover the first LockRelease() in SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.
On 25.11.24 02:24, Noah Misch wrote: > commit d31bbfb wrote: >> --- a/src/backend/catalog/aclchk.c >> +++ b/src/backend/catalog/aclchk.c >> @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) >> * objectNamesToOids >> * >> * Turn a list of object names of a given type into an Oid list. >> - * >> - * XXX: This function doesn't take any sort of locks on the objects whose >> - * names it looks up. In the face of concurrent DDL, we might easily latch >> - * onto an old version of an object, causing the GRANT or REVOKE statement >> - * to fail. > > To prevent "latch onto an old version" and remove the last sentence of the > comment, we'd need two more things: > > - Use a self-exclusive lock here, not AccessShareLock. With two sessions > doing GRANT under AccessShareLock, one will "latch onto an old version". > > - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that > GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is > the xmax stamped on the old tuple. If GRANT switched to > ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT > to "latch onto an old version". Ok, we should probably put that comment back in slightly altered form, like "XXX This function intentionally takes only an AccessShareLock ... $REASON. In the face of concurrent DDL, we might easily latch onto an old version of an object, causing the GRANT or REVOKE statement to fail." > I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA > terminate every autovacuum running in the schema and consume a shared lock > table entry per table in the schema. I think the user-visible benefit of > commit d31bbfb plus this additional work is just changing "ERROR: tuple > concurrently updated" to blocking. That's not nothing, but I don't see it > outweighing autovacuum termination and lock table consumption spikes. What > other benefits and drawbacks should we weigh? I think what are describing is a reasonable tradeoff. The user experience is tolerable: "tuple concurrently updated" is a mildly useful error message, and it's probably the table owner executing both commands. The change to AccessShareLock at least prevents confusing "cache lookup failed" messages, and might alleviate some security concerns about swapping in a completely different object concurrently (even if we currently think this is not an actual problem). >> --- a/src/test/isolation/expected/intra-grant-inplace.out >> +++ b/src/test/isolation/expected/intra-grant-inplace.out >> @@ -248,6 +248,6 @@ relhasindex >> ----------- >> (0 rows) >> >> -s4: WARNING: got: cache lookup failed for relation REDACTED >> +s4: WARNING: got: relation "intra_grant_inplace" does not exist > > The affected permutation existed to cover the first LockRelease() in > SearchSysCacheLocked1(). Since this commit, that line no longer has coverage. Do you have an idea how such a test case could be constructed now?
On Mon, Dec 02, 2024 at 12:13:56PM +0100, Peter Eisentraut wrote: > On 25.11.24 02:24, Noah Misch wrote: > > commit d31bbfb wrote: > > > --- a/src/backend/catalog/aclchk.c > > > +++ b/src/backend/catalog/aclchk.c > > > @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) > > > * objectNamesToOids > > > * > > > * Turn a list of object names of a given type into an Oid list. > > > - * > > > - * XXX: This function doesn't take any sort of locks on the objects whose > > > - * names it looks up. In the face of concurrent DDL, we might easily latch > > > - * onto an old version of an object, causing the GRANT or REVOKE statement > > > - * to fail. > > > > To prevent "latch onto an old version" and remove the last sentence of the > > comment, we'd need two more things: > > > > - Use a self-exclusive lock here, not AccessShareLock. With two sessions > > doing GRANT under AccessShareLock, one will "latch onto an old version". > > > > - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that > > GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is > > the xmax stamped on the old tuple. If GRANT switched to > > ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT > > to "latch onto an old version". > > Ok, we should probably put that comment back in slightly altered form, like > > "XXX This function intentionally takes only an AccessShareLock ... $REASON. > In the face of concurrent DDL, we might easily latch > onto an old version of an object, causing the GRANT or REVOKE statement > to fail." Yep. > > I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA > > terminate every autovacuum running in the schema and consume a shared lock > > table entry per table in the schema. I think the user-visible benefit of > > commit d31bbfb plus this additional work is just changing "ERROR: tuple > > concurrently updated" to blocking. That's not nothing, but I don't see it > > outweighing autovacuum termination and lock table consumption spikes. What > > other benefits and drawbacks should we weigh? > > I think what are describing is a reasonable tradeoff. The user experience > is tolerable: "tuple concurrently updated" is a mildly useful error message, > and it's probably the table owner executing both commands. > > The change to AccessShareLock at least prevents confusing "cache lookup > failed" messages, and might alleviate some security concerns about swapping > in a completely different object concurrently (even if we currently think > this is not an actual problem). Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior. > > > --- a/src/test/isolation/expected/intra-grant-inplace.out > > > +++ b/src/test/isolation/expected/intra-grant-inplace.out > > > @@ -248,6 +248,6 @@ relhasindex > > > ----------- > > > (0 rows) > > > -s4: WARNING: got: cache lookup failed for relation REDACTED > > > +s4: WARNING: got: relation "intra_grant_inplace" does not exist > > > > The affected permutation existed to cover the first LockRelease() in > > SearchSysCacheLocked1(). Since this commit, that line no longer has coverage. > > Do you have an idea how such a test case could be constructed now? A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which didn't mind the LOCKTAG_RELATION from DROP TABLE. One route might be to find another SearchSysCacheLocked1() caller that takes no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd add a temporary elog to report if that's happening. check_lock_if_inplace_updateable_rel() is an example of reporting the absence of a lock. If check-world w/ that elog finds some operation reaching that circumstance, this test could replace REVOKE with that operation. Another route would be to remove the catalog row without locking the underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class. That's more artificial.