Thread: findDependentObjects() mutual exclusion vs. MVCC catalog scans
Consider this sequence of commands: create type rowtype as (c int, d int); create temp table t of rowtype; \c - drop type rowtype cascade; Since the switch to MVCC catalog scans, it exhibits the following error about 10% of the time on my system: CREATE TYPE CREATE TABLE You are now connected to database "test" as user "nm". ERROR: XX000: cache lookup failed for relation 17009 LOCATION: getRelationDescription, objectaddress.c:2186 With "\c", in general, you may end up executing commands under the new session before the old backend has finished exiting. For this test case specifically, the two backends' attempts to drop table "t" regularly overlap. The old backend would drop it within RemoveTempRelationsCallback(), and the new backend would cascade from "rowtype" to drop it. findDependentObjects() deals with concurrent deletion attempts by acquiring a lock on each object it will delete, then calling systable_recheck_tuple() to determine whether another deleter was successful while the current backend was waiting for the lock. systable_recheck_tuple() uses the scan snapshot, which really only works if that snapshot is SnapshotNow or some other that changes its decision in response to concurrent transaction commits. The switch to MVCC snapshots left this mutual exclusion protocol ineffective. Let's fix this by having systable_recheck_tuple() acquire a fresh catalog MVCC snapshot and recheck against that. I believe it would also be fully safe to use SnapshotNow here; however, I'm assuming we would otherwise manage to remove SnapshotNow entirely. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Attachment
On Tue, Jul 16, 2013 at 9:50 AM, Noah Misch <noah@leadboat.com> wrote: > Consider this sequence of commands: > > create type rowtype as (c int, d int); > create temp table t of rowtype; > \c - > drop type rowtype cascade; > > Since the switch to MVCC catalog scans, it exhibits the following error about > 10% of the time on my system: > > CREATE TYPE > CREATE TABLE > You are now connected to database "test" as user "nm". > ERROR: XX000: cache lookup failed for relation 17009 > LOCATION: getRelationDescription, objectaddress.c:2186 > > With "\c", in general, you may end up executing commands under the new session > before the old backend has finished exiting. For this test case specifically, > the two backends' attempts to drop table "t" regularly overlap. The old > backend would drop it within RemoveTempRelationsCallback(), and the new > backend would cascade from "rowtype" to drop it. findDependentObjects() deals > with concurrent deletion attempts by acquiring a lock on each object it will > delete, then calling systable_recheck_tuple() to determine whether another > deleter was successful while the current backend was waiting for the lock. > systable_recheck_tuple() uses the scan snapshot, which really only works if > that snapshot is SnapshotNow or some other that changes its decision in > response to concurrent transaction commits. The switch to MVCC snapshots left > this mutual exclusion protocol ineffective. > > Let's fix this by having systable_recheck_tuple() acquire a fresh catalog MVCC > snapshot and recheck against that. I believe it would also be fully safe to > use SnapshotNow here; however, I'm assuming we would otherwise manage to > remove SnapshotNow entirely. I recommend reworking the header comment to avoid mention of SnapshotNow, since if we get rid of SnapshotNow, the reference might not be too clear to far-future hackers. + /* + * For a scan using a non-MVCC snapshot like SnapshotSelf, we would simply + * reuse the old snapshot. So far, the only caller uses MVCC snapshots. + */ + freshsnap = GetCatalogSnapshot(RelationGetRelid(sysscan->heap_rel)); This comment is not very clear, because it doesn't describe what the code actually does, but rather speculates about what the code could do if the intention of some future caller were different. I recommend adding Assert(IsMVCCSnapshot(scan->xs_snapshot)) and changing the comment to something like this: "For now, we don't handle the case of a non-MVCC scan snapshot. This is adequate for existing uses of this function, but might need to be changed in the future." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 16, 2013 at 9:50 AM, Noah Misch <noah@leadboat.com> wrote: >> Let's fix this by having systable_recheck_tuple() acquire a fresh catalog MVCC >> snapshot and recheck against that. I believe it would also be fully safe to >> use SnapshotNow here; however, I'm assuming we would otherwise manage to >> remove SnapshotNow entirely. I agree with Robert's comments, and in addition suggest that this code needs a comment about why it's safe to use the snapshot without doing RegisterSnapshot or equivalent. regards, tom lane
On 2013-07-16 09:50:07 -0400, Noah Misch wrote: > With "\c", in general, you may end up executing commands under the new session > before the old backend has finished exiting. For this test case specifically, > the two backends' attempts to drop table "t" regularly overlap. The old > backend would drop it within RemoveTempRelationsCallback(), and the new > backend would cascade from "rowtype" to drop it. findDependentObjects() deals > with concurrent deletion attempts by acquiring a lock on each object it will > delete, then calling systable_recheck_tuple() to determine whether another > deleter was successful while the current backend was waiting for the lock. > systable_recheck_tuple() uses the scan snapshot, which really only works if > that snapshot is SnapshotNow or some other that changes its decision in > response to concurrent transaction commits. The switch to MVCC snapshots left > this mutual exclusion protocol ineffective. Nice catch. I wonder though, isn't that code unsafe in other ways as well? What if the pg_depend entry was rewritten inbetween? Consider somebody doing something like REASSIGN OWNED concurrently with a DROP. The DROP possibly will cascade to an entry which changed the owner already. And the recheck will then report that the object doesn't exist anymore and abort since it does a simple HeapTupleSatisfiesVisibility() and doesn't follow the ctid chain if the tuple has been updated... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 16, 2013 at 05:56:10PM +0200, Andres Freund wrote: > On 2013-07-16 09:50:07 -0400, Noah Misch wrote: > > With "\c", in general, you may end up executing commands under the new session > > before the old backend has finished exiting. For this test case specifically, > > the two backends' attempts to drop table "t" regularly overlap. The old > > backend would drop it within RemoveTempRelationsCallback(), and the new > > backend would cascade from "rowtype" to drop it. findDependentObjects() deals > > with concurrent deletion attempts by acquiring a lock on each object it will > > delete, then calling systable_recheck_tuple() to determine whether another > > deleter was successful while the current backend was waiting for the lock. > > systable_recheck_tuple() uses the scan snapshot, which really only works if > > that snapshot is SnapshotNow or some other that changes its decision in > > response to concurrent transaction commits. The switch to MVCC snapshots left > > this mutual exclusion protocol ineffective. > > Nice catch. > > I wonder though, isn't that code unsafe in other ways as well? What if > the pg_depend entry was rewritten inbetween? Consider somebody doing > something like REASSIGN OWNED concurrently with a DROP. The DROP > possibly will cascade to an entry which changed the owner already. And > the recheck will then report that the object doesn't exist anymore and > abort since it does a simple HeapTupleSatisfiesVisibility() and doesn't > follow the ctid chain if the tuple has been updated... I'm not seeing a problem with that particular route. Say we're examining a pg_depend tuple where the referencing object is a table and the referenced object is a role, the table's owner. We're dropping the role and cascade to the table. If the REASSIGNED OWNED assigns the table to a different role, then we are correct to treat the dependency as gone. If it's the same role ("REASSIGNED OWNED BY alice TO alice", pointless but permitted), several of the rename implementations short-circuit that case and don't change catalog entries. But even if that optimization were omitted, shdepChangeDep() will have blocked against our previously-acquired deletion lock on the role. Code that adds or updates a dependency without locking both objects of the new pg_depend tuple is buggy independently. That being said, there may well be a related mechanism that can slip past the locking here. My brain turns to mush when I ponder findDependentObjects() too thoroughly. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Tue, Jul 16, 2013 at 11:27:02AM -0400, Robert Haas wrote: > I recommend reworking the header comment to avoid mention of > SnapshotNow, since if we get rid of SnapshotNow, the reference might > not be too clear to far-future hackers. > > + /* > + * For a scan using a non-MVCC snapshot like SnapshotSelf, we would simply > + * reuse the old snapshot. So far, the only caller uses MVCC snapshots. > + */ > + freshsnap = GetCatalogSnapshot(RelationGetRelid(sysscan->heap_rel)); > > This comment is not very clear, because it doesn't describe what the > code actually does, but rather speculates about what the code could do > if the intention of some future caller were different. I recommend > adding Assert(IsMVCCSnapshot(scan->xs_snapshot)) and changing the > comment to something like this: "For now, we don't handle the case of > a non-MVCC scan snapshot. This is adequate for existing uses of this > function, but might need to be changed in the future." On Tue, Jul 16, 2013 at 11:35:48AM -0400, Tom Lane wrote: > I agree with Robert's comments, and in addition suggest that this code > needs a comment about why it's safe to use the snapshot without doing > RegisterSnapshot or equivalent. Committed with hopefully-better comments. Thanks. -- Noah Misch EnterpriseDB http://www.enterprisedb.com