Re: race condition in pg_class - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: race condition in pg_class |
Date | |
Msg-id | 20231102030915.d3.nmisch@google.com Whole thread Raw |
In response to | Re: race condition in pg_class (Noah Misch <noah@leadboat.com>) |
Responses |
Re: race condition in pg_class
|
List | pgsql-hackers |
I prototyped two ways, one with a special t_ctid and one with LockTuple(). On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > > > I anticipate a new locktag per catalog that can receive inplace updates, > > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. > > > > We could perhaps make this work by using the existing tuple-lock > > infrastructure, rather than inventing new locktags (a choice that > > spills to a lot of places including clients that examine pg_locks). > > That could be okay. It would be weird to reuse a short-term lock like that > one as something held till end of transaction. But the alternative of new > locktags ain't perfect, as you say. That worked. > > I would prefer though to find a solution that only depends on making > > heap_inplace_update protect itself, without high-level cooperation > > from the possibly-interfering updater. This is basically because > > I'm still afraid that we're defining the problem too narrowly. > > For one thing, I have nearly zero confidence that GRANT et al are > > the only problematic source of conflicting transactional updates. > > Likewise here, but I have fair confidence that an assertion would flush out > the rest. heap_inplace_update() would assert that the backend holds one of > the acceptable locks. It could even be an elog; heap_inplace_update() can > tolerate that cost. That check would fall in both heap_inplace_update() and heap_update(). After all, a heap_inplace_update() check won't detect an omission in GRANT. > > For another, I'm worried that some extension may be using > > heap_inplace_update against a catalog we're not considering here. > > A pgxn search finds "citus" using heap_inplace_update(). > > > I'd also like to find a solution that fixes the case of a conflicting > > manual UPDATE (although certainly that's a stretch goal we may not be > > able to reach). > > It would be nice. I expect most approaches could get there by having ExecModifyTable() arrange for the expected locking or other actions. That's analogous to how heap_update() takes care of sinval even for a manual UPDATE. > > I wonder if there's a way for heap_inplace_update to mark the tuple > > header as just-updated in a way that regular heap_update could > > recognize. (For standard catalog updates, we'd then end up erroring > > in simple_heap_update, which I think is fine.) We can't update xmin, > > because the VACUUM callers don't have an XID; but maybe there's some > > other way? I'm speculating about putting a funny value into xmax, > > or something like that, and having heap_update check that what it > > sees in xmax matches what was in the tuple the update started with. > > Hmmm. Achieving it without an XID would be the real trick. (With an XID, we > could use xl_heap_lock like heap_update() does.) Thinking out loud, what if > heap_inplace_update() sets HEAP_XMAX_INVALID and xmax = > TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular > heap_update() could complain if the field changed vs. last seen value. This > feels like something to regret later in terms of limiting our ability to > harness those fields for more-valuable ends or compact them away in a future > page format. I can't pinpoint a specific loss, so the idea might have legs. > Nontransactional data in separate tables or in new metapages smells like the > right long-term state. A project wanting to reuse the tuple header bits could > introduce such storage to unblock its own bit reuse. heap_update() does not have the pre-modification xmax today, so I used t_ctid. heap_modify_tuple() preserves t_ctid, so heap_update() already has the pre-modification t_ctid in key cases. For details of how the prototype uses t_ctid, see comment at "#define InplaceCanaryOffsetNumber". The prototype doesn't prevent corruption in the following scenario, because the aborted ALTER TABLE RENAME overwrites the special t_ctid: == session 1 drop table t; create table t (c int); begin; -- in gdb, set breakpoint on heap_modify_tuple grant select on t to public; == session 2 alter table t add primary key (c); begin; alter table t rename to t2; rollback; == back in session 1 -- release breakpoint -- want error (would get it w/o the begin;alter;rollback) commit; I'm missing how to mark the tuple in a fashion accessible to a second heap_update() after a rolled-back heap_update(). The mark needs enough bits "N" so it's implausible for 2^N inplace updates to happen between GRANT fetching the old tuple and GRANT completing heap_update(). Looking for bits that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2 in t_infomask2, and 0 in xmax. I definitely don't want to paint us into a corner by spending the t_infomask2 bits on this. Even if I did, 2^(3+2)=32 wouldn't clearly be enough inplace updates. Is there a way to salvage the goal of fixing the bug without modifying code like ExecGrant_common()? If not, I'm inclined to pick from one of the following designs: - Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list). It does make GRANT more intrusive; e.g. GRANT will cancel autovacuum. I'm leaning toward this one for two reasons. First, it doesn't slow heap_update() outside of assert builds. Second, it makes the GRANT experience more like the rest our DDL, in that concurrent DDL will make GRANT block, not fail. - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. heap_update() compares those bytes to the oldtup in shared buffers to see if an inplace update happened. (HEAD could get the bytes from a new heap_update() parameter, while back branches would need a different passing approach.) - LockTuple(), as seen in its attached prototype. I like this least at the moment, because it changes pg_locks content without having a clear advantage over the previous option. Also, the prototype has enough FIXME markers that I expect this to get hairy before it's done. I might change my preference after further prototypes. Does anyone have a strong preference between those? Briefly, I did consider these additional alternatives: - Just accept the yet-rarer chance of corruption from this message's test procedure. - Hold a buffer lock long enough to solve things. - Remember the tuples where we overwrote a special t_ctid, and reverse the overwrite during abort processing. But I/O in the abort path sounds bad. Thanks, nm
Attachment
pgsql-hackers by date: