Re: race condition in pg_class - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: race condition in pg_class |
Date | |
Msg-id | 20240817040748.f8.nmisch@google.com Whole thread Raw |
In response to | Re: race condition in pg_class (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: race condition in pg_class
|
List | pgsql-hackers |
Thanks for reviewing. On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote: > On 14/07/2024 20:48, Noah Misch wrote: > > I've pushed the two patches for your reports. To placate cfbot, I'm attaching > > the remaining patches. > > inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in > order, it looks pretty random as it is. Something like: > > /* > * Tuple locks are currently only held for short durations within a > * transaction. Check that we didn't forget to release one. > */ Will add. > inplace110-successors-v8.patch: Makes sense. > > The README changes would be better as part of the third patch, as this patch > doesn't actually do any of the new locking described in the README, and it > fixes the "inplace update updates wrong tuple" bug even without those tuple > locks. That should work. Will confirm. > > + * ... [any slow preparation not requiring oldtup] ... > > + * heap_inplace_update_scan([...], &tup, &inplace_state); > > + * if (!HeapTupleIsValid(tup)) > > + * elog(ERROR, [...]); > > + * ... [buffer is exclusive-locked; mutate "tup"] ... > > + * if (dirty) > > + * heap_inplace_update_finish(inplace_state, tup); > > + * else > > + * heap_inplace_update_cancel(inplace_state); > > I wonder if the functions should be called "systable_*" and placed in > genam.c rather than in heapam.c. The interface looks more like the existing > systable functions. It feels like a modularity violation for a function in > heapam.c to take an argument like "indexId", and call back into systable_* > functions. Yes, _scan() and _cancel() especially are wrappers around systable. Some API options follow. Any preference or other ideas? ==== direct s/heap_/systable_/ rename systable_inplace_update_scan([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) systable_inplace_update_finish(inplace_state, tup); else systable_inplace_update_cancel(inplace_state); ==== make the first and last steps more systable-like systable_inplace_update_begin([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) systable_inplace_update(inplace_state, tup); systable_inplace_update_end(inplace_state); ==== no systable_ wrapper for middle step, more like CatalogTupleUpdate systable_inplace_update_begin([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) heap_inplace_update(relation, systable_inplace_old_tuple(inplace_state), tup, systable_inplace_buffer(inplace_state)); systable_inplace_update_end(inplace_state); > > /*---------- > > * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid: > > * > > * ["D" is a VACUUM (ONLY_DATABASE_STATS)] > > * ["R" is a VACUUM tbl] > > * D: vac_update_datfrozenid() -> systable_beginscan(pg_class) > > * c * D: systable_getnext() returns pg_class tuple of tbl > > * R: memcpy() into pg_class tuple of tbl > > * D: raise pg_database.datfrozenxid, XLogInsert(), finish > > * [crash] > > * [recovery restores datfrozenxid w/o relfrozenxid] > > */ > > Hmm, that's a tight race, but feels bad to leave it unfixed. One approach > would be to modify the tuple on the buffer only after WAL-logging it. That > way, D cannot read the updated value before it has been WAL logged. Just > need to make sure that the change still gets included in the WAL record. > Maybe something like: > > if (RelationNeedsWAL(relation)) > { > /* > * Make a temporary copy of the page that includes the change, in > * case the a full-page image is logged > */ > PGAlignedBlock tmppage; > > memcpy(tmppage.data, page, BLCKSZ); > > /* copy the tuple to the temporary copy */ > memcpy(...); > > XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD); > XLogInsert(); > } > > /* copy the tuple to the buffer */ > memcpy(...); Yes, that's the essence of inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from https://postgr.es/m/flat/20240620012908.92.nmisch@google.com. > > pg_class heap_inplace_update_scan() callers: before the call, acquire > > LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock > > (VACUUM), or a mode with strictly more conflicts. If the update targets a > > row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be > > on the table. Locking the index rel is optional. (This allows VACUUM to > > overwrite per-index pg_class while holding a lock on the table alone.) We > > could allow weaker locks, in which case the next paragraph would simply call > > for stronger locks for its class of commands. heap_inplace_update_scan() > > acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for > > ExclusiveLock, on each tuple it overwrites. > > > > pg_class heap_update() callers: before copying the tuple to modify, take a > > lock that conflicts with at least one of those from the preceding paragraph. > > SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE. > > After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt > > to acquire just the LOCKTAG_RELATION. > > These rules seem complicated. Phrasing this slightly differently, if I > understand correctly: for a heap_update() caller, it's always sufficient to > hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the > relation that conflicts with those mentioned in the first paragraph, then > you can skip the LOCKTAG_TUPLE lock. Yes. > Could we just stipulate that you must always hold LOCKTAG_TUPLE when you > call heap_update() on pg_class or pg_database? That'd make the rule simple. We could. That would change more code sites. Rough estimate: $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l 23 If the count were 2, I'd say let's simplify the rule like you're exploring. (I originally had a complicated rule for pg_database, but I abandoned that when it helped few code sites.) If it were 100, I'd say the complicated rule is worth it. A count of 23 makes both choices fair. Long-term, I hope relfrozenxid gets reimplemented with storage outside pg_class, removing the need for inplace updates. So the additional 23 code sites might change back at a future date. That shouldn't be a big consideration, though. Another option here would be to preface that README section with a simplified view, something like, "If a warning brought you here, take a tuple lock. The rest of this section is just for people needing to understand the conditions for --enable-casserts emitting that warning." How about that instead of simplifying the rules?
pgsql-hackers by date: