Re: Inval reliability, especially for inplace updates - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Inval reliability, especially for inplace updates |
Date | |
Msg-id | 20240618181643.14.nmisch@google.com Whole thread Raw |
In response to | Re: Inval reliability, especially for inplace updates (Noah Misch <noah@leadboat.com>) |
List | pgsql-hackers |
On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote: > On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote: > > On 2024-06-17 16:58:54 -0700, Noah Misch wrote: > > > On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote: > > > > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote: > > > > > https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote: > > > > > > Separable, nontrivial things not fixed in the attached patch stack: > > > > > > > > > > > > - Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of > > > > > > CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss > > > > > > still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right. > > > > > > > > > > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately, > > > > > inside the critical section. Send it in heap_xlog_inplace(), too. > > > > I'm worried this might cause its own set of bugs, e.g. if there are any places > > that, possibly accidentally, rely on the invalidation from the inplace update > > to also cover separate changes. > > Good point. I do have index_update_stats() still doing an ideally-superfluous > relcache update for that reason. Taking that further, it would be cheap > insurance to have the inplace update do a transactional inval in addition to > its immediate inval. Future master-only work could remove the transactional > one. How about that? > > > Have you considered instead submitting these invalidations during abort as > > well? > > I had not. Hmmm. If the lock protocol in README.tuplock (after patch > inplace120) told SearchSysCacheLocked1() to do systable scans instead of > syscache reads, that could work. Would need to ensure a PANIC if transaction > abort doesn't reach the inval submission. Overall, it would be harder to > reason about the state of caches, but I suspect the patch would be smaller. > How should we choose between those strategies? > > > > > > a. Within logical decoding, cease processing invalidations for inplace > > > > > > > > I'm attaching the implementation. This applies atop the v3 patch stack from > > > > https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are > > > > mostly orthogonal and intended for independent review. Translating a tuple > > > > into inval messages uses more infrastructure than relmapper, which needs just > > > > a database ID. Hence, this ended up more like a miniature of inval.c's > > > > participation in the transaction commit sequence. > > > > > > > > I waffled on whether to back-patch inplace150-inval-durability-atcommit > > > > > > That inplace150 patch turned out to be unnecessary. Contrary to the > > > "noncritical resource releasing" comment some lines above > > > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to > > > PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot > > > abort transaction %u, it was already committed". > > > > Relying on that, instead of explicit critical sections, seems fragile to me. > > IIRC some of the behaviour around errors around transaction commit/abort has > > changed a bunch of times. Tying correctness into something that could be > > changed for unrelated reasons doesn't seem great. > > Fair enough. It could still be a good idea for master, but given I missed a > bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones > $SUBJECT fixes, let's not risk it in back branches. > > > I'm not sure it holds true even today - what if the transaction didn't have an > > xid? Then RecordTransactionAbort() wouldn't trigger > > "cannot abort transaction %u, it was already committed" > > I think? > > I think that's right. As the inplace160-inval-durability-inplace-v2.patch > edits to xact.c say, the concept of invals in XID-less transactions is buggy > at its core. Fortunately, after that patch, we use them only for two things > that could themselves stop with something roughly as simple as the attached. Now actually attached. > > > > - Same change, no WAL version bump. Standby must update before primary. This > > > > is best long-term, but the transition is more disruptive. I'm leaning > > > > toward this one, but the second option isn't bad: > > > > Hm. The inplace record doesn't use the length of the "main data" record > > segment for anything, from what I can tell. If records by an updated primary > > were replayed by an old standby, it'd just ignore the additional data, afaict? > > Agreed, but ... > > > I think with the code as-is, the situation with an updated standby replaying > > an old primary's record would actually be worse - it'd afaict just assume the > > now-longer record contained valid fields, despite those just pointing into > > uninitialized memory. I think the replay routine would have to check the > > length of the main data and execute the invalidation conditionally. > > I anticipated back branches supporting a new XLOG_HEAP_INPLACE_WITH_INVAL > alongside the old XLOG_HEAP_INPLACE. Updated standbys would run both fine, > and old binaries consuming new WAL would PANIC, "heap_redo: unknown op code". > > > > > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on > > > > every backend. This is more wasteful, but inplace updates might be rare > > > > enough (~once per VACUUM) to make it tolerable. > > > > We already set that surprisingly frequently, as > > a) The size of the sinval queue is small > > b) If a backend is busy, it does not process catchup interrupts > > (i.e. executing queries, waiting for a lock prevents processing) > > c) There's no deduplication of invals, we often end up sending the same inval > > over and over. > > > > So I suspect this might not be too bad, compared to the current badness. > > That is good. We might be able to do the overflow signal once at end of > recovery, like RelationCacheInitFileRemove() does for the init file. That's > mildly harder to reason about, but it would be cheaper. Hmmm. > > > At least for core code. I guess there could be extension code triggering > > inplace updates more frequently? But I'd hope they'd do it not on catalog > > tables... Except that we wouldn't know that that's the case during replay, > > it's not contained in the record. > > For what it's worth, from a grep of PGXN, only citus does inplace updates. > > > > > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't > > > > correct if one ends recovery between the two records, but you'd need to be > > > > unlucky to notice. Noticing would need a procedure like the following. A > > > > hot standby backend populates a relcache entry, then does DDL on the rel > > > > after recovery ends. > > > > Hm. The problematic cases presumably involves an access exclusive lock? If so, > > could we do LogStandbyInvalidations() *before* logging the WAL record for the > > inplace update? The invalidations can't be processed by other backends until > > the exclusive lock has been released, which should avoid the race? > > A lock forces a backend to drain the inval queue before using the locked > object, but it doesn't stop the backend from draining the queue and > repopulating cache entries earlier. For example, pg_describe_object() can > query many syscaches without locking underlying objects. Hence, the inval > system relies on the buffer change getting fully visible to catcache queries > before the sinval message enters the shared queue. > > Thanks, > nm
Attachment
pgsql-hackers by date: