Re: New WAL record to detect the checkpoint redo location - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: New WAL record to detect the checkpoint redo location |
Date | |
Msg-id | 20231009204702.cy3ilp6wgh7a5qms@awork3.anarazel.de Whole thread Raw |
In response to | Re: New WAL record to detect the checkpoint redo location (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: New WAL record to detect the checkpoint redo location
Re: New WAL record to detect the checkpoint redo location |
List | pgsql-hackers |
Hi, As noted in my email from a few minutes ago, I agree that optimizing this shouldn't be a requirement for merging the patch. On 2023-10-09 15:58:36 -0400, Robert Haas wrote: > 1. The reason why we're doing this multiplication and division is to > make sure that the code in ReserveXLogInsertLocation which executes > while holding insertpos_lck remains as simple and brief as possible. > We could eliminate the conversion between usable byte positions and > LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had > ReserveXLogInsertLocation work out by how much to advance the LSN, but > it would have to be worked out while holding insertpos_lck (or some > replacement lwlock, perhaps) and that cure seems worse than the > disease. Given that, I think we're stuck with converting between > usable bye positions and LSNs, and that intrinsically needs some > multiplication and division. Right, that's absolutely crucial for scalability. > 2. It seems possible to remove one branch in each of > XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing > whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply > increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then > the rest of the calculations can be performed as if every page in the > segment had a header of length SizeOfXLogShortPHD, with no need to > special-case the first page. However, that doesn't get rid of any > multiplication or division, just a branch. This reminded me about something I've been bugged by for a while: The whole idea of short xlog page headers seems like a completely premature optimization. The page header is a very small amount of the overall data (long: 40/8192 ~= 0.00488, short: 24/8192 ~= 0.00292), compared to the space we waste in many other places, including on a per-record level, it doesn't seem worth the complexity. > 3. Aside from that, there seems to be no simple way to reduce the > complexity of an individual calculation, but ReserveXLogInsertLocation > does perform 3 rather similar computations, and I believe that we know > that it will always be the case that *PrevPtr < *StartPos < *EndPos. > Maybe we could have a fast-path for the case where they are all in the > same segment. We could take prevbytepos modulo UsableBytesInSegment; > call the result prevsegoff. If UsableBytesInSegment - prevsegoff > > endbytepos - prevbytepos, then all three pointers are in the same > segment, and maybe we could take advantage of that to avoid performing > the segment calculations more than once, but still needing to repeat > the page calculations. Or, instead or in addition, I think we could by > a similar technique check whether all three pointers are on the same > page; if so, then *StartPos and *EndPos can be computed from *PrevPtr > by just adding the difference between the corresponding byte > positions. I think we might be able to speed some of this up by pre-compute values so we can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we already insist on power-of-two segment sizes, so instead of needing to divide by a runtime value, we should be able to shift by a runtime value (and the modulo should be a mask). > I'm not really sure whether that would come out cheaper. It's just the > only idea that I have. It did also occur to me to wonder whether the > apparent delays performing multiplication and division here were > really the result of the arithmetic itself being slow or whether they > were synchronization-related, SpinLockRelease(&Insert->insertpos_lck) > being a memory barrier just before. But I assume you thought about > that and concluded that wasn't the issue here. I did verify that they continue to be a bottleneck even after (incorrectly obviously), removing the spinlock. It's also not too surprising, the latency of 64bit divs is just high, particularly on intel from a few years ago (my cascade lake workstation) and IIRC there's just a single execution port for it too, so multiple instructions can't be fully parallelized. https://uops.info/table.html documents a worst case latency of 89 cycles on cascade lake, with the division broken up into 36 uops (reducing what's available to track other in-flight instructions). It's much better on alter lake (9 cycles and 7 uops on the perf cores, 44 cycles and 4 uops on efficiency cores) and on zen 3+ (19 cycles, 2 uops). Greetings, Andres Freund
pgsql-hackers by date: