Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-Wzn0TGfxhAviNq=wgwoP9FHHhBvyPmiZiDKKRXpEk8GEYQ@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Mon, Jan 4, 2021 at 4:08 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> You said above that heap_tid_shellsort() is very performance critical,
> and that's why you use the two arrays approach. If it's so performance
> critical that swapping 8 bytes vs 12 byte array elements makes a
> difference, I would guess that comparing TID vs just the block numbers
> would also make a difference.
>
> If you really want to shave cycles, you could also store BlockNumber and
> OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData.
> What's the difference, you might ask? ItemPointerData stores the block
> number as two 16 bit integers that need to be reassembled into a 32 bit
> integer in the ItemPointerGetBlockNumber() macro.
>
> My argument here is two-pronged: If this is performance critical, you
> could do these additional optimizations. If it's not, then you don't
> need the two-arrays trick; one array would be simpler.

That's reasonable. The reason why I haven't been more decisive here is
because the question of whether or not it matters is actually very
complicated, for reasons that have little to do with sorting. The more
effective the mechanism is each time (the more bytes it allows nbtree
to free from each leaf page), the less often it is called, and the
less performance critical the overhead per operation is. On the other
hand there are a couple of other questions about what we do in
heapam.c that aren't quite resolved yet (e.g. exact "favorable
blocks"/prefetching behavior in bottom-up case), that probably affect
how important the heap_tid_shellsort() microoptimisations are.

I think that it makes sense to make a final decision on this at the
last minute, once everything else is resolved, since the implicit
dependencies make any other approach much too complicated. I agree
that this kind of microoptimization is best avoided, but I really
don't want to have to worry about regressions in workloads that I now
understand fully. I think that the sort became less important on
perhaps 2 or 3 occasions during development, even though that was
never really the goal that I had in mind in each case.

I'll do my best to avoid it.

> Hmm, maybe: "... is the best information that we have to go on, it is
> just a guess based on simple heuristics about duplicates in indexes".

I'll add something like that to the next revision.

> I see. Would be good to explain that pattern with multiple indexes in
> the comment.

Will do -- it is the single best example of how heap block locality
can matter with a real workload, so it makes sense to go with it in
explanatory comments.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: set_config() documentation clarification
Next
From: Chapman Flack
Date:
Subject: Re: set_config() documentation clarification