[HACKERS] Abbreviated keys in nbtree internal pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | [HACKERS] Abbreviated keys in nbtree internal pages |
Date | |
Msg-id | CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Over on the "memory layouts for binary search in nbtree" thread, I described a plausible way of implementing abbreviated keys for nbtree internal pages [1]. I've been going on about this technique for a long time, but the insight that we can do it by reusing an itemId's lp_len field is a new one. I'm encouraged by the fact that I got a POC patch to work quite easily. This optimization is all about reducing the number of cache misses within _bt_binsrch(). We'd do this for internal pages only. The big idea is that binary searches may only have to access the ItemId array on internal pages, which is often less than 1/8 of the size of the page, and sometimes only 1/16 (consider internal page fillfactor). Areas of concern/interest include: * How to implement an encoding scheme to make the best of the 15 bits we'd have for this. That could involve hard trade-offs for something like numeric. Note that I'm not interested in making any encoding scheme that is adaptive -- simplicity demands that we implement totally generic encoding schemes, I think. We'd also make the comparisons totally generic unsigned integer comparisons, that are hard coded within _bt_compare(). 15 bits is more than it sounds. It would be useless for sorting, but within B-Trees things are quite different. You can amortize the cost of generating the key over a very long period of time, and abbreviated keys are only needed for internal pages (typically perhaps 1 in 50 - 200 leaf tuples for text). You only need to generate a new abbreviated key during a leaf page split (page splits of internal pages just reuse the existing one). And, most importantly, internal pages naturally represent dispersed values. You've an excellent chance at resolving comparisons with only one byte abbreviated key comparisons within the root page. 15 bits gets you surprisingly far. * What other code this optimization might break. * Workloads helped by the optimization, that should be tested. * Whether it's okay that some workloads will not be helped at all. The hope here is that branch prediction and the fact that we're not accessing stuff not already in a cache line means those cases don't hurt either. * We'll want to do this for text's default B-Tree opclass, with a real collation, since that is where it would really matter. I think there are concerns about the stability of something like a strxfrm() blob over and above the concerns we have for sort support. You don't just have to worry about the stability of the behavior of collations -- you also have to worry about the stability of actual blobs on disk, and that a technical enhancement to the collator algorithm could break things that wouldn't break sort support. ICU has facilities for this, and the ICU talks about storing strxfrm() blobs on disk for a long time, but the details would need to be worked out. [1] postgr.es/m/CAH2-WzmW9LjTfzp7uY4CHsF3NH0YM-_pow3LXRh18ORLgeu48A@mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: