Re: [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [WIP] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com Whole thread Raw |
In response to | Re: [WIP] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Responses |
Re: [WIP] Effective storage of duplicates in B-tree index.
|
List | pgsql-hackers |
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I fixed it in the new version (attached). Some quick remarks on your V2.0: * Seems unnecessary that _bt_binsrch() is passed a real pointer by all callers. Maybe the one current posting list caller _bt_findinsertloc(), or its caller, _bt_doinsert(), should do this work itself: @@ -373,7 +377,17 @@ _bt_binsrch(Relation rel, * scan key), which could be the last slot + 1. */ if (P_ISLEAF(opaque)) + { + if (low <= PageGetMaxOffsetNumber(page)) + { + IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low)); + /* one excessive check of equality. for possible posting tuple update or creation */ + if ((_bt_compare(rel, keysz, scankey, page, low) == 0) + && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page))) + *updposing = true; + } return low; + } * ISTM that you should not use _bt_compare() above, in any case. Consider this: postgres=# select 5.0 = 5.000;?column? ──────────t (1 row) B-Tree operator class indicates equality here. And yet, users will expect to see the original value in an index-only scan, including the trailing zeroes as they were originally input. So this should be a bit closer to HeapSatisfiesHOTandKeyUpdate() (actually, heap_tuple_attr_equals()), which looks for strict binary equality for similar reasons. * Is this correct?: @@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) * it off the old page, not the new one, in case we are not at leaf * level. */ - state->btps_minkey = CopyIndexTuple(oitup); + ItemId iihk = PageGetItemId(opage, P_HIKEY); + IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk); + state->btps_minkey = CopyIndexTuple(hikey); How this code has changed from the master branch is not clear to me. I understand that this code in incomplete/draft: +#define MaxPackedIndexTuplesPerPage \ + ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ + (sizeof(ItemPointerData)))) But why is it different to the old (actually unchanged) MaxIndexTuplesPerPage? I would like to see comments explaining your understanding, even if they are quite rough. Why did GIN never require this change to a generic header (itup.h)? Should such a change live in that generic header file, and not another one more localized to nbtree? * More explanation of the design would be nice. I suggest modifying the nbtree README file, so it's easy to tell what the current design is. It's hard to follow this from the thread. When I reviewed Heikki's B-Tree patches from a couple of years ago, we spent ~75% of the time on design, and only ~25% on code. * I have a paranoid feeling that the deletion locking protocol (VACUUMing index tuples concurrently and safely) may need special consideration here. Basically, with the B-Tree code, there are several complicated locking protocols, like for page splits, page deletion, and interlocking with vacuum ("super exclusive lock" stuff). These are why the B-Tree code is complicated in general, and it's very important to pin down exactly how we deal with each. Ideally, you'd have an explanation for why your code was correct in each of these existing cases (especially deletion). With very complicated and important code like this, it's often wise to be very clear about when we are talking about your design, and when we are talking about your code. It's generally too hard to review both at the same time. Ideally, when you talk about your design, you'll be able to say things like "it's clear that this existing thing is correct; at least we have no complaints from the field. Therefore, it must be true that my new technique is also correct, because it makes that general situation no worse". Obviously that kind of rigor is just something we aspire to, and still fall short of at times. Still, it would be nice to specifically see a reason why the new code isn't special from the point of view of the super-exclusive lock thing (which is what I mean by deletion locking protocol + special consideration). Or why it is special, but that's okay, or whatever. This style of review is normal when writing B-Tree code. Some other things don't need this rigor, or have no invariants that need to be respected/used. Maybe this is obvious to you already, but it isn't obvious to me. It's okay if you don't know why, but knowing that you don't have a strong opinion about something is itself useful information. * I see you disabled the LP_DEAD thing; why? Just because that made bugs go away? * Have you done much stress testing? Using pgbench with many concurrent VACUUM FREEZE operations would be a good idea, if you haven't already, because that is insistent about getting super exclusive locks, unlike regular VACUUM. * Are you keeping the restriction of 1/3 of a buffer page, but that just includes the posting list now? That's the kind of detail I'd like to see in the README now. * Why not support unique indexes? The obvious answer is that it isn't worth it, but why? How useful would that be (a bit, just not enough)? What's the trade-off? Anyway, this is really cool work; I have often thought that we don't have nearly enough people thinking about how to optimize B-Tree indexing. It is hard, but so is anything worthwhile. That's all I have for now. Just a quick review focused on code and correctness (and not on the benefits). I want to do more on this, especially the benefits, because it deserves more attention. -- Peter Geoghegan
pgsql-hackers by date: