Re: [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [WIP] Effective storage of duplicates in B-tree index.
Date
Msg-id CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com
Whole thread Raw
In response to Re: [WIP] Effective storage of duplicates in B-tree index.  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses Re: [WIP] Effective storage of duplicates in B-tree index.
List pgsql-hackers
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,    * scan key), which could be the last slot + 1.    */   if
(P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }       return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;?column?
──────────t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)        * it off the old page, not the new one, in case we are not at leaf        * level.
 */ 
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer page, but that
just includes the posting list now? That's the kind of detail I'd like
to see in the README now.

* Why not support unique indexes? The obvious answer is that it isn't
worth it, but why? How useful would that be (a bit, just not enough)?
What's the trade-off?

Anyway, this is really cool work; I have often thought that we don't
have nearly enough people thinking about how to optimize B-Tree
indexing. It is hard, but so is anything worthwhile.

That's all I have for now. Just a quick review focused on code and
correctness (and not on the benefits). I want to do more on this,
especially the benefits, because it deserves more attention.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Next
From: Shubham Barai
Date:
Subject: UNIQUE capability to hash indexes