Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date
Msg-id CAPpHfdtAjvfMcNp6h=JcNwv8rpMcuCXXq9NOJ6PYX-Jdxf0OJA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents?  I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page.  Is my understanding correct?  But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does).  Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans.  Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
>     /*
>      * Lehman and Yao require that the downlink to the right page, which is to
>      * be inserted into the parent page in the second phase of a page split be
>      * a strict lower bound on items on the right page, and a non-strict upper
>      * bound for items on the left page.  Assert that heap TIDs follow these
>      * invariants, since a heap TID value is apparently needed as a
>      * tiebreaker.
>      */
> #ifndef DEBUG_NO_TRUNCATE
>     Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
>                               BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple logical tuples in the context of inserting new
> tuples, even when you're not performing compression or splitting the
> page. I also see that amcheck's bt_index_parent_check() function
> fails, though bt_index_check() does not fail when I don't use any of
> its extra verification options. (You haven't updated amcheck, but I
> don't think that you need to update it for these basic checks to
> continue to work.)

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs?  If so, it's
definitely wrong.

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
> NULLs into a unique index as if it wasn't even a unique index -- see
> the "checkingunique" optimization at the beginning of _bt_doinsert().
> Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values.  Non-HOT updates produce duplicates of non-NULL values
in unique indexes.  And those duplicates can take significant space.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [sqlsmith] Crash in mcv_get_match_bitmap
Next
From: Tomas Vondra
Date:
Subject: Re: [sqlsmith] Crash in mcv_get_match_bitmap