Re: WIP: Covering + unique indexes. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: WIP: Covering + unique indexes.
Date
Msg-id CAH2-WzmNgNi8TtGqpgGQ2gku=Jwb6yXNcoQ1uy-jA2diyfTK6Q@mail.gmail.com
Whole thread Raw
In response to Re: WIP: Covering + unique indexes.  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: WIP: Covering + unique indexes.
List pgsql-hackers
On Sat, Mar 24, 2018 at 12:39 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> +1, putting "nattributes" to argument of index_truncate_tuple() would
> make this function way more universal.

Great.

> Originally that code was written by Teodor, but I also put my hands there.
> In GIN entry tree, item pointers are stored in a posting list which is
> located
> after index tuple attributes.  So, both t_tid block number and offset are
> used for GIN needs.

Well, you worked on the posting list compression stuff, at least.  :-)

> That makes sense.  Let's store the number of tuple attributes to t_tid.
> Assuming that our INDEX_MAX_KEYS is quite small number, we will have
> higher bits of t_tid free for latter use.

I was going to say that you could just treat the low bit in the t_tid
offset as representing "see catalog entry". My first idea was that
nothing would have to change about the existing format, since internal
page items already have only the low bit set within their offset.
However, I now see that that won't really work, because we don't
change the offset in high keys when they're copied from a real item
during a page split. Whatever we do, it has to work equally well for
all "separator keys" -- that is, it must work for both downlinks in
internal pages, and all high keys (including high keys at the leaf
level).

A good solution is to use the unused 13th t_bit. If hash can have a
INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
This avoids a BTREE_VERSION bump, and allows us to deal with the
highkey offset issue. Actually, it's even more flexible than that --
it can work with ordinary leaf tuples in the future, too. That is, we
can eventually implement prefix truncation and deduplication at the
leaf level using this representation, since there is nothing that
limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".

The main difference between this approach to leaf prefix
truncation/compression/deduplication, and the GIN entry tree's posting
list representation would be that it wouldn't have to be
super-optimized for duplicates, at the expense of more common case for
regular nbtree indexes -- having few or no duplicates. A btree_gin
index on pgbench_accounts(aid) looks very similar to an equivalent
nbtree index if you just compare internal pages from each, but they
look quite different at the leaf level, where GIN has 24 byte
IndexTuples instead of 16 bytes IndexTuples. Of course, this is
because the leaf pages have posting lists that can never be simple
heap pointer TIDs.

A secondary goal of this INDEX_ALT_TID_MASK representation should be
that it won't even be necessary to know that an IndexTuple is
contained within a leaf page rather than an index page (again, unlike
GIN). I'm pretty confident that we can have a truly universal
IndexTuple representation for nbtree, while supporting all of these
standard optimizations.

Sorry for going off in a tangent, but I think it's somewhat necessary
to have a strategy here. Of course, we don't have to get everything
right now, but we should be looking in this direction whenever we talk
about on-disk nbtree changes.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Undesirable entries in typedefs list
Next
From: Andrew Dunstan
Date:
Subject: Re: Faster inserts with mostly-monotonically increasing values