Thread: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?

Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?

From

Peter Geoghegan

Date:

04 December 2017, 02:55:55

I've been working on adding a new feature to pg_hexedit [1] to allow
it to work with GIN indexes. This led to an idea about how we could do
better within nbtree, which I will now describe.

I noticed that GIN's internal posting tree pages (B-Trees over TIDs)
do away with the need for an ItemId array, despite more or less
working in the same way as classic B-Tree pages (posting tree *leaf*
pages are a bit weird OTOH, but that's not important). This is
possible because internal posting list pages store fixed-width
PostingItem structs instead of IndexTuples. A posting tree only needs
to work with TID key values -- not values of some arbitrary data type,
that could be a variable-width type. This raises the question: since
we really just use ItemIds to deal with variable-width tuples, can we
do away with the need for an ItemId array at the start of nbtree pages
in some important, common cases?

Applicability
=============

When there is a single attribute within an ItemId array, that we know
will always be 16 bytes (on 64-bit systems), ISTM that it should be
possible to not have any ItemId array. Simple pointer arithmetic on
the page, based on a tuple width that is known at compile time can be
used instead -- no need to go off of an lp_off value -- see
GinDataPageGetPostingItem() for an example of this. All ItemId
metadata becomes redundant, and we add IndexTuples from the start of
the page, just after the page header, and continue to add them until
the special area at the end of the page is reached (again, see the GIN
posting tree code).

Note that ItemId is already quasi-redundant within indexes. In
general, nbtree indexes only actually need the LP_DEAD bit, as well as
lp_off. (lp_len is simply redundant, even for variable-width index
tuples [2].)

Advantages
==========

Before I go into more depth on implementation issues, let me first
list off the ways in which I'd expect this to improve performance:

* We currently fit 367 IndexTuples on to a leaf page of a
single-int4-or-int8-attribute nbtree index (with the default leaf
fillfactor and BLCKSZ) after many SERIAL/BIGSERIAL style sequential
inserts (workloads where all leaf page splits are 10:90 splits of the
rightmost page). If we removed the ItemId array, we'd be able to fit
458 IndexTuples on the left half post-split (while still leaving the
same amount of freespace on the page as before). IOW, we'd make
indexes that avail of the optimization about 20% smaller, thereby
reducing random I/O and so on.

The benefits for internal pages would also be significant, because we
win in at least 3 places:

* The first win comes from the fact that we can store more IndexTuples
(downlinks) now, because the optimization works just as well for
internal pages as it does for leaf pages. This much is obvious.

* The second win comes from the fact that there are now about 20%
fewer downlinks/internal tuples needed in the first place, because
that directly corresponds to the number of pages one level down. The
advantages *compound*; our savings in the total number of internal
pages are close to 40% (it's less than a full 40% because of the 30:70
splits we do on internal pages).

* The third win comes from the fact that we will make more efficient
use of CPU cache within internal pages, by packing IndexTuples
together. Binary searches with many cache misses are something that
has been identified as a bottleneck for some workloads [3].

I haven't bothered to apply the classic "2/3 of a page" utilization
rule for my free space accounting in this example, just to keep things
simple. (That rule applies to workloads with random insertions.)

The general approach here should be to keep things simple by changing
the physical layout of B-Tree pages (where the optimization applies)
without changing their logical structure. So, very few changes ought
to be required within amcheck to make it work with this, and they
should all be fairly mechanical. Similarly, nothing would change about
the interface of the nbtree pageinspect functions. We never make any
IndexTuple bigger, so we can mix the old and new formats when
pg_upgrade is used without needing to be very defensive all over the
place.

Issues
======

I'm not planning to work on this myself, so I encourage other
contributors to think about it. These tricky issues will need to be
worked through:

* In the case of GIN, routines like GinDataPageAddPostingItem() are
used instead of the generic PageAddItem() routine. This optimization
requires similar new code that performs page modifications that must
care about page space management (code that must memmove() ranges of
tuples to make space for a new tuples, and so on).

The implementation should probably generalize its approach, so that
this optimization could in principle be used by other index AMs, such
as the hash AM. The PageAddItem() code (and similar bufpage.c code)
that is now used by nbtree would only be used in pages that still need
an ItemId array. The new fixed-width-aware code can be maintained
alongside its variable-width equivalent, within bufpage.

* A lot of the code in nbtinsert.c will need to be adjusted, most
notably the pagesplit split point picking logic.

* The LP_DEAD/kill_prior_tuple optimization is very valuable, and
definitely needs to be totally preserved.

I think that we can do this by using the free bit within
IndexTupleData (the 13th IndexTuple.t_info bit) instead of LP_DEAD.
We'd need to develop alternatives to the ItemId-specific macros that
manage the optimization to make it work. This seems doable.

* The representation of NULL values will need some work. It will need
to take account of the fact that there cannot be a NULL bitmap when
the optimization is used, since IndexTuples must always be the same
width.

We could represent an optimized IndexTuple as having a t_info-wise
length of 0xFFF, a magic value. This would allow us to do away with a
NULL bitmap (INDEX_NULL_MASK becomes "only attribute is NULL"), and to
readily mix old and new formats in most nbtree code. We can and should
represent that pages are in the new format by using a new special area
flag bit, too (a BTPageOpaqueData.btpo_flag bit).

* For similar reasons, we'd need to pad-out the "minus infinity" item
(the first "real" item on nbtree internal pages) to always be 16 bytes
(it's 8 bytes now). That's easy, though.

Thoughts?

[1] https://github.com/petergeoghegan/pg_hexedit
[2] https://postgr.es/m/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com
[3] https://www.postgresql.org/message-id/CANP8%2BjLf6ozqc2ie30%2B--h6oxFUL48jmjQXPS5dpiTFtvtYPYQ%40mail.gmail.com
--
Peter Geoghegan

Re: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?

From

Andrey Borodin

Date:

04 December 2017, 09:11:31

Hi, Peter!

> 4 дек. 2017 г., в 4:55, Peter Geoghegan <pg@bowt.ie> написал(а):
> Thoughts?

I like the idea of more compact B-tree.
Chances are that I didn't understood all your ideas.

But ItemId's let you insert a tuple among two existing tuples without data movement. New tuple is places wherever free
spacestarts. You just shift bytes in ItemId array. 
And you always have to insert tuple in specific position, since B-tree relies on tuple order.

Best regards, Andrey Borodin.

Re: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?

From

Peter Geoghegan

Date:

05 December 2017, 00:31:44

Hi Andrey,

On Sun, Dec 3, 2017 at 10:11 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I like the idea of more compact B-tree.
> Chances are that I didn't understood all your ideas.
>
> But ItemId's let you insert a tuple among two existing tuples without data movement. New tuple is places wherever
freespace starts. You just shift bytes in ItemId array.

> And you always have to insert tuple in specific position, since B-tree relies on tuple order.

It's certainly true that the need to memmove() more data is a new cost
to be paid under my proposal: a simple random insertion must move
almost 50% the entire page on average, rather than almost 10% on
average, as is the case today. I think that this would probably be
acceptable.

We really only pay this extra cost during random writes into the index
(sequential writes will still just append within a page). Random
inserts are already much more expensive than sequential int4/int8
index inserts, because of the extra FPIs emitted for full_page_writes,
the fact that more pages are dirtied, etc. It's not that I think that
random insertions are rare. I think that they're a lot rarer with
simple SERIAL/BIGSERIAL primary keys on fact tables, which are what
this optimization is really about.

The approach might actually be faster for a workload that consists
only of random insertions into a table with a SERIAL/BIGSERIAL primary
key -- the "worst case". I'm less confident about that, but it seems
very possible.

-- 
Peter Geoghegan