Thread: Why B-Tree suffix truncation matters

Why B-Tree suffix truncation matters

From
Peter Geoghegan
Date:
I've been working on B-Tree suffix truncation, as part of a larger
effort to make all index tuples have unique keys by treating their
heap TID as a first class part of the key space, as part of an even
larger (and still ill-defined) effort to improve index vacuuming
(a.k.a. "retail index tuple deletion"). I posted some prototype
patches over on the "Making all nbtree entries unique by having heap
TIDs participate in comparisons", and have been working on this fairly
solidly for the past few weeks.

I'm starting this new thread to discuss the benefits of B-Tree suffix
truncation, and to demonstrate how effective it can be at increasing
fan-out and space utilization with a real-world example. I haven't
explained why suffix truncation matters very well before now. Now that
I have a patch that works, I can just show the benefit directly. (By
the way, there are similar benefits to covering INCLUDE indexes, where
suffix truncation occurs all the time, and not just
opportunistically.)

Definition
==========

To recap, suffix truncation is a classic optimization for B-Trees that
was first described by Bayer in 1977. We truncate away
non-distinguishing trailing attributes from pivot tuples to save space
in internal pages, improving fan-in. Pivot tuples are the index tuples
that don't point to the heap - they're only used to guide searches to
the correct leaf page. Since they only guide searches, they only need
those attributes up to and including the first attribute that
distinguished each half of a leaf page split at the time of the
original split. For example, if during a leaf page split the tuple to
the immediate left of the split point looks like this:

('USA', 'New Jersey', 'Wyckoff')

...and, if the tuple at the immediate right half looked like this:

('USA', 'New York', 'Albany')

...then we could get away with truncating to make the new left page's
high key (and right page's downlink) be represented as:

('USA', 'New York', <minus infinity>)

Where "<minus infinity>" is represented a little bit like a NULL
value, though with less overhead (we can use the same representation
as INCLUDE indexes do for pivot tuples). Don't make the mistake of
thinking of this as a form of compression -- it's actually a way
avoiding storing something that we simply don't need, and never could
need, that prematurely limits where tuples can be stored in the index.

Example
=======

It's well known that the number of internal pages in a B-Tree seldom
exceeds 1% of the total (2% would be considered hugely bloated). These
pages are what we're really interested in shrinking, so suffix
truncation might appear to be a micro-optimization of little value.
Who cares if I can shave 0.2% off the total size of the index? In
fact, it can be far more than that in cases involving *random*
insertions. Here is a simple motivating example, that uses the UK land
registry data [1]:

pg@~[31744]=# \dt+
                                List of relations
 Schema │            Name             │ Type  │ Owner │    Size    │ Description
────────┼─────────────────────────────┼───────┼───────┼────────────┼─────────────
 public │ land_registry_price_paid_uk │ table │ pg    │ 3005 MB │
(1 row)

Let's create a table with the same schema, and build a composite index
on (county, city, locality) against that table, for our random
insertions test:

pg@~[31744]=# create table land2 (like land_registry_price_paid_uk);
CREATE TABLE
pg@~[31744]=# create index composite on land2 (county, city, locality);
CREATE INDEX

The data we're about to insert looks like this:

pg@~[7002]=# select county, city, locality from
land_registry_price_paid_uk limit 15;
      county      │    city     │  locality
──────────────────┼─────────────┼─────────────
 LANCASHIRE       │ BURNLEY     │ BURNLEY
 SLOUGH           │ SLOUGH      │ SLOUGH
 EAST SUSSEX      │ RYE         │ RYE
 HAMPSHIRE        │ FARNBOROUGH │ FARNBOROUGH
 GREATER LONDON   │ LONDON      │ LONDON
 LINCOLNSHIRE     │ SKEGNESS    │ SKEGNESS
 WREKIN           │ TELFORD     │ TELFORD
 KENT             │ CANTERBURY  │ CANTERBURY
 GREATER LONDON   │ LONDON      │ LONDON
 GREATER LONDON   │ LONDON      │ LEYTON
 NORTHAMPTONSHIRE │ CORBY       │ CORBY
 SURREY           │ GUILDFORD   │ GUILDFORD
 SURREY           │ LEATHERHEAD │ OXSHOTT
 LINCOLNSHIRE     │ WISBECH     │ TYDD GOTE
 WILTSHIRE        │ PEWSEY      │ PEWSEY
(15 rows)

Let's do something that results in more or less random insertions into
the index:

pg@~[31744]=# insert into land2 select * from land_registry_price_paid_uk;
INSERT 0 21456045
Time: 155536.176 ms (02:35.536)
pg@~[31744]=# \di+
                                     List of relations
 Schema │         Name          │ Type  │ Owner │      Table       │
Size   │ Description
────────┼───────────────────────┼───────┼───────┼──────────────────┼─────────┼─────────────
 public │ composite             │ index │ pg    │ land2            │ 1134 MB │
(1 row)

(Note that I actually VACUUM FREEZE'd land_registry_price_paid_uk
before inserting in both cases.)

If I evaluated this on the master branch, this bulk INSERT takes about
40% longer -- 03:38.682 (I haven't tried to benchmark this to make
sure that it's a consistent improvement, but I think it is). More
importantly, the index is consistently about 17% larger on the master
branch, as shown here:

pg@~[6387]=# \di+
                         List of relations
 Schema │   Name    │ Type  │ Owner │ Table │  Size   │ Description
────────┼───────────┼───────┼───────┼───────┼─────────┼─────────────
 public │ composite │ index │ pg    │ land2 │ 1329 MB │
(1 row)

Note that I've changed the page split logic quite a bit to get this
benefit. I haven't posted a version of my patch that'll do this just
yet, because there are a bunch of competing considerations that I
won't get into now.

Explanation
-----------

Pivot tuples describe how the key space will be split up, which will
*hopefully* be balanced for future insertions. So, apart from the
obvious issue about the size of pivot tuples, there is a more
important issue: why unnecessarily commit to certain exact boundaries
early on, before it's clear how well that will balance values that get
inserted later? Suffix truncation isn't very helpful when used by
CREATE INDEX, which works off a sorted stream, since nbtsort.c "page
splits" will naturally be spaced "far apart in terms of the final key
space". In other words, whether or not the split points chosen will
balance the space utilization of the index as a whole is not left to
chance during CREATE INDEX.

To illustrate what I mean, I'll show that the master branch actually
manages to get the index even smaller than 1134MB following a REINDEX:

pg@~[6387]=# reindex index composite;
REINDEX
pg@~[6387]=# \di+
                         List of relations
 Schema │   Name    │ Type  │ Owner │ Table │  Size   │ Description
────────┼───────────┼───────┼───────┼───────┼─────────┼─────────────
 public │ composite │ index │ pg    │ land2 │ 1100 MB │

If post-REINDEX/CREATE INDEX size is all you care about, then suffix
truncation is much less important, since now you really are talking
about shaving off a fraction of 1% of the total size of the index.
FWIW, my patch leaves the index size at 1101 MB following a REINDEX,
which is slightly larger (we need to store heap TIDs in duplicate
pivot tuples), though this is likely to pay off when there are random
insertions later.

Summary
=======

In summary, suffix truncation is much more useful than it may appear
at first, because:

* It can make space utilization for random insertions much closer to
the utilization we'd get with a REINDEX after the random insertions
are over.

* While it cannot noticeably reduce the number of index tuple
comparisons (fewer internal pages has very little influence on that),
it can make those comparisons resolved by a cheap "negative infinity"
comparison in many cases. I think that the amount of expensive
strcoll() calls we do can be improved significantly with
multi-attribute columns. We see some evidence of this in my example.
Many of the remaining calls to varstr_cmp() are probably resolved
using the memcmp() equality fast-path.

* The downside of trying to truncate seems like it could be made to be
close to zero. Unlike techniques like prefix compression, suffix
truncation leaves us with little to lose, so we don't need buy-in from
the user. No new GUCs required.

[1] https://wiki.postgresql.org/wiki/Sample_Databases#Other_Samples
--
Peter Geoghegan


Re: Why B-Tree suffix truncation matters

From
Andrey Borodin
Date:
Hi, Peter!

> 4 июля 2018 г., в 20:43, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> I've been working on B-Tree suffix truncation

Thank you for this detailed explanation. I must admit I've been seriously confusing prefix truncation and suffix
truncationbefore this post. 

Some years ago I've observed viable performance improvement (around 8% in inserts and 5% in selects) of covering
indexesin a series of experiments [0]. I believe same improvement may be achieved by suffix truncation in case of
complexcomposite indexes. 

> Unlike techniques like prefix compression, suffix
> truncation leaves us with little to lose, so we don't need buy-in from
> the user. No new GUCs required.

Indeed, but prefix truncation can reduce whole index size dramatically, not by a matter of few percents. If we have
foreignkey from table 1 with 1M tuples to table 2 with 1K rows, index size can be reduced by the order of magnitude.
Andthis optimization seems very straightforward: trim tuple's prefix, if it matches hikey's prefix (or some other's in
caseof leaf page). 


> The downside of trying to truncate seems like it could be made to be
> close to zero.
I see code complexity as somewhat a downside. B-tree is kind of a rocket science. Chances are you have some kind of
roadmapof B-tree things to implement in nearest years? 

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/20160814171131.21390.75752.pgcf%40coridan.postgresql.org

Re: Why B-Tree suffix truncation matters

From
Peter Geoghegan
Date:
On Thu, Jul 5, 2018 at 7:17 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Some years ago I've observed viable performance improvement (around 8% in inserts and 5% in selects) of covering
indexesin a series of experiments [0]. 

Covering indexes are definitely a closely related idea. It kind of
makes sense that we got covering indexes first, even though I think
most other DB systems had suffix truncation quite early on.

Suffix truncation (and being smarter about the point at which pages
are split, both in general, and to help suffix truncation) is
important for bloat. Index bloat and LP_DEAD killing and heap bloat
and heap pruning seem to be related, and to benefit from better
locality of access that the TID key thing adds. The same patch that
helped a lot with the composite index seems to help an awful lot with
pgbench workloads that are I/O bound. and take hours to run -- I saw
30%+ increases in transaction throughput on a benchmark that just
finished up, that was running for almost 24 hours. My latest work
seems very promising for OLTP workloads that we've traditionally not
done so great with.

I hope to be able to post a new version of my patch in the next few
days, or maybe next week. I just need to do a bit more validation, and
clean-up. V3 will be very different to V2 and V1, since it won't just
be an idea for a new architectural direction. It addresses several
problems all at once.

This work seems to have a kind of "inventor's paradox" quality to it
-- it's somehow easier to fix all the problems at once than it is to
fix just one, which is very counter-intuitive.

> Indeed, but prefix truncation can reduce whole index size dramatically, not by a matter of few percents.

I made the example composite index about 15% smaller after random
insertions, and I did so with code that mostly just pays close
attention to the existing rules. It saves us the cost of many
comparisons, and it has benefits for I/O and for bloat control. The
optimization has no downside that I'm aware of. That's why I'm
starting there.

Look at it like this: It's already true that pivot tuples may have
values that were from non-pivot tuples that were killed long ago,
since the job of pivot tuples is to simply separate the key space; the
fact that a particular index tuple happened to have the exact values
used for a pivot makes no difference to that index tuple. VACUUM may
have removed the index tuple a year ago, without caring about the
pivot tuple. As far as anybody can tell, it could be a fictitious
value that never existed in the first place. Why not *actually* start
with a truly fictitious value for pivot tuples when we're doing leaf
page splits? It's obvious that it's safe. We just need to be sure that
it separates values on each side of the original split, and be a bit
smarter about where on the page to take as a split point. Everything
else automatically follows.

Right now, the suffix truncation is not very aggressive, since it just
truncates whole attributes. Some day, we should be able to truncate up
to a single character in a text string. I think we need key
normalization for that, though.

> If we have foreign key from table 1 with 1M tuples to table 2 with 1K rows, index size can be reduced by the order of
magnitude.And this optimization seems very straightforward: trim tuple's prefix, if it matches hikey's prefix (or some
other'sin case of leaf page). 

Prefix compression is way more complicated than you seem to think.

With key normalization, it's fairly easy to do it for pivot tuples in
internal pages, because the normalized key representation is easy to
reason about during page splits -- we need pay very close attention to
the use of space. But at the leaf level, we have to use the original
representation (for index-only scans), and so the space management is
hard. You really need to add a "low key" on the leaf, because we need
to think about the universe of possible values that can go on the
page, not the values that happen to currently be present. Otherwise
the free space management in a nightmare.

In MySQL, prefix truncation is configurable, and is often not used.
I'd expect to have to make it configurable. That's a very different
kind of feature.

> I see code complexity as somewhat a downside. B-tree is kind of a rocket science.

I know what you mean, but FWIW I see it as a matter of developing a
good mental model. If you look at the code without a mental model, the
complexity is overwhelming. If you develop a high-level understanding
first, and try to visualize the structure, then eventually it starts
to make sense, and it seems way less complicated (though still very
complicated).

> Chances are you have some kind of roadmap of B-tree things to implement in nearest years?

Kind of. That's almost what this page is:
https://wiki.postgresql.org/wiki/Key_normalization

That page is really just about the representation of pivot tuples and
the representation of individual pages, and yet that's almost the only
area that I'd like to make changes to. I have a very high opinion of
the nbtree code. I'm not really interested in changing the high-level
rules. For example, I'm not interested in adding merging of pages that
are not yet completely empty. It's way too much complexity for way too
little gain. The same problem can actually be more effectively
addressed by suffix truncation -- why not just make it so that those
pages don't become half empty to begin with?

I could perhaps work on prefix compression. But probably not in the
next 3 years. It's way down the list for me.

--
Peter Geoghegan