Why B-Tree suffix truncation matters - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Why B-Tree suffix truncation matters
Date
Msg-id CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com
Whole thread Raw
Responses Re: Why B-Tree suffix truncation matters
List pgsql-hackers
I've been working on B-Tree suffix truncation, as part of a larger
effort to make all index tuples have unique keys by treating their
heap TID as a first class part of the key space, as part of an even
larger (and still ill-defined) effort to improve index vacuuming
(a.k.a. "retail index tuple deletion"). I posted some prototype
patches over on the "Making all nbtree entries unique by having heap
TIDs participate in comparisons", and have been working on this fairly
solidly for the past few weeks.

I'm starting this new thread to discuss the benefits of B-Tree suffix
truncation, and to demonstrate how effective it can be at increasing
fan-out and space utilization with a real-world example. I haven't
explained why suffix truncation matters very well before now. Now that
I have a patch that works, I can just show the benefit directly. (By
the way, there are similar benefits to covering INCLUDE indexes, where
suffix truncation occurs all the time, and not just
opportunistically.)

Definition
==========

To recap, suffix truncation is a classic optimization for B-Trees that
was first described by Bayer in 1977. We truncate away
non-distinguishing trailing attributes from pivot tuples to save space
in internal pages, improving fan-in. Pivot tuples are the index tuples
that don't point to the heap - they're only used to guide searches to
the correct leaf page. Since they only guide searches, they only need
those attributes up to and including the first attribute that
distinguished each half of a leaf page split at the time of the
original split. For example, if during a leaf page split the tuple to
the immediate left of the split point looks like this:

('USA', 'New Jersey', 'Wyckoff')

...and, if the tuple at the immediate right half looked like this:

('USA', 'New York', 'Albany')

...then we could get away with truncating to make the new left page's
high key (and right page's downlink) be represented as:

('USA', 'New York', <minus infinity>)

Where "<minus infinity>" is represented a little bit like a NULL
value, though with less overhead (we can use the same representation
as INCLUDE indexes do for pivot tuples). Don't make the mistake of
thinking of this as a form of compression -- it's actually a way
avoiding storing something that we simply don't need, and never could
need, that prematurely limits where tuples can be stored in the index.

Example
=======

It's well known that the number of internal pages in a B-Tree seldom
exceeds 1% of the total (2% would be considered hugely bloated). These
pages are what we're really interested in shrinking, so suffix
truncation might appear to be a micro-optimization of little value.
Who cares if I can shave 0.2% off the total size of the index? In
fact, it can be far more than that in cases involving *random*
insertions. Here is a simple motivating example, that uses the UK land
registry data [1]:

pg@~[31744]=# \dt+
                                List of relations
 Schema │            Name             │ Type  │ Owner │    Size    │ Description
────────┼─────────────────────────────┼───────┼───────┼────────────┼─────────────
 public │ land_registry_price_paid_uk │ table │ pg    │ 3005 MB │
(1 row)

Let's create a table with the same schema, and build a composite index
on (county, city, locality) against that table, for our random
insertions test:

pg@~[31744]=# create table land2 (like land_registry_price_paid_uk);
CREATE TABLE
pg@~[31744]=# create index composite on land2 (county, city, locality);
CREATE INDEX

The data we're about to insert looks like this:

pg@~[7002]=# select county, city, locality from
land_registry_price_paid_uk limit 15;
      county      │    city     │  locality
──────────────────┼─────────────┼─────────────
 LANCASHIRE       │ BURNLEY     │ BURNLEY
 SLOUGH           │ SLOUGH      │ SLOUGH
 EAST SUSSEX      │ RYE         │ RYE
 HAMPSHIRE        │ FARNBOROUGH │ FARNBOROUGH
 GREATER LONDON   │ LONDON      │ LONDON
 LINCOLNSHIRE     │ SKEGNESS    │ SKEGNESS
 WREKIN           │ TELFORD     │ TELFORD
 KENT             │ CANTERBURY  │ CANTERBURY
 GREATER LONDON   │ LONDON      │ LONDON
 GREATER LONDON   │ LONDON      │ LEYTON
 NORTHAMPTONSHIRE │ CORBY       │ CORBY
 SURREY           │ GUILDFORD   │ GUILDFORD
 SURREY           │ LEATHERHEAD │ OXSHOTT
 LINCOLNSHIRE     │ WISBECH     │ TYDD GOTE
 WILTSHIRE        │ PEWSEY      │ PEWSEY
(15 rows)

Let's do something that results in more or less random insertions into
the index:

pg@~[31744]=# insert into land2 select * from land_registry_price_paid_uk;
INSERT 0 21456045
Time: 155536.176 ms (02:35.536)
pg@~[31744]=# \di+
                                     List of relations
 Schema │         Name          │ Type  │ Owner │      Table       │
Size   │ Description
────────┼───────────────────────┼───────┼───────┼──────────────────┼─────────┼─────────────
 public │ composite             │ index │ pg    │ land2            │ 1134 MB │
(1 row)

(Note that I actually VACUUM FREEZE'd land_registry_price_paid_uk
before inserting in both cases.)

If I evaluated this on the master branch, this bulk INSERT takes about
40% longer -- 03:38.682 (I haven't tried to benchmark this to make
sure that it's a consistent improvement, but I think it is). More
importantly, the index is consistently about 17% larger on the master
branch, as shown here:

pg@~[6387]=# \di+
                         List of relations
 Schema │   Name    │ Type  │ Owner │ Table │  Size   │ Description
────────┼───────────┼───────┼───────┼───────┼─────────┼─────────────
 public │ composite │ index │ pg    │ land2 │ 1329 MB │
(1 row)

Note that I've changed the page split logic quite a bit to get this
benefit. I haven't posted a version of my patch that'll do this just
yet, because there are a bunch of competing considerations that I
won't get into now.

Explanation
-----------

Pivot tuples describe how the key space will be split up, which will
*hopefully* be balanced for future insertions. So, apart from the
obvious issue about the size of pivot tuples, there is a more
important issue: why unnecessarily commit to certain exact boundaries
early on, before it's clear how well that will balance values that get
inserted later? Suffix truncation isn't very helpful when used by
CREATE INDEX, which works off a sorted stream, since nbtsort.c "page
splits" will naturally be spaced "far apart in terms of the final key
space". In other words, whether or not the split points chosen will
balance the space utilization of the index as a whole is not left to
chance during CREATE INDEX.

To illustrate what I mean, I'll show that the master branch actually
manages to get the index even smaller than 1134MB following a REINDEX:

pg@~[6387]=# reindex index composite;
REINDEX
pg@~[6387]=# \di+
                         List of relations
 Schema │   Name    │ Type  │ Owner │ Table │  Size   │ Description
────────┼───────────┼───────┼───────┼───────┼─────────┼─────────────
 public │ composite │ index │ pg    │ land2 │ 1100 MB │

If post-REINDEX/CREATE INDEX size is all you care about, then suffix
truncation is much less important, since now you really are talking
about shaving off a fraction of 1% of the total size of the index.
FWIW, my patch leaves the index size at 1101 MB following a REINDEX,
which is slightly larger (we need to store heap TIDs in duplicate
pivot tuples), though this is likely to pay off when there are random
insertions later.

Summary
=======

In summary, suffix truncation is much more useful than it may appear
at first, because:

* It can make space utilization for random insertions much closer to
the utilization we'd get with a REINDEX after the random insertions
are over.

* While it cannot noticeably reduce the number of index tuple
comparisons (fewer internal pages has very little influence on that),
it can make those comparisons resolved by a cheap "negative infinity"
comparison in many cases. I think that the amount of expensive
strcoll() calls we do can be improved significantly with
multi-attribute columns. We see some evidence of this in my example.
Many of the remaining calls to varstr_cmp() are probably resolved
using the memcmp() equality fast-path.

* The downside of trying to truncate seems like it could be made to be
close to zero. Unlike techniques like prefix compression, suffix
truncation leaves us with little to lose, so we don't need buy-in from
the user. No new GUCs required.

[1] https://wiki.postgresql.org/wiki/Sample_Databases#Other_Samples
--
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Recovery performance of DROP DATABASE with many tablespaces
Next
From: Peter Geoghegan
Date:
Subject: Re: Bulk Insert into PostgreSQL