Why B-Tree suffix truncation matters - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Why B-Tree suffix truncation matters |
Date | |
Msg-id | CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com Whole thread Raw |
Responses |
Re: Why B-Tree suffix truncation matters
|
List | pgsql-hackers |
I've been working on B-Tree suffix truncation, as part of a larger effort to make all index tuples have unique keys by treating their heap TID as a first class part of the key space, as part of an even larger (and still ill-defined) effort to improve index vacuuming (a.k.a. "retail index tuple deletion"). I posted some prototype patches over on the "Making all nbtree entries unique by having heap TIDs participate in comparisons", and have been working on this fairly solidly for the past few weeks. I'm starting this new thread to discuss the benefits of B-Tree suffix truncation, and to demonstrate how effective it can be at increasing fan-out and space utilization with a real-world example. I haven't explained why suffix truncation matters very well before now. Now that I have a patch that works, I can just show the benefit directly. (By the way, there are similar benefits to covering INCLUDE indexes, where suffix truncation occurs all the time, and not just opportunistically.) Definition ========== To recap, suffix truncation is a classic optimization for B-Trees that was first described by Bayer in 1977. We truncate away non-distinguishing trailing attributes from pivot tuples to save space in internal pages, improving fan-in. Pivot tuples are the index tuples that don't point to the heap - they're only used to guide searches to the correct leaf page. Since they only guide searches, they only need those attributes up to and including the first attribute that distinguished each half of a leaf page split at the time of the original split. For example, if during a leaf page split the tuple to the immediate left of the split point looks like this: ('USA', 'New Jersey', 'Wyckoff') ...and, if the tuple at the immediate right half looked like this: ('USA', 'New York', 'Albany') ...then we could get away with truncating to make the new left page's high key (and right page's downlink) be represented as: ('USA', 'New York', <minus infinity>) Where "<minus infinity>" is represented a little bit like a NULL value, though with less overhead (we can use the same representation as INCLUDE indexes do for pivot tuples). Don't make the mistake of thinking of this as a form of compression -- it's actually a way avoiding storing something that we simply don't need, and never could need, that prematurely limits where tuples can be stored in the index. Example ======= It's well known that the number of internal pages in a B-Tree seldom exceeds 1% of the total (2% would be considered hugely bloated). These pages are what we're really interested in shrinking, so suffix truncation might appear to be a micro-optimization of little value. Who cares if I can shave 0.2% off the total size of the index? In fact, it can be far more than that in cases involving *random* insertions. Here is a simple motivating example, that uses the UK land registry data [1]: pg@~[31744]=# \dt+ List of relations Schema │ Name │ Type │ Owner │ Size │ Description ────────┼─────────────────────────────┼───────┼───────┼────────────┼───────────── public │ land_registry_price_paid_uk │ table │ pg │ 3005 MB │ (1 row) Let's create a table with the same schema, and build a composite index on (county, city, locality) against that table, for our random insertions test: pg@~[31744]=# create table land2 (like land_registry_price_paid_uk); CREATE TABLE pg@~[31744]=# create index composite on land2 (county, city, locality); CREATE INDEX The data we're about to insert looks like this: pg@~[7002]=# select county, city, locality from land_registry_price_paid_uk limit 15; county │ city │ locality ──────────────────┼─────────────┼───────────── LANCASHIRE │ BURNLEY │ BURNLEY SLOUGH │ SLOUGH │ SLOUGH EAST SUSSEX │ RYE │ RYE HAMPSHIRE │ FARNBOROUGH │ FARNBOROUGH GREATER LONDON │ LONDON │ LONDON LINCOLNSHIRE │ SKEGNESS │ SKEGNESS WREKIN │ TELFORD │ TELFORD KENT │ CANTERBURY │ CANTERBURY GREATER LONDON │ LONDON │ LONDON GREATER LONDON │ LONDON │ LEYTON NORTHAMPTONSHIRE │ CORBY │ CORBY SURREY │ GUILDFORD │ GUILDFORD SURREY │ LEATHERHEAD │ OXSHOTT LINCOLNSHIRE │ WISBECH │ TYDD GOTE WILTSHIRE │ PEWSEY │ PEWSEY (15 rows) Let's do something that results in more or less random insertions into the index: pg@~[31744]=# insert into land2 select * from land_registry_price_paid_uk; INSERT 0 21456045 Time: 155536.176 ms (02:35.536) pg@~[31744]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table │ Size │ Description ────────┼───────────────────────┼───────┼───────┼──────────────────┼─────────┼───────────── public │ composite │ index │ pg │ land2 │ 1134 MB │ (1 row) (Note that I actually VACUUM FREEZE'd land_registry_price_paid_uk before inserting in both cases.) If I evaluated this on the master branch, this bulk INSERT takes about 40% longer -- 03:38.682 (I haven't tried to benchmark this to make sure that it's a consistent improvement, but I think it is). More importantly, the index is consistently about 17% larger on the master branch, as shown here: pg@~[6387]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table │ Size │ Description ────────┼───────────┼───────┼───────┼───────┼─────────┼───────────── public │ composite │ index │ pg │ land2 │ 1329 MB │ (1 row) Note that I've changed the page split logic quite a bit to get this benefit. I haven't posted a version of my patch that'll do this just yet, because there are a bunch of competing considerations that I won't get into now. Explanation ----------- Pivot tuples describe how the key space will be split up, which will *hopefully* be balanced for future insertions. So, apart from the obvious issue about the size of pivot tuples, there is a more important issue: why unnecessarily commit to certain exact boundaries early on, before it's clear how well that will balance values that get inserted later? Suffix truncation isn't very helpful when used by CREATE INDEX, which works off a sorted stream, since nbtsort.c "page splits" will naturally be spaced "far apart in terms of the final key space". In other words, whether or not the split points chosen will balance the space utilization of the index as a whole is not left to chance during CREATE INDEX. To illustrate what I mean, I'll show that the master branch actually manages to get the index even smaller than 1134MB following a REINDEX: pg@~[6387]=# reindex index composite; REINDEX pg@~[6387]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table │ Size │ Description ────────┼───────────┼───────┼───────┼───────┼─────────┼───────────── public │ composite │ index │ pg │ land2 │ 1100 MB │ If post-REINDEX/CREATE INDEX size is all you care about, then suffix truncation is much less important, since now you really are talking about shaving off a fraction of 1% of the total size of the index. FWIW, my patch leaves the index size at 1101 MB following a REINDEX, which is slightly larger (we need to store heap TIDs in duplicate pivot tuples), though this is likely to pay off when there are random insertions later. Summary ======= In summary, suffix truncation is much more useful than it may appear at first, because: * It can make space utilization for random insertions much closer to the utilization we'd get with a REINDEX after the random insertions are over. * While it cannot noticeably reduce the number of index tuple comparisons (fewer internal pages has very little influence on that), it can make those comparisons resolved by a cheap "negative infinity" comparison in many cases. I think that the amount of expensive strcoll() calls we do can be improved significantly with multi-attribute columns. We see some evidence of this in my example. Many of the remaining calls to varstr_cmp() are probably resolved using the memcmp() equality fast-path. * The downside of trying to truncate seems like it could be made to be close to zero. Unlike techniques like prefix compression, suffix truncation leaves us with little to lose, so we don't need buy-in from the user. No new GUCs required. [1] https://wiki.postgresql.org/wiki/Sample_Databases#Other_Samples -- Peter Geoghegan
pgsql-hackers by date: