Re: Improving btree performance through specializing by key shape, take 2 - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Improving btree performance through specializing by key shape, take 2
Date
Msg-id CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com
Whole thread Raw
In response to Re: Improving btree performance through specializing by key shape, take 2  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Improving btree performance through specializing by key shape, take 2
List pgsql-hackers
On Thu, 26 Oct 2023 at 00:36, Peter Geoghegan <pg@bowt.ie> wrote:
> Most of the work with something like
> this is the analysis of the trade-offs, not writing code. There are
> all kinds of trade-offs that you could make with something like this,
> and the prospect of doing that myself is kind of daunting. Ideally
> you'd have made a significant start on that at this point.

I believe I'd already made most trade-offs clear earlier in the
threads, along with rationales for the changes in behaviour. But here
goes again:

_bt_compare currently uses index_getattr() on each attribute of the
key. index_getattr() is O(n) for the n-th attribute if the index tuple
has any null or non-attcacheoff attributes in front of the current
one. Thus, _bt_compare costs O(n^2) work with n=the number of
attributes, which can cost several % of performance, but is very very
bad in extreme cases, and doesO(n) calls to opclass-supplied compare
operations.

To solve most of the O(n) compare operations, we can optimize
_bt_compare to only compare "interesting" attributes, i.e. we can
apply "dynamic prefix truncation". This is implemented by patch 0001.
This is further enhanced with 0002, where we skip the compare
operations if the HIKEY is the same as the right separator of the
downlink we followed (due to our page split code, this case is
extremely likely).

However, the above only changes the attribute indexing code in
_bt_compare to O(n) for at most about 76% of the index tuples on the
page (1 - (2 / log2(max_tuples_per_page))), while the other on average
20+% of the compare operations still have to deal with the O(n^2)
total complexity of index_getattr.
To fix this O(n^2) issue (the issue this thread was originally created
for) the approach I implemented originally is to not use index_getattr
but an "attribute iterator" that incrementally extracts the next
attribute, while keeping track of the current offset into the tuple,
so each next attribute would be O(1). That is implemented in the last
patches of the patchset.

This attribute iterator approach has an issue: It doesn't perform very
well for indexes that make full use of attcacheoff. The bookkeeping
for attribute iteration proved to be much more expensive than just
reading attcacheoff from memory. This is why the latter patches
(patchset 14 0003+) adapt the btree code to generate different paths
for different "shapes" of key index attributes, to allow the current
attcacheoff code to keep its performance, but to get higher
performance for indexes where the attcacheoff optimization can not be
applied. In passing, it also specializes the code for single-attribute
indexes, so that they don't have to manage branching code, increasing
their performance, too.

TLDR:
The specialization in 0003+ is applied because index_getattr is good
when attcacheoff applies, but very bad when it doesn't. Attribute
iteration is worse than index_getattr when attcacheoff applies, but is
significantly better when attcacheoff does not work. By specializing
we get the best of both worlds.

The 0001 and 0002 optimizations were added later to further remove
unneeded calls to the btree attribute compare functions, thus further
reducing the total time spent in _bt_compare.

Anyway.

PFA v14 of the patchset. v13's 0001 is now split in two, containing
prefix truncation in 0001, and 0002 containing the downlink's right
separator/HIKEY optimization.

Performance numbers (data attached):
0001 has significant gains in multi-column indexes with shared
prefixes, where the prefix columns are expensive to compare, but
otherwise doesn't have much impact.
0002 further improves performance across the board, but again mostly
for indexes with expensive compare operations.
0007 sees performance improvements almost across the board, with only
the 'ul' and 'tnt' indexes getting some worse results than master (but
still better average results),

All patches applied, per-index average performance improvements on 15
runs range from 3% to 290% across the board for INSERT benchmarks, and
-2.83 to 370% for REINDEX.

Configured with autoconf: config.log:
> It was created by PostgreSQL configure 17devel, which was
> generated by GNU Autoconf 2.69.  Invocation command line was
>
>   $ ./configure --enable-tap-tests --enable-depend --with-lz4 --with-zstd COPT=-ggdb -O3
--prefix=/home/matthias/projects/postgresql/pg_install--no-create --no-recursion
 

Benchmark was done on 1m random rows of the pp-complete dataset, as
found on UK Gov's S3 bucket [0]: using a parallel and threaded
downloader is preferred because the throughput is measured in kBps per
client.

I'll do a few runs on the full dataset of 29M rows soon too, but
master's performance is so bad for the 'worstcase' index that I can't
finish its runs fast enough; benchmarking it takes hours per
iteration.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] http://prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: always use runtime checks for CRC-32C instructions
Next
From: Nathan Bossart
Date:
Subject: Re: CRC32C Parallel Computation Optimization on ARM