Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

From Nicolas Barbier
Subject Re: Fast insertion indexes: why no developments
Date
Msg-id CAP-rdTZmaq=qDEcmx3S50--e4x7QKo=LSatiwRu+Biq6cUEpJg@mail.gmail.com
Whole thread Raw
In response to Re: Fast insertion indexes: why no developments  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Fast insertion indexes: why no developments  (Nicolas Barbier <nicolas.barbier@gmail.com>)
Re: Fast insertion indexes: why no developments  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
2013/11/2 Simon Riggs <simon@2ndquadrant.com>:

> On 29 October 2013 16:10, Peter Geoghegan <pg@heroku.com> wrote:
>
>> Presumably someone will get around to implementing a btree index
>> insertion buffer one day. I think that would be a particularly
>> compelling optimization for us, because we could avoid ever inserting
>> index tuples that are already dead when the deferred insertion
>> actually occurs.
>
> That's pretty much what the LSM-tree is.

[ Disclaimer: I have only skimmed the LSM-trees paper rather than read
it thoroughly. ]

LSM-trees seem to hit a wall when the total amount of data gets big
enough, unless one uses “multi-component LSM-trees” (as described in
the paper). Having a B-tree with deferred insertion would be similar
to having an LSM-tree without the multi-component property.

The underlying problem with fast random insertions into a B-tree is
that each write of a whole block writes only a small amount of “useful
changes” (once the tree gets large enough relative to memory). The
“multi-component” concept fixes that. I think that simply combining
that concept with normal B-trees is a rather elegant way of (at least
theoretically) solving the problem:

Definitions:

* Define a rather small integer K (with K ≥ 2), that will influence
maximum insertion speed (higher K = higher insertion speed), but also
look-up time (higher K = slower look-up).
* Define some size S “that easily fits in memory.”
* F is the fan-out of the B-tree. (If F < K, the algorithm results in
slower amortized insertion speed than simply using one single big
B-tree if only the amount of blocks read/written are taken into
account; it may still be better because of seek times.)
* n is the total number of entries in the index.

Algorithm:

* Start inserting into a small B-tree until it gets size S, then start
inserting into a new B-tree until that fills up to size S, etc.
* When K such B-trees (of size S) exist, merge them together into one
(S × K)-sized B-tree (delete the old ones).
* Do this recursively: Once K B-trees of size (K × S) exist, merge
them together into one (S × K^2)-sized B-tree, etc.
* Let each look-up walk all trees, and return the union of all results.

(Note that K B-trees can be merged by simply scanning all of them
concurrently, and merging them just like a merge sort merges runs.
Also, all B-trees except for the first level (of size S) can be
compacted 100% as there is no need to reserve space for further
insertions in them.)

Insertion speed can be calculated as follows (I disregard seek times
to make this easy; it also happens to be the right assumption for
SSDs): Assume that a small B-tree (size S) can simply be written out
without having to write each block multiple times. Each entry has to
be written log_K(n × log_F(n)) times. All blocks written are 100%
“useful changes.” Therefore, insertion speed is log_K(n × log_F(n))
times less than raw disk speed. (Note that I also disregard the time
needed for reading; This may make everything about twice as slow.)

Example: For K = 10, F = 100 (i.e., 80B per entry), blocksize = 8kB,
and n = 10^9 (i.e., 80GB of entries), the insertion speed is
log_10(10^9 × log_100(10^9)) = log_10(10^9 × 4.5) = ~9.5 times slower
than writing the 80GB of index entries sequentially. For the “one
single big B-tree” scenario, the insertion speed would be ~F = ~100
times slower than raw disk speed (assuming that all writes but the
ones to the leaf-level can be combined).

Look-up speed is as follows: Each look-up must look through all
B-trees. Assume for simplicity that all trees have the same height as
the single B-tree in the “one single big B-tree” case (i.e., which is
rather wrong (most are less tall), but seems good enough as an
estimate), there are K trees of each size (idem), and there are
log_K(n) different sizes. Then, the number of trees to walk is K ×
log_K(n), and therefore each look-up is K × log_K(n) slower than in
the “one single big B-tree” case.

Example: (using the same parameters as above) Look-up speed is 10 ×
log_10(10^9) = 90 times slower (i.e., because there are ~90 trees).

Index size: I think (didn’t calculate) that the combined size of the
B-trees will be about the same as (a little bit more than) the size of
a single big B-tree containing the same entries.

I have no idea whether someone already gave this kind of “forest of
B-trees” structure a name. Otherwise, I suggest “B-forest” :-).

In conclusion, use a “B-forest” when:

* The index entries are small (large fan-out).
* The insertion throughput is high.
* It’s OK for look-ups to be slow.
* Extra points when the storage medium has high seek times.

Major missing piece in PostgreSQL (I think):

* Functionality to merge K indexes into one (bigger) combined index.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Transaction-lifespan memory leak with plpgsql DO blocks
Next
From: David Johnston
Date:
Subject: Re: Transaction-lifespan memory leak with plpgsql DO blocks