Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

From Nicolas Barbier
Subject Re: Fast insertion indexes: why no developments
Date
Msg-id CAP-rdTYTvuiMB1FKT53=cANP70deenAq_uO0Pj_D4UQEsTOTQA@mail.gmail.com
Whole thread Raw
In response to Re: Fast insertion indexes: why no developments  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
2013/11/12 Simon Riggs <simon@2ndquadrant.com>:

> On 12 November 2013 21:41, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
>
>> Look-up speed is as follows: Each look-up must look through all
>> B-trees.
>
> That can be optimised by using a min max approach, so we need only
> look at sub-trees that may contain data.

Under the assumption that the data are really *really* inserted in
random order, I think that min max indexes would not help much: All
subtrees would contain so many different values that the boundaries
will almost never be narrow enough to exclude scanning it (unless one
is searching for outliers).

I think that min max indexes are very useful for data that is inserted
in a some less-than-random way: When rather large swathes of inserted
data fall in between boundaries that a lot of other data doesn’t fall
in between. For min max index scans to be fast, the data doesn’t
exactly have to be ordered in the heap (that’s one of the advantages
vs. B-trees, where a different order in the index and the heap is very
bad for scans of any significant part of the index), but it can also
not be entirely arbitrary.

I would say that min max indexes should be used in either of the
following cases (and probably some more that I don’t think of right
now):

* When the data is inserted close to ordered (i.e., past or future
dates that have a tendency to be correlated with “the current day,”
such as invoice dates). For this usecase, a normal B-tree would also
work, but would take more space and require more heavy maintenance.
* When large batches of “similar” data (fitting in between boundaries
that are more narrow than the “global boundaries”) are inserted at a
time, even when multiple of such batches arrive in random order.
* When insertion is up to entirely random, but the goal is to optimize
look-ups for (relatively rare) outliers.
* (combined with any the above to actually make the index *useful*)
When needing a lot a indexes on the same data or having a very high
rate of insertion, and therefore the maintenance burden matters a lot.

> I would add that it is possible to optimise large DELETEs from a table
> if complete sub-trees of the btree can be easily removed. This for me
> would be the compelling benefit of this approach.

Idem WRT the randomness: If the data are really deleted in a
completely random fashion (e.g., Leonardo might want to delete all
phone calls in a certain year, while the index is on phone number),
whole subtrees would almost never become candidates for deletion (the
same problem applies to a normal min max index). Of course, everything
would change if the data is not really deleted randomly.

The same usecases as mentioned above (replace “insertion” with
“deletion”) seem to apply for deletion in min max indexes.

Note that B-forests as described before don’t work well for a high (or
even not-so-high) rate of deletions: During VACUUM the updates to the
bigger trees would kill all performance similar to how a high rate of
insertion kills the performance of normal B-trees once they get big.
To remedy this, one could adds stuff such as “B-trees of deleted
entries” (i.e., negative trees) that may then afterwards be merged
with other such “negative” trees + “positive” trees. Look-ups would
need to take all those trees into account.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Fast insertion indexes: why no developments
Next
From: Leonardo Francalanci
Date:
Subject: Re: Fast insertion indexes: why no developments