Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers
From | Nicolas Barbier |
---|---|
Subject | Re: Fast insertion indexes: why no developments |
Date | |
Msg-id | CAP-rdTYTvuiMB1FKT53=cANP70deenAq_uO0Pj_D4UQEsTOTQA@mail.gmail.com Whole thread Raw |
In response to | Re: Fast insertion indexes: why no developments (Simon Riggs <simon@2ndQuadrant.com>) |
List | pgsql-hackers |
2013/11/12 Simon Riggs <simon@2ndquadrant.com>: > On 12 November 2013 21:41, Nicolas Barbier <nicolas.barbier@gmail.com> wrote: > >> Look-up speed is as follows: Each look-up must look through all >> B-trees. > > That can be optimised by using a min max approach, so we need only > look at sub-trees that may contain data. Under the assumption that the data are really *really* inserted in random order, I think that min max indexes would not help much: All subtrees would contain so many different values that the boundaries will almost never be narrow enough to exclude scanning it (unless one is searching for outliers). I think that min max indexes are very useful for data that is inserted in a some less-than-random way: When rather large swathes of inserted data fall in between boundaries that a lot of other data doesn’t fall in between. For min max index scans to be fast, the data doesn’t exactly have to be ordered in the heap (that’s one of the advantages vs. B-trees, where a different order in the index and the heap is very bad for scans of any significant part of the index), but it can also not be entirely arbitrary. I would say that min max indexes should be used in either of the following cases (and probably some more that I don’t think of right now): * When the data is inserted close to ordered (i.e., past or future dates that have a tendency to be correlated with “the current day,” such as invoice dates). For this usecase, a normal B-tree would also work, but would take more space and require more heavy maintenance. * When large batches of “similar” data (fitting in between boundaries that are more narrow than the “global boundaries”) are inserted at a time, even when multiple of such batches arrive in random order. * When insertion is up to entirely random, but the goal is to optimize look-ups for (relatively rare) outliers. * (combined with any the above to actually make the index *useful*) When needing a lot a indexes on the same data or having a very high rate of insertion, and therefore the maintenance burden matters a lot. > I would add that it is possible to optimise large DELETEs from a table > if complete sub-trees of the btree can be easily removed. This for me > would be the compelling benefit of this approach. Idem WRT the randomness: If the data are really deleted in a completely random fashion (e.g., Leonardo might want to delete all phone calls in a certain year, while the index is on phone number), whole subtrees would almost never become candidates for deletion (the same problem applies to a normal min max index). Of course, everything would change if the data is not really deleted randomly. The same usecases as mentioned above (replace “insertion” with “deletion”) seem to apply for deletion in min max indexes. Note that B-forests as described before don’t work well for a high (or even not-so-high) rate of deletions: During VACUUM the updates to the bigger trees would kill all performance similar to how a high rate of insertion kills the performance of normal B-trees once they get big. To remedy this, one could adds stuff such as “B-trees of deleted entries” (i.e., negative trees) that may then afterwards be merged with other such “negative” trees + “positive” trees. Look-ups would need to take all those trees into account. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
pgsql-hackers by date: