Simon Riggs <simon@2ndquadrant.com> writes:
> Recent discussions on PERFORM have made me look into some aspects of
> B-tree index code, especially with regard to bulk loading high volumes
> of data.
Have you read the archived discussions that led up to the current
algorithm? I don't think it's nearly as bad as you believe. In
particular, I think you missed the point that the move-or-split
decision is *random* and is therefore made differently by each inserter.
Therefore the probability that the earlier pages get split rises rapidly
as more and more insertions are made --- and it only takes one split
decision to take the pressure off.
It's entirely possible that the 0.99 figure needs some fine tuning.
IIRC, the experiments we did to choose that number were done with
pretty simple test indexes --- probably just int4. Thinking about
the behavior, it seems plausible that the figure needs to drop as
the number of entries per page drops ... but we have not tested that.
In an int4 index on Intel-ish hardware (MAXALIGN 4), you can fit about
500 entries per page. So consider a case where the first 2 pages for a
given value are full and the third is half full. To fill the third
completely will require 250 insertions, by which time there is a very
good chance (more than 90% if I did the math right) that someone will
have decided to split rather than move right at the second page. After
that the second page fills, and then we are back to the original state
(because the new third page will be half full). So I claim that in fact
the behavior *is* constant time: most insertions will succeed on either
the second or third page, indefinitely. However, obviously if there are
only a few values per page, you would get much worse behavior. (OTOH,
the wider the index values, the lower the probability of exact
duplicates anyway, I'd think, so we may be wasting our time to worry
about the behavior there.)
regards, tom lane