On Thu, 2005-04-14 at 10:35 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > Recent discussions on PERFORM have made me look into some aspects of
> > B-tree index code, especially with regard to bulk loading high volumes
> > of data.
>
> Have you read the archived discussions that led up to the current
> algorithm? I don't think it's nearly as bad as you believe. In
> particular, I think you missed the point that the move-or-split
> decision is *random* and is therefore made differently by each inserter.
> Therefore the probability that the earlier pages get split rises rapidly
> as more and more insertions are made --- and it only takes one split
> decision to take the pressure off.
Yes, and did the math too. The cumulative probability of having to
search more than 20 blocks before you split is 82%.
..36% chance of searching more than 100.
> It's entirely possible that the 0.99 figure needs some fine tuning.
That would be a simple approach, though has downsides as discussed.
> IIRC, the experiments we did to choose that number were done with
> pretty simple test indexes --- probably just int4. Thinking about
> the behavior, it seems plausible that the figure needs to drop as
> the number of entries per page drops ... but we have not tested that.
> In an int4 index on Intel-ish hardware (MAXALIGN 4), you can fit about
> 500 entries per page. So consider a case where the first 2 pages for a
> given value are full and the third is half full. To fill the third
> completely will require 250 insertions, by which time there is a very
> good chance (more than 90% if I did the math right) that someone will
> have decided to split rather than move right at the second page. After
> that the second page fills, and then we are back to the original state
> (because the new third page will be half full). So I claim that in fact
> the behavior *is* constant time: most insertions will succeed on either
> the second or third page, indefinitely. However, obviously if there are
> only a few values per page, you would get much worse behavior. (OTOH,
> the wider the index values, the lower the probability of exact
> duplicates anyway, I'd think, so we may be wasting our time to worry
> about the behavior there.)
For once, I beg to note that the above maths is not correct, because the
algorithm doesn't work exactly that way.
The move right only occurs when the page is full, so the chance of
moving right is not 0.99^250, but 0.99, since the previous 249 inserts
would not cause a page split. The probability does not drop away as you
suggest and the operation is not constant time as a result. IMHO the
performance figures show this to be true.
Yes, with 500 entries per page, it would take 1500 rows/per index value
before the proposed new algorithm makes *any* difference at all. Since
people often build indexes on columns that have skewed distributions,
the time taken for MFVs is of particular concern in this regard. In a
million+ row table we are are *very* likely to find such numbers of
rows/index value even amongst infrequently occurring values and still
find the index has useful selectivity.
My viewpoint is, as ever, towards large and high performance databases.
Best Regards, Simon Riggs