Recent discussions on PERFORM have made me look into some aspects of
B-tree index code, especially with regard to bulk loading high volumes
of data.
I now have cause for concern about the way that Btree index code
currently works when inserting large volumes of data into a table with
non-unique indexes, which is most tables with > 1 index.
_bt_insertonpg contains an algorithm to locate a page to insert into.
Since the index is unique, if there are many rows in the table then
there could be potentially a large number of possible insertion points.
line 404 onwards says* If we will need to split the page to put the item here,* check whether we can put the tuple
somewhereto the right,* instead. Keep scanning right until we* (a) find a page with enough free space,* (b)
reachthe last page where the tuple can legally go, or* (c) get tired of searching.* (c) is not flippant; it is
importantbecause if there are many* pages' worth of equal keys, it's better to split one of the early* pages than to
scanall the way to the end of the run of equal keys* on every insert. We implement "get tired" as a random choice,*
sincestopping after scanning a fixed number of pages wouldn't work* well (we'd never reach the right-hand side of
previouslysplit* pages). Currently the probability of moving right is set at 0.99,* which may seem too high to change
thebehavior much, but it does an* excellent job of preventing O(N^2) behavior with many equal keys.
from v1.62
The end result of this is that the first N records fill the first block,
then the next N records search the first block, start to fill the next
block. When that is full we move to the next, etc. So as we add rows
index insertion time increases. The probability we move right is high,
so we soon grow to the situation where we move right a large number of
times to find an insertion point. Once a page has been split, all new
inserts must make there way across the same pages to the new insertion
point.
After many rows had been inserted, the steady state will be that all
blocks at the start of that index value will be full apart from the
current insertion point.
With a current probability of moving right of 0.99, the likelihood that
the current insertion point is more than 20 blocks away from the start
of the index value is 82%!! With a large index, many of these would not
be in cache, so I/O is going to show close to O(N^2) behaviour for at
least the first 10-20 pages until the randomness of the algorithm kicks
in.
Any algorithm to find an insertion point has a number of factors to
trade-off:
a) insertion cost
b) packing density
c) ordering of index to heap data
However you cut it, you can only fit so many records in a block, so when
averaged out the number of page splits cannot be less than a constant
wherever you choose to split, assuming we have constant length index
keys (for now).
Current algorithm attempts to locally minimise a) and minimise b).
The current algorithm is "greedy" because it tries to avoid page
splitting by searching for a space created by somebody else, rather than
performing the action itself. The end result is that everybody pays a
much higher cost overall.
When loading large number of rows, a greedy algorithm makes no sense at
all, since you are only competing with yourself. A non-greedy algorithm
would set the probability of moving right = 0, minimising insertion cost
since we would always try to insert the index entry into the first block
and if no space, then split.
If we reduce the probability of moving right this reduces insertion time
considerably, though with the problem that we would allow the index to
increase further in size, eventually ending up at twice as large since
we split each page in two. This would waste disk space, but also waste
RAM since most index pages would be half full, as well as doubling the
time taken for a full index scan.
That effect prevents me from suggesting the very simple change of
reducing the probability of moving right to 90%, which would
dramatically improves the insertion time behaviour.
Reducing the probability of moving right should only be done in
conjunction with making an imbalanced split so that the right hand index
page would be almost full.
The combined approach would offer:
a) insertion cost very low
b) packing density very high
and as a minor addition...
c) excellent ordering of index -> heap (when heap is ordered)
Changing the algorithm for all cases would probably be a bad thing,
since not all indexes are highly non-unique and the way it works now is
tried and tested.
An observation: If we ever scan 3 or more index blocks looking for an
insertion point we know that the middle block only has rows for this
index value. Once we know that we have a block entirely dedicated to an
index value, we can safely take another strategy to locate the insertion
point.
Proposed algorithm would then be* If we will need to split the page to put the item here,* check whether we can put the
tuplesomewhere to the right,* instead. Keep scanning right until we* (a) find a page with enough free space,*
(b)reach the last page where the tuple can legally go, or* (c) we have searched 3 pages without finding free space*
Ifwe find ourselves at (c), then we return to the first page and * recheck for freespace. If no freespace, we move
rightand while* holding the lock on the first page retest the next page for free* space. Those two actions are required
becauseit is possible that* somebody else may have got there first and split either page to* create new space. * If
there'sspace, we use it, if not, we insert a completely new index* page between the 1st and 2nd pages, and place new
rowupon that page.
This algorithm keeps the current behaviour for nearly unique indexes,
but greatly improves the behaviour for highly non-unique indexes so that
the maximum insert time is now a low constant. Most inserts will take
only 1-2 leaf page accesses, while a page split will take 5 (plus the
recursive rearrangement of parent pages which is unchanged)
This should allow high volume inserts to occur both faster and at a
constant rate, rather than reducing as table size increases. Concurrency
of index access and overall scalability would not be effected.
The packing density of highly non-unique indexes is also improved
somewhat, so index scan performance should improve slightly. Vacuum is
not likely to be greatly effected by these changes.
Page access is no longer fully sequential, since we return to our first
page, though the pages will almost certainly be still in shared_buffers,
so will be unlikely to cause any random I/O.
The new algorithm is implementable by code change, rather than changing
index page structure, with all change isolated to _bt_insertonpg and the
creation of a new _bt_nonunique_split function.
If the index has widely variable length keys (which is always a bad idea
anyway) then it does make some sense to attempt to test lots of blocks
to see where it might fit. The algorithm does give you 3 chances to find
freespace, so that should alleviate some of the effects of highly
variable index value lengths.
Comments?
Best Regards, Simon Riggs

Simon Riggs <> writes:
> Recent discussions on PERFORM have made me look into some aspects of
> B-tree index code, especially with regard to bulk loading high volumes
> of data.
Have you read the archived discussions that led up to the current
algorithm? I don't think it's nearly as bad as you believe. In
particular, I think you missed the point that the move-or-split
decision is *random* and is therefore made differently by each inserter.
Therefore the probability that the earlier pages get split rises rapidly
as more and more insertions are made --- and it only takes one split
decision to take the pressure off.
It's entirely possible that the 0.99 figure needs some fine tuning.
IIRC, the experiments we did to choose that number were done with
pretty simple test indexes --- probably just int4. Thinking about
the behavior, it seems plausible that the figure needs to drop as
the number of entries per page drops ... but we have not tested that.
In an int4 index on Intel-ish hardware (MAXALIGN 4), you can fit about
500 entries per page. So consider a case where the first 2 pages for a
given value are full and the third is half full. To fill the third
completely will require 250 insertions, by which time there is a very
good chance (more than 90% if I did the math right) that someone will
have decided to split rather than move right at the second page. After
that the second page fills, and then we are back to the original state
(because the new third page will be half full). So I claim that in fact
the behavior *is* constant time: most insertions will succeed on either
the second or third page, indefinitely. However, obviously if there are
only a few values per page, you would get much worse behavior. (OTOH,
the wider the index values, the lower the probability of exact
duplicates anyway, I'd think, so we may be wasting our time to worry
about the behavior there.)
regards, tom lane

On Thu, 2005-04-14 at 10:35 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > Recent discussions on PERFORM have made me look into some aspects of
> > B-tree index code, especially with regard to bulk loading high volumes
> > of data.
>
> Have you read the archived discussions that led up to the current
> algorithm? I don't think it's nearly as bad as you believe. In
> particular, I think you missed the point that the move-or-split
> decision is *random* and is therefore made differently by each inserter.
> Therefore the probability that the earlier pages get split rises rapidly
> as more and more insertions are made --- and it only takes one split
> decision to take the pressure off.
Yes, and did the math too. The cumulative probability of having to
search more than 20 blocks before you split is 82%.
..36% chance of searching more than 100.
> It's entirely possible that the 0.99 figure needs some fine tuning.
That would be a simple approach, though has downsides as discussed.
> IIRC, the experiments we did to choose that number were done with
> pretty simple test indexes --- probably just int4. Thinking about
> the behavior, it seems plausible that the figure needs to drop as
> the number of entries per page drops ... but we have not tested that.
> In an int4 index on Intel-ish hardware (MAXALIGN 4), you can fit about
> 500 entries per page. So consider a case where the first 2 pages for a
> given value are full and the third is half full. To fill the third
> completely will require 250 insertions, by which time there is a very
> good chance (more than 90% if I did the math right) that someone will
> have decided to split rather than move right at the second page. After
> that the second page fills, and then we are back to the original state
> (because the new third page will be half full). So I claim that in fact
> the behavior *is* constant time: most insertions will succeed on either
> the second or third page, indefinitely. However, obviously if there are
> only a few values per page, you would get much worse behavior. (OTOH,
> the wider the index values, the lower the probability of exact
> duplicates anyway, I'd think, so we may be wasting our time to worry
> about the behavior there.)
For once, I beg to note that the above maths is not correct, because the
algorithm doesn't work exactly that way.
The move right only occurs when the page is full, so the chance of
moving right is not 0.99^250, but 0.99, since the previous 249 inserts
would not cause a page split. The probability does not drop away as you
suggest and the operation is not constant time as a result. IMHO the
performance figures show this to be true.
Yes, with 500 entries per page, it would take 1500 rows/per index value
before the proposed new algorithm makes *any* difference at all. Since
people often build indexes on columns that have skewed distributions,
the time taken for MFVs is of particular concern in this regard. In a
million+ row table we are are *very* likely to find such numbers of
rows/index value even amongst infrequently occurring values and still
find the index has useful selectivity.
My viewpoint is, as ever, towards large and high performance databases.
Best Regards, Simon Riggs

Simon Riggs <> writes:
> The move right only occurs when the page is full, so the chance of
> moving right is not 0.99^250, but 0.99, since the previous 249 inserts
> would not cause a page split.
Sure, but given that we have a full page, the probability that 250
successive insertions *all* decide to move right rather than split
that page is 0.99^250. And it only takes one decision to split to
maintain the constant-time behavior. So I still think your analysis
is incorrect.
> IMHO the performance figures show this to be true.
*What* performance figures? You have shown none. We did do performance
testing of this algorithm when we adopted it, and it worked fine ---
though as I say, I don't think we tested with any very wide keys.
regards, tom lane

On Thu, 2005-04-14 at 11:15 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > The move right only occurs when the page is full, so the chance of
> > moving right is not 0.99^250, but 0.99, since the previous 249 inserts
> > would not cause a page split.
>
> Sure, but given that we have a full page, the probability that 250
> successive insertions *all* decide to move right rather than split
> that page is 0.99^250. And it only takes one decision to split to
> maintain the constant-time behavior. So I still think your analysis
> is incorrect.
OK... point accepted. Darn, but also thank goodness it performs.
P(N) > 0.999 for W byte keys, at...
N W Mean blocks read/insert
3 4 bytes 1.1
5 8 bytes 1.4
11 16 bytes 2.1
22 32 bytes 3.6
43 64 bytes 6.7
83 128 bytes 12.5
lots 256 bytes 23
> > IMHO the performance figures show this to be true.
>
> *What* performance figures?
The figures shown on PERFORM recently, with graphs. We still have a
performance issue with insertion rate for large indexes.
Best Regards, Simon Riggs

Just to check if I was nuts or not, I made up a test case:
create table foo (f1 text);
create index fooi on foo(f1);
then
truncate foo;
copy foo from stdin;
0000999999
0000999998
0000999997
... one million rows ...
0000000002
0000000001
0000000000
\.
versus
truncate foo;
copy foo from stdin;
xxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxx
... one million rows ...
xxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxx
\.
The first of these should of course force a btree split on the first
page each time it splits, while the second will involve the
probabilistic moveright on each split. But the files will be exactly
the same size.
[tgl@rh1 ~]$ time psql -f zdecr10 test
TRUNCATE TABLE
real 1m41.681s
user 0m1.424s
sys 0m0.957s
[tgl@rh1 ~]$ time psql -f zsame10 test
TRUNCATE TABLE
real 1m40.927s
user 0m1.409s
sys 0m0.896s
[tgl@rh1 ~]$
And it just happens that I had this server built with profiling enabled,
so I was able to look at the stats for each run, and I see that
_bt_compare is actually called slightly *fewer* times in the
all-identical-data case.
zdecr10: 2954536 _bt_moveright <cycle 3> [57]
20938485 _bt_binsrch <cycle 3> [48] 0.05 0.18 6491/3006701 _bt_insertonpg <cycle 2>
[16]
[33] 4.9 11.73 0.00 23899512 _bt_compare <cycle 3> [33] 21944768
FunctionCall2 <cycle 3> [24]
zsame10: 2935922 _bt_moveright <cycle 3> [62]
17156948 _bt_binsrch <cycle 3> [54] 3.45 11.09 465429/3072793 _bt_insertonpg <cycle 2>
[16]
[31] 5.0 11.56 0.00 20558299 _bt_compare <cycle 3> [31] 18622167
FunctionCall2 <cycle 3> [24]
So the theory does work, at least for small index entries. Currently
repeating with wider ones ...
regards, tom lane

On Thu, 2005-04-14 at 12:10 -0400, Tom Lane wrote:
> The first of these should of course force a btree split on the first
> page each time it splits, while the second will involve the
> probabilistic moveright on each split. But the files will be exactly
> the same size.
>
> [tgl@rh1 ~]$ time psql -f zdecr10 test
> TRUNCATE TABLE
>
> real 1m41.681s
> user 0m1.424s
> sys 0m0.957s
> [tgl@rh1 ~]$ time psql -f zsame10 test
> TRUNCATE TABLE
>
> real 1m40.927s
> user 0m1.409s
> sys 0m0.896s
> [tgl@rh1 ~]$
I think thats conclusive.
> So the theory does work, at least for small index entries. Currently
> repeating with wider ones ...
I think we should adjust the probability for longer item sizes - many
identifiers can be 32 bytes and there are many people with a non-unique
URL column for example. An average of over 2 blocks/insert at 16 bytes
is still one too many for my liking, though I do understand the need for
the randomness.
I'd suggest a move right probability of 97% (divide by 16) for itemsz >
16 bytes and 94% (divide by 32) when itemsz >= 128
Though I think functional indexes are the way to go there.
Best Regards, Simon Riggs

On Thu, Apr 14, 2005 at 06:36:44PM +0100, Simon Riggs wrote:
> I'd suggest a move right probability of 97% (divide by 16) for itemsz >
> 16 bytes and 94% (divide by 32) when itemsz >= 128
>
> Though I think functional indexes are the way to go there.
Why?
On a somewhat different note, what about storing or caching info about
what page has free space on it? ISTM that will be much more performant
than any kind of search, though you still do need to be careful about
bulk-inserts. My thought is this:
Go to the last page that we knew to have free space on it (last_free_page)
If that page is now full, continue scanning right
If we hit the end of appropriate pages, do a page split
Finally, when vacuum frees space in the index, it needs to set
last_free_page to the left-most page with free space, so that all free
space will be used.
That would be the 'greediest' version of this algorithm possible. In
reality, it probably makes more sense to combine this with the current
behavior, so that you have a random chance of stopping your scan to the
right and just doing a page split wherever you are in the scan. This
would still be a win because most of the time you'd go straight to a
page with free space on it.
As for storing the extra info, there's 2 possibilities that come to
mind. You can either store it in the index itself, or you can cache it
in shared memory. The advantage of just caching it is that you don't
need to change the on-disk index format. You also wouldn't need to
update something on disk every time last_free_page changes. The
disadvantage is that it's probably more complex to code, and you
obviously lose last_free_page info on restart and for index values that
see fewer inserts as the info for those index values falls out of the
cache.
In either case, you only want to worry about doing this for index values
that span more than a few pages (maybe the magic 3 that means you've got
at least one index page that's nothing but values for this index). For
more unique indexes, the extra overhead doesn't make sense.
Can someone think of a way to check the performance of this idea without
actually coding it?
--
Jim C. Nasby, Database Consultant
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Your message is accepted. The reference number is .