Constant time insertion into highly non-unique indexes - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Constant time insertion into highly non-unique indexes |
Date | |
Msg-id | 1113483086.16721.1869.camel@localhost.localdomain Whole thread Raw |
Responses |
Re: Constant time insertion into highly non-unique indexes
|
List | pgsql-hackers |
Recent discussions on PERFORM have made me look into some aspects of B-tree index code, especially with regard to bulk loading high volumes of data. I now have cause for concern about the way that Btree index code currently works when inserting large volumes of data into a table with non-unique indexes, which is most tables with > 1 index. _bt_insertonpg contains an algorithm to locate a page to insert into. Since the index is unique, if there are many rows in the table then there could be potentially a large number of possible insertion points. line 404 onwards says* If we will need to split the page to put the item here,* check whether we can put the tuple somewhereto the right,* instead. Keep scanning right until we* (a) find a page with enough free space,* (b) reachthe last page where the tuple can legally go, or* (c) get tired of searching.* (c) is not flippant; it is importantbecause if there are many* pages' worth of equal keys, it's better to split one of the early* pages than to scanall the way to the end of the run of equal keys* on every insert. We implement "get tired" as a random choice,* sincestopping after scanning a fixed number of pages wouldn't work* well (we'd never reach the right-hand side of previouslysplit* pages). Currently the probability of moving right is set at 0.99,* which may seem too high to change thebehavior much, but it does an* excellent job of preventing O(N^2) behavior with many equal keys. from v1.62 The end result of this is that the first N records fill the first block, then the next N records search the first block, start to fill the next block. When that is full we move to the next, etc. So as we add rows index insertion time increases. The probability we move right is high, so we soon grow to the situation where we move right a large number of times to find an insertion point. Once a page has been split, all new inserts must make there way across the same pages to the new insertion point. After many rows had been inserted, the steady state will be that all blocks at the start of that index value will be full apart from the current insertion point. With a current probability of moving right of 0.99, the likelihood that the current insertion point is more than 20 blocks away from the start of the index value is 82%!! With a large index, many of these would not be in cache, so I/O is going to show close to O(N^2) behaviour for at least the first 10-20 pages until the randomness of the algorithm kicks in. Any algorithm to find an insertion point has a number of factors to trade-off: a) insertion cost b) packing density c) ordering of index to heap data However you cut it, you can only fit so many records in a block, so when averaged out the number of page splits cannot be less than a constant wherever you choose to split, assuming we have constant length index keys (for now). Current algorithm attempts to locally minimise a) and minimise b). The current algorithm is "greedy" because it tries to avoid page splitting by searching for a space created by somebody else, rather than performing the action itself. The end result is that everybody pays a much higher cost overall. When loading large number of rows, a greedy algorithm makes no sense at all, since you are only competing with yourself. A non-greedy algorithm would set the probability of moving right = 0, minimising insertion cost since we would always try to insert the index entry into the first block and if no space, then split. If we reduce the probability of moving right this reduces insertion time considerably, though with the problem that we would allow the index to increase further in size, eventually ending up at twice as large since we split each page in two. This would waste disk space, but also waste RAM since most index pages would be half full, as well as doubling the time taken for a full index scan. That effect prevents me from suggesting the very simple change of reducing the probability of moving right to 90%, which would dramatically improves the insertion time behaviour. Reducing the probability of moving right should only be done in conjunction with making an imbalanced split so that the right hand index page would be almost full. The combined approach would offer: a) insertion cost very low b) packing density very high and as a minor addition... c) excellent ordering of index -> heap (when heap is ordered) Changing the algorithm for all cases would probably be a bad thing, since not all indexes are highly non-unique and the way it works now is tried and tested. An observation: If we ever scan 3 or more index blocks looking for an insertion point we know that the middle block only has rows for this index value. Once we know that we have a block entirely dedicated to an index value, we can safely take another strategy to locate the insertion point. Proposed algorithm would then be* If we will need to split the page to put the item here,* check whether we can put the tuplesomewhere to the right,* instead. Keep scanning right until we* (a) find a page with enough free space,* (b)reach the last page where the tuple can legally go, or* (c) we have searched 3 pages without finding free space* Ifwe find ourselves at (c), then we return to the first page and * recheck for freespace. If no freespace, we move rightand while* holding the lock on the first page retest the next page for free* space. Those two actions are required becauseit is possible that* somebody else may have got there first and split either page to* create new space. * If there'sspace, we use it, if not, we insert a completely new index* page between the 1st and 2nd pages, and place new rowupon that page. This algorithm keeps the current behaviour for nearly unique indexes, but greatly improves the behaviour for highly non-unique indexes so that the maximum insert time is now a low constant. Most inserts will take only 1-2 leaf page accesses, while a page split will take 5 (plus the recursive rearrangement of parent pages which is unchanged) This should allow high volume inserts to occur both faster and at a constant rate, rather than reducing as table size increases. Concurrency of index access and overall scalability would not be effected. The packing density of highly non-unique indexes is also improved somewhat, so index scan performance should improve slightly. Vacuum is not likely to be greatly effected by these changes. Page access is no longer fully sequential, since we return to our first page, though the pages will almost certainly be still in shared_buffers, so will be unlikely to cause any random I/O. The new algorithm is implementable by code change, rather than changing index page structure, with all change isolated to _bt_insertonpg and the creation of a new _bt_nonunique_split function. If the index has widely variable length keys (which is always a bad idea anyway) then it does make some sense to attempt to test lots of blocks to see where it might fit. The algorithm does give you 3 chances to find freespace, so that should alleviate some of the effects of highly variable index value lengths. Comments? Best Regards, Simon Riggs
pgsql-hackers by date: