Re: [PERFORM] A Better External Sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: [PERFORM] A Better External Sort?
Date
Msg-id 19483839.1127783447942.JavaMail.root@elwamui-polski.atl.sa.earthlink.net
Whole thread Raw
In response to [PERFORM] A Better External Sort?  (Ron Peacetree <rjpeace@earthlink.net>)
Responses Re: [PERFORM] A Better External Sort?  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
>From: Dann Corbit <DCorbit@connx.com>
>Sent: Sep 26, 2005 5:13 PM
>To: Ron Peacetree <rjpeace@earthlink.net>, pgsql-hackers@postgresql.org,
>    pgsql-performance@postgresql.org
>Subject: RE: [HACKERS] [PERFORM] A Better External Sort?
>
>I think that the btrees are going to be O(n*log(n)) in construction of
>the indexes in disk access unless you memory map them [which means you
>would need stupendous memory volume] and so I cannot say that I really
>understand your idea yet.
>
Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
don't require O(nlgn) HD accesses.  These shouldn't either.

Let's start by assuming that an element is <= in size to a cache line and a
node fits into L1 DCache.  To make the discussion more concrete, I'll use a
64KB L1 cache + a 1MB L2 cache only as an example.

Simplest case: the Key has few enough distinct values that all Keys or
KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
 <= 1000 different values.  More if we can fit more than 1 element into
each cache line.).

As we scan the data set coming in from HD, we compare the Key or KeyPrefix
to the sorted list of Key values in the node.  This can be done in O(lgn) using
Binary Search or O(lglgn) using a variation of Interpolation Search.
If the Key value exists, we append this RID to the list of RIDs having the
same Key:
  If the RAM buffer of this list of RIDs is full we append it and the current
  RID to the HD list of these RIDs.
Else we insert this new key value into its proper place in the sorted list of Key
values in the node and start a new list for this value of RID.

We allocate room for a CPU write buffer so we can schedule RAM writes to
the RAM lists of RIDs so as to minimize the randomness of them.

When we are finished scanning the data set from HD, the sorted node with
RID lists for each Key value contains the sort order for the whole data set.

Notice that almost all of the random data access is occuring within the CPU
rather than in RAM or HD, and that we are accessing RAM or HD only when
absolutely needed.

Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
In the given example CPU, we will be able to fit at least 1000 elements per
node and 2^20/2^16= up to 16 such nodes in this CPU.  We use a node's
worth of space as a RAM write buffer, so we end up with room for 15 such
nodes in this CPU.  This is enough for a 2 level index to at least 15,000
distinct Key value lists.

All of the traditional tricks for splitting a Btree node and redistributing
elements within them during insertion or splitting for maximum node
utilization can be used here.

The most general case: There are too many nodes to fit within the CPU
cache(s).  The root node now points to a maximum of at least 1000 nodes
since each element in the root node points to another node.  A full 2 level
index is now enough to point to at least 10^6 distinct Key value lists, and
3 levels will index more distinct Key values than is possible in our 1TB,
500M record example.

We can use some sort of node use prediction algorithm like LFU to decide
which node should be moved out of CPU when we have to replace one of
the nodes in the CPU.  The nodes in RAM or on HD can be arranged to
maximize streaming IO behavior and minimize random access IO
behavior.

As you can see, both the RAM and HD IO are as minimized as possible,
and what such IO there is has been optimized for streaming behavior.


>Can you draw a picture of it for me?  (I am dyslexic and understand things
>far better when I can visualize it).
>
Not much for pictures.  Hopefully the explanation helps?

Ron

pgsql-hackers by date:

Previous
From: "Jeffrey W. Baker"
Date:
Subject: Re: Database file compatability
Next
From: Abhijit Menon-Sen
Date:
Subject: Re: PostgreSQL overall design