Re: [PERFORM] A Better External Sort? - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: [PERFORM] A Better External Sort? |
Date | |
Msg-id | D425483C2C5C9F49B5B7A41F8944154757D10E@postal.corporate.connx.com Whole thread Raw |
In response to | [PERFORM] A Better External Sort? (Ron Peacetree <rjpeace@earthlink.net>) |
Responses |
Re: [PERFORM] A Better External Sort?
|
List | pgsql-hackers |
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Ron Peacetree > Sent: Monday, September 26, 2005 10:47 AM > To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org > Subject: [HACKERS] [PERFORM] A Better External Sort? > > >From: Ron Peacetree <rjpeace@earthlink.net> > >Sent: Sep 24, 2005 6:30 AM > >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External > sorting? > > > >... the amount of IO done is the most > >important of the things that you should be optimizing for in > >choosing an external sorting algorithm. > > > > <snip> > > > >Since sorting is a fundamental operation in many parts of a DBMS, > >this is a Big Deal. > > > >This discussion has gotten my creative juices flowing. I'll post > >some Straw Man algorithm sketches after I've done some more > >thought. > > > As a thought exeriment, I've been considering the best way to sort 1TB > (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records. > > Part I: A Model of the System > The performance of such external sorts is limited by HD IO, then > memory IO, and finally CPU throughput. > > On commodity HW, single HD IO is ~1/2048 (single HD realistic worst > case) to ~1/128 (single HD best case. No more than one seek every > ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM. > > RAID HD IO will be in the range from as low as a single HD (RAID 1) to > ~1/8 (a RAID system saturating the external IO bus) the throughput of > RAM. > > RAM is ~1/8-1/16 the throughput and ~128x the latency of the data > pathways internal to the CPU. > > This model suggests that HD IO will greatly dominate every other > factor, particuarly if we are talking about a single HD rather than a > peripheral bus saturating RAID subsystem. If at all possible, we want > to access the HD subsystem only once for each data item, If you can achieve that, I think you should be given a Nobel Prize, and I mean that sincerely. I also think that your analysis is interesting. > and we want > to avoid seeking more than the critical number of seeks implied above > when doing it. It also suggests that at a minimum, it's worth it to > spend ~8 memory operations or ~64 CPU operations to avoid a HD access. > Far more than that if we are talking about a single random access. > > It's worth spending ~128 CPU operations to avoid a single random RAM > access, and literally 10's or even 100's of thousands of CPU operations to > avoid a random HD access. In addition, there are many indications in > current ECE and IT literature that the performance gaps between these > pieces of computer systems are increasing and expected to continue to do > so for the forseeable future. In short, _internal_ sorts have some, and > are > going to increasingly have more, of the same IO problems usually > associated with external sorts. Knuth has made the observation (confirmed by others) that 40% of mainframe CPU cycles are spent on sorting. Hence, any sort of optimization in this area is a potential for enormous savings. > Part II: a Suggested Algorithm > The simplest case is one where we have to order the data using a key that > only has two values. I suggest testing against a very large class of distributions. All of the common statistical models are a start (Gaussian, Poisson, etc.) and also single value, two distinct values, to some limit. > Given 2^40B of data using 2KB or 4KB per record, the most compact > representation we can make of such a data set is to assign a 32b= 4B RID > or Rptr for location + a 1b key for each record. Just the RID's would > take up > 1.25GB (250M records) or 2.5GB (500M records). Enough space that even > an implied ordering of records may not fit into RAM. > > Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in > terms > of IO operations than sorting the actual 1TB of data. > > That IO cost can be lowered even further if instead of actually physically > sorting the RIDs, we assign a RID to the appropriate catagory inside the > CPU > as we scan the data set and append the entries in a catagory from CPU > cache > to a RAM file in one IO burst whenever said catagory gets full inside the > CPU. > We can do the same with either RAM file to HD whenever they get full. The > sorted order of the data is found by concatenating the appropriate files > at the > end of the process. > > As simple as this example is, it has many of the characteristics we are > looking for: > A= We access each piece of data once on HD and in RAM. > B= We do the minimum amount of RAM and HD IO, and almost no random IO in > either case. > C= We do as much work as possible within the CPU. > D= This process is stable. Equal keys stay in the original order they are > encountered. > > To generalize this method, we first need our 1b Key to become a > sufficiently large > enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache > unfriendly. > > Cache lines (also sometimes called "blocks") are usually 64B= 512b in > size. > Therefore our RID+Key or KeyPrefix should never be larger than this. For > a 2^40B > data set, a 5B RID leaves us with potentially as much as 59B of Key or > KeyPrefix. > Since the data can't take on more than 40b worth different values > (actually 500M= 29b > for our example), we have more than adequate space for Key or KeyPrefix. > We just > have to figure out how to use it effectively. > A typical CPU L2 cache can hold 10's or 100's of thousands of such cache > lines. > That's enough that we should be able to do a significant amount of useful > work within > the CPU w/o having to go off-die. > > The data structure we are using to represent the sorted data also needs to > be > generalized. We want a space efficient DS that allows us to find any > given element in > as few accesses as possible and that allows us to insert new elements or > reorganize > the DS as efficiently as possible. This being a DB discussion list, a B+ > tree seems like > a fairly obvious suggestion ;-) > > A B+ tree where each element is no larger than a cache line and no node is > larger than > what fits into L2 cache can be created dynamically as we scan the data set > via any of > the fast, low IO methods well known for doing so. Since the L2 cache can > hold 10's of > thousands of cache lines, it should be easy to make sure that the B+ tree > has something > like 1000 elements per node (making the base of the logarithm for access > being at least > 1000). The log base 1000 of 500M is ~2.9, so that means that even in the > absolute > worst case where every one of the 500M records is unique we can find any > given > element in less than 3 accesses of the B+ tree. Increasing the order of > the B+ tree is > an option to reduce average accesses even further. > > Since the DS representing the sorted order of the data is a B+ tree, it's > very "IO friendly" > if we need to store part or all of it on HD. > > In an multiprocessor environment, we can assign chunks of the data set to > different > CPUs, let them build their independant B+ trees to represent the data in > sorted order from > their POV, and then merge the B+ trees very efficiently into one overall > DS to represent > the sorted order of the entire data set. > > Finally, since these are B+ trees, we can keep them around and easily > update them at will > for frequent used sorting conditions. > > What do people think? I think that your analysis is very interesting. I would like to see the result of the experiment. I think that the btrees are going to be O(n*log(n)) in construction of the indexes in disk access unless you memory map them [which means you would need stupendous memory volume] and so I cannot say that I really understand your idea yet. Can you draw a picture of it for me? (I am dyslexic and understand things far better when I can visualize it).
pgsql-hackers by date: