[HACKERS] A Better External Sort? - Mailing list pgsql-performance
From | Ron Peacetree |
---|---|
Subject | [HACKERS] A Better External Sort? |
Date | |
Msg-id | 21606161.1127756844886.JavaMail.root@elwamui-polski.atl.sa.earthlink.net Whole thread Raw |
List | pgsql-performance |
>From: Ron Peacetree <rjpeace@earthlink.net> >Sent: Sep 24, 2005 6:30 AM >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting? > >... the amount of IO done is the most >important of the things that you should be optimizing for in >choosing an external sorting algorithm. > > <snip> > >Since sorting is a fundamental operation in many parts of a DBMS, >this is a Big Deal. > >This discussion has gotten my creative juices flowing. I'll post >some Straw Man algorithm sketches after I've done some more >thought. > As a thought exeriment, I've been considering the best way to sort 1TB (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records. Part I: A Model of the System The performance of such external sorts is limited by HD IO, then memory IO, and finally CPU throughput. On commodity HW, single HD IO is ~1/2048 (single HD realistic worst case) to ~1/128 (single HD best case. No more than one seek every ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM. RAID HD IO will be in the range from as low as a single HD (RAID 1) to ~1/8 (a RAID system saturating the external IO bus) the throughput of RAM. RAM is ~1/8-1/16 the throughput and ~128x the latency of the data pathways internal to the CPU. This model suggests that HD IO will greatly dominate every other factor, particuarly if we are talking about a single HD rather than a peripheral bus saturating RAID subsystem. If at all possible, we want to access the HD subsystem only once for each data item, and we want to avoid seeking more than the critical number of seeks implied above when doing it. It also suggests that at a minimum, it's worth it to spend ~8 memory operations or ~64 CPU operations to avoid a HD access. Far more than that if we are talking about a single random access. It's worth spending ~128 CPU operations to avoid a single random RAM access, and literally 10's or even 100's of thousands of CPU operations to avoid a random HD access. In addition, there are many indications in current ECE and IT literature that the performance gaps between these pieces of computer systems are increasing and expected to continue to do so for the forseeable future. In short, _internal_ sorts have some, and are going to increasingly have more, of the same IO problems usually associated with external sorts. Part II: a Suggested Algorithm The simplest case is one where we have to order the data using a key that only has two values. Given 2^40B of data using 2KB or 4KB per record, the most compact representation we can make of such a data set is to assign a 32b= 4B RID or Rptr for location + a 1b key for each record. Just the RID's would take up 1.25GB (250M records) or 2.5GB (500M records). Enough space that even an implied ordering of records may not fit into RAM. Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms of IO operations than sorting the actual 1TB of data. That IO cost can be lowered even further if instead of actually physically sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU as we scan the data set and append the entries in a catagory from CPU cache to a RAM file in one IO burst whenever said catagory gets full inside the CPU. We can do the same with either RAM file to HD whenever they get full. The sorted order of the data is found by concatenating the appropriate files at the end of the process. As simple as this example is, it has many of the characteristics we are looking for: A= We access each piece of data once on HD and in RAM. B= We do the minimum amount of RAM and HD IO, and almost no random IO in either case. C= We do as much work as possible within the CPU. D= This process is stable. Equal keys stay in the original order they are encountered. To generalize this method, we first need our 1b Key to become a sufficiently large enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache unfriendly. Cache lines (also sometimes called "blocks") are usually 64B= 512b in size. Therefore our RID+Key or KeyPrefix should never be larger than this. For a 2^40B data set, a 5B RID leaves us with potentially as much as 59B of Key or KeyPrefix. Since the data can't take on more than 40b worth different values (actually 500M= 29b for our example), we have more than adequate space for Key or KeyPrefix. We just have to figure out how to use it effectively. A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines. That's enough that we should be able to do a significant amount of useful work within the CPU w/o having to go off-die. The data structure we are using to represent the sorted data also needs to be generalized. We want a space efficient DS that allows us to find any given element in as few accesses as possible and that allows us to insert new elements or reorganize the DS as efficiently as possible. This being a DB discussion list, a B+ tree seems like a fairly obvious suggestion ;-) A B+ tree where each element is no larger than a cache line and no node is larger than what fits into L2 cache can be created dynamically as we scan the data set via any of the fast, low IO methods well known for doing so. Since the L2 cache can hold 10's of thousands of cache lines, it should be easy to make sure that the B+ tree has something like 1000 elements per node (making the base of the logarithm for access being at least 1000). The log base 1000 of 500M is ~2.9, so that means that even in the absolute worst case where every one of the 500M records is unique we can find any given element in less than 3 accesses of the B+ tree. Increasing the order of the B+ tree is an option to reduce average accesses even further. Since the DS representing the sorted order of the data is a B+ tree, it's very "IO friendly" if we need to store part or all of it on HD. In an multiprocessor environment, we can assign chunks of the data set to different CPUs, let them build their independant B+ trees to represent the data in sorted order from their POV, and then merge the B+ trees very efficiently into one overall DS to represent the sorted order of the entire data set. Finally, since these are B+ trees, we can keep them around and easily update them at will for frequent used sorting conditions. What do people think? Ron
pgsql-performance by date: