Thread: Sorting Improvements for 8.4
Just wanted to review a few thoughts and ideas around improving external sorts, as recently encouraged to do by Jim Nasby. Current issues/opportunities are these: ISSUES a) Memory is always in short supply, so using what we have more effectively is going to be welcome. b) Heap sort has a reasonably strong anti-memory effect, meaning that there is an optimum amount of memory for any sort. This shows itself with the CPU time increasing during run forming, making this stage of the sort CPU bound. c) Many sorts are performed prior to aggregation. It might be possible to aggregate prior to writing to disk, as a way of reducing the overall I/O cost. Benefit would occur when the total CPU cost was same no matter when aggregation occurred; that would not apply in all cases, so we would need to sense when benefit was possible. d) Generally reducing the I/O cost of sorting may help the merging stages of a sort. SOLUTIONS The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed to date were the following: 1. Sort I/O Compression 2. Aggregation during Sort 3. Memory Pools 4. Dynamic Heap Management 5. Dynamic Run Handling I've added (5) to the list as well, which hasn't yet been discussed. 1. SORT I/O COMPRESSION This idea is not dead yet, it just needs a full set of tests to confirm that there is benefit in all cases. If there's not benefit in all cases, we may be able to work out which cases those are, so we know when to use it. 2. AGGREGATION DURING SORT Many sorts are preliminary steps before aggregation. Aggregation during run forming would potentially reduce size of heap and reduce number of comparisons. For many types of aggregate this would not theoretically increase the number of ops since sum(), avg(), min(), max() are all commutative according to their inputs. We would probably need to add another option to Aggregate Functions to indicate the possibility of calculating the aggregate in this way, since some aggregates might rely on the current situation that they expect all their inputs at once in sorted order. (Windowed aggregates are unlikely to be this way). 3. MEMORY POOLS Solving a) could be done by sensible management and allocation of resources. Discussed before, so not rehashed here. 4. DYNAMIC HEAP MANAGEMENT The size of the active heap required to produce the fewest number of runs varies as the sort progresses. For example, sorting an already sorted input needs a trivial heap size. Larger heap sizes simply avoid forming more runs, which is not necessarily a bad thing. More runs only become bad things when we go beyond our ability to perform a single final merge (see Dynamic Run Handling below). Smaller heap sizes reduce the number of comparisons required, plus increase the L2+ cache efficiencies. Those two things are the cause of the anti-memory effect. Because of b), optimising the size of the heap could potentially be a good thing. This can make a considerable difference for nearly sorted data (measurements required...). When we have M amount of memory available to us, we don't start by using it all. We start with m memory and only increase up to M if required. Runs are built with memory set at m. If a tuple arrives that would force the formation of a new run we assess i) do we care if another run is formed? Use our knowledge of the likely amount of data coming our way, compared with number of runs formed so far and see if we really care. If we don't care, allow the new run to be formed and carry on with just heap size of m. (see Dynamic Run Handling later). ii) if we do care about number of runs, then allow the heap to grow by increments up to the full size of M. Increments would be at least x2 and possibly x4. That way we always have work space to rearrange the heap. All of this dances too cleverly around the exact technique and potential costs of rearranging the heap. That is not to be ignored and is the next task in evaluating and accepting/dismissing this potential technique. In combination with memory pooling this technique might also allow memory to be better distributed to other users. 5. DYNAMIC RUN HANDLING (in Final Merge) Another way of addressing a) is to simply make better use of memory itself. Let's look at that in more detail: Number of runs that can be merged at once is currently fixed, based upon available memory. This has the underlying assumption that all runs will be concurrently active during final merging, which may not always be true. If we have random data then almost all runs will overlap with all other runs, i.e. the min and max values are sufficiently wide that the runs do all overlap. In many cases, data arrives in somewhat sorted order, e.g. financial data is fairly regular with some late payers but not many, and those trail off with a fairly tight decay. In the somewhat sorted case we find that the actual overlap is less than total, so there are many later runs that don't overlap the earlier ones. In the best case we would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 overlap. This is also the point where I suggest breaking away from Knuth completely. All of the main algorithms described by Knuth are tape sorts. A run is written to a particular tape and then stays there until "moved" to another tape. That means we have to get super-clever about how runs should be written and formed (see Knuth). If we realise that the runs aren't fixed to particular tapes they are all just independent runs, we can radically rethink sorting. There is no need to implement Cascade Sort, but we do need to rethink merging from the ground up. (All of which is a relief, because Knuth et al are definitely smarter than me, but I've got disks and lots of memory and those guys had tapes.). If we track the min and max values for each run, when run building is finished we will be able to build a merging plan that allows us to be smart about the runs we should bring together. We start with the run with the lowest min value, as well as all runs that overlap that run. When that run is exhausted we move to the next lowest and at that point start merging all runs that overlap that one. This then means we may be able to begin final merging with more runs than the current cut-off. It's possible that we could merge an infinite number of runs in final merge with fixed memory. If we *do* need to merge we can work out which runs should be our best pre-merge candidates, based upon how big they are and which other runs they overlap. (That's much better than being forced to merge tapes 2, 7 and 17 because some bizarre math says so (see Knuth).) Anyway, claiming to have found a better way than Knuth makes me feel a little nervous, so some searching questions on this are very welcome. Interestingly, if we combine this technique with dynamic heap management we may be able to allow a very large number of efficiently written runs to form without it causing any merging. mac_man recently noted the possibility that some runs don't overlap at all and so can be merged for free. That's true, though doesn't actually improve the basic idea here which is building a merge plan after runs have been formed, with an eye on minimizing and potentially elimination the merge phase. There's probably some typos or thinkos above, so go easy on me Greg! They aren't there because I want to skim over anything. I'm not likely to get a chance to do all of this in the near future, so documenting it now should help others to carry things forward. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, Nov 27, 2007 at 06:03:46PM +0000, Simon Riggs wrote: > Just wanted to review a few thoughts and ideas around improving external > sorts, as recently encouraged to do by Jim Nasby. Is there any way of PG knowing that having an index on a subset of the sorted columns is sometimes a win? For example, if we have: CREATE TABLE foo ( a INTEGER NOT NULL PRIMARY KEY, b INTEGER NOT NULL, c INTEGER ); and we request: SELECT * FROM foo ORDER BY a,b LIMIT 10; then it may be a win to do smaller sorts for each value of "a", rather than one big sort after all the data has been pulled out. Obviously, it would depend on the distribution of "a", large numbers of distinct values for "a" being good, and a small number being bad. I think this would help in a number of other situations as well, but that's just the most obvious case. Sam
On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote: > 5. DYNAMIC RUN HANDLING (in Final Merge) > > Another way of addressing a) is to simply make better use of memory > itself. Let's look at that in more detail: > > Number of runs that can be merged at once is currently fixed, based upon > available memory. This has the underlying assumption that all runs will > be concurrently active during final merging, which may not always be > true. > > If we have random data then almost all runs will overlap with all other > runs, i.e. the min and max values are sufficiently wide that the runs do > all overlap. In many cases, data arrives in somewhat sorted order, e.g. > financial data is fairly regular with some late payers but not many, and > those trail off with a fairly tight decay. In the somewhat sorted case > we find that the actual overlap is less than total, so there are many > later runs that don't overlap the earlier ones. In the best case we > would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 > overlap. I have spoken with Len Shapiro, a professor at Portland State University, regarding sorting before. He suggests that PostgreSQL should implement forecasting, which is similar to what you're describing. Forecasting does not require that entire runs are disjoint, it works by tracking the maximum values from the last block read from every run. This allows you to know which run you will need more blocks from the soonest. I'm still looking into the problem to understand it better, but the algorithm is in Knuth Vol 3. I can look at it in more detail, but have you already looked into this idea? Is there a reason we don't do this currently? Regards,Jeff Davis
On Fri, 2007-11-30 at 12:07 -0800, Jeff Davis wrote: > On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote: > > 5. DYNAMIC RUN HANDLING (in Final Merge) > > > > Another way of addressing a) is to simply make better use of memory > > itself. Let's look at that in more detail: > > > > Number of runs that can be merged at once is currently fixed, based upon > > available memory. This has the underlying assumption that all runs will > > be concurrently active during final merging, which may not always be > > true. > > > > If we have random data then almost all runs will overlap with all other > > runs, i.e. the min and max values are sufficiently wide that the runs do > > all overlap. In many cases, data arrives in somewhat sorted order, e.g. > > financial data is fairly regular with some late payers but not many, and > > those trail off with a fairly tight decay. In the somewhat sorted case > > we find that the actual overlap is less than total, so there are many > > later runs that don't overlap the earlier ones. In the best case we > > would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 > > overlap. > > I have spoken with Len Shapiro, a professor at Portland State > University, regarding sorting before. > > He suggests that PostgreSQL should implement forecasting, which is > similar to what you're describing. Forecasting does not require that > entire runs are disjoint, it works by tracking the maximum values from > the last block read from every run. This allows you to know which run > you will need more blocks from the soonest. > > I'm still looking into the problem to understand it better, but the > algorithm is in Knuth Vol 3. > > I can look at it in more detail, but have you already looked into this > idea? Is there a reason we don't do this currently? Interesting, I hadn't read that part. Knuth's Algorithm F covers how to do a P-way merge using 2P + 2 buffers. My ideas cover how to do a P-way merge when you don't have enough memory for that many buffers. The current sort code makes two assumptions, amongst others 1. minimizing number of runs is always worth it 2. there is a single fixed maximum size of P, depending upon memory I'm challenging both of those. Only runs that overlap need to be merged simultaneously, so if the runs aren't overlapping then its OK to allow more runs to be formed. If its OK to allow more runs, then reducing heap size to allow better CPU efficiency is possible. So Algorithm F is somewhat orthogonal to what I've proposed, though maybe still interesting. What we do now is fairly close, but you should look at the code in tuplesort.c and logtape.c to see how well it matches. That might lead to an increase in the limit of the number of concurrent runs mergeable at any one time. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote: > So Algorithm F is somewhat orthogonal to what I've proposed, though > maybe still interesting. What we do now is fairly close, but you should > look at the code in tuplesort.c and logtape.c to see how well it > matches. That might lead to an increase in the limit of the number of > concurrent runs mergeable at any one time. > tuplesort.c: * When merging runs, we use a heap containing just the frontmost tuple from* each source run; we repeatedly output the smallesttuple and insert the* next tuple from its source tape (if any). When the heap empties, the merge* is complete. The basic merge algorithm thus needs very little memory ---* only M tuples for an M-way merge, and M is constrained to asmall number.* However, we can still make good use of our full workMem allocation by* pre-reading additional tuples fromeach source tape. Without prereading,* our access pattern to the temporary file would be very erratic; on average* we'dread one block from each of M source tapes during the same time that* we're writing M blocks to the output tape, so thereis no sequentiality of* access at all, defeating the read-ahead methods used by most Unix kernels.* Worse, the outputtape gets written into a very random sequence of blocks* of the temp file, ensuring that things will be even worsewhen it comes* time to read that tape. A straightforward merge pass thus ends up doing a* lot of waiting for disk seeks. We can improve matters by prereading from* each source tape sequentially, loading about workMem/M bytes from eachtape* in turn. Then we run the merge algorithm, writing but not reading until* one of the preloaded tuple series runsout. Then we switch back to preread* mode, fill memory again, and repeat. This approach helps to localize both* readand write accesses. The idea of prefetching, as I understand it, is that we don't blindly preread workMem/M bytes from each of M tapes; instead we predict which tapes we will need tuples from next through forecasting. If I understand correctly, we just keep track of the maximum value of the last block read from each run, and then always read from the run in which the last block read has the lowest maximum. It seems as if this would allow a variable number of runs to be merged at once, but if the data really *is* random, we'd want it to degrade gracefully something resembling the current implementation. I'm being somewhat vague here because I haven't taken the time to really understand it. If you think this idea has potential I will look into it in more detail. Regards,Jeff Davis
On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote: > If I understand correctly, we just keep track of the maximum value of > the last block read from each run, and then always read from the run in > which the last block read has the lowest maximum. Yep, sounds like Algorithm F > It seems as if this would allow a variable number of runs to be merged > at once, but if the data really *is* random, we'd want it to degrade > gracefully something resembling the current implementation. If we also keep track of the endpoints of runs that we haven't yet read from, then yes that would link my ideas with Algorithm F, so we just have a single implementation. (F++ ?) Probably easiest to store the endpoint tuples directly, with some sane limits for when we have very large tuples. You'll still need to do run-level forecasting as I had proposed to tell whether you need to do any intermediate merging prior to the final merge. So the two sets of ideas can't be brought together completely. > I'm being somewhat vague here because I haven't taken the time to > really understand it. If you think this idea has potential I will look > into it in more detail. Yes, F++ sound like it will use memory more effectively than we do currently. That's likely to improve performance when the number of runs approaches the limit for the size of work_mem. So this will improve external sorts with too small memory allocations, but it won't do anything about sorts with too large a memory allocation. That's probably the order of importance for tackling sort performance, so thats good. Probably best to test with - 1M - 4M work_mem, so we see the full benefit of any improvements in memory utilisation in a typical context - number of runs is nearly at limit for memory - total sort is very large, so we see real I/O issues starkly You'll need to instrument things carefully so you can tell how many runs are being merged at any one time and how that effects elapsed time/row. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote: > >> If I understand correctly, we just keep track of the maximum value of >> the last block read from each run, and then always read from the run in >> which the last block read has the lowest maximum. So it sounds like the use case where this is the biggest win would be a situation where you have presorted input which has been sliced up. So for example sorting by "zip code" in a table which was clustered by city. The alphabetic order of the cities isn't correlated to the results but all the zip codes for a city are in a contiguous block somewhere in the output. In such a case after doing a single pass we would have a bunch of tapes each of which corresponded to a single city and was able to completely reorder the zip codes in that city to be ordered. So the desired results would be, for example, all the tuples from tape 17 (NYC) followed by all the tuples from tape 3 (Buffalo) followed by all the tuples from tape 1 (Albuquerque), etc. We currently preread an equal amount from each tape and then would empty all the preread tuples from tape 17, refill them, preread them again, repeat until tape 17 is empty then move on to tape 3. All the tuples except the currently active tape are completely idle. I think the way to do what you're proposing is to preread one tuple from each tape, then when one preread bunch is emptied refill it with twice as many and repeat. In this case you would end up with nearly all of workmem full of tuples from NYC until you're done with NYC. That would increase the prereading block size by a factor of 20 in this case. So the question is just how many seeks are we doing during sorting. If we're doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely (which we can't do) isn't going to speed up seeking all that much. If we're doing 20% seeks and can get that down to 10% it might be worthwhile. I'm not sure where the idea of keeping the current bounds of the input tapes comes into it. We only preread when we run out of tuples anyways and then we don't really have a choice about which tape we want to preread from. And it's a good thing too since maintaining such a list of bounds and finding the lowest or highest would mean maintaining a second heap which would basically double the cpu cost of sorting. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote: > I think the way to do what you're proposing is... Don't understand that. Algorithm F covers that already doesn't it? > So the question is just how many seeks are we doing during sorting. If we're > doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely > (which we can't do) isn't going to speed up seeking all that much. If we're > doing 20% seeks and can get that down to 10% it might be worthwhile. The buffer size at max tapes is an optimum - a trade off between avoiding intermediate merging and merging efficiently. Freeing more memory is definitely going to help in the case of low work_mem and lots of runs. You're right that there is a limit to the benefit you can get. I wrote a patch in 2005/6 to optimise the memory usage when there were few runs and lots of memory. I still think there's value in that. > I'm not sure where the idea of keeping the current bounds of the input tapes > comes into it. We only preread when we run out of tuples anyways and then we > don't really have a choice about which tape we want to preread from. You have to decide whether to perform intermediate merges or whether you can do everything at the final merge. Otherwise you can't merge more runs than you have buffers for, since you'd at some point freeze up and not be able to input. And it's > a good thing too since maintaining such a list of bounds and finding the > lowest or highest would mean maintaining a second heap which would basically > double the cpu cost of sorting. I think you're not understanding me. You only need to record the lowest or highest when a run completes/starts. When all runs have been written we then have a table of the highest and lowest values for each run. We then scan that to see whether we can perform merging in one pass, or if not what kind of intermediate merging is required. We keep the merge plan in memory and then follow it. So probably very small % of total sort cost, though might save you doing intermediate merges with huge costs. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > The buffer size at max tapes is an optimum - a trade off between > avoiding intermediate merging and merging efficiently. Freeing more > memory is definitely going to help in the case of low work_mem and lots > of runs. I can't follow these abstract arguments. That's why I tried to spell out a concrete example. > I think you're not understanding me. > > You only need to record the lowest or highest when a run > completes/starts. When all runs have been written we then have a table > of the highest and lowest values for each run. We then scan that to see > whether we can perform merging in one pass, or if not what kind of > intermediate merging is required. We keep the merge plan in memory and > then follow it. So probably very small % of total sort cost, though > might save you doing intermediate merges with huge costs. Ok, that's a very different concept than what I was thinking. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote: > So the question is just how many seeks are we doing during sorting. If we're > doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely > (which we can't do) isn't going to speed up seeking all that much. If we're > doing 20% seeks and can get that down to 10% it might be worthwhile. It's not just about eliminating seeks, it's about being able to merge more runs at one time. If you are merging 10 runs at once, and only two of those runs overlap and the rest are much greater values, you might be spending 99% of the time in sequential I/O. But the point is, we're wasting the memory holding those other 8 runs in memory (wasting 80% of the memory you're using), so we really could be merging a lot more than 10 runs at once. This might eliminate stages from the merge process. My point is just that "how many seeks are we doing" is not the only question. We could be doing 99% sequential I/O and still make huge wins. In reality, of course, the runs aren't going to be disjoint completely, but they may be partially disjoint. That's where forecasting comes in: you preread from the tapes you will actually need tuples from soonest. Regards,Jeff Davis
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote: > I'm not sure where the idea of keeping the current bounds of the input tapes > comes into it. We only preread when we run out of tuples anyways and then we > don't really have a choice about which tape we want to preread from. And it's > a good thing too since maintaining such a list of bounds and finding the > lowest or highest would mean maintaining a second heap which would basically > double the cpu cost of sorting. > You're only keeping track of the maximum value for each run, which should be cheap to track. The only time it changes is when you're reading more data from that run, in which case it increases. The tradeoff that's happening right now is: we want to merge many runs at once because it reduces the number of merge phases, but the problem is that it increases the seeking because we read one block from one run, then one block from another run, etc., especially if the input is random. If we reduce the number of runs, then we can preread more efficiently. See: tuplesort.c: * as sorted runs, we can eliminate any repeated I/O at all. In the current * code we determine the number of tapes M on the basis of workMem: we want * workMem/M to be large enough that we read a fair amount of data each time * we preread from a tape, so as to maintain the locality of access described * above. Nonetheless, with large workMem we can have many tapes. So, for workMem/M to be "large enough", M has to be small enough. But a small M means we have to do more merge phases, which is expensive. Forecasting improves this trade. Forecasting no longer _blindly_ prereads from each tape, it uses information that it already has (the max value of the last block read from each run) to determine the runs from which we need tuples the soonest. Then, it prereads the _correct_ data. Regards,Jeff Davis
Has anyone looked into sorting algorithms that could use more than one CPU or core at a time? Benchmarks I see[1][2] suggest that sorting is an area that improves greatly with multiple processors and even with multi-threading on a single core processor. "For 1-processor and 2-threads (1p2t), the algorithm sorts the relation about 48% faster than the single-threaded version with a speedup of 31% during the quicksort and 58% during the mergesort. The dual-processor (2p2t) version providesan even faster total speedup of 86% over the single-threaded version with a speedup of 60% during the quicksortand 100% during the merge sort." [from the acm paper on link 2 below] PS: Yeah, I know multi-threading is a hot-button on these lists; but sorting seems a relatively isolated of the code and I'd wonder if it'd be isolate-able enough that multiple CPUs could be used there. [1] http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf [2] http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&key2=5713023711&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618
Hi, Le mardi 18 décembre 2007, Ron Mayer a écrit : > Has anyone looked into sorting algorithms that could use > more than one CPU or core at a time? [...] > PS: Yeah, I know multi-threading is a hot-button on these > lists; but sorting seems a relatively isolated of the code > and I'd wonder if it'd be isolate-able enough that multiple > CPUs could be used there. And before that objection to multi-threading implementation and portability concerns arise, what about using a coroutine BSD-licenced portable implementation such as Protothreads to have backend code use several CPU at a time? http://www.sics.se/~adam/pt/ With such a tool, would it be possible to think about producer/consumer parallel executions for sorting, aggregates nodes or other parts of the executor? Hope this helps, regards, -- dim
On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote: > PS: Yeah, I know multi-threading is a hot-button on these > lists; but sorting seems a relatively isolated of the code > and I'd wonder if it'd be isolate-able enough that multiple > CPUs could be used there. I'm not sure multi-threading is the issue you think. Threads is, but only for architectural reasons. Using multiple processes to complete a task seems very sensible to me. Yeh, sorting is isolated enough to try out some of those ideas on. I was unaware of the work on finding medians, so thats a good way of dividing the workloads for parallelism. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-12-18 at 09:31 +0000, Simon Riggs wrote: > On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote: > > > PS: Yeah, I know multi-threading is a hot-button on these > > lists; but sorting seems a relatively isolated of the code > > and I'd wonder if it'd be isolate-able enough that multiple > > CPUs could be used there. > > I'm not sure multi-threading is the issue you think. Threads is, but > only for architectural reasons. Using multiple processes to complete a > task seems very sensible to me. My first thought would be that we would need a new executor node (e.g. "ParallelSort") that would only be chosen when the cost of the sort is large enough to outweigh other factors (such as process creation time, dividing available work_mem, and any necessary IPC). It seems to me the simplest way to do it would be to allow each sub process to allocate work_mem/P where P is the degree of parallelization. However, that somewhat works against our schemes for dynamic run handling and forecasting, and may lead to more random reading from disk. Any other scheme I can think of would involve more IPC, which might make the idea just too complex. Regards,Jeff Davis
Jeff Davis wrote: > My first thought would be that we would need a new executor node (e.g. > "ParallelSort") that would only be chosen when the cost of the sort is > large enough to outweigh other factors (such as process creation time, > dividing available work_mem, and any necessary IPC). > > It seems to me the simplest way to do it would be to allow each sub > process to allocate work_mem/P where P is the degree of parallelization. > However, that somewhat works against our schemes for dynamic run > handling and forecasting, and may lead to more random reading from disk. > Any other scheme I can think of would involve more IPC, which might make > the idea just too complex. > I am curious - what algorithms exist to efficiently do a parallel sort? Do you mean if sorting 1 million items, it is possible to separate this into 2 sets of 500 thousand each, execute them in separate threads (with task administration and synchronization overhead) , combine the results, and complete the task in significantly less time than doing it in one thread? I am skeptical that this is possible, and suspect that the overall efficiency of the system would go down even if the throughput of a single execution increases. Or do you mean being able to perform parts of the query plan fully in parallel? If this, then one would need a lot more than ParallelSort... Cheers, mark -- Mark Mielke <mark@mielke.cc>
2007/12/19, Mark Mielke <mark@mark.mielke.cc>: > Jeff Davis wrote: > > My first thought would be that we would need a new executor node (e.g. > > "ParallelSort") that would only be chosen when the cost of the sort is > > large enough to outweigh other factors (such as process creation time, > > dividing available work_mem, and any necessary IPC). > > > > It seems to me the simplest way to do it would be to allow each sub > > process to allocate work_mem/P where P is the degree of parallelization. > > However, that somewhat works against our schemes for dynamic run > > handling and forecasting, and may lead to more random reading from disk. > > Any other scheme I can think of would involve more IPC, which might make > > the idea just too complex. > > > I am curious - what algorithms exist to efficiently do a parallel sort? > Do you mean if sorting 1 million items, it is possible to separate this > into 2 sets of 500 thousand each, execute them in separate threads > (with task administration and synchronization overhead) , combine the > results, and complete the task in significantly less time than doing it > in one thread? I am skeptical that this is possible, and suspect that > the overall efficiency of the system would go down even if the > throughput of a single execution increases. > Ok - we want to sort table with quick sort and we want to do it on - N threads. Every thread - gets begin and end of indices of the table. First step starts at 0 and lasts with count -1. Single step: find medium value and move lover before it and bigger after. In normal case - we use recursive call - so the same procedure is being called for that two parts. In thread we can put indices at side list - and use queue of threads to pick up data from the list. We can use common table, access to side list with indices has to be serialized. > Or do you mean being able to perform parts of the query plan fully in > parallel? If this, then one would need a lot more than ParallelSort... > Nice to have, but rather for data warehouses. In other cases... IE - backend for Internet - there are many requests and every processor / core works nice. -- Regards, Michał Zaborowski (TeXXaS)
On Tuesday 18 December 2007 10:03:25 Dimitri Fontaine wrote: > Hi, > > Le mardi 18 décembre 2007, Ron Mayer a écrit : > > Has anyone looked into sorting algorithms that could use > > more than one CPU or core at a time? > > [...] > > > PS: Yeah, I know multi-threading is a hot-button on these > > lists; but sorting seems a relatively isolated of the code > > and I'd wonder if it'd be isolate-able enough that multiple > > CPUs could be used there. > > And before that objection to multi-threading implementation and portability > concerns arise, what about using a coroutine BSD-licenced portable > implementation such as Protothreads to have backend code use several CPU at > a time? > http://www.sics.se/~adam/pt/ > > With such a tool, would it be possible to think about producer/consumer > parallel executions for sorting, aggregates nodes or other parts of the > executor? > > Hope this helps, regards, And remember; Users don't care about portability-issues, they care about performance. If multi-threading is a way to speed up sorting considerably, it should, IMHO, be considered seriously. -- Andreas Joseph Krogh
Andreas Joseph Krogh <andreak@officenet.no> writes: > And remember; Users don't care about portability-issues, they care about > performance. Nonsense. If Postgres stops working on their machine, they'll care. regards, tom lane
Michał Zaborowski wrote:<br /><blockquote cite="mid:e2289d9e0712190231u6d1cd5e0qe57643c99492e4a5@mail.gmail.com" type="cite"><prewrap="">Ok - we want to sort table with quick sort and we want to do it on - N threads. Every thread - gets begin and end of indices of the table. First step starts at 0 and lasts with count -1. Single step: find medium value and move lover before it and bigger after. In normal case - we use recursive call - so the same procedure is being called for that two parts. In thread we can put indices at side list - and use queue of threads to pick up data from the list. We can use common table, access to side list with indices has to be serialized. </pre></blockquote> Stupid question #2: Isit well recognized that the CPU is the bottleneck in the PostgreSQL sorting mechanism? Or might it be memory bandwidthand I/O?<br /><br /> It would seem to me that any sort worth parallelizing (administrative and synchronization overhead),must have data larger than the L2 cache. If larger than the L2 cache, it becomes real memory speed. If real memoryspeed, wouldn't one CPU without hardware synchronization, be able to fill the memory read/write pipe? If 'divide andconquer' to parallize, wouldn't the values written<br /> from one thread, often (1 / N) need to be read from another thread,requiring hardware data synchronization?<br /><br /> I see the wikipedia.org page describes how easy it is to parallelizequick sort, and scale performance linearly with the number of processors, but I don't see references to back thisclaim.<br /> At least some of these steps seem difficult or impractical to parallelize. For example, the initial partitionreorder that moves items lower than the pivot to the left, and items higher than the pivot to the right, would notbe easy to parallelize using an in-place re-order. It needs to move one partition down before it can 'divide and conquer'.They say no synchronization is required, but I think they are missing the hardware synchronization required (especiallyin the inner most loops where the thread task becomes shorter, and starts to fit in L1/L2). They say linear, butthen talk about a 'new thread being created'. New thread creation has a cost, and if reduced to using a thread pool, thensynchronization *is* required.<br /><br /> It sounds like a 'nice in theory' idea. :-) Which doesn't mean it is wrong...<br/><br /> I am curious enough to write a test...<br /><blockquote cite="mid:e2289d9e0712190231u6d1cd5e0qe57643c99492e4a5@mail.gmail.com"type="cite"><blockquote type="cite"><pre wrap="">Ordo you mean being able to perform parts of the query plan fully in parallel? If this, then one would need a lot more than ParallelSort..</pre></blockquote><pre wrap="">Nice to have, but ratherfor data warehouses. In other cases... IE - backend for Internet - there are many requests and every processor / core works nice. </pre></blockquote> I'm a fan of the 'eachplan item is a task, that is assigned to the pool, with each CPU grabbing tasks from the pool'. Another 'nice in theory'idea (used by DB2?). As it is, though, I think PostgreSQL planning is heavily designed to maximize performance ona single CPU, and single queries would not easily scale to multiple CPUs. (Perhaps hashing could be done on another CPU,or as you describe above, sorting)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Wed, 2007-12-19 at 12:08 -0500, Mark Mielke wrote: > Stupid question #2: Is it well recognized that the CPU is the > bottleneck in the PostgreSQL sorting mechanism? Or might it be memory > bandwidth and I/O? > I think it depends a lot on several factors. It's probably a different bottleneck for integers versus localized text, and depends on the available memory and I/O characteristics. > > It would seem to me that any sort worth parallelizing (administrative > and synchronization overhead), must have data larger than the L2 > cache. If larger than the L2 cache, it becomes real memory speed. If > real memory speed, wouldn't one CPU without hardware synchronization, > be able to fill the memory read/write pipe? We do an external merge sort, which involves merging M runs together. You seem to be implying that we can generate the output run at disk speed, and therefore the CPU speed is unimportant. I suspect that the comparison costs are enough that the above statement isn't true in all cases, particularly in the case of localized text. Also, there is probably a lot of memory copying going on, and that probably destroys a lot of the effectiveness of L2 caching. When L2 caching is ineffective, the CPU spends a lot of time just waiting on memory. In that case, it's better to have P threads of execution all waiting on memory operations in parallel. This would explain why "1p2t" would outperform a "1p1t" in Ron's reference above. These are just my first thoughts, however. There is a lot of existing research out there that we can look into, and also a lot of tests that we can run before jumping into this. I think parallel sorting can be looked into separately from the other sorting improvements. Regards,Jeff Davis
Mark Mielke wrote: > I am curious - what algorithms exist to efficiently do a parallel sort? > Do you mean if sorting 1 million items, it is possible to separate this > into 2 sets of 500 thousand each, execute them in separate threads > (with task administration and synchronization overhead) , combine the > results, and complete the task in significantly less time than doing it > in one thread? I am skeptical that this is possible... The link in the beginning of the thread points to articles that seem to describe one such algorithm; along with benchmarks. (http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m) The improvements were pretty consistent from set sizes ranging from very small sets (hundreds) to quite large ones (hundreds of K). Interestingly, even multi-threading helped a lot. "Our tests correlate well with previous research that showed Intel’s implementation of SMT (Hyper-Threading) to be adept at hiding this latency [6, 20, 12].Table 4 shows that by having two threads access memory at the same time, performance improved over 80% when compared to the singlethreaded version. It uses both quicksort phases and merge phases; for the merge phase using 2CPUs (no hyperthreading) apparently gave more than 2X speed improvement; apparently because it could parallelize memory access with CPU more. > Or do you mean being able to perform parts of the query plan fully in > parallel? If this, then one would need a lot more than ParallelSort... I wouldn't recommend that - it seems like a Hard Problem. My guess is that the best way to use multiple threads in one backend would be to find specific algorithms like sorting that would be easier to isolate.
Ron Mayer wrote: <blockquote cite="mid:47696249.4090602@cheapcomplexdevices.com" type="cite"><pre wrap="">The link in thebeginning of the thread points to articles that seem to describe one such algorithm; along with benchmarks. (<a class="moz-txt-link-freetext" href="http://tinyurl.com/3bvu4u">http://tinyurl.com/3bvu4u</a>, <a class="moz-txt-link-freetext"href="http://tinyurl.com/32wg2m">http://tinyurl.com/32wg2m</a>) The improvements were pretty consistent from set sizes ranging from very small sets (hundreds) to quite large ones (hundreds of K). Interestingly, even multi-threading helped a lot. "Our tests correlate well with previous research that showed Intel’s implementation of SMT (Hyper-Threading) to be adept at hiding this latency [6, 20, 12].Table 4 shows that by having two threads access memory at the same time, performance improved over 80% when compared to the singlethreaded version. It uses both quicksort phases and merge phases; for the merge phase using 2CPUs (no hyperthreading) apparently gave more than 2X speed improvement; apparently because it could parallelize memory access with CPU more. </pre></blockquote> Good points. I had forgotten about DDR and DDR2 having high throughput at the cost ofhigh latency. Somewhere in there, having the most number of memory requests in the queue would allow hardware to eliminatethis high latency effect.<br /><br /><blockquote cite="mid:47696249.4090602@cheapcomplexdevices.com" type="cite"><blockquotetype="cite"><pre wrap="">Or do you mean being able to perform parts of the query plan fully in parallel? If this, then one would need a lot more than ParallelSort... </pre></blockquote><pre wrap="">I wouldn't recommendthat - it seems like a Hard Problem. My guess is that the best way to use multiple threads in one backend would be to find specific algorithms like sorting that would be easier to isolate. </pre></blockquote> Also a good point. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Jeff Davis wrote: <blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com" type="cite"><pre wrap="">On Wed,2007-12-19 at 12:08 -0500, Mark Mielke wrote: </pre><blockquote type="cite"><pre wrap="">Stupid question #2: Is it wellrecognized that the CPU is the bottleneck in the PostgreSQL sorting mechanism? Or might it be memory bandwidth and I/O? </pre></blockquote><pre wrap="">I think it depends a lot on several factors. It's probably a different bottleneck for integers versus localized text, and depends on the available memory and I/O characteristics. </pre></blockquote> Makes sense.<br /><br /><blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><blockquote type="cite"><pre wrap="">It would seem tome that any sort worth parallelizing (administrative and synchronization overhead), must have data larger than the L2 cache. If larger than the L2 cache, it becomes real memory speed. If real memory speed, wouldn't one CPU without hardware synchronization, be able to fill the memory read/write pipe? </pre></blockquote><pre wrap="">We do an external merge sort, which involvesmerging M runs together. You seem to be implying that we can generate the output run at disk speed, and therefore the CPU speed is unimportant. </pre></blockquote> Correct. Or, alternatively, you could achieve thesame effect using asychronous I/O or read ahead.<br /><blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap="">I suspect that the comparison costs areenough that the above statement isn't true in all cases, particularly in the case of localized text.</pre></blockquote> That sounds possible, but I stillfeel myself suspecting that disk reads will be much slower than localized text comparison. Perhaps I am overestimatingthe performance of the comparison function?<br /><br /><blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap=""> Also, there is probably a lot of memory copying going on, and that probably destroys a lot of the effectiveness of L2 caching. When L2 caching is ineffective, the CPU spends a lot of time just waiting on memory. In that case, it's better to have P threads of execution all waiting on memory operations in parallel. </pre></blockquote> I didn't consider the high throughput / high latency effect.This could be true if the CPU prefetch isn't effective enough.<br /><br /><blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap=""> This would explain why "1p2t" would outperform a "1p1t" in Ron's reference above. These are just my first thoughts, however. There is a lot of existing research out there that we can look into, and also a lot of tests that we can run before jumping into this. I think parallel sorting can be looked into separately from the other sorting improvements. </pre></blockquote> Yep - I started to read up on it. It still sounds like it's a hard-ish problem(to achieve near N times speedup for N CPU cores without degrading performance for existing loads), but that doesn'tmean impossible. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
As long as sorting improvements are being considered, may I suggest an experiment that uses a very simple model? Assuming that you have K subfiles created by the initial sorting pass, insert the top record of each file into a priority queue. Then, emit records from the queue until the priority queue is empty. Now, there will be the objection that we will be jumping willy-nilly all over the disk because of reading one record at a time, but (depending on how it is implemented) generally several records are buffered during a read. So (as a gentle suggestion) I suggest testing the model. It works great for a single CPU or multiple CPU system for the work that *I* do. I have no idea if it will be a benefit for PostgreSQL or not, but it should be a very simple matter to try it. As long as someone is doing the work right now, it would be a good time to give it a go. I am not very familiar with PostgreSQL internals, but I would be willing to give a hand with it (not really sure how much time I can guarantee, though, since I would be doing it on my free time).
On Wed, 2007-12-19 at 15:51 -0500, Mark Mielke wrote: > That sounds possible, but I still feel myself suspecting that disk > reads will be much slower than localized text comparison. Perhaps I am > overestimating the performance of the comparison function? I think this simple test will change your perceptions: Do an initdb with --locale="en_US.UTF-8" and start postgres. test=> create table sorter(t text, b bytea, f float); CREATE TABLE test=> insert into sorter select r AS rt, r::text::bytea AS rb, r AS rf from (select random() as r from generate_series(1,1000000)) a; INSERT 0 1000000 test=> select pg_size_pretty(pg_total_relation_size('sorter')); pg_size_pretty ---------------- 70 MB (1 row) test=> explain analyze select * from sorter order by t; test=> explain analyze select * from sorter order by b; test=> explain analyze select * from sorter order by f; On my machine this table fits easily in memory (so there aren't any disk reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary data, and 20 seconds for localized text. That's much longer than it would take to read that data from disk, since it's only 70MB (which takes a fraction of a second on my machine). I think this disproves your hypothesis that sorting happens at disk speed. > Yep - I started to read up on it. It still sounds like it's a hard-ish > problem (to achieve near N times speedup for N CPU cores without > degrading performance for existing loads), but that doesn't mean > impossible. :-) > You don't even need multiple cores to achieve a speedup, according to Ron's reference. Regards,Jeff Davis
On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote: > As long as sorting improvements are being considered, may I suggest an > experiment that uses a very simple model? > > Assuming that you have K subfiles created by the initial sorting pass, > insert the top record of each file into a priority queue. > > Then, emit records from the queue until the priority queue is empty. > What is the principle difference between that idea and our existing sort algorithm? There's a good explanation in the comment at the top of tuplesort.c. Regards,Jeff Davis
> -----Original Message----- > From: Jeff Davis [mailto:pgsql@j-davis.com] > Sent: Wednesday, December 19, 2007 3:10 PM > To: Dann Corbit > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Sorting Improvements for 8.4 > > On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote: > > As long as sorting improvements are being considered, may I suggest an > > experiment that uses a very simple model? > > > > Assuming that you have K subfiles created by the initial sorting pass, > > insert the top record of each file into a priority queue. > > > > Then, emit records from the queue until the priority queue is empty. > > > > What is the principle difference between that idea and our existing sort > algorithm? > > There's a good explanation in the comment at the top of tuplesort.c. According to the comments, PostgreSQL uses replacement selection. Replacement selection is a wonderful thing because it creates runs that are twice as long as normal due to the snowplow effect. See (for instance): http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/69/27216/01209012. pdf Then, the merge routine will have half as many runs to merge the files together. So (for instance) without replacement selection, if you create 1024 subfiles, then replacement selection will create 512. That saves one merge pass. The algorithm that I am suggesting will take exactly one pass to merge all of the files. It works like this... Imagine an array of pointers to the subfiles: [*subfile][*subfile]...[*subfile] Step 0: We sort the array by a comparison operator that examines the top element of each subfile. So now the array is ordered such that the record with the smallest key is in array slot 0. Step 1: We remove the first record from the subfile in array slot 0. Now, the priority of the first element *may* have changed. So if it is no longer smaller than the subfile immediately to the right, we do a binary insertion to put this subfile in its new location, moving the contents of array slot[1] to array slot 0 if it is needed. Step 2: Is the entire list of subfiles empty? If yes, then terminate, if no then go to Step 1. Like I said, it is ultra-simple and it sorts the entire contents of all subfiles to the output with a single pass. Consider the way that current replacement selection works. The actual O(f(N)) behavior of replacement selection is just terrible O(n^2). But because we save one full merge pass, it is usually worth it anyway, since memory access is much faster than disk. And if we only have a few subfiles, the savings will be large. In the case of a priority queue merge, we only have one single merge pass no matter how many subfiles there are.
P.S. A beautiful paper on replacement selection is found here: http://students.fim.uni-passau.de/~fickensc/Proseminar/Proseminar.pdf
On Wed, 2007-12-19 at 15:19 -0800, Dann Corbit wrote: > The algorithm that I am suggesting will take exactly one pass to merge > all of the files. > >From tuplesort.c: "In the current code we determine the number of tapes M on the basis of workMem: we want workMem/M to be large enough that we read a fair amount of data each time we preread from a tape, so as to maintain the locality of access described above. Nonetheless, with large workMem we can have many tapes." It seems like you are just choosing M to be equal to the number of initial runs, whereas the current code takes into account the cost of having workMem/M too small. We do want to increase the number of runs that can be merged at once; that's what dynamic run handling and forecasting are all about. But we want to avoid unnecessary seeking, also. Regards,Jeff Davis
Mark Mielke <mark@mark.mielke.cc> writes: > Jeff Davis wrote: >> Also, there is probably a lot of memory copying going on, and that >> probably destroys a lot of the effectiveness of L2 caching. When L2 >> caching is ineffective, the CPU spends a lot of time just waiting on >> memory. In that case, it's better to have P threads of execution all >> waiting on memory operations in parallel. >> > I didn't consider the high throughput / high latency effect. This could > be true if the CPU prefetch isn't effective enough. Note that if this is the argument, then there's a ceiling on the speedup you can expect to get: it's just the extent of mismatch between the CPU and memory speeds. I can believe that suitable test cases would show 2X improvement for 2 threads, but it doesn't follow that you will get 10X improvement with 10 threads, or even 4X with 4. regards, tom lane
"Jeff Davis" <pgsql@j-davis.com> writes: > test=> explain analyze select * from sorter order by t; > test=> explain analyze select * from sorter order by b; > test=> explain analyze select * from sorter order by f; > > On my machine this table fits easily in memory (so there aren't any disk > reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary > data, and 20 seconds for localized text. That's much longer than it > would take to read that data from disk, since it's only 70MB (which > takes a fraction of a second on my machine). > > I think this disproves your hypothesis that sorting happens at disk > speed. I suspect most of that is spent just copying the data around. Which would not be helped by having multiple threads doing the copying -- and in fact might be exacerbated if it required an extra copy to consolidate all the data in the end. How long does a "explain analyze sinmple select * from sorter" take? And assuming you're doing disk sorts (in disk cache) you're doing quite a lot of copying to temporary files (in disk cache) and then back to memory. Note that speeding up a query from 20s to 5s isn't terribly useful. If it's OLTP you can't be using all your cores for each user anyways. And if it's DSS 20s isn't a problem. Where parallel processing like this becomes attractive is when you're running a 2 hour query on a machine sequentially running scheduled batch jobs which can be sped up to 30 minutes. But in that case you're almost certainly being limited by your disk bandwidth, not your cpu speed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Gregory Stark > Sent: Wednesday, December 19, 2007 5:26 PM > To: Jeff Davis > Cc: Mark Mielke; Michał Zaborowski; Simon Riggs; Ron Mayer; pgsql- > hackers@postgresql.org > Subject: Re: [HACKERS] Sorting Improvements for 8.4 > > > "Jeff Davis" <pgsql@j-davis.com> writes: > > > test=> explain analyze select * from sorter order by t; > > test=> explain analyze select * from sorter order by b; > > test=> explain analyze select * from sorter order by f; > > > > On my machine this table fits easily in memory (so there aren't any disk > > reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary > > data, and 20 seconds for localized text. That's much longer than it > > would take to read that data from disk, since it's only 70MB (which > > takes a fraction of a second on my machine). > > > > I think this disproves your hypothesis that sorting happens at disk > > speed. > > I suspect most of that is spent just copying the data around. Which would > not > be helped by having multiple threads doing the copying -- and in fact > might be > exacerbated if it required an extra copy to consolidate all the data in > the > end. Benchmarking a single system will really only explain that system. Someone may have a disk farm with 2GB/Sec throughput: http://www.sql-server-performance.com/articles/per/system_storage_configuration_p1.aspx But such a configuration is very unlikely. Someone may have 10GB/S NIC cards, but those too, are rare. So for any benchmark, we will really just end up with a number for that system. Typically, disk is the bottleneck. I found this on the net somewhere, but it's quite a useful table for capacity planning (to find the weak link in the chainusing back of the envelope calculations): Interface Width Frequency Bytes/Sec Bits/Sec 4-way interleaved PC1600 (DDR200) SDRAM 4 x 64bits 100 MHz DDR 6.4 GB/s 51 Gbps Opteron HyperTransport memory bus 128bits 200 MHz DDR 6.4 GB/s 51 Gbps Pentium 4 "800 MHz" FSB 64bits 200 MHz QDR 6.4 GB/s 51 Gbps PC2 6400 (DDR-II 800) SDRAM 64bits 400 MHz DDR 6.4 GB/s 51 Gbps PC2 5300 (DDR-II 667) SDRAM 64bits 333 MHz DDR 5.3 GB/s 43 Gbps Pentium 4 "533 MHz" FSB 64bits 133 MHz QDR 4.3 GB/s 34 Gbps PC2 4300 (DDR-II 533) SDRAM 64bits 266 MHz DDR 4.3 GB/s 34 Gbps 2-channel PC1066 RDRAM 2 x 16bits 533 MHz DDR 4.3 GB/s 34 Gbps PCI-X 533 64bits 533 MHz 4.3 GB/s 34 Gbps PCI-Express x16 serial/16lanes 2.5 GHz 4 GB/s 32 Gbps Pentium 4 "400 MHz" FSB 64bits 100 MHz QDR 3.2 GB/s 25.6 Gbps 2-channel PC800 RDRAM 2 x 16bits 400 MHz DDR 3.2 GB/s 25.6 Gbps 2-way interleaved PC1600 (DDR200) SDRAM 2 x 64bits 100 MHz DDR 3.2 GB/s 25.6 Gbps PC2 3200 (DDR-II 400) SDRAM 64bits 200 MHz DDR 3.2 GB/s 25.6 Gbps PC3200 (DDR400) SDRAM 64bits 200 MHz DDR 3.2 GB/s 25.6 Gbps PC2700 (DDR333) SDRAM 64bits 167 MHz DDR 2.7 GB/s 21 Gbps PC2100 (DDR266) SDRAM 64bits 133 MHz DDR 2.1 GB/s 17 Gbps AGP 8x 32bits 533 MHz 2.1 GB/s 17 Gbps PCI-X 266 64bits 266 MHz 2.1 GB/s 17 Gbps PCI-Express x8 serial/8lanes 2.5 GHz 2 GB/s 16 Gbps EV6 bus (Athlon/Duron FSB) 64bits 100 MHz DDR 1.6 GB/s 13 Gbps PC1600 (DDR200) SDRAM 64bits 100 MHz DDR 1.6 GB/s 13 Gbps PC800 RDRAM 16bits 400 MHz DDR 1.6 GB/s 13 Gbps PC150 SDRAM 64bits 150 MHz 1.3 GB/s 10.2 Gbps 10 gigabit ethernet serial 10 GHz 1.25 GB/s 10 Gbps OC-192 serial 9.953 GHz 1.24 GB/s 9.953 Gbps 133 MHz FSB 64bits 133 MHz 1.06 GB/s 8.5 Gbps PC133 SDRAM 64bits 133 MHz 1.06 GB/s 8.5 Gbps AGP 4x 32bits 266 MHz 1.06 GB/s 8.5 Gbps PCI-X 64bits 133 MHz 1.06 GB/s 8.5 Gbps PCI-Express x4 serial/4lanes 2.5 GHz 1 GB/s 8 Gbps 100 MHz FSB 64bits 100 MHz 800 MB/s 6.4 Gbps PC100 SDRAM 64bits 100 MHz 800 MB/s 6.4 Gbps PC66 SDRAM 64bits 66 MHz 533 MB/s 4.3 Gbps fast/wide PCI 64bits 66 MHz 533 MB/s 4.3 Gbps AGP 2x 32bits 133 MHz 533 MB/s 4.3 Gbps single-link DVI 12bits 165 MHz DDR 495 MB/s 3.96 Gbps Ultra-320 SCSI 16bits 160 MHz 320 MB/s 2.6 Gbps OC-48 network serial 2.488 GHz 311 MB/s 2.488 Gbps AGP 32bits 66 MHz 266 MB/s 2.1 Gbps PCI-Express x1 serial 2.5 GHz 250 MB/s 2 Gbps Serial ATA/1500 disk serial 1.5 GHz 187 MB/s 1.5 Gbps Ultra-160 SCSI 16bits 80 MHz 160 MB/s 1.3 Gbps OC-24 network serial 1.244 GHz 155 MB/s 1.244 Gbps PCI 32bits 33 MHz 133 MB/s 1.06 Gbps ATA/133 disk 8bits 66 MHz DDR 133 MB/s 1.06 Gbps gigabit ethernet serial 1 GHz 125 MB/s 1 Gbps ATA/100 disk 8bits 50 MHz DDR 100 MB/s 800 Mbps IEEE 1394b serial 800 MHz 100 MB/s 800 Mbps Ultra-2 Wide SCSI 16bits 40 MHz 80 MB/s 640 Mbps OC-12 network serial 622.08 MHz 77.7 MB/s 622.08 Mbps ATA/66 disk 8bits 33 MHz DDR 66 MB/s 533 Mbps USB-2 serial 480 MHz 60 MB/s 480 Mbps IEEE 1394 serial 400 MHz 50 MB/s 400 Mbps Ultra Wide SCSI 16bits 20 MHz 40 MB/s 320 Mbps ATA/33 disk 8bits 16.6 MHz DDR 33 MB/s 266 Mbps Fast Wide SCSI 16bits 10 MHz 20 MB/s 160 Mbps OC-3 network serial 155.52 MHz 19.4 MB/s 155.52 Mbps 100baseT ethernet serial 100 MHz 12.5 MB/s 100 Mbps OC-1 network serial 51.84 MHz 6.5 MB/s 51.84 Mbps T-3 network serial 45 MHz 5.6 MB/s 44.736 Mbps USB serial 12 MHz 1.5 MB/s 12 Mbps 10baseT ethernet serial 10 MHz 1.25 MB/s 10 Mbps IrDA-2 serial 4 MHz 500 KB/s 4 Mbps T-1 network serial 1.5 MHz 193 KB/s 1.544 Mbps > How long does a "explain analyze sinmple select * from sorter" take? > > And assuming you're doing disk sorts (in disk cache) you're doing quite a > lot > of copying to temporary files (in disk cache) and then back to memory. > > > Note that speeding up a query from 20s to 5s isn't terribly useful. If > it's > OLTP you can't be using all your cores for each user anyways. And if it's > DSS > 20s isn't a problem. Unless (of course) there are 20,000 users doing the queries that would take 20 seconds but now they take 5 (when run single-user). They will still have a bit of a wait, of course. > Where parallel processing like this becomes attractive is when you're > running > a 2 hour query on a machine sequentially running scheduled batch jobs > which > can be sped up to 30 minutes. But in that case you're almost certainly > being > limited by your disk bandwidth, not your cpu speed. A linear speedup of 2 or more is always worth while[*]. Since sorting (e.g. for 'group by' and 'order by') and sort joinsare a major database task, I guess that a linear speedup by a factor of 2 might make the database operations on thewhole be 10% faster or so {OK, it's a SWAG}. I guess it would look good on the benchmarks, if nothing else. [*] unless it is already fast enough. If, at peak load a query takes 1 ms, then making the query take 0.5 ms is not goingto win you any medals, especially if the improvement costs $10,000.
On Wed, 19 Dec 2007, Dann Corbit wrote: > Benchmarking a single system will really only explain that system. > Someone may have a disk farm with 2GB/Sec throughput > But such a configuration is very unlikely. If you believe comments like those at http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's possible to hit >2GB/s total to the 48 disks in one of the Sun X4500 servers, which start at $24K. May be unlikely to you, but I was reading there after I set one up last night, and that's a boring standard configuration for some Sun and Greenplum customers. Also, that's today--by the time 8.4 is mainstream high-end machines will be even faster. Wanna make a bet on how much disk throughput will be available as SSD disks go mainstream in the next two years? -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
"Dann Corbit" <DCorbit@connx.com> writes: >> Note that speeding up a query from 20s to 5s isn't terribly useful. If it's >> OLTP you can't be using all your cores for each user anyways. And if it's >> DSS 20s isn't a problem. > > Unless (of course) there are 20,000 users doing the queries that would take 20 > seconds but now they take 5 (when run single-user). They will still have a bit > of a wait, of course. I'm not exactly following. If you have 20,000 users then you're probably using all the processors already. If you process them one by one on 4 cores in 5s then you'll get the same throughput as if you ran them four at a time on 1 core each in 20s. >> Where parallel processing like this becomes attractive is when you're >> running a 2 hour query on a machine sequentially running scheduled batch >> jobs which can be sped up to 30 minutes. But in that case you're almost >> certainly being limited by your disk bandwidth, not your cpu speed. > > A linear speedup of 2 or more is always worth while[*]. Since sorting (e.g. for > group by' and 'order by') and sort joins are a major database task, I guess > that a linear speedup by a factor of 2 might make the database operations on > the whole be 10% faster or so {OK, it's a SWAG}. I guess it would look good on > the benchmarks, if nothing else. Except note that you're not getting this linear speedup for free. To get a linear speedup of 2x you'll be using more than 2x the cpu resources. If there is nothing else contending for that resource (such as the scenario I described where you're running a single large batch query on a system and want to use all available resources to run it as fast as possible), then you'll get a 2x speedup. But if there is more than one query running on the system then you're not actually gaining anything. Each query will run faster but you won't be able to run as many simultaneously without having them slow back down. And the overhead of parallelizing the query will be a net loss. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
"Greg Smith" <gsmith@gregsmith.com> writes: > On Wed, 19 Dec 2007, Dann Corbit wrote: > >> Benchmarking a single system will really only explain that system. >> Someone may have a disk farm with 2GB/Sec throughput >> But such a configuration is very unlikely. > > If you believe comments like those at > http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's possible > to hit >2GB/s total to the 48 disks in one of the Sun X4500 servers, which > start at $24K. May be unlikely to you, but I was reading there after I set one > up last night, and that's a boring standard configuration for some Sun and > Greenplum customers. Surely such machines have kickass memory backplanes too though? How could it ever be reasonable to have an i/o controller with more bandwidth than your memory? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Tom Lane wrote: > ...I can believe that suitable test cases would show > 2X improvement for 2 threads, One other thing I found interesting is that their test case showed a near 2X improvement for hyperthreading; where I haven't heard of many other ways to get hyperthreading to show improvements for postgreql. > but it doesn't follow that you will get > 10X improvement with 10 threads, or even 4X with 4. Yeah - unless those 10 cores have additional I/O to the memories compared to a 1 core system (which I'd hope would be the case or else I'd expect many apps would be run into memory bottlenecks on such systems, no?).
On Thu, 20 Dec 2007, Gregory Stark wrote: > Surely such machines have kickass memory backplanes too though? How could it > ever be reasonable to have an i/o controller with more bandwidth than your > memory? Dann had the right general numbers here--max of 6.4GB/s between processors and you might coax an aggregate of double that out of the DDR RAM with 2 4-way interleaved banks of memory. Let's call it 12GB/s theoretical max. If the theoretical max of the disks is 2GB/s, that's only a 6:1 headroom, and with a decent cache rate it's not outrageous to imagine you could bottleneck on memory with some things before you run out of disk throughput. Right now I think a lot of the disk bottlenecks are seek-limited more than anything, but SSD will knock that one out for apps that are more about throughput than maximum storage. I could already switch to SDD usefully today for some of what I do that's in that category, it's just a bit too expensive to do yet; soon, though. Just trying to usefully estimate where the edge of that back of the envelope should go to. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Dec 19, 2007 at 07:17:21PM -0800, Ron Mayer wrote: > > but it doesn't follow that you will get > > 10X improvement with 10 threads, or even 4X with 4. > > Yeah - unless those 10 cores have additional I/O to the > memories compared to a 1 core system (which I'd hope > would be the case or else I'd expect many apps would be > run into memory bottlenecks on such systems, no?). I don't suppose you saw the document from Ulrich Drepper "What Every Programmer Should Know About Memory". It's a fact that most machines with multiple cores have less L2 cache/core than a single core machines. And having multiple conduits to main memory isn't that common at all. So having more threads sometimes *decreases* performance because you cut the L2 cache and memory bandwidth in half. The document is very useful for getting tips about how to work out optimal thread/memory/datasize ratios. The way around this is a NUMA architecture, but that's a whole other ball of wax. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
On Thu, 20 Dec 2007, Martijn van Oosterhout wrote: > The way around this is a NUMA architecture, but that's a whole > other ball of wax. Quick note for those reading Ulrich's paper: he refers in a couple of places to Intel's upcoming CSI approach to NUMA. This has now been renamed QuickPath, and it looks like it will be late 2008 before that even makes it to Itanium processors. The fact that AMD has a good NUMA implementation in their Opteron lines while Intel's Xeon processors do not is one area AMD still has a clear competative lead on. But you need memory bandwidth starved application before that matters more than the fact that the current Xeons are faster in general. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
While we're blue skying things, I've had an idea for a sorting algorithm kicking around for a couple of years that might be interesting. It's a variation on heapsort to make it significantly more block-friendly. I have no idea if the idea would work, or how well it'd work, but it might be worthwhile kicking around. Now, the core idea of heapsort is that the array is put into heap order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based array version here). The problem is that, assuming that the length of a is larger than memory, then a[2i+1] is likely going to be on a different page or block than a[i]. That means every time you have to bubble down a new element, you end up reading O(log N) blocks- this is *per element*. The variation is to instead work with blocks, so you have a block of entries b[i], and you change the definition of heap order, so that min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during bubble down, you need to be carefull to only change the minimum value of one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the algorithm works as normal. The advantage of doing it this way is that while each bubble down still takes O(log N) blocks being touched, you get a entire block worth of results for your effort. Make your blocks large enough (say, 1/4 the size of workmem) and you greatly reduce N, the number of blocks you have to deal with, and get much better I/O (when you're reading, you're reading megabytes at a shot). Now, there are boatloads of complexities I'm glossing over here. This is more of a sketch of the idea. But it's something to consider. Brian
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Brian Hurt > Sent: Thursday, December 20, 2007 6:42 AM > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Sorting Improvements for 8.4 > > While we're blue skying things, I've had an idea for a sorting algorithm > kicking around for a couple of years that might be interesting. It's a > variation on heapsort to make it significantly more block-friendly. I > have no idea if the idea would work, or how well it'd work, but it might > be worthwhile kicking around. > > Now, the core idea of heapsort is that the array is put into heap order- > basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based > array version here). The problem is that, assuming that the length of a > is larger than memory, then a[2i+1] is likely going to be on a different > page or block than a[i]. That means every time you have to bubble down > a new element, you end up reading O(log N) blocks- this is *per element*. > > The variation is to instead work with blocks, so you have a block of > entries b[i], and you change the definition of heap order, so that > min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during > bubble down, you need to be carefull to only change the minimum value of > one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the > algorithm works as normal. The advantage of doing it this way is that > while each bubble down still takes O(log N) blocks being touched, you > get a entire block worth of results for your effort. Make your blocks > large enough (say, 1/4 the size of workmem) and you greatly reduce N, > the number of blocks you have to deal with, and get much better I/O > (when you're reading, you're reading megabytes at a shot). > > Now, there are boatloads of complexities I'm glossing over here. This > is more of a sketch of the idea. But it's something to consider. It's an interesting idea to work with a "heap of heaps" where you try to keep each heap page-sized. It reminds me of the B+ tree, where you collect a whole bunch of nodes into a single page. I don't know if you have examined weak-heaps, but there are some interesting results for weak-heap approaches. As you know, heapsort variants do not degenerate to O(N^2). On this link: http://www.jea.acm.org/2002/EdelkampHeapsort/ I highly recommend all the goodies he has embedded (papers, source, etc.)
Gregory Stark wrote: > Note that speeding up a query from 20s to 5s isn't terribly useful. I disagree totally with that. That is the difference between no chance of someone waiting for a web page to load; vs. a good chance they'd wait. And 2s vs 0.5s is the difference between a web site that feels responsive and one that doesn't. > If it's OLTP you can't be using all your cores for each user anyways. Even so, I'd much rather keep each response time lower. If web page requests are coming in at 1 a second, it's much nicer to respond to each of them in 1 second than in 4 seconds -- even if the overall throughput is identical.
Jeff Davis wrote: <blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><pre wrap="">On Wed,2007-12-19 at 15:51 -0500, Mark Mielke wrote: </pre><blockquote type="cite"><pre wrap="">That sounds possible, but Istill feel myself suspecting that disk reads will be much slower than localized text comparison. Perhaps I am overestimating the performance of the comparison function? </pre></blockquote><pre wrap="">I think this simple test willchange your perceptions: </pre></blockquote> Yes - I received the same results (although my PostgreSQL doesn't have abuilt in case ::text::bytea... :-) )<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><prewrap="">On my machine this table fits easily in memory (so there aren't any disk reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary data, and 20 seconds for localized text. That's much longer than it would take to read that data from disk, since it's only 70MB (which takes a fraction of a second on my machine). </pre></blockquote> Might this mean that PostgreSQL performs too many copy operations?:-)<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><pre wrap="">Ithink this disproves your hypothesis that sorting happens at disk speed. </pre></blockquote> Yes.<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><blockquotetype="cite"><pre wrap="">Yep - I started to read up on it. It still sounds like it's a hard-ish problem (to achieve near N times speedup for N CPU cores without degrading performance for existing loads), but that doesn't mean impossible. :-) </pre></blockquote><pre wrap="">You don't even need multiple cores to achieve a speedup, according to Ron's reference. </pre></blockquote> I think Ron's reference actually said that you don't need full cores to achieve a speedup.It spoke of Intel's HT system. A single CPU with a single execution pipeline is not going to function better withmultiple threads unless the single thread case is written wrong. Multiple threads is always an overall loss without hardwaresupport. The thinking on this is that multiple threads can sometimes lead to cleaner designs, which are sometimesmore naturally written to be performing. In my experience, the opposite is usually true.<br /><br /> But, if youdo have HT, and the algorithm can be modified to take advantage of it for an overall increase in speed - great.<br /><br/> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Thu, 2007-12-20 at 01:26 +0000, Gregory Stark wrote: > I suspect most of that is spent just copying the data around. Which would not > be helped by having multiple threads doing the copying -- and in fact might be > exacerbated if it required an extra copy to consolidate all the data in the > end. The theory is that it could be helped by multiple threads, because of the memory latency. > How long does a "explain analyze sinmple select * from sorter" take? 2 seconds, but the table is already in cache I'm sure (since it's so small). > Note that speeding up a query from 20s to 5s isn't terribly useful. If it's > OLTP you can't be using all your cores for each user anyways. And if it's DSS > 20s isn't a problem. I'm not pushing for parallel sort, I'm just brainstorming. I think Ron's idea has merit, but I realize it also has limitations. > Where parallel processing like this becomes attractive is when you're running > a 2 hour query on a machine sequentially running scheduled batch jobs which > can be sped up to 30 minutes. But in that case you're almost certainly being > limited by your disk bandwidth, not your cpu speed. Are you sure that's always the case? My test seemed to indicate that sorting took longer than it would to read the file from disk. Regards,Jeff Davis
Jeff Davis wrote:<br /><blockquote cite="mid:1198193593.10057.72.camel@dogma.ljc.laika.com" type="cite"><blockquote type="cite"><prewrap="">Where parallel processing like this becomes attractive is when you're running a 2 hour query on a machine sequentially running scheduled batch jobs which can be sped up to 30 minutes. But in that case you're almost certainly being limited by your disk bandwidth, not your cpu speed. </pre></blockquote><pre wrap="">Are you sure that's always the case?My test seemed to indicate that sorting took longer than it would to read the file from disk. </pre></blockquote> It's probably not a relevant scenarioeither, as this discussion has only been about improving the performance of the sort, and I suspect there are veryfew database loads with performance characteristics completely defined by the efficiency of the sort algorithm? :-)<br/><br /> So far I am getting:<br /><br /> 1) Sort is slower than many people expect. (Jeff's test case emphasizesthis well)<br /> 2) White papers exist that document theoretical, simulated, and in some cases actual executionwhere parallel sort can be beneficial.<br /> 3) White papers exist that document how parallel sort is difficultto get right, and that characteristics of machines in use today prevent full utilization.<br /> 4) PostgreSQLis not designed to spread a single query across multiple execution units (whether CPUs, cores, or HT).<br /><br/> It's interesting discussion for me thus far.<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Brian Hurt wrote: > While we're blue skying things, I've had an idea for a sorting > algorithm kicking around for a couple of years that might be > interesting. It's a variation on heapsort to make it significantly > more block-friendly. I have no idea if the idea would work, or how > well it'd work, but it might be worthwhile kicking around. > > Now, the core idea of heapsort is that the array is put into heap > order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the > 0-based array version here). The problem is that, assuming that the > length of a is larger than memory, then a[2i+1] is likely going to be > on a different page or block than a[i]. That means every time you > have to bubble down a new element, you end up reading O(log N) blocks- > this is *per element*. > > The variation is to instead work with blocks, so you have a block of > entries b[i], and you change the definition of heap order, so that > min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during > bubble down, you need to be carefull to only change the minimum value > of one of the two child blocks b[2i+1] and b[2i+2]. Other than that, > the algorithm works as normal. The advantage of doing it this way is > that while each bubble down still takes O(log N) blocks being touched, > you get a entire block worth of results for your effort. Make your > blocks large enough (say, 1/4 the size of workmem) and you greatly > reduce N, the number of blocks you have to deal with, and get much > better I/O (when you're reading, you're reading megabytes at a shot). > > Now, there are boatloads of complexities I'm glossing over here. This > is more of a sketch of the idea. But it's something to consider. > Following up to myself (my apologies), but it's occurred to me that there are three advantages to this proposal that I've since thought of: 1) The two child blocks b[2i+1] and b[2i+2]- the one with the larger minimum element is the one we might replace. In other words, if min(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we know we're going to want the blocks b[4i+3] and b[4i+4]- before we're done with blocks b[2i+1] and b[2i+2]. The point here is that this would work wonders with the posix_fadvise/asyncio ideas kicking around. It'd be easy for the code to keep 2 large writes and 2 large reads going pretty constantly. 2) There is some easy parallelization available. I'm not sure how much worth this is, but the bubble down code is fairly easy to parallelize. If we have two bubble-downs going on in parallel, once they go down different branches (one thread goes to block b[2i+1] while the other goes to b[2i+2]) they no longer interact. Blocks near the root of the heap would be contended over, and multiple threads means smaller blocks to keep the total memory foot print the same. Personally, I think the asyncio idea above is more likely to be worthwhile. 3) It's possible to perform the sort lazily. You have the initial O(N) pass over the list, but then each block is only O(log N) cost. If it's likely that only the first part of the result is needed, then much of the work can be avoided. Brian
"Brian Hurt" <bhurt@janestcapital.com> writes: > 3) It's possible to perform the sort lazily. You have the initial O(N) pass > over the list, but then each block is only O(log N) cost. If it's likely that > only the first part of the result is needed, then much of the work can be > avoided. Now that's a *fascinating* idea. I'm having trouble coming up with a really killer use case for it since the bounded heap sort takes care of many cases where it would seem to apply. But it seems rally promising. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
>-----Original Message----- >From: pgsql-hackers-owner@postgresql.org >[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ron Mayer >Sent: Wednesday, 19 December 2007 19:26 >To: Mark Mielke; pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] Sorting Improvements for 8.4 > >> Or do you mean being able to perform parts of the query plan >fully in >> parallel? If this, then one would need a lot more than >ParallelSort... > >I wouldn't recommend that - it seems like a Hard Problem. > >My guess is that the best way to use multiple threads in one >backend would be to find specific algorithms like sorting that > would be easier to isolate. To give my view on this problem: if I'm looking at a competing (commercial) database product, they added some operations called "parallize" and "combine". Basically they split the data across several threads at one point and combine them later. This is basically what you are also implementing for "parallelsort", but as a single step in the query exeuction. In my opinion your starting point is too narrow and specific, especially since a fairly simple generalization is possible. Instead, the issue becomes the spill-to-disk code that needs to operate in parallel (which needs to be tackled sooner or later anyways). If you can change the sort into three steps: parallelize, sort (multiple parallel instances) and combine (merge) you still have the same base case. However I believe such a thing is much easier to extend to more operations. Futhermore it seems that cache is a considered a major problem, especially the varying sizes. Wouldn't a cache-oblivious algorithm, like <http://erikdemaine.org/papers/BRICS2002/> or <http://etd.uwaterloo.ca/etd/afarzan2004.pdf> be a good starting point for refinements on sort algorithm itself? I believe you can get a more consistent performance depending on the cache sizes, but it might be slower than a well-tuned quicksort. Just my EUR 0,02... - Joris
Ron Mayer wrote: >> Or do you mean being able to perform parts of the query plan fully in >> parallel? If this, then one would need a lot more than ParallelSort... >> > > I wouldn't recommend that - it seems like a Hard Problem. > > Isn't it the case that the implicit unions from processing partitioned data provides a more-or-less-ideal opportunity here? I certainly have sympathy for parallelising expensive queries to bring the best response time down, even if the average under full load goes up slightly, since any implied locks (including pinning of read-ahead ages) will be released sooner. And when load is light, users who are online get more of the hardware they paid for. James
Added to TODO: > * Consider being smarter about memory and external files used during > sorts > > http://archives.postgresql.org/pgsql-hackers/2007-11/msg01101.php > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00045.php --------------------------------------------------------------------------- Simon Riggs wrote: > Just wanted to review a few thoughts and ideas around improving external > sorts, as recently encouraged to do by Jim Nasby. > > Current issues/opportunities are these: > > ISSUES > > a) Memory is always in short supply, so using what we have more > effectively is going to be welcome. > > b) Heap sort has a reasonably strong anti-memory effect, meaning that > there is an optimum amount of memory for any sort. This shows itself > with the CPU time increasing during run forming, making this stage of > the sort CPU bound. > > c) Many sorts are performed prior to aggregation. It might be possible > to aggregate prior to writing to disk, as a way of reducing the overall > I/O cost. Benefit would occur when the total CPU cost was same no matter > when aggregation occurred; that would not apply in all cases, so we > would need to sense when benefit was possible. > > d) Generally reducing the I/O cost of sorting may help the merging > stages of a sort. > > > SOLUTIONS > > The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed > to date were the following: > > 1. Sort I/O Compression > 2. Aggregation during Sort > 3. Memory Pools > 4. Dynamic Heap Management > 5. Dynamic Run Handling > > I've added (5) to the list as well, which hasn't yet been discussed. > > 1. SORT I/O COMPRESSION > > This idea is not dead yet, it just needs a full set of tests to confirm > that there is benefit in all cases. If there's not benefit in all cases, > we may be able to work out which cases those are, so we know when to use > it. > > > 2. AGGREGATION DURING SORT > > Many sorts are preliminary steps before aggregation. Aggregation during > run forming would potentially reduce size of heap and reduce number of > comparisons. For many types of aggregate this would not theoretically > increase the number of ops since sum(), avg(), min(), max() are all > commutative according to their inputs. We would probably need to add > another option to Aggregate Functions to indicate the possibility of > calculating the aggregate in this way, since some aggregates might rely > on the current situation that they expect all their inputs at once in > sorted order. (Windowed aggregates are unlikely to be this way). > > > 3. MEMORY POOLS > > Solving a) could be done by sensible management and allocation of > resources. Discussed before, so not rehashed here. > > > 4. DYNAMIC HEAP MANAGEMENT > > The size of the active heap required to produce the fewest number of > runs varies as the sort progresses. For example, sorting an already > sorted input needs a trivial heap size. > > Larger heap sizes simply avoid forming more runs, which is not > necessarily a bad thing. More runs only become bad things when we go > beyond our ability to perform a single final merge (see Dynamic Run > Handling below). > > Smaller heap sizes reduce the number of comparisons required, plus > increase the L2+ cache efficiencies. Those two things are the cause of > the anti-memory effect. > > Because of b), optimising the size of the heap could potentially be a > good thing. This can make a considerable difference for nearly sorted > data (measurements required...). > > When we have M amount of memory available to us, we don't start by using > it all. We start with m memory and only increase up to M if required. > Runs are built with memory set at m. If a tuple arrives that would force > the formation of a new run we assess > > i) do we care if another run is formed? Use our knowledge of the likely > amount of data coming our way, compared with number of runs formed so > far and see if we really care. If we don't care, allow the new run to be > formed and carry on with just heap size of m. (see Dynamic Run Handling > later). > > ii) if we do care about number of runs, then allow the heap to grow by > increments up to the full size of M. Increments would be at least x2 and > possibly x4. That way we always have work space to rearrange the heap. > > All of this dances too cleverly around the exact technique and potential > costs of rearranging the heap. That is not to be ignored and is the next > task in evaluating and accepting/dismissing this potential technique. > > In combination with memory pooling this technique might also allow > memory to be better distributed to other users. > > > 5. DYNAMIC RUN HANDLING (in Final Merge) > > Another way of addressing a) is to simply make better use of memory > itself. Let's look at that in more detail: > > Number of runs that can be merged at once is currently fixed, based upon > available memory. This has the underlying assumption that all runs will > be concurrently active during final merging, which may not always be > true. > > If we have random data then almost all runs will overlap with all other > runs, i.e. the min and max values are sufficiently wide that the runs do > all overlap. In many cases, data arrives in somewhat sorted order, e.g. > financial data is fairly regular with some late payers but not many, and > those trail off with a fairly tight decay. In the somewhat sorted case > we find that the actual overlap is less than total, so there are many > later runs that don't overlap the earlier ones. In the best case we > would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 > overlap. > > This is also the point where I suggest breaking away from Knuth > completely. All of the main algorithms described by Knuth are tape > sorts. A run is written to a particular tape and then stays there until > "moved" to another tape. That means we have to get super-clever about > how runs should be written and formed (see Knuth). If we realise that > the runs aren't fixed to particular tapes they are all just independent > runs, we can radically rethink sorting. There is no need to implement > Cascade Sort, but we do need to rethink merging from the ground up. (All > of which is a relief, because Knuth et al are definitely smarter than > me, but I've got disks and lots of memory and those guys had tapes.). > > If we track the min and max values for each run, when run building is > finished we will be able to build a merging plan that allows us to be > smart about the runs we should bring together. We start with the run > with the lowest min value, as well as all runs that overlap that run. > When that run is exhausted we move to the next lowest and at that point > start merging all runs that overlap that one. > > This then means we may be able to begin final merging with more runs > than the current cut-off. It's possible that we could merge an infinite > number of runs in final merge with fixed memory. If we *do* need to > merge we can work out which runs should be our best pre-merge > candidates, based upon how big they are and which other runs they > overlap. (That's much better than being forced to merge tapes 2, 7 and > 17 because some bizarre math says so (see Knuth).) > > Anyway, claiming to have found a better way than Knuth makes me feel a > little nervous, so some searching questions on this are very welcome. > > Interestingly, if we combine this technique with dynamic heap management > we may be able to allow a very large number of efficiently written runs > to form without it causing any merging. > > mac_man recently noted the possibility that some runs don't overlap at > all and so can be merged for free. That's true, though doesn't actually > improve the basic idea here which is building a merge plan after runs > have been formed, with an eye on minimizing and potentially elimination > the merge phase. > > There's probably some typos or thinkos above, so go easy on me Greg! > They aren't there because I want to skim over anything. > > I'm not likely to get a chance to do all of this in the near future, so > documenting it now should help others to carry things forward. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +