Re: Memory usage during sorting - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: Memory usage during sorting |
Date | |
Msg-id | CAMkU=1yfaWD1h9UY-es3_bpoVyMPm2b-3Pz2iGibWPZq_gnqJA@mail.gmail.com Whole thread Raw |
In response to | Re: Memory usage during sorting (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Memory usage during sorting
|
List | pgsql-hackers |
On Sun, Feb 26, 2012 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Feb 25, 2012 at 4:31 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> I'm not sure about the conclusion, but given this discussion, I'm >>> inclined to mark this Returned with Feedback. >> >> OK, thanks. Does anyone have additional feed-back on how tightly we >> wish to manage memory usage? Is trying to make us use as much memory >> as we are allowed to without going over a worthwhile endeavor at all, >> or is it just academic nitpicking? > > I'm not sure, either. It strikes me that, in general, it's hard to > avoid a little bit of work_mem overrun, since we can't know whether > the next tuple will fit until we've read it, and then if it turns out > to be big, well, the best thing we can do is free it, but perhaps > that's closing the barn door after the horse has gotten out already. That type of overrun doesn't bother me much, because the size of the next tuple some else feeds us is mostly outside of this modules control. And also be cause the memory that is overrun should be reusable once the offending tuple is written out to tape. The type of overrun I'm more concerned with is that which is purely under this modules control, and which is then not re-used productively. The better solution would be to reduce the overhead in the first place. While building the initial runs, there is no reason to have 3 blocks worth of overhead for each tape, when only one tape is ever being used at a time. But that change seems much tougher to implement. > Having recently spent quite a bit of time looking at tuplesort.c as a > result of Peter Geoghegan's work and some other concerns, I'm inclined > to think that it needs more than minor surgery. That file is peppered > with numerous references to Knuth which serve the dual function of > impressing the reader with the idea that the code must be very good > (since Knuth is a smart guy) and rendering it almost completely > impenetrable (since the design is justified by reference to a textbook > most of us probably do not have copies of). Yes, I agree with that analysis. And getting the volume I want by inter-library-loan has been challenging--I keep getting the volume before or after the one I want. Maybe Knuth starts counting volumes at 0 and libraries start counting at 1 :) Anyway, I think the logtape could use redoing. When your tapes are actually physically tape drives, it is necessary to build up runs one after the other on physical tapes, because un-mounting a tape from a tape drive and remounting another tape is not very feasible at scale. Which then means you need to place your runs carefully, because if the final merge finds that two runs it needs are back-to-back on one tape, that is bad. But with disks pretending to be tapes, you could re-arrange the "runs" with just some book keeping. Maintaining the distinction between "tapes" and "runs" is pointless, which means the Fibonacci placement algorithm is pointless as well. > A quick Google search for external sorting algorithms suggest that the > typical way of doing an external sort is to read data until you fill > your in-memory buffer, quicksort it, and dump it out as a run. Repeat > until end-of-data; then, merge the runs (either in a single pass, or > if there are too many, in multiple passes). I'm not sure whether that > would be better than what we're doing now, but there seem to be enough > other people doing it that we might want to try it out. Our current > algorithm is to build a heap, bounded in size by work_mem, and dribble > tuples in and out, but maintaining that heap is pretty expensive; > there's a reason people use quicksort rather than heapsort for > in-memory sorting. But it would mean we have about 1.7x more runs that need to be merged (for initially random data). Considering the minimum merge order is 6, that increase in runs is likely not to lead to an additional level of merging, in which case the extra speed of building the runs would definitely win. But if it does cause an additional level of merge, it could end up being a loss. Is there some broad corpus of sorting benchmarks which changes could be tested against? I usually end up testing just simple columns of integers or small strings, because they are easy to set up. That is not ideal. > As a desirable side effect, I think it would mean > that we could dispense with retail palloc and pfree altogether. We > could just allocate a big chunk of memory, copy tuples into it until > it's full, using a pointer to keep track of the next unused byte, and > then, after writing the run, reset the allocation pointer back to the > beginning of the buffer. That would not only avoid the cost of going > through palloc/pfree, but also the memory overhead imposed by > bookkeeping and power-of-two rounding. Wouldn't we still need an array of pointers to the start of every tuple's location in the buffer? Or else, how would qsort know where to find them? Also, to do this we would need to get around the 1GB allocation limit.It is bad enough that memtuples is limited to 1GB,it would be much worse if the entire arena was limited to that amount. > If we do want to stick with the current algorithm, there seem to be > some newer techniques for cutting down on the heap maintenance > overhead. Heikki's been investigating that a bit. Interesting. Is that investigation around the poor L1/L2 caching properties of large heaps? I was wondering if there might be a way to give tuplesort an initial estimate of how much data there was to sort, so that it could use a smaller amount of memory than the max if it decided that that would lead to better caching effects. Once you know you can't do an in-memory sort, then there is no reason to use more memory than the amount that lets you merge all the runs in one go. Cheers, Jeff
pgsql-hackers by date: