Re: Automatically setting work_mem - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Automatically setting work_mem |
Date | |
Msg-id | 1142971550.24487.499.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Automatically setting work_mem (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Automatically setting work_mem
|
List | pgsql-hackers |
On Fri, 2006-03-17 at 09:46 -0500, Tom Lane wrote: > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > So what's the difference between these two strategy? > > (1) Running time: do they use the same amount of memory? Why option 2 is > > better than 1? > > (2) Idle time: after sort done, option 1 will return all 1024 to the OS and > > 2 will still keep 512? > > Point 2 is actually a serious flaw in Simon's proposal, because there > is no portable way to make malloc return freed memory to the OS. Some > mallocs will do that ... in some cases ... but many simply don't ever > move the brk address down. It's not an easy thing to do when the arena > gets cluttered with a lot of different alloc chunks and only some of > them get freed. I'm aware of that objection and agree its an issue... One of the situations I am looking at is larger queries with multiple sorts in them. I'm getting some reasonable results for final merge even after releasing lots of memory (say 50-90%). memtuples array is not required at all for randomAccess sorts (100% reduction). The largest requirement for memory is the run building during performsort. That portion of the code is not concurrently executed within the same query. If we can reduce memory usage after that phase completes then we stand a chance of not overusing memory on a big query and not being able to reclaim it. So overall memory usage could be as low as work_mem + (numsorts * 0.1 * work_mem) which is a lot less than numsorts * work_mem. (e.g. 130% of work_mem rather than 300% work_mem). > So the semantics we'd have to adopt is that once a backend claims some > "shared work mem", it keeps it until process exit. I don't think that > makes the idea worthless, because there's usually a clear distinction > between processes doing expensive stuff and processes doing cheap > stuff. But it's definitely a limitation. ...Hopefully less so with mem reduction changes. The other way is of course to allocate *all* sort/hash/big stuff space out of shared memory and then let all the backends fight it out (somehow...) to see who gets access to it. That way backends stay small and we have well bounded memory usage. > Also, if you've got a process > doing expensive stuff, it's certainly possible to expect the user to > just increase work_mem locally. Doing that is fine, but you have to retune the system every time the memory usage changes for any reason. So most of the time manual tuning has to be very conservative to avoid getting it wrong. > My own thoughts about the problems with our work_mem arrangement are > that the real problem is the rule that we can allocate work_mem per sort > or hash operation; this makes the actual total memory use per backend > pretty unpredictable for nontrivial queries. I don't know how to fix > this though. The planner needs to know the work_mem that will be used > for any one of these operations in order to estimate costs, so simply > trying to divide up work_mem among the operations of a completed plan > tree is not going to improve matters. Spent about 5 hours discussing that and the best answer was "use queuing".... = = = = Anyway, thinking just about sort, I've got the following concrete suggestions (first two of which coded and tested): 1. I originally picked MERGE_BUFFER_SIZE at 32 blocks as a guess. Better test results show that is indeed the optimum when we take into account both intermediate and final merging. However, the preferred buffer size would be about 256 blocks in the case that Nruns << Ntapes i.e. when work_mem is set high. In this case memory is reallocated to reduce the overall usage after "performsort done" has happened. 2. When a tape runs out of tuples, its memory is reallocated to remaining tapes to increase their I/O efficiency. This should help to increase performance for smaller work_mem settings with large sorts, or anything where the final merge Nruns is close to Ntapes. You can see this occurring in the example below. The reallocation is either done uniformly or all onto a single tape, depending upon the preread pattern. This mostly doesn't occur with well sorted output, so there is little overhead from doing this in the general case. Right now, large sort performance is very good, whereas smaller sort perfomance is still fairly bad. 3. We implement new merge alogorithm as Tom suggested... Best Regards, Simon Riggs
pgsql-hackers by date: