tuplesort memory usage: grow_memtuples - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | tuplesort memory usage: grow_memtuples |
Date | |
Msg-id | CAMkU=1zE-d2HAU=MqmeVjBFv577FzWRzHmm6e3CQ9Os7-jRv7A@mail.gmail.com Whole thread Raw |
Responses |
Re: tuplesort memory usage: grow_memtuples
Re: tuplesort memory usage: grow_memtuples |
List | pgsql-hackers |
When sorting small tuples, the memtuples array can use a substantial fraction of the total per-tuple memory used. (In the case of pass by value, it is all of it) The way it grows leads to sub-optimal memory usage. In one case, if it can grow by 1.999 x, but not by 2 x, we just give up and use half the memory we could have. This is what actually happens in pass-by-value sorting when using a power of two for *work_mem. While I understand we don't want to thrash memory, it seems like we could at least grow by less than 2x just one last time. Also, if there is just barely room to double memtuples (and it is pass-by-reference sorting) then the number of slots is doubled but leaves no room for the tuples themselves to be stored, so most of those slots can't be used. A solution to both would be to assess when we are about to do one of the two above things, and instead use the historical tuple size to compute how much of a growth we can do so that the memtuples slots and the cumulative palloc heap of tuples will exhaust at the same time. If the availMem is less than half of allowedMem, then the next doubling of memtuples would either fail, or would succeed but leave behind too little allowedMem to hold the full complement of new tuples. So either way, extrapolate from current usage. The patch assumes that nothing other than memtuples and individual tuple storage have been deducted from availMem. Currently that is true, but may not always be so (we recently discussed pre-deducting the tape buffer overhead, so it doesn't cause a memory overrun). As the XXXX indicates, I am not currently able to prove to myself that there are no conditions under which this change could cause LACKMEM(state) to become true. This patch gives about a 2-fold window over which a sort that would otherwise go to tape sort will instead be done in memory, which is 2-3 times faster. That might not be very impressive, because after all if you wanted to use more memory you could have just increased work_mem. But it still seems worthwhile to use the memory we are told to use. It may be hard to fine-tune the *work_mem settings, but not using the amount we are given surely doesn't make it any easier. But other than the tape vs in memory improvement, this also gives other improvements. For example, this gives me about a 35% speed up when using default work_mem to do a "select count(distinct k) from t" where k is random integer and t has 512 million rows. Granted, it is silly to intentionally do such a large sort with such small memory. But it still seems worth improving, and the wins should be sitting around at other sizes as well, though probably not as big. In my test case, the improvement mostly comes from turning random IO reads into sequential ones. The prevelance of random IO is due to a cascade of issues: 1) As discussed above, we are using only half the memory we could be. 2) Due to MINORDER we are spreading that reduced memory over twice as many tapes as MERGE_BUFFER_SIZE would suggest. Changing this would obviously have a trade-off 3) MERGE_BUFFER_SIZE itself describes the in-RAM footprint, not the on-tape footprint, because we unpack tuples as we read them from tape. Perhaps mergeprereadone could stash the preread data unpacked and unpack it only as needed. But that is a much more invasive change. 4) Pre-reading all tapes whenever just one becomes empty causes blocks to be freed, and thus re-used, in smaller contiguous chunks than they otherwise would. (Easy to change, but there is trade-off to changing it) 5) Even without 4, logtape.c still makes a jumble of the free list on multi-level merges. I'm not entirely sure why yet--I think maybe it is the way indirect blocks are handled. Add it all up, and instead of pre-reading 32 consecutive 8K blocks, it pre-reads only about 1 or 2 consecutive ones on the final merge. Now some of those could be salvaged by the kernel keeping track of multiple interleaved read ahead opportunities, but in my hands vmstat shows a lot of IO wait and shows reads that seem to be closer to random IO than large read-ahead. If it used truly efficient read ahead, CPU would probably be limiting. I'll add this to the open commit fest. When doing performance testing, I find it far easier to alter a guc.c parameter than to keep multiple compiled binaries around. Would people like a testing patch that does the same thing as this, or does the original behavior, under control of a parameter setting? Cheers, Jeff
Attachment
pgsql-hackers by date: