Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers
From | Ronan Dunklau |
---|---|
Subject | Re: Use generation context to speed up tuplesorts |
Date | |
Msg-id | 2740198.88bMQJbFj6@aivenronan Whole thread Raw |
In response to | Re: Use generation context to speed up tuplesorts (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Use generation context to speed up tuplesorts
|
List | pgsql-hackers |
Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit : > On 12/8/21 16:51, Ronan Dunklau wrote: > > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > >> And now comes the funny part - if I run it in the same backend as the > >> > >> "full" benchmark, I get roughly the same results: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806256640 | 37159 | 76669 > >> > >> but if I reconnect and run it in the new backend, I get this: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806158336 | 233909 | 100785 > >> > >> (1 row) > >> > >> It does not matter if I wait a bit before running the query, if I run it > >> repeatedly, etc. The machine is not doing anything else, the CPU is set > >> to use "performance" governor, etc. > > > > I've reproduced the behaviour you mention. > > I also noticed asm_exc_page_fault showing up in the perf report in that > > case. > > > > Running an strace on it shows that in one case, we have a lot of brk > > calls, > > while when we run in the same process as the previous tests, we don't. > > > > My suspicion is that the previous workload makes glibc malloc change it's > > trim_threshold and possibly other dynamic options, which leads to > > constantly moving the brk pointer in one case and not the other. > > > > Running your fifo test with absurd malloc options shows that indeed that > > might be the case (I needed to change several, because changing one > > disable the dynamic adjustment for every single one of them, and malloc > > would fall back to using mmap and freeing it on each iteration): > > > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > > > I get the following results for your self contained test. I ran the query > > twice, in each case, seeing the importance of the first allocation and the > > subsequent ones: > > > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 300156 | 207557 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 211942 | 77207 > > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 219000 | 36223 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 75761 | 78082 > > > > (1 row) > > > > I can't tell how representative your benchmark extension would be of real > > life allocation / free patterns, but there is probably something we can > > improve here. > > Thanks for looking at this. I think those allocation / free patterns are > fairly extreme, and there probably are no workloads doing exactly this. > The idea is the actual workloads are likely some combination of these > extreme cases. > > > I'll try to see if I can understand more precisely what is happening. > > Thanks, that'd be helpful. Maybe we can learn something about tuning > malloc parameters to get significantly better performance. > Apologies for the long email, maybe what I will state here is obvious for others but I learnt a lot, so... I think I understood what the problem was in your generation tests: depending on the sequence of allocations we will raise a different maximum for mmap_threshold and trim_threshold. When an mmap block is freed, malloc will raise it's threshold as it consider memory freed to be better served by regular moving of the sbrk pointer, instead of using mmap to map memory. This threshold is upped by multiplying it by two anytime we free a mmap'ed block. At the same time, the trim_threshold is raised to double the mmap_threshold, considering that this amount of memory should not be released to the OS because we have a good chance of reusing it. This can be demonstrated using the attached systemtap script, along with a patch adding new traces to generation_context for this purpose: When running your query: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; We obtain the following trace for thresholds adjustments: 2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes When running the full benchmark, we reach a higher threshold at some point: 2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes 2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes This is because at some point in the full benchmark, we allocate a block bigger than mmap threshold, which malloc serves by an mmap, then frees it which means we now also raise the trim_threshold. The subsequent allocations in the lone query end up between those thresholds, so are served by moving the sbrk pointer, then releasing the memory back to the OS, which turns out to be expensive too. One thing that I observed is that that effect of constantly moving the sbrk pointer still happens, but not as much in a real tuplesort workload. I haven't tested the reordering buffer for now, which is the other part of the code using generation context. So, I decided to benchmark trying to control the malloc thresholds. This result is on my local laptop, with an I7 processor, for the benchmark initially proposed by David, with 4GB work_mem and default settings. I benchmarked master, the original patch with fixed block size, and the one adjusting the size. For each of them, I ran them using default malloc options (ie, adjusting the thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this value (ie, 64MB). I did that by setting the corresponding env variables. The results is in the attached spreadsheet. I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. As of now, my conclusion is that for glibc malloc, the block size we use doesn't really matter, as long as we tell malloc to not release a certain amount of memory back to the system once it's been allocated. Setting the mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold to two times that would IMO take us to where a long-lived backend would likely end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to the system, and save us a huge amount a syscalls in common cases. Doing that will not change the allocation profiles for other platforms, and should be safe for those. The only problem I see if we were to do that would be for allocations in excess of work_mem, which would no longer trigger a threshold raise since once we're in "non-dynamic" mode, glibc's malloc would keep our manually-set values. I guess this proposal could be refined to set it up dynamically ourselves when pfreeing blocks, just like malloc does. What do you think of this analysis and idea ? -- Ronan Dunklau
Attachment
pgsql-hackers by date: