Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Use generation context to speed up tuplesorts |
Date | |
Msg-id | bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com Whole thread Raw |
In response to | Re: Use generation context to speed up tuplesorts (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Use generation context to speed up tuplesorts
Re: Use generation context to speed up tuplesorts |
List | pgsql-hackers |
Hi, I've spent quite a bit of time investigating the regressions clearly visible in the benchmark results for some allocation/free patterns. Attached is a subset of results from a recent run, but the old results show mostly the same thing. Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly fine - the patches are a clear improvement. For for example results for fifo-decrease pattern look like this (showing just timings of palloc): block chunk 0001 0002 0003 0004 ---------------------------------------------------- 32768 512 38497 37234 39817 226451 0001 just adds an extension used for the benchmarking, so it's pretty much "master" without any behavior changes. 0002 and 0003 make some changes that seem to have no impact on this workload, but 0004 makes it about 5x slower. Which is bizarre, because 0004 just tweaks how we calculate the threshold for oversized chunks. But that shouldn't have any impact, because with the benchmark parameters the threshold should be 64kB, way more than the chunk size (which is what we allocate). Similarly, it should not matter whether we double or quadruple the block size, because we reach the maximum block size (1MB) very fast in both cases, and then just just 1MB blocks for 99% of the benchmark. I was thinking that maybe the extra allocChunkLimit increases the size of the GenerationContext struct, requiring another cacheline. But no, it's 3 cachelines in both cases. But while trying to investigate / profile this, I noticed a rather bizarre behavior, which I can't explain but I suspect it may be related to the slowdown. The benchmark queries are executing generation_bench_fifo() for a range of block size / chunk_size combinations, and look about like this: select block_size, chunk_size, x.* from generate_series(32,512,32) a(chunk_size), (values (1024), (2048), (4096), (8192), (16384), (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; Which takes a while to run, and then produces something like this: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 1024 | 32 | 81528832 | 62016 | 24601 1024 | 64 | 137449472 | 61951 | 36034 1024 | 96 | 208408464 | 85437 | 57984 ... 32768 | 448 | 707756032 | 36605 | 67648 32768 | 480 | 757104640 | 36838 | 71967 32768 | 512 | 806256640 | 37073 | 76489 (96 rows) Now, I can run benchmark for a single case (e.g. the last one in the results above), say like this: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; And now comes the funny part - if I run it in the same backend as the "full" benchmark, I get roughly the same results: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 806256640 | 37159 | 76669 but if I reconnect and run it in the new backend, I get this: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ------------+------------+---------------+----------+--------- 32768 | 512 | 806158336 | 233909 | 100785 (1 row) It does not matter if I wait a bit before running the query, if I run it repeatedly, etc. The machine is not doing anything else, the CPU is set to use "performance" governor, etc. This is really strange, and the really suspicious thing is that this slower timing is very close to the 0004 timing (which is 226451). So perhaps the reason for these slowdowns is the same, it's just that with 0004 it happens all the time. According to perf profiling, it seems there's one pretty significant change - 0004 spends quite a bit of time in asm_exc_page_fault 41.17% 5.26% postgres postgres [.] GenerationAlloc | --36.06%--GenerationAlloc | |--28.95%--asm_exc_page_fault | | | --23.17%--exc_page_fault | | while on 0003 there's not a single frame with asm_exc_page_fault. I have no idea why, though. I wonder if this might be some unexpected / unfortunate interaction with kernel memory management, or something like that. Any ideas what might be the root cause, what to look for, etc.? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: