Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use generation context to speed up tuplesorts
Date
Msg-id bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
Whole thread Raw
In response to Re: Use generation context to speed up tuplesorts  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Use generation context to speed up tuplesorts
Re: Use generation context to speed up tuplesorts
List pgsql-hackers
Hi,

I've spent quite a bit of time investigating the regressions clearly 
visible in the benchmark results for some allocation/free patterns. 
Attached is a subset of results from a recent run, but the old results 
show mostly the same thing.

Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly 
fine - the patches are a clear improvement. For for example results for 
fifo-decrease pattern look like this (showing just timings of palloc):

     block    chunk      0001     0002    0003     0004
    ----------------------------------------------------
     32768      512     38497    37234   39817   226451

0001 just adds an extension used for the benchmarking, so it's pretty 
much "master" without any behavior changes. 0002 and 0003 make some 
changes that seem to have no impact on this workload, but 0004 makes it 
about 5x slower. Which is bizarre, because 0004 just tweaks how we 
calculate the threshold for oversized chunks.

But that shouldn't have any impact, because with the benchmark 
parameters the threshold should be 64kB, way more than the chunk size 
(which is what we allocate).

Similarly, it should not matter whether we double or quadruple the block 
size, because we reach the maximum block size (1MB) very fast in both 
cases, and then just just 1MB blocks for 99% of the benchmark.

I was thinking that maybe the extra allocChunkLimit increases the size 
of the GenerationContext struct, requiring another cacheline. But no, 
it's 3 cachelines in both cases.

But while trying to investigate / profile this, I noticed a rather 
bizarre behavior, which I can't explain but I suspect it may be related 
to the slowdown.

The benchmark queries are  executing generation_bench_fifo() for a range 
of block size / chunk_size combinations, and look about like this:

     select block_size, chunk_size, x.*
       from generate_series(32,512,32) a(chunk_size),
          (values (1024), (2048), (4096), (8192), (16384), (32768))
          AS b(block_size),
       lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                     2*chunk_size, 100, 10000, 5000) x;

Which takes a while to run, and then produces something like this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
            1024 |         32 |      81528832 |    62016 |   24601
            1024 |         64 |     137449472 |    61951 |   36034
            1024 |         96 |     208408464 |    85437 |   57984
     ...
           32768 |        448 |     707756032 |    36605 |   67648
           32768 |        480 |     757104640 |    36838 |   71967
           32768 |        512 |     806256640 |    37073 |   76489
     (96 rows)


Now, I can run benchmark for a single case (e.g. the last one in the 
results above), say like this:

     select block_size, chunk_size, x.* from
        (values (512)) AS a(chunk_size),
        (values (32768)) AS b(block_size),
     lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                   2*chunk_size, 100, 10000, 5000) x;


And now comes the funny part - if I run it in the same backend as the 
"full" benchmark, I get roughly the same results:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806256640 |    37159 |   76669

but if I reconnect and run it in the new backend, I get this:

      block_size | chunk_size | mem_allocated | alloc_ms | free_ms
     ------------+------------+---------------+----------+---------
           32768 |        512 |     806158336 |   233909 |  100785
     (1 row)

It does not matter if I wait a bit before running the query, if I run it 
repeatedly, etc. The machine is not doing anything else, the CPU is set 
to use "performance" governor, etc.

This is really strange, and the really suspicious thing is that this 
slower timing is very close to the 0004 timing (which is 226451). So 
perhaps the reason for these slowdowns is the same, it's just that with 
0004 it happens all the time.

According to perf profiling, it seems there's one pretty significant 
change - 0004 spends quite a bit of time in asm_exc_page_fault

   41.17%     5.26%  postgres  postgres   [.] GenerationAlloc
           |
            --36.06%--GenerationAlloc
                      |
                      |--28.95%--asm_exc_page_fault
                      |          |
                      |           --23.17%--exc_page_fault
                      |                     |

while on 0003 there's not a single frame with asm_exc_page_fault. I have 
no idea why, though.

I wonder if this might be some unexpected / unfortunate interaction with 
kernel memory management, or something like that. Any ideas what might 
be the root cause, what to look for, etc.?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: trap instead of error on 32 TiB table
Next
From: Tom Lane
Date:
Subject: Re: trap instead of error on 32 TiB table