Re: Use generation context to speed up tuplesorts - Mailing list pgsql-hackers

From Ronan Dunklau
Subject Re: Use generation context to speed up tuplesorts
Date
Msg-id 8046109.NyiUUSuA9g@aivenronan
Whole thread Raw
In response to Re: Use generation context to speed up tuplesorts  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Use generation context to speed up tuplesorts
List pgsql-hackers
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806256640 |    37159 |   76669
>
> but if I reconnect and run it in the new backend, I get this:
>
>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>      ------------+------------+---------------+----------+---------
>            32768 |        512 |     806158336 |   233909 |  100785
>      (1 row)
>
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:

With default malloc options:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   300156 |  207557

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   211942 |   77207


With the oversized values above:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |   219000 |   36223


 block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
      32768 |        512 |     795836416 |    75761 |   78082
(1 row)

I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.

I'll try to see if I can understand more precisely what is happening.

--
Ronan Dunklau





pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Question on not-in and array-eq
Next
From: Mark Dilger
Date:
Subject: Re: Optionally automatically disable logical replication subscriptions on error