Re: Use generation memory context for tuplestore.c - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Use generation memory context for tuplestore.c |
Date | |
Msg-id | CAApHDvobfoHnWB8mmWm=3b6ZUaGh-Bmh7_fYRXgd2hJZ4xDUAQ@mail.gmail.com Whole thread Raw |
In response to | Re: Use generation memory context for tuplestore.c (Dmitry Dolgov <9erthalion6@gmail.com>) |
List | pgsql-hackers |
Thanks for having a look at this. On Sat, 11 May 2024 at 04:34, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Do I understand correctly, that the efficiency of generation memory > context could be measured directly via counting number of malloc/free > calls? In those experiments I've conducted the numbers were indeed > visibly lower for the patched version (~30%), but I would like to > confirm my interpretation of this difference. For the test case I had in the script, I imagine the reduction in malloc/free is just because of the elimination of the power-of-2 roundup. If the test case had resulted in tuplestore_trim() being used, e.g if Material was used to allow mark and restore for a Merge Join or WindowAgg, then the tuplestore_trim() would result in pfree()ing of tuples and further reduce malloc of new blocks. This is true because AllocSet records pfree'd non-large chunks in a freelist element and makes no attempt to free() blocks. In the tuplestore_trim() scenario with the patched version, the pfree'ing of chunks is in a first-in-first-out order which allows an entire block to become empty of palloc'd chunks which allows it to become the freeblock and be reused rather than malloc'ing another block when there's not enough space on the current block. If the scenario is that tuplestore_trim() always manages to keep the number of tuples down to something that can fit on one GenerationBlock, then we'll only malloc() 2 blocks and the generation code will just alternate between the two, reusing the empty one when it needs a new block rather than calling malloc. > > 5. Higher likelihood of neighbouring tuples being stored consecutively > > in memory, resulting in better CPU memory prefetching. > > I guess this roughly translates into better CPU cache utilization. > Measuring cache hit ratio for unmodified vs patched versions in my case > indeed shows about 10% less cache misses. Thanks for testing that. In simple test cases that's likely to come from removing the power-of-2 round-up behaviour of AllocSet allowing the allocations to be more dense. In more complex cases when allocations are making occasional use of chunks from AllowSet's freelist[], the access pattern will be more predictable and allow the CPU to prefetch memory more efficiently, resulting in a better CPU cache hit ratio. > The query you use in the benchmark, is it something like a "best-case" > scenario for generation memory context? I was experimenting with > different size of the b column, lower values seems to produce less > difference between generation and aset (although still generation > context is distinctly faster regarding final query latencies, see the > attached PDF estimation, ran for 8192 rows). I'm curious what could be a > "worst-case" workload type for this patch? It's not purposefully "best-case". Likely there'd be more improvement if I'd scanned the Material node more than twice. However, the tuple size I picked is close to the best case as it's just over 1024 bytes. Generation will just round the size up to the next 8-byte alignment boundary, whereas AllocSet will round that up to 2048 bytes. A case where the patched version would be even better would be where the unpatched version spilled to disk but the patched version did not. I imagine there's room for hundreds of percent speedup for that case. > I've also noticed the first patch disables materialization in some tests. > > --- a/src/test/regress/sql/partition_prune.sql > +++ b/src/test/regress/sql/partition_prune.sql > > +set enable_material = 0; > + > -- UPDATE on a partition subtree has been seen to have problems. > insert into ab values (1,2); > explain (analyze, costs off, summary off, timing off) > > Is it due to the volatility of Maximum Storage values? If yes, could it > be covered with something similar to COSTS OFF instead? I don't think not showing this with COSTS OFF is very consistent with Sort and Memoize's memory stats output. I opted to disable the Material node for that plan as it seemed like the easiest way to make the output stable. There are other ways that could be done. See explain_memoize() in memoize.sql. I thought about doing that, but it seemed like overkill when there was a much more simple way to get what I wanted. I'd certainly live to regret that if disabling Material put the Nested Loop costs dangerously close to the costs of some other join method and the plan became unstable on the buildfarm. David
pgsql-hackers by date: