Thread: Re: Reduce TupleHashEntryData struct size by half

Re: Reduce TupleHashEntryData struct size by half

From
Heikki Linnakangas
Date:
On 18/11/2024 02:01, Jeff Davis wrote:
> 
> TupleHashEntryData is the size of a hash bucket, and it's currently 24
> bytes. The size of the structure is important for HashAgg and other
> callers that can build up large hashtables of small tuples. Waste in
> this structure is even worse when the hash table is sparse, because the
> space is consumed whether the bucket is occupied or not.
> 
> The attached patch series brings it down to 12. There are several
> unappealing hacks, so ideally we'd find something cleaner, but halving
> the bucket size seems quite desirable if it can be done in an
> acceptable way.

Makes sense.

> 0001: This patch makes the structure private, so that it's easier to
> control the way the structure is accessed.

Seems pretty uncontroversial. You removed the typedef for struct 
TupleHashEntryData, which is a bit unusual for our usual source style. 
Was there a reason for that?

> 0002: Removes the "additional" pointer, instead storing it right after
> the tuple, which is already stored in a separate chunk. Hack: this
> crams the "additional" pointer after the MinimalTuple in the same chunk
> of memory to avoid adding additional palloc()s.

Hmm, it would seem more straightforward to store it in the beginning, 
i.e. have something like this:

struct {
      void *additional;
      MinimalTupleData mtup;
} ;

Come to think of it, how important is it that we use MinimalTuple here 
at all? Some other representation could be faster to deal with in 
TupleHashTableMatch() anyway.

> 0003: simplehash: allow the caller to decide which entries are empty vs
> in-use rather than requiring a separate "status" field. This may limit
> other possible status values in the future (i.e. adding on to the
> enum), but I'm not sure what those other stutus values might be.

+1. I've wanted to have this in the past.

> 0004: Removes the "status" field from TupleHashEntryData, using
> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> additional "special" pointer value to mean "input slot" now that NULL
> means "empty".

+1

> 0005: Pack TupleHashEntryData. IIUC, this is fine even on machines that
> can't do unaligned access, so long as we are accessing the fields
> through the struct, and not taking the address of individual members.

Seems OK.

I wonder if the compiler understands that the elements are still 4-byte 
aligned, or if it forces byte-per-byte access? Playing with godbolt a 
little, it seems like GCC at least understands it, but clang does not. 
On architectures with non-strict alignment, it doesn't matter as a 
simple load/store instruction is the fastest option anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Reduce TupleHashEntryData struct size by half

From
Jeff Davis
Date:
On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
> Seems pretty uncontroversial. You removed the typedef for struct
> TupleHashEntryData, which is a bit unusual for our usual source
> style.
> Was there a reason for that?

If it's private to a file and I don't intend to use it a lot, I do it
to cut down typedefs.list bloat. I'll go ahead and add the typedef back
to match the style, though.

>
> Hmm, it would seem more straightforward to store it in the beginning,
> i.e. have something like this:
>
> struct {
>       void *additional;
>       MinimalTupleData mtup;
> } ;

That was my first approach, but it requires an additional memcpy,
because ExecCopySlotMinimalTuple() does it's own palloc. I used
repalloc() because it will often have space at the end of the chunk
anyway, and not need to memcpy(). Maybe that's not significant but it
did seem detectable in some perf tests.

But perhaps we can go further and get rid of the "additional" pointer
and inline the pergroup data and the grouping key tuple into the same
palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
pointer.

> Come to think of it, how important is it that we use MinimalTuple
> here
> at all? Some other representation could be faster to deal with in
> TupleHashTableMatch() anyway.

What did you have in mind? That sounds like a good idea orthogonal to
reducing the bucket size.

Alternatively, MinimalTuple is not very "minimal", and perhaps we can
just make it better.

>
> > 0004: Removes the "status" field from TupleHashEntryData, using
> > firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> > additional "special" pointer value to mean "input slot" now that
> > NULL
> > means "empty".
>
> +1

For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
what I did:

  const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;

or something like:

  static MinimalTupleData dummy = {0};
  const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;

?

>
> On architectures with non-strict alignment, it doesn't matter as a
> simple load/store instruction is the fastest option anyway.

My intuition is that the cost of dereferencing that pointer (to memory
which is not expected to be in cache) is going to be way higher than
the cost of a couple extra instructions to do the unaligned access.

Regards,
    Jeff Davis




Re: Reduce TupleHashEntryData struct size by half

From
Heikki Linnakangas
Date:
On 18/11/2024 22:22, Jeff Davis wrote:
> On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
>> Hmm, it would seem more straightforward to store it in the beginning,
>> i.e. have something like this:
>>
>> struct {
>>        void *additional;
>>        MinimalTupleData mtup;
>> } ;
> 
> That was my first approach, but it requires an additional memcpy,
> because ExecCopySlotMinimalTuple() does it's own palloc. I used
> repalloc() because it will often have space at the end of the chunk
> anyway, and not need to memcpy(). Maybe that's not significant but it
> did seem detectable in some perf tests.
> 
> But perhaps we can go further and get rid of the "additional" pointer
> and inline the pergroup data and the grouping key tuple into the same
> palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
> pointer.

Sounds like a good idea. Needs some changes to the TupleTableSlotOps 
interface to avoid the memcpy I presume.

>> Come to think of it, how important is it that we use MinimalTuple
>> here
>> at all? Some other representation could be faster to deal with in
>> TupleHashTableMatch() anyway.
> 
> What did you have in mind? That sounds like a good idea orthogonal to
> reducing the bucket size.

Queries that have a only a small number of groups might might benefit 
from storing a plain Datum/isnull array instead of a MinimalTuple. That 
would take more memory when you have a lot of groups though.

> Alternatively, MinimalTuple is not very "minimal", and perhaps we can
> just make it better.

Yeah. It tries to be compatible with HeapTuple, but perhaps we should 
give up on that and pack it more tightly instead.

>>> 0004: Removes the "status" field from TupleHashEntryData, using
>>> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
>>> additional "special" pointer value to mean "input slot" now that
>>> NULL
>>> means "empty".
>>
>> +1
> 
> For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
> what I did:
> 
>    const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;
> 
> or something like:
> 
>    static MinimalTupleData dummy = {0};
>    const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;
> 
> ?

I think I'd do "(MinimalTuple) 0x1" myself, but no strong opinion.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Reduce TupleHashEntryData struct size by half

From
David Rowley
Date:
On Thu, 13 Feb 2025 at 14:01, Jeff Davis <pgsql@j-davis.com> wrote:
> Attached a new patchset that doesn't change the API for
> ExecCopySlotMinimalTuple(). Instead, I'm using
> ExecFetchSlotMinimalTuple(), which avoids the extra memcpy if the slot
> is TTSOpsMinimalTuple, which is what HashAgg uses. I separated out the
> change to use ExecFetchSlotMinimalTuple() into 0004 to illustrate the
> performance impact.

Some review comments:

* my_log2() takes a "long" parameter type but transitionSpace is a
"Size". These types aren't the same width on all platforms we support.
Maybe pg_nextpower2_size_t() is a better option.

* Should the following have MAXALIGN(tupleSize) on the right side of
the expression?

tupleChunkSize = tupleSize

I understand this was missing before, but both bump.c does consume at
least MAXALIGN(<req_size>) in BumpAlloc().

* while (16 * maxBlockSize > work_mem * 1024L)

The 1024L predates the change made in 041e8b95. "1024L" needs to be
replaced with "(Size) 1024"

Maybe you could just replace the while loop and the subsequent "if" check with:

/*
 * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
 * to avoid large jumps in memory usage.
 */
maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);

/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);

/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);

I believe that gives the same result without the looping.

* In hash_create_memory(), can you get rid of the minContextSize and
initBlockSize variables?

* Is it worth an Assert() theck additionalsize > 0?

 * The amount of space available is the additionalsize requested in the call
 * to BuildTupleHashTable(). If additionalsize was specified as zero, no
 * additional space is available and this function should not be called.
 */
static inline void *
TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
{
return (char *) entry->firstTuple - hashtable->additionalsize;
}

Benchmarks:

I was also experimenting with the performance of this using the
following test case:

create table c (a int not null);
insert into c select a from generate_Series(1,1000) a;
vacuum freeze analyze c;

Query: select a,count(*) from c group by a;

Average TPS over 10x 10 second runs with -M prepared

master: 3653.9853988
v7-0001: 3741.815979
v7-0001+0002: 3737.4313064
v7-0001+0002+0003: 3834.6271706
v7-0001+0002+0004+0004: 3912.1158887

I also tried out changing hash_agg_check_limits() so that it no longer
calls MemoryContextMemAllocated and instead uses ->mem_allocated
directly and with all the other patches got:

v7-0001+0002+0004+0004+extra: 4049.0732381

We probably shouldn't do exactly that as it be better not to access
that internal field from outside the memory context code.  A static
inline function in memutils.h that handles the non-recursive callers
might be nice.

I've attached my results of running your test in graph form. I also
see a small regression for these small scale tests. I wondering if
it's worth mocking up some code to see what the performance would be
like without the additional memcpy() that's new to
LookupTupleHashEntry_internal().  How about hacking something up that
adds an additionalsize field to TupleDesc and then set that field to
your additional size and have heap_form_minimal_tuple() allocate that
much extra memory?

David

Attachment

Re: Reduce TupleHashEntryData struct size by half

From
Jeff Davis
Date:
On Fri, 2025-02-28 at 17:09 +1300, David Rowley wrote:
> * my_log2() takes a "long" parameter type but transitionSpace is a
> "Size". These types aren't the same width on all platforms we
> support.
> Maybe pg_nextpower2_size_t() is a better option.

Done.

> * Should the following have MAXALIGN(tupleSize) on the right side of
> the expression?

Done.

> Maybe you could just replace the while loop and the subsequent "if"
> check with:

Done.

> * In hash_create_memory(), can you get rid of the minContextSize and
> initBlockSize variables?

Done.

> * Is it worth an Assert() theck additionalsize > 0?

One caller happens to call it unconditionally. It seems better to
return NULL if additionalsize == 0.

> create table c (a int not null);
> insert into c select a from generate_Series(1,1000) a;
> vacuum freeze analyze c;

The schema you posted is the narrower table, and your numbers better
match the wide table you posted before. Was there a mixup?

> master: 3653.9853988
> v7-0001: 3741.815979
> v7-0001+0002: 3737.4313064
> v7-0001+0002+0003: 3834.6271706
> v7-0001+0002+0004+0004: 3912.1158887
>
> I also tried out changing hash_agg_check_limits() so that it no
> longer
> calls MemoryContextMemAllocated and instead uses ->mem_allocated
> directly and with all the other patches got:
>
> v7-0001+0002+0004+0004+extra: 4049.0732381

Great!

> We probably shouldn't do exactly that as it be better not to access
> that internal field from outside the memory context code.  A static
> inline function in memutils.h that handles the non-recursive callers
> might be nice.

Both the metacxt as well as the context used for byref transition
values can have child contexts, so we should keep the recursion. I just
inlined MemoryContextMemAllocated() and MemoryContextTraverseNext().

> I've attached my results of running your test in graph form.

Thank you!

My results (with wide tables):

                        GROUP BY      EXCEPT
   master:                  2151        1732
   entire v8 series:        2054        1740

In some of the patches in the middle of the series, I ran into some
hard-to-explain regressions, so consider my results preliminary. I may
need to profile and figure out what's going on. I didn't see any
overall regression.

But the series overall seems about even, while the memory consumption
is ~35% lower for the example I posted in the first message in the
thread.


>   How about hacking something up that
> adds an additionalsize field to TupleDesc and then set that field to
> your additional size and have heap_form_minimal_tuple() allocate that
> much extra memory?

I assume we wouldn't want to actually add a field to TupleDescData,
right?

When I reworked the ExecCopySlotMinimalTupleExtra() API to place the
extra memory before the tuple, it worked out to be a bit cleaner with
fewer special cases, so I'm fine with that API now.

Regards,
    Jeff Davis


Attachment

Re: Reduce TupleHashEntryData struct size by half

From
Jeff Davis
Date:
On Tue, 2025-03-04 at 17:28 -0800, Jeff Davis wrote:
> My results (with wide tables):
>
>                         GROUP BY      EXCEPT
>    master:                  2151        1732
>    entire v8 series:        2054        1740

I'm not sure what I did with the EXCEPT test, above, perhaps I had
filled the test table twice by mistake or something?

New numbers below:

Data:

  create table c (a int not null);
  insert into c select a from generate_Series(1,1000) a;

  create table big(i int, j int);
  insert into big select g, 1 from generate_series(1,10000000) g;

  create table wide (t text not null);
  insert into wide
    select repeat(md5(a::text),10)
    from generate_Series(1,1000) a;

  create table wide_copy (t text not null);
  insert into wide_copy select * from wide;

  vacuum freeze analyze;


Results:

              c(tps) wide(tps) except(tps)  big(ms)  big(MiB)

91ecb5e0bc     4768      2091        2317     3853       768
master         4942      2063        2323     3831       768
0001           4932      2038        2361     3712       616
0002           4601      2034        2365     3753       616
0003           4850      2040        2418     3749       616
0004           4744      2065        2282     3665       488
0005           4630      1994        2307     3665       488
0006           4809      2063        2394     3646       488
0007           4752      2070        2479     3659       488

Note: c.sql, wide.sql, and except.sql are measured in tps, higher is
better. big.sql is measured in ms and MiB, lower is better.

For some reason I'm getting a decline of about 3% in the c.sql test
that seems to be associated with the accessor functions, even when
inlined. I'm also not seeing as much benefit from the inlining of the
MemoryContextMemAllocated(). But these mixed test results are minor
compared with the memory savings of 35% and the more consistently-
improved performance of 5% on the larger test (which is also integers),
so I plan to commit it.

Regards,
    Jeff Davis


Attachment

Re: Reduce TupleHashEntryData struct size by half

From
Jeff Davis
Date:
On Sat, 2025-03-22 at 09:39 -0700, Jeff Davis wrote:
> For some reason I'm getting a decline of about 3% in the c.sql test
> that seems to be associated with the accessor functions, even when
> inlined. I'm also not seeing as much benefit from the inlining of the
> MemoryContextMemAllocated(). But these mixed test results are minor
> compared with the memory savings of 35% and the more consistently-
> improved performance of 5% on the larger test (which is also
> integers),
> so I plan to commit it.

Committed, except for v9-0007. (Note that some commits were combined;
they were only separated originally for performance testing.)

I attached a new version of v9-0007, now v10-0001, that uses
recurse=false for two of the memory contexts. I didn't see a major
speedup, but posting here anyway.

Regards,
    Jeff Davis


Attachment