Thread: Re: Reduce TupleHashEntryData struct size by half

Re: Reduce TupleHashEntryData struct size by half

From
Heikki Linnakangas
Date:
On 18/11/2024 02:01, Jeff Davis wrote:
> 
> TupleHashEntryData is the size of a hash bucket, and it's currently 24
> bytes. The size of the structure is important for HashAgg and other
> callers that can build up large hashtables of small tuples. Waste in
> this structure is even worse when the hash table is sparse, because the
> space is consumed whether the bucket is occupied or not.
> 
> The attached patch series brings it down to 12. There are several
> unappealing hacks, so ideally we'd find something cleaner, but halving
> the bucket size seems quite desirable if it can be done in an
> acceptable way.

Makes sense.

> 0001: This patch makes the structure private, so that it's easier to
> control the way the structure is accessed.

Seems pretty uncontroversial. You removed the typedef for struct 
TupleHashEntryData, which is a bit unusual for our usual source style. 
Was there a reason for that?

> 0002: Removes the "additional" pointer, instead storing it right after
> the tuple, which is already stored in a separate chunk. Hack: this
> crams the "additional" pointer after the MinimalTuple in the same chunk
> of memory to avoid adding additional palloc()s.

Hmm, it would seem more straightforward to store it in the beginning, 
i.e. have something like this:

struct {
      void *additional;
      MinimalTupleData mtup;
} ;

Come to think of it, how important is it that we use MinimalTuple here 
at all? Some other representation could be faster to deal with in 
TupleHashTableMatch() anyway.

> 0003: simplehash: allow the caller to decide which entries are empty vs
> in-use rather than requiring a separate "status" field. This may limit
> other possible status values in the future (i.e. adding on to the
> enum), but I'm not sure what those other stutus values might be.

+1. I've wanted to have this in the past.

> 0004: Removes the "status" field from TupleHashEntryData, using
> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> additional "special" pointer value to mean "input slot" now that NULL
> means "empty".

+1

> 0005: Pack TupleHashEntryData. IIUC, this is fine even on machines that
> can't do unaligned access, so long as we are accessing the fields
> through the struct, and not taking the address of individual members.

Seems OK.

I wonder if the compiler understands that the elements are still 4-byte 
aligned, or if it forces byte-per-byte access? Playing with godbolt a 
little, it seems like GCC at least understands it, but clang does not. 
On architectures with non-strict alignment, it doesn't matter as a 
simple load/store instruction is the fastest option anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Reduce TupleHashEntryData struct size by half

From
Jeff Davis
Date:
On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
> Seems pretty uncontroversial. You removed the typedef for struct
> TupleHashEntryData, which is a bit unusual for our usual source
> style.
> Was there a reason for that?

If it's private to a file and I don't intend to use it a lot, I do it
to cut down typedefs.list bloat. I'll go ahead and add the typedef back
to match the style, though.

>
> Hmm, it would seem more straightforward to store it in the beginning,
> i.e. have something like this:
>
> struct {
>       void *additional;
>       MinimalTupleData mtup;
> } ;

That was my first approach, but it requires an additional memcpy,
because ExecCopySlotMinimalTuple() does it's own palloc. I used
repalloc() because it will often have space at the end of the chunk
anyway, and not need to memcpy(). Maybe that's not significant but it
did seem detectable in some perf tests.

But perhaps we can go further and get rid of the "additional" pointer
and inline the pergroup data and the grouping key tuple into the same
palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
pointer.

> Come to think of it, how important is it that we use MinimalTuple
> here
> at all? Some other representation could be faster to deal with in
> TupleHashTableMatch() anyway.

What did you have in mind? That sounds like a good idea orthogonal to
reducing the bucket size.

Alternatively, MinimalTuple is not very "minimal", and perhaps we can
just make it better.

>
> > 0004: Removes the "status" field from TupleHashEntryData, using
> > firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> > additional "special" pointer value to mean "input slot" now that
> > NULL
> > means "empty".
>
> +1

For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
what I did:

  const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;

or something like:

  static MinimalTupleData dummy = {0};
  const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;

?

>
> On architectures with non-strict alignment, it doesn't matter as a
> simple load/store instruction is the fastest option anyway.

My intuition is that the cost of dereferencing that pointer (to memory
which is not expected to be in cache) is going to be way higher than
the cost of a couple extra instructions to do the unaligned access.

Regards,
    Jeff Davis




Re: Reduce TupleHashEntryData struct size by half

From
Heikki Linnakangas
Date:
On 18/11/2024 22:22, Jeff Davis wrote:
> On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
>> Hmm, it would seem more straightforward to store it in the beginning,
>> i.e. have something like this:
>>
>> struct {
>>        void *additional;
>>        MinimalTupleData mtup;
>> } ;
> 
> That was my first approach, but it requires an additional memcpy,
> because ExecCopySlotMinimalTuple() does it's own palloc. I used
> repalloc() because it will often have space at the end of the chunk
> anyway, and not need to memcpy(). Maybe that's not significant but it
> did seem detectable in some perf tests.
> 
> But perhaps we can go further and get rid of the "additional" pointer
> and inline the pergroup data and the grouping key tuple into the same
> palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
> pointer.

Sounds like a good idea. Needs some changes to the TupleTableSlotOps 
interface to avoid the memcpy I presume.

>> Come to think of it, how important is it that we use MinimalTuple
>> here
>> at all? Some other representation could be faster to deal with in
>> TupleHashTableMatch() anyway.
> 
> What did you have in mind? That sounds like a good idea orthogonal to
> reducing the bucket size.

Queries that have a only a small number of groups might might benefit 
from storing a plain Datum/isnull array instead of a MinimalTuple. That 
would take more memory when you have a lot of groups though.

> Alternatively, MinimalTuple is not very "minimal", and perhaps we can
> just make it better.

Yeah. It tries to be compatible with HeapTuple, but perhaps we should 
give up on that and pack it more tightly instead.

>>> 0004: Removes the "status" field from TupleHashEntryData, using
>>> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
>>> additional "special" pointer value to mean "input slot" now that
>>> NULL
>>> means "empty".
>>
>> +1
> 
> For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
> what I did:
> 
>    const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;
> 
> or something like:
> 
>    static MinimalTupleData dummy = {0};
>    const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;
> 
> ?

I think I'd do "(MinimalTuple) 0x1" myself, but no strong opinion.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Reduce TupleHashEntryData struct size by half

From
David Rowley
Date:
On Thu, 13 Feb 2025 at 14:01, Jeff Davis <pgsql@j-davis.com> wrote:
> Attached a new patchset that doesn't change the API for
> ExecCopySlotMinimalTuple(). Instead, I'm using
> ExecFetchSlotMinimalTuple(), which avoids the extra memcpy if the slot
> is TTSOpsMinimalTuple, which is what HashAgg uses. I separated out the
> change to use ExecFetchSlotMinimalTuple() into 0004 to illustrate the
> performance impact.

Some review comments:

* my_log2() takes a "long" parameter type but transitionSpace is a
"Size". These types aren't the same width on all platforms we support.
Maybe pg_nextpower2_size_t() is a better option.

* Should the following have MAXALIGN(tupleSize) on the right side of
the expression?

tupleChunkSize = tupleSize

I understand this was missing before, but both bump.c does consume at
least MAXALIGN(<req_size>) in BumpAlloc().

* while (16 * maxBlockSize > work_mem * 1024L)

The 1024L predates the change made in 041e8b95. "1024L" needs to be
replaced with "(Size) 1024"

Maybe you could just replace the while loop and the subsequent "if" check with:

/*
 * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
 * to avoid large jumps in memory usage.
 */
maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);

/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);

/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);

I believe that gives the same result without the looping.

* In hash_create_memory(), can you get rid of the minContextSize and
initBlockSize variables?

* Is it worth an Assert() theck additionalsize > 0?

 * The amount of space available is the additionalsize requested in the call
 * to BuildTupleHashTable(). If additionalsize was specified as zero, no
 * additional space is available and this function should not be called.
 */
static inline void *
TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
{
return (char *) entry->firstTuple - hashtable->additionalsize;
}

Benchmarks:

I was also experimenting with the performance of this using the
following test case:

create table c (a int not null);
insert into c select a from generate_Series(1,1000) a;
vacuum freeze analyze c;

Query: select a,count(*) from c group by a;

Average TPS over 10x 10 second runs with -M prepared

master: 3653.9853988
v7-0001: 3741.815979
v7-0001+0002: 3737.4313064
v7-0001+0002+0003: 3834.6271706
v7-0001+0002+0004+0004: 3912.1158887

I also tried out changing hash_agg_check_limits() so that it no longer
calls MemoryContextMemAllocated and instead uses ->mem_allocated
directly and with all the other patches got:

v7-0001+0002+0004+0004+extra: 4049.0732381

We probably shouldn't do exactly that as it be better not to access
that internal field from outside the memory context code.  A static
inline function in memutils.h that handles the non-recursive callers
might be nice.

I've attached my results of running your test in graph form. I also
see a small regression for these small scale tests. I wondering if
it's worth mocking up some code to see what the performance would be
like without the additional memcpy() that's new to
LookupTupleHashEntry_internal().  How about hacking something up that
adds an additionalsize field to TupleDesc and then set that field to
your additional size and have heap_form_minimal_tuple() allocate that
much extra memory?

David

Attachment