Re: More speedups for tuple deformation - Mailing list pgsql-hackers

From David Rowley
Subject Re: More speedups for tuple deformation
Date
Msg-id CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com
Whole thread Raw
In response to Re: More speedups for tuple deformation  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres@anarazel.de> wrote:
> This is why I like the idea of keeping track of whether we can rely on NOT
> NULL columns to be present (I think that means we're evaluating expressions
> other than constraint checks for new rows). It allows the leading NOT NULL
> fixed-width columns to be decoded without having to wait for a good chunk of
> the computations above. That's a performance boon even if we later have
> nullable or varlength columns.

I can look into this. As we both know, we can't apply this
optimisation in every case as there are places in the code which form
then deform tuples before NOT NULL constraints are checked. Perhaps
the slot can store a flag to mention if the optimisation is valid to
apply or not. It doesn't look like the flag can be part of the
TupleDesc since we cache those in relcache. I'm imagining that
TupleDescFinalize() calculates another field which could be the max
cached offset that's got a NOT NULL constraint and isn't attmissing. I
think this will need another dedicated loop in
slot_deform_heap_tuple() to loop up to that attribute before doing the
firstNonCacheOffsetAttr loop. Maybe it's worth making that new loop
only handle byval types so that we can get rid of the byval branch
from fetch_att(). That should make it a bit faster at the expense of
not being able to handle fixed-width types that have a attlen > 8,
which I don't think are excessively common...  It's hard to know.
UUIDs do get used as primary key columns, and making the PK column the
first column in the table is a common pattern to follow.

Overall, I agree that this will likely speed up some of the tests, but
it will also likely further increase the overheads for the 0 extra
column tests.

> > > 3) I briefly experimented with this code, and I think we may be able to
> > >    optimize the combination of att_pointer_alignby(), fetch_att() and
> > >    att_addlength_pointer(). They all do quite related work, and for byvalue
> > >    types, we know at compile time what the alignment requirement for each of
> > >    the supported attlen is.
> >
> > Is this true? Isn't there some nearby discussion about AIX having
> > 4-byte double alignment?
>
> The AIX stuff is just bonkers. Having different alignment based on where in an
> aggregate a double is is just insane.
>
> We could probably just accomodate that with a compile-time ifdef.

I've not tried this yet. There's probably more than
align_fetch_then_add() that could benefit from the knowledge that
attalign is the same as attlen for byval types. align_fetch_then_add()
is the only one I'm currently aware of, and that only saves doing
attalign - 1, which only saves 1 cycle.

> > I've taken a go at implementing a function called align_fetch_then_add(),
> > which rolls all the macros into one (See 0004). I just can't see any
> > improvements with it. Maybe I've missed something that could be more
> > optimal. I did even ditch one of the cases from the switch(attlen). It might
> > be ok to do that now as we can check for invalid attlens for byval types
> > when we populate the CompactAttribute.
>
> You could move the TYPEALIGN(attalignby, *off) calls into the attlen switch
> and call it with a constant argument.

I think that would need the assumption you mentioned above that the
byval type's attlen defines the alignment requirements.

> > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > than doing an att_isnull() separately for each column.
> >
> > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > faster than setting the isnull element in the same loop as the
> > tts_values element. What I did try was having a dedicated tight loop
> > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > but the compiler would always try to optimise that into an inlined
> > memset which would result in poorly performing code in cases with a
> > small number of columns due to the size and alignment prechecks.
>
> Yea, that kind of transformation is pretty annoying and makes little sense
> here :(.
>
> I was thinking of actually computing the value of isnull[] based on the null
> bitmap (as you also try below).

I've taken the code you posted in [1] to do this. Thanks for that. It
works very well. I made it so the tts_isnull array size is rounded up
to the next multiple of 8. Without that we'll overwrite the sentinel
byte in the palloc'd chunk if we assume we can always write to
tts_isnull in multiples of 8 elements.  Because I've made tts_isnull
round up to the next 8 elements, I can make the !hasnulls loop zero
the memory 8 bytes at a time. That solves the annoying memset inlining
I was getting. That seems to mostly fix the regression in the zero
extra column test. The t_hoff calculation replacement maybe helped a
bit too. I didn't test individually.

> > I spent a bit of time trying to figure out a way to do this without the
> > lookup table and only came up with a method that requires AVX512
> > instructions. I coded that up, but it requires building with
> > -march=x86-64-v4, which will likely cause many other reasons for the
> > performance to vary.
>
> Yea, I doubt we want that... Too many tuples will be too short to benefit.

It was only doing 8 columns at once, not 64. I think it's irrelevant
now with your 0x204081 trick.

I've revised the 0004 patch to use the 0x204081 trick. I also added
fetch_att_noerr() gets rid of the elog and assumes the default switch
case must be attlen == 8. I didn't modify the existing function as
that will be used for attlen values that don't come from
CompactAttribute. Also got rid of the t_hoff usage. With those and the
new code that zeros the tts_isnull array 8 bytes at a time, things are
looking pretty good.

I've attached 3 graphs, which are now looking a bit better. The gcc
results are not quite as good. There's still a small regression with 0
extra column test, and overall, the results are not as impressive as
clang's. I've not yet studied why. The Apple M2 machine averaged over
53% faster than master over all tests, or ~63% if you exclude the 0
extra column tests.

I'll look into adding another optional optimisation for the NOT NULL
columns. Attaching the updated patch in the meantime. This includes
the fix for the __builtin_ctz() issue mentioned by John.

David

Attachment

pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Next
From: Matthias van de Meent
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart