Re: More speedups for tuple deformation - Mailing list pgsql-hackers

From Andres Freund
Subject Re: More speedups for tuple deformation
Date
Msg-id v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us
Whole thread Raw
In response to Re: More speedups for tuple deformation  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: More speedups for tuple deformation
List pgsql-hackers
Hi,

On 2026-02-01 00:27:02 +1300, David Rowley wrote:
> On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres@anarazel.de> wrote:
> > This is why I like the idea of keeping track of whether we can rely on NOT
> > NULL columns to be present (I think that means we're evaluating expressions
> > other than constraint checks for new rows). It allows the leading NOT NULL
> > fixed-width columns to be decoded without having to wait for a good chunk of
> > the computations above. That's a performance boon even if we later have
> > nullable or varlength columns.
> 
> I can look into this. As we both know, we can't apply this
> optimisation in every case as there are places in the code which form
> then deform tuples before NOT NULL constraints are checked.

Right.


> Perhaps the slot can store a flag to mention if the optimisation is valid to
> apply or not. It doesn't look like the flag can be part of the TupleDesc
> since we cache those in relcache.

I wonder if we should do it the other way round - use a special flag (and
perhaps tuple descriptor) iff we are evaluating "unsanitizes" tuples,
i.e. ones where the NOT NULLness might not yet be correct.


> I'm imagining that TupleDescFinalize() calculates another field which could
> be the max cached offset that's got a NOT NULL constraint and isn't
> attmissing. I think this will need another dedicated loop in
> slot_deform_heap_tuple() to loop up to that attribute before doing the
> firstNonCacheOffsetAttr loop.

I was imagining that we'd use the new value to enter the
firstNonCacheOffsetAttr loop without having to depend on
HeapTupleHeaderGetNatts() & HeapTupleHasNulls(). I.e. just use it to avoid the
dependency on having to have completed the memory fetch for the header.



> > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > > than doing an att_isnull() separately for each column.
> > >
> > > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > > faster than setting the isnull element in the same loop as the
> > > tts_values element. What I did try was having a dedicated tight loop
> > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > > but the compiler would always try to optimise that into an inlined
> > > memset which would result in poorly performing code in cases with a
> > > small number of columns due to the size and alignment prechecks.
> >
> > Yea, that kind of transformation is pretty annoying and makes little sense
> > here :(.
> >
> > I was thinking of actually computing the value of isnull[] based on the null
> > bitmap (as you also try below).
> 
> I've taken the code you posted in [1] to do this. Thanks for that. It
> works very well.

Nice!


> I made it so the tts_isnull array size is rounded up to the next multiple of
> 8.

Right, that's what I assumed we'd need.


> I've attached 3 graphs, which are now looking a bit better. The gcc
> results are not quite as good. There's still a small regression with 0
> extra column test, and overall, the results are not as impressive as
> clang's. I've not yet studied why.

I suspect it's due to gcc thinking it'd be a good idea to vectorize the
loop. I saw that happening on godbolt.

Are your results better if you use

#if defined(__clang__)
#define pg_nounroll _Pragma("clang loop unroll(disable)")
#define pg_novector _Pragma("clang loop vectorize(disable)")
#elif defined(__GNUC__)
#define pg_nounroll _Pragma("GCC unroll 0")
#define pg_novector _Pragma("GCC novector")
#else
#define pg_nounroll
#define pg_novector _Pragma("loop( no_vector )")
#endif

and put "pg_nounroll pg_novector" before the loop in populate_isnull_array()?
That improves both gcc and clang code generation substantially for me, but
with a considerably bigger improvement for gcc.

Compiler   Opt      Isns
gcc        -O2      165
clang      -O2      135
gcc        -O3      532
clang      -O3      135

Preventing vectorization & unrolling:
gcc        -O2      26
clang      -O2      25
gcc        -O3      26
clang      -O3      25


It's somewhat scary to see this level of code size increase in a case where
the compiler really has no information to think vectorizing really is
beneficial...



I'd expect it makes sense to combine the loop for first_null_attr() with the
one for populate_isnull_array().  It might also prevent gcc from trying to
vectorize the loop...


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Paul A Jungwirth
Date:
Subject: Re: CREATE OR REPLACE MATERIALIZED VIEW
Next
From: "Euler Taveira"
Date:
Subject: Re: Change default of jit to off