Re: More speedups for tuple deformation - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: More speedups for tuple deformation |
| Date | |
| Msg-id | v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us Whole thread Raw |
| In response to | Re: More speedups for tuple deformation (David Rowley <dgrowleyml@gmail.com>) |
| Responses |
Re: More speedups for tuple deformation
|
| List | pgsql-hackers |
Hi,
On 2026-02-01 00:27:02 +1300, David Rowley wrote:
> On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres@anarazel.de> wrote:
> > This is why I like the idea of keeping track of whether we can rely on NOT
> > NULL columns to be present (I think that means we're evaluating expressions
> > other than constraint checks for new rows). It allows the leading NOT NULL
> > fixed-width columns to be decoded without having to wait for a good chunk of
> > the computations above. That's a performance boon even if we later have
> > nullable or varlength columns.
>
> I can look into this. As we both know, we can't apply this
> optimisation in every case as there are places in the code which form
> then deform tuples before NOT NULL constraints are checked.
Right.
> Perhaps the slot can store a flag to mention if the optimisation is valid to
> apply or not. It doesn't look like the flag can be part of the TupleDesc
> since we cache those in relcache.
I wonder if we should do it the other way round - use a special flag (and
perhaps tuple descriptor) iff we are evaluating "unsanitizes" tuples,
i.e. ones where the NOT NULLness might not yet be correct.
> I'm imagining that TupleDescFinalize() calculates another field which could
> be the max cached offset that's got a NOT NULL constraint and isn't
> attmissing. I think this will need another dedicated loop in
> slot_deform_heap_tuple() to loop up to that attribute before doing the
> firstNonCacheOffsetAttr loop.
I was imagining that we'd use the new value to enter the
firstNonCacheOffsetAttr loop without having to depend on
HeapTupleHeaderGetNatts() & HeapTupleHasNulls(). I.e. just use it to avoid the
dependency on having to have completed the memory fetch for the header.
> > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > > than doing an att_isnull() separately for each column.
> > >
> > > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > > faster than setting the isnull element in the same loop as the
> > > tts_values element. What I did try was having a dedicated tight loop
> > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > > but the compiler would always try to optimise that into an inlined
> > > memset which would result in poorly performing code in cases with a
> > > small number of columns due to the size and alignment prechecks.
> >
> > Yea, that kind of transformation is pretty annoying and makes little sense
> > here :(.
> >
> > I was thinking of actually computing the value of isnull[] based on the null
> > bitmap (as you also try below).
>
> I've taken the code you posted in [1] to do this. Thanks for that. It
> works very well.
Nice!
> I made it so the tts_isnull array size is rounded up to the next multiple of
> 8.
Right, that's what I assumed we'd need.
> I've attached 3 graphs, which are now looking a bit better. The gcc
> results are not quite as good. There's still a small regression with 0
> extra column test, and overall, the results are not as impressive as
> clang's. I've not yet studied why.
I suspect it's due to gcc thinking it'd be a good idea to vectorize the
loop. I saw that happening on godbolt.
Are your results better if you use
#if defined(__clang__)
#define pg_nounroll _Pragma("clang loop unroll(disable)")
#define pg_novector _Pragma("clang loop vectorize(disable)")
#elif defined(__GNUC__)
#define pg_nounroll _Pragma("GCC unroll 0")
#define pg_novector _Pragma("GCC novector")
#else
#define pg_nounroll
#define pg_novector _Pragma("loop( no_vector )")
#endif
and put "pg_nounroll pg_novector" before the loop in populate_isnull_array()?
That improves both gcc and clang code generation substantially for me, but
with a considerably bigger improvement for gcc.
Compiler Opt Isns
gcc -O2 165
clang -O2 135
gcc -O3 532
clang -O3 135
Preventing vectorization & unrolling:
gcc -O2 26
clang -O2 25
gcc -O3 26
clang -O3 25
It's somewhat scary to see this level of code size increase in a case where
the compiler really has no information to think vectorizing really is
beneficial...
I'd expect it makes sense to combine the loop for first_null_attr() with the
one for populate_isnull_array(). It might also prevent gcc from trying to
vectorize the loop...
Greetings,
Andres Freund
pgsql-hackers by date: