Re: More speedups for tuple deformation - Mailing list pgsql-hackers
| From | David Rowley |
|---|---|
| Subject | Re: More speedups for tuple deformation |
| Date | |
| Msg-id | CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com Whole thread Raw |
| In response to | Re: More speedups for tuple deformation (Andres Freund <andres@anarazel.de>) |
| List | pgsql-hackers |
On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres@anarazel.de> wrote: > This is why I like the idea of keeping track of whether we can rely on NOT > NULL columns to be present (I think that means we're evaluating expressions > other than constraint checks for new rows). It allows the leading NOT NULL > fixed-width columns to be decoded without having to wait for a good chunk of > the computations above. That's a performance boon even if we later have > nullable or varlength columns. I can look into this. As we both know, we can't apply this optimisation in every case as there are places in the code which form then deform tuples before NOT NULL constraints are checked. Perhaps the slot can store a flag to mention if the optimisation is valid to apply or not. It doesn't look like the flag can be part of the TupleDesc since we cache those in relcache. I'm imagining that TupleDescFinalize() calculates another field which could be the max cached offset that's got a NOT NULL constraint and isn't attmissing. I think this will need another dedicated loop in slot_deform_heap_tuple() to loop up to that attribute before doing the firstNonCacheOffsetAttr loop. Maybe it's worth making that new loop only handle byval types so that we can get rid of the byval branch from fetch_att(). That should make it a bit faster at the expense of not being able to handle fixed-width types that have a attlen > 8, which I don't think are excessively common... It's hard to know. UUIDs do get used as primary key columns, and making the PK column the first column in the table is a common pattern to follow. Overall, I agree that this will likely speed up some of the tests, but it will also likely further increase the overheads for the 0 extra column tests. > > > 3) I briefly experimented with this code, and I think we may be able to > > > optimize the combination of att_pointer_alignby(), fetch_att() and > > > att_addlength_pointer(). They all do quite related work, and for byvalue > > > types, we know at compile time what the alignment requirement for each of > > > the supported attlen is. > > > > Is this true? Isn't there some nearby discussion about AIX having > > 4-byte double alignment? > > The AIX stuff is just bonkers. Having different alignment based on where in an > aggregate a double is is just insane. > > We could probably just accomodate that with a compile-time ifdef. I've not tried this yet. There's probably more than align_fetch_then_add() that could benefit from the knowledge that attalign is the same as attlen for byval types. align_fetch_then_add() is the only one I'm currently aware of, and that only saves doing attalign - 1, which only saves 1 cycle. > > I've taken a go at implementing a function called align_fetch_then_add(), > > which rolls all the macros into one (See 0004). I just can't see any > > improvements with it. Maybe I've missed something that could be more > > optimal. I did even ditch one of the cases from the switch(attlen). It might > > be ok to do that now as we can check for invalid attlens for byval types > > when we populate the CompactAttribute. > > You could move the TYPEALIGN(attalignby, *off) calls into the attlen switch > and call it with a constant argument. I think that would need the assumption you mentioned above that the byval type's attlen defines the alignment requirements. > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls > > > and then in this loop just checking isnull[attnum]? Seems like that could > > > perhaps be combined with the work in first_null_attr() and be more efficient > > > than doing an att_isnull() separately for each column. > > > > Yes. I experiment with that quite a bit. I wasn't able to make it any > > faster than setting the isnull element in the same loop as the > > tts_values element. What I did try was having a dedicated tight loop > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;, > > but the compiler would always try to optimise that into an inlined > > memset which would result in poorly performing code in cases with a > > small number of columns due to the size and alignment prechecks. > > Yea, that kind of transformation is pretty annoying and makes little sense > here :(. > > I was thinking of actually computing the value of isnull[] based on the null > bitmap (as you also try below). I've taken the code you posted in [1] to do this. Thanks for that. It works very well. I made it so the tts_isnull array size is rounded up to the next multiple of 8. Without that we'll overwrite the sentinel byte in the palloc'd chunk if we assume we can always write to tts_isnull in multiples of 8 elements. Because I've made tts_isnull round up to the next 8 elements, I can make the !hasnulls loop zero the memory 8 bytes at a time. That solves the annoying memset inlining I was getting. That seems to mostly fix the regression in the zero extra column test. The t_hoff calculation replacement maybe helped a bit too. I didn't test individually. > > I spent a bit of time trying to figure out a way to do this without the > > lookup table and only came up with a method that requires AVX512 > > instructions. I coded that up, but it requires building with > > -march=x86-64-v4, which will likely cause many other reasons for the > > performance to vary. > > Yea, I doubt we want that... Too many tuples will be too short to benefit. It was only doing 8 columns at once, not 64. I think it's irrelevant now with your 0x204081 trick. I've revised the 0004 patch to use the 0x204081 trick. I also added fetch_att_noerr() gets rid of the elog and assumes the default switch case must be attlen == 8. I didn't modify the existing function as that will be used for attlen values that don't come from CompactAttribute. Also got rid of the t_hoff usage. With those and the new code that zeros the tts_isnull array 8 bytes at a time, things are looking pretty good. I've attached 3 graphs, which are now looking a bit better. The gcc results are not quite as good. There's still a small regression with 0 extra column test, and overall, the results are not as impressive as clang's. I've not yet studied why. The Apple M2 machine averaged over 53% faster than master over all tests, or ~63% if you exclude the 0 extra column tests. I'll look into adding another optional optimisation for the NOT NULL columns. Attaching the updated patch in the meantime. This includes the fix for the __builtin_ctz() issue mentioned by John. David
Attachment
pgsql-hackers by date: