Thread: Make tuple deformation faster

Make tuple deformation faster

From

David Rowley

Date:

01 July 2024, 08:55:44

Currently, TupleDescData contains the descriptor's attributes in a
variable length array of FormData_pg_attribute allocated within the
same allocation as the TupleDescData. According to my IDE,
sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
to attname being 64 bytes. The TupleDescData.attrs[] array could end
up quite large on tables with many columns and that could result in
poor CPU cache hit ratios when deforming tuples.

Instead, we could make TupleDescData contain an out-of-line pointer to
the array of FormData_pg_attribute and have a much more compact
inlined array of some other struct that much more densely contains the
fields required for tuple deformation. attname and many of the other
fields are not required to deform a tuple.

I've attached a patch series which does this.

0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
missed by me, apparently :-( )
0002: Adjusts the TupleDescData.attrs array to make it out of line. I
wanted to make sure nothing weird happened by doing this before doing
the bulk of the other changes to add the new struct.
0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
which can be used for tuple deformation. 8 columns fits on a 64-byte
cacheline rather than 13 cachelines.
0004: Adjusts the attalign to change it from char to uint8.  See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

The state of this patch series is "proof of concept". I think the
current state should be enough to get an idea of the rough amount of
code churn this change would cause and also an idea of the expected
performance of the change. It certainly isn't in a finished state.
I've not put much effort into updating comments or looking at READMEs
to see what's now outdated.

I also went with trying to patch a bunch of additional boolean columns
from pg_attribute so they just take up 1 bit of space in the attflags
field in the new struct.  I've not tested the performance of expanding
this out so these use 1 bool field each.  That would make the struct
bigger than 8 bytes.  Having the struct be a power-of-2 size is also
beneficial as it allows fast bit-shifting to be used to get the array
element address rather than a more complex (and slower) LEA
instruction. I could try making the struct 16 bytes and see if there
are any further wins by avoiding the bitwise AND on the
TupleDescDeformAttr.attflags field.

To test the performance of this, I tried using the attached script
which creates a table where the first column is a variable length
column and the final column is an int.  The query I ran to test the
performance inserted 1 million rows into this table and performed a
sum() on the final column.  The attached graph shows that the query is
30% faster than master with 15 columns between the first and last
column. For fewer columns, the speedup is less. This is quite a
deform-heavy query so it's not like it speeds up every table with that
column arrangement by 30%, but certainly, some queries could see that
much gain and even more seems possible. I didn't go to a great deal of
trouble to find the most deform-heavy workload.

I'll stick this in the July CF. It would be good to get some feedback
on the idea and feedback on whether more work on this is worthwhile.

As mentioned, the 0001 patch just fixes up the missing usages of the
TupleDescAttr() macro. I see no reason not to commit this now.

Thanks

David

On Tue, 16 Jul 2024 at 00:13, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote:
> > I'm happy to keep going with this version of the patch
>
> +1, go for it.

I've attached an updated patch series which are a bit more polished
than the last set. I've attempted to document the distinction between
FormData_pg_attribute and the abbreviated struct and tried to give an
indication of which one should be used.

Apart from that, changes include:

* I pushed the v1-0001 patch, so that's removed from the patch series.
* Rename TupleDescDeformAttr struct. It's now called CompactAttribute.
* Rename TupleDescDeformAttr() macro. It's now called TupleDescCompactAttr()
* Other macro renaming. e.g. ATT_IS_PACKABLE_FAST to COMPACT_ATTR_IS_PACKABLE
* In 0003, renamed CompactAttribute.attalign to attalignby to make it
easier to understand the distinction between the align char and the
number of bytes.
* Added 0004 patch to remove pg_attribute.attcacheoff.

There are a few more things that could be done to optimise a few more
things. For example, a bunch of places still use att_align_nominal().
With a bit more work, these could use att_nominal_alignby(). I'm not
yet sure of the cleanest way to do this. Adding alignby to the
typecache might be one way, or maybe just a function that converts the
attalign to the number of bytes. This would be useful in all places
where att_align_nominal() is used in loops, as converting the char to
the number of bytes would only be done once rather than once per loop.
I feel like this patch series is probably big enough for now, so I'd
like to opt for those improvements to take place as follow-on work.

I'll put this up for the CF bot to run with for a bit as the patch has
needed a rebase since I pushed the v1-0001 patch.

David

Hello David,

20.12.2024 12:31, David Rowley wrote:

The attcacheoff removal is now pushed. I've attached the two remaining patches.

Please look at the following query, which triggers (sometimes not on a
first run) an assert added with 5983a4cff:
regression=# SELECT COUNT(*) FROM
(SELECT (aclexplode(proacl)).* FROM pg_proc) a,
(SELECT oid FROM pg_proc UNION ALL SELECT oid FROM pg_proc) b;
count
--------
520366
(1 row)

regression=# SELECT COUNT(*) FROM
(SELECT (aclexplode(proacl)).* FROM pg_proc) a,
(SELECT oid FROM pg_proc UNION ALL SELECT oid FROM pg_proc) b;
WARNING: terminating connection because of crash of another server process
...
TRAP: failed Assert("memcmp(&snapshot, cattr, sizeof(CompactAttribute)) == 0"), File: "../../../../src/include/access/tupdesc.h", Line: 191, PID: 1302048
ExceptionalCondition at assert.c:52:13
TupleDescCompactAttr at tupdesc.h:195:1
nocachegetattr at heaptuple.c:668:8
fastgetattr at htup_details.h:768:11
heap_getattr at htup_details.h:804:11
ExecEvalFieldSelect at execExprInterp.c:3623:17
ExecInterpExpr at execExprInterp.c:1542:4
MemoryContextSwitchTo at palloc.h:128:23
(inlined by) ExecEvalExprSwitchContext at executor.h:370:2
ExecProject at executor.h:409:18
ExecResult at nodeResult.c:139:1
ExecProcNode at executor.h:273:1
SubqueryNext at nodeSubqueryscan.c:61:1
ExecScanFetch at execScan.c:131:10
ExecScan at execScan.c:197:10
ExecSubqueryScan at nodeSubqueryscan.c:90:1
ExecProcNode at executor.h:273:1
ExecMaterial at nodeMaterial.c:134:15
ExecProcNode at executor.h:273:1
ExecNestLoop at nodeNestloop.c:160:29
ExecProcNode at executor.h:273:1
fetch_input_tuple at nodeAgg.c:561:10
agg_retrieve_direct at nodeAgg.c:2459:18
...
LaunchMissingBackgroundProcesses at postmaster.c:3220:1

(I've discovered this with SQLsmith.)

Best regards,
Alexander

Re: Make tuple deformation faster

From

David Rowley

Date:

23 December 2024, 23:34:23

On Tue, 24 Dec 2024 at 02:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> regression=# SELECT COUNT(*) FROM
> (SELECT (aclexplode(proacl)).* FROM pg_proc) a,
> (SELECT oid FROM pg_proc UNION ALL SELECT oid FROM pg_proc) b;
> WARNING:  terminating connection because of crash of another server process
> ...
> TRAP: failed Assert("memcmp(&snapshot, cattr, sizeof(CompactAttribute)) == 0"), File:
"../../../../src/include/access/tupdesc.h",Line: 191, PID: 1302048
 

Thanks. Looking now.

David

Re: Make tuple deformation faster

From

David Rowley

Date:

24 December 2024, 04:57:27

On Tue, 24 Dec 2024 at 11:19, David Rowley <dgrowleyml@gmail.com> wrote:
> The attached adjusts that Assert code so that a fresh CompactAttribute
> is populated instead of modifying the TupleDesc's one.  I'm not sure
> if populate_compact_attribute_internal() is exactly the nicest way to
> do this. I'll think a bit harder about that. Assume the attached is
> POC grade.

I've now pushed a fix for this using the same method but with the code
factored around a little differently. I didn't want to expose the
populate_compact_attribute_internal() function externally, so I
invented verify_compact_attribute() to call from
TupleDescCompactAttr().

Thanks for the report.

David

Re: Make tuple deformation faster

From

Jeff Davis

Date:

04 March, 22:48:54

On Thu, 2024-10-10 at 02:59 +1300, David Rowley wrote:
> > A few weeks ago David and I discussed this patch. We were curious
> > *why* the
> > flags approach was slower.

...

> > Could it make sense to use bitfields instead of flag values, to
> > reduce the
> > impact?
>
> Yeah. That's a good idea.

I happened to run into the code and was surprised to see a strongly-
worded comment about the size of CompactAttribute, but then also see
independent booleans rather than flags or bitfields.

Did the discussion end here, or was there some kind of conclusion? Is
it worth adding a comment about why we use independent booleans, even
if we don't have a complete answer?

Regards,
    Jeff Davis

Re: Make tuple deformation faster

From

David Rowley

Date:

05 March, 15:07:21

On Wed, 5 Mar 2025 at 08:48, Jeff Davis <pgsql@j-davis.com> wrote:
> I happened to run into the code and was surprised to see a strongly-
> worded comment about the size of CompactAttribute, but then also see
> independent booleans rather than flags or bitfields.
>
> Did the discussion end here, or was there some kind of conclusion? Is
> it worth adding a comment about why we use independent booleans, even
> if we don't have a complete answer?

I must have either forgotten to try that, or I tried it and wrote it
off and forgot to document it.

Anyway, I've now benchmarked using bitfields on the same 3 machines
that I used last time. There are certainly some cases where it's
faster with the bitfields, but it's mainly slower.

I've attached the results. The 3990x with clang looks good, but the
rest are mostly slower.

David

On Thu, 6 Mar 2025 at 10:17, Andres Freund <andres@anarazel.de> wrote:
> FWIW, I am fairly certain that I looked at this at an earlier state of the
> patch, and at least for me the issue wasn't that it was inherently slower to
> use the bitmask, but that it was hard to convince the compiler not generate
> worse code.
>
> IIRC the compiler generated more complicated address gathering instructions
> which are slower on some older microarchitectures, but this is a vague memory.

I've been reading GCC's assembly output with -fverbose-asm. I find it
quite hard to follow as the changes between the 16-byte and 8-byte
CompactAttribute versions are vast (see attached).

A few interesting things jump out. e.g, in master:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1468
movl %ebp, (%rax) # off, MEM[(int *)_22]

whereas with the 8-byte version, I see:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1484
movl %ebp, 24(%rax) # off, MEM[(int *)_358 + 24B]

You can see the MOVL in the 8-byte version should amount to an
additional micro op to add 24 to RAX before the dereference.

One interesting thing to note about having CompactAttribute in its
8-byte form is that the compiler is tempted into sharing a register
with the tts_values array before Datum is also 8-bytes. Note the
difference in [1] between the two left compiler outputs and the
right-hand one. You can see RCX is dedicated for addressing
CompactAttribute in the right window, but RAX is used for both arrays
in the left two.  I don't 100% know for sure that's the reason for the
slowness with the full version but it does seem from the fragment I
posted just above that RAX does need 24 bytes added in the 8 bytes
version but not in the 16 byte version, so RAX is certainly not
dedicated and ready pointing to attcacheoff at that point.

Jeff, I'm not sure if I understand this well enough to write a
meaningful comment to explain why we don't use bitflags. With my
current knowledge level on this, it's a bit hand-wavy at best. Are you
content with this, or do you want to see something written into the
header comment for CompactAttribute in the code?

David

[1] https://godbolt.org/z/7hWvqdW6E

Attachment

slot_deform_heap_tuple.zip

Re: Make tuple deformation faster

From

Alexander Lakhin

Date:

07 June, 14:00:01

Hello David,

24.12.2024 03:57, David Rowley wrote:

On Tue, 24 Dec 2024 at 11:19, David Rowley <dgrowleyml@gmail.com> wrote:

The attached adjusts that Assert code so that a fresh CompactAttribute
is populated instead of modifying the TupleDesc's one.  I'm not sure
if populate_compact_attribute_internal() is exactly the nicest way to
do this. I'll think a bit harder about that. Assume the attached is
POC grade.

I've now pushed a fix for this using the same method but with the code
factored around a little differently. I didn't want to expose the
populate_compact_attribute_internal() function externally, so I
invented verify_compact_attribute() to call from
TupleDescCompactAttr().

I stumbled upon that assertion failure again. It's not reproduced easily,
but maybe you can forgive me the following modification:
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -159,8 +159,11 @@ verify_compact_attribute(TupleDesc tupdesc, int attnum)
     tmp.attcacheoff = cattr->attcacheoff;
     tmp.attnullability = cattr->attnullability;

+for (int i = 0; i < 1000; i++)
+{
     /* Check the freshly populated CompactAttribute matches the TupleDesc's */
     Assert(memcmp(&tmp, cattr, sizeof(CompactAttribute)) == 0);
+}
#endif
}

which helps for this script:
for i in {1..50}; do
echo "ITERATION $i"
for c in {1..20}; do
echo "
set parallel_setup_cost = 1;
set min_parallel_table_scan_size = '1kB';
select * from information_schema.role_udt_grants limit 50;
" | psql > psql-$c.log &
done
wait
grep 'was terminated by signal' server.log && break;
done

to fail for me as below:
...
ITERATION 34
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
2025-06-07 13:01:39.326 EEST [537106] LOG: background worker "parallel worker" (PID 539473) was terminated by signal 6: Aborted

Core was generated by `postgres: parallel worker for PID 539434                                      '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007de05ec4526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007de05ec288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x00005dd0a1788377 in ExceptionalCondition (conditionName=0x5dd0a1822ee8 "memcmp(&tmp, cattr, sizeof(CompactAttribute)) == 0",
    fileName=0x5dd0a1822ed9 "tupdesc.c", lineNumber=165) at assert.c:66
#6 0x00005dd0a0f85bfd in verify_compact_attribute (tupdesc=0x7de05ef01000, attnum=1) at tupdesc.c:165
#7 0x00005dd0a0f72a25 in TupleDescCompactAttr (tupdesc=0x7de05ef01000, i=1) at ../../../../src/include/access/tupdesc.h:182
#8 0x00005dd0a0f73da6 in nocachegetattr (tup=0x7ffc737cc550, attnum=1, tupleDesc=0x7de05ef01000) at heaptuple.c:581
#9 0x00005dd0a12719c9 in fastgetattr (tup=0x7ffc737cc550, attnum=2, tupleDesc=0x7de05ef01000, isnull=0x5dd0a7b919d5)
    at ../../../src/include/access/htup_details.h:880
#10 0x00005dd0a1271a74 in heap_getattr (tup=0x7ffc737cc550, attnum=2, tupleDesc=0x7de05ef01000, isnull=0x5dd0a7b919d5)
    at ../../../src/include/access/htup_details.h:916
#11 0x00005dd0a127a50d in ExecEvalFieldSelect (state=0x5dd0a7b919d0, op=0x5dd0a7b93258, econtext=0x5dd0a7b86648) at execExprInterp.c:3837
#12 0x00005dd0a12759ce in ExecInterpExpr (state=0x5dd0a7b919d0, econtext=0x5dd0a7b86648, isnull=0x0) at execExprInterp.c:1698
#13 0x00005dd0a127702f in ExecInterpExprStillValid (state=0x5dd0a7b919d0, econtext=0x5dd0a7b86648, isNull=0x0) at execExprInterp.c:2299
#14 0x00005dd0a12db079 in ExecEvalExprNoReturn (state=0x5dd0a7b919d0, econtext=0x5dd0a7b86648) at ../../../src/include/executor/executor.h:417
#15 0x00005dd0a12db137 in ExecEvalExprNoReturnSwitchContext (state=0x5dd0a7b919d0, econtext=0x5dd0a7b86648) at ../../../src/include/executor/executor.h:458
#16 0x00005dd0a12db198 in ExecProject (projInfo=0x5dd0a7b919c8) at ../../../src/include/executor/executor.h:490
#17 0x00005dd0a12db3bb in ExecResult (pstate=0x5dd0a7b86538) at nodeResult.c:135
#18 0x00005dd0a1290e47 in ExecProcNodeFirst (node=0x5dd0a7b86538) at execProcnode.c:469
#19 0x00005dd0a12e2823 in ExecProcNode (node=0x5dd0a7b86538) at ../../../src/include/executor/executor.h:313
#20 0x00005dd0a12e2848 in SubqueryNext (node=0x5dd0a7b86318) at nodeSubqueryscan.c:53
#21 0x00005dd0a1295a36 in ExecScanFetch (node=0x5dd0a7b86318, epqstate=0x0, accessMtd=0x5dd0a12e2825 <SubqueryNext>,
...
(gdb) f 6
#6 0x00005dd0a0f85bfd in verify_compact_attribute (tupdesc=0x7de05ef01000, attnum=1) at tupdesc.c:165
165             Assert(memcmp(&tmp, cattr, sizeof(CompactAttribute)) == 0);
(gdb) p i
$1 = 484

(I've compiled postgres with -O0.)

Could you look at this once again, please?

Best regards,
Alexander