Thread: Make tuple deformation faster
Currently, TupleDescData contains the descriptor's attributes in a variable length array of FormData_pg_attribute allocated within the same allocation as the TupleDescData. According to my IDE, sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due to attname being 64 bytes. The TupleDescData.attrs[] array could end up quite large on tables with many columns and that could result in poor CPU cache hit ratios when deforming tuples. Instead, we could make TupleDescData contain an out-of-line pointer to the array of FormData_pg_attribute and have a much more compact inlined array of some other struct that much more densely contains the fields required for tuple deformation. attname and many of the other fields are not required to deform a tuple. I've attached a patch series which does this. 0001: Just fixes up some missing usages of TupleDescAttr(). (mostly missed by me, apparently :-( ) 0002: Adjusts the TupleDescData.attrs array to make it out of line. I wanted to make sure nothing weird happened by doing this before doing the bulk of the other changes to add the new struct. 0003: Adds a very compact 8-byte struct named TupleDescDeformAttr, which can be used for tuple deformation. 8 columns fits on a 64-byte cacheline rather than 13 cachelines. 0004: Adjusts the attalign to change it from char to uint8. See below. The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8 rather than a char containing 'c', 's', 'i' or 'd'. This allows much more simple code in the att_align_nominal() macro. What's in master is quite a complex expression to evaluate every time we deform a column as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we just store that numeric value in the struct that macro can become a simple TYPEALIGN() so the operation becomes simple bit masking rather than a poorly branch predictable series of compare and jump. The state of this patch series is "proof of concept". I think the current state should be enough to get an idea of the rough amount of code churn this change would cause and also an idea of the expected performance of the change. It certainly isn't in a finished state. I've not put much effort into updating comments or looking at READMEs to see what's now outdated. I also went with trying to patch a bunch of additional boolean columns from pg_attribute so they just take up 1 bit of space in the attflags field in the new struct. I've not tested the performance of expanding this out so these use 1 bool field each. That would make the struct bigger than 8 bytes. Having the struct be a power-of-2 size is also beneficial as it allows fast bit-shifting to be used to get the array element address rather than a more complex (and slower) LEA instruction. I could try making the struct 16 bytes and see if there are any further wins by avoiding the bitwise AND on the TupleDescDeformAttr.attflags field. To test the performance of this, I tried using the attached script which creates a table where the first column is a variable length column and the final column is an int. The query I ran to test the performance inserted 1 million rows into this table and performed a sum() on the final column. The attached graph shows that the query is 30% faster than master with 15 columns between the first and last column. For fewer columns, the speedup is less. This is quite a deform-heavy query so it's not like it speeds up every table with that column arrangement by 30%, but certainly, some queries could see that much gain and even more seems possible. I didn't go to a great deal of trouble to find the most deform-heavy workload. I'll stick this in the July CF. It would be good to get some feedback on the idea and feedback on whether more work on this is worthwhile. As mentioned, the 0001 patch just fixes up the missing usages of the TupleDescAttr() macro. I see no reason not to commit this now. Thanks David
Attachment
David Rowley <dgrowleyml@gmail.com> writes: > Currently, TupleDescData contains the descriptor's attributes in a > variable length array of FormData_pg_attribute allocated within the > same allocation as the TupleDescData. According to my IDE, > sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due > to attname being 64 bytes. The TupleDescData.attrs[] array could end > up quite large on tables with many columns and that could result in > poor CPU cache hit ratios when deforming tuples. ... > > To test the performance of this, I tried using the attached script > which creates a table where the first column is a variable length > column and the final column is an int. The query I ran to test the > performance inserted 1 million rows into this table and performed a > sum() on the final column. The attached graph shows that the query is > 30% faster than master with 15 columns between the first and last > column. Yet another a wonderful optimization! I just want to know how did you find this optimization (CPU cache hit) case and think it worths some time. because before we invest our time to optimize something, it is better know that we can get some measurable improvement after our time is spend. At for this case, 30% is really a huge number even it is a artificial case. Another case is Andrew introduced NullableDatum 5 years ago and said using it in TupleTableSlot could be CPU cache friendly, I can follow that, but how much it can improve in an ideal case, is it possible to forecast it somehow? I ask it here because both cases are optimizing for CPU cache.. -- Best Regards Andy Fan
On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1213@163.com> wrote: > Yet another a wonderful optimization! I just want to know how did you > find this optimization (CPU cache hit) case and think it worths some > time. because before we invest our time to optimize something, it is > better know that we can get some measurable improvement after our time > is spend. At for this case, 30% is really a huge number even it is a > artificial case. > > Another case is Andrew introduced NullableDatum 5 years ago and said using > it in TupleTableSlot could be CPU cache friendly, I can follow that, but > how much it can improve in an ideal case, is it possible to forecast it > somehow? I ask it here because both cases are optimizing for CPU cache.. Have a look at: perf stat --pid=<backend pid> On my AMD Zen4 machine running the 16 extra column test from the script in my last email, I see: $ echo master && perf stat --pid=389510 sleep 10 master Performance counter stats for process id '389510': 9990.65 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 0 page-faults:u # 0.000 /sec 49407204156 cycles:u # 4.945 GHz 18529494 stalled-cycles-frontend:u # 0.04% frontend cycles idle 8505168 stalled-cycles-backend:u # 0.02% backend cycles idle 165442142326 instructions:u # 3.35 insn per cycle # 0.00 stalled cycles per insn 39409877343 branches:u # 3.945 G/sec 146350275 branch-misses:u # 0.37% of all branches 10.001012132 seconds time elapsed $ echo patched && perf stat --pid=380216 sleep 10 patched Performance counter stats for process id '380216': 9989.14 msec task-clock:u # 0.998 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 0 page-faults:u # 0.000 /sec 49781280456 cycles:u # 4.984 GHz 22922276 stalled-cycles-frontend:u # 0.05% frontend cycles idle 24259785 stalled-cycles-backend:u # 0.05% backend cycles idle 213688149862 instructions:u # 4.29 insn per cycle # 0.00 stalled cycles per insn 44147675129 branches:u # 4.420 G/sec 14282567 branch-misses:u # 0.03% of all branches 10.005034271 seconds time elapsed You can see the branch predictor has done a *much* better job in the patched code vs master with about 10x fewer misses. This should have helped contribute to the "insn per cycle" increase. 4.29 is quite good for postgres. I often see that around 0.5. According to [1] (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the micro-op cache". I'm unsure how micro-ops translate to "insn per cycle" that's shown in perf stat. I thought 4-5 was about the maximum pipeline size from today's era of CPUs. Maybe someone else can explain better than I can. In more simple terms, generally, the higher the "insn per cycle", the better. Also, the lower all of the idle and branch miss percentages are that's generally also better. However, you'll notice that the patched version has more front and backend stalls. I assume this is due to performing more instructions per cycle from improved branch prediction causing memory and instruction stalls to occur more frequently, effectively (I think) it's just hitting the next bottleneck(s) - memory and instruction decoding. At least, modern CPUs should be able to out-pace RAM in many workloads, so perhaps it's not that surprising that "backend cycles idle" has gone up due to such a large increase in instructions per cycle due to improved branch prediction. It would be nice to see this tested on some modern Intel CPU. A 13th series or 14th series, for example, or even any intel from the past 5 years would be better than nothing. David [1] https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
On Mon, 1 Jul 2024 at 10:56, David Rowley <dgrowleyml@gmail.com> wrote: > > Currently, TupleDescData contains the descriptor's attributes in a > variable length array of FormData_pg_attribute allocated within the > same allocation as the TupleDescData. According to my IDE, > sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due > to attname being 64 bytes. The TupleDescData.attrs[] array could end > up quite large on tables with many columns and that could result in > poor CPU cache hit ratios when deforming tuples. > > Instead, we could make TupleDescData contain an out-of-line pointer to > the array of FormData_pg_attribute and have a much more compact > inlined array of some other struct that much more densely contains the > fields required for tuple deformation. attname and many of the other > fields are not required to deform a tuple. +1 > I've attached a patch series which does this. > > 0001: Just fixes up some missing usages of TupleDescAttr(). (mostly > missed by me, apparently :-( ) > 0002: Adjusts the TupleDescData.attrs array to make it out of line. I > wanted to make sure nothing weird happened by doing this before doing > the bulk of the other changes to add the new struct. > 0003: Adds a very compact 8-byte struct named TupleDescDeformAttr, > which can be used for tuple deformation. 8 columns fits on a 64-byte > cacheline rather than 13 cachelines. Cool, that's similar to, but even better than, my patch from 2021 over at [0]. One thing I'm slightly concerned about is that this allocates another 8 bytes for each attribute in the tuple descriptor. While that's not a lot when compared with the ->attrs array, it's still quite a lot when we might not care at all about this data; e.g. in temporary tuple descriptors during execution, in intermediate planner nodes. Did you test for performance gains (or losses) with an out-of-line TupleDescDeformAttr array? One benefit from this would be that we could reuse the deform array for suffix truncated TupleDescs, reuse of which currently would require temporarily updating TupleDesc->natts with a smaller value; but with out-of-line ->attrs and ->deform_attrs, we could reuse these arrays between TupleDescs if one is shorter than the other, but has otherwise fully matching attributes. I know that btree split code would benefit from this, as it wouldn't have to construct a full new TupleDesc when it creates a suffix-truncated tuple during page splits. > 0004: Adjusts the attalign to change it from char to uint8. See below. > > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8 > rather than a char containing 'c', 's', 'i' or 'd'. This allows much > more simple code in the att_align_nominal() macro. What's in master is > quite a complex expression to evaluate every time we deform a column > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we > just store that numeric value in the struct that macro can become a > simple TYPEALIGN() so the operation becomes simple bit masking rather > than a poorly branch predictable series of compare and jump. +1, that's something I'd missed in my patches, and is probably the largest contributor to the speedup. > I'll stick this in the July CF. It would be good to get some feedback > on the idea and feedback on whether more work on this is worthwhile. Do you plan to remove the ->attcacheoff catalog field from the FormData_pg_attribute, now that (with your patch) it isn't used anymore as a placeholder field for fast (de)forming of tuples? Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/CAEze2Wh8-metSryZX_Ubj-uv6kb%2B2YnzHAejmEdubjhmGusBAg%40mail.gmail.com
On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Cool, that's similar to, but even better than, my patch from 2021 over at [0]. I'll have a read of that. Thanks for pointing it out. > One thing I'm slightly concerned about is that this allocates another > 8 bytes for each attribute in the tuple descriptor. While that's not a > lot when compared with the ->attrs array, it's still quite a lot when > we might not care at all about this data; e.g. in temporary tuple > descriptors during execution, in intermediate planner nodes. I've not done it in the patch, but one way to get some of that back is to ditch pg_attribute.attcacheoff. There's no need for it after this patch. That's only 4 out of 8 bytes, however. I think in most cases due to FormData_pg_attribute being so huge, the aset.c power-of-2 roundup behaviour will be unlikely to cross a power-of-2 boundary. The following demonstrates which column counts that actually makes a difference with: select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c) where position('1' in old_bytes::bit(32)::text) != position('1' in new_bytes::bit(32)::text); That returns just 46 column counts out of 1024 where we cross a power of 2 boundaries with the patched code that we didn't cross in master. Of course, larger pallocs will result in a malloc() directly, so perhaps that's not a good measure. At least for smaller column counts it should be mainly the same amount of memory used. There are only 6 rows in there for column counts below 100. I think if we were worried about memory there are likely 100 other things we could do to reclaim some. It would only take some shuffling of fields in RelationData. I count 50 bytes of holes in that struct out of the 488 bytes. There are probably a few that could be moved without upsetting the struct-field-order-lords too much. > Did you test for performance gains (or losses) with an out-of-line > TupleDescDeformAttr array? One benefit from this would be that we > could reuse the deform array for suffix truncated TupleDescs, reuse of > which currently would require temporarily updating TupleDesc->natts > with a smaller value; but with out-of-line ->attrs and ->deform_attrs, > we could reuse these arrays between TupleDescs if one is shorter than > the other, but has otherwise fully matching attributes. I know that > btree split code would benefit from this, as it wouldn't have to > construct a full new TupleDesc when it creates a suffix-truncated > tuple during page splits. No, but it sounds easy to test as patch 0002 moves that out of line and does nothing else. > > 0004: Adjusts the attalign to change it from char to uint8. See below. > > > > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8 > > rather than a char containing 'c', 's', 'i' or 'd'. This allows much > > more simple code in the att_align_nominal() macro. What's in master is > > quite a complex expression to evaluate every time we deform a column > > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we > > just store that numeric value in the struct that macro can become a > > simple TYPEALIGN() so the operation becomes simple bit masking rather > > than a poorly branch predictable series of compare and jump. > > +1, that's something I'd missed in my patches, and is probably the > largest contributor to the speedup. I think so too and I did consider if we should try and do that to pg_attribute, renaming the column to attalignby. I started but didn't finish a patch for that. > > I'll stick this in the July CF. It would be good to get some feedback > > on the idea and feedback on whether more work on this is worthwhile. > > Do you plan to remove the ->attcacheoff catalog field from the > FormData_pg_attribute, now that (with your patch) it isn't used > anymore as a placeholder field for fast (de)forming of tuples? Yes, I plan to do that once I get more confidence I'm on to a winner here. Thanks for having a look at this. David
On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote: > > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > One thing I'm slightly concerned about is that this allocates another > > 8 bytes for each attribute in the tuple descriptor. While that's not a > > lot when compared with the ->attrs array, it's still quite a lot when > > we might not care at all about this data; e.g. in temporary tuple > > descriptors during execution, in intermediate planner nodes. > > I've not done it in the patch, but one way to get some of that back is > to ditch pg_attribute.attcacheoff. There's no need for it after this > patch. That's only 4 out of 8 bytes, however. FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it doesn't have any fields that require 8-byte alignment? Only on 64-bit systems we align the tuples on pages with 8-byte alignment, but in-memory arrays of the struct wouldn't have to deal with that, AFAIK. > I think in most cases > due to FormData_pg_attribute being so huge, the aset.c power-of-2 > roundup behaviour will be unlikely to cross a power-of-2 boundary. ASet isn't the only allocator, but default enough for this to make sense, yes. > The following demonstrates which column counts that actually makes a > difference with: > > select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as > old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c) > where position('1' in old_bytes::bit(32)::text) != position('1' in > new_bytes::bit(32)::text); > > That returns just 46 column counts out of 1024 where we cross a power > of 2 boundaries with the patched code that we didn't cross in master. > Of course, larger pallocs will result in a malloc() directly, so > perhaps that's not a good measure. At least for smaller column counts > it should be mainly the same amount of memory used. There are only 6 > rows in there for column counts below 100. I think if we were worried > about memory there are likely 100 other things we could do to reclaim > some. It would only take some shuffling of fields in RelationData. I > count 50 bytes of holes in that struct out of the 488 bytes. There are > probably a few that could be moved without upsetting the > struct-field-order-lords too much. I'd love for RelationData to be split into IndexRelation, TableRelation, ForeignTableRelation, etc., as there's a lot of wastage caused by exclusive fields, too. > > > 0004: Adjusts the attalign to change it from char to uint8. See below. > > > > > > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8 > > > rather than a char containing 'c', 's', 'i' or 'd'. This allows much > > > more simple code in the att_align_nominal() macro. What's in master is > > > quite a complex expression to evaluate every time we deform a column > > > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we > > > just store that numeric value in the struct that macro can become a > > > simple TYPEALIGN() so the operation becomes simple bit masking rather > > > than a poorly branch predictable series of compare and jump. > > > > +1, that's something I'd missed in my patches, and is probably the > > largest contributor to the speedup. > > I think so too and I did consider if we should try and do that to > pg_attribute, renaming the column to attalignby. I started but didn't > finish a patch for that. I'm not sure we have a pg_type entry that that supports numeric attalign values without increasing the size of the field, as the one single-byte integer-like type (char) is always used as a printable character, and implied to always be printable through its usage in e.g. nodeToString infrastructure. I'd love to have a better option, but we don't seem to have one yet. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > > > One thing I'm slightly concerned about is that this allocates another > > > 8 bytes for each attribute in the tuple descriptor. While that's not a > > > lot when compared with the ->attrs array, it's still quite a lot when > > > we might not care at all about this data; e.g. in temporary tuple > > > descriptors during execution, in intermediate planner nodes. > > > > I've not done it in the patch, but one way to get some of that back is > > to ditch pg_attribute.attcacheoff. There's no need for it after this > > patch. That's only 4 out of 8 bytes, however. > > FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it > doesn't have any fields that require 8-byte alignment? Only on 64-bit > systems we align the tuples on pages with 8-byte alignment, but > in-memory arrays of the struct wouldn't have to deal with that, AFAIK. Yeah, 4-byte alignment. "out of 8 bytes" I was talking about is sizeof(TupleDescDeformAttr), which I believe is the same "another 8 bytes" you had mentioned. What I meant was that deleting attcacheoff only reduces FormData_pg_attribute by 4 bytes per column and adding TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes per column with the patch. I really doubt the 4 bytes extra memory is a big concern here. It would be more concerning for patch that wanted to do something like change NAMEDATALEN to 128, but I think the main concern with that would be even slower tuple deforming. Additional memory would also be concerning, but I doubt that's more important than the issue of making all queries slower due to slower tuple deformation, which is what such a patch would result in. > > I think in most cases > > due to FormData_pg_attribute being so huge, the aset.c power-of-2 > > roundup behaviour will be unlikely to cross a power-of-2 boundary. > > ASet isn't the only allocator, but default enough for this to make sense, yes. It's the only allocator we use for allocating TupleDescs, so other types and their behaviour are not relevant here. > I'm not sure we have a pg_type entry that that supports numeric > attalign values without increasing the size of the field, as the one > single-byte integer-like type (char) is always used as a printable > character, and implied to always be printable through its usage in > e.g. nodeToString infrastructure. > > I'd love to have a better option, but we don't seem to have one yet. yeah, select typname from pg_Type where typalign = 'c' and typlen=1; has just bool and char. I'm happy to keep going with this version of the patch unless someone points out some good reason that we should go with the alternative instead. David
David Rowley <dgrowleyml@gmail.com> writes: > You can see the branch predictor has done a *much* better job in the > patched code vs master with about 10x fewer misses. This should have > helped contribute to the "insn per cycle" increase. 4.29 is quite > good for postgres. I often see that around 0.5. According to [1] > (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the > micro-op cache". I'm unsure how micro-ops translate to "insn per > cycle" that's shown in perf stat. I thought 4-5 was about the maximum > pipeline size from today's era of CPUs. Maybe someone else can explain > better than I can. In more simple terms, generally, the higher the > "insn per cycle", the better. Also, the lower all of the idle and > branch miss percentages are that's generally also better. However, > you'll notice that the patched version has more front and backend > stalls. I assume this is due to performing more instructions per cycle > from improved branch prediction causing memory and instruction stalls > to occur more frequently, effectively (I think) it's just hitting the > next bottleneck(s) - memory and instruction decoding. At least, modern > CPUs should be able to out-pace RAM in many workloads, so perhaps it's > not that surprising that "backend cycles idle" has gone up due to such > a large increase in instructions per cycle due to improved branch > prediction. Thanks for the answer, just another area desvers to exploring. > It would be nice to see this tested on some modern Intel CPU. A 13th > series or 14th series, for example, or even any intel from the past 5 > years would be better than nothing. I have two kind of CPUs. a). Intel Xeon Processor (Icelake) for my ECS b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac. My ECS reports "<not supported> branch-misses", probabaly because it runs in virtualization software , and Mac doesn't support perf yet :( -- Best Regards Andy Fan
On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote: > > On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > > On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent > > > <boekewurm+postgres@gmail.com> wrote: > > > > One thing I'm slightly concerned about is that this allocates another > > > > 8 bytes for each attribute in the tuple descriptor. While that's not a > > > > lot when compared with the ->attrs array, it's still quite a lot when > > > > we might not care at all about this data; e.g. in temporary tuple > > > > descriptors during execution, in intermediate planner nodes. > > > > > > I've not done it in the patch, but one way to get some of that back is > > > to ditch pg_attribute.attcacheoff. There's no need for it after this > > > patch. That's only 4 out of 8 bytes, however. > > > > FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it > > doesn't have any fields that require 8-byte alignment? Only on 64-bit > > systems we align the tuples on pages with 8-byte alignment, but > > in-memory arrays of the struct wouldn't have to deal with that, AFAIK. > > Yeah, 4-byte alignment. "out of 8 bytes" I was talking about is > sizeof(TupleDescDeformAttr), which I believe is the same "another 8 > bytes" you had mentioned. What I meant was that deleting attcacheoff > only reduces FormData_pg_attribute by 4 bytes per column and adding > TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes > per column with the patch. I see I was confused, thank you for clarifying. As I said, the concerns were only small; 4 more bytes only in memory shouldn't matter much in the grand scheme of things. > I'm happy to keep going with this version of the patch +1, go for it. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote: > cycles idle > 8505168 stalled-cycles-backend:u # 0.02% backend cycles idle > 165442142326 instructions:u # 3.35 insn per cycle > # 0.00 stalled > cycles per insn > 39409877343 branches:u # 3.945 G/sec > 146350275 branch-misses:u # 0.37% of all branches > patched > cycles idle > 24259785 stalled-cycles-backend:u # 0.05% backend cycles idle > 213688149862 instructions:u # 4.29 insn per cycle > # 0.00 stalled > cycles per insn > 44147675129 branches:u # 4.420 G/sec > 14282567 branch-misses:u # 0.03% of all branches > You can see the branch predictor has done a *much* better job in the > patched code vs master with about 10x fewer misses. This should have Nice! > helped contribute to the "insn per cycle" increase. 4.29 is quite > good for postgres. I often see that around 0.5. According to [1] > (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the > micro-op cache". I'm unsure how micro-ops translate to "insn per > cycle" that's shown in perf stat. I thought 4-5 was about the maximum > pipeline size from today's era of CPUs. "ins per cycle" is micro-ops retired (i.e. excludes those executed speculatively on a mispredicted branch). That article mentions that 6 micro-ops per cycle can enter the backend from the frontend, but that can happen only with internally cached ops, since only 4 instructions per cycle can be decoded. In specific cases, CPUs can fuse multiple front-end instructions into a single macro-op, which I think means a pair of micro-ops that can "travel together" as one. The authors concluded further down that "Zen 4’s reorder buffer is also special, because each entry can hold up to 4 NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of fused NOPs are fused again at the rename stage."
On Tue, 16 Jul 2024 at 00:13, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote: > > I'm happy to keep going with this version of the patch > > +1, go for it. I've attached an updated patch series which are a bit more polished than the last set. I've attempted to document the distinction between FormData_pg_attribute and the abbreviated struct and tried to give an indication of which one should be used. Apart from that, changes include: * I pushed the v1-0001 patch, so that's removed from the patch series. * Rename TupleDescDeformAttr struct. It's now called CompactAttribute. * Rename TupleDescDeformAttr() macro. It's now called TupleDescCompactAttr() * Other macro renaming. e.g. ATT_IS_PACKABLE_FAST to COMPACT_ATTR_IS_PACKABLE * In 0003, renamed CompactAttribute.attalign to attalignby to make it easier to understand the distinction between the align char and the number of bytes. * Added 0004 patch to remove pg_attribute.attcacheoff. There are a few more things that could be done to optimise a few more things. For example, a bunch of places still use att_align_nominal(). With a bit more work, these could use att_nominal_alignby(). I'm not yet sure of the cleanest way to do this. Adding alignby to the typecache might be one way, or maybe just a function that converts the attalign to the number of bytes. This would be useful in all places where att_align_nominal() is used in loops, as converting the char to the number of bytes would only be done once rather than once per loop. I feel like this patch series is probably big enough for now, so I'd like to opt for those improvements to take place as follow-on work. I'll put this up for the CF bot to run with for a bit as the patch has needed a rebase since I pushed the v1-0001 patch. David
Attachment
Hi, A few weeks ago David and I discussed this patch. We were curious *why* the flags approach was slower. It turns out that, at least on my machine, this is just a compiler optimization issue. Putting a pg_compiler_barrier() just after: > for (; attnum < natts; attnum++) > { > - Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum); > + CompactAttribute *thisatt = TupleDescCompactAttr(tupleDesc, attnum); addressed the issue. The problem basically is that instead of computing the address of thisatt once, gcc for some reason instead uses complex addressing lea instructions, which are slower on some machines. Not sure what a good way to deal with that is. I haven't tried, but it could be that just advancing thisatt using++ would do the trick? I think this generally looks quite good and the performance wins are quite nice! I'm a bit concerned about the impact on various extensions - I haven't looked, but I wouldn't be surprised if this requires a bunch of of changes. Perhaps we could reduce that a bit? Could it make sense to use bitfields instead of flag values, to reduce the impact? > From 35a6cdc2056decc31d67ad552826b573d2b66073 Mon Sep 17 00:00:00 2001 > From: David Rowley <dgrowley@gmail.com> > Date: Tue, 28 May 2024 20:10:50 +1200 > Subject: [PATCH v3 2/5] Introduce CompactAttribute array in TupleDesc > > This array stores a subset of the fields of FormData_pg_attribute, > primarily the ones for deforming tuples, but since we have additional > space, pack a few additional boolean columns in the attflags field. > > Many areas of the code can get away with only accessing the > CompactAttribute array, which because the details of each attribute is > stored much more densely than FormData_pg_attribute, many operations can > be performed accessing fewer cachelines which can improve performance. > > This also makes pg_attribute.attcacheoff redundant. A follow-on commit > will remove it. Yay. > diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h > index 2c435cdcb2..478ebbe1f4 100644 > --- a/src/include/access/tupdesc.h > +++ b/src/include/access/tupdesc.h > @@ -45,6 +45,46 @@ typedef struct TupleConstr > bool has_generated_stored; > } TupleConstr; > > +/* > + * CompactAttribute > + * Cut-down version of FormData_pg_attribute for faster access for tasks > + * such as tuple deformation. > + */ > +typedef struct CompactAttribute > +{ > + int32 attcacheoff; /* fixed offset into tuple, if known, or -1 */ For a short while I thought we could never have a large offset here, due to toast, but that's only when tuples come from a table... > + int16 attlen; /* attr len in bytes or -1 = varlen, -2 = > + * cstring */ It's hard to imagine fixed-dith types longer than 256 bits long making sense, but even if we wanted to change that, it'd not be sensible to change that in this patch... > +#ifdef USE_ASSERT_CHECKING > + > +/* > + * Accessor for the i'th CompactAttribute of tupdesc. In Assert enabled > + * builds we verify that the CompactAttribute is populated correctly. > + * This helps find bugs in places such as ALTER TABLE where code makes changes > + * to the FormData_pg_attribute but forgets to call populate_compact_attribute > + */ > +static inline CompactAttribute * > +TupleDescCompactAttr(TupleDesc tupdesc, int i) > +{ > + CompactAttribute snapshot; > + CompactAttribute *cattr = &tupdesc->compact_attrs[i]; > + > + /* > + * Take a snapshot of how the CompactAttribute is now before calling > + * populate_compact_attribute to make it up-to-date with the > + * FormData_pg_attribute. > + */ > + memcpy(&snapshot, cattr, sizeof(CompactAttribute)); > + > + populate_compact_attribute(tupdesc, i); > + > + /* reset attcacheoff back to what it was */ > + cattr->attcacheoff = snapshot.attcacheoff; > + > + /* Ensure the snapshot matches the freshly populated CompactAttribute */ > + Assert(memcmp(&snapshot, cattr, sizeof(CompactAttribute)) == 0); > + > + return cattr; > +} > + > +#else > +/* Accessor for the i'th CompactAttribute of tupdesc */ > +#define TupleDescCompactAttr(tupdesc, i) (&(tupdesc)->compact_attrs[(i)]) > +#endif Personally I'd have the ifdef inside the static inline function, rather than changing between a static inline and a macro depending on USE_ASSERT_CHECKING.. > #ifndef FRONTEND > /* > - * Given a Form_pg_attribute and a pointer into a tuple's data area, > + * Given a CompactAttribute pointer and a pointer into a tuple's data area, > * return the correct value or pointer. > * > * We return a Datum value in all cases. If the attribute has "byval" false, > @@ -43,7 +43,7 @@ att_isnull(int ATT, const bits8 *BITS) > * > * Note that T must already be properly aligned for this to work correctly. > */ > -#define fetchatt(A,T) fetch_att(T, (A)->attbyval, (A)->attlen) > +#define fetchatt(A, T) fetch_att(T, CompactAttrByVal(A), (A)->attlen) Stuff like this seems like it might catch some extensions unaware. I think it might make sense to change macros like this to be a static inline, so that you get proper type mismatch errors, rather than errors about invalid casts or nonexistant fields. > From c75d072b8bc43ba4f9b7fbe2f99b65edc7421a15 Mon Sep 17 00:00:00 2001 > From: David Rowley <dgrowley@gmail.com> > Date: Wed, 29 May 2024 12:19:03 +1200 > Subject: [PATCH v3 3/5] Optimize alignment calculations in tuple form/deform > > This converts CompactAttribute.attalign from a char which is directly > derived from pg_attribute.attalign into a uint8 which specifies the > number of bytes to align the column by. Also, rename the field to > attalignby to make the distinction more clear in code. > > This removes the complexity of checking each char value and transforming > that into the appropriate alignment call. This can just be a simple > TYPEALIGN passing in the number of bytes. I like this a lot. > diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c > index 08772de39f..b66eb178b9 100644 > --- a/contrib/amcheck/verify_heapam.c > +++ b/contrib/amcheck/verify_heapam.c > @@ -1592,7 +1592,7 @@ check_tuple_attribute(HeapCheckContext *ctx) > /* Skip non-varlena values, but update offset first */ > if (thisatt->attlen != -1) > { > - ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign); > + ctx->offset = att_nominal_alignby(ctx->offset, thisatt->attalignby); > ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen, > tp + ctx->offset); > if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len) A bit confused about the change in naming policy here... > From d1ec19a46a480b0c75f9df468b2765ad4e51dce2 Mon Sep 17 00:00:00 2001 > From: David Rowley <dgrowley@gmail.com> > Date: Tue, 3 Sep 2024 14:05:30 +1200 > Subject: [PATCH v3 5/5] Try a larger CompactAttribute struct without flags > > Benchmarks have shown that making the CompactAttribute struct larger and > getting rid of the flags to reduce the bitwise-ANDing requirements makes > things go faster. I think we have some way of not needing this. I don't like my compiler barrier hack, but I'm sure we can hold the hands of the compiler sufficiently to generate useful code. But this made me wonder: > --- > src/backend/access/common/tupdesc.c | 21 ++++++---------- > src/include/access/tupdesc.h | 39 ++++++++++------------------- > 2 files changed, 20 insertions(+), 40 deletions(-) > > diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c > index 74f22cffb9..95c92e6585 100644 > --- a/src/backend/access/common/tupdesc.c > +++ b/src/backend/access/common/tupdesc.c > @@ -67,20 +67,13 @@ populate_compact_attribute(TupleDesc tupdesc, int i) > dst->attcacheoff = -1; > dst->attlen = src->attlen; > > - dst->attflags = 0; > - > - if (src->attbyval) > - dst->attflags |= COMPACT_ATTR_FLAG_BYVAL; > - if (src->attstorage != TYPSTORAGE_PLAIN) > - dst->attflags |= COMPACT_ATTR_FLAG_IS_PACKABLE; > - if (src->atthasmissing) > - dst->attflags |= COMPACT_ATTR_FLAG_HAS_MISSING; > - if (src->attisdropped) > - dst->attflags |= COMPACT_ATTR_FLAG_IS_DROPPED; > - if (src->attgenerated) > - dst->attflags |= COMPACT_ATTR_FLAG_IS_GENERATED; > - if (src->attnotnull) > - dst->attflags |= COMPACT_ATTR_FLAG_IS_NOTNULL; > + > + dst->attbyval = src->attbyval; > + dst->attispackable = (src->attstorage != TYPSTORAGE_PLAIN); > + dst->atthasmissing = src->atthasmissing; > + dst->attisdropped = src->attisdropped; > + dst->attgenerated = src->attgenerated; > + dst->attnotnull = src->attnotnull; It'd sure be nice if we could condense some of these fields in pg_attribute too. It obviously shouldn't be a requirement for this patch, don't get me wrong! Production systems have often very large pg_attribute tables, which makes this a fairly worthwhile optimization target. I wonder if we could teach the bootstrap and deform logic about bool:1 or such and have it generate the right code for that. Greetings, Andres Freund