Thread: Removing alignment padding for byval types
Hi, We currently align byval types such as int4/8, float4/8, timestamp *, date etc, even though we mostly don't need to. When tuples are deformed, all byval types are copied out from the tuple data into the corresponding Datum array, therefore the original alignment in the tuple data doesn't matter. This is different from byref types, where the Datum formed will often be a pointer into the tuple data. While there are some older systems where it could be a bit slower to copy data out from unaligned positions into the datum array, this is more than bought back by the next point: The fact that these types are aligned has substantial costs: For one, we often waste substantial amounts of space inside tables with alignment padding. It's not uncommon to see about 30% or more of space wasted (especially when taking alignment of the first column into account). For another, and this I think is less obvious, we actually waste substantial amounts of CPU maintaining the alignment. This is primarily the case because we have to perform to align the pointer to the next field during tuple [de]forming. Those instructions [1] have to be executed taking time, but what's worse, they also reduce the ability of out-of-order execution to hide latencies. There's a hard dependency on knowing the offset to the next column to be able to continue with the next column. There's two reasons why we can't just set the alignment for these types to 'c'. 1) pg_upgrade, for fairly obvious reasons 2) We map catalog table rows to structs, in a *lot* of places. It seems to me that, despite the above, it's still worth trying to improve upon the current state, to benefit from reduced space and CPU usage. As it turns out we already separate out the alignment for a type, and a column, between pg_type.typalign and pg_attribute.attalign. One way to tackle this would be to allow to specify, for byval types only, at column creation time whether a column uses a 'struct-mappable' alignment or not. When set, set attalign to pg_type.typalign for alignment, when not, to 'c'. By changing pg_dump in binary upgrade mode to emit the necessary options, and by adding such options during bki processing, we'd solve 1) and 2), but otherwise gain the benefits. Alternatively we could declare such a propert on the table level, but that seems more restrictive, without a corresponding upside. It's possible that we should do something related with a few varlena datatypes. We currently use intalign for types like text, json, and as far as I can tell that does not make all that much sense. They're not struct mappable *anyway* (and if they were, they'd need to be 8 byte aligned on common platforms, __alignof__(void*) is 8). We'd have to take a bit of care to treat the varlena header as unaligned - but we need to do so anyway, because of 1byte varlenas. Short varlenas seems to make it less crucial to pursue this, as the average datum that'd benefit is long enough to make padding a non-issue. So I don't think it'd make sense to tackle this project at the same time. To fully benefit from the increased tuple deforming speed, it might be beneficial to branch very early between two different versions within slot_deform_heap_tuple, having determined whether there's any byval columns with alignment requirements at slot creation / ExecSetSlotDescriptor() time (or even set a different callback getsomeattrs callback, but that's a bit more complicated). Thoughts? Independent of the above, I think it might make sense to replace pg_attribute.attalign with a smallint or such. It's a bit absurd that we need code like #define att_align_nominal(cur_offset, attalign) \ ( \ ((attalign) == 'i') ? INTALIGN(cur_offset) : \ (((attalign) == 'c') ? (uintptr_t) (cur_offset) : \ (((attalign) == 'd') ? DOUBLEALIGN(cur_offset) : \ ( \ AssertMacro((attalign) == 's'), \ SHORTALIGN(cur_offset) \ ))) \ ) instead of just using TYPEALIGN(). There's no need to adapt CREATE TYPE, or pg_type - we should just store the number in pg_attribute.attalign. That keeps CREATE TYPE 32/64bit/arch independent, doesn't require reconstructing c/s/i/d in pg_dump, simplifies the generated code [1], and would also "just" work for what I described earlier in this email. Greetings, Andres Freund [1] E.g. as a function of (void *ptr, char attalign) this ends up with assembly like .globl alignme .type alignme, @function alignme: .LFB210: .cfi_startproc movq %rdi, %rax cmpb $105, %sil je .L496 cmpb $99, %sil je .L492 addq $7, %rax leaq 1(%rdi), %rdi andq $-8, %rax andq $-2, %rdi cmpb $100, %sil cmovne %rdi, %rax .L492: ret .p2align 4,,10 .p2align 3 .L496: addq $3, %rax andq $-4, %rax ret .cfi_endproc using (void *ptr, int8 attalign) instead yields: .globl alignme2 .type alignme2, @function alignme2: .LFB211: .cfi_startproc movzbl %sil, %esi leal -1(%rsi), %eax cltq negl %esi addq %rax, %rdi movslq %esi, %rax andq %rdi, %rax ret .cfi_endproc
On Thu, Oct 31, 2019 at 11:48:21AM -0700, Andres Freund wrote: >Hi, > >We currently align byval types such as int4/8, float4/8, timestamp *, >date etc, even though we mostly don't need to. When tuples are deformed, >all byval types are copied out from the tuple data into the >corresponding Datum array, therefore the original alignment in the tuple >data doesn't matter. This is different from byref types, where the >Datum formed will often be a pointer into the tuple data. > >While there are some older systems where it could be a bit slower to >copy data out from unaligned positions into the datum array, this is >more than bought back by the next point: > > >The fact that these types are aligned has substantial costs: > >For one, we often waste substantial amounts of space inside tables with >alignment padding. It's not uncommon to see about 30% or more of space >wasted (especially when taking alignment of the first column into >account). > >For another, and this I think is less obvious, we actually waste >substantial amounts of CPU maintaining the alignment. This is primarily >the case because we have to perform to align the pointer to the next >field during tuple [de]forming. Those instructions [1] have to be >executed taking time, but what's worse, they also reduce the ability of >out-of-order execution to hide latencies. There's a hard dependency on >knowing the offset to the next column to be able to continue with the >next column. > Right. Reducing this overhead was one of the goals to allow "logical ordering" of columns in a table (while arbitrarily reordering the physical ones), but that patch got out of hand pretty quickly. Also, it'd still keep some of the overhead, because it was not removing the alignment/padding entirely. > >There's two reasons why we can't just set the alignment for these types >to 'c'. >1) pg_upgrade, for fairly obvious reasons >2) We map catalog table rows to structs, in a *lot* of places. > > >It seems to me that, despite the above, it's still worth trying to >improve upon the current state, to benefit from reduced space and CPU >usage. > >As it turns out we already separate out the alignment for a type, and a >column, between pg_type.typalign and pg_attribute.attalign. One way to >tackle this would be to allow to specify, for byval types only, at >column creation time whether a column uses a 'struct-mappable' alignment >or not. When set, set attalign to pg_type.typalign for alignment, when >not, to 'c'. By changing pg_dump in binary upgrade mode to emit the >necessary options, and by adding such options during bki processing, >we'd solve 1) and 2), but otherwise gain the benefits. > >Alternatively we could declare such a propert on the table level, but >that seems more restrictive, without a corresponding upside. > I don't know, but it seems entirely sufficient specifying this at the table level, no? What would be the use case for removing padding for only some of the columns? I don't see the use case for that. > >It's possible that we should do something related with a few varlena >datatypes. We currently use intalign for types like text, json, and as >far as I can tell that does not make all that much sense. They're not >struct mappable *anyway* (and if they were, they'd need to be 8 byte >aligned on common platforms, __alignof__(void*) is 8). We'd have to take >a bit of care to treat the varlena header as unaligned - but we need to >do so anyway, because of 1byte varlenas. Short varlenas seems to make it >less crucial to pursue this, as the average datum that'd benefit is long >enough to make padding a non-issue. So I don't think it'd make sense to >tackle this project at the same time. > Not sure, but how come it's not failing on the picky platforms, then? On x86 it's probably OK because it's pretty permissive, but I'd expect some platforms (s390, parisc, itanium, powerpc, ...) to be much pickier. > >To fully benefit from the increased tuple deforming speed, it might be >beneficial to branch very early between two different versions within >slot_deform_heap_tuple, having determined whether there's any byval >columns with alignment requirements at slot creation / >ExecSetSlotDescriptor() time (or even set a different callback >getsomeattrs callback, but that's a bit more complicated). > > >Thoughts? > Seems reasonable. I certainly agree this padding is pretty annoying, so if we can get rid of it without causing issues, that'd be nice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-10-31 20:15:12 +0100, Tomas Vondra wrote: > On Thu, Oct 31, 2019 at 11:48:21AM -0700, Andres Freund wrote: > > We currently align byval types such as int4/8, float4/8, timestamp *, > > date etc, even though we mostly don't need to. When tuples are deformed, > > all byval types are copied out from the tuple data into the > > corresponding Datum array, therefore the original alignment in the tuple > > data doesn't matter. This is different from byref types, where the > > Datum formed will often be a pointer into the tuple data. > > > > While there are some older systems where it could be a bit slower to > > copy data out from unaligned positions into the datum array, this is > > more than bought back by the next point: > > > > > > The fact that these types are aligned has substantial costs: > > > > For one, we often waste substantial amounts of space inside tables with > > alignment padding. It's not uncommon to see about 30% or more of space > > wasted (especially when taking alignment of the first column into > > account). > > > > For another, and this I think is less obvious, we actually waste > > substantial amounts of CPU maintaining the alignment. This is primarily > > the case because we have to perform to align the pointer to the next > > field during tuple [de]forming. Those instructions [1] have to be > > executed taking time, but what's worse, they also reduce the ability of > > out-of-order execution to hide latencies. There's a hard dependency on > > knowing the offset to the next column to be able to continue with the > > next column. > > > > Right. Reducing this overhead was one of the goals to allow "logical > ordering" of columns in a table (while arbitrarily reordering the > physical ones), but that patch got out of hand pretty quickly. Also, > it'd still keep some of the overhead, because it was not removing the > alignment/padding entirely. Yea. It'd keep just about all the CPU overhead, because we'd still need to align as soon as there is a preceding nulled or varlena colum. There's still some benefit for logical column order, as grouping NOT NULL fixed-length columns at the start is beneficial. And it's also beneficial to have frequently accessed columns at the start. But I think this proposal gains a lot of the space related benefits, at a much lower complexity, together with a lot of other benefits. > > There's two reasons why we can't just set the alignment for these types > > to 'c'. > > 1) pg_upgrade, for fairly obvious reasons > > 2) We map catalog table rows to structs, in a *lot* of places. > > > > > > It seems to me that, despite the above, it's still worth trying to > > improve upon the current state, to benefit from reduced space and CPU > > usage. > > > > As it turns out we already separate out the alignment for a type, and a > > column, between pg_type.typalign and pg_attribute.attalign. One way to > > tackle this would be to allow to specify, for byval types only, at > > column creation time whether a column uses a 'struct-mappable' alignment > > or not. When set, set attalign to pg_type.typalign for alignment, when > > not, to 'c'. By changing pg_dump in binary upgrade mode to emit the > > necessary options, and by adding such options during bki processing, > > we'd solve 1) and 2), but otherwise gain the benefits. > > > > Alternatively we could declare such a propert on the table level, but > > that seems more restrictive, without a corresponding upside. > I don't know, but it seems entirely sufficient specifying this at the > table level, no? What would be the use case for removing padding for > only some of the columns? I don't see the use case for that. Well, if we had it on a per-table level, we'd also align a) catalog table columns that follow the first varlena column - which we don't need to align, as they can't be accessed via mapping b) columns in pg_upgraded tables that have been added after the upgrade > > It's possible that we should do something related with a few varlena > > datatypes. We currently use intalign for types like text, json, and as > > far as I can tell that does not make all that much sense. They're not > > struct mappable *anyway* (and if they were, they'd need to be 8 byte > > aligned on common platforms, __alignof__(void*) is 8). We'd have to take > > a bit of care to treat the varlena header as unaligned - but we need to > > do so anyway, because of 1byte varlenas. Short varlenas seems to make it > > less crucial to pursue this, as the average datum that'd benefit is long > > enough to make padding a non-issue. So I don't think it'd make sense to > > tackle this project at the same time. > > > > Not sure, but how come it's not failing on the picky platforms, then? On > x86 it's probably OK because it's pretty permissive, but I'd expect some > platforms (s390, parisc, itanium, powerpc, ...) to be much pickier. I'm not quite following? As I said, we already need to use alignment aware code due to caring for short varlenas. And these types aren't actually struct mappable. Greetings, Andres Freund
On Thu, Oct 31, 2019 at 12:24:33PM -0700, Andres Freund wrote: >Hi, > >On 2019-10-31 20:15:12 +0100, Tomas Vondra wrote: >> On Thu, Oct 31, 2019 at 11:48:21AM -0700, Andres Freund wrote: >> > We currently align byval types such as int4/8, float4/8, timestamp *, >> > date etc, even though we mostly don't need to. When tuples are deformed, >> > all byval types are copied out from the tuple data into the >> > corresponding Datum array, therefore the original alignment in the tuple >> > data doesn't matter. This is different from byref types, where the >> > Datum formed will often be a pointer into the tuple data. >> > >> > While there are some older systems where it could be a bit slower to >> > copy data out from unaligned positions into the datum array, this is >> > more than bought back by the next point: >> > >> > >> > The fact that these types are aligned has substantial costs: >> > >> > For one, we often waste substantial amounts of space inside tables with >> > alignment padding. It's not uncommon to see about 30% or more of space >> > wasted (especially when taking alignment of the first column into >> > account). >> > >> > For another, and this I think is less obvious, we actually waste >> > substantial amounts of CPU maintaining the alignment. This is primarily >> > the case because we have to perform to align the pointer to the next >> > field during tuple [de]forming. Those instructions [1] have to be >> > executed taking time, but what's worse, they also reduce the ability of >> > out-of-order execution to hide latencies. There's a hard dependency on >> > knowing the offset to the next column to be able to continue with the >> > next column. >> > >> >> Right. Reducing this overhead was one of the goals to allow "logical >> ordering" of columns in a table (while arbitrarily reordering the >> physical ones), but that patch got out of hand pretty quickly. Also, >> it'd still keep some of the overhead, because it was not removing the >> alignment/padding entirely. > >Yea. It'd keep just about all the CPU overhead, because we'd still need >to align as soon as there is a preceding nulled or varlena colum. > >There's still some benefit for logical column order, as grouping NOT >NULL fixed-length columns at the start is beneficial. And it's also >beneficial to have frequently accessed columns at the start. But I think >this proposal gains a lot of the space related benefits, at a much lower >complexity, together with a lot of other benefits. > +1 > >> > There's two reasons why we can't just set the alignment for these types >> > to 'c'. >> > 1) pg_upgrade, for fairly obvious reasons >> > 2) We map catalog table rows to structs, in a *lot* of places. >> > >> > >> > It seems to me that, despite the above, it's still worth trying to >> > improve upon the current state, to benefit from reduced space and CPU >> > usage. >> > >> > As it turns out we already separate out the alignment for a type, and a >> > column, between pg_type.typalign and pg_attribute.attalign. One way to >> > tackle this would be to allow to specify, for byval types only, at >> > column creation time whether a column uses a 'struct-mappable' alignment >> > or not. When set, set attalign to pg_type.typalign for alignment, when >> > not, to 'c'. By changing pg_dump in binary upgrade mode to emit the >> > necessary options, and by adding such options during bki processing, >> > we'd solve 1) and 2), but otherwise gain the benefits. >> > >> > Alternatively we could declare such a propert on the table level, but >> > that seems more restrictive, without a corresponding upside. > >> I don't know, but it seems entirely sufficient specifying this at the >> table level, no? What would be the use case for removing padding for >> only some of the columns? I don't see the use case for that. > >Well, if we had it on a per-table level, we'd also align >a) catalog table columns that follow the first varlena column - which we don't need > to align, as they can't be accessed via mapping >b) columns in pg_upgraded tables that have been added after the upgrade > Hmm, OK. I think the question is whether it's worth the extra complexity. I'd say it's not, but perhaps I'm wrong. > >> > It's possible that we should do something related with a few varlena >> > datatypes. We currently use intalign for types like text, json, and as >> > far as I can tell that does not make all that much sense. They're not >> > struct mappable *anyway* (and if they were, they'd need to be 8 byte >> > aligned on common platforms, __alignof__(void*) is 8). We'd have to take >> > a bit of care to treat the varlena header as unaligned - but we need to >> > do so anyway, because of 1byte varlenas. Short varlenas seems to make it >> > less crucial to pursue this, as the average datum that'd benefit is long >> > enough to make padding a non-issue. So I don't think it'd make sense to >> > tackle this project at the same time. >> > >> >> Not sure, but how come it's not failing on the picky platforms, then? On >> x86 it's probably OK because it's pretty permissive, but I'd expect some >> platforms (s390, parisc, itanium, powerpc, ...) to be much pickier. > >I'm not quite following? As I said, we already need to use alignment >aware code due to caring for short varlenas. And these types aren't >actually struct mappable. > Sorry, I misread/misunderstood what you wrote. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services