I wrote:
> However, I then tried a partitioned equivalent of the 6-column case
> (script also attached), and it looks like
> 6 columns 16551 19097 15637 18201
> which is really noticeably worse, 16% or so.
... and on the third hand, that might just be some weird compiler-
and platform-specific artifact.
Using the exact same compiler (RHEL8's gcc 8.3.1) on a different
x86_64 machine, I measure the same case as about 7% slowdown not
16%. That's still not great, but it calls the original measurement
into question, for sure.
Using Apple's clang 12.0.0 on an M1 mini, the patch actually clocks
in a couple percent *faster* than HEAD, for both the partitioned and
unpartitioned 6-column test cases.
So I'm not sure what to make of these results, but my level of concern
is less than it was earlier today. I might've just gotten trapped by
the usual bugaboo of micro-benchmarking, ie putting too much stock in
only one test case.
regards, tom lane