For v10, I've split the patch up into two parts. 0001 uses pure C everywhere. This is much smaller and easier to review, and gets us the most bang for the buck.
One concern Heikki raised upthread is that platforms with poor unaligned-memory access will see a regression. We could easily add an #ifdef to take care of that, but I haven't done so here.
To recap: On ascii-only input with storage taken out of the picture, profiles of COPY FROM show a reduction from nealy 10% down to just over 1%. In microbenchmarks found earlier in this thread, this works out to about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, but not really enough to make a difference in copy performance.
0002 adds the SSE4 implementation on x86-64, and is equally fast on all input, at the cost of greater complexity.
To reflect the split, I've changed the thread subject and the commitfest title.
--