On 28 Jul 2014, at 5:23 PM, Peter Geoghegan wrote:
> On Mon, Jul 28, 2014 at 5:14 PM, Wim Lewis <wiml@omnigroup.com> wrote:
>> A quick glance at OSX's strxfrm() suggests they're using an implementation of strxfrm() from FreeBSD. You can find
thesource here:
>>
>> http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c
>>
>> (and a really quick glance at the contents of libc on OSX 10.9 reinforces this--- I don't see any calls into their
CoreFoundationunicode string APIs.)
>
> Something isn't quite accounted for, then. The FreeBSD behavior is to
> append the primary weights only. That makes their returned blobs
> smaller than those you'll see on Linux, but also appears to imply that
> their implementation is substandard (The PostgreSQL port uses ICU on
> FreeBSD for a reason, I suppose). But FreeBSD did not add extra,
> redundant "header bytes" right in the primary level when I tested it,
> but I'm told Mac OS X does.
I don't think OSX actually does. From a look at the source and a simple test program, OSX's strxfrm represents its
outputas a series of 24-bit weights, each one encoded into four bytes. Multiple "levels" of weights are separated by a
weightof 0 (which is encoded as the four bytes "0000").
Robert Haas' message in this thread on 7 April is the first mention of a header, but his examples from 12 June don't
reallydemonstrate a header--- they're completely consistent with the description above and the published source code.
OSX's strxfrm output is very space-inefficient, especially for Latin text, where none of the weights seem to be greater
than2^12, meaning that half of the bytes in any output are always going to be 0x30. (Testing with, e.g., Hangul
charactersI find some weights that use more of the space.) But as far as I can tell it's not completely crazy. :)
What this means is that on OSX, comparing the first 8 bytes of strxfrm output will end up comparing the primary weights
ofthe first two characters in the original string. Which is the same conclusion Robert Haas came to earlier, modulo the
"headerbytes" interpretation.