Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: B-Tree support function number 3 (strxfrm() optimization)
Date
Msg-id CAM3SWZSq96DKTa4otA5Z71-8PwhNrWEkPFYJKGaArGhaSMdKHQ@mail.gmail.com
Whole thread Raw
In response to Re: B-Tree support function number 3 (strxfrm() optimization)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: B-Tree support function number 3 (strxfrm() optimization)
Re: B-Tree support function number 3 (strxfrm() optimization)
Re: B-Tree support function number 3 (strxfrm() optimization)
List pgsql-hackers
Thanks for looking into this.

On Thu, Jun 12, 2014 at 9:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.

Right. It was a little bit incautious of me to say that we had the
full benefit of 8 bytes of storage with "en_US.UTF-8", since that is
only true of lower case characters (I think that FreeBSD can play
tricks with this. Sometimes, it will give you the benefit of 8 bytes
of entropy for an 8 byte string, with only non-differentiating
trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct
from the first eight bytes of "aaaaaaaa", while any trailing bytes are
non-distinct for both). In any case it's pretty clear that a goal of
the glibc implementation is to concentrate as much entropy as possible
into the first part of the string, and that's the important point.
This makes perfect sense, and is why I was so incredulous about the
Mac behavior. After all, the Open Group's strcoll() documentation
says:

"The strxfrm() and strcmp() functions should be used for sorting large lists."

Sorting text is hardly an infrequent requirement -- it's almost the
entire reason for having strxfrm() in the standard. You're always
going to want to have each strcmp() find differences as early as
possible.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Crash on backend exit w/ OpenLDAP [2.4.24, 2.4.31]
Next
From: Noah Misch
Date:
Subject: Re: lo_create(oid, bytea) breaks every extant release of libpq