Re: Refactor to introduce pg_strcoll(). - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Refactor to introduce pg_strcoll().
Date
Msg-id CA+hUKG+BgA7nXBW22hZR2c1c=kBazZiojxotnYe1PjFMj1ELMw@mail.gmail.com
Whole thread Raw
In response to Re: Refactor to introduce pg_strcoll().  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
+    /* Win32 does not have UTF-8, so we need to map to UTF-16 */

I wonder if this is still true.  I think in Windows 10+ you can enable
UTF-8 support.  Then could you use strcoll_l() directly?  I struggled
to understand that, but I am a simple Unix hobbit from the shire so I
dunno.  (Perhaps the *whole OS* has to be in that mode, so you might
have to do a runtime test?  This was discussed in another thread that
mostly left me confused[1].).

And that leads to another thought.  We have an old comment
"Unfortunately, there is no strncoll(), so ...".  Curiously, Windows
does actually have strncoll_l() (as do some other libcs out there).
So after skipping the expansion to wchar_t, one might think you could
avoid the extra copy required to nul-terminate the string (and hope
that it doesn't make an extra copy internally, far from given).
Unfortunately it seems to be defined in a strange way that doesn't
look like your pg_strncoll_XXX() convention: it has just one length
parameter, not one for each string.  That is, it's designed for
comparing prefixes of strings, not for working with
non-null-terminated strings.  I'm not entirely sure if the interface
makes sense at all!  Is it measuring in 'chars' or 'encoded
characters'?  I would guess the former, like strncpy() et al, but then
what does it mean if it chops a UTF-8 sequence in half?  And at a
higher level, if you wanted to use it for our purpose, you'd
presumably need Min(s1_len, s2_len), but I wonder if there are string
pairs that would sort in a different order if the collation algorithm
could see more characters after that?  For example, in Dutch "ij" is
sometimes treated like a letter that sorts differently than "i" + "j"
normally would, so if you arbitrarily chop that "j" off while
comparing common-length prefix you might get into trouble; likewise
for "aa" in Danish.  Perhaps these sorts of problems explain why it's
not in the standard (though I see it was at some point in some kind of
draft; I don't grok the C standards process enough to track down what
happened but WG20/WG14 draft N1027[2] clearly contains strncoll_l()
alongside the stuff that we know and use today).  Or maybe I'm
underthinking it.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1027.pdf



pgsql-hackers by date:

Previous
From: Jim Jones
Date:
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize
Next
From: Thomas Munro
Date:
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize