Home > mailing lists

Re: Refactor to introduce pg_strcoll(). - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Refactor to introduce pg_strcoll().
Date	March 6, 2023 01:20:48
Msg-id	CA+hUKG+BgA7nXBW22hZR2c1c=kBazZiojxotnYe1PjFMj1ELMw@mail.gmail.com Whole thread Raw
In response to	Re: Refactor to introduce pg_strcoll(). (Jeff Davis <pgsql@j-davis.com>)
List	pgsql-hackers

Tree view

+ /* Win32 does not have UTF-8, so we need to map to UTF-16 */

I wonder if this is still true. I think in Windows 10+ you can enable
UTF-8 support. Then could you use strcoll_l() directly? I struggled
to understand that, but I am a simple Unix hobbit from the shire so I
dunno. (Perhaps the *whole OS* has to be in that mode, so you might
have to do a runtime test? This was discussed in another thread that
mostly left me confused[1].).

And that leads to another thought. We have an old comment
"Unfortunately, there is no strncoll(), so ...". Curiously, Windows
does actually have strncoll_l() (as do some other libcs out there).
So after skipping the expansion to wchar_t, one might think you could
avoid the extra copy required to nul-terminate the string (and hope
that it doesn't make an extra copy internally, far from given).
Unfortunately it seems to be defined in a strange way that doesn't
look like your pg_strncoll_XXX() convention: it has just one length
parameter, not one for each string. That is, it's designed for
comparing prefixes of strings, not for working with
non-null-terminated strings. I'm not entirely sure if the interface
makes sense at all! Is it measuring in 'chars' or 'encoded
characters'? I would guess the former, like strncpy() et al, but then
what does it mean if it chops a UTF-8 sequence in half? And at a
higher level, if you wanted to use it for our purpose, you'd
presumably need Min(s1_len, s2_len), but I wonder if there are string
pairs that would sort in a different order if the collation algorithm
could see more characters after that? For example, in Dutch "ij" is
sometimes treated like a letter that sorts differently than "i" + "j"
normally would, so if you arbitrarily chop that "j" off while
comparing common-length prefix you might get into trouble; likewise
for "aa" in Danish. Perhaps these sorts of problems explain why it's
not in the standard (though I see it was at some point in some kind of
draft; I don't grok the C standards process enough to track down what
happened but WG20/WG14 draft N1027[2] clearly contains strncoll_l()
alongside the stuff that we know and use today). Or maybe I'm
underthinking it.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1027.pdf

pgsql-hackers by date:

From: Jim Jones
Date: 06 March 2023, 01:20:19
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize

From: Thomas Munro
Date: 06 March 2023, 02:32:49
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize

Re: Refactor to introduce pg_strcoll(). - Mailing list pgsql-hackers

Previous

Next