Re: Move defaults toward ICU in 16? - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Move defaults toward ICU in 16? |
Date | |
Msg-id | 438b15cf88c9c5cf0cf1c24bf7b835d6362a8e0c.camel@j-davis.com Whole thread Raw |
In response to | Re: Move defaults toward ICU in 16? (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Responses |
Re: Move defaults toward ICU in 16?
|
List | pgsql-hackers |
On Thu, 2023-03-02 at 10:37 +0100, Peter Eisentraut wrote: > I would skip this. There was a brief discussion about this at [0], > where I pointed out that if we are going to do something like that, > there would be other candidates among the optional dependencies to > promote, such as certainly openssl and arguably lz4. If we don't do > this consistently across dependencies, then there will be confusion. The difference is that ICU affects semantics of collations, and collations are not really an optional feature. If postgres is built without ICU, that will affect the default at initdb time (after patch 003, anyway), which will then affect the default collations in all databases. > In practice, I don't think it matters. Almost all installations are > made by packagers, who will make their own choices. Mostly true, but the discussion at [0] reveals that some people do build postgresql themselves for whatever reason. When I first started out with postgres I always built from source. That was quite a while ago, so maybe that means nothing; but it would be sad to think that the build-it-yourself experience doesn't matter. > > 0002: update template0 in new cluster (as described above) > > This makes sense. I'm confused what the code originally wanted to > achieve, e.g., > > -/* > - * Check that every database that already exists in the new cluster > is > - * compatible with the corresponding database in the old one. > - */ > -static void > -check_databases_are_compatible(void) > > Was there once support for the new cluster having additional > databases > in place? Weird. It looks like 33755e8edf was the last significant change here. CC Heikki for comment. > In any case, I think you can remove additional code from > get_db_infos() > related to fields that are no longer used, such as db_encoding, and > the > corresponding struct fields in DbInfo. You're right: there's not much of an intersection between the code that needs a DbInfo and the code that needs the locale fields. I created a separate DbLocaleInfo struct for the template0 locale information, and removed the locale fields from DbInfo. I also added a TAP test. > > 0003: default initdb to use ICU > > What are the changes in the citext tests about? There's a test in citext_utf8.sql for: SELECT 'i'::citext = 'İ'::citext AS t; citext_eq uses DEFAULT_COLLATION_OID, comparing the results after applying lower(). Apparently: lower('İ' collate "en_US") = 'i' -- true lower('İ' collate "tr-TR-x-icu") = 'i' -- true lower('İ' collate "en-US-x-icu") = 'i' -- false the test was passing before because it seems to be true for all libc locales. But for ICU, it seems to only be true in the "tr-TR" locale at colstrength=secondary (and true for other ICU locales at colstrength=primary). We can't fix the test by being explicit about the collation, because citext doesn't pass it down; it always uses the default collation. We could fix citext to pass it down properly, but that seems like a different patch. In any case, citext doesn't seem very important to those using ICU (we have a doc note suggesting ICU instead), so I don't see a strong reason to test the combination. So, I just exit the test early if it's ICU. I added a better comment. > Is it the same issue as > in unaccent? In that case, the OR should be an AND? Maybe add a > comment? > > Why is unaccent is "broken" if the default collation is provided by > ICU > and LC_CTYPE=C? Is that a general problem? Should we prevent this > combination? A different issue: unaccent is calling t_isspace(), which is just not properly handling locales. t_isspace() always passes NULL as the last argument to char2wchar. There are TODO comments throughout that file. Specifically what happens: lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false so it calls char2wchar(), which calls mbstowcs() which returns an error because the LC_CTYPE=C Right now, that's a longstanding issue for all users of t_isspace() and related functions: tsearch, ltree, pg_trgm, dict_xsyn, and unaccent. I assume it was known and considered unimportant, otherwise we wouldn't have left the TODO comments in there. I believe it's only a problem when the provider is ICU and the LC_CTYPE is C. I think a quick fix would be to just test LC_CTYPE directly (from the environment or setlocale(LC_CTYPE, NULL)) rather than try to extract it from the default collation. It sounds like a separate patch, and should be handled as a bugfix and backported. A better fix would be to support character classification in ICU. I don't think that's hard, but ICU has quite a few options, and we'd need to discuss which are the right ones to support. We may also want to pass collation information down rather than just using the database default, but that may depend on the caller and we should discuss that, as well. > What are the changes in the ecpg tests about? Looks harmless, but if > there is a need, maybe it should be commented somewhere, otherwise > what > prevents someone from changing it back? ICU is not compatible with SQL_ASCII, so I had to remove the ENCODING=SQL_ASCII line from the ecpg test build. CC Michael Meskes who added the line in 1fa6be6f69 in case he has a comment. But when I did that, I got CI failures on windows because it couldn't transcode between LATIN1 and WIN1252. So I changed the ecpg test to just use SQL_ASCII for the client_encoding (not the server encoding). Michael Meskes added the client_encoding parameter test in 5e7710e725, so he might have a comment about that as well. Since I removed the code, I didn't see a clear place to add a comment, but if you have a suggestion I'll take it. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
pgsql-hackers by date: