Thread: ICU locale validation / canonicalization
Right now, ICU locales are not validated: initdb ... --locale-provider=icu --icu-locale=anything CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything'); CREATE DATABASE anythingdb ICU_LOCALE 'anything'; all succeed. We do check that the value is accepted by ICU, but ICU seems to accept anything and use some fallback logic. Bogus strings will typically end up as the "root" locale (spelled "root" or ""). At first, I thought this was a bug. The ICU documentation[1] suggests that the fallback logic can result in using the ICU default locale in some cases. The default locale is problematic because it's affected by the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely not LC_COLLATE). Fortunately, I didn't find any cases where it actually does fall back to the default locale, so I think we're safe, but validation seems wise regrardless. In different contexts we may want to fail (e.g. initdb with a bogus locale), or warn, issue a notice that we changed the string, or just silently change what the user entered to be in a consistent form. BCP47 [2] seems to be the standard here, and we're already using it when importing the icu collations. ICU locale validation is not exactly straightforward, though, and I suppose that's why it isn't already done. There's a document[3] that explains canonicalization in terms of "level 1" and "level 2", and says that ucol_canonicalize() provides level 2 canonicalization, but I am not seeing all of the documented behavior in my tests. For instance, the document says that "de__PHONEBOOK" should canonicalize to "de@collation=phonebook", but instead I see that it remains "de__PHONEBOOK". It also says that "C" should canonicalize to "en_US_POSIX", but in my test, it just goes to "c". The right entry point appears to get uloc_getLanguageTag(), which internally calls uloc_canonicalize, but also converts to BCP47 format, and gives the option for strictness. Non-strict mode seems problematic because for "de__PHONEBOOK", it returns a langtag of plain "de", which is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize worked as documented, it would have changed it to "de@collation=phonebook" and the correct language tag "de-u-co-phonebk" would be returned, which would find the right collator. I suppose that means we would need to use strict mode. And then we need to check whether it actually exists; i.e. reject well- formed but bogus locales, like "wx-YZ". To do that, probably the most straightforward way would be to initialize a UCollator and then query it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results in the root locale, we could say that it doesn't exist because it failed to find a more suitable match (unless the user explicitly requested the root locale). If it resolves to something else, we could either just assume it's fine, or we could try to validate that it matches what we expect in more detail. To be safe, we could double- check that the resulting BCP 47 locale string loads the same actual collator as what would have been loaded with the original string (also check attributes?). The overall benefit here is that we keep our catalogs consistently using an independent standard format for ICU locale strings, rather than whatever the user specifies. That makes it less likely that ICU needs to use any fallback logic when trying to open a collator, which could only be bad news. Thoughts? [1] https://unicode-org.github.io/icu/userguide/locale/#fallback [2] https://en.wikipedia.org/wiki/IETF_language_tag [3] https://unicode-org.github.io/icu/userguide/locale/#canonicalization -- Jeff Davis PostgreSQL Contributor Team - AWS
On 08.02.23 08:59, Jeff Davis wrote: > The overall benefit here is that we keep our catalogs consistently > using an independent standard format for ICU locale strings, rather > than whatever the user specifies. That makes it less likely that ICU > needs to use any fallback logic when trying to open a collator, which > could only be bad news. One use case is that if a user specifies a locale, say, of 'de-AT', this might canonicalize to 'de' today, but we should still store what the user specified because 1) that documents what the user wanted, and 2) it might not canonicalize to the same thing tomorrow.
On Wed, Feb 8, 2023 at 2:59 AM Jeff Davis <pgsql@j-davis.com> wrote: > We do check that the value is accepted by ICU, but ICU seems to accept > anything and use some fallback logic. Bogus strings will typically end > up as the "root" locale (spelled "root" or ""). I've noticed this, and I think it's really frustrating. There's barely any documentation of what strings you're allowed to specify, and the documentation that does exist is extremely difficult to understand. Normally, you could work around that problem to some degree by making a guess at what you're supposed to be doing and then seeing whether the program accepts it, but here that doesn't work either. It just accepts anything you give it and then you have to try to figure out whether the behavior is what you wanted. But there's also no real documentation of what the behavior of any collation is, so you're apparently just supposed to magically know what collations exist and how they behave and then you can test whether the string you put in gave you the behavior you wanted. Adding validation and canonicalization wouldn't cure the documentation problems, but it would be a big help. You still wouldn't know what string you were supposed to be passing to ICU, but if you did pass it a string, you'd find out what it thought that string meant. I think that would be a huge step forward. Unfortunately, I have no idea whether your specific ideas about how to make that happen are any good or not. But I hope they are, because the current situation is pessimal. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote: > One use case is that if a user specifies a locale, say, of 'de-AT', > this > might canonicalize to 'de' today, Canonicalization should not lose useful information, it should just rearrange it, so I don't see a risk here based on what I read and the behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes the language tag "de-AT". > but we should still store what the > user specified because 1) that documents what the user wanted, and 2) > it > might not canonicalize to the same thing tomorrow. We don't want to store things with ambiguous interpretations that could change tomorrow; that's a recipe for trouble. That's why most people store timestamps as the offset from some epoch in UTC rather than as "2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where you would want to store something like that, but I don't see why they'd apply in this case, where reinterpretation probably means a corrupted index. If the user wants to know how their ad-hoc string was interpreted, they can look at the resulting BCP 47 language tag, and see if it's what they meant. We can try to make this user-friendly by offering a NOTICE, WARNING, or helper functions that allow them to explore. We can also double check that the canonicalized form resolves to the same actual collator to be safe, and maybe even fall back to whatever the user specified if not. I'm open to discuss how strict we want to be and what kind of escape hatches we need to offer. There is still a risk that the BCP 47 language tag resolves to a different specific ICU collator or different collator version tomorrow. That's why we need to be careful about versioning (library versions or collator versions or both), and we've had long discussions about that. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote: > Unfortunately, I have no idea whether your specific ideas about how > to > make that happen are any good or not. But I hope they are, because > the > current situation is pessimal. It feels like BCP 47 is the right catalog representation. We are already using it for the import of initial collations, and it's a standard, and there seems to be good support in ICU. There are a couple cases where canonicalization will succeed but conversion to a BCP 47 language tag will fail. One is for unsupported attributes, like "en_US@foo=bar". Another is a bug I found and reported here: https://unicode-org.atlassian.net/browse/ICU-22268 In both cases, we know that conversion has failed, and we have a choice about how to proceed. We can fail, warn and continue with the user- entered representation, or turn off the strictness checking and come up with some BCP 47 tag and see if it resolves to the same collator. I do like the ICU format locale IDs from a readability standpoint. "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- level1" (the equivalent language tag). And the format is specified[1], even though it's not an independent standard. But I think the benefits of better validation, an independent standard, and the fact that we're already favoring BCP47 outweigh my subjective opinion. I also attached a simple test program that I've been using to experiment (not intended for code review). It's hard for me to say that I'm sure I'm right. I really just got involved in this a few months back, and had a few off-list conversations with Peter Eisentraut to try to learn more (I believe he is aligned with my proposal but I will let him speak for himself). I should also say that I'm not exactly an expert in languages or scripts. I assume that ICU and IETF are doing sensible things to accommodate the diversity of human language as well as they can (or at least much better than the Postgres project could do on its own). I'm happy to hear more input or other proposals. [1] https://unicode-org.github.io/icu/userguide/locale/#canonicalization -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 2/9/23 23:09, Jeff Davis wrote: > I do like the ICU format locale IDs from a readability standpoint. > "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- > level1" (the equivalent language tag). And the format is specified[1], > even though it's not an independent standard. But I think the benefits > of better validation, an independent standard, and the fact that we're > already favoring BCP47 outweigh my subjective opinion. I have the same feeling one is readable and the other unreadable but the unreadable one is standardized. Hard call. And in general I agree, if we are going to make ICU default it needs to be more user friendly than it is now. Currently there is no nice way to understand if you entered the right locale or made a typo in the BCP47 syntax. Andreas
On Fri, 2023-02-10 at 01:04 +0100, Andreas Karlsson wrote: > I have the same feeling one is readable and the other unreadable but > the > unreadable one is standardized. Hard call. > > And in general I agree, if we are going to make ICU default it needs > to > be more user friendly than it is now. Currently there is no nice way > to > understand if you entered the right locale or made a typo in the > BCP47 > syntax. We will still allow the ICU format locale IDs for input; we would just convert them to BCP47 before storing them in the catalog. And there's an inverse function, so it's easy enough to offer a view that shows the ICU format locale IDs in addition to the BCP 47 tags. I don't think it's hugely important that we use BCP47; ICU format locale IDs would also make sense. But I do think we should be consistent to simplify things where we can -- collator versioning is hard enough without wondering how a user-entered string will be interpreted. And if we're going to be consistent, BCP 47 seems like the most obvious choice. -- Jeff Davis PostgreSQL Contributor Team - AWS
On 2/10/23 02:22, Jeff Davis wrote: > We will still allow the ICU format locale IDs for input; we would just > convert them to BCP47 before storing them in the catalog. And there's > an inverse function, so it's easy enough to offer a view that shows the > ICU format locale IDs in addition to the BCP 47 tags. Aha, then I misread your mail. Sorry! BCP 47 sounds perfect for storage. Andreas
On 09.02.23 22:15, Jeff Davis wrote: > On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote: >> One use case is that if a user specifies a locale, say, of 'de-AT', >> this >> might canonicalize to 'de' today, > Canonicalization should not lose useful information, it should just > rearrange it, so I don't see a risk here based on what I read and the > behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes > the language tag "de-AT". It turns out that 'de_AT' is actually a distinct collation from 'de' in CLDR, so that was not the best example. What behavior do you see for 'de_CH'?
On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote: > I do like the ICU format locale IDs from a readability standpoint. > "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- > level1" (the equivalent language tag). Sadly, neither of those means a whole lot to me? :-( How did you find out that those are equivalent? > And the format is specified[1], > even though it's not an independent standard. But I think the benefits > of better validation, an independent standard, and the fact that we're > already favoring BCP47 outweigh my subjective opinion. See, I'm confused, because that link says "If a keyword list is present it must be preceded by an at-sign" which makes it sound like it is talking about stuff like en_US@colstrength=primary rather than stuff like en-US-u-ks-level1. The examples are all that way too, like it gives examples like en_IE@currency=IEP and fr@collation=phonebook;calendar=islamic-civil. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2023-02-10 at 07:42 +0100, Peter Eisentraut wrote: > It turns out that 'de_AT' is actually a distinct collation from 'de' > in > CLDR, so that was not the best example. What behavior do you see for > 'de_CH'? The canonicalized form is de_CH and the bcp47 tag is de-CH. uloc_canonicalize() and uloc_getLanguageTag() are declared in uloc.h, and they aren't (as far as I can tell) tied to which collations are actually defined. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote: > On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote: > > I do like the ICU format locale IDs from a readability standpoint. > > "en_US@colstrength=primary" is more meaningful to me than "en-US-u- > > ks- > > level1" (the equivalent language tag). > > Sadly, neither of those means a whole lot to me? :-( > > How did you find out that those are equivalent? In our tests you can see colstrength=primary is used to mean "case insensitive". That's where I picked up the "colstrength" keyword, which is also present in the ICU sources, but now that you ask I'm embarassed that I don't see the keyword itself documented very well. This document https://unicode-org.github.io/icu/userguide/locale/#keywords lists keywords, but colstrength is not there. It's easy enough to find in the ICU source; I'm probably just missing the document. Here's the API reference, which tells you that you can set the strength of a collator (using the API, not the keyword): https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a The more precise definitions of the strengths are here: https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels Regarding the equivalence of the two forms, uloc_toLanguageTag() and uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree of assurance than you are looking for), if one succeeds, then the other will also succeed and produce the original result. There are another couple documents here (TR35): http://www.unicode.org/reports/tr35/ https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options that seems to cover the "ks-level1" and how it maps to the collation strength. My examination of these standards is very superficial -- I'm basically just checking that they seem to be there. If I search for a string like "en-US-u-ks-level1", I only find Postgres-related results, so you could also question whether these standards are actually used. Using BCP 47 tags for icu locale strings, and moving to ICU (as discussed in the other thread) is basically a leap of faith in ICU. The docs aren't perfect, the source is hard to read, and we've found bugs. But it seems like a better place for us than libc for the reasons I mentioned in the other thread. > > And the format is specified[1], > > even though it's not an independent standard. But I think the > > benefits > > of better validation, an independent standard, and the fact that > > we're > > already favoring BCP47 outweigh my subjective opinion. > > See, I'm confused, because that link says "If a keyword list is > present it must be preceded by an at-sign" which makes it sound like > it is talking about stuff like en_US@colstrength=primary rather than > stuff like en-US-u-ks-level1. The examples are all that way too, like > it gives examples like en_IE@currency=IEP and > fr@collation=phonebook;calendar=islamic-civil. My paragraph was unclear, let me restate the point: To represent ICU locale strings in the catalog consistently, we have two choices, which as far as I can tell are equivalent: 1. ICU format Locale IDs. These are more readable, and still specified (albeit non-standard). 2. BCP47 language tags. These are standardized, there's better validation with "strict" mode, and we are already using them. Honestly I don't think it's hugely important which one we pick. But being consistent is important, so we need to pick one, and BCP 47 seems like the better option to me. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Fri, Feb 10, 2023 at 12:54 PM Jeff Davis <pgsql@j-davis.com> wrote: > In our tests you can see colstrength=primary is used to mean "case > insensitive". That's where I picked up the "colstrength" keyword, which > is also present in the ICU sources, but now that you ask I'm embarassed > that I don't see the keyword itself documented very well. > > This document > https://unicode-org.github.io/icu/userguide/locale/#keywords > lists keywords, but colstrength is not there. It's easy enough to find > in the ICU source; I'm probably just missing the document. The fact that you're figuring out how it all works from reading the source code does not give me a warm feeling. > But it seems like a better place for us than libc for the reasons I > mentioned in the other thread. It may be. But sometimes I feel that's not setting our sights very high. :-( -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2023-02-10 at 22:50 -0500, Robert Haas wrote: > The fact that you're figuring out how it all works from reading the > source code does not give me a warm feeling. Right. On the other hand, the behavior is quite well documented, it was just the keyword that was undocumented (or I didn't find it). > > But it seems like a better place for us than libc for the reasons I > > mentioned in the other thread. > > It may be. But sometimes I feel that's not setting our sights very > high. :-( How much higher could we set our sights? What would the ideal collation provider look like? Those are good questions, but please let's take those questions to the thread about ICU as a default. The topic of this thread is: Given that we are already offering ICU support, should we canonicalize the locale string stored in the catalog? If so, should we use the ICU format locale IDs, or BCP 47 language tags? Do you have an opinion on that topic? If not, do you need additional information? -- Jeff Davis PostgreSQL Contributor Team - AWS
On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote: > It feels like BCP 47 is the right catalog representation. We are > already using it for the import of initial collations, and it's a > standard, and there seems to be good support in ICU. Patch attached. We should have been canonicalizing all along -- either with uloc_toLanguageTag(), as this patch does, or at least with uloc_canonicalize() -- before passing to ucol_open(). ucol_open() is documented[1] to work on either language tags or ICU format locale IDs. Anything else is invalid and ends up going through some fallback logic, probably after being mis-parsed. For instance, in ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid language tag, and is resolved by ucol_open() to the actual locale "root"; but if you canonicalize it first (to the ICU format locale ID "fr_CA" or the language tag "fr-CA"), it correctly resolves to the actual locale "fr_CA". The correct thing to do is canonicalize first and then pass to ucol_open(). But because we didn't canonicalize in the past, there could be raw locale strings stored in the catalog that resolve to the wrong actual collator, and there could be indexes depending on the wrong collator, so we have to be careful during pg_upgrade. Say someone created two ICU collations, one with locale "en_US.UTF-8" and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16, this patch will check the language tag "en-US" and see that it resolves to the same locale as "en_US.UTF-8", and change to the language tag during upgrade (so "en-US" will be in the new catalog). But when it checks the language tag "fr-CA", it will notice that it resolves to a different locale than "fr_CA.UTF-8", and keep the latter string even though it's wrong, because some indexes might be dependent on that wrong collator. [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6 -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 10.02.23 18:53, Jeff Davis wrote: > To represent ICU locale strings in the catalog consistently, we have > two choices, which as far as I can tell are equivalent: > > 1. ICU format Locale IDs. These are more readable, and still specified > (albeit non-standard). > > 2. BCP47 language tags. These are standardized, there's better > validation with "strict" mode, and we are already using them. > > Honestly I don't think it's hugely important which one we pick. But > being consistent is important, so we need to pick one, and BCP 47 seems > like the better option to me. I found some discussion about this from when ICU support was first added. See this message as a starting point: https://www.postgresql.org/message-id/flat/5291804b-169e-3ba9-fdaf-fae8e7d2d959%402ndquadrant.com#96acb7eb9299c2ca64dbabcf58e11a90 There isn't much detail there, but the discussion and the current code seem pretty convinced that a) BCP47 tags are preferred, and b) They don't work with ICU versions before 54. I can't locate the source for claim b) anymore. However, it seems pretty clear that there is some cutoff, even if it isn't exactly 54. I would support transitioning this forward somehow, but we would need to know exactly what the impact would be.
New patch attached. The new patch also includes a GUC that (when enabled) validates that the collator is actually found. On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote: > a) BCP47 tags are preferred, and Agreed. > b) They don't work with ICU versions before 54. I tried in versions 50 through 53, and the language tags are supported, but I think I know why we don't use them: Prior to version 54, ICU would not set the collator attributes based on the locale name. That is the same for either language tags or ICU format locale IDs. However, for ICU format locale IDs, we added special code to parse the locale string and set the attributes ourselves. We didn't bother to add the same parsing logic for language tags, so if a language tag is found in the catalog, the parts of it that specify collation strength (for example) would be ignored. I don't know if that's an actual problem when importing the system collations, because I don't think we use any collator attributes, but it makes sense that we'd not favor language tags in ICU prior to v54. > I would support transitioning this forward somehow, but we would need > to > know exactly what the impact would be. I've done quite a bit of investigation, which I've described upthread. We need to transition somehow, because the prior behavior is incorrect for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable thing to do, but if we pass that straight to ucol_open(), then it gets misinterpreted as plain "fr" because it doesn't understand the "." as a valid separator. We must turn it into a language tag (or at least canonicalize it) before passing the string to ucol_open(). This misbehavior only affects a small number of locales, which resolve to a different actual collator than they should. The most problematic case is during pg_upgrade, where a slight behavior change would result in corrupt indexes. So during binary upgrade, my patch falls back to the original raw string (not the language tag) when it resolves to a different actual collator. If we want to be more paranoid, we could also provide a compatibility GUC to preserve the old misbehavior for newly-created collations, too, but I don't think that's necessary. There is also some interaction with pg_upgrade's ability to check whether the old and new cluster are compatible. If the catalog representation of the locale changes, then it could falsely believe the icu locales aren't compatible, because it's doing a simple string comparison. But as we are discussing in the other thread[1], the whole idea of checking for compatibility of the initialized cluster is strange: pg_upgrade should be in charge of making a compatible cluster to upgrade into (assuming the binaries are at least compatible). I don't see this as a major problem; we'll sort out the other thread first to allow ICU as the default, and then adapt this patch if necessary. [1] https://www.postgresql.org/message-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote: > > New patch attached. The new patch also includes a GUC that (when > enabled) validates that the collator is actually found. New patch attached. Now it always preserves the exact locale string during pg_upgrade, and does not attempt to canonicalize it. Before it was trying to be clever by determining if the language tag was finding the same collator as the original string -- I didn't find a problem with that, but it just seemed a bit too clever. So, only newly-created locales and databases have the ICU locale string canonicalized to a language tag. Also, I added a SQL function pg_icu_language_tag() that can convert locale strings to language tags, and check whether they exist or not. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 28.02.23 06:57, Jeff Davis wrote: > On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote: >> >> New patch attached. The new patch also includes a GUC that (when >> enabled) validates that the collator is actually found. > > New patch attached. > > Now it always preserves the exact locale string during pg_upgrade, and > does not attempt to canonicalize it. Before it was trying to be clever > by determining if the language tag was finding the same collator as the > original string -- I didn't find a problem with that, but it just > seemed a bit too clever. So, only newly-created locales and databases > have the ICU locale string canonicalized to a language tag. > > Also, I added a SQL function pg_icu_language_tag() that can convert > locale strings to language tags, and check whether they exist or not. This patch appears to do about three things at once, and it's not clear exactly where the boundaries are between them and which ones we might actually want. And I think the terminology also gets mixed up a bit, which makes following this harder. 1. Canonicalizing the locale string. This is presumably what uloc_canonicalize() does, which the patch doesn't actually use. What are examples of what this does? Does the patch actually do this? 2. Converting the locale string to BCP 47 format. This converts 'de@collation=phonebook' to 'de-u-co-phonebk'. This is what uloc_getLanguageTag() does. 3. Validating the locale string, to reject faulty input. What are the relationships between these? I don't understand how the validation actually happens in your patch. Does uloc_getLanguageTag() do the validation also? Can you do canonicalization without converting to language tag? Can you do validation of un-canonicalized locale names? What is the guidance for the use of the icu_locale_validation GUC? The description throws in yet another term: "validates that ICU locale strings are well-formed". What is "well-formed"? How does that relate to the other concepts? Personally, I'm not on board with this behavior: => CREATE COLLATION test (provider = icu, locale = 'de@collation=phonebook'); NOTICE: 00000: using language tag "de-u-co-phonebk" for locale "de@collation=phonebook" I mean, maybe that is a thing we want to do somehow sometime, to migrate people to the "new" spellings, but the old ones aren't wrong. So this should be a separate consideration, with an option, and it would require various updates in the documentation. It also doesn't appear to address how to handle ICU before version 54. But, see earlier questions, are these three things all connected somehow?
On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote: > This patch appears to do about three things at once, and it's not > clear > exactly where the boundaries are between them and which ones we might > actually want. And I think the terminology also gets mixed up a bit, > which makes following this harder. > > 1. Canonicalizing the locale string. This is presumably what > uloc_canonicalize() does, which the patch doesn't actually use. What > are examples of what this does? Does the patch actually do this? Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2 Canonicalization, which is described here: https://unicode-org.github.io/icu/userguide/locale/#canonicalization > 2. Converting the locale string to BCP 47 format. This converts > 'de@collation=phonebook' to 'de-u-co-phonebk'. This is what > uloc_getLanguageTag() does. Yes, though uloc_getLanguageTag() also canonicalizes. I consider converting to the language tag a part of "canonicalization", because it's the canonical form we agreed on in this thread. > 3. Validating the locale string, to reject faulty input. Canonicalization doesn't make sure the locale actually exists in ICU, so it's easy to make a typo like "jp_JP" instead of "ja_JP". After canonicalizing to a language tag, the former is "jp-JP" (resolving to the collator with valid locale "root") and the latter is "ja-JP" (resolving to the collator with valid locale "ja"). The former is clearly a mistake, and I call catching that mistake "validation". If the user specifies something other than the root locale (i.e. not "root", "und", or ""), and the locale resolves to a collator with a valid locale of "root", then this patch considers that to be a mistake and issues a WARNING (upgraded to ERROR if the GUC icu_locale_validation is true). > What are the relationships between these? 1 & 2 are closely related. If we canonicalize, we need to pick one canonical form: either BCP 47 or ICU format locale IDs. 3 is related, but can be seen as an independent change. > I don't understand how the validation actually happens in your patch. > Does uloc_getLanguageTag() do the validation also? Using the above definition of "validation" it happens inside icu_collator_exists(). > Can you do canonicalization without converting to language tag? If we used uloc_canonicalize(), it would give us ICU format locale IDs, and that would be a valid thing to do; and we could switch the canonical form from ICU format locale IDs to BCP 47 in a separate patch. I don't have a strong opinion, but if we're going to canonicalize, I think it makes sense to go straight to language tags. > Can you do validation of un-canonicalized locale names? Yes, though I feel like an un-canonicalized name is less stable in meaning, and so validation on that name may also be less stable. For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA". Will the uncanonicalized name always resolve to "fr"? I'm not sure, because the documentation says that ucol_open() expects either an ICU format locale ID or, preferably, a language tag. So they are technically independently useful changes, but I would recommend that canonicalization goes in first. > What is the guidance for the use of the icu_locale_validation GUC? If an error when creating a new collation or database due to a bad locale name would be highly disruptive, leave it false. If such an error would be helpful to make sure you get the locale you expect, then turn it on. In practice, existing important production systems would leave it off; new systems could turn it on to help avoid misconfigurations/mistakes. > The description throws in yet another term: "validates that ICU > locale > strings are well-formed". What is "well-formed"? How does that > relate > to the other concepts? Good point, I don't think I need to redefine "validation". Maybe I should just describe it as elevating canonicalization or validation problems from WARNING to ERROR. > Personally, I'm not on board with this behavior: > > => CREATE COLLATION test (provider = icu, locale = > 'de@collation=phonebook'); > NOTICE: 00000: using language tag "de-u-co-phonebk" for locale > "de@collation=phonebook" > > I mean, maybe that is a thing we want to do somehow sometime, to > migrate > people to the "new" spellings, but the old ones aren't wrong. I see what you mean; I'm not sure the best thing to do here. We are adjusting the string passed by the user, and it feels like some users might want to know that. It's a NOTICE, not a WARNING, so it's not meant to imply that it's wrong. But at the same time I can see it being annoying or confusing. If it's confusing, perhaps a wording change and documentation would improve it? If it's annoying, we might need to have an option and/or a different log level? > It also doesn't appear to address > how to handle ICU before version 54. Do you have a specific concern here? Regards, Jeff Davis
On 09.03.23 21:17, Jeff Davis wrote: >> Personally, I'm not on board with this behavior: >> >> => CREATE COLLATION test (provider = icu, locale = >> 'de@collation=phonebook'); >> NOTICE: 00000: using language tag "de-u-co-phonebk" for locale >> "de@collation=phonebook" >> >> I mean, maybe that is a thing we want to do somehow sometime, to >> migrate >> people to the "new" spellings, but the old ones aren't wrong. > > I see what you mean; I'm not sure the best thing to do here. We are > adjusting the string passed by the user, and it feels like some users > might want to know that. It's a NOTICE, not a WARNING, so it's not > meant to imply that it's wrong. For clarification, I wasn't complaining about the notice, but about the automatic conversion from old-style ICU locale ID to language tag. >> It also doesn't appear to address >> how to handle ICU before version 54. > > Do you have a specific concern here? What we had discussed a while ago in one of these threads is that ICU before version 54 do not support keyword lists, and we have custom code to do that parsing ourselves, but we don't have code to do the same for language tags. Therefore, if I understand this right, if we automatically convert ICU locale IDs to language tags, as shown above, then we break support for such locales in those older ICU versions.
On Mon, 2023-03-13 at 08:25 +0100, Peter Eisentraut wrote: > For clarification, I wasn't complaining about the notice, but about > the > automatic conversion from old-style ICU locale ID to language tag. Canonicalization means that we pick one format, and automatically convert to it, right? > What we had discussed a while ago in one of these threads is that ICU > before version 54 do not support keyword lists, and we have custom > code > to do that parsing ourselves, but we don't have code to do the same > for > language tags. Therefore, if I understand this right, if we > automatically convert ICU locale IDs to language tags, as shown > above, > then we break support for such locales in those older ICU versions. Right. In versions 53 and earlier, and during pg_upgrade, we would just preserve the locale string as entered. Alternatively, we could canonicalize to the ICU format locale IDs. Or add something to parse out the attributes from a language tag. Regards, Jeff Davis
On 13.03.23 16:31, Jeff Davis wrote: >> What we had discussed a while ago in one of these threads is that ICU >> before version 54 do not support keyword lists, and we have custom >> code >> to do that parsing ourselves, but we don't have code to do the same >> for >> language tags. Therefore, if I understand this right, if we >> automatically convert ICU locale IDs to language tags, as shown >> above, >> then we break support for such locales in those older ICU versions. > > Right. In versions 53 and earlier, and during pg_upgrade, we would just > preserve the locale string as entered. Another issue that came to mind: Right now, you can, say, develop SQL schemas on a newer ICU version, say, your laptop, and then deploy them on a server running an older ICU version. If we have a cutoff beyond which we convert ICU locale IDs to language tags, then this won't work anymore for certain combinations. And RHEL/CentOS 7 is still pretty popular.
On Tue, 2023-03-14 at 08:08 +0100, Peter Eisentraut wrote: > Another issue that came to mind: Right now, you can, say, develop > SQL > schemas on a newer ICU version, say, your laptop, and then deploy > them > on a server running an older ICU version. If we have a cutoff beyond > which we convert ICU locale IDs to language tags, then this won't > work > anymore for certain combinations. And RHEL/CentOS 7 is still pretty > popular. If we just uloc_canonicalize() in icu_set_collation_attributes() then versions 50-53 can support language tags. Patch attached. One loose end is that we really should support language tags like "und" in those older versions (54 and earlier). Your commit d72900bded avoided the problem, but perhaps we should fix it by looking for "und" and replacing it with "root" while opening, or something. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote: > One loose end is that we really should support language tags like > "und" > in those older versions (54 and earlier). Your commit d72900bded > avoided the problem, but perhaps we should fix it by looking for > "und" > and replacing it with "root" while opening, or something. Attached are a few patches to implement this idea. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Tue, 2023-03-14 at 23:47 -0700, Jeff Davis wrote: > On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote: > > One loose end is that we really should support language tags like > > "und" > > in those older versions (54 and earlier). Your commit d72900bded > > avoided the problem, but perhaps we should fix it by looking for > > "und" > > and replacing it with "root" while opening, or something. > > Attached are a few patches to implement this idea. Here is an updated patch series that includes these earlier fixes for older ICU versions, with the canonicalization patch last (0005). I left out the validation patch for now, and I'm evaluating a different approach that will attempt to match to the locales retrieved with uloc_countAvailable()/uloc_getAvailable(). -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote: > I left out the validation patch for now, and I'm evaluating a > different > approach that will attempt to match to the locales retrieved with > uloc_countAvailable()/uloc_getAvailable(). I like this approach, attached new patch series with that included as 0006. The first 3 patches are essentially bugfixes -- should they be backported? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
- v6-0001-Support-language-tags-in-older-ICU-versions-53-an.patch
- v6-0002-Wrap-ICU-ucol_open.patch
- v6-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patch
- v6-0004-Add-SQL-function-pg_icu_language_tag.patch
- v6-0005-Canonicalize-ICU-locale-names-to-language-tags.patch
- v6-0006-Validate-ICU-locales.patch
On 17.03.23 18:55, Jeff Davis wrote: > On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote: >> I left out the validation patch for now, and I'm evaluating a >> different >> approach that will attempt to match to the locales retrieved with >> uloc_countAvailable()/uloc_getAvailable(). > > I like this approach, attached new patch series with that included as > 0006. I have looked at the first three patches. I think we want what those patches do. [PATCH v6 1/6] Support language tags in older ICU versions (53 and earlier). In pg_import_system_collations(), this is now redundant and can be simplified: - if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr)) + if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag)) icu_set_collation_attributes() needs more commenting about what is going on. My guess is that uloc_canonicalize() converts from language tag to ICU locale ID, and then the existing logic to parse that apart would apply. Is that how it works? [PATCH v6 2/6] Wrap ICU ucol_open(). It makes sense to try to unify some of this. But I find the naming confusing. If I see pg_ucol_open(), then I would expect that all calls to ucol_open() would be replaced by this. But here it's only a few, without explanation. (pg_ucol_open() has no explanation at all AFAICT.) I have in my notes that check_icu_locale() and make_icu_collator() should be combined into a single function. I think that would be a better way to slice it. Btw., I had intentionally not written code like this +#if U_ICU_VERSION_MAJOR_NUM < 54 + icu_set_collation_attributes(collator, loc_str); +#endif The disadvantage of doing it that way is that you then need to dig out an old version of ICU in order to check whether the code compiles at all. With the current code, you can be sure that that code compiles if you make changes elsewhere. [PATCH v6 3/6] Handle the "und" locale in ICU versions 54 and older. This makes sense, but the same comment about not #if'ing out code for old ICU versions applies here. The +#ifdef USE_ICU + before pg_ucol_open() probably belongs in patch 2.
On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote: > [PATCH v6 1/6] Support language tags in older ICU versions (53 and > earlier). > > In pg_import_system_collations(), this is now redundant and can be > simplified: > > - if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr)) > + if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag)) > > icu_set_collation_attributes() needs more commenting about what is > going > on. My guess is that uloc_canonicalize() converts from language tag > to > ICU locale ID, and then the existing logic to parse that apart would > apply. Is that how it works? Fixed the redundancy, added some comments, and committed 0001. > [PATCH v6 2/6] Wrap ICU ucol_open(). > > It makes sense to try to unify some of this. But I find the naming > confusing. If I see pg_ucol_open(), then I would expect that all > calls > to ucol_open() would be replaced by this. But here it's only a few, > without explanation. (pg_ucol_open() has no explanation at all > AFAICT.) The remaining callsite which doesn't use the wrapper is in initdb.c, which can't call into pg_locale.c, and has different intentions. initdb uses ucol_open to get the default locale if icu_locale is not specified; and it also uses ucol open to verify that the locale can be opened (whether specified or the default). (Aside: I created a tiny 0004 patch which makes this difference more clear and adds a nice comment.) There's no reason to use a wrapper when getting the default locale, because it's just passing NULL anyway. When verifying that the locale can be opened, ucol_open() doesn't catch many problems anyway, so I'm not sure it's worth a lot of effort to copy these extra checks that the wrapper does into initdb.c. For instance, what's the value in replacing "und" with "root" if opening either will succeed? Parsing the attributes can potentially catch problems, but the later patch 0006 will check the attributes when converting to a language tag at initdb time. So I'm inclined to just leave initdb alone in patches 0002 and 0003. > I have in my notes that check_icu_locale() and make_icu_collator() > should be combined into a single function. I think that would be a > better way to slice it. That would leave out get_collation_actual_version(), which should handle the same fixups for attributes and the "und" locale. > Btw., I had intentionally not written code like this > > +#if U_ICU_VERSION_MAJOR_NUM < 54 > + icu_set_collation_attributes(collator, loc_str); > +#endif > > The disadvantage of doing it that way is that you then need to dig > out > an old version of ICU in order to check whether the code compiles at > all. With the current code, you can be sure that that code compiles > if > you make changes elsewhere. I was wondering about that -- thank you, I changed it back to use "if" rather than "#ifdef". New series attached (starting at 0002 to better correspond to the previous series). -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
- v7-0002-Wrap-ICU-ucol_open.patch
- v7-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patch
- v7-0004-Accept-C-POSIX-locales-when-converting-to-languag.patch
- v7-0005-initdb-emit-message-when-using-default-ICU-locale.patch
- v7-0006-Canonicalize-ICU-locale-names-to-language-tags.patch
- v7-0007-Validate-ICU-locales.patch
On 22.03.23 19:05, Jeff Davis wrote: > On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote: >> [PATCH v6 1/6] Support language tags in older ICU versions (53 and >> earlier). >> >> In pg_import_system_collations(), this is now redundant and can be >> simplified: >> >> - if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr)) >> + if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag)) >> >> icu_set_collation_attributes() needs more commenting about what is >> going >> on. My guess is that uloc_canonicalize() converts from language tag >> to >> ICU locale ID, and then the existing logic to parse that apart would >> apply. Is that how it works? > > Fixed the redundancy, added some comments, and committed 0001. So, does uloc_canonicalize() always convert to ICU locale IDs? What if you pass a language tag, does it convert it to ICU locale ID as well? >> [PATCH v6 2/6] Wrap ICU ucol_open(). > So I'm inclined to just leave initdb alone in patches 0002 and 0003. 0002 and 0003 look ok to me now. In 0002, the error "opening default collator is not supported", should that be an assert or an elog? Is it reachable by the user? You might want to check the declarations at the top of pg_ucol_open(). 0003 reformats them after they were just added in 0002. Maybe check that they are pgindent'ed in 0002 properly. I don't understand patch 0004. It seems to do two things, handle C/POSIX locale specifications and add an SQL-callable function. Are those connected?
On Thu, 2023-03-23 at 07:27 +0100, Peter Eisentraut wrote: > So, does uloc_canonicalize() always convert to ICU locale IDs? What > if > you pass a language tag, does it convert it to ICU locale ID as well? Yes. The documentation is not clear on that point, but my testing shows that it does. And this is only for old versions of the code, so we don't need to worry about later versions of ICU changing that. I thought about using uloc_forLanguageTag(), but the documentation for that is not clear what formats it accepts as an input, so it doesn't seem like a win. If wanted to be paranoid we could use uloc_toLanguageTag() followed by uloc_forLanguageTag(), but that seemed excessive. > > > 0002 and 0003 look ok to me now. Thank you, committed 0002 and 0003. > In 0002, the error "opening default collator is not supported", > should > that be an assert or an elog? Is it reachable by the user? It's not reachable by the user, but could catch a bug if we accidentally read a NULL field from the catalog or something like that. It seemed a worthwhile check to leave in production builds. > You might want to check the declarations at the top of > pg_ucol_open(). > 0003 reformats them after they were just added in 0002. Maybe check > that they are pgindent'ed in 0002 properly. They seem to be pgindented fine in 0002, it was unnecessarily reindented in 0003 and I fixed that. I use emacs "align-current" and generally that does the right thing, but I'll rely more on pgindent in the future. > I don't understand patch 0004. It seems to do two things, handle > C/POSIX locale specifications and add an SQL-callable function. Are > those connected? It's hard to test (or even exercise) the former without the latter. I could get rid of the SQL-callable function and move the rest of the changes into 0006. I'll see if that arrangement works better, and that way we can add the SQL-callable function later (or perhaps not at all if it's not desired). Regards, Jeff Davis
On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote: > I could get rid of the SQL-callable function and move the rest of the > changes into 0006. I'll see if that arrangement works better, and > that > way we can add the SQL-callable function later (or perhaps not at all > if it's not desired). Attached a new series that doesn't include the SQL-callable function. It's probably better to just wait and see what functions seem actually useful to users. I included a new small patch to fix a potential UCollator leak and make the errors more consistent. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 23.03.23 18:16, Jeff Davis wrote: >> In 0002, the error "opening default collator is not supported", >> should >> that be an assert or an elog? Is it reachable by the user? > It's not reachable by the user, but could catch a bug if we > accidentally read a NULL field from the catalog or something like that. > It seemed a worthwhile check to leave in production builds. Then it ought to be an elog().
On 24.03.23 07:39, Jeff Davis wrote: > On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote: >> I could get rid of the SQL-callable function and move the rest of the >> changes into 0006. I'll see if that arrangement works better, and >> that >> way we can add the SQL-callable function later (or perhaps not at all >> if it's not desired). > > Attached a new series that doesn't include the SQL-callable function. > It's probably better to just wait and see what functions seem actually > useful to users. > > I included a new small patch to fix a potential UCollator leak and make > the errors more consistent. [PATCH v8 1/4] Avoid potential UCollator leak for older ICU versions. Couldn't we do this in a simpler way by just freeing the collator before the ereport() calls. Or wrap a PG_TRY/PG_FINALLY around the whole thing? It would be nicer to not make the callers of icu_set_collation_attributes() responsible for catching and reporting the errors. [PATCH v8 2/4] initdb: emit message when using default ICU locale. I'm not able to make initdb print this message. Under what circumstances am I supposed to see this? Do you have some examples? The function check_icu_locale() has now gotten a lot more functionality than its name suggests. Maybe the part that assigns the default ICU locale should be moved up one level to setlocales(), which has a better name and does something similar for the libc locale realm. [PATCH v8 3/4] Canonicalize ICU locale names to language tags. I'm still on the fence about whether we actually want to do this, but I'm warming up to it, now that the issues with pre-54 versions are fixed. But if we do this, the documentation needs to be updated. There is a bunch of text there that says, like, you can do this format or that format, whatever you like. At least the guidance should be changed there. [PATCH v8 4/4] Validate ICU locales. I would make icu_locale_validation true by default. Or maybe it should be a log-level type option, so you can set it to error, warning, and also completely off?
On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote: > Couldn't we do this in a simpler way by just freeing the collator > before > the ereport() calls. I committed a tiny patch to do this. We still need to address the error inconsistency though. The problem is that, in older ICU versions, if the fixup for "und@colNumeric=lower" -> "root@colNumeric=lower" is applied, then icu_set_collation_attributes() will throw an error reporting "root@colNumeric=lower", which is not what the user typed. We could fix that directly by passing the original string to icu_set_collation_attributes() instead, or perhaps as an extra parameter used only for the ereport(). I like the minor refactoring I did better, though. It puts the ereports() close to each other, so any differences are more obvious. And it seems cleaner to me for pg_ucol_open to close the UCollator because it's the one that opened it. I don't have a strong opinion, but that's my reasoning. > Or wrap a PG_TRY/PG_FINALLY around the whole thing? I generally avoid PG_TRY/FINALLY unless it avoids some major awkwardness or other problem. > It would be nicer to not make the callers of > icu_set_collation_attributes() responsible for catching and reporting > the errors. There's only one caller now: pg_ucol_open(). > [PATCH v8 2/4] initdb: emit message when using default ICU locale. > > I'm not able to make initdb print this message. Under what > circumstances am I supposed to see this? Do you have some examples? It happens when you don't specify --icu-locale. It is slightly redundant with "ICU locale", but it lets you see that it came from the environment rather than the command line: ------------- $ initdb -D data The files belonging to this database system will be owned by user "someone". This user must also own the server process. Using default ICU locale "en_US_POSIX". The database cluster will be initialized with this locale configuration: provider: icu ICU locale: en_US_POSIX ... ------------- That seems fairly useful for testing, etc., where initdb.log doesn't show the command line options. > The function check_icu_locale() has now gotten a lot more > functionality > than its name suggests. Maybe the part that assigns the default ICU > locale should be moved up one level to setlocales(), which has a > better > name and does something similar for the libc locale realm. Agreed, done. In fact, initdb.c:check_icu_locale() is completely unnecessary in that patch, because as the comment points out, the backend will try to open it during post-bootstrap initialization. I think it was simply a mistake to try to do this validation in commit 27b62377b4. The later validation patch does do some better validation at initdb time to make sure the language can be found. > [PATCH v8 3/4] Canonicalize ICU locale names to language tags. > > I'm still on the fence about whether we actually want to do this, but > I'm warming up to it, now that the issues with pre-54 versions are > fixed. > > But if we do this, the documentation needs to be updated. There is a > bunch of text there that says, like, you can do this format or that > format, whatever you like. At least the guidance should be changed > there. > > > [PATCH v8 4/4] Validate ICU locales. > > I would make icu_locale_validation true by default. Agreed. I considered also not having a GUC, but it seems like some kind of escape hatch is wise, at least for now. > Or maybe it should be a log-level type option, so you can set it to > error, warning, and also completely off? As the validation patch seems closer to acceptance, I changed it to be before the canonicalization patch. New series attached. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
[PATCH v9 1/5] Fix error inconsistency in older ICU versions. ok [PATCH v9 2/5] initdb: replace check_icu_locale() with default_icu_locale(). I would keep the #ifdef USE_ICU inside the lower-level function default_icu_locale(), like it was before, so that the higher-level setlocales() doesn't need to know about it. Otherwise ok. [PATCH v9 3/5] initdb: emit message when using default ICU locale. ok [PATCH v9 4/5] Validate ICU locales. Also here, let's keep the #ifdef USE_ICU in the lower-level function and move more logic in there. Otherwise you have to repeat various things in DefineCollation() and createdb(). I'm not sure we need the IsBinaryUpgrade checks. Users can set icu_validation_level on the target instance if they don't want that.
On Tue, 2023-03-28 at 08:41 +0200, Peter Eisentraut wrote: > [PATCH v9 1/5] Fix error inconsistency in older ICU versions. > > ok Committed 0001. > [PATCH v9 2/5] initdb: replace check_icu_locale() with > default_icu_locale(). > > I would keep the #ifdef USE_ICU inside the lower-level function > default_icu_locale(), like it was before, so that the higher-level > setlocales() doesn't need to know about it. > > Otherwise ok. Done and committed 0002. > > [PATCH v9 3/5] initdb: emit message when using default ICU locale. Done and committed 0003. > [PATCH v9 4/5] Validate ICU locales. > > Also here, let's keep the #ifdef USE_ICU in the lower-level function > and > move more logic in there. Otherwise you have to repeat various > things > in DefineCollation() and createdb(). Done. > I'm not sure we need the IsBinaryUpgrade checks. Users can set > icu_validation_level on the target instance if they don't want that. I committed a version that still performs the checks during binary upgrade, but degrades the message to a WARNING if it's set higher than that. I tried some upgrades with invalid locales, and getting an error deep in the logs after the upgrade actually starts is not very user- friendly. We could add something during the --check phase, which would be more helpful, but I didn't do that for this patch. Attached is a new version of the final patch, which performs canonicalization. I'm not 100% sure that it's wanted, but it still seems like a good idea to get the locales into a standard format in the catalogs, and if a lot more people start using ICU in v16 (because it's the default), then it would be a good time to do it. But perhaps there are risks? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 30.03.23 04:33, Jeff Davis wrote: > Attached is a new version of the final patch, which performs > canonicalization. I'm not 100% sure that it's wanted, but it still > seems like a good idea to get the locales into a standard format in the > catalogs, and if a lot more people start using ICU in v16 (because it's > the default), then it would be a good time to do it. But perhaps there > are risks? I say, let's do it. I don't think we should show the notice when the canonicalization doesn't change anything. This is not useful: +NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf-upper" Also, the message should be phrased more from the perspective of the user instead of using ICU jargon, like NOTICE: using canonicalized form "%s" for locale specification "%s" (Still too many big words?) I don't think the special handling of IsBinaryUpgrade is needed or wanted. I would hope that with this feature, all old-style locale IDs would go away, but this way we would keep them forever. If we believe that canonicalization is safe, then I don't see why we cannot apply it during binary upgrade. Needs documentation updates in doc/src/sgml/charset.sgml.
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: > I don't think the special handling of IsBinaryUpgrade is needed or > wanted. I would hope that with this feature, all old-style locale > IDs > would go away, but this way we would keep them forever. If we > believe > that canonicalization is safe, then I don't see why we cannot apply > it > during binary upgrade. There are two issues: 1. Failures can occur. For instance, if an invalid attribute is used, like '@collStrength=primary', then we can't canonicalize it (or if we do, it could end up being not what the user intended). 2. Version 15 and earlier have a subtle bug: it passes the raw locale straight to ucol_open(), and if the locale is "fr_CA.UTF-8" ucol_open() mis-parses it to have language "fr" with no region. If you canonicalize first, it properly parses the locale and produces "fr-CA", which results in a different collator. The 15 behavior is wrong, and this canonicalization patch will fix it, but it doesn't do so during pg_upgrade because that could change the collator and corrupt an index. The current patch deals with these problems by simply preserving the locale (valid or not) during pg_upgrade, and only canonicalizing new collations and databases (so #2 is only fixed for new collations/databases). I think that's a good trade-off because a lot more users will be on ICU now that it's the default, so let's avoid creating more of the problem cases for those new users. To get to perfectly-canonicalized catalogs for upgrades from earlier versions: * We need a way to detect #2, which I posted some code for in an uncommitted revision[1] of this patch series. * We need a way to detect #1 and #2 during the pg_upgrade --check phase. * We need actions that the user can take to correct the problems. I have some ideas but they could use some dicsussion. I'm not sure all of those will be ready for v16, though. Regards, Jeff Davis [1] See check_equivalent_icu_locales() and calling code here: https://www.postgresql.org/message-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: > I don't think we should show the notice when the canonicalization > doesn't change anything. This is not useful: > > +NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf- > upper" Done. > Also, the message should be phrased more from the perspective of the > user instead of using ICU jargon, like > > NOTICE: using canonicalized form "%s" for locale specification "%s" > > (Still too many big words?) Changed to: NOTICE: using standard form "%s" for locale "%s" > Needs documentation updates in doc/src/sgml/charset.sgml. I made a very minor update. Do you have something more specific in mind? Regards, Jeff Davis
Attachment
On 31.03.23 12:11, Jeff Davis wrote: > On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote: >> I don't think we should show the notice when the canonicalization >> doesn't change anything. This is not useful: >> >> +NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf- >> upper" > > Done. > >> Also, the message should be phrased more from the perspective of the >> user instead of using ICU jargon, like >> >> NOTICE: using canonicalized form "%s" for locale specification "%s" >> >> (Still too many big words?) > > Changed to: > > NOTICE: using standard form "%s" for locale "%s" > >> Needs documentation updates in doc/src/sgml/charset.sgml. > > I made a very minor update. Do you have something more specific in > mind? This all looks good to me.
MSVC now says this on master: [17:48:12.446] c:\cirrus\src\backend\utils\adt\pg_locale.c(2912) : warning C4715: 'icu_language_tag': not all control paths return a value CI doesn't currently fail for MSVC warnings, so it's a bit hidden. FWIW cfbot does show this with a ⚠ sign with its new system for grovelling through logs, which will now show up on every entry now that this warning is in master.
On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > On 30.03.23 04:33, Jeff Davis wrote: > >Attached is a new version of the final patch, which performs > >canonicalization. I'm not 100% sure that it's wanted, but it still > >seems like a good idea to get the locales into a standard format in the > >catalogs, and if a lot more people start using ICU in v16 (because it's > >the default), then it would be a good time to do it. But perhaps there > >are risks? > > I say, let's do it. The following is not cause for postgresql.git changes at this time, but I'm sharing it in case it saves someone else the study effort. Commit ea1db8a ("Canonicalize ICU locale names to language tags.") slowed buildfarm member hoverfly, but that disappears if I drop debug_parallel_query from its config. Typical end-to-end duration rose from 2h5m to 2h55m. Most-affected were installcheck runs, which rose from 11m to 19m. (The "check" stage uses NO_LOCALE=1, so it changed less.) From profiles, my theory is that each of the many parallel workers burns notable CPU and I/O opening its ICU collator for the first time. debug_parallel_query, by design, pursues parallelism independent of cost, so this is working as intended. If it ever matters in non-debug configurations, we might raise the default parallel_setup_cost or pre-load ICU collators in the postmaster.
On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote: > On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > > On 30.03.23 04:33, Jeff Davis wrote: > > > Attached is a new version of the final patch, which performs > > > canonicalization. I'm not 100% sure that it's wanted, but it > > > still > > > seems like a good idea to get the locales into a standard format > > > in the > > > catalogs, and if a lot more people start using ICU in v16 > > > (because it's > > > the default), then it would be a good time to do it. But perhaps > > > there > > > are risks? > > > > I say, let's do it. > > The following is not cause for postgresql.git changes at this time, > but I'm > sharing it in case it saves someone else the study effort. Commit > ea1db8a > ("Canonicalize ICU locale names to language tags.") slowed buildfarm > member > hoverfly, but that disappears if I drop debug_parallel_query from its > config. > Typical end-to-end duration rose from 2h5m to 2h55m. Most-affected > were > installcheck runs, which rose from 11m to 19m. (The "check" stage > uses > NO_LOCALE=1, so it changed less.) From profiles, my theory is that > each of > the many parallel workers burns notable CPU and I/O opening its ICU > collator > for the first time. I didn't repro the overall test timings (mine is ~1m40s compared to ~11-19m on hoverfly) but I think a microbenchmark on the ICU calls showed a possible cause. I ran open in a loop 10M times on the requested locale. The root locale ("und"[1], "root" and "") take about 1.3s to open 10M times; simple locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at 3.3s. Unrecognized locales like "xyz" take about 10 times as long: 13s to open 10M times, presumably to perform the fallback logic that ultimately opens the root locale. Not sure if 10X slower in the open path is enough to explain the overall test slowdown. My guess is that the ICU locale for these tests is not recognized, or is some other locale that opens slowly. Can you tell me the actual daticulocale? Regards, Jeff Davis [1] It appears that "und" is also slow to open in ICU < 64. Hoverfly is on v58, so it's possible that's the problem if daticulocale=und.
On Sat, May 20, 2023 at 10:19:30AM -0700, Jeff Davis wrote: > On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote: > > On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote: > > > On 30.03.23 04:33, Jeff Davis wrote: > > > > Attached is a new version of the final patch, which performs > > > > canonicalization. I'm not 100% sure that it's wanted, but it > > > > still > > > > seems like a good idea to get the locales into a standard format > > > > in the > > > > catalogs, and if a lot more people start using ICU in v16 > > > > (because it's > > > > the default), then it would be a good time to do it. But perhaps > > > > there > > > > are risks? > > > > > > I say, let's do it. > > > > The following is not cause for postgresql.git changes at this time, > > but I'm > > sharing it in case it saves someone else the study effort. Commit > > ea1db8a > > ("Canonicalize ICU locale names to language tags.") slowed buildfarm > > member > > hoverfly, but that disappears if I drop debug_parallel_query from its > > config. > > Typical end-to-end duration rose from 2h5m to 2h55m. Most-affected > > were > > installcheck runs, which rose from 11m to 19m. (The "check" stage > > uses > > NO_LOCALE=1, so it changed less.) From profiles, my theory is that > > each of > > the many parallel workers burns notable CPU and I/O opening its ICU > > collator > > for the first time. > > I didn't repro the overall test timings (mine is ~1m40s compared to > ~11-19m on hoverfly) but I think a microbenchmark on the ICU calls > showed a possible cause. > > I ran open in a loop 10M times on the requested locale. The root locale > ("und"[1], "root" and "") take about 1.3s to open 10M times; simple > locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at > 3.3s. > > Unrecognized locales like "xyz" take about 10 times as long: 13s to > open 10M times, presumably to perform the fallback logic that > ultimately opens the root locale. Not sure if 10X slower in the open > path is enough to explain the overall test slowdown. > > My guess is that the ICU locale for these tests is not recognized, or > is some other locale that opens slowly. Can you tell me the actual > daticulocale? As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-posix. Check got daticulocale=NULL. (The machine in question was unusable for PostgreSQL from 2023-05-12 to 2023-06-30, due to https://stackoverflow.com/q/76369660/16371536. That delayed my response.)
On Sat, 2023-07-01 at 10:31 -0700, Noah Misch wrote: > As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va- > posix. Check > got daticulocale=NULL. With the same test setup, that locale takes about 8.6 seconds (opening it 10M times), about 2.5X slower than "en-US" and about 7X slower than "und". I think that explains it. The locale "en-US-u-va-posix" normally happens when passing a locale beginning with "C" to ICU. After 2535c74b1a we don't get ICU locales from the environment anywhere, so that should be rare (and probably indicates a user mistake). I don't think this is a practical problem any more. Regards, Jeff Davis