Re: Order changes in PG16 since ICU introduction - Mailing list pgsql-hackers
From | Jonathan S. Katz |
---|---|
Subject | Re: Order changes in PG16 since ICU introduction |
Date | |
Msg-id | 25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org Whole thread Raw |
In response to | Re: Order changes in PG16 since ICU introduction (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Order changes in PG16 since ICU introduction
|
List | pgsql-hackers |
On 5/5/23 8:25 PM, Jeff Davis wrote: > On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote: >> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote: >>> Most of the complaints seem to be complaints about v15 as well, and >>> while those complaints may be a reason to not make ICU the default, >>> they are also an argument that we should continue to learn and try >>> to >>> fix those issues because they exist in an already-released version. >>> Leaving it the default for now will help us fix those issues rather >>> than hide them. >>> >>> It's still early, so we have plenty of time to revert the initdb >>> default if we need to. >> >> That's fair enough, but I really think it's important that some >> energy >> get invested in providing adequate documentation for this stuff. Just >> patching the code is not enough. > > Attached a significant documentation patch. > I tried to make it comprehensive without trying to be exhaustive, and I > separated the explanation of language tags from what collation settings > you can include in a language tag, so hopefully that's more clear. > > I added quite a few examples spread throughout the various sections, > and I preserved the existing examples at the end. I also left all of > the external links at the bottom for those interested enough to go > beyond what's there. [Personal hat, not RMT] Thanks -- this is super helpful. A bunch of these examples I had previously had to figure out by randomly searching blog posts / trial-and-error, so I think this will help developers get started more quickly. Comments (and a lot are just little nits to tighten the language) Commit message -- typo: "documentaiton" + If you see such a message, ensure that the <symbol>PROVIDER</symbol> and + <symbol>LOCALE</symbol> are as you expect, and consider specifying + directly as the canonical language tag instead of relying on the + transformation. + </para> I'd recommend make this more prescriptive: "If you see this notice, ensure that the <symbol>PROVIDER</symbol> and <symbol>LOCALE</symbol> are the expected result. For consistent results when using the ICU provider, specify the canonical <link linkend="icu-language-tag">language tag</link> instead of relying on the transformation." + If there is some problem interpreting the locale name, or if it represents + a language or region that ICU does not recognize, a message will be reported: This is passive voice, consider: "If there is a problem interpreting the locale name, or if the locale name represents a language or region that ICU does not recognize, you'll see the following error:" + <sect3 id="icu-language-tag"> + <title>Language Tag</title> + <para> Before jumping in, I'd recommend a quick definition of what a language tag is, e.g.: "A language tag, defined in BCP 47, is a standardized identifier used to identify languages in computer systems" or something similar. (I did find a database that made it simpler to search for these, which is one issue I've previously add, but I don't think we'd want to link to i) + To include this additional collation information in a language tag, + append <literal>-u</literal>, followed by one or more My first question was "what's special about '-u'", so maybe we say: "To include this additional collation information in a language tag, append <literal>-u</literal>, which indicates there are additional collation settings, followed by one or more..." + ICU locales are specified as a <link linkend="icu-language-tag">Language + Tag</link>, but can also accept most libc-style locale names (which will + be transformed into language tags if possible). + </para> I'd recommend removing the parantheticals: ICU locales are specified as a BCP 47 <link linkend="icu-language-tag">Language Tag</link>, but can also accept most libc-style locale names. If possible, libc-style locale names are transformed into language tags. + <title>ICU Collation Levels</title> Nothing to add here other than to say I'm extremely appreciative of this section. Once upon a time I sunk a lot of time trying to figure out how all of these levels worked. + Sensitivity when determining equality, with + <literal>level1</literal> the least sensitive and + <literal>identic</literal> the most sensitive. See <xref + linkend="icu-collation-levels"/> for details. This discusses equality sensitivity, but I'm not sure if I understand that term here. The ICU docs seem to call these "strengths"[1], maybe we use that term to be consistent with upstream? + If set to <literal>upper</literal>, upper case sorts before lower + case. If set to <literal>lower</literal>, lower case sorts before + upper case. If set to <literal>false</literal>, it depends on the + locale. Suggestion to tighten this up: "If set to <literal>false</literal>, the sort depends on the rules of the locale." + Defaults may depend on locale. The above table is not meant to be + complete. See <xref linkend="icu-external-references"/> for additinal + options and details. Typo: additinal => "additional" > I didn't add additional documentation for ICU rules. There are so many > options for collations that it's hard for me to think of realistic > examples to specify the rules directly, unless someone wants to invent > a new language. Perhaps useful if working with an interesting text file > format with special treatment for delimiters? > > I asked the question about rules here: > > https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at > > and got some limited response about addressing sort complaints. That > sounds reasonable, but a lot of that can also be handled just by > specifying the right collation settings. Someone who understands the > use case better could add some more documentation. I'm not too sure about this one -- from my experience, users want predictability in sorts, but there are a variety of ways to get that experience. Thanks, Jonathan [1] https://unicode-org.github.io/icu/userguide/collation/concepts.html
Attachment
pgsql-hackers by date: