Re: Order changes in PG16 since ICU introduction - Mailing list pgsql-hackers

From Jonathan S. Katz
Subject Re: Order changes in PG16 since ICU introduction
Date
Msg-id 25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org
Whole thread Raw
In response to Re: Order changes in PG16 since ICU introduction  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Order changes in PG16 since ICU introduction
List pgsql-hackers
On 5/5/23 8:25 PM, Jeff Davis wrote:
> On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
>> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
>>> Most of the complaints seem to be complaints about v15 as well, and
>>> while those complaints may be a reason to not make ICU the default,
>>> they are also an argument that we should continue to learn and try
>>> to
>>> fix those issues because they exist in an already-released version.
>>> Leaving it the default for now will help us fix those issues rather
>>> than hide them.
>>>
>>> It's still early, so we have plenty of time to revert the initdb
>>> default if we need to.
>>
>> That's fair enough, but I really think it's important that some
>> energy
>> get invested in providing adequate documentation for this stuff. Just
>> patching the code is not enough.
> 
> Attached a significant documentation patch.


> I tried to make it comprehensive without trying to be exhaustive, and I
> separated the explanation of language tags from what collation settings
> you can include in a language tag, so hopefully that's more clear.
> 
> I added quite a few examples spread throughout the various sections,
> and I preserved the existing examples at the end. I also left all of
> the external links at the bottom for those interested enough to go
> beyond what's there.

[Personal hat, not RMT]

Thanks -- this is super helpful. A bunch of these examples I had 
previously had to figure out by randomly searching blog posts / 
trial-and-error, so I think this will help developers get started more 
quickly.

Comments (and a lot are just little nits to tighten the language)

Commit message -- typo: "documentaiton"


+     If you see such a message, ensure that the 
<symbol>PROVIDER</symbol> and
+     <symbol>LOCALE</symbol> are as you expect, and consider specifying
+     directly as the canonical language tag instead of relying on the
+     transformation.
+    </para>

I'd recommend make this more prescriptive:

"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and 
<symbol>LOCALE</symbol> are the expected result. For consistent results 
when using the ICU provider, specify the canonical <link 
linkend="icu-language-tag">language tag</link> instead of relying on the 
transformation."

+     If there is some problem interpreting the locale name, or if it 
represents
+     a language or region that ICU does not recognize, a message will 
be reported:

This is passive voice, consider:

"If there is a problem interpreting the locale name, or if the locale 
name represents a language or region that ICU does not recognize, you'll 
see the following error:"


+   <sect3 id="icu-language-tag">
+    <title>Language Tag</title>
+    <para>

Before jumping in, I'd recommend a quick definition of what a language 
tag is, e.g.:

"A language tag, defined in BCP 47, is a standardized identifier used to 
identify languages in computer systems" or something similar.

(I did find a database that made it simpler to search for these, which 
is one issue I've previously add, but I don't think we'd want to link to i)

+     To include this additional collation information in a language tag,
+     append <literal>-u</literal>, followed by one or more

My first question was "what's special about '-u'", so maybe we say:

"To include this additional collation information in a language tag, 
append <literal>-u</literal>, which indicates there are additional 
collation settings, followed by one or more..."

+     ICU locales are specified as a <link 
linkend="icu-language-tag">Language
+     Tag</link>, but can also accept most libc-style locale names 
(which will
+     be transformed into language tags if possible).
+    </para>

I'd recommend removing the parantheticals:

ICU locales are specified as a BCP 47 <link 
linkend="icu-language-tag">Language
  Tag</link>, but can also accept most libc-style locale names. If 
possible, libc-style locale names are transformed into language tags.

+      <title>ICU Collation Levels</title>

Nothing to add here other than to say I'm extremely appreciative of this 
section. Once upon a time I sunk a lot of time trying to figure out how 
all of these levels worked.

+          Sensitivity when determining equality, with
+          <literal>level1</literal> the least sensitive and
+          <literal>identic</literal> the most sensitive. See <xref
+          linkend="icu-collation-levels"/> for details.

This discusses equality sensitivity, but I'm not sure if I understand 
that term here. The ICU docs seem to call these "strengths"[1], maybe we 
use that term to be consistent with upstream?

+          If set to <literal>upper</literal>, upper case sorts before lower
+          case. If set to <literal>lower</literal>, lower case sorts before
+          upper case. If set to <literal>false</literal>, it depends on the
+          locale.

Suggestion to tighten this up:

"If set to <literal>false</literal>, the sort depends on the rules of 
the locale."

+      Defaults may depend on locale. The above table is not meant to be
+      complete. See <xref linkend="icu-external-references"/> for additinal
+      options and details.

Typo: additinal => "additional"

> I didn't add additional documentation for ICU rules. There are so many
> options for collations that it's hard for me to think of realistic
> examples to specify the rules directly, unless someone wants to invent
> a new language. Perhaps useful if working with an interesting text file
> format with special treatment for delimiters?
> 
> I asked the question about rules here:
> 
> https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at
> 
> and got some limited response about addressing sort complaints. That
> sounds reasonable, but a lot of that can also be handled just by
> specifying the right collation settings. Someone who understands the
> use case better could add some more documentation.

I'm not too sure about this one -- from my experience, users want 
predictability in sorts, but there are a variety of ways to get that 
experience.

Thanks,

Jonathan

[1] https://unicode-org.github.io/icu/userguide/collation/concepts.html

Attachment

pgsql-hackers by date:

Previous
From: Sergey Dudoladov
Date:
Subject: Re: Introduce "log_connection_stages" setting.
Next
From: Melanie Plageman
Date:
Subject: Re: Memory leak from ExecutorState context?