Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id c6dd63de8267ad8267ce4ec939f72c43f1fc2020.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:
> [PATCH v6 1/6] Support language tags in older ICU versions (53 and
>   earlier).
>
> In pg_import_system_collations(), this is now redundant and can be
> simplified:
>
> -               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
> +               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
>
> icu_set_collation_attributes() needs more commenting about what is
> going
> on.  My guess is that uloc_canonicalize() converts from language tag
> to
> ICU locale ID, and then the existing logic to parse that apart would
> apply.  Is that how it works?

Fixed the redundancy, added some comments, and committed 0001.

> [PATCH v6 2/6] Wrap ICU ucol_open().
>
> It makes sense to try to unify some of this.  But I find the naming
> confusing.  If I see pg_ucol_open(), then I would expect that all
> calls
> to ucol_open() would be replaced by this.  But here it's only a few,
> without explanation.  (pg_ucol_open() has no explanation at all
> AFAICT.)

The remaining callsite which doesn't use the wrapper is in initdb.c,
which can't call into pg_locale.c, and has different intentions. initdb
uses ucol_open to get the default locale if icu_locale is not
specified; and it also uses ucol open to verify that the locale can be
opened (whether specified or the default). (Aside: I created a tiny
0004 patch which makes this difference more clear and adds a nice
comment.)

There's no reason to use a wrapper when getting the default locale,
because it's just passing NULL anyway.

When verifying that the locale can be opened, ucol_open() doesn't catch
many problems anyway, so I'm not sure it's worth a lot of effort to
copy these extra checks that the wrapper does into initdb.c. For
instance, what's the value in replacing "und" with "root" if opening
either will succeed? Parsing the attributes can potentially catch
problems, but the later patch 0006 will check the attributes when
converting to a language tag at initdb time.

So I'm inclined to just leave initdb alone in patches 0002 and 0003.

> I have in my notes that check_icu_locale() and make_icu_collator()
> should be combined into a single function.  I think that would be a
> better way to slice it.

That would leave out get_collation_actual_version(), which should
handle the same fixups for attributes and the "und" locale.

> Btw., I had intentionally not written code like this
>
> +#if U_ICU_VERSION_MAJOR_NUM < 54
> +       icu_set_collation_attributes(collator, loc_str);
> +#endif
>
> The disadvantage of doing it that way is that you then need to dig
> out
> an old version of ICU in order to check whether the code compiles at
> all.  With the current code, you can be sure that that code compiles
> if
> you make changes elsewhere.

I was wondering about that -- thank you, I changed it back to use "if"
rather than "#ifdef".


New series attached (starting at 0002 to better correspond to the
previous series).


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: Pavel Luzanov
Date:
Subject: Re: psql: Add role's membership options to the \du+ command