On Thu, May 16, 2024 at 1:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > On 14.05.24 16:51, Robert Haas wrote:
> > The rules are only loaded once on first use, right? I tested with
> >
> > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')"
> > -o /dev/null; done; date
> >
> > and this had the same runtime (about 8 seconds here) with and without
> > the patch.
>
> Cool. Sounds like that's not a problem.
Thanks Peter for testing, and thanks Robert for kicking this thread.
> > Btw., with the patch I get
> >
> > WARNING: duplicate source strings, first one will be used
> >
> > so it will need to adjustments in how the rules are produced.
>
> OK. Does anyone want to look into that?
I think the problem is that the new "simple redirection" rule from the
Unicode database produces some values that are also present in
Latin-ASCII.xml, and these are all tolerated as long as the "from" and
"to" strings both match, because we uniquify them as pairs. But there
is one pair where the "to" string is different, resulting in this
clash:
ℌ x
ℌ H
I think the first line might actually be a bug in CLDR data. I dunno,
but this just doesn't look right:
ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat)
And in the tests I now see that Michael had already figured that out!
I've included a kludge to remove that. Someone should file a ticket with CLDR.