Re: BUG #18362: unaccent rules and Old Greek text - Mailing list pgsql-bugs

From Robert Haas
Subject Re: BUG #18362: unaccent rules and Old Greek text
Date
Msg-id CA+TgmobZvFiTHhnO=jBqffMY=5OTB7R=jAkVRRR13yQG11UnvQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18362: unaccent rules and Old Greek text  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: BUG #18362: unaccent rules and Old Greek text
Re: BUG #18362: unaccent rules and Old Greek text
Re: BUG #18362: unaccent rules and Old Greek text
List pgsql-bugs
On Thu, Feb 29, 2024 at 8:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland
> <cees.van.zeeland@freedom.nl> wrote:
> > I'm not an expert, but obviously computers make a difference between the two versions of the characters.
> > We are talking about this series:
> > U+1F70 - U+1F7D:    ὰ     ά     ὲ     έ     ὴ     ή     ὶ     ί     ὸ     ό     ὺ     ύ     ὼ     ώ
> > Is it possible to filter / limit in some way the redirection in the script to this range?
>
> Right, so to get this in we either need to decide that we're OK with
> adding that many characters, or figure out some systematic way to
> select just the ones we want.  One hint that might be helpful if
> someone wants to investigate: I suspect that a lot of those mappings
> might be marked with <font>, which seems to be for code points for
> alternative renderings ("mathematical" bold, italic, fraktur etc), so
> perhaps we could filter them out that way without losing the
> oxia-marked characters if that's the way it has to be.

There's a CommitFest entry for this thread at
https://commitfest.postgresql.org/48/4873/ which is set to "Needs
Review," but does it, really? There seem to basically be two related
issues here. One is whether all of the mappings that the proposed
change would create are correct; perhaps some of them are superfluous,
or even wrong. The other is whether adding a lot of mappings is going
to cause a problem with rule file load times.

It's true that a reviewer could look into these questions, but I think
what we need here is almost more like a volunteer to be a co-author or
co-sponsor of the patch, because I normally expect that when someone
asks me to review a patch, they believe they've got something that
they already have good reasons to believe is correct and what they
want from me is to know whether I agree. And it doesn't seem like this
has progressed to that stage, so I kind of wonder whether it ought to
be evicted from the CommitFest entirely until it does. This perhaps
sounds mean-spirited, but our CommitFest application contains an awful
lot of things that aren't actually actionable, and it makes it hard to
find the things that are, so trying to improve that situation is the
motivation here. On the other hand, maybe this is actionable after all
and we just need to make a decision and do something, so let's look at
the practical questions.

1. The question of rule file load times seems like something that
anyone who could compile PostgreSQL with and without a patch applied
could test in under an hour. They could then report the results that
they got, and people here could judge whether the resulting numbers
are totally cool or very sad or something in between. Anyone willing
to do that?

2. The question of which mappings we actually ought to be adding seems
a lot harder, because it's not altogether clear what it means to
"remove an accent". The proposed patch adds a whole lot of rules that
turn tiny little characters into full-sized characters, boldfaced
and/or italicized and/or otherwise-fancily-printed characters into
full-sized characters. Only a handful of the changes are actually
adding rules that specifically *remove an accent*, but there are
similar rules that already exist, like turning ⅐ into the
four-character sequence " 1/7" and blocky-looking versions of each
letter into standard versions and ㍱ into the three-character sequence
"hPa". So my naive guess would be that we want all of these rules,
even though you would not guess from the unaccent documentation that
it's supposed to do stuff like this. But my knowledge of languages
other than English is very limited, and I am not a user of unaccent
and never have been, so I am reluctant to make grand pronouncements.
Does anyone more knowledgeable want to opine?

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Next
From: Dmitry Dolgov
Date:
Subject: Re: BUG #18463: Possible bug in stored procedures with polymorphic OUT parameters