Re: BUG #18362: unaccent rules and Old Greek text - Mailing list pgsql-bugs
From | Robert Haas |
---|---|
Subject | Re: BUG #18362: unaccent rules and Old Greek text |
Date | |
Msg-id | CA+TgmobZvFiTHhnO=jBqffMY=5OTB7R=jAkVRRR13yQG11UnvQ@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18362: unaccent rules and Old Greek text (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: BUG #18362: unaccent rules and Old Greek text
Re: BUG #18362: unaccent rules and Old Greek text Re: BUG #18362: unaccent rules and Old Greek text |
List | pgsql-bugs |
On Thu, Feb 29, 2024 at 8:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland > <cees.van.zeeland@freedom.nl> wrote: > > I'm not an expert, but obviously computers make a difference between the two versions of the characters. > > We are talking about this series: > > U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ > > Is it possible to filter / limit in some way the redirection in the script to this range? > > Right, so to get this in we either need to decide that we're OK with > adding that many characters, or figure out some systematic way to > select just the ones we want. One hint that might be helpful if > someone wants to investigate: I suspect that a lot of those mappings > might be marked with <font>, which seems to be for code points for > alternative renderings ("mathematical" bold, italic, fraktur etc), so > perhaps we could filter them out that way without losing the > oxia-marked characters if that's the way it has to be. There's a CommitFest entry for this thread at https://commitfest.postgresql.org/48/4873/ which is set to "Needs Review," but does it, really? There seem to basically be two related issues here. One is whether all of the mappings that the proposed change would create are correct; perhaps some of them are superfluous, or even wrong. The other is whether adding a lot of mappings is going to cause a problem with rule file load times. It's true that a reviewer could look into these questions, but I think what we need here is almost more like a volunteer to be a co-author or co-sponsor of the patch, because I normally expect that when someone asks me to review a patch, they believe they've got something that they already have good reasons to believe is correct and what they want from me is to know whether I agree. And it doesn't seem like this has progressed to that stage, so I kind of wonder whether it ought to be evicted from the CommitFest entirely until it does. This perhaps sounds mean-spirited, but our CommitFest application contains an awful lot of things that aren't actually actionable, and it makes it hard to find the things that are, so trying to improve that situation is the motivation here. On the other hand, maybe this is actionable after all and we just need to make a decision and do something, so let's look at the practical questions. 1. The question of rule file load times seems like something that anyone who could compile PostgreSQL with and without a patch applied could test in under an hour. They could then report the results that they got, and people here could judge whether the resulting numbers are totally cool or very sad or something in between. Anyone willing to do that? 2. The question of which mappings we actually ought to be adding seems a lot harder, because it's not altogether clear what it means to "remove an accent". The proposed patch adds a whole lot of rules that turn tiny little characters into full-sized characters, boldfaced and/or italicized and/or otherwise-fancily-printed characters into full-sized characters. Only a handful of the changes are actually adding rules that specifically *remove an accent*, but there are similar rules that already exist, like turning ⅐ into the four-character sequence " 1/7" and blocky-looking versions of each letter into standard versions and ㍱ into the three-character sequence "hPa". So my naive guess would be that we want all of these rules, even though you would not guess from the unaccent documentation that it's supposed to do stuff like this. But my knowledge of languages other than English is very limited, and I am not a user of unaccent and never have been, so I am reluctant to make grand pronouncements. Does anyone more knowledgeable want to opine? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-bugs by date: