Thread: BUG #18362: unaccent rules and Old Greek text
The following bug has been logged on the website: Bug reference: 18362 Logged by: Cees van Zeeland Email address: cees.van.zeeland@freedom.nl PostgreSQL version: 15.6 Operating system: Windows 11 Description: I am using a Postgres Server 15.06-1 with UTF-8 I am struggling with the unaccent extension and "Old Greek" characters. To explain what behaviour I encoutered, try this: 1. Create a table with one text field CREATE TABLE IF NOT EXISTS public.test ( entry text COLLATE pg_catalog."default" NOT NULL, CONSTRAINT test_pkey PRIMARY KEY (entry) ) 2. Insert the next few greek words with (stress accents) on the vowels, or import de CSV file with the same items. ἀνήρ (== man) πέντε (== five) γίγας (== giant) γράφω (== write) δύο (== two) ἐγώ (== Ι) θεός (== god) 3. Create the next view for searching: CREATE OR REPLACE VIEW public.test_view AS SELECT test.entry, COALESCE(array_to_string(ts_lexize('unaccent'::regdictionary, replace(test.entry, 'ς'::text, 'σ'::text)), ''::text), replace(test.entry, 'ς'::text, 'σ'::text)) AS search_entry FROM test ORDER BY test.entry; 4. Try if it works: SELECT entry, search_entry FROM public.test_view; Result shows that not all diacritics are removed When I search in the unaccent.rules I see around line 530 characters that look the same but they are in fact different. f.e. Greek Small Letter Epsilon with Tonos versus Greek Small Letter Epsilon with Oxia I found here a discussion about this subject: https://ibiblio.org/bgreek/forum/viewtopic.php?t=4170 So, there are reasons to keep the current unaccent.rules as it is, but... there are other reasons to add a few lines to it, f.e. after line 955 and insert five greek vowels with Oxia Please add: ά α έ ε ή η ί ι ό ο ύ υ ώ ω It would solve the problem and make searching through old greek texts al lot easier... Thanks for your help, Cees van Zeeland
On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form <noreply@postgresql.org> wrote: > So, there are reasons to keep the current unaccent.rules as it is, but... > there are other reasons to add a few lines to it, f.e. after line 955 and > insert five greek vowels with Oxia > Please add: > ά α > έ ε > ή η > ί ι > ό ο > ύ υ > ώ ω Hi, We don't exactly maintain this list manually, we extract it from Unicode source data. Can you see what needs to be adjusted in here to achieve that goal? https://github.com/postgres/postgres/blob/master/contrib/unaccent/generate_unaccent_rules.py Perhaps a new range or something like that?
On Sun, Feb 25, 2024 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form > <noreply@postgresql.org> wrote: > > So, there are reasons to keep the current unaccent.rules as it is, but... > > there are other reasons to add a few lines to it, f.e. after line 955 and > > insert five greek vowels with Oxia > > Please add: > > ά α Oh, I think I see it. "ά" is: 1F71;GREEK SMALL LETTER ALPHA WITH OXIA;Ll;0;L;03AC;;;;N;;;1FBB;;1FBB The Python script is looking for combining sequences that add accents, but this one has just "03AC" in the combining sequence field, so it's a kind of "simple" redirection that points here: 03AC;GREEK SMALL LETTER ALPHA WITH TONOS;Ll;0;L;03B1 0301;;;;N;GREEK SMALL LETTER ALPHA TONOS;;0386;;0386 That has a normal looking sequence that we can understand (α + an accent). If I tell the script to follow such "simple" redirections, I get over a thousand new mappings, including those. See attached. There is probably more correct terminology that I'm using here...
Attachment
On Sun, Feb 25, 2024 at 04:21:36PM +1300, Thomas Munro wrote: > On Sun, Feb 25, 2024 at 11:14 AM PG Bug reporting form > <noreply@postgresql.org> wrote: >> So, there are reasons to keep the current unaccent.rules as it is, but... >> there are other reasons to add a few lines to it, f.e. after line 955 and >> insert five greek vowels with Oxia >> Please add: >> ά α >> έ ε >> ή η >> ί ι >> ό ο >> ύ υ >> ώ ω Correct me if I'm wrong of course, but reading a bit on the matter at [1], letters with Tonos or Oxia are actually equivalent since 1986, and we only include character with Tonos in our unaccent.rules. > We don't exactly maintain this list manually, we extract it from > Unicode source data. Can you see what needs to be adjusted in here to > achieve that goal? See commits like e3dd7c06e627 or 59f47fb98dab for some references. Unfortunately, we've been using as policy to not backpatch any changes to the in-core rules file, and you can plug in your own file. Saying that, these additions sound like a natural addition seen from here. > Perhaps a new range or something like that? It seems to me that it is a bit more complicated than that, because Unicode.data decomposes the characters with Oxia as characters with Tonos, and then characters with Tonos are decomposed with the "base" alphabet characters + Tonos. We do a recursive lookup at the unicode table in get_plain_letter() and is_letter_with_marks(), so it seems to me that we're not missing much, and I suspect that there should be no need for a new custom range of characters.. Cees, perhaps you would like to get a shot at that? [1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode -- Michael
Attachment
On Mon, Feb 26, 2024 at 12:15:57PM +1300, Thomas Munro wrote: > The Python script is looking for combining sequences that add accents, > but this one has just "03AC" in the combining sequence field, so it's > a kind of "simple" redirection that points here: > > 03AC;GREEK SMALL LETTER ALPHA WITH TONOS;Ll;0;L;03B1 0301;;;;N;GREEK > SMALL LETTER ALPHA TONOS;;0386;;0386 > > That has a normal looking sequence that we can understand (α + an > accent). If I tell the script to follow such "simple" redirections, I > get over a thousand new mappings, including those. See attached. > There is probably more correct terminology that I'm using here... Ah, you've beaten me to it. Yes, that's pretty much the impression I was getting while looking at the set of characters in Unicode.txt. I am not entirely sure if what you are doing is the best way to do it, but the set of characters generated in unaccent.rules makes sense here. I am surprised to see that many, TBH. Perhaps you should add a few characters of these series to unaccent.sql? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Feb 26, 2024 at 12:15:57PM +1300, Thomas Munro wrote: >> That has a normal looking sequence that we can understand (α + an >> accent). If I tell the script to follow such "simple" redirections, I >> get over a thousand new mappings, including those. See attached. >> There is probably more correct terminology that I'm using here... > Ah, you've beaten me to it. Yes, that's pretty much the impression I > was getting while looking at the set of characters in Unicode.txt. I > am not entirely sure if what you are doing is the best way to do it, > but the set of characters generated in unaccent.rules makes sense > here. I am surprised to see that many, TBH. There are only about 1650 lines in our standard unaccent.rules file today. Are we concerned about adding so many more? I don't think the trie lookup logic would be slowed any, but the time to load the rules file might take a hit. regards, tom lane
Thomas Munro <thomas.munro@gmail.com> wrote:
> If I tell the script to follow such "simple" redirections, I
> get over a thousand new mappings, including those. See attached.
> There is probably more correct terminology that I'm using here...
> Unicode.data decomposes the characters with Oxia as characters with
> Tonos, and then characters with Tonos are decomposed with the "base"
> alphabet characters + Tonos. We do a recursive lookup at the unicode
> table in get_plain_letter() and is_letter_with_marks(), so it seems to
> me that we're not missing much, and I suspect that there should be no
> need for a new custom range of characters..
>
> Cees, perhaps you would like to get a shot at that?
>
> [1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
I'm not an expert, but obviously computers make a difference between the two versions of the characters.
We are talking about this series:
U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ
Is it possible to filter / limit in some way the redirection in the script to this range?
~
Cees
> If I tell the script to follow such "simple" redirections, I
> get over a thousand new mappings, including those. See attached.
> There is probably more correct terminology that I'm using here...
Michael Paquier wrote:
> It seems to me that it is a bit more complicated than that, because> Unicode.data decomposes the characters with Oxia as characters with
> Tonos, and then characters with Tonos are decomposed with the "base"
> alphabet characters + Tonos. We do a recursive lookup at the unicode
> table in get_plain_letter() and is_letter_with_marks(), so it seems to
> me that we're not missing much, and I suspect that there should be no
> need for a new custom range of characters..
>
> Cees, perhaps you would like to get a shot at that?
>
> [1]: https://en.wikipedia.org/wiki/Greek_diacritics#Unicode
I'm not an expert, but obviously computers make a difference between the two versions of the characters.
We are talking about this series:
U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ
Is it possible to filter / limit in some way the redirection in the script to this range?
~
Cees
On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland <cees.van.zeeland@freedom.nl> wrote: > I'm not an expert, but obviously computers make a difference between the two versions of the characters. > We are talking about this series: > U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ > Is it possible to filter / limit in some way the redirection in the script to this range? Right, so to get this in we either need to decide that we're OK with adding that many characters, or figure out some systematic way to select just the ones we want. One hint that might be helpful if someone wants to investigate: I suspect that a lot of those mappings might be marked with <font>, which seems to be for code points for alternative renderings ("mathematical" bold, italic, fraktur etc), so perhaps we could filter them out that way without losing the oxia-marked characters if that's the way it has to be. I think all the relevant part of the character database file is described here: https://unicode.org/reports/tr44/#Property_Values The file we're currently using is 15.1: https://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt I registered this thread as https://commitfest.postgresql.org/47/4873/ .
Hi Thomas, I found: https://www.unicode.org/Public/15.1.0/ucd/CompositionExclusions.txt that might be useful to tackle characters that we are searching for. Hope this helps. Cees On 01/03/2024 02:53, Thomas Munro wrote: > On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland > <cees.van.zeeland@freedom.nl> wrote: >> I'm not an expert, but obviously computers make a difference between the two versions of the characters. >> We are talking about this series: >> U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ >> Is it possible to filter / limit in some way the redirection in the script to this range? > Right, so to get this in we either need to decide that we're OK with > adding that many characters, or figure out some systematic way to > select just the ones we want. One hint that might be helpful if > someone wants to investigate: I suspect that a lot of those mappings > might be marked with <font>, which seems to be for code points for > alternative renderings ("mathematical" bold, italic, fraktur etc), so > perhaps we could filter them out that way without losing the > oxia-marked characters if that's the way it has to be. > > I think all the relevant part of the character database file is described here: > > https://unicode.org/reports/tr44/#Property_Values > > The file we're currently using is 15.1: > > https://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt > > I registered this thread as https://commitfest.postgresql.org/47/4873/ .
On Thu, Feb 29, 2024 at 8:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Feb 27, 2024 at 1:33 AM Cees van Zeeland > <cees.van.zeeland@freedom.nl> wrote: > > I'm not an expert, but obviously computers make a difference between the two versions of the characters. > > We are talking about this series: > > U+1F70 - U+1F7D: ὰ ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ > > Is it possible to filter / limit in some way the redirection in the script to this range? > > Right, so to get this in we either need to decide that we're OK with > adding that many characters, or figure out some systematic way to > select just the ones we want. One hint that might be helpful if > someone wants to investigate: I suspect that a lot of those mappings > might be marked with <font>, which seems to be for code points for > alternative renderings ("mathematical" bold, italic, fraktur etc), so > perhaps we could filter them out that way without losing the > oxia-marked characters if that's the way it has to be. There's a CommitFest entry for this thread at https://commitfest.postgresql.org/48/4873/ which is set to "Needs Review," but does it, really? There seem to basically be two related issues here. One is whether all of the mappings that the proposed change would create are correct; perhaps some of them are superfluous, or even wrong. The other is whether adding a lot of mappings is going to cause a problem with rule file load times. It's true that a reviewer could look into these questions, but I think what we need here is almost more like a volunteer to be a co-author or co-sponsor of the patch, because I normally expect that when someone asks me to review a patch, they believe they've got something that they already have good reasons to believe is correct and what they want from me is to know whether I agree. And it doesn't seem like this has progressed to that stage, so I kind of wonder whether it ought to be evicted from the CommitFest entirely until it does. This perhaps sounds mean-spirited, but our CommitFest application contains an awful lot of things that aren't actually actionable, and it makes it hard to find the things that are, so trying to improve that situation is the motivation here. On the other hand, maybe this is actionable after all and we just need to make a decision and do something, so let's look at the practical questions. 1. The question of rule file load times seems like something that anyone who could compile PostgreSQL with and without a patch applied could test in under an hour. They could then report the results that they got, and people here could judge whether the resulting numbers are totally cool or very sad or something in between. Anyone willing to do that? 2. The question of which mappings we actually ought to be adding seems a lot harder, because it's not altogether clear what it means to "remove an accent". The proposed patch adds a whole lot of rules that turn tiny little characters into full-sized characters, boldfaced and/or italicized and/or otherwise-fancily-printed characters into full-sized characters. Only a handful of the changes are actually adding rules that specifically *remove an accent*, but there are similar rules that already exist, like turning ⅐ into the four-character sequence " 1/7" and blocky-looking versions of each letter into standard versions and ㍱ into the three-character sequence "hPa". So my naive guess would be that we want all of these rules, even though you would not guess from the unaccent documentation that it's supposed to do stuff like this. But my knowledge of languages other than English is very limited, and I am not a user of unaccent and never have been, so I am reluctant to make grand pronouncements. Does anyone more knowledgeable want to opine? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2024-05-14 at 10:51 -0400, Robert Haas wrote: > The question of which mappings we actually ought to be adding seems > a lot harder, because it's not altogether clear what it means to > "remove an accent". The proposed patch adds a whole lot of rules that > turn tiny little characters into full-sized characters, boldfaced > and/or italicized and/or otherwise-fancily-printed characters into > full-sized characters. Only a handful of the changes are actually > adding rules that specifically *remove an accent*, but there are > similar rules that already exist, like turning ⅐ into the > four-character sequence " 1/7" and blocky-looking versions of each > letter into standard versions and ㍱ into the three-character sequence > "hPa". So my naive guess would be that we want all of these rules, > even though you would not guess from the unaccent documentation that > it's supposed to do stuff like this. But my knowledge of languages > other than English is very limited, and I am not a user of unaccent > and never have been, so I am reluctant to make grand pronouncements. > Does anyone more knowledgeable want to opine? I am not necessarily more knowledgeable, but I'll opine anyway. As a German speaker, I wouldn't call the dieresis on "ü" an accent like the French é, è or ê, even though the current implementation of unaccent() turns it into an "u". And while most people would agree that the caret on â is an accent in the French language, I am not sure if it is the same in Vietnamese. And I cannot see how ⅐ could be considered an accent... Perhaps if we invent a function called convert_to_ascii() or so instead of shoving that into unaccent(), it would make more sense. Yours, Laurenz Albe
On 14.05.24 16:51, Robert Haas wrote: > 1. The question of rule file load times seems like something that > anyone who could compile PostgreSQL with and without a patch applied > could test in under an hour. They could then report the results that > they got, and people here could judge whether the resulting numbers > are totally cool or very sad or something in between. Anyone willing > to do that? The rules are only loaded once on first use, right? I tested with date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')" -o /dev/null; done; date and this had the same runtime (about 8 seconds here) with and without the patch. Btw., with the patch I get WARNING: duplicate source strings, first one will be used so it will need to adjustments in how the rules are produced.
On 14.05.24 16:51, Robert Haas wrote: > 2. The question of which mappings we actually ought to be adding seems > a lot harder, because it's not altogether clear what it means to > "remove an accent". The proposed patch adds a whole lot of rules that > turn tiny little characters into full-sized characters, boldfaced > and/or italicized and/or otherwise-fancily-printed characters into > full-sized characters. Only a handful of the changes are actually > adding rules that specifically*remove an accent*, but there are > similar rules that already exist, like turning ⅐ into the > four-character sequence " 1/7" and blocky-looking versions of each > letter into standard versions and ㍱ into the three-character sequence > "hPa". So my naive guess would be that we want all of these rules, > even though you would not guess from the unaccent documentation that > it's supposed to do stuff like this. unaccent actually does both accent removal and ligature expansion. (This is documented.) The cases you show above are ligature expansions. You can also run generate_unaccent_rules.py with --no-ligatures and then you get a smaller list that indeed looks more like just accent removal. It does look like that whatever it thinks a ligature is has some unintuitive results.
On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 14.05.24 16:51, Robert Haas wrote: > > 1. The question of rule file load times seems like something that > > anyone who could compile PostgreSQL with and without a patch applied > > could test in under an hour. They could then report the results that > > they got, and people here could judge whether the resulting numbers > > are totally cool or very sad or something in between. Anyone willing > > to do that? > > The rules are only loaded once on first use, right? I tested with > > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')" > -o /dev/null; done; date > > and this had the same runtime (about 8 seconds here) with and without > the patch. Cool. Sounds like that's not a problem. > Btw., with the patch I get > > WARNING: duplicate source strings, first one will be used > > so it will need to adjustments in how the rules are produced. OK. Does anyone want to look into that? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 15, 2024 at 3:01 AM Peter Eisentraut <peter@eisentraut.org> wrote: > unaccent actually does both accent removal and ligature expansion. > (This is documented.) The cases you show above are ligature expansions. Ah. Searching the documentation of unaccent for the word ligature, I found where this is mentioned. But at the top of the page, it says only "unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes". Perhaps we should add a mention of ligature expansion there as well. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 16, 2024 at 1:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 15, 2024 at 2:45 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > On 14.05.24 16:51, Robert Haas wrote: > > The rules are only loaded once on first use, right? I tested with > > > > date; for x in $(seq 1 1000); do psql -X -c "select unaccent('foobar')" > > -o /dev/null; done; date > > > > and this had the same runtime (about 8 seconds here) with and without > > the patch. > > Cool. Sounds like that's not a problem. Thanks Peter for testing, and thanks Robert for kicking this thread. > > Btw., with the patch I get > > > > WARNING: duplicate source strings, first one will be used > > > > so it will need to adjustments in how the rules are produced. > > OK. Does anyone want to look into that? I think the problem is that the new "simple redirection" rule from the Unicode database produces some values that are also present in Latin-ASCII.xml, and these are all tolerated as long as the "from" and "to" strings both match, because we uniquify them as pairs. But there is one pair where the "to" string is different, resulting in this clash: ℌ x ℌ H I think the first line might actually be a bug in CLDR data. I dunno, but this just doesn't look right: ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat) And in the tests I now see that Michael had already figured that out! I've included a kludge to remove that. Someone should file a ticket with CLDR.
Attachment
On 18.05.24 11:36, Thomas Munro wrote: >>> WARNING: duplicate source strings, first one will be used >>> >>> so it will need to adjustments in how the rules are produced. >> >> OK. Does anyone want to look into that? > > I think the problem is that the new "simple redirection" rule from the > Unicode database produces some values that are also present in > Latin-ASCII.xml, and these are all tolerated as long as the "from" and > "to" strings both match, because we uniquify them as pairs. But there > is one pair where the "to" string is different, resulting in this > clash: > > ℌ x > ℌ H > > I think the first line might actually be a bug in CLDR data. I dunno, > but this just doesn't look right: > > ℌ → x ; # 210C;BLACK-LETTER CAPITAL H (compat) > > And in the tests I now see that Michael had already figured that out! > I've included a kludge to remove that. Someone should file a ticket with CLDR. Done: https://unicode-org.atlassian.net/browse/CLDR-17656
On Sat, May 18, 2024 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote: > And in the tests I now see that Michael had already figured that out! > I've included a kludge to remove that. Someone should file a ticket with CLDR. I think you should update the comment that says "a mistake?" to instead link to the CLDR issue that Peter filed. Other than that, I'm not sure this needs any other changes. I can't actually testify to the correctness of the Python code, but the results look sane so hey, why not? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 22, 2024 at 12:47:37PM -0400, Robert Haas wrote: > On Sat, May 18, 2024 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> And in the tests I now see that Michael had already figured that out! >> I've included a kludge to remove that. Someone should file a ticket with CLDR. That was some time ago.. I was not sure back then how to handle that with upstream data, so thanks for the bug report and the pointers. I'll try to remember that. > I think you should update the comment that says "a mistake?" to > instead link to the CLDR issue that Peter filed. Other than that, I'm > not sure this needs any other changes. I can't actually testify to the > correctness of the Python code, but the results look sane so hey, why > not? +1 for the comment refresh in the test, keeping the test. + if src == "ℌ": + # a mistake? + continue Perhaps this should use the codepoint rather than the non-ascii character in the script. Another thing would be to add some tests that cover the new characters that get a conversion. Just a few of them in the new ranges, checking the recursive case with is_letter_with_marks() would be fine. -- Michael
Attachment
On Thu, May 23, 2024 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote: > Another thing would be to add some tests that cover the new characters > that get a conversion. Just a few of them in the new ranges, checking > the recursive case with is_letter_with_marks() would be fine. Because the mappings that the script produces are going to be checked into the repository, I don't necessarily see that we also need separate tests for it. I'm not sure what problem that could realistically find. -- Robert Haas EDB: http://www.enterprisedb.com
I improved the comment, referenced the CLDR bug report, and replaced the funky character with \uNNNN notation, as requested. And pushed (master/18-to-be only). Ευχαριστώ!
On Fri, Jul 05, 2024 at 03:41:05PM +1200, Thomas Munro wrote: > I improved the comment, referenced the CLDR bug report, and replaced > the funky character with \uNNNN notation, as requested. And pushed > (master/18-to-be only). Ευχαριστώ! Thanks! -- Michael