Thread: PATCH: Allow empty targets in unaccent dictionary
Hi, Currently, unaccent extension only allows replacing one source character with one or more target characters. In Arabic, Hebrew and possibly other languages, diacritics are standalone characters that are being added to normal letters. To use unaccent dictionary for these languages, we need to allow empty targets to remove diacritics instead of replacing them. The attached patch modfies unaacent.c so that dictionary parser uses zero-length target when the line has no target. Best Regards, Mohammad Alhashash
Attachment
Please add this to the next commitfest. https://commitfest.postgresql.org/action/commitfest_view?id=22 Cheers, David. On Sun, Apr 20, 2014 at 01:06:43AM +0200, Mohammad Alhashash wrote: > Hi, > > Currently, unaccent extension only allows replacing one source > character with one or more target characters. In Arabic, Hebrew and > possibly other languages, diacritics are standalone characters that > are being added to normal letters. To use unaccent dictionary for > these languages, we need to allow empty targets to remove diacritics > instead of replacing them. > > The attached patch modfies unaacent.c so that dictionary parser uses > zero-length target when the line has no target. > > Best Regards, > > Mohammad Alhashash > > diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c > old mode 100644 > new mode 100755 > index a337df6..4e72829 > --- a/contrib/unaccent/unaccent.c > +++ b/contrib/unaccent/unaccent.c > @@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, char *replaceTo, int r > { > curnode->replacelen = replacelen; > curnode->replaceTo = palloc(replacelen); > - memcpy(curnode->replaceTo, replaceTo, replacelen); > + /* palloc(0) returns a valid address, not NULL */ > + if (replaceTo) /* memcpy() is undefined for NULL pointers*/ > + memcpy(curnode->replaceTo, replaceTo, replacelen); > } > } > else > @@ -105,10 +107,10 @@ initTrie(char *filename) > while ((line = tsearch_readline(&trst)) != NULL) > { > /* > - * The format of each line must be "src trg" where src and trg > + * The format of each line must be "src [trg]" where src and trg > * are sequences of one or more non-whitespace characters, > * separated by whitespace. Whitespace at start or end of > - * line is ignored. > + * line is ignored. If no trg added, a zero-length string is used. > */ > int state; > char *ptr; > @@ -160,6 +162,13 @@ initTrie(char *filename) > } > } > > + /* if no trg (loop stops at state 1 or 2), use zero-length target */ > + if (state == 1 || state == 2) > + { > + trglen = 0; > + state = 5; > + } > + > if (state >= 3) > rootTrie = placeChar(rootTrie, > (unsigned char *) src, srclen, > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi. At 2014-04-20 01:06:43 +0200, alhashash@alhashash.net wrote: > > To use unaccent dictionary for these languages, we need to allow empty > targets to remove diacritics instead of replacing them. Your patch should definitely add a test case or two to sql/unaccent.sql and expected/unaccent.out showing the behaviour that didn't work before the change. > The attached patch modfies unaacent.c so that dictionary parser uses > zero-length target when the line has no target. The basic idea seems sensible. > diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c > old mode 100644 > new mode 100755 > index a337df6..4e72829 > --- a/contrib/unaccent/unaccent.c > +++ b/contrib/unaccent/unaccent.c > @@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, char *replaceTo, int r > { > curnode->replacelen = replacelen; > curnode->replaceTo = palloc(replacelen); > - memcpy(curnode->replaceTo, replaceTo, replacelen); > + /* palloc(0) returns a valid address, not NULL */ > + if (replaceTo) /* memcpy() is undefined for NULL pointers*/ > + memcpy(curnode->replaceTo, replaceTo, replacelen); > } > } I think these comments confuse the issue, and should be removed. In fact, I think this part of the code can remain unchanged (see below). > @@ -105,10 +107,10 @@ initTrie(char *filename) > while ((line = tsearch_readline(&trst)) != NULL) > { > /* > - * The format of each line must be "src trg" where src and trg > + * The format of each line must be "src [trg]" where src and trg > * are sequences of one or more non-whitespace characters, > * separated by whitespace. Whitespace at start or end of > - * line is ignored. > + * line is ignored. If no trg added, a zero-length string is used. > */ > int state; I suggest "If trg is empty, a zero-length string is used" for the last sentence. > @@ -160,6 +162,13 @@ initTrie(char *filename) > } > } > > + /* if no trg (loop stops at state 1 or 2), use zero-length target */ > + if (state == 1 || state == 2) > + { > + trglen = 0; > + state = 5; > + } If I understand the code correctly, "src" alone will leave state == 1, and "src " will leave state == 2, and in both cases trg and trglen will be unchanged (from NULL and 0 respectively). In that case, I think it would be clearer to do something like this: char *trg = ""; … /* It's OK to have a valid src and empty trg. */ if (state > 0 && trglen == 0) state = 5; That way, you don't have the NULL pointer, and you don't have to add a NULL-pointer test in placeChar() above. What do you think? If you submit a revised patch along these lines, I'll mark it ready for committer. Thank you. -- Abhijit
Hi. I've attached a patch to contrib/unaccent as outlined in my review the other day. I'm familiar with multiple languages in which modifiers are separate characters (but not Arabic), so I decided to try a quick test because I was curious. I added a line containing only U+0940 (DEVANAGARI VOWEL SIGN II) to unaccent.rules, and tried the following (the argument to unaccent is U+0915 U+0940, and the result is U+0915): ams=# select unaccent('unaccent','की '); unaccent ---------- क (1 row) So the patch works fine: it correctly removes the modifier. To add a test, however, it would be necessary to add this modifier to unaccent.rules. But if we're adding one modifier to unaccent.rules, we really should add them all. I have nowhere near the motivation needed to add all the Devanagari modifiers, let alone any of the other languages I know, and even if I did, it still wouldn't address Mohammad's use case. (As a separate matter, it's not clear to me if stripping these modifiers using unaccent is something everyone will want to do.) So, though I'm not fond of saying it, perhaps the right thing to do is to forget my earlier objection (that the patch didn't have tests), and just commit as-is. It's a pretty straightforward patch, and it works. I'm setting this as ready for committer. -- अभजत "unaccented in three languages" മനന-সন
Attachment
Hi, Thanks a lot for the review and comments. Here is an updated patch. On 6/25/2014 8:20 AM, Abhijit Menon-Sen wrote: > Your patch should definitely add a test case or two to > sql/unaccent.sql and expected/unaccent.out showing the behaviour that > didn't work before the change. That would require adding new entries to the "unaccent.rules" template. I'm afraid that the templates I'm using for Arabic now are not complete enough to be including in the default dictionary. I can create a new template just for the test cases but I've to update the make file to include that file in installation. Should I do this? Thanks, Mohammad Alhashash
Attachment
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > I've attached a patch to contrib/unaccent as outlined in my review the > other day. I went to commit this, and while testing I realized that the current implementation of unaccent_lexize is only capable of coping with "src" strings that are single characters in the current encoding. I'm not sure exactly how complicated it would be to fix that, but it might not be terribly difficult. (Overlapping matches would be the main complication, likely.) Anyway, this raises the question of whether the current patch is actually a desirable way to do things, or whether it would be better if the unaccenting rules were like "base-char accent-char" -> "base-char". Presumably, that would require more rules, but it would prevent deletion of a standalone accent-char, which might or might not be a good thing. Also, if there are any contexts where the right translation of an accent-char depends on the base-char, you couldn't do it with the patch as it stands. I don't know any languages that use separate accent-chars, so I'm not qualified to opine on whether this is important. It's not unlikely that we want this patch *and* an improvement that allows multi-character src strings, but I held off committing for the moment until somebody weighs in with an opinion about that. In any case, the patch failed to update the user-facing docs (unaccent.sgml) so we need some more work in that department. The user-facing docs are already pretty weak in that they fail to explain the whitespace rules, much less clarify that the src must be exactly a single character. If we don't improve the code to allow multi-character src, I wonder whether the rule-reading code shouldn't be improved to forbid such cases. I'm also wondering why it silently ignores malformed lines instead of bleating. But that's more or less orthogonal to this patch. Lastly, I didn't especially like the coding details of either proposed patch, and rewrote it as attached. I didn't see a need to introduce a "state 5", and I also didn't agree with changing the initial values of trg/trglen to be meaningful. Right now, those variables are initialized only to keep the compiler from bleating; the initial values aren't actually used anywhere. It didn't seem like a good idea for the trgxxx variables to be different from the srcxxx variables in that respect. regards, tom lane diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c index a337df6..5a31f85 100644 *** a/contrib/unaccent/unaccent.c --- b/contrib/unaccent/unaccent.c *************** initTrie(char *filename) *** 104,114 **** while ((line = tsearch_readline(&trst)) != NULL) { ! /* ! * The format of each line must be "src trg" where src and trg ! * are sequences of one or more non-whitespace characters, ! * separated by whitespace. Whitespace at start or end of ! * line is ignored. */ int state; char *ptr; --- 104,124 ---- while ((line = tsearch_readline(&trst)) != NULL) { ! /*---------- ! * The format of each line must be "src" or "src trg", where ! * src and trg are sequences of one or more non-whitespace ! * characters, separated by whitespace. Whitespace at start ! * or end of line is ignored. If trg is omitted, an empty ! * string is used as the replacement. ! * ! * We use a simple state machine, with states ! * 0 initial (before src) ! * 1 in src ! * 2 in whitespace after src ! * 3 in trg ! * 4 in whitespace after trg ! * -1 syntax error detected (line will be ignored) ! *---------- */ int state; char *ptr; *************** initTrie(char *filename) *** 160,166 **** } } ! if (state >= 3) rootTrie = placeChar(rootTrie, (unsigned char *) src, srclen, trg, trglen); --- 170,183 ---- } } ! if (state == 1 || state == 2) ! { ! /* trg was omitted, so use "" */ ! trg = ""; ! trglen = 0; ! } ! ! if (state > 0) rootTrie = placeChar(rootTrie, (unsigned char *) src, srclen, trg, trglen);
At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote: > > Anyway, this raises the question of whether the current patch is > actually a desirable way to do things, or whether it would be better > if the unaccenting rules were like "base-char accent-char" -> > "base-char". It might be useful to be able to write such rules, but it would be highly impractical to do so instead of being able to single out accent-chars for removal. In all the languages I'm familiar with that use such accent-chars, any accent-char would form a valid combination with nearly every base-char, unlike European languages where you don't have to worry about k-umlaut, say. Also, a standalone accent-char would always be meaningless. (These accent-chars don't actually exist independently in the syllabary that a Hindi speaker might learn in school: they're combining forms of vowels and are treated differently from characters in practice.) > Also, if there are any contexts where the right translation of an > accent-char depends on the base-char, you couldn't do it with the > patch as it stands. I can't think of a satisfactory example at the moment, but that sounds entirely plausible. > It's not unlikely that we want this patch *and* an improvement that > allows multi-character src strings I think it's enough to apply just this patch, but I wouldn't object to doing both if it were easy. It's not clear to me if that's true after a quick glance at the code, but I'll look again when I'm properly awake. > Lastly, I didn't especially like the coding details of either proposed > patch, and rewrote it as attached. :-) -- Abhijit
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote: >> Anyway, this raises the question of whether the current patch is >> actually a desirable way to do things, or whether it would be better >> if the unaccenting rules were like "base-char accent-char" -> >> "base-char". > It might be useful to be able to write such rules, but it would be > highly impractical to do so instead of being able to single out > accent-chars for removal. On reflection, if we were thinking of this as a general substring-replacement mechanism rather than just a de-accenter (and why shouldn't we think of it that way?), then clearly both multi-character source strings and zero-character substitute strings could be of value. The fact that the existing patch only fixes one of those omissions isn't a strike against it. regards, tom lane
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote: >> It's not unlikely that we want this patch *and* an improvement that >> allows multi-character src strings > I think it's enough to apply just this patch, but I wouldn't object to > doing both if it were easy. It's not clear to me if that's true after a > quick glance at the code, but I'll look again when I'm properly awake. I went ahead and committed this patch, and also some further work to fix the multicharacter-source problem. I took it on myself to make the code issue warnings about misformatted lines, too. regards, tom lane
At 2014-06-30 22:06:30 -0400, tgl@sss.pgh.pa.us wrote: > > I went ahead and committed this patch, and also some further work to > fix the multicharacter-source problem. I took it on myself to make > the code issue warnings about misformatted lines, too. Thanks, looks good. I found the multicharacter-source diff an instructive read. -- Abhijit