Thread: PATCH: Allow empty targets in unaccent dictionary

PATCH: Allow empty targets in unaccent dictionary

From

Mohammad Alhashash

Date:

19 April 2014, 23:06:50

Hi,

Currently, unaccent extension only allows replacing one source character
with one or more target characters. In Arabic, Hebrew and possibly other
languages, diacritics are standalone characters that are being added to
normal letters. To use unaccent dictionary for these languages, we need
to allow empty targets to remove diacritics instead of replacing them.

The attached patch modfies unaacent.c so that dictionary parser uses
zero-length target when the line has no target.

Best Regards,

Mohammad Alhashash

Attachment

unaccent_empty_target.patch

Re: PATCH: Allow empty targets in unaccent dictionary

From

David Fetter

Date:

21 April 2014, 04:21:22

Please add this to the next commitfest.

https://commitfest.postgresql.org/action/commitfest_view?id=22

Cheers,
David.
On Sun, Apr 20, 2014 at 01:06:43AM +0200, Mohammad Alhashash wrote:
> Hi,
> 
> Currently, unaccent extension only allows replacing one source
> character with one or more target characters. In Arabic, Hebrew and
> possibly other languages, diacritics are standalone characters that
> are being added to normal letters. To use unaccent dictionary for
> these languages, we need to allow empty targets to remove diacritics
> instead of replacing them.
> 
> The attached patch modfies unaacent.c so that dictionary parser uses
> zero-length target when the line has no target.
> 
> Best Regards,
> 
> Mohammad Alhashash
> 

> diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
> old mode 100644
> new mode 100755
> index a337df6..4e72829
> --- a/contrib/unaccent/unaccent.c
> +++ b/contrib/unaccent/unaccent.c
> @@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, char *replaceTo, int r
>          {
>              curnode->replacelen = replacelen;
>              curnode->replaceTo = palloc(replacelen);
> -            memcpy(curnode->replaceTo, replaceTo, replacelen);
> +            /* palloc(0) returns a valid address, not NULL */
> +            if (replaceTo) /* memcpy() is undefined for NULL pointers*/
> +                memcpy(curnode->replaceTo, replaceTo, replacelen);
>          }
>      }
>      else
> @@ -105,10 +107,10 @@ initTrie(char *filename)
>              while ((line = tsearch_readline(&trst)) != NULL)
>              {
>                  /*
> -                 * The format of each line must be "src trg" where src and trg
> +                 * The format of each line must be "src [trg]" where src and trg
>                   * are sequences of one or more non-whitespace characters,
>                   * separated by whitespace.  Whitespace at start or end of
> -                 * line is ignored.
> +                 * line is ignored. If no trg added, a zero-length string is used.
>                   */
>                  int            state;
>                  char       *ptr;
> @@ -160,6 +162,13 @@ initTrie(char *filename)
>                      }
>                  }
>  
> +                /* if no trg (loop stops at state 1 or 2), use zero-length target */
> +                if (state == 1 || state == 2)
> +                {
> +                    trglen = 0;
> +                    state = 5;
> +                }
> +                
>                  if (state >= 3)
>                      rootTrie = placeChar(rootTrie,
>                                           (unsigned char *) src, srclen,

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: PATCH: Allow empty targets in unaccent dictionary

From

Abhijit Menon-Sen

Date:

25 June 2014, 05:20:18

Hi.

At 2014-04-20 01:06:43 +0200, alhashash@alhashash.net wrote:
>
> To use unaccent dictionary for these languages, we need to allow empty
> targets to remove diacritics instead of replacing them.

Your patch should definitely add a test case or two to sql/unaccent.sql
and expected/unaccent.out showing the behaviour that didn't work before
the change.

> The attached patch modfies unaacent.c so that dictionary parser uses
> zero-length target when the line has no target.

The basic idea seems sensible.

> diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
> old mode 100644
> new mode 100755
> index a337df6..4e72829
> --- a/contrib/unaccent/unaccent.c
> +++ b/contrib/unaccent/unaccent.c
> @@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, char *replaceTo, int r
>          {
>              curnode->replacelen = replacelen;
>              curnode->replaceTo = palloc(replacelen);
> -            memcpy(curnode->replaceTo, replaceTo, replacelen);
> +            /* palloc(0) returns a valid address, not NULL */
> +            if (replaceTo) /* memcpy() is undefined for NULL pointers*/
> +                memcpy(curnode->replaceTo, replaceTo, replacelen);
>          }
>      }

I think these comments confuse the issue, and should be removed. In
fact, I think this part of the code can remain unchanged (see below).

> @@ -105,10 +107,10 @@ initTrie(char *filename)
>              while ((line = tsearch_readline(&trst)) != NULL)
>              {
>                  /*
> -                 * The format of each line must be "src trg" where src and trg
> +                 * The format of each line must be "src [trg]" where src and trg
>                   * are sequences of one or more non-whitespace characters,
>                   * separated by whitespace.  Whitespace at start or end of
> -                 * line is ignored.
> +                 * line is ignored. If no trg added, a zero-length string is used.
>                   */
>                  int            state;

I suggest "If trg is empty, a zero-length string is used" for the last
sentence.

> @@ -160,6 +162,13 @@ initTrie(char *filename)
>                      }
>                  }
>  
> +                /* if no trg (loop stops at state 1 or 2), use zero-length target */
> +                if (state == 1 || state == 2)
> +                {
> +                    trglen = 0;
> +                    state = 5;
> +                }

If I understand the code correctly, "src" alone will leave state == 1,
and "src " will leave state == 2, and in both cases trg and trglen will
be unchanged (from NULL and 0 respectively).

In that case, I think it would be clearer to do something like this:
   char *trg = "";
   …
   /* It's OK to have a valid src and empty trg. */   if (state > 0 && trglen == 0)       state = 5;

That way, you don't have the NULL pointer, and you don't have to add a
NULL-pointer test in placeChar() above.

What do you think? If you submit a revised patch along these lines, I'll
mark it ready for committer.

Thank you.

-- Abhijit

Re: PATCH: Allow empty targets in unaccent dictionary

From

Abhijit Menon-Sen

Date:

29 June 2014, 11:43:36

Hi.

I've attached a patch to contrib/unaccent as outlined in my review the
other day. I'm familiar with multiple languages in which modifiers are
separate characters (but not Arabic), so I decided to try a quick test
because I was curious.

I added a line containing only U+0940 (DEVANAGARI VOWEL SIGN II) to
unaccent.rules, and tried the following (the argument to unaccent is
U+0915 U+0940, and the result is U+0915):

    ams=# select unaccent('unaccent','की ');
     unaccent
    ----------
     क
    (1 row)

So the patch works fine: it correctly removes the modifier.

To add a test, however, it would be necessary to add this modifier to
unaccent.rules. But if we're adding one modifier to unaccent.rules, we
really should add them all. I have nowhere near the motivation needed to
add all the Devanagari modifiers, let alone any of the other languages I
know, and even if I did, it still wouldn't address Mohammad's use case.

(As a separate matter, it's not clear to me if stripping these modifiers
using unaccent is something everyone will want to do.)

So, though I'm not fond of saying it, perhaps the right thing to do is
to forget my earlier objection (that the patch didn't have tests), and
just commit as-is. It's a pretty straightforward patch, and it works.

I'm setting this as ready for committer.

-- अभजत "unaccented in three languages" മനന-সন

Attachment

unaccent.diff

Re: PATCH: Allow empty targets in unaccent dictionary

From

Mohammad Alhashash

Date:

29 June 2014, 12:03:20

Hi,

Thanks a lot for the review and comments. Here is an updated patch.

On 6/25/2014 8:20 AM, Abhijit Menon-Sen wrote:
> Your patch should definitely add a test case or two to
> sql/unaccent.sql and expected/unaccent.out showing the behaviour that
> didn't work before the change.
That would require adding new entries to the "unaccent.rules" template.
I'm afraid that the templates I'm using for Arabic now are not complete
enough to be including in the default dictionary.

I can create a new template just for the test cases but I've to update
the make file to include that file in installation. Should I do this?

Thanks,

Mohammad Alhashash

Attachment

unaccent.patch

Re: PATCH: Allow empty targets in unaccent dictionary

From

Tom Lane

Date:

30 June 2014, 19:19:26

Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> I've attached a patch to contrib/unaccent as outlined in my review the
> other day.

I went to commit this, and while testing I realized that the current
implementation of unaccent_lexize is only capable of coping with "src"
strings that are single characters in the current encoding.  I'm not
sure exactly how complicated it would be to fix that, but it might
not be terribly difficult.  (Overlapping matches would be the main
complication, likely.)

Anyway, this raises the question of whether the current patch is
actually a desirable way to do things, or whether it would be better if
the unaccenting rules were like "base-char accent-char" -> "base-char".
Presumably, that would require more rules, but it would prevent deletion
of a standalone accent-char, which might or might not be a good thing.
Also, if there are any contexts where the right translation of an
accent-char depends on the base-char, you couldn't do it with the
patch as it stands.  I don't know any languages that use separate
accent-chars, so I'm not qualified to opine on whether this is important.

It's not unlikely that we want this patch *and* an improvement that allows
multi-character src strings, but I held off committing for the moment
until somebody weighs in with an opinion about that.

In any case, the patch failed to update the user-facing docs
(unaccent.sgml) so we need some more work in that department.  The
user-facing docs are already pretty weak in that they fail to explain the
whitespace rules, much less clarify that the src must be exactly a single
character.

If we don't improve the code to allow multi-character src, I wonder
whether the rule-reading code shouldn't be improved to forbid such
cases.  I'm also wondering why it silently ignores malformed lines
instead of bleating.  But that's more or less orthogonal to this patch.

Lastly, I didn't especially like the coding details of either proposed
patch, and rewrote it as attached.  I didn't see a need to introduce
a "state 5", and I also didn't agree with changing the initial values
of trg/trglen to be meaningful.  Right now, those variables are
initialized only to keep the compiler from bleating; the initial values
aren't actually used anywhere.  It didn't seem like a good idea for
the trgxxx variables to be different from the srcxxx variables in that
respect.

            regards, tom lane

diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index a337df6..5a31f85 100644
*** a/contrib/unaccent/unaccent.c
--- b/contrib/unaccent/unaccent.c
*************** initTrie(char *filename)
*** 104,114 ****

              while ((line = tsearch_readline(&trst)) != NULL)
              {
!                 /*
!                  * The format of each line must be "src trg" where src and trg
!                  * are sequences of one or more non-whitespace characters,
!                  * separated by whitespace.  Whitespace at start or end of
!                  * line is ignored.
                   */
                  int            state;
                  char       *ptr;
--- 104,124 ----

              while ((line = tsearch_readline(&trst)) != NULL)
              {
!                 /*----------
!                  * The format of each line must be "src" or "src trg", where
!                  * src and trg are sequences of one or more non-whitespace
!                  * characters, separated by whitespace.  Whitespace at start
!                  * or end of line is ignored.  If trg is omitted, an empty
!                  * string is used as the replacement.
!                  *
!                  * We use a simple state machine, with states
!                  *    0    initial (before src)
!                  *    1    in src
!                  *    2    in whitespace after src
!                  *    3    in trg
!                  *    4    in whitespace after trg
!                  *    -1    syntax error detected (line will be ignored)
!                  *----------
                   */
                  int            state;
                  char       *ptr;
*************** initTrie(char *filename)
*** 160,166 ****
                      }
                  }

!                 if (state >= 3)
                      rootTrie = placeChar(rootTrie,
                                           (unsigned char *) src, srclen,
                                           trg, trglen);
--- 170,183 ----
                      }
                  }

!                 if (state == 1 || state == 2)
!                 {
!                     /* trg was omitted, so use "" */
!                     trg = "";
!                     trglen = 0;
!                 }
!
!                 if (state > 0)
                      rootTrie = placeChar(rootTrie,
                                           (unsigned char *) src, srclen,
                                           trg, trglen);

Re: PATCH: Allow empty targets in unaccent dictionary

From

Abhijit Menon-Sen

Date:

30 June 2014, 20:10:48

At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote:
>
> Anyway, this raises the question of whether the current patch is
> actually a desirable way to do things, or whether it would be better
> if the unaccenting rules were like "base-char accent-char" ->
> "base-char".

It might be useful to be able to write such rules, but it would be
highly impractical to do so instead of being able to single out
accent-chars for removal.

In all the languages I'm familiar with that use such accent-chars, any
accent-char would form a valid combination with nearly every base-char,
unlike European languages where you don't have to worry about k-umlaut,
say. Also, a standalone accent-char would always be meaningless.

(These accent-chars don't actually exist independently in the syllabary
that a Hindi speaker might learn in school: they're combining forms of
vowels and are treated differently from characters in practice.)

> Also, if there are any contexts where the right translation of an
> accent-char depends on the base-char, you couldn't do it with the
> patch as it stands.

I can't think of a satisfactory example at the moment, but that sounds
entirely plausible.

> It's not unlikely that we want this patch *and* an improvement that
> allows multi-character src strings

I think it's enough to apply just this patch, but I wouldn't object to
doing both if it were easy. It's not clear to me if that's true after a
quick glance at the code, but I'll look again when I'm properly awake.

> Lastly, I didn't especially like the coding details of either proposed
> patch, and rewrote it as attached.

:-)

-- Abhijit

Re: PATCH: Allow empty targets in unaccent dictionary

From

Tom Lane

Date:

30 June 2014, 23:42:48

Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote:
>> Anyway, this raises the question of whether the current patch is
>> actually a desirable way to do things, or whether it would be better
>> if the unaccenting rules were like "base-char accent-char" ->
>> "base-char".

> It might be useful to be able to write such rules, but it would be
> highly impractical to do so instead of being able to single out
> accent-chars for removal.

On reflection, if we were thinking of this as a general
substring-replacement mechanism rather than just a de-accenter
(and why shouldn't we think of it that way?), then clearly both
multi-character source strings and zero-character substitute strings
could be of value.  The fact that the existing patch only fixes
one of those omissions isn't a strike against it.
        regards, tom lane

Re: PATCH: Allow empty targets in unaccent dictionary

From

Tom Lane

Date:

01 July 2014, 02:06:39

Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> At 2014-06-30 15:19:17 -0400, tgl@sss.pgh.pa.us wrote:
>> It's not unlikely that we want this patch *and* an improvement that
>> allows multi-character src strings

> I think it's enough to apply just this patch, but I wouldn't object to
> doing both if it were easy. It's not clear to me if that's true after a
> quick glance at the code, but I'll look again when I'm properly awake.

I went ahead and committed this patch, and also some further work to
fix the multicharacter-source problem.  I took it on myself to make the
code issue warnings about misformatted lines, too.
        regards, tom lane

Re: PATCH: Allow empty targets in unaccent dictionary

From

Abhijit Menon-Sen

Date:

01 July 2014, 17:58:22

At 2014-06-30 22:06:30 -0400, tgl@sss.pgh.pa.us wrote:
>
> I went ahead and committed this patch, and also some further work to
> fix the multicharacter-source problem.  I took it on myself to make
> the code issue warnings about misformatted lines, too.

Thanks, looks good. I found the multicharacter-source diff an
instructive read.

-- Abhijit