Re: PATCH: Allow empty targets in unaccent dictionary - Mailing list pgsql-hackers

From Tom Lane
Subject Re: PATCH: Allow empty targets in unaccent dictionary
Date
Msg-id 20035.1404155957@sss.pgh.pa.us
Whole thread Raw
In response to Re: PATCH: Allow empty targets in unaccent dictionary  (Abhijit Menon-Sen <ams@2ndQuadrant.com>)
Responses Re: PATCH: Allow empty targets in unaccent dictionary  (Abhijit Menon-Sen <ams@2ndQuadrant.com>)
List pgsql-hackers
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> I've attached a patch to contrib/unaccent as outlined in my review the
> other day.

I went to commit this, and while testing I realized that the current
implementation of unaccent_lexize is only capable of coping with "src"
strings that are single characters in the current encoding.  I'm not
sure exactly how complicated it would be to fix that, but it might
not be terribly difficult.  (Overlapping matches would be the main
complication, likely.)

Anyway, this raises the question of whether the current patch is
actually a desirable way to do things, or whether it would be better if
the unaccenting rules were like "base-char accent-char" -> "base-char".
Presumably, that would require more rules, but it would prevent deletion
of a standalone accent-char, which might or might not be a good thing.
Also, if there are any contexts where the right translation of an
accent-char depends on the base-char, you couldn't do it with the
patch as it stands.  I don't know any languages that use separate
accent-chars, so I'm not qualified to opine on whether this is important.

It's not unlikely that we want this patch *and* an improvement that allows
multi-character src strings, but I held off committing for the moment
until somebody weighs in with an opinion about that.

In any case, the patch failed to update the user-facing docs
(unaccent.sgml) so we need some more work in that department.  The
user-facing docs are already pretty weak in that they fail to explain the
whitespace rules, much less clarify that the src must be exactly a single
character.

If we don't improve the code to allow multi-character src, I wonder
whether the rule-reading code shouldn't be improved to forbid such
cases.  I'm also wondering why it silently ignores malformed lines
instead of bleating.  But that's more or less orthogonal to this patch.

Lastly, I didn't especially like the coding details of either proposed
patch, and rewrote it as attached.  I didn't see a need to introduce
a "state 5", and I also didn't agree with changing the initial values
of trg/trglen to be meaningful.  Right now, those variables are
initialized only to keep the compiler from bleating; the initial values
aren't actually used anywhere.  It didn't seem like a good idea for
the trgxxx variables to be different from the srcxxx variables in that
respect.

            regards, tom lane

diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index a337df6..5a31f85 100644
*** a/contrib/unaccent/unaccent.c
--- b/contrib/unaccent/unaccent.c
*************** initTrie(char *filename)
*** 104,114 ****

              while ((line = tsearch_readline(&trst)) != NULL)
              {
!                 /*
!                  * The format of each line must be "src trg" where src and trg
!                  * are sequences of one or more non-whitespace characters,
!                  * separated by whitespace.  Whitespace at start or end of
!                  * line is ignored.
                   */
                  int            state;
                  char       *ptr;
--- 104,124 ----

              while ((line = tsearch_readline(&trst)) != NULL)
              {
!                 /*----------
!                  * The format of each line must be "src" or "src trg", where
!                  * src and trg are sequences of one or more non-whitespace
!                  * characters, separated by whitespace.  Whitespace at start
!                  * or end of line is ignored.  If trg is omitted, an empty
!                  * string is used as the replacement.
!                  *
!                  * We use a simple state machine, with states
!                  *    0    initial (before src)
!                  *    1    in src
!                  *    2    in whitespace after src
!                  *    3    in trg
!                  *    4    in whitespace after trg
!                  *    -1    syntax error detected (line will be ignored)
!                  *----------
                   */
                  int            state;
                  char       *ptr;
*************** initTrie(char *filename)
*** 160,166 ****
                      }
                  }

!                 if (state >= 3)
                      rootTrie = placeChar(rootTrie,
                                           (unsigned char *) src, srclen,
                                           trg, trglen);
--- 170,183 ----
                      }
                  }

!                 if (state == 1 || state == 2)
!                 {
!                     /* trg was omitted, so use "" */
!                     trg = "";
!                     trglen = 0;
!                 }
!
!                 if (state > 0)
                      rootTrie = placeChar(rootTrie,
                                           (unsigned char *) src, srclen,
                                           trg, trglen);

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Cluster name in ps output
Next
From: Christian Ullrich
Date:
Subject: Re: PostgreSQL in Windows console and Ctrl-C