Thread: BUG #16337: Finnish Ispell dictionary cannot be created

BUG #16337: Finnish Ispell dictionary cannot be created

From

PG Bug reporting form

Date:

02 April 2020, 10:11:51

The following bug has been logged on the website:

Bug reference:      16337
Logged by:          Matti Linnanvuori
Email address:      matti.linnanvuori@portalify.com
PostgreSQL version: 12.2
Operating system:   Red Hat Enterprise Linux 8.1
Description:

postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell,
DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish);
ERROR:  syntax error
CONTEXT:  line 83 of configuration file
"/usr/pgsql-12/share/tsearch_data/fi_fi.affix": "    I           >
ALI\-
"

http://ispell-fi.sourceforge.net/finnish.dict.bz2
bunzip2 finnish.dict.bz2
iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.dict finnish.dict
cp fi_fi.dict /usr/pgsql-12/share/tsearch_data

http://ispell-fi.sourceforge.net/finnish.large.aff.bz2
bunzip2 finnish.large.aff.bz2
iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.affix finnish.large.aff
cp fi_fi.affix /usr/pgsql-12/share/tsearch_data

http://ispell-fi.sourceforge.net/finnish.medium.aff.bz2
bunzip2 finnish.medium.aff.bz2
iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.affix finnish.medium.aff
cp fi_fi.affix /usr/pgsql-12/share/tsearch_data

https://www.postgresql.org/message-id/46CD5588.5080404%40enterprisedb.com

Re: BUG #16337: Finnish Ispell dictionary cannot be created

From

Artur Zakirov

Date:

03 April 2020, 03:33:00

Hello,

On 4/2/2020 7:11 PM, PG Bug reporting form wrote:
> postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell,
> DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish);
> ERROR:  syntax error
> CONTEXT:  line 83 of configuration file
> "/usr/pgsql-12/share/tsearch_data/fi_fi.affix": "    I           >
> ALI\-
> "

Thank you for the email.

It seems that here the backslash is used to escape the following 
character according to the comment for the following flag:

> flag *E:
>     .           >       YLI     # ylijohtaja
>     I           >       YLI\-   # yli-inhimillinen

Escaping character is valid for ispell format (see 
https://manpages.debian.org/testing/ispell/ispell.5.en.html):

> Any character with special meaning to the parser can be changed to an uninterpreted token by backslashing it

I've looked also for Hunspell finnish dictionary. But I didn't find any 
I found only some postgres extension:
https://github.com/Houston-Inc/dict_voikko

I think it is possible to fix the postgres parser. But I'm not sure 
should we do that.

At first sight it is necessary to fix parse_affentry().

-- 
Artur

Re: BUG #16337: Finnish Ispell dictionary cannot be created

From

Tomas Vondra

Date:

03 April 2020, 08:55:09

On Fri, Apr 03, 2020 at 12:33:00PM +0900, Artur Zakirov wrote:
>Hello,
>
>On 4/2/2020 7:11 PM, PG Bug reporting form wrote:
>>postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell,
>>DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish);
>>ERROR:  syntax error
>>CONTEXT:  line 83 of configuration file
>>"/usr/pgsql-12/share/tsearch_data/fi_fi.affix": "    I           >
>>ALI\-
>>"
>
>Thank you for the email.
>
>It seems that here the backslash is used to escape the following 
>character according to the comment for the following flag:
>
>>flag *E:
>>    .           >       YLI     # ylijohtaja
>>    I           >       YLI\-   # yli-inhimillinen
>
>Escaping character is valid for ispell format (see 
>https://manpages.debian.org/testing/ispell/ispell.5.en.html):
>
>>Any character with special meaning to the parser can be changed to an uninterpreted token by backslashing it
>
>I've looked also for Hunspell finnish dictionary. But I didn't find 
>any I found only some postgres extension:
>https://github.com/Houston-Inc/dict_voikko
>
>
>I think it is possible to fix the postgres parser. But I'm not sure 
>should we do that.
>

I'm not sure if it's a valid ispell format (it might be, but I'm not
very good in reading the ispell manpage). But if it is, we should fix
the code to be able to read it.

>At first sight it is necessary to fix parse_affentry().
>

Right, that seems like the place to fix. It seems we don't expect '-'
(escaped) when in PAE_INREPL state. I wonder if there are other things
we fail to support ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16337: Finnish Ispell dictionary cannot be created

From

Artur Zakirov

Date:

12 April 2020, 14:13:26

On Fri, Apr 3, 2020 at 5:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'm not sure if it's a valid ispell format (it might be, but I'm not
> very good in reading the ispell manpage). But if it is, we should fix
> the code to be able to read it.

I attached the simple patch which fixes PAE_INREPL state.

I don't fully understand the ispell manpage either. I've looked the
ispell source code. They
use yacc for parsing. I'm not good at yacc but it seems that the
escape symbol is used
for all fields. But the patch fixes only PAE_INREPL state.

Also I did some tests with ispell utility. For simplicity I fixed the
.aff file in the following way:

flag *E:
    .           >     YLI
    .           >     YLI\-

And I got the following results:

word: ylijohdon
ok (derives from root JOHDON)

word: yli-johdon
ok (derives from root JOHDON)

word: yly-johdon
how about: yli-johdon

So hyphen escaping works. And results for PostgreSQL with the patch
and the .aff file
fix:

=# select ts_lexize('finnish_ispell', 'yli-johdon');
     ts_lexize
-------------------
 {johdon,johdossa}
=# select ts_lexize('finnish_ispell', 'ylijohdon');
     ts_lexize
-------------------
 {johdon,johdossa}

-- 
Artur

Attachment

tsearch_escape_hyphen.patch

Re: BUG #16337: Finnish Ispell dictionary cannot be created

From

Kyotaro Horiguchi

Date:

13 April 2020, 08:36:10

Hello, Artur.

At Sun, 12 Apr 2020 23:13:26 +0900, Artur Zakirov <zaartur@gmail.com> wrote in 
> On Fri, Apr 3, 2020 at 5:55 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > I'm not sure if it's a valid ispell format (it might be, but I'm not
> > very good in reading the ispell manpage). But if it is, we should fix
> > the code to be able to read it.
> 
> I attached the simple patch which fixes PAE_INREPL state.

Looking man 5 ispell, "Any character with special meaning to parser
can be changed to an uniterpreted token by backslashing it". It
depends on how we sholud be strict on that, but I think it is safer
that we think that any character prefixed by a backslash is an word
character.  (I don't understand how '-' can be in a word by the
definition in the .affix file, though.)

Since a escaped character is intended to be a part of a word, there's
no point in identifying minus-sign ad-hockerly, I think.

So as the result parse_affentry would be something like the follows.

  while (*str)
  {
    if (t_iseq(str, '\\') && !isescaped)
    {
      str += pg_mblen(str);
        escaped = true;
        continue;
    }
  
    if (state == ..)
    {
      if (t_seq(str, <special>) && !escaped)
        <handle special>
      else if (t_isalpha() || escaped)
        <handle non-special (or word) character>
      else if (!t_isspace())
        ereport(ERROR...
    ...

    str += pg_mblen();
    escaped = false;
  }
  
Is there a thouths or opinions?


> I don't fully understand the ispell manpage either. I've looked the
> ispell source code. They
> use yacc for parsing. I'm not good at yacc but it seems that the
> escape symbol is used
> for all fields. But the patch fixes only PAE_INREPL state.
> 
> Also I did some tests with ispell utility. For simplicity I fixed the
> .aff file in the following way:
> 
> flag *E:
>     .           >     YLI
>     .           >     YLI\-
> 
> And I got the following results:
> 
> word: ylijohdon
> ok (derives from root JOHDON)
> 
> word: yli-johdon
> ok (derives from root JOHDON)
> 
> word: yly-johdon
> how about: yli-johdon
> 
> So hyphen escaping works. And results for PostgreSQL with the patch
> and the .aff file
> fix:
> 
> =# select ts_lexize('finnish_ispell', 'yli-johdon');
>      ts_lexize
> -------------------
>  {johdon,johdossa}
> =# select ts_lexize('finnish_ispell', 'ylijohdon');
>      ts_lexize
> -------------------
>  {johdon,johdossa}

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #16337: Finnish Ispell dictionary cannot be created

From

Artur Zakirov

Date:

14 April 2020, 03:44:44

Hello Horiguchi-san,

On 4/13/2020 5:36 PM, Kyotaro Horiguchi wrote:
> Looking man 5 ispell, "Any character with special meaning to parser
> can be changed to an uniterpreted token by backslashing it". It
> depends on how we sholud be strict on that, but I think it is safer
> that we think that any character prefixed by a backslash is an word
> character.  (I don't understand how '-' can be in a word by the
> definition in the .affix file, though.)
> 
> Since a escaped character is intended to be a part of a word, there's
> no point in identifying minus-sign ad-hockerly, I think.

Thank you to pay attention to the patch.

I don't mind if the patch will work in more broad cases. But I tested 
ispell utility with other characters other than '-' before. It seems 
that it ignores such affixes or doesn't work properly. But in general 
maybe it is better to stick closer with the man page description.

I attached new version of the patch. It fixes only PAE_INFIND and 
PAE_INREPL cases. I think we shouldn't allow to escape all cases and it 
is safer to have some exceptions:
- In PAE_WAIT_MASK we shouldn't escape comment string which starts with '#'
- PAE_INMASK case is handled by regcomp.c separately and maybe it is 
better to leave the string as-is
- PAE_WAIT_FIND can start only with '-'
- I don't think that there is a sense in escaping PAE_WAIT_REPL

And in PAE_INFIND and PAE_INREPL I think we shouldn't allow to escape 
',' and '#'.

The condition:

if (t_iseq(str, '\\') && !isescaped &&
    (state == PAE_INFIND || state == PAE_INREPL))

maybe is not great, but I cannot come up with a better solution.

-- 
Artur

Attachment

tsearch_escape_hyphen_v2.patch