Home > mailing lists

Re: BUG #16337: Finnish Ispell dictionary cannot be created - Mailing list pgsql-bugs

From	Kyotaro Horiguchi
Subject	Re: BUG #16337: Finnish Ispell dictionary cannot be created
Date	April 13, 2020 08:36:10
Msg-id	20200413.173610.1847967467851370073.horikyota.ntt@gmail.com Whole thread Raw
In response to	Re: BUG #16337: Finnish Ispell dictionary cannot be created (Artur Zakirov <zaartur@gmail.com>)
Responses	Re: BUG #16337: Finnish Ispell dictionary cannot be created
List	pgsql-bugs

Tree view

Hello, Artur.

At Sun, 12 Apr 2020 23:13:26 +0900, Artur Zakirov <zaartur@gmail.com> wrote in 
> On Fri, Apr 3, 2020 at 5:55 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > I'm not sure if it's a valid ispell format (it might be, but I'm not
> > very good in reading the ispell manpage). But if it is, we should fix
> > the code to be able to read it.
> 
> I attached the simple patch which fixes PAE_INREPL state.

Looking man 5 ispell, "Any character with special meaning to parser
can be changed to an uniterpreted token by backslashing it". It
depends on how we sholud be strict on that, but I think it is safer
that we think that any character prefixed by a backslash is an word
character.  (I don't understand how '-' can be in a word by the
definition in the .affix file, though.)

Since a escaped character is intended to be a part of a word, there's
no point in identifying minus-sign ad-hockerly, I think.

So as the result parse_affentry would be something like the follows.

  while (*str)
  {
    if (t_iseq(str, '\\') && !isescaped)
    {
      str += pg_mblen(str);
        escaped = true;
        continue;
    }
  
    if (state == ..)
    {
      if (t_seq(str, <special>) && !escaped)
        <handle special>
      else if (t_isalpha() || escaped)
        <handle non-special (or word) character>
      else if (!t_isspace())
        ereport(ERROR...
    ...

    str += pg_mblen();
    escaped = false;
  }
  
Is there a thouths or opinions?


> I don't fully understand the ispell manpage either. I've looked the
> ispell source code. They
> use yacc for parsing. I'm not good at yacc but it seems that the
> escape symbol is used
> for all fields. But the patch fixes only PAE_INREPL state.
> 
> Also I did some tests with ispell utility. For simplicity I fixed the
> .aff file in the following way:
> 
> flag *E:
>     .           >     YLI
>     .           >     YLI\-
> 
> And I got the following results:
> 
> word: ylijohdon
> ok (derives from root JOHDON)
> 
> word: yli-johdon
> ok (derives from root JOHDON)
> 
> word: yly-johdon
> how about: yli-johdon
> 
> So hyphen escaping works. And results for PostgreSQL with the patch
> and the .aff file
> fix:
> 
> =# select ts_lexize('finnish_ispell', 'yli-johdon');
>      ts_lexize
> -------------------
>  {johdon,johdossa}
> =# select ts_lexize('finnish_ispell', 'ylijohdon');
>      ts_lexize
> -------------------
>  {johdon,johdossa}

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-bugs by date:

From: wenjing
Date: 13 April 2020, 08:25:02
Subject: [bug] Table not have typarray when created by single user mode

From: Tomas Vondra
Date: 13 April 2020, 11:41:47
Subject: Re: backend crash

Re: BUG #16337: Finnish Ispell dictionary cannot be created - Mailing list pgsql-bugs

Previous

Next