Thread: BUG #16337: Finnish Ispell dictionary cannot be created
The following bug has been logged on the website: Bug reference: 16337 Logged by: Matti Linnanvuori Email address: matti.linnanvuori@portalify.com PostgreSQL version: 12.2 Operating system: Red Hat Enterprise Linux 8.1 Description: postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell, DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish); ERROR: syntax error CONTEXT: line 83 of configuration file "/usr/pgsql-12/share/tsearch_data/fi_fi.affix": " I > ALI\- " http://ispell-fi.sourceforge.net/finnish.dict.bz2 bunzip2 finnish.dict.bz2 iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.dict finnish.dict cp fi_fi.dict /usr/pgsql-12/share/tsearch_data http://ispell-fi.sourceforge.net/finnish.large.aff.bz2 bunzip2 finnish.large.aff.bz2 iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.affix finnish.large.aff cp fi_fi.affix /usr/pgsql-12/share/tsearch_data http://ispell-fi.sourceforge.net/finnish.medium.aff.bz2 bunzip2 finnish.medium.aff.bz2 iconv -f ISO_8859-1 -t UTF-8 -o fi_fi.affix finnish.medium.aff cp fi_fi.affix /usr/pgsql-12/share/tsearch_data https://www.postgresql.org/message-id/46CD5588.5080404%40enterprisedb.com
Hello, On 4/2/2020 7:11 PM, PG Bug reporting form wrote: > postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell, > DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish); > ERROR: syntax error > CONTEXT: line 83 of configuration file > "/usr/pgsql-12/share/tsearch_data/fi_fi.affix": " I > > ALI\- > " Thank you for the email. It seems that here the backslash is used to escape the following character according to the comment for the following flag: > flag *E: > . > YLI # ylijohtaja > I > YLI\- # yli-inhimillinen Escaping character is valid for ispell format (see https://manpages.debian.org/testing/ispell/ispell.5.en.html): > Any character with special meaning to the parser can be changed to an uninterpreted token by backslashing it I've looked also for Hunspell finnish dictionary. But I didn't find any I found only some postgres extension: https://github.com/Houston-Inc/dict_voikko I think it is possible to fix the postgres parser. But I'm not sure should we do that. At first sight it is necessary to fix parse_affentry(). -- Artur
On Fri, Apr 03, 2020 at 12:33:00PM +0900, Artur Zakirov wrote: >Hello, > >On 4/2/2020 7:11 PM, PG Bug reporting form wrote: >>postgres=# CREATE TEXT SEARCH DICTIONARY finnish_ispell ( TEMPLATE = ispell, >>DictFile = fi_fi, AffFile = fi_fi, Stopwords = finnish); >>ERROR: syntax error >>CONTEXT: line 83 of configuration file >>"/usr/pgsql-12/share/tsearch_data/fi_fi.affix": " I > >>ALI\- >>" > >Thank you for the email. > >It seems that here the backslash is used to escape the following >character according to the comment for the following flag: > >>flag *E: >> . > YLI # ylijohtaja >> I > YLI\- # yli-inhimillinen > >Escaping character is valid for ispell format (see >https://manpages.debian.org/testing/ispell/ispell.5.en.html): > >>Any character with special meaning to the parser can be changed to an uninterpreted token by backslashing it > >I've looked also for Hunspell finnish dictionary. But I didn't find >any I found only some postgres extension: >https://github.com/Houston-Inc/dict_voikko > > >I think it is possible to fix the postgres parser. But I'm not sure >should we do that. > I'm not sure if it's a valid ispell format (it might be, but I'm not very good in reading the ispell manpage). But if it is, we should fix the code to be able to read it. >At first sight it is necessary to fix parse_affentry(). > Right, that seems like the place to fix. It seems we don't expect '-' (escaped) when in PAE_INREPL state. I wonder if there are other things we fail to support ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 3, 2020 at 5:55 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I'm not sure if it's a valid ispell format (it might be, but I'm not > very good in reading the ispell manpage). But if it is, we should fix > the code to be able to read it. I attached the simple patch which fixes PAE_INREPL state. I don't fully understand the ispell manpage either. I've looked the ispell source code. They use yacc for parsing. I'm not good at yacc but it seems that the escape symbol is used for all fields. But the patch fixes only PAE_INREPL state. Also I did some tests with ispell utility. For simplicity I fixed the .aff file in the following way: flag *E: . > YLI . > YLI\- And I got the following results: word: ylijohdon ok (derives from root JOHDON) word: yli-johdon ok (derives from root JOHDON) word: yly-johdon how about: yli-johdon So hyphen escaping works. And results for PostgreSQL with the patch and the .aff file fix: =# select ts_lexize('finnish_ispell', 'yli-johdon'); ts_lexize ------------------- {johdon,johdossa} =# select ts_lexize('finnish_ispell', 'ylijohdon'); ts_lexize ------------------- {johdon,johdossa} -- Artur
Attachment
Hello, Artur. At Sun, 12 Apr 2020 23:13:26 +0900, Artur Zakirov <zaartur@gmail.com> wrote in > On Fri, Apr 3, 2020 at 5:55 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > I'm not sure if it's a valid ispell format (it might be, but I'm not > > very good in reading the ispell manpage). But if it is, we should fix > > the code to be able to read it. > > I attached the simple patch which fixes PAE_INREPL state. Looking man 5 ispell, "Any character with special meaning to parser can be changed to an uniterpreted token by backslashing it". It depends on how we sholud be strict on that, but I think it is safer that we think that any character prefixed by a backslash is an word character. (I don't understand how '-' can be in a word by the definition in the .affix file, though.) Since a escaped character is intended to be a part of a word, there's no point in identifying minus-sign ad-hockerly, I think. So as the result parse_affentry would be something like the follows. while (*str) { if (t_iseq(str, '\\') && !isescaped) { str += pg_mblen(str); escaped = true; continue; } if (state == ..) { if (t_seq(str, <special>) && !escaped) <handle special> else if (t_isalpha() || escaped) <handle non-special (or word) character> else if (!t_isspace()) ereport(ERROR... ... str += pg_mblen(); escaped = false; } Is there a thouths or opinions? > I don't fully understand the ispell manpage either. I've looked the > ispell source code. They > use yacc for parsing. I'm not good at yacc but it seems that the > escape symbol is used > for all fields. But the patch fixes only PAE_INREPL state. > > Also I did some tests with ispell utility. For simplicity I fixed the > .aff file in the following way: > > flag *E: > . > YLI > . > YLI\- > > And I got the following results: > > word: ylijohdon > ok (derives from root JOHDON) > > word: yli-johdon > ok (derives from root JOHDON) > > word: yly-johdon > how about: yli-johdon > > So hyphen escaping works. And results for PostgreSQL with the patch > and the .aff file > fix: > > =# select ts_lexize('finnish_ispell', 'yli-johdon'); > ts_lexize > ------------------- > {johdon,johdossa} > =# select ts_lexize('finnish_ispell', 'ylijohdon'); > ts_lexize > ------------------- > {johdon,johdossa} regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello Horiguchi-san, On 4/13/2020 5:36 PM, Kyotaro Horiguchi wrote: > Looking man 5 ispell, "Any character with special meaning to parser > can be changed to an uniterpreted token by backslashing it". It > depends on how we sholud be strict on that, but I think it is safer > that we think that any character prefixed by a backslash is an word > character. (I don't understand how '-' can be in a word by the > definition in the .affix file, though.) > > Since a escaped character is intended to be a part of a word, there's > no point in identifying minus-sign ad-hockerly, I think. Thank you to pay attention to the patch. I don't mind if the patch will work in more broad cases. But I tested ispell utility with other characters other than '-' before. It seems that it ignores such affixes or doesn't work properly. But in general maybe it is better to stick closer with the man page description. I attached new version of the patch. It fixes only PAE_INFIND and PAE_INREPL cases. I think we shouldn't allow to escape all cases and it is safer to have some exceptions: - In PAE_WAIT_MASK we shouldn't escape comment string which starts with '#' - PAE_INMASK case is handled by regcomp.c separately and maybe it is better to leave the string as-is - PAE_WAIT_FIND can start only with '-' - I don't think that there is a sense in escaping PAE_WAIT_REPL And in PAE_INFIND and PAE_INREPL I think we shouldn't allow to escape ',' and '#'. The condition: if (t_iseq(str, '\\') && !isescaped && (state == PAE_INFIND || state == PAE_INREPL)) maybe is not great, but I cannot come up with a better solution. -- Artur