BUG #6455: Wrong match of ipsell dict. - Mailing list pgsql-bugs

From vincent.desmares@inovia-team.com
Subject BUG #6455: Wrong match of ipsell dict.
Date
Msg-id E1RwvqG-0000fE-Oj@wrigleys.postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      6455
Logged by:          Desmares Vincent
Email address:      vincent.desmares@inovia-team.com
PostgreSQL version: 9.1.0
Operating system:   Ubuntu
Description:=20=20=20=20=20=20=20=20

Hello everyone,=20

We recently discovered something that could be a "bug" when using the Full
Text Search of Postgres. More precisely the ispell dictionary.

It appears that words composed with the same character (like =E2=80=9Ca=E2=
=80=9D, =E2=80=9Caa=E2=80=9D,
=E2=80=9Caaa=E2=80=9D, ...) trigger all the prefix and suffix rules even if=
 nothing have
been specified in the dictionary.

We got the bug with the word =E2=80=9Ce=E2=80=9D which was associated to th=
e word =E2=80=9Cdeer=E2=80=9D.

Here is a short way to reproduce the bug from scratch :

# 1) Create a test.dict with only =E2=80=9Ce=E2=80=9D inside
cat =E2=80=9Ce=E2=80=9D > test.dict
# 2) Create an empty test.stop file
touch test.stop
# 3) Create a test.affix file with rules :
echo -e 'PFX C Y 1\nPFX C 0 de .\n\nSFX R Y 1\nSFX R 0 r e\n' > test.affix
# 4) Execute those requests :

DROP TEXT SEARCH DICTIONARY IF EXISTS testispell CASCADE;

CREATE TEXT SEARCH DICTIONARY testispell (
    TEMPLATE =3D ispell,
    DictFile =3D test,
    AffFile =3D test,
    StopWords =3D test
);

CREATE TEXT SEARCH CONFIGURATION test_ispell (
  PARSER =3D "default"
);
ALTER TEXT SEARCH CONFIGURATION test_ispell ADD MAPPING FOR asciihword WITH
testispell;
ALTER TEXT SEARCH CONFIGURATION test_ispell ADD MAPPING FOR asciiword WITH
testispell;
ALTER TEXT SEARCH CONFIGURATION test_ispell ADD MAPPING FOR uint WITH
testispell;
ALTER TEXT SEARCH CONFIGURATION test_ispell ADD MAPPING FOR word WITH
testispell;

SELECT * from ts_debug('test_ispell', 'deer');

# 5) You should get a table with this result :

alias : "asciiword"
description :  "Word, all ASCII"
token : "deer"
dictionaries : "{testispell}"
dictionary : "testispell"=20
lexemes : "{e}"

It appear that it=E2=80=99s reproductible with more characters of the same =
letter :
- .dict with [ee] searching for [deeer] give [ee]
but
- .dict with [ee] searching for [eer|deee] give nothing

Did we miss a configuration or a default behavior, or there is really a bug
?

Regards,

Vincent Desmares
Developer @ Inovia-team

pgsql-bugs by date:

Previous
From: Marc Balmer
Date:
Subject: Re: BUG #6454: Latest x64 msi does not recognize admin account
Next
From: tmpfs@hotmail.com
Date:
Subject: BUG #6456: no password