Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords - Mailing list pgsql-patches

From Jan Urbański
Subject Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date
Msg-id 47348F23.7070002@students.mimuw.edu.pl
Whole thread Raw
In response to a tsearch2 (8.2.4) dictionary that only filters out stopwords  (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
Responses Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
List pgsql-patches
> This example still doesn't seem very convincing --- why would you not
> merely attach the stopword list to the pl_ispell dictionary?

Because the ispell-based dictionaries first stem the lexeme and then
search for it in the stopwords file. The situation here is that a
stopword is first stemmed to produce another lexeme (which is not in the
stopwords file, as it's a perfectly valid word) and then gets indexed,
instead of being discarded.
To restate: the word 'od' in Polish is both a preposition and a declined
form of the noun 'oda'. The ispell dictionary when passed the lexeme
'od' first stems it to produce 'oda' and then fails to find it in the
stopwords file. If I'd include the word 'oda' in the stopwords file, I'd
be losing information about the noun 'oda' appearing in documents.

I'm still trying to find an English example, as I'm sure it would be
easier to understand by most readers of this list. Nothing comes to my
mind, however - I guess some languages just have rotten luck with their
grammar.

> If there is a use-case for it, IMHO it'd be better to add a boolean
> accept-or-pass-on parameter to the "simple" dictionary than to add a
> whole new dictionary type.

Ah, I never thought of it. You may be very right - it does look like an
easier solution. However, it would require coding some basic parsing
logic into the dex_init procedure, because right now the 'simple'
dictionary expects dict_initoption to be a path to the stopwords file.
Do you mean something like 'StopFile="/path/to/stopwords",
AcceptUnknown=0'" ?

Regards,
Jan Urbanski
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


Attachment

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Fix for stop words in thesaurus file
Next
From: Zdenek Kotala
Date:
Subject: Re: Fix pg_dump dependency on postgres.h