On Tue, 28 Aug 2018 12:40:32 +0700
Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:
>On Fri, 24 Aug 2018 18:50:38 +0300
>Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>>Agreed, backward compatibility is important here. Probably we should
>>leave old dictionaries for that. But I just meant that if we
>>introduce new (better) way of stop words handling and encourage users
>>to use it, then it would look strange if default configurations work
>>the old way...
>
>I agree with Alexander. The only drawback I see is that after addition
>of new dictionaries, there will be 3 dictionaries for each language:
>old one, stop-word filter for the language, and stemmer dictionary.
During work on the new version of the patch, I found an issue in
proposed syntax. At the beginning of the conversation, there was a
suggestion to split stop word filtering and words normalization. At this
stage of development, we can use a different dictionary for stop word
detection, but if we drop the word, the word counter wouldn't increase
and the stop word will be processed as an unknown word.
Currently, I see two solutions:
1) Keep the old way of stop word filtering. The drawback of this
approach is the mixing of word normalization and stop word detection
logic inside of a dictionary. It can be solved by the usage of 'simple'
dictionary in accept=false mode as a stop word filter.
2) Add an action STOPWORD to KEEP and DROP (which is not implemented in
previous patch, but I think it is good to have both of them) in the
meaning of "increase word counter but don't add lexeme to vector".
Any suggestions on the issue?
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company