Re: [HACKERS] Flexible configuration for full-text search - Mailing list pgsql-hackers
From | Aleksandr Parfenov |
---|---|
Subject | Re: [HACKERS] Flexible configuration for full-text search |
Date | |
Msg-id | 20171030154032.5447672c@asp437-24-g082ur Whole thread Raw |
In response to | Re: [HACKERS] Flexible configuration for full-text search (Emre Hasegeli <emre@hasegeli.com>) |
Responses |
Re: [HACKERS] Flexible configuration for full-text search
|
List | pgsql-hackers |
I'm mostly happy with mentioned modifications, but I have few questions to clarify some points. I will send new patch in week or two. On Thu, 26 Oct 2017 20:01:14 +0200 Emre Hasegeli <emre@hasegeli.com> wrote: > To put it formally: > > ALTER TEXT SEARCH CONFIGURATION name > ADD MAPPING FOR token_type [, ... ] WITH config > > where config is one of: > > dictionary_name > config { UNION | INTERSECT | EXCEPT } config > CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END According to formal definition following configurations are valid: CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END CASE english_noun WHEN MATCH THEN english_hunspell END But configuration: CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END is not (as I understand ELSE can be used only with KEEP). I think we should decide to allow or disallow usage of different dictionaries for match checking (between CASE and WHEN) and a result (after THEN). If answer is 'allow', maybe we should allow the third example too for consistency in configurations. > > 3) Using different dictionaries for recognizing and output > > generation. As I mentioned before, in new syntax condition and > > command are separate and we can use it for some more complex text > > processing. Here an example for processing only nouns: > > > > ALTER TEXT SEARCH CONFIGURATION nouns_only > > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > > word, hword, hword_part WITH CASE > > WHEN english_noun THEN english_hunspell > > END > > This would also still work with the simpler syntax because > "english_noun", still being a dictionary, would pass the tokens to the > next one. Based on formal definition it is possible to describe this example in following manner: CASE english_noun WHEN MATCH THEN english_hunspell END The question is same as in the previous example. > Instead of supporting old way of putting stopwords on dictionaries, we > can make them dictionaries on their own. This would then become > something like: > > CASE polish_stopword > WHEN NO MATCH THEN polish_isspell > END Currently, stopwords increment position, for example: SELECT to_tsvector('english','a test message'); ---------------------'messag':3 'test':2 A stopword 'a' has a position 1 but it is not in the vector. If we want to save this behavior, we should somehow pass a stopword to tsvector composition function (parsetext in ts_parse.c) for counter increment or increment it in another way. Currently, an empty lexemes array is passed as a result of LexizeExec. One of possible way to do so is something like: CASE polish_stopword WHEN MATCH THEN KEEP -- stopword counting ELSE polish_isspell END -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: