Re: [HACKERS] Flexible configuration for full-text search - Mailing list pgsql-hackers

From Aleksandr Parfenov
Subject Re: [HACKERS] Flexible configuration for full-text search
Date
Msg-id 20171030154032.5447672c@asp437-24-g082ur
Whole thread Raw
In response to Re: [HACKERS] Flexible configuration for full-text search  (Emre Hasegeli <emre@hasegeli.com>)
Responses Re: [HACKERS] Flexible configuration for full-text search
List pgsql-hackers
I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre@hasegeli.com> wrote:
> To put it formally:
> 
> ALTER TEXT SEARCH CONFIGURATION name
>     ADD MAPPING FOR token_type [, ... ] WITH config
> 
> where config is one of:
> 
>     dictionary_name
>     config { UNION | INTERSECT | EXCEPT } config
>     CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

According to formal definition following configurations are valid:

CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> >   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> >                     word, hword, hword_part WITH CASE
> >   WHEN english_noun THEN english_hunspell
> > END  
> 
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own.  This would then become
> something like:
> 
>     CASE polish_stopword
>         WHEN NO MATCH THEN polish_isspell
>     END

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword   WHEN MATCH THEN KEEP -- stopword counting   ELSE polish_isspell
END

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] pow support for pgbench
Next
From: Simon Riggs
Date:
Subject: Re: [HACKERS] MERGE SQL Statement for PG11