Thread: [HACKERS] Flexible configuration for full-text search

[HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

19 October 2017, 17:24:09

Hello hackers,

Arthur Zakirov and I are working on a patch to introduce more flexible
way to configure full-text search in PostgreSQL because current syntax
doesn't allow a variety of scenarios to be handled. Additionally, some
parts contain the implicit logic of the processing, such as filtering
dictionaries with TSL_FILTER flag, so configuration partially moved to
dictionary itself and in most of the cases hardcoded into dictionary.
One more drawback of current FTS configuration is that we can't divide
the dictionary selection and output producing, so we can't configure
FTS to use one dictionary if another one recognized a token (e.g. use
hunspell if dictionary of nouns recognized a token).

Basically, the key goal of the patch is to provide user more
control on processing of the text.

The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
construction. Current comma-separated list also available to meet
compatibility. The basic form of new syntax is following:

ALTER TEXT SEARCH CONFIGURATION <fts_conf>
ALTER MAPPING FOR <token_types> WITH
CASE
WHEN <condition> THEN <command>
....
[ ELSE <command> ]
END;

A condition is a logical expression on dictionaries. You can specify how
to interpret dictionary output with
dictionary IS [ NOT ] NULL - for NULL-result
dictionary IS [ NOT ] STOPWORD - for empty (stopword) result

If interpretation marker is not given it is interpreted as:
dictionary IS NOT NULL AND dictionary IS NOT STOPWORD

A command is an expression on dictionaries output sets with operators
UNION, EXCEPT and INTERSECT. Additionally, there is a special operator
MAP BY which allow us to create the same behavior as with filtering
dictionaries. MAP BY operator get output of the right subexpression and
send it to left subexpression as an input token (if there are more than
one lexeme each one is sent separately).

There is a few example of usage of new configuration and comparison
with solutions using current syntax.

1) Multilingual search. Can be used for FTS on a set of documents in
different languages (example for German and English languages).
ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell AND german_hunspell THEN
english_hunspell UNION german_hunspell
WHEN english_hunspell THEN english_hunspell
WHEN german_hunspell THEN german_hunspell
ELSE german_stem UNION english_stem
END;

With old configuration we should use separate vector and index for each
required language and query should combine result of search for each
language:
SELECT * FROM en_de_documents WHERE
to_tsvector('english', text) @@ to_tsquery('english', 'query')
OR
to_tsvector('german', text) @@ to_tsquery('german', 'query');

The new multilingual search configuration itself looks more complex but
allow to avoid a split of index and vectors. Additionally, for
similar languages or configurations with simple or *_stem dictionaries
in the list we can reduce total size of index since in
current-state example index for English configuration also will
keep data about documents written in German and vice-versa.

2) Combination of exact search with morphological one. This patch not
fully solve the problem but it is a step toward solution. Currently, we
should split exact and morphological search in query manually and use
separate index for each part. With new way to configure FTS we can use
following configuration:
ALTER TEXT SEARCH CONFIGURATION exact_and_morph
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell THEN english_hunspell UNION simple
ELSE english_stem UNION simple
END;

Some of the queries like "'looking' <1> through" where 'looking' is
search for exact form of the word doesn't work in current-state FTS
since we can guarantee that document contains both 'looking' and
through, but can't be sure with distance between them.

Unfortunately, we can't fully support such queries with current format
of tsvector because after processing we can't distinguish is a word
was mentioned in normal form in text or was processed by some
dictionary. This leads to false positive hits if user searches for
the normal form of the word. I think we should provide a user ability
to mark dictionary something like "exact form producer". But without
tsvector modification this mark is useless since we can't mark output
of this dictionary in tsvector.

There is a patch on commitfest which removes 1MB limit on tsvector [1].
There are few free bits available in each lexeme in vector, so
one of the bits may be used for "exact" flag.

3) Using different dictionaries for recognizing and output generation.
As I mentioned before, in new syntax condition and command are separate
and we can use it for some more complex text processing. Here an
example for processing only nouns:
ALTER TEXT SEARCH CONFIGURATION nouns_only
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_noun THEN english_hunspell
END;

This behavior couldn't be reached with the current state of FTS.

4) Special stopword processing allows us to discard stopwords even if
the main dictionary doesn't support such feature (in example pl_ispell
dictionary keeps stopwords in text):
ALTER TEXT SEARCH CONFIGURATION pl_without_stops
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
END;

The patch is in attachment. I'm will be glad to hear hackers' opinion
about it.

There are several cases discussed in hackers earlier:

Check for stopwords using non-target dictionary.
https://www.postgresql.org/message-id/4733B65A.9030707@students.mimuw.edu.pl

Support union of outputs of several dictionaries.
https://www.postgresql.org/message-id/c6851b7e-da25-3d8e-a5df-022c395a11b4%40postgrespro.ru

Support of chain of dictionaries using MAP BY operator.
https://www.postgresql.org/message-id/46D57E6F.8020009%40enterprisedb.com

[1] Remove 1MB size limit in tsvector
https://commitfest.postgresql.org/15/1221/

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-flexible-fts-configuration.patch

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

20 October 2017, 15:39:04

In attachment updated patch with fixes of empty XML tags in
documentation.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-flexible-fts-configuration-v2.patch

Re: [HACKERS] Flexible configuration for full-text search

From

Emre Hasegeli

Date:

26 October 2017, 21:01:14

> The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
> construction.

Interesting feature.  I needed this flexibility before when I was
implementing text search for a Turkish private listing application.
Aleksandr and Arthur were kind enough to discuss it with me off-list
today.

> 1) Multilingual search. Can be used for FTS on a set of documents in
> different languages (example for German and English languages).
>
> ALTER TEXT SEARCH CONFIGURATION multi
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                     word, hword, hword_part WITH CASE
>     WHEN english_hunspell AND german_hunspell THEN
>       english_hunspell UNION german_hunspell
>     WHEN english_hunspell THEN english_hunspell
>     WHEN german_hunspell THEN german_hunspell
>     ELSE german_stem UNION english_stem
>   END;

I understand the need to support branching, but this syntax is overly
complicated.  I don't think there is any need to support different set
of dictionaries as condition and action.  Something like this might
work better:

ALTER TEXT SEARCH CONFIGURATION multi   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word,hword, hword_part WITH   CASE english_hunspell UNION german_hunspell       WHEN MATCH THEN KEEP       ELSE
german_stemUNION english_stem   END;
 

To put it formally:

ALTER TEXT SEARCH CONFIGURATION name   ADD MAPPING FOR token_type [, ... ] WITH config

where config is one of:
   dictionary_name   config { UNION | INTERSECT | EXCEPT } config   CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ]
configEND
 

> 2) Combination of exact search with morphological one. This patch not
> fully solve the problem but it is a step toward solution. Currently, we
> should split exact and morphological search in query manually and use
> separate index for each part. With new way to configure FTS we can use
> following configuration:
>
> ALTER TEXT SEARCH CONFIGURATION exact_and_morph
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                   word, hword, hword_part WITH CASE
>     WHEN english_hunspell THEN english_hunspell UNION simple
>     ELSE english_stem UNION simple
>   END

This could be:
       CASE english_hunspell           THEN KEEP           ELSE english_stem       END   UNION       simple

> 3) Using different dictionaries for recognizing and output generation.
> As I mentioned before, in new syntax condition and command are separate
> and we can use it for some more complex text processing. Here an
> example for processing only nouns:
>
> ALTER TEXT SEARCH CONFIGURATION nouns_only
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                     word, hword, hword_part WITH CASE
>   WHEN english_noun THEN english_hunspell
> END

This would also still work with the simpler syntax because
"english_noun", still being a dictionary, would pass the tokens to the
next one.

> 4) Special stopword processing allows us to discard stopwords even if
> the main dictionary doesn't support such feature (in example pl_ispell
> dictionary keeps stopwords in text):
>
> ALTER TEXT SEARCH CONFIGURATION pl_without_stops
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                     word, hword, hword_part WITH CASE
>     WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
>   END

Instead of supporting old way of putting stopwords on dictionaries, we
can make them dictionaries on their own.  This would then become
something like:
   CASE polish_stopword       WHEN NO MATCH THEN polish_isspell   END


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

30 October 2017, 15:40:32

I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre@hasegeli.com> wrote:
> To put it formally:
> 
> ALTER TEXT SEARCH CONFIGURATION name
>     ADD MAPPING FOR token_type [, ... ] WITH config
> 
> where config is one of:
> 
>     dictionary_name
>     config { UNION | INTERSECT | EXCEPT } config
>     CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

According to formal definition following configurations are valid:

CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> >   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> >                     word, hword, hword_part WITH CASE
> >   WHEN english_noun THEN english_hunspell
> > END  
> 
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own.  This would then become
> something like:
> 
>     CASE polish_stopword
>         WHEN NO MATCH THEN polish_isspell
>     END

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword   WHEN MATCH THEN KEEP -- stopword counting   ELSE polish_isspell
END

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Emre Hasegeli

Date:

31 October 2017, 11:47:57

> I'm mostly happy with mentioned modifications, but I have few questions
> to clarify some points. I will send new patch in week or two.

I am glad you liked it.  Though, I think we should get approval from
more senior community members or committers about the syntax, before
we put more effort to the code.

> But configuration:
>
> CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END
>
> is not (as I understand ELSE can be used only with KEEP).
>
> I think we should decide to allow or disallow usage of different
> dictionaries for match checking (between CASE and WHEN) and a result
> (after THEN). If answer is 'allow', maybe we should allow the
> third example too for consistency in configurations.

I think you are right.  We better allow this too.  Then the CASE syntax becomes:
   CASE config       WHEN [ NO ] MATCH THEN { KEEP | config }       [ ELSE config ]   END

> Based on formal definition it is possible to describe this example in
> following manner:
> CASE english_noun WHEN MATCH THEN english_hunspell END
>
> The question is same as in the previous example.

I couldn't understand the question.

> Currently, stopwords increment position, for example:
> SELECT to_tsvector('english','a test message');
> ---------------------
>  'messag':3 'test':2
>
> A stopword 'a' has a position 1 but it is not in the vector.

Is this problem only applies to stopwords and the whole thing we are
inventing?  Shouldn't we preserve the positions through the pipeline?

> If we want to save this behavior, we should somehow pass a stopword to
> tsvector composition function (parsetext in ts_parse.c) for counter
> increment or increment it in another way. Currently, an empty lexemes
> array is passed as a result of LexizeExec.
>
> One of possible way to do so is something like:
> CASE polish_stopword
>     WHEN MATCH THEN KEEP -- stopword counting
>     ELSE polish_isspell
> END

This would mean keeping the stopwords.  What we want is

CASE polish_stopword    -- stopword counting   WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Thomas Munro

Date:

06 November 2017, 08:05:23

On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:
> In attachment updated patch with fixes of empty XML tags in
> documentation.

Hi Aleksandr,

I'm not sure if this is expected at this stage, but just in case you
aren't aware, with this version of the patch the binary upgrade test
in
src/bin/pg_dump/t/002_pg_dump.pl fails for me:

#   Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
dump_test.alt_ts_conf1 ...'
#   at t/002_pg_dump.pl line 6715.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

07 November 2017, 09:18:38

On Mon, 6 Nov 2017 18:05:23 +1300
Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
> <a.parfenov@postgrespro.ru> wrote:
> > In attachment updated patch with fixes of empty XML tags in
> > documentation.  
> 
> Hi Aleksandr,
> 
> I'm not sure if this is expected at this stage, but just in case you
> aren't aware, with this version of the patch the binary upgrade test
> in
> src/bin/pg_dump/t/002_pg_dump.pl fails for me:
> 
> #   Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
> dump_test.alt_ts_conf1 ...'
> #   at t/002_pg_dump.pl line 6715.
> 

Hi Thomas,

Thank you for noticing it. I will investigate it during work on next
version of patch.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

07 November 2017, 12:48:38

On Tue, 31 Oct 2017 09:47:57 +0100
Emre Hasegeli <emre@hasegeli.com> wrote:

> > If we want to save this behavior, we should somehow pass a stopword
> > to tsvector composition function (parsetext in ts_parse.c) for
> > counter increment or increment it in another way. Currently, an
> > empty lexemes array is passed as a result of LexizeExec.
> >
> > One of possible way to do so is something like:
> > CASE polish_stopword
> >     WHEN MATCH THEN KEEP -- stopword counting
> >     ELSE polish_isspell
> > END  
> 
> This would mean keeping the stopwords.  What we want is
> 
> CASE polish_stopword    -- stopword counting
>     WHEN NO MATCH THEN polish_isspell
> END
> 
> Do you think it is possible?

Hi Emre,

I thought how it can be implemented. The way I see is to increment
word counter in case if any chcked dictionary matched the word even
without returning lexeme. Main drawback is that counter increment is
implicit.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flexible configuration for full-text search

From

Michael Paquier

Date:

30 November 2017, 07:27:07

On Tue, Nov 7, 2017 at 3:18 PM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:
> On Mon, 6 Nov 2017 18:05:23 +1300
> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
>> On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
>> <a.parfenov@postgrespro.ru> wrote:
>> > In attachment updated patch with fixes of empty XML tags in
>> > documentation.
>>
>> Hi Aleksandr,
>>
>> I'm not sure if this is expected at this stage, but just in case you
>> aren't aware, with this version of the patch the binary upgrade test
>> in
>> src/bin/pg_dump/t/002_pg_dump.pl fails for me:
>>
>> #   Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
>> dump_test.alt_ts_conf1 ...'
>> #   at t/002_pg_dump.pl line 6715.
>>
>
> Hi Thomas,
>
> Thank you for noticing it. I will investigate it during work on next
> version of patch.

Next version pending after three weeks, I am marking the patch as
returned with feedback for now.
-- 
Michael

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

19 December 2017, 17:31:09

Hi,

On Tue, 31 Oct 2017 09:47:57 +0100
Emre Hasegeli <emre@hasegeli.com> wrote:

> I am glad you liked it.  Though, I think we should get approval from
> more senior community members or committers about the syntax, before
> we put more effort to the code.

I postpone a new version of the patch in order to wait for more
feedback, but I think now is the time to send it to push discussion
further.

I keep in mind all the feedback during reworking a patch, so the FTS
syntax and behavior changed since previous one. But I'm not sure about
one last thing:

> CASE polish_stopword    -- stopword counting
>     WHEN NO MATCH THEN polish_isspell
> END
> 
> Do you think it is possible?

If we will count tokens in such a case, any dropped words will be
counted too. For example:

CASE banned_words
    WHEN NO MATCH THEN some_dictionary
END

And I'm not sure about that behavior due to implicit use of the
token produced by 'banned_words' dictionary in the example. In the third
version of patch I keep the behavior without an implicit use of
tokens for counting.

The new version of the patch is in attachment as well as a
little README file with a description of changes in each file. Any
feedback is welcome.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: [HACKERS] Flexible configuration for full-text search

From

Arthur Zakirov

Date:

21 December 2017, 17:46:42

Hello,

On Tue, Dec 19, 2017 at 05:31:09PM +0300, Aleksandr Parfenov wrote:
> 
> The new version of the patch is in attachment as well as a
> little README file with a description of changes in each file. Any
> feedback is welcome.
> 

I looked the patch a little bit. The patch is applied and tests passed.

I noticed that there are typos in the documentation. And I think it is necessary to keep information about previous
sintax.The syntax will be supported anyway. For example, information about TSL_FILTER was removed. And it will be good
toadd more examples of the new syntax.

The result of ts_debug() function was changed. Is it possible to keep the old ts_debug() result? To be specific,
'dictionaries'column is text now, not array, as I understood. It will be good to keep the result for the sake of
backwardcompatibility.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

25 December 2017, 17:15:07

Hi Arthur,

Thank you for the review.

On Thu, 21 Dec 2017 17:46:42 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

> I noticed that there are typos in the documentation. And I think it
> is necessary to keep information about previous sintax. The syntax
> will be supported anyway. For example, information about TSL_FILTER
> was removed. And it will be good to add more examples of the new
> syntax.

In the current version of the patch, configurations written in old
syntax are rewritten into the same configuration in the new syntax.
Since new syntax doesn't support a TSL_FILTER, it was removed from the
documentation. It is possible to store configurations written in old
syntax in a special way and simulate a TSL_FILTER behavior for them.
But it will lead to maintenance of two different behavior of the FTS
depends on a version of the syntax used during configuration. Do you
think we should keep both behaviors?

> The result of ts_debug() function was changed. Is it possible to keep
> the old ts_debug() result? To be specific, 'dictionaries' column is
> text now, not array, as I understood. It will be good to keep the
> result for the sake of backward compatibility.

Columns' 'dictionaries' and 'dictionary' type were changed to text
because after the patch the configuration may be not a plain array of
dictionaries but a complex expression tree. In the column
'dictionaries' the result is textual representation of configuration
and it is the same as a result of \dF+ description of the configuration.

I decide to rename newly added column to 'configuration' and keep
column 'dictionaries' with an array of all dictionaries used in
configuration (no matter how). Also, I fixed a bug in 'command' output
of the ts_debug in some cases.

Additionally, I added some examples to documentation regarding
multilingual search and combination of exact and linguistic-aware
search and fixed typos.

Attachment

Re: [HACKERS] Flexible configuration for full-text search

From

Arthur Zakirov

Date:

26 December 2017, 13:51:03

On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote:
> 
> In the current version of the patch, configurations written in old
> syntax are rewritten into the same configuration in the new syntax.
> Since new syntax doesn't support a TSL_FILTER, it was removed from the
> documentation. It is possible to store configurations written in old
> syntax in a special way and simulate a TSL_FILTER behavior for them.
> But it will lead to maintenance of two different behavior of the FTS
> depends on a version of the syntax used during configuration. Do you
> think we should keep both behaviors?

Is I understood users need to rewrite their configurations if they use unaccent dictionary, for example.
It is not good I think. Users will be upset about that if they use only old configuration and they don't need new
configuration.

From my point of view it is necessary to keep old configuration syntax.

> 
> Columns' 'dictionaries' and 'dictionary' type were changed to text
> because after the patch the configuration may be not a plain array of
> dictionaries but a complex expression tree. In the column
> 'dictionaries' the result is textual representation of configuration
> and it is the same as a result of \dF+ description of the configuration.

Oh, I understood.

> 
> I decide to rename newly added column to 'configuration' and keep
> column 'dictionaries' with an array of all dictionaries used in
> configuration (no matter how). Also, I fixed a bug in 'command' output
> of the ts_debug in some cases.

Maybe it would be better to keep the 'dictionary' column name? Is there a reason why it was renamed to 'command'?

> 
> Additionally, I added some examples to documentation regarding
> multilingual search and combination of exact and linguistic-aware
> search and fixed typos.

Great!


-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

26 December 2017, 14:50:52

On Tue, 26 Dec 2017 13:51:03 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

> On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote:
> 
> Is I understood users need to rewrite their configurations if they
> use unaccent dictionary, for example. It is not good I think. Users
> will be upset about that if they use only old configuration and they
> don't need new configuration.
> 
> From my point of view it is necessary to keep old configuration
> syntax.

I see your point. I will rework a patch to keep the backward
compatibility and return the TSL_FILTER entry into documentation.

> > I decide to rename newly added column to 'configuration' and keep
> > column 'dictionaries' with an array of all dictionaries used in
> > configuration (no matter how). Also, I fixed a bug in 'command'
> > output of the ts_debug in some cases.  
> 
> Maybe it would be better to keep the 'dictionary' column name? Is
> there a reason why it was renamed to 'command'?

I changed the name bacause it may contain more than one dictionary
interconnected via operators (e.g. 'english_stem UNION simple') and
the word 'dictionary' doesn't fully describe a content of the column
now. Also, a type of column was changed from regdictionary to text in
order to put operators into the output.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Flexible configuration for full-text search

From

a.parfenov@postgrespro.ru

Date:

04 January 2018, 22:36:13

Hello,

I changed the processing of a comma-separated list of dictionaries in
order to keep backward compatibility. Also, I fixed the output of the
ts_debug function as well as some other minor bugs.

A new version of the patch in the attachment.

Attachment

0001-flexible-fts-configuration-v5.patch

Re: [HACKERS] Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

10 January 2018, 10:57:19

Greetings,

According to http://commitfest.cputube.org/ the patch is not applicable.

Updated version of the patch in the attachment.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v6.patch

Re: Flexible configuration for full-text search

From

Aleksander Alekseev

Date:

23 January 2018, 16:38:08

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

This patch seems to be in a pretty good shape. There is a room for improvement 
though.

1. There are no comments for some procedures, their arguments and return 
values. Ditto regarding structures and their fields.

2. Please, fix the year in the copyright messages from 2017 to 2018.

3. Somehow I doubt that current amount of tests covers most of the 
functionality. Are you sure that if we run lcov, it will not show that most of
the new code is never executed during make installcheck-world?

4. I'm a bit concerned regarding change of the catalog in the 
src/include/catalog/indexing.h file. Are you sure it will not break if I
migrate from PostgreSQL 10 to PostgreSQL 11?

5. There are typos, e.g "Last result retued...", "...thesaurus pharse
processing...".

I'm going to run a few more test a bit later. I'll let you know if I'll find
anything.

The new status of this patch is: Waiting on Author

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

02 February 2018, 17:24:48

Hi Aleksander,

Thank you for the review!

> 1. There are no comments for some procedures, their arguments and
> return values. Ditto regarding structures and their fields.
> 
> 2. Please, fix the year in the copyright messages from 2017 to 2018.

Both issues are fixed.
 
> 3. Somehow I doubt that current amount of tests covers most of the 
> functionality. Are you sure that if we run lcov, it will not show
> that most of the new code is never executed during make
> installcheck-world?

I checked it and most of the new code is executed (I mostly checked
ts_parse.c and ts_configmap.c because those files contain most of the
potentially unchecked code). I have added some tests to test output of
the configurations. Also, I have added a test of TEXT CONFIGURATION into
unaccent contrib to test the MAP operator.

> 4. I'm a bit concerned regarding change of the catalog in the 
> src/include/catalog/indexing.h file. Are you sure it will not break
> if I migrate from PostgreSQL 10 to PostgreSQL 11?

I have tested an upgrade process via pg_upgrade from PostgreSQL 10 and
PostgreSQL 9.6 and have found a bug in a way the pg_dump generates
schema dump for old databases. I fixed it and now pg_upgrade works fine.

The new version of the patch is in an attachment.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v7.patch

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

12 February 2018, 11:16:51

I have found an issue in grammar which doesn't allow to construct
complex expressions (usage of CASEs as operands) without parentheses.

I fixed and simplified a grammar a little bit. The rest of the patch is
the same.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v8.patch

Re: Flexible configuration for full-text search

From

Aleksander Alekseev

Date:

05 March 2018, 15:59:21

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Unfortunately this patch doesn't apply anymore: http://commitfest.cputube.org/

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

06 March 2018, 17:18:29

On Mon, 05 Mar 2018 12:59:21 +0000
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> The following review has been posted through the commitfest
> application: make installcheck-world:  not tested
> Implements feature:       not tested
> Spec compliant:           not tested
> Documentation:            not tested
> 
> Unfortunately this patch doesn't apply anymore:
> http://commitfest.cputube.org/

Thank you for noticing that. A refreshed version of the patch is
attached.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v9.patch

Re: Flexible configuration for full-text search

From

Aleksander Alekseev

Date:

30 March 2018, 17:43:30

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

LGTM.

The new status of this patch is: Ready for Committer

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

05 April 2018, 11:12:56

On Fri, 30 Mar 2018 14:43:30 +0000
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> The following review has been posted through the commitfest
> application: make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
> 
> LGTM.
> 
> The new status of this patch is: Ready for Committer

It seems that after d204ef6 (MERGE SQL Command) in master the patch
doesn't apply due to a conflict in keywords lists (grammar and header).
The new version of the patch without conflicts is attached.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v10.patch

Re: Flexible configuration for full-text search

From

Teodor Sigaev

Date:

05 April 2018, 17:26:10

Some notices:

0) patch conflicts with last changes in gram.y, conflicts are trivial.

1) jsonb in catalog. I'm ok with it, any opinions?

2) pg_ts_config_map.h, "jsonb       mapdicts" isn't decorated with #ifdef 
CATALOG_VARLEN like other varlena columns in catalog. It it's right, pls, 
explain and add comment.

3) I see changes in pg_catalog, including drop column, change its type, change 
index, change function etc. Did you pay attention to pg_upgrade? I don't see it 
in patch.

4) Seems, it could work:
ALTER TEXT SEARCH CONFIGURATION russian
   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                                           word, hword, hword_part
         WITH english_stem union (russian_stem, simple);
                 ^^^^^^^^^^^^^^^^^^^^^ simple way instead of
WITH english_stem union (case russian_stem when match then keep else simple end);

4) Initial approach suggested to distinguish three state of dictionary result: 
null (unknown word), stopword and usual word. Now only two, we lost possibility 
to catch stopwords. One of way to use stopwrods is: let we have to identical fts 
configurations, except one skips stopwords and another doesn't. Second 
configuration is used for indexing, and first one for search by default. But if 
we can't  find anything ('to be or to be' - phrase contains stopwords only) 
then we can use second configuration. For now, we need to keep two variant of 
each dictionary - with and without stopwords. But if it's possible to 
distinguish stop and nonstop words in configuration then we don't need to have 
duplicated dictionaries.

Aleksandr Parfenov wrote:
> On Fri, 30 Mar 2018 14:43:30 +0000
> Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
> 
>> The following review has been posted through the commitfest
>> application: make installcheck-world:  tested, passed
>> Implements feature:       tested, passed
>> Spec compliant:           tested, passed
>> Documentation:            tested, passed
>>
>> LGTM.
>>
>> The new status of this patch is: Ready for Committer
> 
> It seems that after d204ef6 (MERGE SQL Command) in master the patch
> doesn't apply due to a conflict in keywords lists (grammar and header).
> The new version of the patch without conflicts is attached.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Flexible configuration for full-text search

From

Andres Freund

Date:

05 April 2018, 21:37:49

Hi,

On 2018-04-05 17:26:10 +0300, Teodor Sigaev wrote:
> Some notices:
> 
> 0) patch conflicts with last changes in gram.y, conflicts are trivial.
> 
> 1) jsonb in catalog. I'm ok with it, any opinions?
> 
> 2) pg_ts_config_map.h, "jsonb       mapdicts" isn't decorated with #ifdef
> CATALOG_VARLEN like other varlena columns in catalog. It it's right, pls,
> explain and add comment.
> 
> 3) I see changes in pg_catalog, including drop column, change its type,
> change index, change function etc. Did you pay attention to pg_upgrade? I
> don't see it in patch.
> 
> 4) Seems, it could work:
> ALTER TEXT SEARCH CONFIGURATION russian
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                                           word, hword, hword_part
>         WITH english_stem union (russian_stem, simple);
>                  ^^^^^^^^^^^^^^^^^^^^^ simple way instead of
> WITH english_stem union (case russian_stem when match then keep else simple end);
> 
> 4) Initial approach suggested to distinguish three state of dictionary
> result: null (unknown word), stopword and usual word. Now only two, we lost
> possibility to catch stopwords. One of way to use stopwrods is: let we have
> to identical fts configurations, except one skips stopwords and another
> doesn't. Second configuration is used for indexing, and first one for search
> by default. But if we can't  find anything ('to be or to be' - phrase
> contains stopwords only) then we can use second configuration. For now, we
> need to keep two variant of each dictionary - with and without stopwords.
> But if it's possible to distinguish stop and nonstop words in configuration
> then we don't need to have duplicated dictionaries.

Just to be clear: I object to attempting to merge this into v11. This
introduces new user interface, arrived late in the development cycle,
and hasn't seen that much review. Not something that should be merged
two minutes before midnight.

I think it's good to continue reviewing, don't get me wrong.


- Andres

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

06 April 2018, 10:51:38

On Thu, 5 Apr 2018 17:26:10 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:
> Some notices:
> 
> 0) patch conflicts with last changes in gram.y, conflicts are trivial.

Yes, due to commits with MERGE command with changes in gram.y there
were some conflicts.

> 2) pg_ts_config_map.h, "jsonb       mapdicts" isn't decorated with
> #ifdef CATALOG_VARLEN like other varlena columns in catalog. It it's
> right, pls, explain and add comment.

Since there is only one varlena column it is safe to use it directly. I
add a related comment about it.

> 3) I see changes in pg_catalog, including drop column, change its
> type, change index, change function etc. Did you pay attention to
> pg_upgrade? I don't see it in patch.

The full-text search configuration is migrated via FTS commands such
as CREATE TEXT SEARCH CONFIGURATION. The pg_upgrade uses pg_dump to
create a dump of this part of the catalog where
dictionary_mapping_to_text is used to get a textual representation of
the FTS configuration. Correct me if I'm wrong.

> 4) Seems, it could work:
> ALTER TEXT SEARCH CONFIGURATION russian
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                                            word, hword, hword_part
>          WITH english_stem union (russian_stem, simple);
>                  ^^^^^^^^^^^^^^^^^^^^^ simple way
> instead of WITH english_stem union (case russian_stem when match then
> keep else simple end);

I add such ability since it was just a little fix in grammar. I also
add tests for this kind of configurations. The test is a bit
synthetic because I used a synonym dictionary as one which doesn't
accept some input.

> 4) Initial approach suggested to distinguish three state of
> dictionary result: null (unknown word), stopword and usual word. Now
> only two, we lost possibility to catch stopwords. One of way to use
> stopwrods is: let we have to identical fts configurations, except one
> skips stopwords and another doesn't. Second configuration is used for
> indexing, and first one for search by default. But if we can't  find
> anything ('to be or to be' - phrase contains stopwords only) then we
> can use second configuration. For now, we need to keep two variant of
> each dictionary - with and without stopwords. But if it's possible to
> distinguish stop and nonstop words in configuration then we don't
> need to have duplicated dictionaries.

With the proposed way to configure it is possible to create a special
dictionary only for stopword checking and use it at decision-making
time.

For example, we can create dictionary english_stopword which will
return word itself in case of stopword and NULL otherwise. With such
dictionary we create a configuration:

ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword,
                                                           word WITH
    CASE english_stopword WHEN NO MATCH THEN english_hunspell END;

In described example, english_hunspell can be implemented without
processing of stopwords at all and we can divide stopword processing
and processing of other words into separate dictionaries.

The key point of the patch is to process stopwords the same way as
others at the level of the PostgreSQL internals and give users an
instrument to process them in a special way via configurations.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v11.patch

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

09 April 2018, 15:07:15

Hi,

After last commits related to storing initial data-set of catalog and
commits related to MERGE command with changes in gram.y the patch
doesn't apply. A rebased in the attachment.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v12.patch

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

09 July 2018, 10:25:32

Hello hackers,

A new version of the patch in the attachment. There are no changes since
the last version except refreshing it to current HEAD.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v13.patch

Re: Flexible configuration for full-text search

From

Alexander Korotkov

Date:

20 July 2018, 14:42:34

Hi, Aleksandr!

On Mon, Jul 9, 2018 at 10:26 AM Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:

A new version of the patch in the attachment. There are no changes since
the last version except refreshing it to current HEAD.

I took a look at this patch. It applied cleanly, but didn't pass regression tests.

*** /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/misc_sanity.out 2018-07-20 13:44:54.000000000 +0300

--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/misc_sanity.out 2018-07-20 13:47:00.000000000 +0300

***************

*** 105,109 ****

pg_index | indpred | pg_node_tree

pg_largeobject | data | bytea

pg_largeobject_metadata | lomacl | aclitem[]

! (11 rows)

--- 105,110 ----

pg_index | indpred | pg_node_tree

pg_largeobject | data | bytea

pg_largeobject_metadata | lomacl | aclitem[]

! pg_ts_config_map | mapdicts | jsonb

! (12 rows)

It seems to be related to recent patches which adds toast tables to majority of system tables with varlena column. Regression diff indicates that mapdicts field of pg_ts_config_map can't be toasted. I think we should add toast table to pg_ts_config_map table.

Also, I see that you add extra grammar rules for ALTER TEXT SEARCH CONFIGURATION to the documentation, which allows specifying config instead of dictionary_name.

+ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>

+ ADD MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">config</replaceable>

ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>

ADD MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">dictionary_name</replaceable> [, ... ]

+ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>

+ ALTER MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">config</replaceable>

ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>

ALTER MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">dictionary_name</replaceable> [, ... ]

ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>

In the same time you declare config as following.

+ <para>

+ Formally <replaceable class="parameter">config</replaceable> is one of:

+ </para>

+ <programlisting>

+ * dictionary_name

+ * config { UNION | INTERSECT | EXCEPT | MAP } config

+ * CASE config

+ WHEN [ NO ] MATCH THEN { KEEP | config }

+ [ ELSE config ]

+ END

+ </programlisting>

That is config itself could be a dictionary_name. I think this makes grammar rules for ALTER TEXT SEARCH CONFIGURATION redundant. We can specify those rules to always expect config, assuming that it can be actually a dictionary nay.

+ if (fout->remoteVersion >= 110000)

PostgreSQL 11 already passed feature freeze. Thus, we should be aimed to PostgreSQL 12.

That's all for now, but I'm going to do more detailed code review.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

24 July 2018, 10:58:06

Hi Alexander,

Thank you for the feedback!
Fixed version attached to the letter.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

0001-flexible-fts-configuration-v14.patch

Re: Flexible configuration for full-text search

From

Alexander Korotkov

Date:

24 August 2018, 01:06:08

On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:
> On Thu, 5 Apr 2018 17:26:10 +0300
> Teodor Sigaev <teodor@sigaev.ru> wrote:
> > 4) Initial approach suggested to distinguish three state of
> > dictionary result: null (unknown word), stopword and usual word. Now
> > only two, we lost possibility to catch stopwords. One of way to use
> > stopwrods is: let we have to identical fts configurations, except one
> > skips stopwords and another doesn't. Second configuration is used for
> > indexing, and first one for search by default. But if we can't  find
> > anything ('to be or to be' - phrase contains stopwords only) then we
> > can use second configuration. For now, we need to keep two variant of
> > each dictionary - with and without stopwords. But if it's possible to
> > distinguish stop and nonstop words in configuration then we don't
> > need to have duplicated dictionaries.
>
> With the proposed way to configure it is possible to create a special
> dictionary only for stopword checking and use it at decision-making
> time.
>
> For example, we can create dictionary english_stopword which will
> return word itself in case of stopword and NULL otherwise. With such
> dictionary we create a configuration:
>
> ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword,
>                                                            word WITH
>     CASE english_stopword WHEN NO MATCH THEN english_hunspell END;
>
> In described example, english_hunspell can be implemented without
> processing of stopwords at all and we can divide stopword processing
> and processing of other words into separate dictionaries.
>
> The key point of the patch is to process stopwords the same way as
> others at the level of the PostgreSQL internals and give users an
> instrument to process them in a special way via configurations.

If we're going to do it that way by providing separate dictionaries
for stop words, then I think we should also make it for builtin
dictionaries and configurations.  So, I think this patch should also
split builtin dictionaries into stemmers and stop word dictionaries,
and provide corresponding configuration over them.  It would be also
needed to perform some benchmarking to show that new way of defining
configurations is not worse than previous way in the performance.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Flexible configuration for full-text search

From

Tom Lane

Date:

24 August 2018, 01:13:46

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov
> <a.parfenov@postgrespro.ru> wrote:
>> The key point of the patch is to process stopwords the same way as
>> others at the level of the PostgreSQL internals and give users an
>> instrument to process them in a special way via configurations.

> If we're going to do it that way by providing separate dictionaries
> for stop words, then I think we should also make it for builtin
> dictionaries and configurations.  So, I think this patch should also
> split builtin dictionaries into stemmers and stop word dictionaries,
> and provide corresponding configuration over them.  It would be also
> needed to perform some benchmarking to show that new way of defining
> configurations is not worse than previous way in the performance.

I'm hesitant about the backwards-compatibility aspects of this.
Yes, we could set up the standard text search configurations to still
work the same as before, but how will you do it without breaking existing
custom configurations that use those dictionaries?

            regards, tom lane

Re: Flexible configuration for full-text search

From

Alexander Korotkov

Date:

24 August 2018, 18:50:38

On Fri, Aug 24, 2018 at 1:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov
> > <a.parfenov@postgrespro.ru> wrote:
> >> The key point of the patch is to process stopwords the same way as
> >> others at the level of the PostgreSQL internals and give users an
> >> instrument to process them in a special way via configurations.
>
> > If we're going to do it that way by providing separate dictionaries
> > for stop words, then I think we should also make it for builtin
> > dictionaries and configurations.  So, I think this patch should also
> > split builtin dictionaries into stemmers and stop word dictionaries,
> > and provide corresponding configuration over them.  It would be also
> > needed to perform some benchmarking to show that new way of defining
> > configurations is not worse than previous way in the performance.
>
> I'm hesitant about the backwards-compatibility aspects of this.
> Yes, we could set up the standard text search configurations to still
> work the same as before, but how will you do it without breaking existing
> custom configurations that use those dictionaries?

Agreed, backward compatibility is important here.  Probably we should
leave old dictionaries for that.  But I just meant that if we
introduce new (better) way of stop words handling and encourage users
to use it, then it would look strange if default configurations work
the old way...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

28 August 2018, 08:40:32

On Fri, 24 Aug 2018 18:50:38 +0300
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>Agreed, backward compatibility is important here.  Probably we should
>leave old dictionaries for that.  But I just meant that if we
>introduce new (better) way of stop words handling and encourage users
>to use it, then it would look strange if default configurations work
>the old way...

I agree with Alexander. The only drawback I see is that after addition
of new dictionaries, there will be 3 dictionaries for each language: old
one, stop-word filter for the language, and stemmer dictionary.

Also, the new approach will solve ambiguity in case of 'simple'
dictionary. Currently, it filters stop-words for the language, which
was selected during DB initialization.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

29 August 2018, 11:38:31

On Tue, 28 Aug 2018 12:40:32 +0700
Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:

>On Fri, 24 Aug 2018 18:50:38 +0300
>Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>>Agreed, backward compatibility is important here.  Probably we should
>>leave old dictionaries for that.  But I just meant that if we
>>introduce new (better) way of stop words handling and encourage users
>>to use it, then it would look strange if default configurations work
>>the old way...  
>
>I agree with Alexander. The only drawback I see is that after addition
>of new dictionaries, there will be 3 dictionaries for each language:
>old one, stop-word filter for the language, and stemmer dictionary.

During work on the new version of the patch, I found an issue in
proposed syntax. At the beginning of the conversation, there was a
suggestion to split stop word filtering and words normalization. At this
stage of development, we can use a different dictionary for stop word
detection, but if we drop the word, the word counter wouldn't increase
and the stop word will be processed as an unknown word.

Currently, I see two solutions:

1) Keep the old way of stop word filtering. The drawback of this
approach is the mixing of word normalization and stop word detection
logic inside of a dictionary. It can be solved by the usage of 'simple'
dictionary in accept=false mode as a stop word filter.

2) Add an action STOPWORD to KEEP and DROP (which is not implemented in
previous patch, but I think it is good to have both of them) in the
meaning of "increase word counter but don't add lexeme to vector".

Any suggestions on the issue?

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Flexible configuration for full-text search

From

Aleksandr Parfenov

Date:

11 September 2018, 13:31:50

Hello hackers!

As I wrote few weeks ago, there is a issue with stopwords processing in
proposed syntax for full-text configurations. I want to separate word
normalization and stopwords detection to two separate dictionaries. The
problem is how to configure stopword detection dictionary.

The cause of the problem is counting stopwords, but not using any
lexemes for them. However, do we have to count stopwords during words
counting or can we ignore them like unknown words? The problem I see is
backward compatibility, since we have to regenerate all queries and
vectors. But is it real problem or we can change its behavior in this
way?

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Flexible configuration for full-text search

From

Tom Lane

Date:

11 September 2018, 18:24:41

Aleksandr Parfenov <a.parfenov@postgrespro.ru> writes:
> As I wrote few weeks ago, there is a issue with stopwords processing in
> proposed syntax for full-text configurations. I want to separate word
> normalization and stopwords detection to two separate dictionaries. The
> problem is how to configure stopword detection dictionary.

> The cause of the problem is counting stopwords, but not using any
> lexemes for them. However, do we have to count stopwords during words
> counting or can we ignore them like unknown words? The problem I see is
> backward compatibility, since we have to regenerate all queries and
> vectors. But is it real problem or we can change its behavior in this
> way?

I think there should be a pretty high bar for forcing people to regenerate
all that data when they haven't made any change of their own choice.

Also, I'm not very clear on exactly what you're proposing here, but it
sounds like it'd have the effect of changing whether stopwords count in
phrase distances ('a <N> b').  I think that's right out --- whether or not
you feel the current distance behavior is ideal, asking people to *both*
rebuild all their derived data *and* change their applications will cause
a revolt.  It's not sufficiently obviously broken that we can change it.

            regards, tom lane

Re: Flexible configuration for full-text search

From

Dmitry Dolgov

Date:

29 November 2018, 13:02:16

> On Wed, Aug 29, 2018 at 10:38 AM Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:
>
> On Tue, 28 Aug 2018 12:40:32 +0700
> Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:
>
> >On Fri, 24 Aug 2018 18:50:38 +0300
> >Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
> >>Agreed, backward compatibility is important here.  Probably we should
> >>leave old dictionaries for that.  But I just meant that if we
> >>introduce new (better) way of stop words handling and encourage users
> >>to use it, then it would look strange if default configurations work
> >>the old way...
> >
> >I agree with Alexander. The only drawback I see is that after addition
> >of new dictionaries, there will be 3 dictionaries for each language:
> >old one, stop-word filter for the language, and stemmer dictionary.
>
> During work on the new version of the patch, I found an issue in
> proposed syntax. At the beginning of the conversation, there was a
> suggestion to split stop word filtering and words normalization. At this
> stage of development, we can use a different dictionary for stop word
> detection, but if we drop the word, the word counter wouldn't increase
> and the stop word will be processed as an unknown word.

Maybe it would be better if you or some of your colleagues (Alexander, Arthur?)
will post this new version, because the current one has some conflicts - so it
would be easier for a reviewers. For now I'll move it to the next CF.

Re: Flexible configuration for full-text search

From

Michael Paquier

Date:

04 February 2019, 06:15:19

On Thu, Nov 29, 2018 at 02:02:16PM +0100, Dmitry Dolgov wrote:
> Maybe it would be better if you or some of your colleagues (Alexander, Arthur?)
> will post this new version, because the current one has some conflicts - so it
> would be easier for a reviewers. For now I'll move it to the next CF.

No updates for some time now, marked as returned with feedback.
--
Michael

Attachment

signature.asc