Re: Flexible configuration for full-text search - Mailing list pgsql-hackers

From Aleksandr Parfenov
Subject Re: Flexible configuration for full-text search
Date
Msg-id 20180406105138.72ed468c@asp437-manjaro
Whole thread Raw
In response to Re: Flexible configuration for full-text search  (Teodor Sigaev <teodor@sigaev.ru>)
Responses Re: Flexible configuration for full-text search
Re: Flexible configuration for full-text search
List pgsql-hackers
On Thu, 5 Apr 2018 17:26:10 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:
> Some notices:
> 
> 0) patch conflicts with last changes in gram.y, conflicts are trivial.

Yes, due to commits with MERGE command with changes in gram.y there
were some conflicts.

> 2) pg_ts_config_map.h, "jsonb       mapdicts" isn't decorated with
> #ifdef CATALOG_VARLEN like other varlena columns in catalog. It it's
> right, pls, explain and add comment.

Since there is only one varlena column it is safe to use it directly. I
add a related comment about it.

> 3) I see changes in pg_catalog, including drop column, change its
> type, change index, change function etc. Did you pay attention to
> pg_upgrade? I don't see it in patch.

The full-text search configuration is migrated via FTS commands such
as CREATE TEXT SEARCH CONFIGURATION. The pg_upgrade uses pg_dump to
create a dump of this part of the catalog where
dictionary_mapping_to_text is used to get a textual representation of
the FTS configuration. Correct me if I'm wrong.
 
> 4) Seems, it could work:
> ALTER TEXT SEARCH CONFIGURATION russian
>    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>                                            word, hword, hword_part
>          WITH english_stem union (russian_stem, simple);
>                  ^^^^^^^^^^^^^^^^^^^^^ simple way
> instead of WITH english_stem union (case russian_stem when match then
> keep else simple end);

I add such ability since it was just a little fix in grammar. I also
add tests for this kind of configurations. The test is a bit
synthetic because I used a synonym dictionary as one which doesn't
accept some input.

> 4) Initial approach suggested to distinguish three state of
> dictionary result: null (unknown word), stopword and usual word. Now
> only two, we lost possibility to catch stopwords. One of way to use
> stopwrods is: let we have to identical fts configurations, except one
> skips stopwords and another doesn't. Second configuration is used for
> indexing, and first one for search by default. But if we can't  find
> anything ('to be or to be' - phrase contains stopwords only) then we
> can use second configuration. For now, we need to keep two variant of
> each dictionary - with and without stopwords. But if it's possible to
> distinguish stop and nonstop words in configuration then we don't
> need to have duplicated dictionaries.

With the proposed way to configure it is possible to create a special
dictionary only for stopword checking and use it at decision-making
time.

For example, we can create dictionary english_stopword which will
return word itself in case of stopword and NULL otherwise. With such
dictionary we create a configuration:

ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword,
                                                           word WITH
    CASE english_stopword WHEN NO MATCH THEN english_hunspell END;

In described example, english_hunspell can be implemented without
processing of stopwords at all and we can divide stopword processing
and processing of other words into separate dictionaries.

The key point of the patch is to process stopwords the same way as
others at the level of the PostgreSQL internals and give users an
instrument to process them in a special way via configurations.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Problem while setting the fpw with SIGHUP