[HACKERS] Flexible configuration for full-text search - Mailing list pgsql-hackers
From | Aleksandr Parfenov |
---|---|
Subject | [HACKERS] Flexible configuration for full-text search |
Date | |
Msg-id | 20171019172409.731f52a7@asp437-24-g082ur Whole thread Raw |
Responses |
Re: [HACKERS] Flexible configuration for full-text search
Re: [HACKERS] Flexible configuration for full-text search |
List | pgsql-hackers |
Hello hackers, Arthur Zakirov and I are working on a patch to introduce more flexible way to configure full-text search in PostgreSQL because current syntax doesn't allow a variety of scenarios to be handled. Additionally, some parts contain the implicit logic of the processing, such as filtering dictionaries with TSL_FILTER flag, so configuration partially moved to dictionary itself and in most of the cases hardcoded into dictionary. One more drawback of current FTS configuration is that we can't divide the dictionary selection and output producing, so we can't configure FTS to use one dictionary if another one recognized a token (e.g. use hunspell if dictionary of nouns recognized a token). Basically, the key goal of the patch is to provide user more control on processing of the text. The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE construction. Current comma-separated list also available to meet compatibility. The basic form of new syntax is following: ALTER TEXT SEARCH CONFIGURATION <fts_conf> ALTER MAPPING FOR <token_types> WITH CASE WHEN <condition> THEN <command> .... [ ELSE <command> ] END; A condition is a logical expression on dictionaries. You can specify how to interpret dictionary output with dictionary IS [ NOT ] NULL - for NULL-result dictionary IS [ NOT ] STOPWORD - for empty (stopword) result If interpretation marker is not given it is interpreted as: dictionary IS NOT NULL AND dictionary IS NOT STOPWORD A command is an expression on dictionaries output sets with operators UNION, EXCEPT and INTERSECT. Additionally, there is a special operator MAP BY which allow us to create the same behavior as with filtering dictionaries. MAP BY operator get output of the right subexpression and send it to left subexpression as an input token (if there are more than one lexeme each one is sent separately). There is a few example of usage of new configuration and comparison with solutions using current syntax. 1) Multilingual search. Can be used for FTS on a set of documents in different languages (example for German and English languages). ALTER TEXT SEARCH CONFIGURATION multi ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_hunspell AND german_hunspell THEN english_hunspell UNION german_hunspell WHEN english_hunspell THEN english_hunspell WHEN german_hunspell THEN german_hunspell ELSE german_stem UNION english_stem END; With old configuration we should use separate vector and index for each required language and query should combine result of search for each language: SELECT * FROM en_de_documents WHERE to_tsvector('english', text) @@ to_tsquery('english', 'query') OR to_tsvector('german', text) @@ to_tsquery('german', 'query'); The new multilingual search configuration itself looks more complex but allow to avoid a split of index and vectors. Additionally, for similar languages or configurations with simple or *_stem dictionaries in the list we can reduce total size of index since in current-state example index for English configuration also will keep data about documents written in German and vice-versa. 2) Combination of exact search with morphological one. This patch not fully solve the problem but it is a step toward solution. Currently, we should split exact and morphological search in query manually and use separate index for each part. With new way to configure FTS we can use following configuration: ALTER TEXT SEARCH CONFIGURATION exact_and_morph ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_hunspell THEN english_hunspell UNION simple ELSE english_stem UNION simple END; Some of the queries like "'looking' <1> through" where 'looking' is search for exact form of the word doesn't work in current-state FTS since we can guarantee that document contains both 'looking' and through, but can't be sure with distance between them. Unfortunately, we can't fully support such queries with current format of tsvector because after processing we can't distinguish is a word was mentioned in normal form in text or was processed by some dictionary. This leads to false positive hits if user searches for the normal form of the word. I think we should provide a user ability to mark dictionary something like "exact form producer". But without tsvector modification this mark is useless since we can't mark output of this dictionary in tsvector. There is a patch on commitfest which removes 1MB limit on tsvector [1]. There are few free bits available in each lexeme in vector, so one of the bits may be used for "exact" flag. 3) Using different dictionaries for recognizing and output generation. As I mentioned before, in new syntax condition and command are separate and we can use it for some more complex text processing. Here an example for processing only nouns: ALTER TEXT SEARCH CONFIGURATION nouns_only ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_noun THEN english_hunspell END; This behavior couldn't be reached with the current state of FTS. 4) Special stopword processing allows us to discard stopwords even if the main dictionary doesn't support such feature (in example pl_ispell dictionary keeps stopwords in text): ALTER TEXT SEARCH CONFIGURATION pl_without_stops ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN simple_pl IS NOT STOPWORD THEN pl_ispell END; The patch is in attachment. I'm will be glad to hear hackers' opinion about it. There are several cases discussed in hackers earlier: Check for stopwords using non-target dictionary. https://www.postgresql.org/message-id/4733B65A.9030707@students.mimuw.edu.pl Support union of outputs of several dictionaries. https://www.postgresql.org/message-id/c6851b7e-da25-3d8e-a5df-022c395a11b4%40postgrespro.ru Support of chain of dictionaries using MAP BY operator. https://www.postgresql.org/message-id/46D57E6F.8020009%40enterprisedb.com [1] Remove 1MB size limit in tsvector https://commitfest.postgresql.org/15/1221/ -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: