Thread: [HACKERS] Flexible configuration for full-text search
Hello hackers, Arthur Zakirov and I are working on a patch to introduce more flexible way to configure full-text search in PostgreSQL because current syntax doesn't allow a variety of scenarios to be handled. Additionally, some parts contain the implicit logic of the processing, such as filtering dictionaries with TSL_FILTER flag, so configuration partially moved to dictionary itself and in most of the cases hardcoded into dictionary. One more drawback of current FTS configuration is that we can't divide the dictionary selection and output producing, so we can't configure FTS to use one dictionary if another one recognized a token (e.g. use hunspell if dictionary of nouns recognized a token). Basically, the key goal of the patch is to provide user more control on processing of the text. The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE construction. Current comma-separated list also available to meet compatibility. The basic form of new syntax is following: ALTER TEXT SEARCH CONFIGURATION <fts_conf> ALTER MAPPING FOR <token_types> WITH CASE WHEN <condition> THEN <command> .... [ ELSE <command> ] END; A condition is a logical expression on dictionaries. You can specify how to interpret dictionary output with dictionary IS [ NOT ] NULL - for NULL-result dictionary IS [ NOT ] STOPWORD - for empty (stopword) result If interpretation marker is not given it is interpreted as: dictionary IS NOT NULL AND dictionary IS NOT STOPWORD A command is an expression on dictionaries output sets with operators UNION, EXCEPT and INTERSECT. Additionally, there is a special operator MAP BY which allow us to create the same behavior as with filtering dictionaries. MAP BY operator get output of the right subexpression and send it to left subexpression as an input token (if there are more than one lexeme each one is sent separately). There is a few example of usage of new configuration and comparison with solutions using current syntax. 1) Multilingual search. Can be used for FTS on a set of documents in different languages (example for German and English languages). ALTER TEXT SEARCH CONFIGURATION multi ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_hunspell AND german_hunspell THEN english_hunspell UNION german_hunspell WHEN english_hunspell THEN english_hunspell WHEN german_hunspell THEN german_hunspell ELSE german_stem UNION english_stem END; With old configuration we should use separate vector and index for each required language and query should combine result of search for each language: SELECT * FROM en_de_documents WHERE to_tsvector('english', text) @@ to_tsquery('english', 'query') OR to_tsvector('german', text) @@ to_tsquery('german', 'query'); The new multilingual search configuration itself looks more complex but allow to avoid a split of index and vectors. Additionally, for similar languages or configurations with simple or *_stem dictionaries in the list we can reduce total size of index since in current-state example index for English configuration also will keep data about documents written in German and vice-versa. 2) Combination of exact search with morphological one. This patch not fully solve the problem but it is a step toward solution. Currently, we should split exact and morphological search in query manually and use separate index for each part. With new way to configure FTS we can use following configuration: ALTER TEXT SEARCH CONFIGURATION exact_and_morph ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_hunspell THEN english_hunspell UNION simple ELSE english_stem UNION simple END; Some of the queries like "'looking' <1> through" where 'looking' is search for exact form of the word doesn't work in current-state FTS since we can guarantee that document contains both 'looking' and through, but can't be sure with distance between them. Unfortunately, we can't fully support such queries with current format of tsvector because after processing we can't distinguish is a word was mentioned in normal form in text or was processed by some dictionary. This leads to false positive hits if user searches for the normal form of the word. I think we should provide a user ability to mark dictionary something like "exact form producer". But without tsvector modification this mark is useless since we can't mark output of this dictionary in tsvector. There is a patch on commitfest which removes 1MB limit on tsvector [1]. There are few free bits available in each lexeme in vector, so one of the bits may be used for "exact" flag. 3) Using different dictionaries for recognizing and output generation. As I mentioned before, in new syntax condition and command are separate and we can use it for some more complex text processing. Here an example for processing only nouns: ALTER TEXT SEARCH CONFIGURATION nouns_only ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN english_noun THEN english_hunspell END; This behavior couldn't be reached with the current state of FTS. 4) Special stopword processing allows us to discard stopwords even if the main dictionary doesn't support such feature (in example pl_ispell dictionary keeps stopwords in text): ALTER TEXT SEARCH CONFIGURATION pl_without_stops ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE WHEN simple_pl IS NOT STOPWORD THEN pl_ispell END; The patch is in attachment. I'm will be glad to hear hackers' opinion about it. There are several cases discussed in hackers earlier: Check for stopwords using non-target dictionary. https://www.postgresql.org/message-id/4733B65A.9030707@students.mimuw.edu.pl Support union of outputs of several dictionaries. https://www.postgresql.org/message-id/c6851b7e-da25-3d8e-a5df-022c395a11b4%40postgrespro.ru Support of chain of dictionaries using MAP BY operator. https://www.postgresql.org/message-id/46D57E6F.8020009%40enterprisedb.com [1] Remove 1MB size limit in tsvector https://commitfest.postgresql.org/15/1221/ -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
In attachment updated patch with fixes of empty XML tags in documentation. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE > construction. Interesting feature. I needed this flexibility before when I was implementing text search for a Turkish private listing application. Aleksandr and Arthur were kind enough to discuss it with me off-list today. > 1) Multilingual search. Can be used for FTS on a set of documents in > different languages (example for German and English languages). > > ALTER TEXT SEARCH CONFIGURATION multi > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_hunspell AND german_hunspell THEN > english_hunspell UNION german_hunspell > WHEN english_hunspell THEN english_hunspell > WHEN german_hunspell THEN german_hunspell > ELSE german_stem UNION english_stem > END; I understand the need to support branching, but this syntax is overly complicated. I don't think there is any need to support different set of dictionaries as condition and action. Something like this might work better: ALTER TEXT SEARCH CONFIGURATION multi ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word,hword, hword_part WITH CASE english_hunspell UNION german_hunspell WHEN MATCH THEN KEEP ELSE german_stemUNION english_stem END; To put it formally: ALTER TEXT SEARCH CONFIGURATION name ADD MAPPING FOR token_type [, ... ] WITH config where config is one of: dictionary_name config { UNION | INTERSECT | EXCEPT } config CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] configEND > 2) Combination of exact search with morphological one. This patch not > fully solve the problem but it is a step toward solution. Currently, we > should split exact and morphological search in query manually and use > separate index for each part. With new way to configure FTS we can use > following configuration: > > ALTER TEXT SEARCH CONFIGURATION exact_and_morph > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_hunspell THEN english_hunspell UNION simple > ELSE english_stem UNION simple > END This could be: CASE english_hunspell THEN KEEP ELSE english_stem END UNION simple > 3) Using different dictionaries for recognizing and output generation. > As I mentioned before, in new syntax condition and command are separate > and we can use it for some more complex text processing. Here an > example for processing only nouns: > > ALTER TEXT SEARCH CONFIGURATION nouns_only > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_noun THEN english_hunspell > END This would also still work with the simpler syntax because "english_noun", still being a dictionary, would pass the tokens to the next one. > 4) Special stopword processing allows us to discard stopwords even if > the main dictionary doesn't support such feature (in example pl_ispell > dictionary keeps stopwords in text): > > ALTER TEXT SEARCH CONFIGURATION pl_without_stops > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN simple_pl IS NOT STOPWORD THEN pl_ispell > END Instead of supporting old way of putting stopwords on dictionaries, we can make them dictionaries on their own. This would then become something like: CASE polish_stopword WHEN NO MATCH THEN polish_isspell END -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I'm mostly happy with mentioned modifications, but I have few questions to clarify some points. I will send new patch in week or two. On Thu, 26 Oct 2017 20:01:14 +0200 Emre Hasegeli <emre@hasegeli.com> wrote: > To put it formally: > > ALTER TEXT SEARCH CONFIGURATION name > ADD MAPPING FOR token_type [, ... ] WITH config > > where config is one of: > > dictionary_name > config { UNION | INTERSECT | EXCEPT } config > CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END According to formal definition following configurations are valid: CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END CASE english_noun WHEN MATCH THEN english_hunspell END But configuration: CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END is not (as I understand ELSE can be used only with KEEP). I think we should decide to allow or disallow usage of different dictionaries for match checking (between CASE and WHEN) and a result (after THEN). If answer is 'allow', maybe we should allow the third example too for consistency in configurations. > > 3) Using different dictionaries for recognizing and output > > generation. As I mentioned before, in new syntax condition and > > command are separate and we can use it for some more complex text > > processing. Here an example for processing only nouns: > > > > ALTER TEXT SEARCH CONFIGURATION nouns_only > > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > > word, hword, hword_part WITH CASE > > WHEN english_noun THEN english_hunspell > > END > > This would also still work with the simpler syntax because > "english_noun", still being a dictionary, would pass the tokens to the > next one. Based on formal definition it is possible to describe this example in following manner: CASE english_noun WHEN MATCH THEN english_hunspell END The question is same as in the previous example. > Instead of supporting old way of putting stopwords on dictionaries, we > can make them dictionaries on their own. This would then become > something like: > > CASE polish_stopword > WHEN NO MATCH THEN polish_isspell > END Currently, stopwords increment position, for example: SELECT to_tsvector('english','a test message'); ---------------------'messag':3 'test':2 A stopword 'a' has a position 1 but it is not in the vector. If we want to save this behavior, we should somehow pass a stopword to tsvector composition function (parsetext in ts_parse.c) for counter increment or increment it in another way. Currently, an empty lexemes array is passed as a result of LexizeExec. One of possible way to do so is something like: CASE polish_stopword WHEN MATCH THEN KEEP -- stopword counting ELSE polish_isspell END -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> I'm mostly happy with mentioned modifications, but I have few questions > to clarify some points. I will send new patch in week or two. I am glad you liked it. Though, I think we should get approval from more senior community members or committers about the syntax, before we put more effort to the code. > But configuration: > > CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END > > is not (as I understand ELSE can be used only with KEEP). > > I think we should decide to allow or disallow usage of different > dictionaries for match checking (between CASE and WHEN) and a result > (after THEN). If answer is 'allow', maybe we should allow the > third example too for consistency in configurations. I think you are right. We better allow this too. Then the CASE syntax becomes: CASE config WHEN [ NO ] MATCH THEN { KEEP | config } [ ELSE config ] END > Based on formal definition it is possible to describe this example in > following manner: > CASE english_noun WHEN MATCH THEN english_hunspell END > > The question is same as in the previous example. I couldn't understand the question. > Currently, stopwords increment position, for example: > SELECT to_tsvector('english','a test message'); > --------------------- > 'messag':3 'test':2 > > A stopword 'a' has a position 1 but it is not in the vector. Is this problem only applies to stopwords and the whole thing we are inventing? Shouldn't we preserve the positions through the pipeline? > If we want to save this behavior, we should somehow pass a stopword to > tsvector composition function (parsetext in ts_parse.c) for counter > increment or increment it in another way. Currently, an empty lexemes > array is passed as a result of LexizeExec. > > One of possible way to do so is something like: > CASE polish_stopword > WHEN MATCH THEN KEEP -- stopword counting > ELSE polish_isspell > END This would mean keeping the stopwords. What we want is CASE polish_stopword -- stopword counting WHEN NO MATCH THEN polish_isspell END Do you think it is possible? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: > In attachment updated patch with fixes of empty XML tags in > documentation. Hi Aleksandr, I'm not sure if this is expected at this stage, but just in case you aren't aware, with this version of the patch the binary upgrade test in src/bin/pg_dump/t/002_pg_dump.pl fails for me: # Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION dump_test.alt_ts_conf1 ...' # at t/002_pg_dump.pl line 6715. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 6 Nov 2017 18:05:23 +1300 Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov > <a.parfenov@postgrespro.ru> wrote: > > In attachment updated patch with fixes of empty XML tags in > > documentation. > > Hi Aleksandr, > > I'm not sure if this is expected at this stage, but just in case you > aren't aware, with this version of the patch the binary upgrade test > in > src/bin/pg_dump/t/002_pg_dump.pl fails for me: > > # Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION > dump_test.alt_ts_conf1 ...' > # at t/002_pg_dump.pl line 6715. > Hi Thomas, Thank you for noticing it. I will investigate it during work on next version of patch. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, 31 Oct 2017 09:47:57 +0100 Emre Hasegeli <emre@hasegeli.com> wrote: > > If we want to save this behavior, we should somehow pass a stopword > > to tsvector composition function (parsetext in ts_parse.c) for > > counter increment or increment it in another way. Currently, an > > empty lexemes array is passed as a result of LexizeExec. > > > > One of possible way to do so is something like: > > CASE polish_stopword > > WHEN MATCH THEN KEEP -- stopword counting > > ELSE polish_isspell > > END > > This would mean keeping the stopwords. What we want is > > CASE polish_stopword -- stopword counting > WHEN NO MATCH THEN polish_isspell > END > > Do you think it is possible? Hi Emre, I thought how it can be implemented. The way I see is to increment word counter in case if any chcked dictionary matched the word even without returning lexeme. Main drawback is that counter increment is implicit. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 7, 2017 at 3:18 PM, Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: > On Mon, 6 Nov 2017 18:05:23 +1300 > Thomas Munro <thomas.munro@enterprisedb.com> wrote: > >> On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov >> <a.parfenov@postgrespro.ru> wrote: >> > In attachment updated patch with fixes of empty XML tags in >> > documentation. >> >> Hi Aleksandr, >> >> I'm not sure if this is expected at this stage, but just in case you >> aren't aware, with this version of the patch the binary upgrade test >> in >> src/bin/pg_dump/t/002_pg_dump.pl fails for me: >> >> # Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION >> dump_test.alt_ts_conf1 ...' >> # at t/002_pg_dump.pl line 6715. >> > > Hi Thomas, > > Thank you for noticing it. I will investigate it during work on next > version of patch. Next version pending after three weeks, I am marking the patch as returned with feedback for now. -- Michael
Hi, On Tue, 31 Oct 2017 09:47:57 +0100 Emre Hasegeli <emre@hasegeli.com> wrote: > I am glad you liked it. Though, I think we should get approval from > more senior community members or committers about the syntax, before > we put more effort to the code. I postpone a new version of the patch in order to wait for more feedback, but I think now is the time to send it to push discussion further. I keep in mind all the feedback during reworking a patch, so the FTS syntax and behavior changed since previous one. But I'm not sure about one last thing: > CASE polish_stopword -- stopword counting > WHEN NO MATCH THEN polish_isspell > END > > Do you think it is possible? If we will count tokens in such a case, any dropped words will be counted too. For example: CASE banned_words WHEN NO MATCH THEN some_dictionary END And I'm not sure about that behavior due to implicit use of the token produced by 'banned_words' dictionary in the example. In the third version of patch I keep the behavior without an implicit use of tokens for counting. The new version of the patch is in attachment as well as a little README file with a description of changes in each file. Any feedback is welcome. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hello, On Tue, Dec 19, 2017 at 05:31:09PM +0300, Aleksandr Parfenov wrote: > > The new version of the patch is in attachment as well as a > little README file with a description of changes in each file. Any > feedback is welcome. > I looked the patch a little bit. The patch is applied and tests passed. I noticed that there are typos in the documentation. And I think it is necessary to keep information about previous sintax.The syntax will be supported anyway. For example, information about TSL_FILTER was removed. And it will be good toadd more examples of the new syntax. The result of ts_debug() function was changed. Is it possible to keep the old ts_debug() result? To be specific, 'dictionaries'column is text now, not array, as I understood. It will be good to keep the result for the sake of backwardcompatibility. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hi Arthur, Thank you for the review. On Thu, 21 Dec 2017 17:46:42 +0300 Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > I noticed that there are typos in the documentation. And I think it > is necessary to keep information about previous sintax. The syntax > will be supported anyway. For example, information about TSL_FILTER > was removed. And it will be good to add more examples of the new > syntax. In the current version of the patch, configurations written in old syntax are rewritten into the same configuration in the new syntax. Since new syntax doesn't support a TSL_FILTER, it was removed from the documentation. It is possible to store configurations written in old syntax in a special way and simulate a TSL_FILTER behavior for them. But it will lead to maintenance of two different behavior of the FTS depends on a version of the syntax used during configuration. Do you think we should keep both behaviors? > The result of ts_debug() function was changed. Is it possible to keep > the old ts_debug() result? To be specific, 'dictionaries' column is > text now, not array, as I understood. It will be good to keep the > result for the sake of backward compatibility. Columns' 'dictionaries' and 'dictionary' type were changed to text because after the patch the configuration may be not a plain array of dictionaries but a complex expression tree. In the column 'dictionaries' the result is textual representation of configuration and it is the same as a result of \dF+ description of the configuration. I decide to rename newly added column to 'configuration' and keep column 'dictionaries' with an array of all dictionaries used in configuration (no matter how). Also, I fixed a bug in 'command' output of the ts_debug in some cases. Additionally, I added some examples to documentation regarding multilingual search and combination of exact and linguistic-aware search and fixed typos.
Attachment
On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote: > > In the current version of the patch, configurations written in old > syntax are rewritten into the same configuration in the new syntax. > Since new syntax doesn't support a TSL_FILTER, it was removed from the > documentation. It is possible to store configurations written in old > syntax in a special way and simulate a TSL_FILTER behavior for them. > But it will lead to maintenance of two different behavior of the FTS > depends on a version of the syntax used during configuration. Do you > think we should keep both behaviors? Is I understood users need to rewrite their configurations if they use unaccent dictionary, for example. It is not good I think. Users will be upset about that if they use only old configuration and they don't need new configuration. From my point of view it is necessary to keep old configuration syntax. > > Columns' 'dictionaries' and 'dictionary' type were changed to text > because after the patch the configuration may be not a plain array of > dictionaries but a complex expression tree. In the column > 'dictionaries' the result is textual representation of configuration > and it is the same as a result of \dF+ description of the configuration. Oh, I understood. > > I decide to rename newly added column to 'configuration' and keep > column 'dictionaries' with an array of all dictionaries used in > configuration (no matter how). Also, I fixed a bug in 'command' output > of the ts_debug in some cases. Maybe it would be better to keep the 'dictionary' column name? Is there a reason why it was renamed to 'command'? > > Additionally, I added some examples to documentation regarding > multilingual search and combination of exact and linguistic-aware > search and fixed typos. Great! -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On Tue, 26 Dec 2017 13:51:03 +0300 Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote: > > Is I understood users need to rewrite their configurations if they > use unaccent dictionary, for example. It is not good I think. Users > will be upset about that if they use only old configuration and they > don't need new configuration. > > From my point of view it is necessary to keep old configuration > syntax. I see your point. I will rework a patch to keep the backward compatibility and return the TSL_FILTER entry into documentation. > > I decide to rename newly added column to 'configuration' and keep > > column 'dictionaries' with an array of all dictionaries used in > > configuration (no matter how). Also, I fixed a bug in 'command' > > output of the ts_debug in some cases. > > Maybe it would be better to keep the 'dictionary' column name? Is > there a reason why it was renamed to 'command'? I changed the name bacause it may contain more than one dictionary interconnected via operators (e.g. 'english_stem UNION simple') and the word 'dictionary' doesn't fully describe a content of the column now. Also, a type of column was changed from regdictionary to text in order to put operators into the output. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hello, I changed the processing of a comma-separated list of dictionaries in order to keep backward compatibility. Also, I fixed the output of the ts_debug function as well as some other minor bugs. A new version of the patch in the attachment.
Attachment
Greetings, According to http://commitfest.cputube.org/ the patch is not applicable. Updated version of the patch in the attachment. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed This patch seems to be in a pretty good shape. There is a room for improvement though. 1. There are no comments for some procedures, their arguments and return values. Ditto regarding structures and their fields. 2. Please, fix the year in the copyright messages from 2017 to 2018. 3. Somehow I doubt that current amount of tests covers most of the functionality. Are you sure that if we run lcov, it will not show that most of the new code is never executed during make installcheck-world? 4. I'm a bit concerned regarding change of the catalog in the src/include/catalog/indexing.h file. Are you sure it will not break if I migrate from PostgreSQL 10 to PostgreSQL 11? 5. There are typos, e.g "Last result retued...", "...thesaurus pharse processing...". I'm going to run a few more test a bit later. I'll let you know if I'll find anything. The new status of this patch is: Waiting on Author
Hi Aleksander, Thank you for the review! > 1. There are no comments for some procedures, their arguments and > return values. Ditto regarding structures and their fields. > > 2. Please, fix the year in the copyright messages from 2017 to 2018. Both issues are fixed. > 3. Somehow I doubt that current amount of tests covers most of the > functionality. Are you sure that if we run lcov, it will not show > that most of the new code is never executed during make > installcheck-world? I checked it and most of the new code is executed (I mostly checked ts_parse.c and ts_configmap.c because those files contain most of the potentially unchecked code). I have added some tests to test output of the configurations. Also, I have added a test of TEXT CONFIGURATION into unaccent contrib to test the MAP operator. > 4. I'm a bit concerned regarding change of the catalog in the > src/include/catalog/indexing.h file. Are you sure it will not break > if I migrate from PostgreSQL 10 to PostgreSQL 11? I have tested an upgrade process via pg_upgrade from PostgreSQL 10 and PostgreSQL 9.6 and have found a bug in a way the pg_dump generates schema dump for old databases. I fixed it and now pg_upgrade works fine. The new version of the patch is in an attachment. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
I have found an issue in grammar which doesn't allow to construct complex expressions (usage of CASEs as operands) without parentheses. I fixed and simplified a grammar a little bit. The rest of the patch is the same. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: not tested Unfortunately this patch doesn't apply anymore: http://commitfest.cputube.org/
On Mon, 05 Mar 2018 12:59:21 +0000 Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > The following review has been posted through the commitfest > application: make installcheck-world: not tested > Implements feature: not tested > Spec compliant: not tested > Documentation: not tested > > Unfortunately this patch doesn't apply anymore: > http://commitfest.cputube.org/ Thank you for noticing that. A refreshed version of the patch is attached. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed LGTM. The new status of this patch is: Ready for Committer
On Fri, 30 Mar 2018 14:43:30 +0000 Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > The following review has been posted through the commitfest > application: make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: tested, passed > > LGTM. > > The new status of this patch is: Ready for Committer It seems that after d204ef6 (MERGE SQL Command) in master the patch doesn't apply due to a conflict in keywords lists (grammar and header). The new version of the patch without conflicts is attached. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Some notices: 0) patch conflicts with last changes in gram.y, conflicts are trivial. 1) jsonb in catalog. I'm ok with it, any opinions? 2) pg_ts_config_map.h, "jsonb mapdicts" isn't decorated with #ifdef CATALOG_VARLEN like other varlena columns in catalog. It it's right, pls, explain and add comment. 3) I see changes in pg_catalog, including drop column, change its type, change index, change function etc. Did you pay attention to pg_upgrade? I don't see it in patch. 4) Seems, it could work: ALTER TEXT SEARCH CONFIGURATION russian ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH english_stem union (russian_stem, simple); ^^^^^^^^^^^^^^^^^^^^^ simple way instead of WITH english_stem union (case russian_stem when match then keep else simple end); 4) Initial approach suggested to distinguish three state of dictionary result: null (unknown word), stopword and usual word. Now only two, we lost possibility to catch stopwords. One of way to use stopwrods is: let we have to identical fts configurations, except one skips stopwords and another doesn't. Second configuration is used for indexing, and first one for search by default. But if we can't find anything ('to be or to be' - phrase contains stopwords only) then we can use second configuration. For now, we need to keep two variant of each dictionary - with and without stopwords. But if it's possible to distinguish stop and nonstop words in configuration then we don't need to have duplicated dictionaries. Aleksandr Parfenov wrote: > On Fri, 30 Mar 2018 14:43:30 +0000 > Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > >> The following review has been posted through the commitfest >> application: make installcheck-world: tested, passed >> Implements feature: tested, passed >> Spec compliant: tested, passed >> Documentation: tested, passed >> >> LGTM. >> >> The new status of this patch is: Ready for Committer > > It seems that after d204ef6 (MERGE SQL Command) in master the patch > doesn't apply due to a conflict in keywords lists (grammar and header). > The new version of the patch without conflicts is attached. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi, On 2018-04-05 17:26:10 +0300, Teodor Sigaev wrote: > Some notices: > > 0) patch conflicts with last changes in gram.y, conflicts are trivial. > > 1) jsonb in catalog. I'm ok with it, any opinions? > > 2) pg_ts_config_map.h, "jsonb mapdicts" isn't decorated with #ifdef > CATALOG_VARLEN like other varlena columns in catalog. It it's right, pls, > explain and add comment. > > 3) I see changes in pg_catalog, including drop column, change its type, > change index, change function etc. Did you pay attention to pg_upgrade? I > don't see it in patch. > > 4) Seems, it could work: > ALTER TEXT SEARCH CONFIGURATION russian > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part > WITH english_stem union (russian_stem, simple); > ^^^^^^^^^^^^^^^^^^^^^ simple way instead of > WITH english_stem union (case russian_stem when match then keep else simple end); > > 4) Initial approach suggested to distinguish three state of dictionary > result: null (unknown word), stopword and usual word. Now only two, we lost > possibility to catch stopwords. One of way to use stopwrods is: let we have > to identical fts configurations, except one skips stopwords and another > doesn't. Second configuration is used for indexing, and first one for search > by default. But if we can't find anything ('to be or to be' - phrase > contains stopwords only) then we can use second configuration. For now, we > need to keep two variant of each dictionary - with and without stopwords. > But if it's possible to distinguish stop and nonstop words in configuration > then we don't need to have duplicated dictionaries. Just to be clear: I object to attempting to merge this into v11. This introduces new user interface, arrived late in the development cycle, and hasn't seen that much review. Not something that should be merged two minutes before midnight. I think it's good to continue reviewing, don't get me wrong. - Andres
On Thu, 5 Apr 2018 17:26:10 +0300 Teodor Sigaev <teodor@sigaev.ru> wrote: > Some notices: > > 0) patch conflicts with last changes in gram.y, conflicts are trivial. Yes, due to commits with MERGE command with changes in gram.y there were some conflicts. > 2) pg_ts_config_map.h, "jsonb mapdicts" isn't decorated with > #ifdef CATALOG_VARLEN like other varlena columns in catalog. It it's > right, pls, explain and add comment. Since there is only one varlena column it is safe to use it directly. I add a related comment about it. > 3) I see changes in pg_catalog, including drop column, change its > type, change index, change function etc. Did you pay attention to > pg_upgrade? I don't see it in patch. The full-text search configuration is migrated via FTS commands such as CREATE TEXT SEARCH CONFIGURATION. The pg_upgrade uses pg_dump to create a dump of this part of the catalog where dictionary_mapping_to_text is used to get a textual representation of the FTS configuration. Correct me if I'm wrong. > 4) Seems, it could work: > ALTER TEXT SEARCH CONFIGURATION russian > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part > WITH english_stem union (russian_stem, simple); > ^^^^^^^^^^^^^^^^^^^^^ simple way > instead of WITH english_stem union (case russian_stem when match then > keep else simple end); I add such ability since it was just a little fix in grammar. I also add tests for this kind of configurations. The test is a bit synthetic because I used a synonym dictionary as one which doesn't accept some input. > 4) Initial approach suggested to distinguish three state of > dictionary result: null (unknown word), stopword and usual word. Now > only two, we lost possibility to catch stopwords. One of way to use > stopwrods is: let we have to identical fts configurations, except one > skips stopwords and another doesn't. Second configuration is used for > indexing, and first one for search by default. But if we can't find > anything ('to be or to be' - phrase contains stopwords only) then we > can use second configuration. For now, we need to keep two variant of > each dictionary - with and without stopwords. But if it's possible to > distinguish stop and nonstop words in configuration then we don't > need to have duplicated dictionaries. With the proposed way to configure it is possible to create a special dictionary only for stopword checking and use it at decision-making time. For example, we can create dictionary english_stopword which will return word itself in case of stopword and NULL otherwise. With such dictionary we create a configuration: ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword, word WITH CASE english_stopword WHEN NO MATCH THEN english_hunspell END; In described example, english_hunspell can be implemented without processing of stopwords at all and we can divide stopword processing and processing of other words into separate dictionaries. The key point of the patch is to process stopwords the same way as others at the level of the PostgreSQL internals and give users an instrument to process them in a special way via configurations. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hi, After last commits related to storing initial data-set of catalog and commits related to MERGE command with changes in gram.y the patch doesn't apply. A rebased in the attachment. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hello hackers, A new version of the patch in the attachment. There are no changes since the last version except refreshing it to current HEAD. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hi, Aleksandr!
On Mon, Jul 9, 2018 at 10:26 AM Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote:
A new version of the patch in the attachment. There are no changes since
the last version except refreshing it to current HEAD.
I took a look at this patch. It applied cleanly, but didn't pass regression tests.
*** /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/misc_sanity.out 2018-07-20 13:44:54.000000000 +0300
--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/misc_sanity.out 2018-07-20 13:47:00.000000000 +0300
***************
*** 105,109 ****
pg_index | indpred | pg_node_tree
pg_largeobject | data | bytea
pg_largeobject_metadata | lomacl | aclitem[]
! (11 rows)
--- 105,110 ----
pg_index | indpred | pg_node_tree
pg_largeobject | data | bytea
pg_largeobject_metadata | lomacl | aclitem[]
! pg_ts_config_map | mapdicts | jsonb
! (12 rows)
It seems to be related to recent patches which adds toast tables to majority of system tables with varlena column. Regression diff indicates that mapdicts field of pg_ts_config_map can't be toasted. I think we should add toast table to pg_ts_config_map table.
Also, I see that you add extra grammar rules for ALTER TEXT SEARCH CONFIGURATION to the documentation, which allows specifying config instead of dictionary_name.
+ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>
+ ADD MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">config</replaceable>
ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>
ADD MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">dictionary_name</replaceable> [, ... ]
+ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>
+ ALTER MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">config</replaceable>
ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>
ALTER MAPPING FOR <replaceable class="parameter">token_type</replaceable> [, ... ] WITH <replaceable class="parameter">dictionary_name</replaceable> [, ... ]
ALTER TEXT SEARCH CONFIGURATION <replaceable>name</replaceable>
In the same time you declare config as following.
+ <para>
+ Formally <replaceable class="parameter">config</replaceable> is one of:
+ </para>
+ <programlisting>
+ * dictionary_name
+
+ * config { UNION | INTERSECT | EXCEPT | MAP } config
+
+ * CASE config
+ WHEN [ NO ] MATCH THEN { KEEP | config }
+ [ ELSE config ]
+ END
+ </programlisting>
That is config itself could be a dictionary_name. I think this makes grammar rules for ALTER TEXT SEARCH CONFIGURATION redundant. We can specify those rules to always expect config, assuming that it can be actually a dictionary nay.
+ if (fout->remoteVersion >= 110000)
PostgreSQL 11 already passed feature freeze. Thus, we should be aimed to PostgreSQL 12.
That's all for now, but I'm going to do more detailed code review.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Alexander, Thank you for the feedback! Fixed version attached to the letter. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: > On Thu, 5 Apr 2018 17:26:10 +0300 > Teodor Sigaev <teodor@sigaev.ru> wrote: > > 4) Initial approach suggested to distinguish three state of > > dictionary result: null (unknown word), stopword and usual word. Now > > only two, we lost possibility to catch stopwords. One of way to use > > stopwrods is: let we have to identical fts configurations, except one > > skips stopwords and another doesn't. Second configuration is used for > > indexing, and first one for search by default. But if we can't find > > anything ('to be or to be' - phrase contains stopwords only) then we > > can use second configuration. For now, we need to keep two variant of > > each dictionary - with and without stopwords. But if it's possible to > > distinguish stop and nonstop words in configuration then we don't > > need to have duplicated dictionaries. > > With the proposed way to configure it is possible to create a special > dictionary only for stopword checking and use it at decision-making > time. > > For example, we can create dictionary english_stopword which will > return word itself in case of stopword and NULL otherwise. With such > dictionary we create a configuration: > > ALTER TEXT SEARCH CONFIGURATION test_cfg ALTER MAPPING FOR asciiword, > word WITH > CASE english_stopword WHEN NO MATCH THEN english_hunspell END; > > In described example, english_hunspell can be implemented without > processing of stopwords at all and we can divide stopword processing > and processing of other words into separate dictionaries. > > The key point of the patch is to process stopwords the same way as > others at the level of the PostgreSQL internals and give users an > instrument to process them in a special way via configurations. If we're going to do it that way by providing separate dictionaries for stop words, then I think we should also make it for builtin dictionaries and configurations. So, I think this patch should also split builtin dictionaries into stemmers and stop word dictionaries, and provide corresponding configuration over them. It would be also needed to perform some benchmarking to show that new way of defining configurations is not worse than previous way in the performance. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov > <a.parfenov@postgrespro.ru> wrote: >> The key point of the patch is to process stopwords the same way as >> others at the level of the PostgreSQL internals and give users an >> instrument to process them in a special way via configurations. > If we're going to do it that way by providing separate dictionaries > for stop words, then I think we should also make it for builtin > dictionaries and configurations. So, I think this patch should also > split builtin dictionaries into stemmers and stop word dictionaries, > and provide corresponding configuration over them. It would be also > needed to perform some benchmarking to show that new way of defining > configurations is not worse than previous way in the performance. I'm hesitant about the backwards-compatibility aspects of this. Yes, we could set up the standard text search configurations to still work the same as before, but how will you do it without breaking existing custom configurations that use those dictionaries? regards, tom lane
On Fri, Aug 24, 2018 at 1:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > > On Fri, Apr 6, 2018 at 10:52 AM Aleksandr Parfenov > > <a.parfenov@postgrespro.ru> wrote: > >> The key point of the patch is to process stopwords the same way as > >> others at the level of the PostgreSQL internals and give users an > >> instrument to process them in a special way via configurations. > > > If we're going to do it that way by providing separate dictionaries > > for stop words, then I think we should also make it for builtin > > dictionaries and configurations. So, I think this patch should also > > split builtin dictionaries into stemmers and stop word dictionaries, > > and provide corresponding configuration over them. It would be also > > needed to perform some benchmarking to show that new way of defining > > configurations is not worse than previous way in the performance. > > I'm hesitant about the backwards-compatibility aspects of this. > Yes, we could set up the standard text search configurations to still > work the same as before, but how will you do it without breaking existing > custom configurations that use those dictionaries? Agreed, backward compatibility is important here. Probably we should leave old dictionaries for that. But I just meant that if we introduce new (better) way of stop words handling and encourage users to use it, then it would look strange if default configurations work the old way... ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, 24 Aug 2018 18:50:38 +0300 Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >Agreed, backward compatibility is important here. Probably we should >leave old dictionaries for that. But I just meant that if we >introduce new (better) way of stop words handling and encourage users >to use it, then it would look strange if default configurations work >the old way... I agree with Alexander. The only drawback I see is that after addition of new dictionaries, there will be 3 dictionaries for each language: old one, stop-word filter for the language, and stemmer dictionary. Also, the new approach will solve ambiguity in case of 'simple' dictionary. Currently, it filters stop-words for the language, which was selected during DB initialization. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On Tue, 28 Aug 2018 12:40:32 +0700 Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: >On Fri, 24 Aug 2018 18:50:38 +0300 >Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >>Agreed, backward compatibility is important here. Probably we should >>leave old dictionaries for that. But I just meant that if we >>introduce new (better) way of stop words handling and encourage users >>to use it, then it would look strange if default configurations work >>the old way... > >I agree with Alexander. The only drawback I see is that after addition >of new dictionaries, there will be 3 dictionaries for each language: >old one, stop-word filter for the language, and stemmer dictionary. During work on the new version of the patch, I found an issue in proposed syntax. At the beginning of the conversation, there was a suggestion to split stop word filtering and words normalization. At this stage of development, we can use a different dictionary for stop word detection, but if we drop the word, the word counter wouldn't increase and the stop word will be processed as an unknown word. Currently, I see two solutions: 1) Keep the old way of stop word filtering. The drawback of this approach is the mixing of word normalization and stop word detection logic inside of a dictionary. It can be solved by the usage of 'simple' dictionary in accept=false mode as a stop word filter. 2) Add an action STOPWORD to KEEP and DROP (which is not implemented in previous patch, but I think it is good to have both of them) in the meaning of "increase word counter but don't add lexeme to vector". Any suggestions on the issue? -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hello hackers! As I wrote few weeks ago, there is a issue with stopwords processing in proposed syntax for full-text configurations. I want to separate word normalization and stopwords detection to two separate dictionaries. The problem is how to configure stopword detection dictionary. The cause of the problem is counting stopwords, but not using any lexemes for them. However, do we have to count stopwords during words counting or can we ignore them like unknown words? The problem I see is backward compatibility, since we have to regenerate all queries and vectors. But is it real problem or we can change its behavior in this way? -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Aleksandr Parfenov <a.parfenov@postgrespro.ru> writes: > As I wrote few weeks ago, there is a issue with stopwords processing in > proposed syntax for full-text configurations. I want to separate word > normalization and stopwords detection to two separate dictionaries. The > problem is how to configure stopword detection dictionary. > The cause of the problem is counting stopwords, but not using any > lexemes for them. However, do we have to count stopwords during words > counting or can we ignore them like unknown words? The problem I see is > backward compatibility, since we have to regenerate all queries and > vectors. But is it real problem or we can change its behavior in this > way? I think there should be a pretty high bar for forcing people to regenerate all that data when they haven't made any change of their own choice. Also, I'm not very clear on exactly what you're proposing here, but it sounds like it'd have the effect of changing whether stopwords count in phrase distances ('a <N> b'). I think that's right out --- whether or not you feel the current distance behavior is ideal, asking people to *both* rebuild all their derived data *and* change their applications will cause a revolt. It's not sufficiently obviously broken that we can change it. regards, tom lane
> On Wed, Aug 29, 2018 at 10:38 AM Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: > > On Tue, 28 Aug 2018 12:40:32 +0700 > Aleksandr Parfenov <a.parfenov@postgrespro.ru> wrote: > > >On Fri, 24 Aug 2018 18:50:38 +0300 > >Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > >>Agreed, backward compatibility is important here. Probably we should > >>leave old dictionaries for that. But I just meant that if we > >>introduce new (better) way of stop words handling and encourage users > >>to use it, then it would look strange if default configurations work > >>the old way... > > > >I agree with Alexander. The only drawback I see is that after addition > >of new dictionaries, there will be 3 dictionaries for each language: > >old one, stop-word filter for the language, and stemmer dictionary. > > During work on the new version of the patch, I found an issue in > proposed syntax. At the beginning of the conversation, there was a > suggestion to split stop word filtering and words normalization. At this > stage of development, we can use a different dictionary for stop word > detection, but if we drop the word, the word counter wouldn't increase > and the stop word will be processed as an unknown word. Maybe it would be better if you or some of your colleagues (Alexander, Arthur?) will post this new version, because the current one has some conflicts - so it would be easier for a reviewers. For now I'll move it to the next CF.
On Thu, Nov 29, 2018 at 02:02:16PM +0100, Dmitry Dolgov wrote: > Maybe it would be better if you or some of your colleagues (Alexander, Arthur?) > will post this new version, because the current one has some conflicts - so it > would be easier for a reviewers. For now I'll move it to the next CF. No updates for some time now, marked as returned with feedback. -- Michael