Thread: [HACKERS] new function for tsquery creartion
Dear all, Now Postgres has a few functions to create tsqueries for full text search. The main one is the to_tsquery function that allows to make query with any operation. But to make correct query all of the operators should be specified explicitly. In order to make it easier postgres has functions like plainto_tsquery and phraseto_tsquery which allow to make tsqueries from strings. But they are not flexible enough. Let me introduce new function for full text search query creation(which is called 'queryto_tsquery'). It takes 'google like' query string and translates it to tsquery. The main features are the following: All the text inside double quotes would be treated as a phrase("a b c" -> 'a <-> b <-> c') New operator AROUND(N). It matches if the distance between words(or maybe phrases) is less than or equal to N. Alias for !('-rat' is the same as '!rat') Alias for |('dog OR cat' is the same as 'dog | cat') As a plainto_tsquery and phraseto_tsquery it will fill operators by itself, but already placed operations won't be ignored. It allows to combine two approaches. In the attachment you can find patch with the new features, tests and documentation for it. What do you think about it? Thank you very much for the attention! -- ------ Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote: > Let me introduce new function for full text search query creation(which is > called 'queryto_tsquery'). It takes 'google like' query string and > translates it to tsquery. I haven't looked at the code, but that sounds like a neat idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 20, 2017 at 4:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote: >> Let me introduce new function for full text search query creation(which is >> called 'queryto_tsquery'). It takes 'google like' query string and >> translates it to tsquery. > > I haven't looked at the code, but that sounds like a neat idea. +1 This is a very cool feature making tsquery much more accessible. Many people know that sort of defacto search engine query language that many websites accept using quotes, AND, OR, - etc. Calling this search syntax just "query" seems too general and overloaded. "Simple search", "simple query", "web search", "web syntax", "web query", "Google-style query", "Poogle" (kidding!) ... well I'm not sure, but I feel like it deserves a proper name. websearch_to_tsquery()? I see that your AROUND(n) is an undocumented Google search syntax. That's a good trick to know. Please send a rebased version of the patch for people to review and test as that one has bit-rotted. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-09 06:03, Thomas Munro wrote: > Please send a rebased version of the patch for people to review and > test as that one has bit-rotted. Hello, Thank you for interest. In the attachment you can find rebased version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit) -- Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi all, I am extending phrase operator <n> is such way that it will have <n,m> syntax that means from n to m words, so I will use such syntax (<n,m>) further. I found that a AROUND(N) b is exactly the same as a <-N,N> b and it can be replaced while parsing. So, what do you think of such idea? In this patch I have noticed some unobvious behavior. # select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat AROUND(1) cat') as match; match -------t cat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and: # select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat');?column? -------t It seems to be a proper logic behavior but it is a possible pitfall, maybe it should be documented? But more important question is how AROUND() operator should handle stop words? Now it works as: # select queryto_tsquery('cat <2> (a AROUND(10) rat)');queryto_tsquery ------------------'cat' <12> 'rat' (1 row) # select queryto_tsquery('cat <2> a AROUND(10) rat'); queryto_tsquery ------------------------'cat' AROUND(12) 'rat' (1 row) In my opinion it should be like: cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12> rat cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12> rat Now <n,m> operator can be replaced with combination of phrase operator <n>, AROUND(), and logical operators, but with <n,m> operator it will be much painless. Correct me, please, if I am wrong. -- Alexey Chernyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-13 16:37, Alexey Chernyshov wrote: > Hi all, > I am extending phrase operator <n> is such way that it will have <n,m> > syntax that means from n to m words, so I will use such syntax (<n,m>) > further. I found that a AROUND(N) b is exactly the same as a <-N,N> b > and it can be replaced while parsing. So, what do you think of such > idea? In this patch I have noticed some unobvious behavior. Thank you for the interest and review! > # select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat > AROUND(1) cat') as match; > match > ------- > t > > cat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and: > > # select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat'); > ?column? > ------- > t > > It seems to be a proper logic behavior but it is a possible pitfall, > maybe it should be documented? It is a tricky question. I think that this interpretation is confusing, so better to make it as <-N, -1> and <1, N>. > But more important question is how AROUND() operator should handle stop > words? Now it works as: > > # select queryto_tsquery('cat <2> (a AROUND(10) rat)'); > queryto_tsquery > ------------------ > 'cat' <12> 'rat' > (1 row) > > # select queryto_tsquery('cat <2> a AROUND(10) rat'); > queryto_tsquery > ------------------------ > 'cat' AROUND(12) 'rat' > (1 row) > > In my opinion it should be like: > cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12> > rat I think that correct version is: cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-2,12> rat. > cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12> > rat It is a problem indeed. I did not catch it during implementation. Thank you for pointing it out. > Now <n,m> operator can be replaced with combination of phrase > operator <n>, AROUND(), and logical operators, but with <n,m> operator > it will be much painless. Correct me, please, if I am wrong. I think that <n,m> operator is more general than around(n) so the last one should be based on yours. However, i think, that taking negative parameters is not the best idea because it is confusing. On top of that it is not so necessary and i think it won`t be popular among users. It seems to me that AROUND operator can be easily implemented with <n,m>, also, it helps to avoid problems, that you showed above. -- Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 09/13/2017 10:57 AM, Victor Drobny wrote: > On 2017-09-09 06:03, Thomas Munro wrote: >> Please send a rebased version of the patch for people to review and >> test as that one has bit-rotted. > Hello, > Thank you for interest. In the attachment you can find rebased > version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit) > I did a quick review of the patch today. The patch unfortunately no longer applies, so I had to use an older commit from September. Please rebase to current master. I've only looked on the diff at this point, will do more testing once the rebase happens. Some comments: 1) This seems to mix multiple improvements into one patch. There's the support for alternative query syntax, and then there are the new operators (AROUND and <m,n>). I propose to split the patch into two or more parts, each addressing one of those bits. I guess there will be two or three parts - first adding the syntax, second adding <m,n> and third adding the AROUND(n). Seems reasonable? 2) I don't think we should mention Google in the docs explicitly. Not that I'm somehow anti-google, but this syntax was certainly not invented by Google - I vividly remember using something like that on Altavista (yeah, I'm old). And it's used by pretty much every other web search engine out there ... 3) In the SGML docs, please use <literal></literal> instead of just quoting the values. So it should be <literal>|</literal> instead of '|' etc. Just like in the parts describing plainto_tsquery, for example. 4) Also, I recommend adding a brief explanation what the examples do. Right now there's just a bunch of queryto_tsquery, and the reader is expected to understand the output. I suggest adding a sentence or two, explaining what's happening (just like for plainto_tsquery examples). 5) I'm not sure about negative values in the <n,m> operator. I don't find it particularly confusing - once you understand that (a <n,m> b) means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then negative values seem like a fairly straightforward extension. But I guess the main question is "Is there really a demand for the new <n,m> operator, or have we just invented if because we can?" 6) There seem to be some new constants defined, with names not following the naming convention. I mean this #define WAITOPERAND 1 #define WAITOPERATOR 2 #define WAITFIRSTOPERAND 3 #define WAITSINGLEOPERAND 4 #define INSIDEQUOTES 5 <-- the new one and #define TSPO_L_ONLY 0x01 #define TSPO_R_ONLY 0x02 #define TSPO_BOTH 0x04 #defineTS_NOT_EXAC 0x08 <-- the new one Perhaps that's OK, but it seems a bit inconsistent. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-11-19 04:30, Tomas Vondra wrote: Hello, > Hi, > > On 09/13/2017 10:57 AM, Victor Drobny wrote: >> On 2017-09-09 06:03, Thomas Munro wrote: >>> Please send a rebased version of the patch for people to review and >>> test as that one has bit-rotted. >> Hello, >> Thank you for interest. In the attachment you can find rebased >> version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit) >> > > I did a quick review of the patch today. The patch unfortunately no > longer applies, so I had to use an older commit from September. Please > rebase to current master. Thank you for your time. In the attachment you can find rebased version. (based on e842791b0 commit) > I've only looked on the diff at this point, will do more testing once > the rebase happens. > > Some comments: > > 1) This seems to mix multiple improvements into one patch. There's the > support for alternative query syntax, and then there are the new > operators (AROUND and <m,n>). I propose to split the patch into two or > more parts, each addressing one of those bits. I agree. I have split it in 3 parts: support for around operator, queryto_tsquery function and documentation. > I guess there will be two or three parts - first adding the syntax, > second adding <m,n> and third adding the AROUND(n). Seems reasonable? > > 2) I don't think we should mention Google in the docs explicitly. Not > that I'm somehow anti-google, but this syntax was certainly not > invented > by Google - I vividly remember using something like that on Altavista > (yeah, I'm old). And it's used by pretty much every other web search > engine out there ... Yes, those syntax is not introduced by google, but, as for me, it was the easiest way to give a brief description of it. Of cause it can be changed, I just don't know how. Any suggestions are welcomed! ;) > 3) In the SGML docs, please use <literal></literal> instead of just > quoting the values. So it should be <literal>|</literal> instead of '|' > etc. Just like in the parts describing plainto_tsquery, for example. Fixed. I hope that i didn't miss anything. > 4) Also, I recommend adding a brief explanation what the examples do. > Right now there's just a bunch of queryto_tsquery, and the reader is > expected to understand the output. I suggest adding a sentence or two, > explaining what's happening (just like for plainto_tsquery examples). > > 5) I'm not sure about negative values in the <n,m> operator. I don't > find it particularly confusing - once you understand that (a <n,m> b) > means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then > negative values seem like a fairly straightforward extension. > > But I guess the main question is "Is there really a demand for the new > <n,m> operator, or have we just invented if because we can?" The operator <n,m> is not introduced yet. It's just a concept. It were our thoughts about implementation AROUND operator through <n,m> in future. > 6) There seem to be some new constants defined, with names not > following > the naming convention. I mean this > > #define WAITOPERAND 1 > #define WAITOPERATOR 2 > #define WAITFIRSTOPERAND 3 > #define WAITSINGLEOPERAND 4 > #define INSIDEQUOTES 5 <-- the new one > > and > > #define TSPO_L_ONLY 0x01 > #define TSPO_R_ONLY 0x02 > #define TSPO_BOTH 0x04 > #define TS_NOT_EXAC 0x08 <-- the new one > > Perhaps that's OK, but it seems a bit inconsistent. I agree. I have fixed it. > > regards -- Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hi Victor, I like the idea and I think it's a great patch. However in current shape it requires some amount of reworking to meet PostgreSQL standards of code quality. Particularly: 1. Many new procedures don't have a comment with a brief description. Ideally every procedure should have not only a brief description but also a description of every argument, return value and changes of global state if applicable. 2. I believe you could implement the has_prefix procedure just as a wrapper of strstr(). 3. I suggest to use snprintf instead of sprintf in a new code whenever possible, especially if you are using %s - just to be on a safe side. 4. I noticed that your code affects the catalog. Did you check that your changes will not cause problems during the migration from the older version of PostgreSQL to the never one? 5. Tests for queryto_tsquery use only ASCII strings. I suggest to add a few test that use non-ASCII characters as well, and a few corner cases like empty string, string that contains only the stop-words, etc. 6. `make check-world` doesn't pass: ``` *************** *** 1672,1678 **** (1 row) set enable_seqscan = on; - --test queryto_tsquery function select queryto_tsquery('My brand new smartphone'); queryto_tsquery --- 1672,1677 ---- *************** *** 1784,1786 **** --- 1783,1786 ---- --------------------------- 'fat-rat' & 'fat' & 'rat' (1 row) + ```
Hi Victor, > I like the idea and I think it's a great patch. However in current shape it > requires some amount of reworking to meet PostgreSQL standards of code quality. Also I would like to add that I agree with Thomas Munro: > Calling this search syntax just "query" seems too general and > overloaded. "Simple search", "simple query", "web search", "web > syntax", "web query", "Google-style query", "Poogle" (kidding!) ... > well I'm not sure, but I feel like it deserves a proper name. > websearch_to_tsquery()? websearch_to_tsquery() sounds much better than query_to_tsquery(). Also I agree Tomas Vondra in regard that: > 2) I don't think we should mention Google in the docs explicitly. Not > that I'm somehow anti-google, but this syntax was certainly not invented > by Google - I vividly remember using something like that on Altavista > (yeah, I'm old). And it's used by pretty much every other web search > engine out there ... I suggest to rephrase: ``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ``` .. to something more like "provides a different syntax, similar to one used in web search engines, to create tsqeury". And maybe give a few examples right in the next sentence. -- Best regards, Aleksander Alekseev
On Tue, Nov 28, 2017 at 11:57 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> I like the idea and I think it's a great patch. However in current shape it >> requires some amount of reworking to meet PostgreSQL standards of code quality. > > Also I would like to add that I agree with Thomas Munro: > >> Calling this search syntax just "query" seems too general and >> overloaded. "Simple search", "simple query", "web search", "web >> syntax", "web query", "Google-style query", "Poogle" (kidding!) ... >> well I'm not sure, but I feel like it deserves a proper name. >> websearch_to_tsquery()? > > websearch_to_tsquery() sounds much better than query_to_tsquery(). > > Also I agree Tomas Vondra in regard that: > >> 2) I don't think we should mention Google in the docs explicitly. Not >> that I'm somehow anti-google, but this syntax was certainly not invented >> by Google - I vividly remember using something like that on Altavista >> (yeah, I'm old). And it's used by pretty much every other web search >> engine out there ... > > I suggest to rephrase: > > ``` > + about its input. <function>queryto_tsquery</function> provides a > + different, Google like syntax to create tsquery. > ``` > > .. to something more like "provides a different syntax, similar to one > used in web search engines, to create tsqeury". And maybe give a few > examples right in the next sentence. The patch got a review less than 1 day ago, so I am moving it to next CF with the same status, waiting on author. -- Michael
On 2017-11-28 17:57, Aleksander Alekseev wrote: Hi Aleksander, Thank you for review. I have tried to fix all of your comments. However i want to mention that the absence of comments for functions in to_tsany.c is justified by the absence of comments for other similar functions. > Hi Victor, > >> I like the idea and I think it's a great patch. However in current >> shape it >> requires some amount of reworking to meet PostgreSQL standards of code >> quality. > > Also I would like to add that I agree with Thomas Munro: > >> Calling this search syntax just "query" seems too general and >> overloaded. "Simple search", "simple query", "web search", "web >> syntax", "web query", "Google-style query", "Poogle" (kidding!) ... >> well I'm not sure, but I feel like it deserves a proper name. >> websearch_to_tsquery()? > > websearch_to_tsquery() sounds much better than query_to_tsquery(). > > Also I agree Tomas Vondra in regard that: > >> 2) I don't think we should mention Google in the docs explicitly. Not >> that I'm somehow anti-google, but this syntax was certainly not >> invented >> by Google - I vividly remember using something like that on Altavista >> (yeah, I'm old). And it's used by pretty much every other web search >> engine out there ... > > I suggest to rephrase: > > ``` > + about its input. <function>queryto_tsquery</function> provides a > + different, Google like syntax to create tsquery. > ``` > > .. to something more like "provides a different syntax, similar to one > used in web search engines, to create tsqeury". And maybe give a few > examples right in the next sentence. Best, -- Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2017-11-29 17:56, Victor Drobny wrote: Sorry, forgot to attach new version of the patch. > On 2017-11-28 17:57, Aleksander Alekseev wrote: > Hi Aleksander, > > Thank you for review. I have tried to fix all of your comments. > However i want to mention that the absence of comments for functions > in to_tsany.c is justified by the absence of comments for other > similar functions. >> Hi Victor, >> >>> I like the idea and I think it's a great patch. However in current >>> shape it >>> requires some amount of reworking to meet PostgreSQL standards of >>> code quality. >> >> Also I would like to add that I agree with Thomas Munro: >> >>> Calling this search syntax just "query" seems too general and >>> overloaded. "Simple search", "simple query", "web search", "web >>> syntax", "web query", "Google-style query", "Poogle" (kidding!) ... >>> well I'm not sure, but I feel like it deserves a proper name. >>> websearch_to_tsquery()? >> >> websearch_to_tsquery() sounds much better than query_to_tsquery(). >> >> Also I agree Tomas Vondra in regard that: >> >>> 2) I don't think we should mention Google in the docs explicitly. Not >>> that I'm somehow anti-google, but this syntax was certainly not >>> invented >>> by Google - I vividly remember using something like that on Altavista >>> (yeah, I'm old). And it's used by pretty much every other web search >>> engine out there ... >> >> I suggest to rephrase: >> >> ``` >> + about its input. <function>queryto_tsquery</function> provides a >> + different, Google like syntax to create tsquery. >> ``` >> >> .. to something more like "provides a different syntax, similar to one >> used in web search engines, to create tsqeury". And maybe give a few >> examples right in the next sentence. > Best, -- Victor Drobny Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Here are a few minor issues: ``` +/* + * Checks if 'str' starts with a 'prefix' + */ +static bool +has_prefix(char * str, char * prefix) +{ + if (strlen(prefix) > strlen(str)) + { + return false; + } + return strstr(str, prefix) == str; +} ``` strlen() check is redundant. ``` + case OP_AROUND: + snprintf(in->cur, 256, " AROUND(%d) %s", distance, nrm.buf); + break; ``` Instead of the constant 256 it's better to use sizeof(). Apart from these issues this patch looks not bad. The new status of this patch is: Ready for Committer
On 2017-11-29 17:56:30 +0300, Victor Drobny wrote: > Thank you for review. I have tried to fix all of your comments. > However i want to mention that the absence of comments for functions > in to_tsany.c is justified by the absence of comments for other > similar functions. That's not justification. Tsquery related code is notorious for being badly commented, we do not want to continue that. Greetings, Andres Freund
It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/ The new status of this patch is: Waiting on Author
Hi Victor, On 3/5/18 7:52 AM, Aleksander Alekseev wrote: > It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/ > > The new status of this patch is: Waiting on Author This patch needs a rebase and should address the comments from Aleksander and Andres. We are now three weeks into the CF with no new patch. Are you planning to provide a new patch? If not, I think it should be marked as Returned with Feedback and submitted to the next CF once it has been updated. Regards, -- -David david@pgmasters.net
Hi David, I'd like to take over from Victor. I'll post a revised version of the patch in a couple of days. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> I am extending phrase operator <n> is such way that it will have <n,m> > syntax that means from n to m words, so I will use such syntax (<n,m>) > further. I found that a AROUND(N) b is exactly the same as a <-N,N> b > and it can be replaced while parsing. So, what do you think of such > idea? In this patch I have noticed some unobvious behavior. I think new operator should be a subject for separate patch. And I prefer idea about range phrase operator. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, 22 Mar 2018 16:53:15 +0300 Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote: > Hi David, > > I'd like to take over from Victor. I'll post a revised version of the > patch in a couple of days. > Hi Dmitry, Recently I worked with the old version of the patch and found a bug. So, I think it is better to notify you immediately, so you can fix it in rebased/revised version. I noticed, that operator AROUND(N) works only in case of non-negative operands. If any of the operands is negative, it behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC flag and AROUND(N) operator check in function TS_phrase_execute in branches for negated operands. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
> Recently I worked with the old version of the patch and found a bug. > So, I think it is better to notify you immediately, so you can fix it > in > rebased/revised version. > > I noticed, that operator AROUND(N) works only > in case of non-negative operands. If any of the operands is negative, > it > behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC > flag and AROUND(N) operator check in function TS_phrase_execute in > branches for negated operands. Good to know, thanks! To be honest, I' sure that Theodor is right: it's better to implement AROUND(N) operator using <N, M> when it's committed. The following version of patch won't support AROUND(N). I have to fix a few more questionable things, though. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Mar 26, 2018 at 9:51 PM, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote: >> Recently I worked with the old version of the patch and found a bug. >> So, I think it is better to notify you immediately, so you can fix it in >> rebased/revised version. >> >> I noticed, that operator AROUND(N) works only >> in case of non-negative operands. If any of the operands is negative, it >> behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC >> flag and AROUND(N) operator check in function TS_phrase_execute in >> branches for negated operands. > > Good to know, thanks! To be honest, I' sure that Theodor is right: it's > better to implement AROUND(N) operator using <N, M> when it's committed. The > following version of patch won't support AROUND(N). I have to fix a few more > questionable things, though. Hi, I took a quick look at the language in the last version of the patches. Patch 01: + errmsg("Invalid AROUND(X) operator!"))); s/I/i/;s/!// + errmsg("Missing ')' in AROUND(X) operator"))); s/M/m/ Patch 03 (the documentation) needed some proof-reading. I've attached a new version of that patch with some small suggested improvements. Questions I had while reading the documentation without looking at the code: Is there anything to_tsquery() can do that websearch_to_tsquery() can't? The documentation doesn't mention parentheses, but I can see that they are in fact supported from the regression test. Would it be OK to use user-supplied websearch strings? Ie can it produce a syntax error? Well clearly it can, see above -- perhaps that should be explicitly documented? Is there any way to write OR as a term (that's a valuable non-stopword in French)? It seems like AROUND(x) should be documented also more generally for tsquery, but I see there is some discussion about how that should look. By the way, not this patch's fault, but I noticed that commit f5f1355dc4d did this: - (errmsg("query contains only stopword(s) or doesn't contain lexeme(s), ignored"))); + (errmsg("text-search query contains only stop words or doesn't contain lexemes, ignored"))); But the old test still appears in an example in doc/src/sgml/textsearch.sgml. -- Thomas Munro http://www.enterprisedb.com
Attachment
> Patch 03 (the documentation) needed some proof-reading. I've attached > a new version of that patch with some small suggested improvements. Thanks, I'm definitely going to use this. > Is there anything to_tsquery() can do that websearch_to_tsquery() > can't? Currently, no. > Would it be OK to use user-supplied websearch strings? > Ie can it produce a syntax error? I believe that's the most important question. After a private discussion with Theodor I came to a conclusion that the most beneficial outcome would be to suppress all syntax errors and give user some result, cutting all misused operators along the way. This requires some changes, though. > Is there any way to write OR as a term (that's a valuable non-stopword > in French)? You could quote it like this: websearch_to_tsquery('"or"'); Moreover, it's still possible to use & and |. > It seems like AROUND(x) should be documented also more generally for > tsquery, but I see there is some discussion about how that should > look. Personally, I like <N, M> operator better. It would instantly deprecate AROUND(N), which is why I'm going to drop it. > By the way, not this patch's fault, but I noticed that commit > f5f1355dc4d did this: > > - (errmsg("query contains only > stopword(s) or doesn't contain lexeme(s), ignored"))); > + (errmsg("text-search query contains > only stop words or doesn't contain lexemes, ignored"))); > > But the old test still appears in an example in > doc/src/sgml/textsearch.sgml. Will fix this. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi everyone, I'd like to share some intermediate results. Here's what has changed: 1. OR operator is now case-insensitive. Moreover, trailing whitespace is no longer used to identify it: select websearch_to_tsquery('simple', 'abc or'); websearch_to_tsquery ---------------------- 'abc' & 'or' (1 row) select websearch_to_tsquery('simple', 'abc or(def)'); websearch_to_tsquery ---------------------- 'abc' | 'def' (1 row) select websearch_to_tsquery('simple', 'abc or!def'); websearch_to_tsquery ---------------------- 'abc' | 'def' (1 row) 2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us to implement it with a few lines of code. 3. websearch_to_tsquery() now tolerates various syntax errors, for instance: Misused operators: 'abc &' '| abc' '<- def' Missing parentheses: 'abc & (def <-> (cat or rat' Other sorts of nonsense: 'abc &--|| def' => 'abc' & !!'def' 'abc:def' => 'abc':D & 'ef' This, however, doesn't mean that the result will always be adequate (who would have thought?). Overall, current implementation follows the GIGO principle. In theory, this would allow us to use user-supplied websearch strings (but see gotchas), even if they don't make much sense. Better then nothing, right? 4. A small refactoring: I've replaced all WAIT* macros with a enum for better debugging (names look much nicer in GDB). Hope this is acceptable. 5. Finally, I've added a few more comments and tests. I haven't checked the code coverage, though. A few gotchas: I haven't touched gettoken_tsvector() yet. As a result, the following queries produce errors: select websearch_to_tsquery('simple', ''''); ERROR: syntax error in tsquery: "'" select websearch_to_tsquery('simple', '\'); ERROR: there is no escaped character: "\" Maybe there's more. The question is: should we fix those, or it's fine as it is? I don't have a strong opinion about this. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> select websearch_to_tsquery('simple', 'abc or!def'); > websearch_to_tsquery > ---------------------- > 'abc' | 'def' > (1 row) This is wrong ofc, I've attached the fixed version. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Dmitry, > A few gotchas: > > I haven't touched gettoken_tsvector() yet. As a result, the following > queries produce errors: > > select websearch_to_tsquery('simple', ''''); > ERROR: syntax error in tsquery: "'" > > select websearch_to_tsquery('simple', '\'); > ERROR: there is no escaped character: "\" > > Maybe there's more. The question is: should we fix those, or it's fine as it > is? I don't have a strong opinion about this. It doesn't sound right to me to accept any input as a general rule but sometimes return errors nevertheless. That API would be complicated for the users. Thus I suggest to accept any garbage and try our best to interpret it. -- Best regards, Aleksander Alekseev
Attachment
Hello hackers, On 2018-03-28 12:21, Aleksander Alekseev wrote: > It doesn't sound right to me to accept any input as a general rule but > sometimes return errors nevertheless. That API would be complicated for > the users. Thus I suggest to accept any garbage and try our best to > interpret it. I agree with Aleksander about silencing all errors in websearch_to_tsquery(). In the attachment is a revised patch with the attempt to introduce an ability to ignore syntax errors in gettoken_tsvector(). I'm also read through the patch and all the code looks good to me except one thing. The name of enum ts_parsestate looks more like a name of the function than a name of a type. In my version, it renamed to QueryParserState, but you can fix it if I'm wrong. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hi Aleksandr, > I agree with Aleksander about silencing all errors in > websearch_to_tsquery(). > > In the attachment is a revised patch with the attempt to introduce an > ability to ignore syntax errors in gettoken_tsvector(). Thanks for the further improvements! Yes, you're both right, the API has to be consistent. Unfortunately, I had to make some adjustments according to Oleg Bartunov's review. Here's a change log: 1. &, | and (), <-> are no longer considered operators in web search mode. 2. I've stumbled upon a bug: web search used to transform "pg_class" into 'pg <-> class', which is no longer the case. 3. I changed the behavior of gettoken_tsvector() as soon as I had heard from Aleksander Alekseev, so I decided to use my implementation in this revision of the patch. This is a good subject for discussion, though. Feel free to share your opinion. 4. As suggested by Theodor, I've replaced some bool args with bit flags. > The name of enum ts_parsestate looks more like a name of the function > than a name of a type. > In my version, it renamed to QueryParserState, but you can fix it if > I'm wrong. True, but gettoken_query() returns ts_tokentype, so I decided to use this naming scheme. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
I've fixed a bug and added some tests and documentation. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi everyone, The code in its current state looks messy and way too complicated; there're lots of interleaving code branches. Thus, I decided to split gettoken_query() into three independent tokenizers for phrase, web and original (to_tsquery()) syntaxes. Documentation is included. Any feedback is very welcome. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> The code in its current state looks messy and way too complicated; > there're lots of interleaving code branches. Thus, I decided to split > gettoken_query() into three independent tokenizers for phrase, web and > original (to_tsquery()) syntaxes. Documentation is included. Any > feedback is very welcome. I'm sorry, I totally forgot to fix a few more things, the patch is attached below. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, 03 Apr 2018 14:28:37 +0300 Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote: > I'm sorry, I totally forgot to fix a few more things, the patch is > attached below. The patch looks good to me except two things. I'm not sure about the different result for these queries: SELECT websearch_to_tsquery('simple', 'cat or '); websearch_to_tsquery ---------------------- 'cat' (1 row) SELECT websearch_to_tsquery('simple', 'cat or'); websearch_to_tsquery ---------------------- 'cat' & 'or' (1 row) But I don't have strong opinion about these queries, since input in both of them looks broken in terms of operator usage. I found an odd behavior of the query creation function in case: SELECT websearch_to_tsquery('english', '"pg_class pg"'); websearch_to_tsquery ----------------------------- ( 'pg' & 'class' ) <-> 'pg' (1 row) This query means that lexemes 'pg' and 'class' should be at the same distance from the last 'pg'. In other words, they should have the same position. But default parser will interpret pg_class as two separate words during text parsing/vector creation. The bug wasn't introduced in the patch and can be found in current master. During the discussion of the patch with Dmitry, he noticed that to_tsquery() function shares same odd behavior: select to_tsquery('english', ' pg_class <-> pg'); to_tsquery ----------------------------- ( 'pg' & 'class' ) <-> 'pg' (1 row) This oddity caused by they implementation of makepol. In makepol, each token (parsed by query parser) is sent to FTS parser and in case of further separation of the token, it uses operator selected in functions to_tsquery and friends. So it doesn't change over the runtime. I see two different ways to solve it: 1) Use the same operator inside the parenthesizes. This will mean to parse it as few parts of one word. 2) Remove parenthesizes. This will mean to parse it as few separate words. I prefer the second way since in some languages words can be separated by some special symbol or not separated by any symbols at all and should be extracted by special FTS parser. It also allows us to parse such words as one by using the special parser (as it done for hyphenated word). But in the example with websearch_to_tsquery, I think it should use the same operator for quoted part of the query. For example, we can update the operator in makepol before sending it to pushval (pushval_morph) to do so. It looks like there should be two separated patches, one for websearch_to_tsquery and another one for fixing odd behavior of the query construction. But since the first one may depend on the bugfix, to solve case with quotes, I will mark it as Waiting on Author. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
> I'm not sure about the different result for these queries: > SELECT websearch_to_tsquery('simple', 'cat or '); > websearch_to_tsquery > ---------------------- > 'cat' > (1 row) > SELECT websearch_to_tsquery('simple', 'cat or'); > websearch_to_tsquery > ---------------------- > 'cat' & 'or' > (1 row) I guess both queries should produce just 'cat'. I've changed the definition of parse_or_operator(). > I found an odd behavior of the query creation function in case: > SELECT websearch_to_tsquery('english', '"pg_class pg"'); > websearch_to_tsquery > ----------------------------- > ( 'pg' & 'class' ) <-> 'pg' > (1 row) > > This query means that lexemes 'pg' and 'class' should be at the same > distance from the last 'pg'. In other words, they should have the same > position. But default parser will interpret pg_class as two separate > words during text parsing/vector creation. > > The bug wasn't introduced in the patch and can be found in current > master. During the discussion of the patch with Dmitry, he noticed that > to_tsquery() function shares same odd behavior: > select to_tsquery('english', ' pg_class <-> pg'); > to_tsquery > ----------------------------- > ( 'pg' & 'class' ) <-> 'pg' > (1 row) I've been thinking about this for a while, and it seems that this should be fixed somewhere near parsetext(). Perhaps 'pg' and 'class' should share the same position. After all, somebody could implement a parser which would split some words using an arbitrary set of rules, for instance "split all words containing digits". I propose merging this patch provided that there are no objections. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2018-04-04 17:33, Dmitry Ivanov wrote: > I've been thinking about this for a while, and it seems that this > should be fixed somewhere near parsetext(). Perhaps 'pg' and 'class' > should share the same position. After all, somebody could implement a > parser which would split some words using an arbitrary set of rules, > for instance "split all words containing digits". I propose merging > this patch provided that there are no objections. I'm agree that this problem should be solved in separate patch and that this feature can be merged without waiting for the fix. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Thanks to everyone, pushed with some editorization: 1) translate russian test to prevent potential problems with encoding 2) fix inconsistency 'or cat' and 'cat or', second example doesn't treat OR as lexeme, but first one does. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hello everyone, I am wondering if
AROUND(N) or <N, M> is still possible? I found this thread below and the original post https://www.postgresql.org/message-id/fe931111ff7e9ad79196486ada79e268%40postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches if the distance between words(or maybe phrases) is less than or equal to N.'
currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the thread?
search maximum distance so the search returns query1 followed by query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the thread?
Thank you!
On Mon, Jul 22, 2019 at 9:13 AM Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
Hi everyone,
I'd like to share some intermediate results. Here's what has changed:
1. OR operator is now case-insensitive. Moreover, trailing whitespace is
no longer used to identify it:
select websearch_to_tsquery('simple', 'abc or');
websearch_to_tsquery
----------------------
'abc' & 'or'
(1 row)
select websearch_to_tsquery('simple', 'abc or(def)');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us
to implement it with a few lines of code.
3. websearch_to_tsquery() now tolerates various syntax errors, for
instance:
Misused operators:
'abc &'
'| abc'
'<- def'
Missing parentheses:
'abc & (def <-> (cat or rat'
Other sorts of nonsense:
'abc &--|| def' => 'abc' & !!'def'
'abc:def' => 'abc':D & 'ef'
This, however, doesn't mean that the result will always be adequate (who
would have thought?). Overall, current implementation follows the GIGO
principle. In theory, this would allow us to use user-supplied websearch
strings (but see gotchas), even if they don't make much sense. Better
then nothing, right?
4. A small refactoring: I've replaced all WAIT* macros with a enum for
better debugging (names look much nicer in GDB). Hope this is
acceptable.
5. Finally, I've added a few more comments and tests. I haven't checked
the code coverage, though.
A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:
select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"
select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"
Maybe there's more. The question is: should we fix those, or it's fine
as it is? I don't have a strong opinion about this.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Is this possible with the current websearch_to_tsquery function?
Thanks.
Hello everyone, I am wondering ifAROUND(N) or <N, M> is still possible? I found this thread below and the original post https://www.postgresql.org/message-id/fe931111ff7e9ad79196486ada79e268%40postgrespro.rumentioned the proposed feature: 'New operator AROUND(N). It matches if the distance between words(or maybe phrases) is less than or equal to N.'currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the thread?Thank you!On Mon, Jul 22, 2019 at 9:13 AM Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:Hi everyone,
I'd like to share some intermediate results. Here's what has changed:
1. OR operator is now case-insensitive. Moreover, trailing whitespace is
no longer used to identify it:
select websearch_to_tsquery('simple', 'abc or');
websearch_to_tsquery
----------------------
'abc' & 'or'
(1 row)
select websearch_to_tsquery('simple', 'abc or(def)');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us
to implement it with a few lines of code.
3. websearch_to_tsquery() now tolerates various syntax errors, for
instance:
Misused operators:
'abc &'
'| abc'
'<- def'
Missing parentheses:
'abc & (def <-> (cat or rat'
Other sorts of nonsense:
'abc &--|| def' => 'abc' & !!'def'
'abc:def' => 'abc':D & 'ef'
This, however, doesn't mean that the result will always be adequate (who
would have thought?). Overall, current implementation follows the GIGO
principle. In theory, this would allow us to use user-supplied websearch
strings (but see gotchas), even if they don't make much sense. Better
then nothing, right?
4. A small refactoring: I've replaced all WAIT* macros with a enum for
better debugging (names look much nicer in GDB). Hope this is
acceptable.
5. Finally, I've added a few more comments and tests. I haven't checked
the code coverage, though.
A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:
select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"
select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"
Maybe there's more. The question is: should we fix those, or it's fine
as it is? I don't have a strong opinion about this.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Re: query1 followed by query2 at maximum distance vs current fixeddistance
From
Arthur Zakirov
Date:
Hello, On 23.07.2019 09:55, Wh isere wrote: > Is this possible with the current websearch_to_tsquery function? > > Thanks. > > Hello everyone, I am wondering if > AROUND(N) or <N, M> is still possible? I found this thread below and > the original post > https://www.postgresql.org/message-id/fe931111ff7e9ad79196486ada79e268%40postgrespro.ru > mentioned the proposed feature: 'New operator AROUND(N). It matches > if the distance between words(or maybe phrases) is less than or > equal to N.' > > currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance > integer) the distaince is searching a fixed distance, is there way to > search maximum distance so the search returns query1 followed by > query2 up > to a certain distance? like the AROUND(N) or <N, M> mentioned in the > thread? As far as I know AROUND(N) and <N, M> weren't committed, unfortunately. And so you can search only using a fixed distance currently. websearch_to_tsquery() can't help here. It just transforms search pattern with OR, AND statements into tsquery syntax. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Thanks Arthur! I guess there is not other solution? I tried to create a function to loop through all the distance but its very slow.
On Tuesday, July 30, 2019, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
On Tuesday, July 30, 2019, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
Hello,
On 23.07.2019 09:55, Wh isere wrote:Is this possible with the current websearch_to_tsquery function?As far as I know AROUND(N) and <N, M> weren't committed, unfortunately. And so you can search only using a fixed distance currently.
Thanks.
Hello everyone, I am wondering if
AROUND(N) or <N, M> is still possible? I found this thread below and
the original post
https://www.postgresql.org/message-id/fe931111ff7e9ad7919648 6ada79e268%40postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches
if the distance between words(or maybe phrases) is less than or
equal to N.'
currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by
query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the
thread?
websearch_to_tsquery() can't help here. It just transforms search pattern with OR, AND statements into tsquery syntax.
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company