Thread: FTS uses "tsquery" directly in the query
Hi, everyone: First I can successful run this query: select name, ts_rank_cd(vectors, query) as rank from element, to_tsquery('item') query where query @@ vectors order by rank desc; But actually I want to run this one: select name, ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where query @@ vectors order by rank desc; Looks like that FTS does not support directly use "::tsquery " in such query. Do I misunderstand something? Thanks! Xu |
Xu, FTS has nothing with your problem, it's general limitation/feature. Oleg On Sat, 23 Jan 2010, xu fei wrote: > Hi, everyone: > First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank from element, to_tsquery('item')query where query @@ vectors order by rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors,query) as rank from element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks likethat FTS does not support directly use "::tsquery " in such query. Do I misunderstand something? Thanks! > Xu > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi, Oleg Bartunov: First thanks for your quick replay. Could you explain it a little more on "it's general limitation/feature"? I just confuse that to_tsquery('item') function will return a tsquery type which is same as 'item'::tsquery, to my understanding. Let me explain what I want: First Step: extract top K tokens I have a table with a column as tsvector type. Some records in this column are too big, which contain hundreds tokens. I just want the top K tokens based on the frequency, for example top 5. I am not sure there is a direct way to get such kind top K tokens. I just read them out in Java and count frequency for each token and sort them. Second Step: generate query Now I will use these tokens to construct a query to search other vectors in the same table. I can not directly use to_tsquery() due to two reasons: 1) The default logic operator in to_tsquery() is "&" but what I need it "|". 2) Since the tokens are from tsvector, they are already normalized. If I use to_tsquery() again, they will be normalized again! For example, “course” -> “cours” -> “cour”. So I just concatenate the top K tokens with “|” and directly use "::tsquery ". Unfortunately, as you say "it's general limitation/feature”, I can not do that. I checked your manual “Full-Text Search in PostgreSQL A Gentle Introduction”, but could not figure out how. So is it possible to implement what I want in FTS? If so, how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov <oleg@sai.msu.su> wrote:
|
On Mon, 25 Jan 2010 07:19:59 -0800 (PST) xu fei <autofei@yahoo.com> wrote: > Hi, Oleg Bartunov: > First thanks for your quick replay. Could you explain it a little > more on "it's general limitation/feature"? I just confuse that > to_tsquery('item') function will return a tsquery type which is > same as 'item'::tsquery, to my understanding. Let me explain what > I want:First Step: extract top K tokensI have a table with a > column as tsvector type. Some records in this column are too big, > which contain hundreds tokens. I just want the top K tokens based > on the frequency, for example top 5. I am not sure there is a > direct way to get such kind top K tokens. I just read them out in > Java and count frequency for each token and sort them. Second > Step: generate queryNow I will use these tokens to construct a > query to search other vectors in the same table. I can not > directly use to_tsquery() due to two reasons: 1) The default logic > operator in to_tsquery() is "&" but what I need it "|". 2) Since > the tokens are from tsvector, they are already normalized. If I > use to_tsquery() again, they will be normalized again! For > example, “course” -> “cours” -> “cour”. So I just concatenate the > top K tokens with “|” and directly use "::tsquery ". > Unfortunately, as you say "it's general limitation/feature”, I can > not do that. I checked your manual “Full-Text Search > in PostgreSQL A Gentle Introduction”, but could not figure out > how. So is it possible to implement what I want in FTS? If so, > how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov You're trying to solve a similar problems than mine. I'd like to build up a näive similar text search. I don't have the "length" problem still I'd like to avoid to tokenize/lexize a text twice to build up a tsquery. I've weighted tsvectors stored in a column and once I pick up one I'd like to look for similar ones in the same column. There are thousands way to measure text similarity (and Oleg pointed me to some), still ts_rank should be "good enough for me". I've very short text so I can't use & on the whole tsvector otherwise there will be very high chances to find just one match. As you suggested I could pick up a subset of "important"[1] lexemes in the tsvector and build up an "&"ed tsquery with them. Still at least in my case, since I'm dealing with very short texts, this still looks too risky (just 1 match). Considering that I'm using weighted tsvectors it seems that "|" and picking up the ones with the best rank could be a way to go. But as you've noted there is no function that turns a tsvector in a tsquery (including weight possibly) and give you the choice to use "|". Well... I'm trying to write a couple of helper functions in C. But I'm pretty new to postgres internals and well I miss a reference of functions/macro with some examples... and this is a side project and I haven't been using C for quite a while. Once I'll have that function I'll have to solve how to return few rows (since I'll have to use | I expect a lot of returned rows) to make efficient use of the gin index and avoid to compute ts_rank for too many rows. Don't hold your breath waiting... but let me know if you're interested so I don't have to be the only one posting newbies questions on pgsql-hackers ;) [1] ts_stat could give you some hints about what lexemes may be important... but well deciding what's important is another can of worms... and as anticipated ts_rank should be "good enough for me". -- Ivan Sergio Borgonovo http://www.webthatworks.it
Do you guys wanted something like: arxiv=# select and2or(to_tsquery('1 & 2 & 3')); and2or --------------------- ( '1' | '2' ) | '3' (1 row) Oleg On Mon, 25 Jan 2010, Ivan Sergio Borgonovo wrote: > On Mon, 25 Jan 2010 07:19:59 -0800 (PST) > xu fei <autofei@yahoo.com> wrote: > > > Hi, Oleg Bartunov: > > First thanks for your quick replay.=C2=A0Could you explain it a little > > more on "it's general limitation/feature"? I just confuse that > > to_tsquery('item') function will return a tsquery type which is > > same as 'item'::tsquery, to my understanding. Let me explain what > > I want:First Step: extract top K tokensI have a table with a > > column as tsvector type. Some records in this column are too big, > > which contain hundreds tokens. I just want the top K tokens based > > on the frequency, for example top 5. I am not sure there is a > > direct way to get such kind top K tokens. I just read them out in > > Java and count frequency for each token and sort them. Second > > Step: generate queryNow I will use these tokens to construct a > > query to search other vectors in the same table. I can not > > directly use to_tsquery() due to two reasons: 1) The default logic > > operator in to_tsquery() is "&" but what I need it "|". 2) Since > > the tokens are from tsvector, they are already normalized. If I > > use to_tsquery() again, they will be normalized again! For > > example, =E2=80=9Ccourse=E2=80=9D -> =E2=80=9Ccours=E2=80=9D -> =E2=80=9C= > cour=E2=80=9D. So I just concatenate the > > top K tokens with =E2=80=9C|=E2=80=9D and directly use "::tsquery ". > > Unfortunately, as you say "it's general limitation/feature=E2=80=9D, I can > > not do that. I checked your manual =E2=80=9CFull-Text Search > > in=C2=A0PostgreSQL=C2=A0A Gentle Introduction=E2=80=9D, but could not fig= > ure out > > how. So is it possible to implement what I want in FTS? If so, > > how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov > > You're trying to solve a similar problems than mine. > I'd like to build up a n=C3=A4ive similar text search. > I don't have the "length" problem still I'd like to avoid to > tokenize/lexize a text twice to build up a tsquery. > I've weighted tsvectors stored in a column and once I pick up one > I'd like to look for similar ones in the same column. > > There are thousands way to measure text similarity (and Oleg pointed > me to some), still ts_rank should be "good enough for me". > > I've very short text so I can't use & on the whole tsvector > otherwise there will be very high chances to find just one match. > > As you suggested I could pick up a subset of "important"[1] lexemes > in the tsvector and build up an "&"ed tsquery with them. > > Still at least in my case, since I'm dealing with very short texts, > this still looks too risky (just 1 match). Considering that I'm > using weighted tsvectors it seems that "|" and picking up the ones > with the best rank could be a way to go. > > But as you've noted there is no function that turns a tsvector in a > tsquery (including weight possibly) and give you the choice to use > "|". > > Well... I'm trying to write a couple of helper functions in C. > But I'm pretty new to postgres internals and well I miss a reference > of functions/macro with some examples... and this is a side project > and I haven't been using C for quite a while. > > Once I'll have that function I'll have to solve how to return few > rows (since I'll have to use | I expect a lot of returned rows) to > make efficient use of the gin index and avoid to compute ts_rank for > too many rows. > > Don't hold your breath waiting... but let me know if you're > interested so I don't have to be the only one posting newbies > questions on pgsql-hackers ;) > > [1] ts_stat could give you some hints about what lexemes may be > important... but well deciding what's important is another can of > worms... and as anticipated ts_rank should be "good enough for me". > > --=20 > Ivan Sergio Borgonovo > http://www.webthatworks.it > > > --=20 > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 25 Jan 2010 23:35:12 +0300 (MSK) Oleg Bartunov <oleg@sai.msu.su> wrote: > Do you guys wanted something like: > > arxiv=# select and2or(to_tsquery('1 & 2 & 3')); > and2or > --------------------- > ( '1' | '2' ) | '3' > (1 row) Nearly. I'm starting from a weighted tsvector not from text/tsquery. I would like to: - keep the weights in the query - avoid parsing the text to extract lexemes twice (I already have a tsvector) For me extending pg in C is a new science, but I'm actually trying to write at least a couple of functions that: - will return a tsvector as a weight int, pos int[], lexeme text record - will turn a tsvector + operator into a tsquery 'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 -> 'orange':A | 'banana':B | 'tomato':C or eventually 'orange':A & 'banana':B & 'tomato':C thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Hi, Ivan: I agree with you and also would like to 'hack' into the code. Current FTC is the best one in database system and a great building block to support more functions. I list some I can think about:
I am not sure current there is a team to help Oleg Bartunov or not. If need, I can try to do something rather than just hacking it. I am sure, Ivan also will join this. :) Xu --- On Mon, 1/25/10, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
|