Thread: tsearch strategy for incremental search
I am trying to implement an incremental search engine. The service should start searching when the user has typed at least 3 characters. I am thinking of using the following strategy: a) Create a function string_to_three_char_tsvector(str text) that would generate the tsvector composed of the three-letter lexemes that begin all the admissible words within the parameter str. b) Using this function, create an indexed tsvector column: three_char_index. c) Given the query string query_string (assume query_string containing at least 3 characters): SELECT * FROM mytable, plainto_tsquery((string_to_three_char_tsvector(query_string))::text) AS query WHERE three_char_index @@ query AND text_field LIKE '%' || str || '%'; Once I've narrowed the field of possibilities down to the correct 3-letter lexemes, there are fewer than 100 lines to search through with LIKE. I could even repeat the exercise with 4-letter lexemes if these numbers were to grow or if I needed the extra boost in performance. So, two questions to postgres/tsearch experts: 1) Does that seem like a decent overall strategy? 2) About the function string_to_three_char_tsvector(text), I cannot think of an elegant way of writing this. Is it possible to do better than the following: str => cast to tsvector => cast to text => for each lexeme-string, take first-three-char substring => concat back together => cast to tsvector Is there a nice way of performing the middle operation? Like splitting the string to an array...
2008/6/30 Oleg Bartunov <oleg@sai.msu.su>: > tsearch will have prefix search support in 8.4. Thanks Oleg! That's fantastic news! In the meantime, carrying on with my earlier idea, here's the little function I came up with for extracting the 3-char-lexeme tsvector, in case anyone's interested: CREATE OR REPLACE FUNCTION public.to_three_char_tsvector(str text) RETURNS tsvector AS $BODY$declare somerow record; shortened_text text := ''; BEGIN FOR somerow IN SELECT regexp_split_to_table(trim(both '\'' from strip(to_tsvector('simple',str))::text),'\'') AS item LOOP shortened_text := shortened_text || ' ' || COALESCE(substring(somerow.item for 3), ''); END LOOP; RETURN strip(to_tsvector(shortened_text)); END;$BODY$ LANGUAGE 'plpgsql' IMMUTABLE;
Pierre, you, probably, can use custom configuration, which uses pg_3chars dictionary. Oleg On Mon, 30 Jun 2008, Pierre Thibaudeau wrote: > 2008/6/30 Oleg Bartunov <oleg@sai.msu.su>: >> tsearch will have prefix search support in 8.4. > > Thanks Oleg! That's fantastic news! > > In the meantime, carrying on with my earlier idea, here's the little > function I came up with for extracting the 3-char-lexeme tsvector, in > case anyone's interested: > > CREATE OR REPLACE FUNCTION public.to_three_char_tsvector(str text) > RETURNS tsvector AS > $BODY$declare > somerow record; > shortened_text text := ''; > BEGIN > FOR somerow IN SELECT regexp_split_to_table(trim(both '\'' from > strip(to_tsvector('simple',str))::text),'\'') AS item LOOP > shortened_text := shortened_text || ' ' || > COALESCE(substring(somerow.item for 3), ''); > END LOOP; > RETURN strip(to_tsvector(shortened_text)); > END;$BODY$ > LANGUAGE 'plpgsql' IMMUTABLE; > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
2008/6/30 Oleg Bartunov <oleg@sai.msu.su>: > you, probably, can use custom configuration, which uses pg_3chars > dictionary. Thanks Oleg. That sounds like a cleverer (and more natural) idea than mine. I am intrigued: is "pg_3chars" something that's part of the current distribution of postgresql (I did a quick search and found nothing), or are you implying that it would be worth writing a tsearch configuration along the lines of my earlier ideas?
On Mon, 30 Jun 2008, Pierre Thibaudeau wrote: > 2008/6/30 Oleg Bartunov <oleg@sai.msu.su>: >> you, probably, can use custom configuration, which uses pg_3chars >> dictionary. > > Thanks Oleg. That sounds like a cleverer (and more natural) idea than mine. > > I am intrigued: is "pg_3chars" something that's part of the current > distribution of postgresql (I did a quick search and found nothing), > or are you implying that it would be worth writing a tsearch > configuration along the lines of my earlier ideas? it's just a suggestion name :) btw, if you're really lazy you can use dict_regex (http://vo.astronet.ru/arxiv/dict_regex.html) dictionary. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83