Re: [GENERAL] Fragments in tsearch2 headline - Mailing list pgsql-hackers
From | Sushant Sinha |
---|---|
Subject | Re: [GENERAL] Fragments in tsearch2 headline |
Date | |
Msg-id | 1212538472.5848.29.camel@dragflick Whole thread Raw |
In response to | Re: [GENERAL] Fragments in tsearch2 headline (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: [GENERAL] Fragments in tsearch2 headline
|
List | pgsql-hackers |
My main argument for using Cover instead of hlCover was that Cover will be faster. I tested the default headline generation that uses hlCover with the current patch that uses Cover. There was not much difference. So I think you are right in that we do not need norms and we can just use hlCover. I also compared performance of ts_headline with my first patch to headline generation (one which was a separate function and took tsvector as input). The performance was dramatically different. For one query ts_headline took roughly 200 ms while headline_with_fragments took just 70 ms. On an another query ts_headline took 76 ms while headline_with_fragments took 24 ms. You can find 'explain analyze' for the first query at the bottom of the page. These queries were run multiple times to ensure that I never hit the disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running Linux 2.6.22-gentoo-r8. A couple of caveats: 1. ts_headline testing was done with current cvs head where as headline_with_fragments was done with postgres 8.3.1. 2. For headline_with_fragments, TSVector for the document was obtained by joining with another table. Are these differences understandable? If you think these caveats are the reasons or there is something I am missing, then I can repeat the entire experiments with exactly the same conditions. -Sushant. Here is 'explain analyze' for both the functions: ts_headline ------------ lawdb=# explain analyze SELECT ts_headline('english', doc, q, '') FROM docraw, plainto_tsquery('english', 'freedomof speech') as q WHERE docraw.tid = 125596; QUERY PLAN Nested Loop (cost=0.00..8.31 rows=1 width=497) (actual time=199.692..200.207 rows=1 loops=1) -> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=465) (actual time=0.041..0.065 rows=1 loops=1) Index Cond: (tid = 125596) -> Function Scan on q (cost=0.00..0.01rows=1 width=32) (actual time=0.010..0.014 rows=1 loops=1)Total runtime: 200.311 ms headline_with_fragments ----------------------- lawdb=# explain analyze SELECT headline_with_fragments('english', docvector, doc, q, 'MaxWords=40') FROM docraw, docmeta, plainto_tsquery('english', 'freedom of speech') as q WHERE docraw.tid = 125596 and docmeta.tid=125596; QUERY PLAN ----------------------Nested Loop (cost=0.00..16.61 rows=1 width=883) (actual time=70.564..70.949 rows=1 loops=1) -> Nested Loop (cost=0.00..16.59 rows=1 width=851) (actual time=0.064..0.094 rows=1 loops=1) -> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1) Index Cond: (tid = 125596) -> Index Scanusing docmeta_pkey on docmeta (cost=0.00..8.29 rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1) Index Cond: (docmeta.tid = 125596) -> FunctionScan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.016 rows=1 loops=1)Total runtime: 71.076 ms (8 rows) On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote: > > Why we need norms? > > We don't need norms at all - all matched HeadlineWordEntry already marked by > HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in > tsquery. > > > hlCover does the exact thing that Cover in tsrank does which is to find > > the cover that contains the query. However hlcover has to go through > > words that do not match the query. Cover on the other hand operates on > > position indexes for just the query words and so it should be faster. > Cover, by definition, is a minimal continuous text's piece matched by query. May > be a several covers in text and hlCover will find all of them. Next, > prsd_headline() (for now) tries to define the best one. "Best" means: cover > contains a lot of words from query, not less that MinWords, not greater than > MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc. > > > > The main reason why I would I like it to be fast is that I want to > > generate all covers for a given query. Then choose covers with smallest > hlCover generates all covers. > > > Let me know what you think on this patch and I will update the patch to > > respect other options like MinWords and ShortWord. > > As I understand, you very wish to call Cover() function instead of hlCover() - > by design, they should be identical, but accepts different document's > representation. So, the best way is generalize them: develop a new one which can > be called with some kind of callback or/and opaque structure to use it in both > rank and headline. > > > > > NumFragments < 2: > > I wanted people to use the new headline marker if they specify > > NumFragments >= 1. If they do not specify the NumFragments or put it to > Ok, but if you unify cover generation and NumFragments == 1 then result for old > and new algorithms should be the same... > > > > On an another note I found that make_tsvector crashes if it receives a > > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords > > as 1 even when it gets 0 words. I am not sure if this is the desired > > behavior. > In all places there is a check before call of make_tsvector. >
pgsql-hackers by date: