Thread: Re: [GENERAL] Fragments in tsearch2 headline
[moved to -hackers, because talk is about implementation details] > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) Thank you. 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c now contrib/tsearch2 is compatibility layer for old applications - they don't know about new features. So, this part isn't needed. 2 solution to compile function (ts_headline_with_fragments) into core, but using it only from contrib module looks very odd. So, new feature can be used only with compatibility layer for old release :) 3 headline_with_fragments() is hardcoded to use default parser, but what will be in case when configuration uses another parser? For example, for japanese language. 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] ) and function should accept 'NumFragments=>N' for default parser. Another parsers may use another options. 5 it just doesn't work correctly, because new code doesn't care of parser specific type of lexemes. contrib_regression=# select headline_with_fragments('english', 'wow asd-wow wow', 'asd', ''); headline_with_fragments ---------------------------------- ...wow asd-wow<b>asd</b>-wow wow (1 row) So, I incline to use existing framework/infrastructure although it may be a subject to change. Some description: 1 ts_headline defines a correct parser to use 2 it calls hlparsetext to split text into structure suitable for both goals: find the best fragment(s) and concatenate that fragment(s) back to the text representation 3 it calls parser specific method prsheadline which works with preparsed text (parse was done in hlparsetext). Method should mark a needed words/parts/lexemes etc. 4 ts_headline glues fragments into text and returns that. We need a parser's headline method because only parser knows all about its lexemes. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Now I understand the code much better. A few more questions on headline generation that I was not able to get from the code: 1. Why is hlparsetext used to parse the document rather than the parsetext function? Since words to be included in the headline will be marked afterwords, it seems more reasonable to just use the parsetext function. The main difference I see is the use of hlfinditem and marking whether some word is repeated. The reason this is important is that hlparsetext does not seem to be storing word positions which parsetext does. The word positions are important for generating headline with fragments. 2. > I would prefer the signature ts_headline( [regconfig,] text, tsquery >[,text] )and function should accept 'NumFragments=>N' for default >parser. Another parsers may use another options. Does this mean we want a unified function ts_headline and we trigger the fragments if NumFragments is specified? It seems that introducing a new function which can take configuration OID, or name is complex as there are so many functions handling these issues in wparser.c. If this is true then we need to just add marking of headline words in prsd_headline. Otherwise we will need another prsd_headline_with_covers function. 3. In many cases people may already have TSVector for a given document (for search operation). Would it be faster to pass TSVector to headline function when compared to computing TSVector each time? If that is the case then should we have an option to pass TSVector to headline function? -Sushant. On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote: > [moved to -hackers, because talk is about implementation details] > > > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 > > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) > Thank you. > > 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c > now contrib/tsearch2 is compatibility layer for old applications - they don't > know about new features. So, this part isn't needed. > > 2 solution to compile function (ts_headline_with_fragments) into core, but > using it only from contrib module looks very odd. So, new feature can be used > only with compatibility layer for old release :) > > 3 headline_with_fragments() is hardcoded to use default parser, but what will be > in case when configuration uses another parser? For example, for japanese language. > > 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] ) > and function should accept 'NumFragments=>N' for default parser. Another parsers > may use another options. > > 5 it just doesn't work correctly, because new code doesn't care of parser > specific type of lexemes. > contrib_regression=# select headline_with_fragments('english', 'wow asd-wow > wow', 'asd', ''); > headline_with_fragments > ---------------------------------- > ...wow asd-wow<b>asd</b>-wow wow > (1 row) > > > So, I incline to use existing framework/infrastructure although it may be a > subject to change. > > Some description: > 1 ts_headline defines a correct parser to use > 2 it calls hlparsetext to split text into structure suitable for both goals: > find the best fragment(s) and concatenate that fragment(s) back to the text > representation > 3 it calls parser specific method prsheadline which works with preparsed text > (parse was done in hlparsetext). Method should mark a needed > words/parts/lexemes etc. > 4 ts_headline glues fragments into text and returns that. > > We need a parser's headline method because only parser knows all about its lexemes. > > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > >
On Sat, May 24, 2008 at 11:18 PM, Sushant Sinha <sushant354@gmail.com> wrote: > Does this mean we want a unified function ts_headline and we trigger the > fragments if NumFragments is specified? It seems that introducing a new > function which can take configuration OID, or name is complex as there > are so many functions handling these issues in wparser.c. > > If this is true then we need to just add marking of headline words in > prsd_headline. Otherwise we will need another prsd_headline_with_covers > function. I think that we could merge down the two functions using a default cover of the whole text when NumCovers if missing. I started writing the function. The only missing information is, as you stated, the missing posititions in HeadlineParsedText.
Hi! > 1. Why is hlparsetext used to parse the document rather than the > parsetext function? Since words to be included in the headline will be > marked afterwords, it seems more reasonable to just use the parsetext > function. > The main difference I see is the use of hlfinditem and marking whether > some word is repeated. hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext doesn't. hlparsetext preserves original form of lexemes. parsetext doesn't. > > The reason this is important is that hlparsetext does not seem to be > storing word positions which parsetext does. The word positions are > important for generating headline with fragments. Doesn't needed - hlparsetext preserves the whole text, so, position is a number of array. > > 2. >> I would prefer the signature ts_headline( [regconfig,] text, tsquery >> [,text] )and function should accept 'NumFragments=>N' for default >> parser. Another parsers may use another options. > > Does this mean we want a unified function ts_headline and we trigger the > fragments if NumFragments is specified? Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). Another parsers might not recognize that option. > It seems that introducing a new > function which can take configuration OID, or name is complex as there > are so many functions handling these issues in wparser.c. No, of course - ts_headline takes care about finding configuration and calling correct parser. > > If this is true then we need to just add marking of headline words in > prsd_headline. Otherwise we will need another prsd_headline_with_covers > function. Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change an array of HeadlineParsedText. > > 3. In many cases people may already have TSVector for a given document > (for search operation). Would it be faster to pass TSVector to headline > function when compared to computing TSVector each time? If that is the > case then should we have an option to pass TSVector to headline > function? As I mentioned above, tsvector doesn;t contain whole information about text. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I have attached a new patch with respect to the current cvs head. This produces headline in a document for a given query. Basically it identifies fragments of text that contain the query and displays them. DESCRIPTION HeadlineParsedText contains an array of actual words but not information about the norms. We need an indexed position vector for each norm so that we can quickly evaluate a number of possible fragments. Something that tsvector provides. So this patch changes HeadlineParsedText to contain the norms (ParsedText). This field is updated while parsing in hlparsetext. The position information of the norms corresponds to the position of words in HeadlineParsedText (not to the norms positions as is the case in tsvector). This works correctly with the current parser. If you think there may be issues with other parsers please let me know. This approach does not change any other interface and fits nicely with the overall framework. The norms are converted into tsvector and a number of covers are generated. The best covers are then chosen to be in the headline. The covers are separated using a hardcoded coversep. Let me know if you want to expose this as an option. Covers that overlap with already chosen covers are excluded. Some options like ShortWord and MinWords are not taken care of right now. MaxWords are used as maxcoversize. Let me know if you would like to see other options for fragment generation as well. Let me know any more changes you would like to see. -Sushant. On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote: > Hi! > > > 1. Why is hlparsetext used to parse the document rather than the > > parsetext function? Since words to be included in the headline will be > > marked afterwords, it seems more reasonable to just use the parsetext > > function. > > The main difference I see is the use of hlfinditem and marking whether > > some word is repeated. > hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext > doesn't. > hlparsetext preserves original form of lexemes. parsetext doesn't. > > > > > The reason this is important is that hlparsetext does not seem to be > > storing word positions which parsetext does. The word positions are > > important for generating headline with fragments. > Doesn't needed - hlparsetext preserves the whole text, so, position is a number > of array. > > > > > 2. > >> I would prefer the signature ts_headline( [regconfig,] text, tsquery > >> [,text] )and function should accept 'NumFragments=>N' for default > >> parser. Another parsers may use another options. > > > > Does this mean we want a unified function ts_headline and we trigger the > > fragments if NumFragments is specified? > > Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). > Another parsers might not recognize that option. > > > It seems that introducing a new > > function which can take configuration OID, or name is complex as there > > are so many functions handling these issues in wparser.c. > No, of course - ts_headline takes care about finding configuration and calling > correct parser. > > > > > If this is true then we need to just add marking of headline words in > > prsd_headline. Otherwise we will need another prsd_headline_with_covers > > function. > Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change > an array of HeadlineParsedText. > > > > > 3. In many cases people may already have TSVector for a given document > > (for search operation). Would it be faster to pass TSVector to headline > > function when compared to computing TSVector each time? If that is the > > case then should we have an option to pass TSVector to headline > > function? > As I mentioned above, tsvector doesn;t contain whole information about text. >
Attachment
> I have attached a new patch with respect to the current cvs head. This > produces headline in a document for a given query. Basically it > identifies fragments of text that contain the query and displays them. New variant is much better, but... > HeadlineParsedText contains an array of actual words but not > information about the norms. We need an indexed position vector for each > norm so that we can quickly evaluate a number of possible fragments. > Something that tsvector provides. Why do you need to store norms? The single purpose of norms is identifying words from query - but it's already done by hlfinditem. It sets HeadlineWordEntry->item to corresponding QueryOperand in tsquery. Look, headline function is rather expensive and your patch adds a lot of extra work - at least in memory usage. And if user calls with NumFragments=0 the that work is unneeded. > This approach does not change any other interface and fits nicely with > the overall framework. Yeah, it's a really big step forward. Thank you. You are very close to committing except: Did you find a hlCover() function which produce a cover from original HeadlineParsedText representation? Is any reason to do not use it? > > The norms are converted into tsvector and a number of covers are > generated. The best covers are then chosen to be in the headline. The > covers are separated using a hardcoded coversep. Let me know if you want > to expose this as an option. > > Covers that overlap with already chosen covers are excluded. > > Some options like ShortWord and MinWords are not taken care of right > now. MaxWords are used as maxcoversize. Let me know if you would like to > see other options for fragment generation as well. ShortWord, MinWords and MaxWords should store their meaning, but for each fragment, not for the whole headline. > > Let me know any more changes you would like to see. if (num_fragments == 0) /* call the default headline generator */ mark_hl_words(prs, query,highlight, shortword, min_words, max_words); else mark_hl_fragments(prs, query, highlight, num_fragments,max_words); Suppose, num_fragments < 2? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Efficiency: I realized that we do not need to store all norms. We need to only store store norms that are in the query. So I moved the addition of norms from addHLParsedLex to hlfinditem. This should add very little memory overhead to existing headline generation. If this is still not acceptable for default headline generation, then I can push it into mark_hl_fragments. But I think any headline marking function will benefit by having the norms corresponding to the query. Why we need norms? hlCover does the exact thing that Cover in tsrank does which is to find the cover that contains the query. However hlcover has to go through words that do not match the query. Cover on the other hand operates on position indexes for just the query words and so it should be faster. The main reason why I would I like it to be fast is that I want to generate all covers for a given query. Then choose covers with smallest length as they will be the one that will best explain relation of a query to a document. Finally stretch those covers to the specified size. In my understanding, the current headline generation tries to find the biggest cover for display in the headline. I personally think that such a cover does not explain the context of a query in a document. We may differ on this and thats why we may need both options. Let me know what you think on this patch and I will update the patch to respect other options like MinWords and ShortWord. NumFragments < 2: I wanted people to use the new headline marker if they specify NumFragments >= 1. If they do not specify the NumFragments or put it to 0 then the default marker will be used. This becomes a bit of tricky parameter so please put in any idea on how to trigger the new marker. On an another note I found that make_tsvector crashes if it receives a ParsedText with curwords = 0. Specifically uniqueWORD returns curwords as 1 even when it gets 0 words. I am not sure if this is the desired behavior. -Sushant. On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote: > > I have attached a new patch with respect to the current cvs head. This > > produces headline in a document for a given query. Basically it > > identifies fragments of text that contain the query and displays them. > New variant is much better, but... > > > HeadlineParsedText contains an array of actual words but not > > information about the norms. We need an indexed position vector for each > > norm so that we can quickly evaluate a number of possible fragments. > > Something that tsvector provides. > > Why do you need to store norms? The single purpose of norms is identifying words > from query - but it's already done by hlfinditem. It sets > HeadlineWordEntry->item to corresponding QueryOperand in tsquery. > Look, headline function is rather expensive and your patch adds a lot of extra > work - at least in memory usage. And if user calls with NumFragments=0 the that > work is unneeded. > > > This approach does not change any other interface and fits nicely with > > the overall framework. > Yeah, it's a really big step forward. Thank you. You are very close to > committing except: Did you find a hlCover() function which produce a cover from > original HeadlineParsedText representation? Is any reason to do not use it? > > > > > The norms are converted into tsvector and a number of covers are > > generated. The best covers are then chosen to be in the headline. The > > covers are separated using a hardcoded coversep. Let me know if you want > > to expose this as an option. > > > > > > Covers that overlap with already chosen covers are excluded. > > > > Some options like ShortWord and MinWords are not taken care of right > > now. MaxWords are used as maxcoversize. Let me know if you would like to > > see other options for fragment generation as well. > ShortWord, MinWords and MaxWords should store their meaning, but for each > fragment, not for the whole headline. > > > > > > Let me know any more changes you would like to see. > > if (num_fragments == 0) > /* call the default headline generator */ > mark_hl_words(prs, query, highlight, shortword, min_words, max_words); > else > mark_hl_fragments(prs, query, highlight, num_fragments, max_words); > > > Suppose, num_fragments < 2? >
Attachment
> Why we need norms? We don't need norms at all - all matched HeadlineWordEntry already marked by HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in tsquery. > hlCover does the exact thing that Cover in tsrank does which is to find > the cover that contains the query. However hlcover has to go through > words that do not match the query. Cover on the other hand operates on > position indexes for just the query words and so it should be faster. Cover, by definition, is a minimal continuous text's piece matched by query. May be a several covers in text and hlCover will find all of them. Next, prsd_headline() (for now) tries to define the best one. "Best" means: cover contains a lot of words from query, not less that MinWords, not greater than MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc. > > The main reason why I would I like it to be fast is that I want to > generate all covers for a given query. Then choose covers with smallest hlCover generates all covers. > Let me know what you think on this patch and I will update the patch to > respect other options like MinWords and ShortWord. As I understand, you very wish to call Cover() function instead of hlCover() - by design, they should be identical, but accepts different document's representation. So, the best way is generalize them: develop a new one which can be called with some kind of callback or/and opaque structure to use it in both rank and headline. > > NumFragments < 2: > I wanted people to use the new headline marker if they specify > NumFragments >= 1. If they do not specify the NumFragments or put it to Ok, but if you unify cover generation and NumFragments == 1 then result for old and new algorithms should be the same... > On an another note I found that make_tsvector crashes if it receives a > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords > as 1 even when it gets 0 words. I am not sure if this is the desired > behavior. In all places there is a check before call of make_tsvector. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
My main argument for using Cover instead of hlCover was that Cover will be faster. I tested the default headline generation that uses hlCover with the current patch that uses Cover. There was not much difference. So I think you are right in that we do not need norms and we can just use hlCover. I also compared performance of ts_headline with my first patch to headline generation (one which was a separate function and took tsvector as input). The performance was dramatically different. For one query ts_headline took roughly 200 ms while headline_with_fragments took just 70 ms. On an another query ts_headline took 76 ms while headline_with_fragments took 24 ms. You can find 'explain analyze' for the first query at the bottom of the page. These queries were run multiple times to ensure that I never hit the disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running Linux 2.6.22-gentoo-r8. A couple of caveats: 1. ts_headline testing was done with current cvs head where as headline_with_fragments was done with postgres 8.3.1. 2. For headline_with_fragments, TSVector for the document was obtained by joining with another table. Are these differences understandable? If you think these caveats are the reasons or there is something I am missing, then I can repeat the entire experiments with exactly the same conditions. -Sushant. Here is 'explain analyze' for both the functions: ts_headline ------------ lawdb=# explain analyze SELECT ts_headline('english', doc, q, '') FROM docraw, plainto_tsquery('english', 'freedomof speech') as q WHERE docraw.tid = 125596; QUERY PLAN Nested Loop (cost=0.00..8.31 rows=1 width=497) (actual time=199.692..200.207 rows=1 loops=1) -> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=465) (actual time=0.041..0.065 rows=1 loops=1) Index Cond: (tid = 125596) -> Function Scan on q (cost=0.00..0.01rows=1 width=32) (actual time=0.010..0.014 rows=1 loops=1)Total runtime: 200.311 ms headline_with_fragments ----------------------- lawdb=# explain analyze SELECT headline_with_fragments('english', docvector, doc, q, 'MaxWords=40') FROM docraw, docmeta, plainto_tsquery('english', 'freedom of speech') as q WHERE docraw.tid = 125596 and docmeta.tid=125596; QUERY PLAN ----------------------Nested Loop (cost=0.00..16.61 rows=1 width=883) (actual time=70.564..70.949 rows=1 loops=1) -> Nested Loop (cost=0.00..16.59 rows=1 width=851) (actual time=0.064..0.094 rows=1 loops=1) -> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1) Index Cond: (tid = 125596) -> Index Scanusing docmeta_pkey on docmeta (cost=0.00..8.29 rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1) Index Cond: (docmeta.tid = 125596) -> FunctionScan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.016 rows=1 loops=1)Total runtime: 71.076 ms (8 rows) On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote: > > Why we need norms? > > We don't need norms at all - all matched HeadlineWordEntry already marked by > HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in > tsquery. > > > hlCover does the exact thing that Cover in tsrank does which is to find > > the cover that contains the query. However hlcover has to go through > > words that do not match the query. Cover on the other hand operates on > > position indexes for just the query words and so it should be faster. > Cover, by definition, is a minimal continuous text's piece matched by query. May > be a several covers in text and hlCover will find all of them. Next, > prsd_headline() (for now) tries to define the best one. "Best" means: cover > contains a lot of words from query, not less that MinWords, not greater than > MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc. > > > > The main reason why I would I like it to be fast is that I want to > > generate all covers for a given query. Then choose covers with smallest > hlCover generates all covers. > > > Let me know what you think on this patch and I will update the patch to > > respect other options like MinWords and ShortWord. > > As I understand, you very wish to call Cover() function instead of hlCover() - > by design, they should be identical, but accepts different document's > representation. So, the best way is generalize them: develop a new one which can > be called with some kind of callback or/and opaque structure to use it in both > rank and headline. > > > > > NumFragments < 2: > > I wanted people to use the new headline marker if they specify > > NumFragments >= 1. If they do not specify the NumFragments or put it to > Ok, but if you unify cover generation and NumFragments == 1 then result for old > and new algorithms should be the same... > > > > On an another note I found that make_tsvector crashes if it receives a > > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords > > as 1 even when it gets 0 words. I am not sure if this is the desired > > behavior. > In all places there is a check before call of make_tsvector. >
> A couple of caveats: > > 1. ts_headline testing was done with current cvs head where as > headline_with_fragments was done with postgres 8.3.1. > 2. For headline_with_fragments, TSVector for the document was obtained > by joining with another table. > Are these differences understandable? That is possible situation because ts_headline has several criterias of 'best' covers - length, number of words from query, good words at the begin and at the end of headline while your fragment's algorithm takes care only on total number of words in all covers. It's not very good, but it's acceptable, I think. Headline (and ranking too) hasn't any formal rules to define is it good or bad? Just a people's opinions. Next possible reason: original algorithm had a look on all covers trying to find the best one while your algorithm tries to find just the shortest covers to fill a headline. But it's very desirable to use ShortWord - it's not very comfortable for user if one option produces unobvious side effect with another one. ` > If you think these caveats are the reasons or there is something I am > missing, then I can repeat the entire experiments with exactly the same > conditions. Interesting for me test is a comparing hlCover with Cover in your patch, i.e. develop a patch which uses hlCover instead of Cover and compare old patch with new one. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I have an attached an updated patch with following changes: 1. Respects ShortWord and MinWords 2. Uses hlCover instead of Cover 3. Does not store norm (or lexeme) for headline marking 4. Removes ts_rank.h 5. Earlier it was counting even NONWORDTOKEN in the headline. Now it only counts the actual words and excludes spaces etc. I have also changed NumFragments option to MaxFragments as there may not be enough covers to display NumFragments. Another change that I was thinking: Right now if cover size > max_words then I just cut the trailing words. Instead I was thinking that we should split the cover into more fragments such that each fragment contains a few query words. Then each fragment will not contain all query words but will show more occurrences of query words in the headline. I would like to know what your opinion on this is. -Sushant. On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote: > > A couple of caveats: > > > > 1. ts_headline testing was done with current cvs head where as > > headline_with_fragments was done with postgres 8.3.1. > > 2. For headline_with_fragments, TSVector for the document was obtained > > by joining with another table. > > Are these differences understandable? > > That is possible situation because ts_headline has several criterias of 'best' > covers - length, number of words from query, good words at the begin and at the > end of headline while your fragment's algorithm takes care only on total number > of words in all covers. It's not very good, but it's acceptable, I think. > Headline (and ranking too) hasn't any formal rules to define is it good or bad? > Just a people's opinions. > > Next possible reason: original algorithm had a look on all covers trying to find > the best one while your algorithm tries to find just the shortest covers to fill > a headline. > > But it's very desirable to use ShortWord - it's not very comfortable for user if > one option produces unobvious side effect with another one. > ` > > > If you think these caveats are the reasons or there is something I am > > missing, then I can repeat the entire experiments with exactly the same > > conditions. > > Interesting for me test is a comparing hlCover with Cover in your patch, i.e. > develop a patch which uses hlCover instead of Cover and compare old patch with > new one.
Attachment
> 1. Respects ShortWord and MinWords > 2. Uses hlCover instead of Cover > 3. Does not store norm (or lexeme) for headline marking > 4. Removes ts_rank.h > 5. Earlier it was counting even NONWORDTOKEN in the headline. Now it > only counts the actual words and excludes spaces etc. > > I have also changed NumFragments option to MaxFragments as there may not > be enough covers to display NumFragments. Nice. But it will be good to resolve following issues: 1) Patch contains mistakes, I didn't investigate or carefully read it. Get http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz and load in db. Queries # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); and # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod; crash postgresql :( 2) pls, include in your patch documentation and regression tests. > > Another change that I was thinking: > > Right now if cover size > max_words then I just cut the trailing words. > Instead I was thinking that we should split the cover into more > fragments such that each fragment contains a few query words. Then each > fragment will not contain all query words but will show more occurrences > of query words in the headline. I would like to know what your opinion > on this is. Agreed. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attached a new patch that:
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords. Basically it divides a cover greater than MaxWords into fragments of MaxWords, resizes each such fragment so that each end of the fragment contains a query word and then evaluates best fragments based on number of query words in each fragment. In case of tie it picks up the smaller fragment. This allows more query words to be shown with multiple fragments in case a single cover is larger than the MaxWords.
The resizing of a fragment such that each end is a query word provides room for stretching both sides of the fragment. This (hopefully) better presents the context in which query words appear in the document. If a cover is smaller than MaxWords then the cover is treated as a fragment.
Let me know if you have any more suggestions or anything is not clear.
I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function?
-Sushant.
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords. Basically it divides a cover greater than MaxWords into fragments of MaxWords, resizes each such fragment so that each end of the fragment contains a query word and then evaluates best fragments based on number of query words in each fragment. In case of tie it picks up the smaller fragment. This allows more query words to be shown with multiple fragments in case a single cover is larger than the MaxWords.
The resizing of a fragment such that each end is a query word provides room for stretching both sides of the fragment. This (hopefully) better presents the context in which query words appear in the document. If a cover is smaller than MaxWords then the cover is treated as a fragment.
Let me know if you have any more suggestions or anything is not clear.
I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function?
-Sushant.
Nice. But it will be good to resolve following issues:
1) Patch contains mistakes, I didn't investigate or carefully read it. Get http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz and load in db.
Queries
# select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
and
# select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod;
crash postgresql :(
2) pls, include in your patch documentation and regression tests.Agreed.
Another change that I was thinking:
Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would like to know what your opinion
on this is.
Attachment
> Attached a new patch that: > > 1. fixes previous bug > 2. better handles the case when cover size is greater than the MaxWords. Looks good, I'll make some tests with real-world application. > I have not yet added the regression tests. The regression test suite > seemed to be only ensuring that the function works. How many tests > should I be adding? Is there any other place that I need to add > different test cases for the function? Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic functionality and corner cases like - there is no covers in text - Cover(s) is too big - and so on Add some words in documentation too, pls. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
attached are two patches: 1. documentation 2. regression tests for headline with fragments. -Sushant. On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote: > > Attached a new patch that: > > > > 1. fixes previous bug > > 2. better handles the case when cover size is greater than the MaxWords. > > Looks good, I'll make some tests with real-world application. > > > I have not yet added the regression tests. The regression test suite > > seemed to be only ensuring that the function works. How many tests > > should I be adding? Is there any other place that I need to add > > different test cases for the function? > > Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic > functionality and corner cases like > - there is no covers in text > - Cover(s) is too big > - and so on > > Add some words in documentation too, pls. > >
Attachment
Sushant, first, please, provide simple test queries, which demonstrate the right work in the corner cases. This will helps reviewers to test your patch and helps you to make sure your new version is ok. For example: =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery); ts_headline ------------------------------------------------------ <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b> This select breaks your code: =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2'); ts_headline -------------- ... 2 ... and so on .... Oleg On Tue, 15 Jul 2008, Sushant Sinha wrote: > Attached a new patch that: > > 1. fixes previous bug > 2. better handles the case when cover size is greater than the MaxWords. > Basically it divides a cover greater than MaxWords into fragments of > MaxWords, resizes each such fragment so that each end of the fragment > contains a query word and then evaluates best fragments based on number of > query words in each fragment. In case of tie it picks up the smaller > fragment. This allows more query words to be shown with multiple fragments > in case a single cover is larger than the MaxWords. > > The resizing of a fragment such that each end is a query word provides room > for stretching both sides of the fragment. This (hopefully) better presents > the context in which query words appear in the document. If a cover is > smaller than MaxWords then the cover is treated as a fragment. > > Let me know if you have any more suggestions or anything is not clear. > > I have not yet added the regression tests. The regression test suite seemed > to be only ensuring that the function works. How many tests should I be > adding? Is there any other place that I need to add different test cases for > the function? > > -Sushant. > > > Nice. But it will be good to resolve following issues: >> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get >> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and loadin db. >> >> Queries >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') >> from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); >> >> and >> >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') >> from apod; >> >> crash postgresql :( >> >> 2) pls, include in your patch documentation and regression tests. >> >> >>> Another change that I was thinking: >>> >>> Right now if cover size > max_words then I just cut the trailing words. >>> Instead I was thinking that we should split the cover into more >>> fragments such that each fragment contains a few query words. Then each >>> fragment will not contain all query words but will show more occurrences >>> of query words in the headline. I would like to know what your opinion >>> on this is. >>> >> >> Agreed. >> >> >> -- >> Teodor Sigaev E-mail: teodor@sigaev.ru >> WWW: >> http://www.sigaev.ru/ >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I will add test queries and their results for the corner cases in a separate file. I guess the only thing I am confused about is what should be the behavior of headline generation when Query items have words of size less than ShortWord. I guess the answer is to ignore ShortWord parameter but let me know if the answer is any different. -Sushant. On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote: > Sushant, > > first, please, provide simple test queries, which demonstrate the right work > in the corner cases. This will helps reviewers to test your patch and > helps you to make sure your new version is ok. For example: > > =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery); > ts_headline > ------------------------------------------------------ > <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b> > > This select breaks your code: > > =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2'); > ts_headline > -------------- > ... 2 ... > > and so on .... > > > Oleg > On Tue, 15 Jul 2008, Sushant Sinha wrote: > > > Attached a new patch that: > > > > 1. fixes previous bug > > 2. better handles the case when cover size is greater than the MaxWords. > > Basically it divides a cover greater than MaxWords into fragments of > > MaxWords, resizes each such fragment so that each end of the fragment > > contains a query word and then evaluates best fragments based on number of > > query words in each fragment. In case of tie it picks up the smaller > > fragment. This allows more query words to be shown with multiple fragments > > in case a single cover is larger than the MaxWords. > > > > The resizing of a fragment such that each end is a query word provides room > > for stretching both sides of the fragment. This (hopefully) better presents > > the context in which query words appear in the document. If a cover is > > smaller than MaxWords then the cover is treated as a fragment. > > > > Let me know if you have any more suggestions or anything is not clear. > > > > I have not yet added the regression tests. The regression test suite seemed > > to be only ensuring that the function works. How many tests should I be > > adding? Is there any other place that I need to add different test cases for > > the function? > > > > -Sushant. > > > > > > Nice. But it will be good to resolve following issues: > >> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get > >> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and loadin db. > >> > >> Queries > >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') > >> from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); > >> > >> and > >> > >> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') > >> from apod; > >> > >> crash postgresql :( > >> > >> 2) pls, include in your patch documentation and regression tests. > >> > >> > >>> Another change that I was thinking: > >>> > >>> Right now if cover size > max_words then I just cut the trailing words. > >>> Instead I was thinking that we should split the cover into more > >>> fragments such that each fragment contains a few query words. Then each > >>> fragment will not contain all query words but will show more occurrences > >>> of query words in the headline. I would like to know what your opinion > >>> on this is. > >>> > >> > >> Agreed. > >> > >> > >> -- > >> Teodor Sigaev E-mail: teodor@sigaev.ru > >> WWW: > >> http://www.sigaev.ru/ > >> > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, 16 Jul 2008, Sushant Sinha wrote: > I will add test queries and their results for the corner cases in a > separate file. I guess the only thing I am confused about is what should > be the behavior of headline generation when Query items have words of > size less than ShortWord. I guess the answer is to ignore ShortWord > parameter but let me know if the answer is any different. > ShortWord is about headline text, it doesn't affects words in query, so you can't discard them from query. > -Sushant. > > On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote: >> Sushant, >> >> first, please, provide simple test queries, which demonstrate the right work >> in the corner cases. This will helps reviewers to test your patch and >> helps you to make sure your new version is ok. For example: >> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery); >> ts_headline >> ------------------------------------------------------ >> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b> >> >> This select breaks your code: >> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2'); >> ts_headline >> -------------- >> ... 2 ... >> >> and so on .... >> >> >> Oleg >> On Tue, 15 Jul 2008, Sushant Sinha wrote: >> >>> Attached a new patch that: >>> >>> 1. fixes previous bug >>> 2. better handles the case when cover size is greater than the MaxWords. >>> Basically it divides a cover greater than MaxWords into fragments of >>> MaxWords, resizes each such fragment so that each end of the fragment >>> contains a query word and then evaluates best fragments based on number of >>> query words in each fragment. In case of tie it picks up the smaller >>> fragment. This allows more query words to be shown with multiple fragments >>> in case a single cover is larger than the MaxWords. >>> >>> The resizing of a fragment such that each end is a query word provides room >>> for stretching both sides of the fragment. This (hopefully) better presents >>> the context in which query words appear in the document. If a cover is >>> smaller than MaxWords then the cover is treated as a fragment. >>> >>> Let me know if you have any more suggestions or anything is not clear. >>> >>> I have not yet added the regression tests. The regression test suite seemed >>> to be only ensuring that the function works. How many tests should I be >>> adding? Is there any other place that I need to add different test cases for >>> the function? >>> >>> -Sushant. >>> >>> >>> Nice. But it will be good to resolve following issues: >>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get >>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and loadin db. >>>> >>>> Queries >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') >>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); >>>> >>>> and >>>> >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') >>>> from apod; >>>> >>>> crash postgresql :( >>>> >>>> 2) pls, include in your patch documentation and regression tests. >>>> >>>> >>>>> Another change that I was thinking: >>>>> >>>>> Right now if cover size > max_words then I just cut the trailing words. >>>>> Instead I was thinking that we should split the cover into more >>>>> fragments such that each fragment contains a few query words. Then each >>>>> fragment will not contain all query words but will show more occurrences >>>>> of query words in the headline. I would like to know what your opinion >>>>> on this is. >>>>> >>>> >>>> Agreed. >>>> >>>> >>>> -- >>>> Teodor Sigaev E-mail: teodor@sigaev.ru >>>> WWW: >>>> http://www.sigaev.ru/ >>>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Fixed some off by one errors pointed by Oleg and errors in excluding overlapping fragments. Also adding test queries and updating regression tests. Let me know of any other changes that are needed. -Sushant. On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote: > On Wed, 16 Jul 2008, Sushant Sinha wrote: > > > I will add test queries and their results for the corner cases in a > > separate file. I guess the only thing I am confused about is what should > > be the behavior of headline generation when Query items have words of > > size less than ShortWord. I guess the answer is to ignore ShortWord > > parameter but let me know if the answer is any different. > > > > ShortWord is about headline text, it doesn't affects words in query, > so you can't discard them from query. > > > -Sushant. > > > > On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote: > >> Sushant, > >> > >> first, please, provide simple test queries, which demonstrate the right work > >> in the corner cases. This will helps reviewers to test your patch and > >> helps you to make sure your new version is ok. For example: > >> > >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery); > >> ts_headline > >> ------------------------------------------------------ > >> <b>1</b> 2 <b>3</b> 4 5 <b>1</b> 2 <b>3</b> <b>1</b> > >> > >> This select breaks your code: > >> > >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2'); > >> ts_headline > >> -------------- > >> ... 2 ... > >> > >> and so on .... > >> > >> > >> Oleg > >> On Tue, 15 Jul 2008, Sushant Sinha wrote: > >> > >>> Attached a new patch that: > >>> > >>> 1. fixes previous bug > >>> 2. better handles the case when cover size is greater than the MaxWords. > >>> Basically it divides a cover greater than MaxWords into fragments of > >>> MaxWords, resizes each such fragment so that each end of the fragment > >>> contains a query word and then evaluates best fragments based on number of > >>> query words in each fragment. In case of tie it picks up the smaller > >>> fragment. This allows more query words to be shown with multiple fragments > >>> in case a single cover is larger than the MaxWords. > >>> > >>> The resizing of a fragment such that each end is a query word provides room > >>> for stretching both sides of the fragment. This (hopefully) better presents > >>> the context in which query words appear in the document. If a cover is > >>> smaller than MaxWords then the cover is treated as a fragment. > >>> > >>> Let me know if you have any more suggestions or anything is not clear. > >>> > >>> I have not yet added the regression tests. The regression test suite seemed > >>> to be only ensuring that the function works. How many tests should I be > >>> adding? Is there any other place that I need to add different test cases for > >>> the function? > >>> > >>> -Sushant. > >>> > >>> > >>> Nice. But it will be good to resolve following issues: > >>>> 1) Patch contains mistakes, I didn't investigate or carefully read it. Get > >>>> http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gz<http://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gz>and loadin db. > >>>> > >>>> Queries > >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') > >>>> from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); > >>>> > >>>> and > >>>> > >>>> # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') > >>>> from apod; > >>>> > >>>> crash postgresql :( > >>>> > >>>> 2) pls, include in your patch documentation and regression tests. > >>>> > >>>> > >>>>> Another change that I was thinking: > >>>>> > >>>>> Right now if cover size > max_words then I just cut the trailing words. > >>>>> Instead I was thinking that we should split the cover into more > >>>>> fragments such that each fragment contains a few query words. Then each > >>>>> fragment will not contain all query words but will show more occurrences > >>>>> of query words in the headline. I would like to know what your opinion > >>>>> on this is. > >>>>> > >>>> > >>>> Agreed. > >>>> > >>>> > >>>> -- > >>>> Teodor Sigaev E-mail: teodor@sigaev.ru > >>>> WWW: > >>>> http://www.sigaev.ru/ > >>>> > >>> > >> > >> Regards, > >> Oleg > >> _____________________________________________________________ > >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > >> Sternberg Astronomical Institute, Moscow University, Russia > >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > >> phone: +007(495)939-16-83, +007(495)939-23-83 > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
Attachment
> Let me know of any other changes that are needed. Looks like ready to commit, but documentation is needed. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
btw, is it intentional to have '....' in headline ? =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1'); ts_headline ------------------------- ... <b>4</b> 5 <b>1</b> Oleg On Wed, 23 Jul 2008, Teodor Sigaev wrote: >> Let me know of any other changes that are needed. > > Looks like ready to commit, but documentation is needed. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I guess it is more readable to add cover separator at the end of a fragment than in the front. Let me know what you think and I can update it.
I think the right place for cover separator is in the structure HeadlineParsedText just like startsel and stopsel. This will enable users to specify their own cover separators. But this will require changes to the structure as well as to the generateHeadline function. This option will not also play well with the default headline generation function.
The default MaxWords = 35 seems a bit high for this headline generation function and 20 seems to be more reasonable. Any thoughts?
-Sushant.
I think the right place for cover separator is in the structure HeadlineParsedText just like startsel and stopsel. This will enable users to specify their own cover separators. But this will require changes to the structure as well as to the generateHeadline function. This option will not also play well with the default headline generation function.
The default MaxWords = 35 seems a bit high for this headline generation function and 20 seems to be more reasonable. Any thoughts?
-Sushant.
On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg@sai.msu.su> wrote:
btw, is it intentional to have '....' in headline ?
=# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1');
ts_headline
-------------------------
... <b>4</b> 5 <b>1</b>
Oleg
On Wed, 23 Jul 2008, Teodor Sigaev wrote:Let me know of any other changes that are needed.
Looks like ready to commit, but documentation is needed.Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, 23 Jul 2008, Sushant Sinha wrote: > I guess it is more readable to add cover separator at the end of a fragment > than in the front. Let me know what you think and I can update it. FragmentsDelimiter should *separate* fragments and that says all. Not very difficult algorithmic problem, it's like perl's join(FragmentsDelimiter, @array) > > I think the right place for cover separator is in the structure > HeadlineParsedText just like startsel and stopsel. This will enable users to > specify their own cover separators. But this will require changes to the > structure as well as to the generateHeadline function. This option will not > also play well with the default headline generation function. As soon as we introduce FragmentsDelimiter we should make it configurable. > > The default MaxWords = 35 seems a bit high for this headline generation > function and 20 seems to be more reasonable. Any thoughts? I think we should not change default value because it could change behaviour of existing applications. I'm not sure if it'd be useful and possible to define default values in CREATE TEXT SEARCH PARSER > > -Sushant. > > On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg@sai.msu.su> wrote: > >> btw, is it intentional to have '....' in headline ? >> >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1'); >> ts_headline >> ------------------------- >> ... <b>4</b> 5 <b>1</b> >> >> >> >> Oleg >> >> On Wed, 23 Jul 2008, Teodor Sigaev wrote: >> >> Let me know of any other changes that are needed. >>>> >>> >>> Looks like ready to commit, but documentation is needed. >>> >>> >>> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/> >> phone: +007(495)939-16-83, +007(495)939-23-83 >> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Sorry for the delay. Here is the patch with FragmentDelimiter option. It requires an extra option in HeadlineParsedText and uses that option during generateHeadline. Implementing notion of fragments in HeadlineParsedText and a separate function to join them seems more complicated. So for the time being I just dump a FragmentDelimiter whenever a new fragment (other than the first one) starts. The patch also contains the updated regression tests/results and also a new test for FragmentDelimiter option. It also contains the documentation for the new options. I have also attached a separate file that tests different aspects of the new headline generation function. Let me know if anything else is needed. -Sushant. On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote: > On Wed, 23 Jul 2008, Sushant Sinha wrote: > > > I guess it is more readable to add cover separator at the end of a fragment > > than in the front. Let me know what you think and I can update it. > > FragmentsDelimiter should *separate* fragments and that says all. > Not very difficult algorithmic problem, it's like perl's > join(FragmentsDelimiter, @array) > > > > > I think the right place for cover separator is in the structure > > HeadlineParsedText just like startsel and stopsel. This will enable users to > > specify their own cover separators. But this will require changes to the > > structure as well as to the generateHeadline function. This option will not > > also play well with the default headline generation function. > > As soon as we introduce FragmentsDelimiter we should make it > configurable. > > > > > The default MaxWords = 35 seems a bit high for this headline generation > > function and 20 seems to be more reasonable. Any thoughts? > > I think we should not change default value because it could change > behaviour of existing applications. I'm not sure if it'd be useful and > possible to define default values in CREATE TEXT SEARCH PARSER > > > > > -Sushant. > > > > On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov <oleg@sai.msu.su> wrote: > > > >> btw, is it intentional to have '....' in headline ? > >> > >> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&4'::tsquery,'MaxFragments=1'); > >> ts_headline > >> ------------------------- > >> ... <b>4</b> 5 <b>1</b> > >> > >> > >> > >> Oleg > >> > >> On Wed, 23 Jul 2008, Teodor Sigaev wrote: > >> > >> Let me know of any other changes that are needed. > >>>> > >>> > >>> Looks like ready to commit, but documentation is needed. > >>> > >>> > >>> > >> Regards, > >> Oleg > >> _____________________________________________________________ > >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > >> Sternberg Astronomical Institute, Moscow University, Russia > >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/<http://www.sai.msu.su/%7Emegera/> > >> phone: +007(495)939-16-83, +007(495)939-23-83 > >> > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
Attachment
Sushant Sinha <sushant354@gmail.com> writes: > Sorry for the delay. Here is the patch with FragmentDelimiter option. > It requires an extra option in HeadlineParsedText and uses that option > during generateHeadline. I did some editing of the documentation for this patch and noticed that the explanation of the fragment-based headline method says If not all query words are found in the document, then a single fragment of the first <literal>MinWords</> in the document will be displayed. (That's what it says now, that is, based on my editing and testing of the original.) This seems like a pretty dumb fallback approach --- if you have only a partial match, the headline generation suddenly becomes about as stupid as it could possibly be. I could understand doing the above if the text actually contains *none* of the query words, but surely if it contains some of them we should still select fragments centered on those words. regards, tom lane
Sushant Sinha <sushant354@gmail.com> writes: > Headline generation uses hlCover to get fragments in text with *all* > query items. In case there is no such fragment, it does not return > anything. > What you are asking will either require returning *maximally* matching > covers or handling it as a separate case. Efficiently being useless is still useless --- a headline selection function needs to be robust, not fragile, and not doing anything useful for a partial match sounds pretty fragile to me. regards, tom lane
Headline generation uses hlCover to get fragments in text with *all* query items. In case there is no such fragment, it does not return anything. What you are asking will either require returning *maximally* matching covers or handling it as a separate case. -Sushant. On Mon, 2009-04-13 at 20:57 -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > > Sorry for the delay. Here is the patch with FragmentDelimiter option. > > It requires an extra option in HeadlineParsedText and uses that option > > during generateHeadline. > > I did some editing of the documentation for this patch and noticed that > the explanation of the fragment-based headline method says > > If not all query words are found in the > document, then a single fragment of the first <literal>MinWords</> > in the document will be displayed. > > (That's what it says now, that is, based on my editing and testing of > the original.) This seems like a pretty dumb fallback approach --- > if you have only a partial match, the headline generation suddenly > becomes about as stupid as it could possibly be. I could understand > doing the above if the text actually contains *none* of the query > words, but surely if it contains some of them we should still select > fragments centered on those words. > > regards, tom lane