Re: [GENERAL] Fragments in tsearch2 headline - Mailing list pgsql-hackers
From | Sushant Sinha |
---|---|
Subject | Re: [GENERAL] Fragments in tsearch2 headline |
Date | |
Msg-id | 1212278321.5891.24.camel@dragflick Whole thread Raw |
In response to | Re: [GENERAL] Fragments in tsearch2 headline (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: [GENERAL] Fragments in tsearch2 headline
|
List | pgsql-hackers |
I have attached a new patch with respect to the current cvs head. This produces headline in a document for a given query. Basically it identifies fragments of text that contain the query and displays them. DESCRIPTION HeadlineParsedText contains an array of actual words but not information about the norms. We need an indexed position vector for each norm so that we can quickly evaluate a number of possible fragments. Something that tsvector provides. So this patch changes HeadlineParsedText to contain the norms (ParsedText). This field is updated while parsing in hlparsetext. The position information of the norms corresponds to the position of words in HeadlineParsedText (not to the norms positions as is the case in tsvector). This works correctly with the current parser. If you think there may be issues with other parsers please let me know. This approach does not change any other interface and fits nicely with the overall framework. The norms are converted into tsvector and a number of covers are generated. The best covers are then chosen to be in the headline. The covers are separated using a hardcoded coversep. Let me know if you want to expose this as an option. Covers that overlap with already chosen covers are excluded. Some options like ShortWord and MinWords are not taken care of right now. MaxWords are used as maxcoversize. Let me know if you would like to see other options for fragment generation as well. Let me know any more changes you would like to see. -Sushant. On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote: > Hi! > > > 1. Why is hlparsetext used to parse the document rather than the > > parsetext function? Since words to be included in the headline will be > > marked afterwords, it seems more reasonable to just use the parsetext > > function. > > The main difference I see is the use of hlfinditem and marking whether > > some word is repeated. > hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext > doesn't. > hlparsetext preserves original form of lexemes. parsetext doesn't. > > > > > The reason this is important is that hlparsetext does not seem to be > > storing word positions which parsetext does. The word positions are > > important for generating headline with fragments. > Doesn't needed - hlparsetext preserves the whole text, so, position is a number > of array. > > > > > 2. > >> I would prefer the signature ts_headline( [regconfig,] text, tsquery > >> [,text] )and function should accept 'NumFragments=>N' for default > >> parser. Another parsers may use another options. > > > > Does this mean we want a unified function ts_headline and we trigger the > > fragments if NumFragments is specified? > > Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). > Another parsers might not recognize that option. > > > It seems that introducing a new > > function which can take configuration OID, or name is complex as there > > are so many functions handling these issues in wparser.c. > No, of course - ts_headline takes care about finding configuration and calling > correct parser. > > > > > If this is true then we need to just add marking of headline words in > > prsd_headline. Otherwise we will need another prsd_headline_with_covers > > function. > Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change > an array of HeadlineParsedText. > > > > > 3. In many cases people may already have TSVector for a given document > > (for search operation). Would it be faster to pass TSVector to headline > > function when compared to computing TSVector each time? If that is the > > case then should we have an option to pass TSVector to headline > > function? > As I mentioned above, tsvector doesn;t contain whole information about text. >
Attachment
pgsql-hackers by date: