Thread: text search: restricting the number of parsed words in headline generation

text search: restricting the number of parsed words in headline generation

From
Sushant Sinha
Date:
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:

1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the headline
3. Generate the headline

So the time taken by the headline generation is directly dependent on
the size of the document. The longer the document, the more time taken
to tokenize and more lexemes to operate on.

Most of the time is taken during the tokenization phase and for very big
documents, the headline generation is very expensive.

Here is a simple patch that limits the number of words during the
tokenization phase and puts an upper-bound on the headline generation.
The headline function takes a parameter MaxParsedWords. If this
parameter is negative or not supplied, then the entire document is
tokenized  and operated on (the current behavior). However, if the
supplied MaxParsedWords is a positive number, then the tokenization
stops after MaxParsedWords is obtained. The remaining headline
generation happens on the tokens obtained till that point.

The current patch can be applied to 9.1rc1. It lacks changes to the
documentation and test cases. I will add them if you folks agree on the
functionality.

-Sushant.

Attachment
Sushant Sinha <sushant354@gmail.com> writes:
> Given a document and a query, the goal of headline generation is to
> produce text excerpts in which the query appears.

... right ...

> Here is a simple patch that limits the number of words during the
> tokenization phase and puts an upper-bound on the headline generation.

Doesn't this force the headline to be taken from the first N words of
the document, independent of where the match was?  That seems rather
unworkable, or at least unhelpful.
        regards, tom lane


Re: text search: restricting the number of parsed words in headline generation

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar ago 23 15:59:18 -0300 2011:
> Sushant Sinha <sushant354@gmail.com> writes:
> > Given a document and a query, the goal of headline generation is to
> > produce text excerpts in which the query appears.
> 
> ... right ...
> 
> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
> 
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was?  That seems rather
> unworkable, or at least unhelpful.

Yeah ...

Doesn't a search result include the position on which the tokens were
found within the document?  Wouldn't it make more sense to improve the
system somehow so that it can restrict searching for headlines in the
general area where the tokens were found?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: text search: restricting the number of parsed words in headline generation

From
Sushant Sinha
Date:
> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
> 
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was?  That seems rather
> unworkable, or at least unhelpful.
> 
>             regards, tom lane

In headline generation function, we don't have any index or knowledge of
where the match is. We discover the matches by first tokenizing and then
comparing the matches with the query tokens. So it is hard to do
anything better than first N words.


One option could be that we start looking for "good match" while
tokenizing and then stop if we have found good match. Currently the
algorithms that decide a good match operates independently of the
tokenization and there are two of them. So integrating them would not be
easy.

The patch is very helpful if you believe in the common case assumption
that most of the time a good match is at the top of the document.
Typically a search application generates headline for the top matches of
a query i.e., those in which the query terms appears frequently. So
there should be atleast one or two good text excerpt matches at the top
of the document.



-Sushant.



Sushant Sinha <sushant354@gmail.com> writes:
>> Doesn't this force the headline to be taken from the first N words of
>> the document, independent of where the match was?  That seems rather
>> unworkable, or at least unhelpful.

> In headline generation function, we don't have any index or knowledge of
> where the match is. We discover the matches by first tokenizing and then
> comparing the matches with the query tokens. So it is hard to do
> anything better than first N words.

After looking at the code in wparser_def.c a bit more, I wonder whether
this patch is doing what you think it is.  Did you do any profiling to
confirm that tokenization is where the cost is?  Because it looks to me
like the match searching in hlCover() is at least O(N^2) in the number
of tokens in the document, which means it's probably the dominant cost
for any long document.  I suspect that your patch helps not so much
because it saves tokenization costs as because it bounds the amount of
effort spent in hlCover().

I haven't tried to do anything about this, but I wonder whether it
wouldn't be possible to eliminate the quadratic blowup by saving more
state across the repeated calls to hlCover().  At the very least, it
shouldn't be necessary to find the last query-token occurrence in the
document from scratch on each and every call.

Actually, this code seems probably flat-out wrong: won't every
successful call of hlCover() on a given document return exactly the same
q value (end position), namely the last token occurrence in the
document?  How is that helpful?
        regards, tom lane


Re: text search: restricting the number of parsed words in headline generation

From
Sushant Sinha
Date:
<div class="im"><br /><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px
#cccsolid;padding-left:1ex"><br /> Actually, this code seems probably flat-out wrong: won't every<br /> successful call
ofhlCover() on a given document return exactly the same<br /> q value (end position), namely the last token occurrence
inthe<br /> document?  How is that helpful?<br /><br />                        regards, tom lane<br
/></blockquote></div><br/></div>There is a line that saves the computation state from the previous call and search only
startsfrom there:<br /><pre><span>int</span>         pos = *p;</pre><br /><br /> 

Re: text search: restricting the number of parsed words in headline generation

From
Bruce Momjian
Date:
Is this a TODO?

---------------------------------------------------------------------------

On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was?  That seems rather
> >> unworkable, or at least unhelpful.
> 
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
> 
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is.  Did you do any profiling to
> confirm that tokenization is where the cost is?  Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document.  I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
> 
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover().  At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: text search: restricting the number of parsed words in headline generation

From
Bruce Momjian
Date:
This might indicate that the hlCover() item is resolved.

---------------------------------------------------------------------------

On Wed, Aug 24, 2011 at 10:08:11AM +0530, Sushant Sinha wrote:
> 
> 
>     Actually, this code seems probably flat-out wrong: won't every
>     successful call of hlCover() on a given document return exactly the same
>     q value (end position), namely the last token occurrence in the
>     document?  How is that helpful?
> 
>                            regards, tom lane
> 
> 
> There is a line that saves the computation state from the previous call and
> search only starts from there:
> 
> int         pos = *p;
> 
> 
> 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Bruce Momjian <bruce@momjian.us> writes:
> Is this a TODO?

AFAIR nothing's been done about the speed issue, so yes.  I didn't
like the idea of creating a user-visible knob when the speed issue
might be fixable with internal algorithm improvements, but we never
followed up on this in either fashion.
        regards, tom lane

> ---------------------------------------------------------------------------

> On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
>> Sushant Sinha <sushant354@gmail.com> writes:
>>> Doesn't this force the headline to be taken from the first N words of
>>> the document, independent of where the match was?  That seems rather
>>> unworkable, or at least unhelpful.
>> 
>>> In headline generation function, we don't have any index or knowledge of
>>> where the match is. We discover the matches by first tokenizing and then
>>> comparing the matches with the query tokens. So it is hard to do
>>> anything better than first N words.
>> 
>> After looking at the code in wparser_def.c a bit more, I wonder whether
>> this patch is doing what you think it is.  Did you do any profiling to
>> confirm that tokenization is where the cost is?  Because it looks to me
>> like the match searching in hlCover() is at least O(N^2) in the number
>> of tokens in the document, which means it's probably the dominant cost
>> for any long document.  I suspect that your patch helps not so much
>> because it saves tokenization costs as because it bounds the amount of
>> effort spent in hlCover().
>> 
>> I haven't tried to do anything about this, but I wonder whether it
>> wouldn't be possible to eliminate the quadratic blowup by saving more
>> state across the repeated calls to hlCover().  At the very least, it
>> shouldn't be necessary to find the last query-token occurrence in the
>> document from scratch on each and every call.
>> 
>> Actually, this code seems probably flat-out wrong: won't every
>> successful call of hlCover() on a given document return exactly the same
>> q value (end position), namely the last token occurrence in the
>> document?  How is that helpful?
>> 
>> regards, tom lane
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers

> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com

>   + It's impossible for everything to be true. +


> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: text search: restricting the number of parsed words in headline generation

From
Sushant Sinha
Date:
I will do the profiling and present the results.

On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Is this a TODO?
> 
> AFAIR nothing's been done about the speed issue, so yes.  I didn't
> like the idea of creating a user-visible knob when the speed issue
> might be fixable with internal algorithm improvements, but we never
> followed up on this in either fashion.
> 
>             regards, tom lane
> 
> > ---------------------------------------------------------------------------
> 
> > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> >> Sushant Sinha <sushant354@gmail.com> writes:
> >>> Doesn't this force the headline to be taken from the first N words of
> >>> the document, independent of where the match was?  That seems rather
> >>> unworkable, or at least unhelpful.
> >> 
> >>> In headline generation function, we don't have any index or knowledge of
> >>> where the match is. We discover the matches by first tokenizing and then
> >>> comparing the matches with the query tokens. So it is hard to do
> >>> anything better than first N words.
> >> 
> >> After looking at the code in wparser_def.c a bit more, I wonder whether
> >> this patch is doing what you think it is.  Did you do any profiling to
> >> confirm that tokenization is where the cost is?  Because it looks to me
> >> like the match searching in hlCover() is at least O(N^2) in the number
> >> of tokens in the document, which means it's probably the dominant cost
> >> for any long document.  I suspect that your patch helps not so much
> >> because it saves tokenization costs as because it bounds the amount of
> >> effort spent in hlCover().
> >> 
> >> I haven't tried to do anything about this, but I wonder whether it
> >> wouldn't be possible to eliminate the quadratic blowup by saving more
> >> state across the repeated calls to hlCover().  At the very least, it
> >> shouldn't be necessary to find the last query-token occurrence in the
> >> document from scratch on each and every call.
> >> 
> >> Actually, this code seems probably flat-out wrong: won't every
> >> successful call of hlCover() on a given document return exactly the same
> >> q value (end position), namely the last token occurrence in the
> >> document?  How is that helpful?
> >> 
> >> regards, tom lane
> >> 
> >> -- 
> >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-hackers
> 
> > -- 
> >   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
> >   EnterpriseDB                             http://enterprisedb.com
> 
> >   + It's impossible for everything to be true. +
> 
> 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers





Re: text search: restricting the number of parsed words in headline generation

From
Bruce Momjian
Date:
On Wed, Aug 15, 2012 at 11:09:18PM +0530, Sushant Sinha wrote:
> I will do the profiling and present the results.

Sushant, do you have any profiling results on this issue from August?

---------------------------------------------------------------------------


> 
> On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Is this a TODO?
> > 
> > AFAIR nothing's been done about the speed issue, so yes.  I didn't
> > like the idea of creating a user-visible knob when the speed issue
> > might be fixable with internal algorithm improvements, but we never
> > followed up on this in either fashion.
> > 
> >             regards, tom lane
> > 
> > > ---------------------------------------------------------------------------
> > 
> > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> > >> Sushant Sinha <sushant354@gmail.com> writes:
> > >>> Doesn't this force the headline to be taken from the first N words of
> > >>> the document, independent of where the match was?  That seems rather
> > >>> unworkable, or at least unhelpful.
> > >> 
> > >>> In headline generation function, we don't have any index or knowledge of
> > >>> where the match is. We discover the matches by first tokenizing and then
> > >>> comparing the matches with the query tokens. So it is hard to do
> > >>> anything better than first N words.
> > >> 
> > >> After looking at the code in wparser_def.c a bit more, I wonder whether
> > >> this patch is doing what you think it is.  Did you do any profiling to
> > >> confirm that tokenization is where the cost is?  Because it looks to me
> > >> like the match searching in hlCover() is at least O(N^2) in the number
> > >> of tokens in the document, which means it's probably the dominant cost
> > >> for any long document.  I suspect that your patch helps not so much
> > >> because it saves tokenization costs as because it bounds the amount of
> > >> effort spent in hlCover().
> > >> 
> > >> I haven't tried to do anything about this, but I wonder whether it
> > >> wouldn't be possible to eliminate the quadratic blowup by saving more
> > >> state across the repeated calls to hlCover().  At the very least, it
> > >> shouldn't be necessary to find the last query-token occurrence in the
> > >> document from scratch on each and every call.
> > >> 
> > >> Actually, this code seems probably flat-out wrong: won't every
> > >> successful call of hlCover() on a given document return exactly the same
> > >> q value (end position), namely the last token occurrence in the
> > >> document?  How is that helpful?
> > >> 
> > >> regards, tom lane
> > >> 
> > >> -- 
> > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > >> To make changes to your subscription:
> > >> http://www.postgresql.org/mailpref/pgsql-hackers
> > 
> > > -- 
> > >   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
> > >   EnterpriseDB                             http://enterprisedb.com
> > 
> > >   + It's impossible for everything to be true. +
> > 
> > 
> > > -- 
> > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: text search: restricting the number of parsed words in headline generation

From
Bruce Momjian
Date:
FYI, I have kept this email from 2011 about poor performance of parsed
words in headline generation.  If someone wants to research it, please
do so:
http://www.postgresql.org/message-id/1314117620.3700.12.camel@dragflick

---------------------------------------------------------------------------

On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was?  That seems rather
> >> unworkable, or at least unhelpful.
> 
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
> 
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is.  Did you do any profiling to
> confirm that tokenization is where the cost is?  Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document.  I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
> 
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover().  At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +