Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers

From Andres Freund
Subject Re: tsearch parser inefficiency if text includes urls or emails - new version
Date
Msg-id 200912102005.16560.andres@anarazel.de
Whole thread Raw
In response to Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
> I wrote:
> > Thanks for the sample which shows the difference.
> 
> Ah, once I got on the right track, there is no problem seeing
> dramatic improvements with the patch.  It changes some nasty O(N^2)
> cases to O(N).  In particular, the fixes affect parsing of large
> strings encoded with multi-byte character encodings and containing
> email addresses or URLs with a non-IP-address host component.  It
> strikes me as odd that URLs without a slash following the host
> portion, or with an IP address, are treated so differently in the
> parser, but if we want to address that, it's a matter for another
> patch.
Same here. Generally I do have to say that I dont find that parser very nice - 
and actually I think its not very efficient as well because it searches a big 
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to 
read...

> I'm inclined to think that the minimal differences found in some of
> my tests probably have more to do with happenstance of code
> alignment than the particulars of the patch.
Yes, I think that as well.

> I did find one significant (although easily solved) problem.  In the
> patch, the recursive setup of usewide, pgwstr, and wstr are not
> conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
> version.  Unless there's a good reason for that, the #ifdef should
> be added.
Uh. Sorry. Fixed.

> Less critical, but worth fixing one way or the other, TParserClose
> does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
> TParserCopyClose does.  I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite 
obvious (the call returns, and its so verbose you will never do more than one 
call anyway) - but knowing where the copy ends is quite important to 
understand the parse flow.
I added a log there because in the end that line is not going to make any 
difference ;-)

> Sorry for the delay in review.  I hope there's still time to get
> this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough 
that it shouldnt cause much work/conflicts/whatever next round and I am running 
patched versions anyway, so ...



Andres

pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: tsearch parser inefficiency if text includes urls or emails - new version
Next
From: "Kevin Grittner"
Date:
Subject: Re: explain output infelicity in psql