Re: tsearch parser inefficiency if text includes urls or emails - new version - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: tsearch parser inefficiency if text includes urls or emails - new version
Date
Msg-id 4B20E530020000250002D305@gw.wicourts.gov
Whole thread Raw
In response to Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: tsearch parser inefficiency if text includes urls or emails - new version  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: tsearch parser inefficiency if text includes urls or emails - new version  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
I wrote:
> Thanks for the sample which shows the difference.
Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch.  It changes some nasty O(N^2)
cases to O(N).  In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component.  It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.
I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.
I did find one significant (although easily solved) problem.  In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version.  Unless there's a good reason for that, the #ifdef should
be added.
Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does.  I think this should be consistent.
Finally, there's that spelling error in the comment for
TParserCopyInit.  Please fix.
If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer.  I'm marking it as Waiting on Author
for fixes to these three items.
Sorry for the delay in review.  I hope there's still time to get
this committed in this CF.
-Kevin


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Need --without-docs build switch
Next
From: Andrew Dunstan
Date:
Subject: Re: explain output infelicity in psql