Thread: [PATCH] tsearch parser inefficiency if text includes urls or emails

[PATCH] tsearch parser inefficiency if text includes urls or emails

From
Andres Freund
Date:
Hi,

While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is due
using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c.
If a multibyte encoding is in use TParserInit copies the whole remaining input
and converts it to wchar_t or pg_wchar - for every email or protocol prefixed
url in the the document. Which obviously is bad.

I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which
reuses the the already converted strings of the original TParser - only at
different offsets.

Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so I
wanted to get some feedback first.

Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

─────────────────────────────────────────────────────────────────────────────────────────────────────
...(1 row)
Time: 5835.676 ms

With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

─────────────────────────────────────────────────────────────────────────────────────────────────────
...(1 row)
Time: 395.341 ms

Ill cleanup the patch if it seems like a sensible solution...

Is this backpatch-worthy?

Andres


PS: I let the additional define in for the moment so that its easier to see the
performance differences.

On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> While playing around/evaluating tsearch I notices that to_tsvector is
> obscenely slow for some files. After some profiling I found that this is
>  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
>  a multibyte encoding is in use TParserInit copies the whole remaining
>  input and converts it to wchar_t or pg_wchar - for every email or protocol
>  prefixed url in the the document. Which obviously is bad.
>
> I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
>  which reuses the the already converted strings of the original TParser -
>  only at different offsets.
>
> Another approach would be to get rid of the separate parser invocations -
> requiring a bunch of additional states. This seemed more complex to me, so
>  I wanted to get some feedback first.
>
> Without patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
>  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
>  ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
>  (1 row)
>
> Time: 5835.676 ms
>
> With patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
>  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
>  ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
>  (1 row)
>
> Time: 395.341 ms
>
> Ill cleanup the patch if it seems like a sensible solution...
As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this is an
issue?

Andres

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

From
Kenneth Marshall
Date:
On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> > While playing around/evaluating tsearch I notices that to_tsvector is
> > obscenely slow for some files. After some profiling I found that this is
> >  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
> >  a multibyte encoding is in use TParserInit copies the whole remaining
> >  input and converts it to wchar_t or pg_wchar - for every email or protocol
> >  prefixed url in the the document. Which obviously is bad.
> > 
> > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
> >  which reuses the the already converted strings of the original TParser -
> >  only at different offsets.
> > 
> > Another approach would be to get rid of the separate parser invocations -
> > requiring a bunch of additional states. This seemed more complex to me, so
> >  I wanted to get some feedback first.
> > 
> > Without patch:
> > andres=# SELECT to_tsvector('english', document) FROM document WHERE
> >  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
> > 
> >
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
> > ????????????????????????????????????????????????????????????????????????????????? ...
> >  (1 row)
> > 
> > Time: 5835.676 ms
> > 
> > With patch:
> > andres=# SELECT to_tsvector('english', document) FROM document WHERE
> >  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
> > 
> >
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
> > ????????????????????????????????????????????????????????????????????????????????? ...
> >  (1 row)
> > 
> > Time: 395.341 ms
> > 
> > Ill cleanup the patch if it seems like a sensible solution...
> As nobody commented here is a corrected (stupid thinko) and cleaned up 
> version. Anyone cares to comment whether I am the only one thinking this is an 
> issue?
> 
> Andres

+1

As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Regards,
Ken


On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote:
> On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
> > As nobody commented here is a corrected (stupid thinko) and cleaned up
> > version. Anyone cares to comment whether I am the only one thinking this
> > is an issue?
> > Andres
> +1
> As a user of tsearch, I can certainly appreciate to speed-up in parsing --
> more CPU for everyone else.
Please note that this is mostly an issue when using rather long documents 
including either email addresses (ie. an @) or links with protocol prefixes 
(like ?+://) - so it might not give you personally a benefit :-(

Andres


Re: tsearch parser inefficiency if text includes urls or emails - new version

From
"Kevin Grittner"
Date:
Andres Freund <andres@anarazel.de> wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
>> While playing around/evaluating tsearch I notices that to_tsvector
>> is obscenely slow for some files. After some profiling I found that
>> this is due using a seperate TSParser in p_ishost/p_isURLPath in
>> wparser_def.c. If a multibyte encoding is in use TParserInit copies
>> the whole remaining input and converts it to wchar_t or pg_wchar -
>> for every email or protocol prefixed url in the the document. Which
>> obviously is bad.
>> 
>> I solved the issue by having a seperate TParserCopyInit/
>> TParserCopyClose which reuses the the already converted strings of
>> the original TParser - only at different offsets.
>> 
>> Another approach would be to get rid of the separate parser
>> invocations - requiring a bunch of additional states. This seemed
>> more complex to me, so I wanted to get some feedback first.
>> Without patch:
>> Time: 5835.676 ms
>> With patch:
>> Time: 395.341 ms
> As nobody commented here is a corrected (stupid thinko) and cleaned
> up version.
I've taken this one for review, and have taken a preliminary look.
It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that Andres
posed 12 days ago.  His patch massively reduces the cost of the parser
recursively calling itself for some cases, and it seems like the least
invasive way to modify the parser to solve this performance problem;
but it does beg the question of why a state machine like this should
recursively call itself when it hits certain states.
The patch is mitigating the penalty for what seems to me to be an
existing ugly kludge.  Is it acceptable to address the problem in this
way, or should the much more invasive work be done to eliminate the
kludge?  (Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more invasive
work, than to leave this alone for the next release if there's nobody
willing to tackle the problem at a more fundamental level.)
If nobody has a contrary opinion, I'll proceed with the review of this
patch and add something to the TODO page for the more invasive work.
-Kevin


Re: tsearch parser inefficiency if text includes urls or emails - new version

From
Alvaro Herrera
Date:
Kevin Grittner wrote:

> (Note: I personally would much rather see the performance
> penalty addressed this way, and a TODO added for the more invasive
> work, than to leave this alone for the next release if there's nobody
> willing to tackle the problem at a more fundamental level.)

+1

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: tsearch parser inefficiency if text includes urls or emails - new version

From
Andres Freund
Date:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
> It is in context format, applies cleanly, and passes "make check".
Unfortunately the latter is not saying much - I had a bug there and it was not 
found by the regression tests. Perhaps I should take a stab and add at least 
some more...

> It is in context format, applies cleanly, and passes "make check".
> Next I read through the code, and have the same question that Andres
> posed 12 days ago.  His patch massively reduces the cost of the parser
> recursively calling itself for some cases, and it seems like the least
> invasive way to modify the parser to solve this performance problem;
> but it does beg the question of why a state machine like this should
> recursively call itself when it hits certain states.
I was wondering about that as well. I am not completely sure but to me it 
looks like its just done to reduce the amount of rules and states. 

I have to say that that code is not exactly clear and well documented...

Andres


tsearch parser overhaul

From
"Kevin Grittner"
Date:
re:
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00754.php

Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Kevin Grittner wrote:
> 
>> (Note: I personally would much rather see the performance
>> penalty addressed this way, and a TODO added for the more
>> invasive work, than to leave this alone for the next release if
>> there's nobody willing to tackle the problem at a more
>> fundamental level.)
> 
> +1
I haven't added a TODO yet because I'm not sure how to frame it. 
I'm inclined that it would be no more work to replace the current
recursively called state engine with something easier to read and
understand than to try to fix the current oddities.  Perhaps
something along the lines of this?:
http://vo.astronet.ru/arxiv/dict_regex.html
I suspect we'd need to get it to use the same regexp code used
elsewhere in PostgreSQL.
Thoughts?
-Kevin


Re: tsearch parser overhaul

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'm inclined that it would be no more work to replace the current
> recursively called state engine with something easier to read and
> understand than to try to fix the current oddities.  Perhaps
> something along the lines of this?:
> http://vo.astronet.ru/arxiv/dict_regex.html
> I suspect we'd need to get it to use the same regexp code used
> elsewhere in PostgreSQL.

We're certainly not going to start carrying two regexp engines,
so yeah ;-)

I guess if this is proposed as a replacement for the existing parser,
we'd need to see some speed comparisons.  I have no idea at all which
is faster.
        regards, tom lane