Thread: [PATCH] tsearch parser inefficiency if text includes urls or emails
Hi, While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ...(1 row) Time: 5835.676 ms With patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ...(1 row) Time: 395.341 ms Ill cleanup the patch if it seems like a sensible solution... Is this backpatch-worthy? Andres PS: I let the additional define in for the moment so that its easier to see the performance differences.
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
From
Andres Freund
Date:
On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > While playing around/evaluating tsearch I notices that to_tsvector is > obscenely slow for some files. After some profiling I found that this is > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > a multibyte encoding is in use TParserInit copies the whole remaining > input and converts it to wchar_t or pg_wchar - for every email or protocol > prefixed url in the the document. Which obviously is bad. > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > which reuses the the already converted strings of the original TParser - > only at different offsets. > > Another approach would be to get rid of the separate parser invocations - > requiring a bunch of additional states. This seemed more complex to me, so > I wanted to get some feedback first. > > Without patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 5835.676 ms > > With patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 395.341 ms > > Ill cleanup the patch if it seems like a sensible solution... As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue? Andres
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
From
Kenneth Marshall
Date:
On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: > On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > > While playing around/evaluating tsearch I notices that to_tsvector is > > obscenely slow for some files. After some profiling I found that this is > > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > > a multibyte encoding is in use TParserInit copies the whole remaining > > input and converts it to wchar_t or pg_wchar - for every email or protocol > > prefixed url in the the document. Which obviously is bad. > > > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > > which reuses the the already converted strings of the original TParser - > > only at different offsets. > > > > Another approach would be to get rid of the separate parser invocations - > > requiring a bunch of additional states. This seemed more complex to me, so > > I wanted to get some feedback first. > > > > Without patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? ... > > (1 row) > > > > Time: 5835.676 ms > > > > With patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? ... > > (1 row) > > > > Time: 395.341 ms > > > > Ill cleanup the patch if it seems like a sensible solution... > As nobody commented here is a corrected (stupid thinko) and cleaned up > version. Anyone cares to comment whether I am the only one thinking this is an > issue? > > Andres +1 As a user of tsearch, I can certainly appreciate to speed-up in parsing -- more CPU for everyone else. Regards, Ken
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
From
Andres Freund
Date:
On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote: > On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: > > As nobody commented here is a corrected (stupid thinko) and cleaned up > > version. Anyone cares to comment whether I am the only one thinking this > > is an issue? > > Andres > +1 > As a user of tsearch, I can certainly appreciate to speed-up in parsing -- > more CPU for everyone else. Please note that this is mostly an issue when using rather long documents including either email addresses (ie. an @) or links with protocol prefixes (like ?+://) - so it might not give you personally a benefit :-( Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
"Kevin Grittner"
Date:
Andres Freund <andres@anarazel.de> wrote: > On Sunday 01 November 2009 16:19:43 Andres Freund wrote: >> While playing around/evaluating tsearch I notices that to_tsvector >> is obscenely slow for some files. After some profiling I found that >> this is due using a seperate TSParser in p_ishost/p_isURLPath in >> wparser_def.c. If a multibyte encoding is in use TParserInit copies >> the whole remaining input and converts it to wchar_t or pg_wchar - >> for every email or protocol prefixed url in the the document. Which >> obviously is bad. >> >> I solved the issue by having a seperate TParserCopyInit/ >> TParserCopyClose which reuses the the already converted strings of >> the original TParser - only at different offsets. >> >> Another approach would be to get rid of the separate parser >> invocations - requiring a bunch of additional states. This seemed >> more complex to me, so I wanted to get some feedback first. >> Without patch: >> Time: 5835.676 ms >> With patch: >> Time: 395.341 ms > As nobody commented here is a corrected (stupid thinko) and cleaned > up version. I've taken this one for review, and have taken a preliminary look. It is in context format, applies cleanly, and passes "make check". Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. The patch is mitigating the penalty for what seems to me to be an existing ugly kludge. Is it acceptable to address the problem in this way, or should the much more invasive work be done to eliminate the kludge? (Note: I personally would much rather see the performance penalty addressed this way, and a TODO added for the more invasive work, than to leave this alone for the next release if there's nobody willing to tackle the problem at a more fundamental level.) If nobody has a contrary opinion, I'll proceed with the review of this patch and add something to the TODO page for the more invasive work. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
Alvaro Herrera
Date:
Kevin Grittner wrote: > (Note: I personally would much rather see the performance > penalty addressed this way, and a TODO added for the more invasive > work, than to leave this alone for the next release if there's nobody > willing to tackle the problem at a more fundamental level.) +1 -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
Andres Freund
Date:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: > It is in context format, applies cleanly, and passes "make check". Unfortunately the latter is not saying much - I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... > It is in context format, applies cleanly, and passes "make check". > Next I read through the code, and have the same question that Andres > posed 12 days ago. His patch massively reduces the cost of the parser > recursively calling itself for some cases, and it seems like the least > invasive way to modify the parser to solve this performance problem; > but it does beg the question of why a state machine like this should > recursively call itself when it hits certain states. I was wondering about that as well. I am not completely sure but to me it looks like its just done to reduce the amount of rules and states. I have to say that that code is not exactly clear and well documented... Andres
re: http://archives.postgresql.org/pgsql-hackers/2009-11/msg00754.php Alvaro Herrera <alvherre@commandprompt.com> wrote: > Kevin Grittner wrote: > >> (Note: I personally would much rather see the performance >> penalty addressed this way, and a TODO added for the more >> invasive work, than to leave this alone for the next release if >> there's nobody willing to tackle the problem at a more >> fundamental level.) > > +1 I haven't added a TODO yet because I'm not sure how to frame it. I'm inclined that it would be no more work to replace the current recursively called state engine with something easier to read and understand than to try to fix the current oddities. Perhaps something along the lines of this?: http://vo.astronet.ru/arxiv/dict_regex.html I suspect we'd need to get it to use the same regexp code used elsewhere in PostgreSQL. Thoughts? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I'm inclined that it would be no more work to replace the current > recursively called state engine with something easier to read and > understand than to try to fix the current oddities. Perhaps > something along the lines of this?: > http://vo.astronet.ru/arxiv/dict_regex.html > I suspect we'd need to get it to use the same regexp code used > elsewhere in PostgreSQL. We're certainly not going to start carrying two regexp engines, so yeah ;-) I guess if this is proposed as a replacement for the existing parser, we'd need to see some speed comparisons. I have no idea at all which is faster. regards, tom lane