Thread: Very bad FTS performance with the Polish config
Hello This has been discussed in #postgresql and posted to -performance a couple days ago, but no solution has been found. The discussion can be found here: http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php I just finished implementing a "search engine" for my site and found ts_headline extremely slow when used with a Polish tsearch configuration, while fast with English. All of it boils down to a simple testcase, but first some background. I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled from source, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory): http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ Now for the testcase: text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' # explain analyze select ts_headline('polish', text, plainto_tsquery('polish', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0)(actual time=6.407..6.470 rows=1 loops=1) Total runtime: 6.524 ms (2 rows) # explain analyze select ts_headline('english', text, plainto_tsquery('english', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0)(actual time=0.861..0.895 rows=1 loops=1) Total runtime: 0.935 ms (2 rows) # explain analyze select ts_headline('simple', text, plainto_tsquery('simple', 'foobar')); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0)(actual time=0.627..0.660 rows=1 loops=1) Total runtime: 0.697 ms (2 rows) # As you can see, the results differ by an order of magnitude between Polish and English. While in this simple testcase it's a non-issue, in the real world this translates into enormous overhead. One of the queries I ran testing my site's search function took 1870ms. When I took that query and changed all ts_headline(foo) calls to just foo, the time dropped below 100ms. That's the difference between something completely unacceptable and something quite useful. I can post various details about the hardware, software and specific queries, but the testcases speak for themselves. I'm sure you can easily reproduce my results. I'm putting my code into production tomorrow, since I can't wait anymore. Hints would be very much appreciated! cheers, Wojciech Knapik PS. This issue is not related to the loading time of dictionaries, or calls to ts_headline for results that won't be displayed. A few other details can be found here http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with snippets of my conversations in #postgresql that lead to this testcase. Big thanks to RhodiumToad for helping me with fts for the last couple days ;]
Wojciech Knapik escreveu: > PS. This issue is not related to the loading time of dictionaries, or > calls to ts_headline for results that won't be displayed. > So what? Could you post the profiling of that query? -- Euler Taveira de Oliveira http://www.timbira.com/
Euler Taveira de Oliveira wrote: >> PS. This issue is not related to the loading time of dictionaries, or >> calls to ts_headline for results that won't be displayed. > > So what? Could you post the profiling of that query? Polish: http://pastie.textmate.org/private/8lhmnbvde43lfjoxc52r1q English: http://pastie.textmate.org/private/4iaipottrmjmfxfykz94mw cheers, Wojciech Knapik PS. Sorry for the double post.
Wojciech Knapik escreveu: > > Euler Taveira de Oliveira wrote: > >>> PS. This issue is not related to the loading time of dictionaries, or >>> calls to ts_headline for results that won't be displayed. >> >> So what? Could you post the profiling of that query? > I was talking about gprof (--enable-profiling), oprofile [1] or similar tools. But it seems the slow step is the sort one. [1] http://wiki.postgresql.org/wiki/Profiling_with_OProfile -- Euler Taveira de Oliveira http://www.timbira.com/
Wojciech Knapik <webmaster@wolniartysci.pl> writes: > I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 > (2.6.21), then switched both installations to 8.3.8 (both packages > compiled from source, but provided by the distro - port/emerge). The > Polish dictionaries and config were created according to this article > (it's in Polish, but the code is self-explanatory): > http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ I tried to duplicate this test, but got no further than here: u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell ( TEMPLATE = ispell, DictFile = polish, AffFile = polish, StopWords= polish ); ERROR: syntax error CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,G�EM #zalec (15a) " u8=# Seems there's something about the current version of the dictionary that we don't like. I used sjp-ispell-pl-20091117-src.tar.bz2 ... regards, tom lane
Wojciech, your polish_english, polish configurations uses ispell language and slow, while english configuration doesn't contains ispell. So, what's your complains ? Try add ispell dictionary to english configuration and see timings. Oleg On Wed, 18 Nov 2009, Wojciech Knapik wrote: > > Hello > > > This has been discussed in #postgresql and posted to -performance a > couple days ago, but no solution has been found. The discussion can be > found here: > http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php > > I just finished implementing a "search engine" for my site and found > ts_headline extremely slow when used with a Polish tsearch > configuration, while fast with English. All of it boils down to a simple > testcase, but first some background. > > I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 > (2.6.21), then switched both installations to 8.3.8 (both packages > compiled from source, but provided by the distro - port/emerge). The > Polish dictionaries and config were created according to this article > (it's in Polish, but the code is self-explanatory): > > http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ > > Now for the testcase: > > text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do > eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad > minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip > ex ea commodo consequat. Duis aute irure dolor in reprehenderit in > voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur > sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt > mollit anim id est laborum.' > > # explain analyze select ts_headline('polish', text, > plainto_tsquery('polish', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 > rows=1 loops=1) > Total runtime: 6.524 ms > (2 rows) > > # explain analyze select ts_headline('english', text, > plainto_tsquery('english', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 > rows=1 loops=1) > Total runtime: 0.935 ms > (2 rows) > > # explain analyze select ts_headline('simple', text, > plainto_tsquery('simple', 'foobar')); > QUERY PLAN > ------------------------------------------------------------------------------------ > Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 > rows=1 loops=1) > Total runtime: 0.697 ms > (2 rows) > > # > > As you can see, the results differ by an order of magnitude between > Polish and English. While in this simple testcase it's a non-issue, in > the real world this translates into enormous overhead. > > One of the queries I ran testing my site's search function took > 1870ms. When I took that query and changed all ts_headline(foo) calls to > just foo, the time dropped below 100ms. That's the difference between > something completely unacceptable and something quite useful. > > I can post various details about the hardware, software and specific > queries, but the testcases speak for themselves. I'm sure you can easily > reproduce my results. > > I'm putting my code into production tomorrow, since I can't wait > anymore. Hints would be very much appreciated! > > > cheers, > Wojciech Knapik > > PS. This issue is not related to the loading time of dictionaries, or > calls to ts_headline for results that won't be displayed. A few other > details can be found here > http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with > snippets of my conversations in #postgresql that lead to this testcase. > Big thanks to RhodiumToad for helping me with fts for the last couple > days ;] > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote: >> I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 >> (2.6.21), then switched both installations to 8.3.8 (both packages >> compiled from source, but provided by the distro - port/emerge). The >> Polish dictionaries and config were created according to this article >> (it's in Polish, but the code is self-explanatory): > >> http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ > > I tried to duplicate this test, but got no further than here: > > u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell ( > TEMPLATE = ispell, > DictFile = polish, > AffFile = polish, > StopWords = polish > ); > ERROR: syntax error > CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,GĹEM #zalec (15a) > " > u8=# > > Seems there's something about the current version of the dictionary that > we don't like. I used sjp-ispell-pl-20091117-src.tar.bz2 ... Here are the files I used (polish.affix, polish.dict already generated): http://wolniartysci.pl/pl.tar.gz These should work fine. I'd be grateful if you could test and see if you get similar results. cheers, Wojciech Knapik PS. Weird, I get the emails without a reply-to set for the list..
> your polish_english, polish configurations uses ispell language and slow, > while english configuration doesn't contains ispell. So, what's your > complains ? Try add ispell dictionary to english configuration and see > timings. Oh, so this is not anomalous ? These are the expected speeds for an ispell dictionary ? I didn't realize that. Sorry for the bother then. It just seemed way too slow to be practical. cheers, Wojciech Knapik
On Wed, 18 Nov 2009, Wojciech Knapik wrote: > >> your polish_english, polish configurations uses ispell language and slow, >> while english configuration doesn't contains ispell. So, what's your >> complains ? Try add ispell dictionary to english configuration and see >> timings. > > Oh, so this is not anomalous ? These are the expected speeds for an ispell > dictionary ? I didn't realize that. Sorry for the bother then. It just seemed > way too slow to be practical. You can see real timings using ts_lexize() function for different dictionaries (try several time to avoid cold-start problem) instead of ts_headline(), which involves other factors. On my test machine I see no real difference between very simple dictionary and french ispell, snowball dictionaries: dev-oleg=# select ts_lexize('simple','voila'); ts_lexize ----------- {voila} (1 row) Time: 0.282 ms dev-oleg=# select ts_lexize('simple','voila'); ts_lexize ----------- {voila} (1 row) Time: 0.269 ms dev-oleg=# select ts_lexize('french_stem','voila'); ts_lexize ----------- {voil} (1 row) Time: 0.187 ms I see no big difference in ts_headline as well: dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery); ts_headline ----------------------- I can do <b>voila</b> (1 row) Time: 0.265 ms dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery); ts_headline ----------------------- I can do <b>voila</b> (1 row) Time: 0.299 ms This is 8.4.1 version of PostgreSQL. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: >>> your polish_english, polish configurations uses ispell language >>> and slow, while english configuration doesn't contains ispell. >>> So, what's your complains ? Try add ispell dictionary to english >>> configuration and see timings. >> >> Oh, so this is not anomalous ? These are the expected speeds for an >> ispell dictionary ? I didn't realize that. Sorry for the bother >> then. It just seemed way too slow to be practical. > > You can see real timings using ts_lexize() function for different > dictionaries (try several time to avoid cold-start problem) instead > of ts_headline(), which involves other factors. > > On my test machine I see no real difference between very simple > dictionary and french ispell, snowball dictionaries: ts_lexize seems to be just as fast for simple, polish_ispell and english_stem with the 'voila' argument. polish_ispell is in fact *faster* for the lorem ipsum text repeated a couple times (10 ?). Which suggests that the issue is with ts_headline iteself. > I see no big difference in ts_headline as well: > > dev-oleg=# select ts_headline('english','I can do voila', > 'voila'::tsquery); > ts_headline > ----------------------- > I can do <b>voila</b> > (1 row) > > Time: 0.265 ms Yes, for 4-word texts the results are similar. Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times... > This is 8.4.1 version of PostgreSQL. An that was 8.3.8/OSX. cheers, Wojciech Knapik
2009/11/18 Oleg Bartunov <oleg@sai.msu.su>: > On Wed, 18 Nov 2009, Wojciech Knapik wrote: > >> >>> your polish_english, polish configurations uses ispell language and slow, >>> while english configuration doesn't contains ispell. So, what's your >>> complains ? Try add ispell dictionary to english configuration and see >>> timings. >> >> Oh, so this is not anomalous ? These are the expected speeds for an ispell >> dictionary ? I didn't realize that. Sorry for the bother then. It just >> seemed way too slow to be practical. > > You can see real timings using ts_lexize() function for different > dictionaries > (try several time to avoid cold-start problem) instead of ts_headline(), > which involves other factors. > > On my test machine I see no real difference between very simple dictionary > and french ispell, snowball dictionaries: > It's depend on language (and dictionary sizes). for czech: postgres=# select ts_lexize('simple','vody');ts_lexize -----------{vody} (1 row) Time: 0.785 ms postgres=# select ts_lexize('cspell','vody');ts_lexize -----------{voda} (1 row) Time: 1.041 ms I afraid so czech and polland language is very hard (with long affix file). Regards Pavel > dev-oleg=# select ts_lexize('simple','voila'); > ts_lexize > ----------- > {voila} > (1 row) > > Time: 0.282 ms > dev-oleg=# select ts_lexize('simple','voila'); > ts_lexize > ----------- > {voila} > (1 row) > > Time: 0.269 ms > > dev-oleg=# select ts_lexize('french_stem','voila'); > ts_lexize > ----------- > {voil} > (1 row) > > Time: 0.187 ms > > I see no big difference in ts_headline as well: > > dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery); > ts_headline > ----------------------- > I can do <b>voila</b> > (1 row) > > Time: 0.265 ms > dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery); > ts_headline > ----------------------- > I can do <b>voila</b> > (1 row) > > Time: 0.299 ms > > This is 8.4.1 version of PostgreSQL. > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Wed, 18 Nov 2009, Wojciech Knapik wrote: > Yes, for 4-word texts the results are similar. > Try that with a longer text and the difference becomes more and more > significant. For the lorem ipsum text, 'polish' is about 4 times slower, than > 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - > 7.5 times... Again, I see nothing unclear here, since dictionaries (as specified in configuration) apply to ALL words in document. The more words in document, the more overhead. You can pass not all document to ts_headline, but just part, to have predicted performance. This is useful in any case. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: >> Yes, for 4-word texts the results are similar. >> Try that with a longer text and the difference becomes more and more >> significant. For the lorem ipsum text, 'polish' is about 4 times >> slower, than 'english'. For 5 repetitions of the text, it's 6 times, >> for 10 repetitions - 7.5 times... > > Again, I see nothing unclear here, since dictionaries (as specified > in configuration) apply to ALL words in document. The more words in > document, the more overhead. You're missing the point. I'm not surprised that the function takes more time for larger input texts - that's obvious. The thing is, the computation times rise more steeply when the Polish config is used. Steeply enough, that the difference between the Polish and English configs becomes enormous in practical cases. Now this may be expected behaviour, but since I don't know if it is, I posted to the mailing lists to find out. If you're saying this is ok and there's nothing to fix here, then there's nothing more to discuss and we may consider the thread closed. If not, ts_headline deserves a closer look. cheers, Wojciech Knapik
ts_headline calls ts_lexize equivalent to break the text. Off course there is algorithm to process the tokens and generate the headline. I would be really surprised if the algorithm to generate the headline is somehow dependent on language (as it only processes the tokens). So Oleg is right when he says ts_lexize is something to be checked.
I will try to replicate what you are trying to do but in the meantime can you run the same ts_headline under psql multiple times and paste the result.
-Sushant.
I will try to replicate what you are trying to do but in the meantime can you run the same ts_headline under psql multiple times and paste the result.
-Sushant.
2009/11/19 Wojciech Knapik <webmaster@wolniartysci.pl>
You're missing the point. I'm not surprised that the function takes more time for larger input texts - that's obvious. The thing is, the computation times rise more steeply when the Polish config is used. Steeply enough, that the difference between the Polish and English configs becomes enormous in practical cases.
Oleg Bartunov wrote:Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times...
Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in document, the more overhead.
Now this may be expected behaviour, but since I don't know if it is, I posted to the mailing lists to find out. If you're saying this is ok and there's nothing to fix here, then there's nothing more to discuss and we may consider the thread closed.
If not, ts_headline deserves a closer look.
cheers,
Wojciech Knapik
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
> ts_headline calls ts_lexize equivalent to break the text. Off course > there is algorithm to process the tokens and generate the headline. I > would be really surprised if the algorithm to generate the headline is > somehow dependent on language (as it only processes the tokens). So Oleg > is right when he says ts_lexize is something to be checked. ts_lexize performs well for all dictionaries, nothing to see here. It's ts_headline that's causing the problem. But that's just IMHO. > I will try to replicate what you are trying to do but in the meantime > can you run the same ts_headline under psql multiple times and paste the > result. All the results I pasted had the medium run time out of multiple calls. These were certainly not some extreme corner cases. cheers, Wojciech Knapik
Wojciech Knapik <webmaster@wolniartysci.pl> writes: > Tom Lane wrote: >> I tried to duplicate this test, but got no further than here: >> ERROR: syntax error >> CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,G�EM #zalec (15a) > Here are the files I used (polish.affix, polish.dict already generated): > http://wolniartysci.pl/pl.tar.gz Your files were the same as mine. I eventually figured out the problem was I was using C locale, in which some of those letters aren't letters. (I wonder whether the tsearch config file parsers could be made less sensitive to this by avoiding t_isalpha tests.) In pl_PL.ut8 locale I could see that the example is indeed much slower. Oleg is right that the fundamental difference is that this Polish configuration is using an ispell dictionary where the simple English configuration is not. But, just for the record, here's what an oprofile profile looks like: samples % image name symbol name 7480 20.9477 postgres RS_execute 5370 15.0386 postgres pg_utf_mblen 4138 11.5884 postgres pg_mblen 3756 10.5187 postgres mb_strchr 2880 8.0654 postgres FindWord 2754 7.7126 postgres CheckAffix 1576 4.4136 postgres NormalizeSubWord 966 2.7053 postgres FindAffixes 896 2.5092 postgres TParserGet 742 2.0780 postgres AllocSetAlloc 420 1.1762 postgres AllocSetFree 396 1.1090 postgres addHLParsedLex 384 1.0754 postgres LexizeExec So about 55% of the time is going into affix pattern matching. I wonder whether that couldn't be made faster. A lot of the cycles are spent on coping with variable-length characters --- perhaps the ispell code should convert to wchar representation before doing this? regards, tom lane
Tom Lane wrote: *SNIP* > So about 55% of the time is going into affix pattern matching. > I wonder whether that couldn't be made faster. A lot of the cycles > are spent on coping with variable-length characters --- perhaps the > ispell code should convert to wchar representation before doing this? Thanks a lot for looking into this. I hope this will lead to some improvements one day. Unfortunately my C skills were pretty basic years ago and I haven't used the language since, so I can't be of much help.. cheers, Wojciech Knapik