Thread: Very bad FTS performance with the Polish config

Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

17 November 2009, 19:30:16

Hello

 This has been discussed in #postgresql and posted to -performance a
couple days ago, but no solution has been found. The discussion can be
found here:
http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php
 I just finished implementing a "search engine" for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuration, while fast with English. All of it boils down to a simple
testcase, but first some background.
 I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The
Polish dictionaries and config were created according to this article
(it's in Polish, but the code is self-explanatory):

http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
 Now for the testcase:

text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.'

# explain analyze select ts_headline('polish', text,
plainto_tsquery('polish', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=6.407..6.470
 
rows=1 loops=1) Total runtime: 6.524 ms
(2 rows)

# explain analyze select ts_headline('english', text,
plainto_tsquery('english', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=0.861..0.895
 
rows=1 loops=1) Total runtime: 0.935 ms
(2 rows)

# explain analyze select ts_headline('simple', text,
plainto_tsquery('simple', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=0.627..0.660
 
rows=1 loops=1) Total runtime: 0.697 ms
(2 rows)

#
 As you can see, the results differ by an order of magnitude between
Polish and English. While in this simple testcase it's a non-issue, in
the real world this translates into enormous overhead.
 One of the queries I ran testing my site's search function took
1870ms. When I took that query and changed all ts_headline(foo) calls to
just foo, the time dropped below 100ms. That's the difference between
something completely unacceptable and something quite useful.
 I can post various details about the hardware, software and specific
queries, but the testcases speak for themselves. I'm sure you can easily
reproduce my results.
 I'm putting my code into production tomorrow, since I can't wait
anymore. Hints would be very much appreciated!


cheers,
Wojciech Knapik

PS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed. A few other
details can be found here
http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with
snippets of my conversations in #postgresql that lead to this testcase.
Big thanks to RhodiumToad for helping me with fts for the last couple
days ;]

Re: Very bad FTS performance with the Polish config

From

Euler Taveira de Oliveira

Date:

17 November 2009, 19:49:02

Wojciech Knapik escreveu:
> PS. This issue is not related to the loading time of dictionaries, or
> calls to ts_headline for results that won't be displayed.
> 
So what? Could you post the profiling of that query?


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

17 November 2009, 20:04:05

Euler Taveira de Oliveira wrote:

>> PS. This issue is not related to the loading time of dictionaries, or
>> calls to ts_headline for results that won't be displayed.
>
> So what? Could you post the profiling of that query?

Polish:
http://pastie.textmate.org/private/8lhmnbvde43lfjoxc52r1q

English:
http://pastie.textmate.org/private/4iaipottrmjmfxfykz94mw

cheers,
Wojciech Knapik

PS. Sorry for the double post.

Re: Very bad FTS performance with the Polish config

From

Euler Taveira de Oliveira

Date:

17 November 2009, 21:27:16

Wojciech Knapik escreveu:
> 
> Euler Taveira de Oliveira wrote:
> 
>>> PS. This issue is not related to the loading time of dictionaries, or
>>> calls to ts_headline for results that won't be displayed.
>>
>> So what? Could you post the profiling of that query?
> 
I was talking about gprof (--enable-profiling), oprofile [1] or similar tools.
But it seems the slow step is the sort one.


[1] http://wiki.postgresql.org/wiki/Profiling_with_OProfile


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: Very bad FTS performance with the Polish config

From

Tom Lane

Date:

17 November 2009, 23:51:50

Wojciech Knapik <webmaster@wolniartysci.pl> writes:
>   I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
> (2.6.21), then switched both installations to 8.3.8 (both packages
> compiled from source, but provided by the distro - port/emerge). The
> Polish dictionaries and config were created according to this article
> (it's in Polish, but the code is self-explanatory):

> http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/

I tried to duplicate this test, but got no further than here:

u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell (   TEMPLATE = ispell,   DictFile = polish,   AffFile = polish,
StopWords= polish
 
);
ERROR:  syntax error
CONTEXT:  line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": "  L E C
               >       -C,G�EM         #zalec (15a)
 
"
u8=# 

Seems there's something about the current version of the dictionary that
we don't like.  I used sjp-ispell-pl-20091117-src.tar.bz2 ...
        regards, tom lane

Re: Very bad FTS performance with the Polish config

From

Oleg Bartunov

Date:

18 November 2009, 05:11:33

Wojciech,

your polish_english, polish configurations uses ispell language and slow,
while english configuration doesn't contains ispell. So, what's your
complains ? Try add ispell dictionary to english configuration and see
timings.

Oleg

On Wed, 18 Nov 2009, Wojciech Knapik wrote:

>
> Hello
>
>
> This has been discussed in #postgresql and posted to -performance a
> couple days ago, but no solution has been found. The discussion can be
> found here:
> http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php
>
> I just finished implementing a "search engine" for my site and found
> ts_headline extremely slow when used with a Polish tsearch
> configuration, while fast with English. All of it boils down to a simple
> testcase, but first some background.
>
> I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
> (2.6.21), then switched both installations to 8.3.8 (both packages
> compiled from source, but provided by the distro - port/emerge). The
> Polish dictionaries and config were created according to this article
> (it's in Polish, but the code is self-explanatory):
>
> http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
>
> Now for the testcase:
>
> text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
> eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
> minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
> ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
> voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
> sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
> mollit anim id est laborum.'
>
> # explain analyze select ts_headline('polish', text,
> plainto_tsquery('polish', 'foobar'));
>                                     QUERY PLAN
> ------------------------------------------------------------------------------------
> Result  (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470
> rows=1 loops=1)
> Total runtime: 6.524 ms
> (2 rows)
>
> # explain analyze select ts_headline('english', text,
> plainto_tsquery('english', 'foobar'));
>                                     QUERY PLAN
> ------------------------------------------------------------------------------------
> Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895
> rows=1 loops=1)
> Total runtime: 0.935 ms
> (2 rows)
>
> # explain analyze select ts_headline('simple', text,
> plainto_tsquery('simple', 'foobar'));
>                                     QUERY PLAN
> ------------------------------------------------------------------------------------
> Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660
> rows=1 loops=1)
> Total runtime: 0.697 ms
> (2 rows)
>
> #
>
> As you can see, the results differ by an order of magnitude between
> Polish and English. While in this simple testcase it's a non-issue, in
> the real world this translates into enormous overhead.
>
> One of the queries I ran testing my site's search function took
> 1870ms. When I took that query and changed all ts_headline(foo) calls to
> just foo, the time dropped below 100ms. That's the difference between
> something completely unacceptable and something quite useful.
>
> I can post various details about the hardware, software and specific
> queries, but the testcases speak for themselves. I'm sure you can easily
> reproduce my results.
>
> I'm putting my code into production tomorrow, since I can't wait
> anymore. Hints would be very much appreciated!
>
>
> cheers,
> Wojciech Knapik
>
> PS. This issue is not related to the loading time of dictionaries, or
> calls to ts_headline for results that won't be displayed. A few other
> details can be found here
> http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with
> snippets of my conversations in #postgresql that lead to this testcase.
> Big thanks to RhodiumToad for helping me with fts for the last couple
> days ;]
>
>
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

18 November 2009, 05:33:38

Tom Lane wrote:

>>   I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
>> (2.6.21), then switched both installations to 8.3.8 (both packages
>> compiled from source, but provided by the distro - port/emerge). The
>> Polish dictionaries and config were created according to this article
>> (it's in Polish, but the code is self-explanatory):
>
>> http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
>
> I tried to duplicate this test, but got no further than here:
>
> u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell (
>     TEMPLATE = ispell,
>     DictFile = polish,
>     AffFile = polish,
>     StopWords = polish
> );
> ERROR:  syntax error
> CONTEXT:  line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": "  L E C
                 >       -C,GĹEM         #zalec (15a) 
> "
> u8=#
>
> Seems there's something about the current version of the dictionary that
> we don't like.  I used sjp-ispell-pl-20091117-src.tar.bz2 ...

Here are the files I used (polish.affix, polish.dict already generated):
http://wolniartysci.pl/pl.tar.gz

These should work fine. I'd be grateful if you could test and see if you
get similar results.

cheers,
Wojciech Knapik

PS. Weird, I get the emails without a reply-to set for the list..

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

18 November 2009, 05:35:50

> your polish_english, polish configurations uses ispell language and slow,
> while english configuration doesn't contains ispell. So, what's your
> complains ? Try add ispell dictionary to english configuration and see
> timings.

Oh, so this is not anomalous ? These are the expected speeds for an 
ispell dictionary ? I didn't realize that. Sorry for the bother then. It 
just seemed way too slow to be practical.

cheers,
Wojciech Knapik

Re: Very bad FTS performance with the Polish config

From

Oleg Bartunov

Date:

18 November 2009, 10:47:26

On Wed, 18 Nov 2009, Wojciech Knapik wrote:

>
>> your polish_english, polish configurations uses ispell language and slow,
>> while english configuration doesn't contains ispell. So, what's your
>> complains ? Try add ispell dictionary to english configuration and see
>> timings.
>
> Oh, so this is not anomalous ? These are the expected speeds for an ispell 
> dictionary ? I didn't realize that. Sorry for the bother then. It just seemed 
> way too slow to be practical.

You can see real timings using ts_lexize() function for different dictionaries
(try several time to avoid cold-start problem) instead of ts_headline(), 
which involves other factors.

On my test machine I see no real difference between very simple dictionary
and french ispell, snowball dictionaries:

dev-oleg=# select ts_lexize('simple','voila'); ts_lexize
----------- {voila}
(1 row)

Time: 0.282 ms
dev-oleg=# select ts_lexize('simple','voila'); ts_lexize
----------- {voila}
(1 row)

Time: 0.269 ms

dev-oleg=# select ts_lexize('french_stem','voila'); ts_lexize
----------- {voil}
(1 row)

Time: 0.187 ms

I see no big difference in ts_headline as well:

dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery);      ts_headline
----------------------- I can do <b>voila</b>
(1 row)

Time: 0.265 ms
dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery);      ts_headline
----------------------- I can do <b>voila</b>
(1 row)

Time: 0.299 ms

This is 8.4.1 version of PostgreSQL.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

18 November 2009, 11:27:36

Oleg Bartunov wrote:

>>> your polish_english, polish configurations uses ispell language
>>> and slow, while english configuration doesn't contains ispell.
>>> So, what's your complains ? Try add ispell dictionary to english
>>> configuration and see timings.
>> 
>> Oh, so this is not anomalous ? These are the expected speeds for an
>> ispell dictionary ? I didn't realize that. Sorry for the bother
>> then. It just seemed way too slow to be practical.
> 
> You can see real timings using ts_lexize() function for different 
> dictionaries (try several time to avoid cold-start problem) instead
> of ts_headline(), which involves other factors.
> 
> On my test machine I see no real difference between very simple
> dictionary and french ispell, snowball dictionaries:

ts_lexize seems to be just as fast for simple, polish_ispell and 
english_stem with the 'voila' argument.

polish_ispell is in fact *faster* for the lorem ipsum text repeated a 
couple times (10 ?). Which suggests that the issue is with ts_headline 
iteself.

> I see no big difference in ts_headline as well:
> 
> dev-oleg=# select ts_headline('english','I can do voila', 
> 'voila'::tsquery);
>       ts_headline
> -----------------------
>  I can do <b>voila</b>
> (1 row)
> 
> Time: 0.265 ms

Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more 
significant. For the lorem ipsum text, 'polish' is about 4 times slower, 
than 'english'. For 5 repetitions of the text, it's 6 times, for 10 
repetitions - 7.5 times...

> This is 8.4.1 version of PostgreSQL.

An that was 8.3.8/OSX.

cheers,
Wojciech Knapik

Re: Very bad FTS performance with the Polish config

From

Pavel Stehule

Date:

18 November 2009, 11:38:35

2009/11/18 Oleg Bartunov <oleg@sai.msu.su>:
> On Wed, 18 Nov 2009, Wojciech Knapik wrote:
>
>>
>>> your polish_english, polish configurations uses ispell language and slow,
>>> while english configuration doesn't contains ispell. So, what's your
>>> complains ? Try add ispell dictionary to english configuration and see
>>> timings.
>>
>> Oh, so this is not anomalous ? These are the expected speeds for an ispell
>> dictionary ? I didn't realize that. Sorry for the bother then. It just
>> seemed way too slow to be practical.
>
> You can see real timings using ts_lexize() function for different
> dictionaries
> (try several time to avoid cold-start problem) instead of ts_headline(),
> which involves other factors.
>
> On my test machine I see no real difference between very simple dictionary
> and french ispell, snowball dictionaries:
>

It's depend on language (and dictionary sizes).

for czech:

postgres=# select ts_lexize('simple','vody');ts_lexize
-----------{vody}
(1 row)

Time: 0.785 ms

postgres=# select ts_lexize('cspell','vody');ts_lexize
-----------{voda}
(1 row)

Time: 1.041 ms

I afraid so czech and polland language is very hard (with long affix file).

Regards
Pavel

> dev-oleg=# select ts_lexize('simple','voila');
>  ts_lexize
> -----------
>  {voila}
> (1 row)
>
> Time: 0.282 ms
> dev-oleg=# select ts_lexize('simple','voila');
>  ts_lexize
> -----------
>  {voila}
> (1 row)
>
> Time: 0.269 ms
>
> dev-oleg=# select ts_lexize('french_stem','voila');
>  ts_lexize
> -----------
>  {voil}
> (1 row)
>
> Time: 0.187 ms
>
> I see no big difference in ts_headline as well:
>
> dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery);
>      ts_headline
> -----------------------
>  I can do <b>voila</b>
> (1 row)
>
> Time: 0.265 ms
> dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery);
>      ts_headline
> -----------------------
>  I can do <b>voila</b>
> (1 row)
>
> Time: 0.299 ms
>
> This is 8.4.1 version of PostgreSQL.
>
>        Regards,
>                Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: Very bad FTS performance with the Polish config

From

Oleg Bartunov

Date:

18 November 2009, 11:52:28

On Wed, 18 Nov 2009, Wojciech Knapik wrote:

> Yes, for 4-word texts the results are similar.
> Try that with a longer text and the difference becomes more and more 
> significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 
> 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 
> 7.5 times...

Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in 
document, the more overhead.

You can pass not all document to ts_headline, but just part, to have
predicted performance. This is useful in any case.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

18 November 2009, 15:15:14

Oleg Bartunov wrote:

>> Yes, for 4-word texts the results are similar.
>> Try that with a longer text and the difference becomes more and more 
>> significant. For the lorem ipsum text, 'polish' is about 4 times 
>> slower, than 'english'. For 5 repetitions of the text, it's 6 times, 
>> for 10 repetitions - 7.5 times...
> 
> Again, I see nothing unclear here, since dictionaries (as specified
> in configuration) apply to ALL words in document. The more words in 
> document, the more overhead.

You're missing the point. I'm not surprised that the function takes more 
time for larger input texts - that's obvious. The thing is, the 
computation times rise more steeply when the Polish config is used. 
Steeply enough, that the difference between the Polish and English 
configs becomes enormous in practical cases.

Now this may be expected behaviour, but since I don't know if it is, I 
posted to the mailing lists to find out. If you're saying this is ok and 
there's nothing to fix here, then there's nothing more to discuss and we 
may consider the thread closed.
If not, ts_headline deserves a closer look.

cheers,
Wojciech Knapik

Re: Very bad FTS performance with the Polish config

From

Sushant Sinha

Date:

19 November 2009, 00:29:25

ts_headline calls ts_lexize equivalent to break the text. Off course there is algorithm to process the tokens and generate the headline. I would be really surprised if the algorithm to generate the headline is somehow dependent on language (as it only processes the tokens). So Oleg is right when he says ts_lexize is something to be checked.

I will try to replicate what you are trying to do but in the meantime can you run the same ts_headline under psql multiple times and paste the result.

-Sushant.

2009/11/19 Wojciech Knapik <webmaster@wolniartysci.pl>

Oleg Bartunov wrote:

Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times...

Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in document, the more overhead.

You're missing the point. I'm not surprised that the function takes more time for larger input texts - that's obvious. The thing is, the computation times rise more steeply when the Polish config is used. Steeply enough, that the difference between the Polish and English configs becomes enormous in practical cases.

Now this may be expected behaviour, but since I don't know if it is, I posted to the mailing lists to find out. If you're saying this is ok and there's nothing to fix here, then there's nothing more to discuss and we may consider the thread closed.
If not, ts_headline deserves a closer look.

cheers,
Wojciech Knapik

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

19 November 2009, 08:22:59

> ts_headline calls ts_lexize equivalent to break the text. Off course 
> there is algorithm to process the tokens and generate the headline. I 
> would be really surprised if the algorithm to generate the headline is 
> somehow dependent on language (as it only processes the tokens). So Oleg 
> is right when he says ts_lexize is something to be checked.

ts_lexize performs well for all dictionaries, nothing to see here.
It's ts_headline that's causing the problem. But that's just IMHO.

> I will try to replicate what you are trying to do but in the meantime 
> can you run the same ts_headline under psql multiple times and paste the 
> result.

All the results I pasted had the medium run time out of multiple calls. 
These were certainly not some extreme corner cases.

cheers,
Wojciech Knapik

Re: Very bad FTS performance with the Polish config

From

Tom Lane

Date:

19 November 2009, 11:51:25

Wojciech Knapik <webmaster@wolniartysci.pl> writes:
> Tom Lane wrote:
>> I tried to duplicate this test, but got no further than here:
>> ERROR:  syntax error
>> CONTEXT:  line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": "  L E
C                  >       -C,G�EM         #zalec (15a)
 

> Here are the files I used (polish.affix, polish.dict already generated):
> http://wolniartysci.pl/pl.tar.gz

Your files were the same as mine.  I eventually figured out the problem
was I was using C locale, in which some of those letters aren't letters.
(I wonder whether the tsearch config file parsers could be made less
sensitive to this by avoiding t_isalpha tests.)  In pl_PL.ut8 locale
I could see that the example is indeed much slower.  Oleg is right that
the fundamental difference is that this Polish configuration is using
an ispell dictionary where the simple English configuration is not.
But, just for the record, here's what an oprofile profile looks like:

samples  %        image name               symbol name
7480     20.9477  postgres                 RS_execute
5370     15.0386  postgres                 pg_utf_mblen
4138     11.5884  postgres                 pg_mblen
3756     10.5187  postgres                 mb_strchr
2880      8.0654  postgres                 FindWord
2754      7.7126  postgres                 CheckAffix
1576      4.4136  postgres                 NormalizeSubWord
966       2.7053  postgres                 FindAffixes
896       2.5092  postgres                 TParserGet
742       2.0780  postgres                 AllocSetAlloc
420       1.1762  postgres                 AllocSetFree
396       1.1090  postgres                 addHLParsedLex
384       1.0754  postgres                 LexizeExec

So about 55% of the time is going into affix pattern matching.
I wonder whether that couldn't be made faster.  A lot of the cycles
are spent on coping with variable-length characters --- perhaps the
ispell code should convert to wchar representation before doing this?
        regards, tom lane

Re: Very bad FTS performance with the Polish config

From

Wojciech Knapik

Date:

19 November 2009, 17:33:09

Tom Lane wrote:

*SNIP*
> So about 55% of the time is going into affix pattern matching.
> I wonder whether that couldn't be made faster.  A lot of the cycles
> are spent on coping with variable-length characters --- perhaps the
> ispell code should convert to wchar representation before doing this?

Thanks a lot for looking into this. I hope this will lead to some 
improvements one day.

Unfortunately my C skills were pretty basic years ago and I haven't used 
the language since, so I can't be of much help..

cheers,
Wojciech Knapik