Very bad FTS performance with the Polish config - Mailing list pgsql-hackers

From Wojciech Knapik
Subject Very bad FTS performance with the Polish config
Date
Msg-id 4B03314A.6010202@wolniartysci.pl
Whole thread Raw
List pgsql-hackers
Hello

 This has been discussed in #postgresql and posted to -performance a 
couple days ago, but no solution has been found. The discussion can be 
found here: 
http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php
 I just finished implementing a "search engine" for my site and found
ts_headline extremely slow when used with a Polish tsearch 
configuration, while fast with English. All of it boils down to a simple
testcase, but first some background.
 I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The 
Polish dictionaries and config were created according to this article 
(it's in Polish, but the code is self-explanatory):

http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
 Now for the testcase:

text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.'

# explain analyze select ts_headline('polish', text,
plainto_tsquery('polish', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=6.407..6.470
 
rows=1 loops=1) Total runtime: 6.524 ms
(2 rows)

# explain analyze select ts_headline('english', text,
plainto_tsquery('english', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=0.861..0.895
 
rows=1 loops=1) Total runtime: 0.935 ms
(2 rows)

# explain analyze select ts_headline('simple', text,
plainto_tsquery('simple', 'foobar'));                                     QUERY PLAN
------------------------------------------------------------------------------------ Result  (cost=0.00..0.01 rows=1
width=0)(actual time=0.627..0.660
 
rows=1 loops=1) Total runtime: 0.697 ms
(2 rows)

#
 As you can see, the results differ by an order of magnitude between
Polish and English. While in this simple testcase it's a non-issue, in
the real world this translates into enormous overhead.
 One of the queries I ran testing my site's search function took
1870ms. When I took that query and changed all ts_headline(foo) calls to
just foo, the time dropped below 100ms. That's the difference between
something completely unacceptable and something quite useful.
 I can post various details about the hardware, software and specific
queries, but the testcases speak for themselves. I'm sure you can easily
reproduce my results.
 I'm putting my code into production tomorrow, since I can't wait 
anymore. Hints would be very much appreciated!


cheers,
Wojciech Knapik

PS. This issue is not related to the loading time of dictionaries, or 
calls to ts_headline for results that won't be displayed. A few other 
details can be found here 
http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with
snippets of my conversations in #postgresql that lead to this testcase.
Big thanks to RhodiumToad for helping me with fts for the last couple
days ;]



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: CommitFest expectations
Next
From: Euler Taveira de Oliveira
Date:
Subject: Re: Very bad FTS performance with the Polish config