Re: Very bad FTS performance with the Polish config - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Very bad FTS performance with the Polish config
Date
Msg-id 15251.1258645873@sss.pgh.pa.us
Whole thread Raw
In response to Re: Very bad FTS performance with the Polish config  (Wojciech Knapik <webmaster@wolniartysci.pl>)
Responses Re: Very bad FTS performance with the Polish config
List pgsql-hackers
Wojciech Knapik <webmaster@wolniartysci.pl> writes:
> Tom Lane wrote:
>> I tried to duplicate this test, but got no further than here:
>> ERROR:  syntax error
>> CONTEXT:  line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": "  L E
C                  >       -C,G�EM         #zalec (15a)
 

> Here are the files I used (polish.affix, polish.dict already generated):
> http://wolniartysci.pl/pl.tar.gz

Your files were the same as mine.  I eventually figured out the problem
was I was using C locale, in which some of those letters aren't letters.
(I wonder whether the tsearch config file parsers could be made less
sensitive to this by avoiding t_isalpha tests.)  In pl_PL.ut8 locale
I could see that the example is indeed much slower.  Oleg is right that
the fundamental difference is that this Polish configuration is using
an ispell dictionary where the simple English configuration is not.
But, just for the record, here's what an oprofile profile looks like:

samples  %        image name               symbol name
7480     20.9477  postgres                 RS_execute
5370     15.0386  postgres                 pg_utf_mblen
4138     11.5884  postgres                 pg_mblen
3756     10.5187  postgres                 mb_strchr
2880      8.0654  postgres                 FindWord
2754      7.7126  postgres                 CheckAffix
1576      4.4136  postgres                 NormalizeSubWord
966       2.7053  postgres                 FindAffixes
896       2.5092  postgres                 TParserGet
742       2.0780  postgres                 AllocSetAlloc
420       1.1762  postgres                 AllocSetFree
396       1.1090  postgres                 addHLParsedLex
384       1.0754  postgres                 LexizeExec

So about 55% of the time is going into affix pattern matching.
I wonder whether that couldn't be made faster.  A lot of the cycles
are spent on coping with variable-length characters --- perhaps the
ispell code should convert to wchar representation before doing this?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Patch to change a pg_restore message
Next
From: Robert Haas
Date:
Subject: Re: Syntax for partitioning