Re: TSearch2 and optimisation ... - Mailing list pgsql-performance
From | George Essig |
---|---|
Subject | Re: TSearch2 and optimisation ... |
Date | |
Msg-id | 20040913030357.66842.qmail@web53701.mail.yahoo.com Whole thread Raw |
In response to | TSearch2 and optimisation ... (Hervé Piedvache <footcow@noos.fr>) |
List | pgsql-performance |
--- Herv���<inputPiedvache <herve@elma.fr> wrote: > George, > > I have well read many pages about this subject ... but I have not found any > thing for the moment to really help me ... > What can I do to optimize my PostgreSQL configuration for a special use of > Tsearch2 ... > I'm a little dispointed looking the Postgresql Russian search engine using > Tsearch2 is really quick ... why I can't haev the same result with a > bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 > records ? > > Regards, > -- > Herv���<inputPiedvache > > Elma Ing���<inputierie Informatique > 6 rue du Faubourg Saint-Honor���<input> F-75008 - Paris - France > Pho. 33-144949901 > Fax. 33-144949902 > Tsearch does not scale indefinitely. It was designed for fast online updates and to be integrated into PostgreSQL. My understanding is that it uses a bloom filter together with bit string signatures. Typically, full text searches use inverted indexes, scale better, but are slower to update. My understanding is that tsearch has a practical limit of 100,000 distinct word stems or lexemes. Note that word stems are not words. Word stems are what are actually stored in a tsvector after parsing and dictionary processing. The key to making tsearch fast is to keep the number of word stems low. You decrease the number of word stems by using stop words, various dictionaries, synonyms, and preprocessing text before it gets to tsearch. You can find what word stems are stored in a tsvector column by using the stat function. For examples of how to use the stat function, see: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html Note that the stat function will take a long time to run on large tables. Performance tuning must be done on a case by case basis. It can take some time to try different things and see the change in performance. Each time you try something new, use the stat function to see how the number of word stems has changed. The largest project I used tsearch2 on contained 900,000 records. Without performance tuning, there were 275,000 distinct word stems. After performance tuning, I got it down to 14,000 distinct word stems. By using the stat function, I noticed some obvious stop words that were very frequent that nobody would ever search for. For how to use stop words, see: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html Also I noticed some strange patterns by looking through all of the word stems. In one case, strings of 3 to 7 words were joined together with hyphens to indicate category nesting. Tsearch would store these long hyphenated words intact and also store the stem of each individual word. I made a judgment call that no one would ever search for the long hyphenated words, so I preprocessed the text to remove the hyphens. I also noticed that many of the word stems were alphanumeric IDs that were designed to be unique. There were many of these IDs in the tsvector column although each ID would occur only once or twice. I again preprocessed the text to remove these IDs, but created a btree index on a varchar column representing the IDs. My search form allows users to either search full text using tsearch2 or search IDs using 'LIKE' queries which use a btree index. For 'LIKE' queries, it was another matter to get postgres to use the btree index and not use a sequential scan. For this, see: http://www.postgresql.org/docs/7.4/static/indexes-opclass.html Last, I noticed that most users wanted to restrict the full text search to a subset determined by another column in the table. As a result, I created a multicolumn gist index on an integer column and a tsvector column. For how to setup a multicolumn gist index, see: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html There are no easy answers. Like I said, performance tuning must be done on a case by case basis. Hope this helps, George Essig
pgsql-performance by date: