Questions regarding contrib/tsearch - Mailing list pgsql-general
From | Markus Wollny |
---|---|
Subject | Questions regarding contrib/tsearch |
Date | |
Msg-id | 2266D0630E43BB4290742247C8910575014CE32A@dozer.computec.de Whole thread Raw |
Responses |
Re: Questions regarding contrib/tsearch
(Oleg Bartunov <oleg@sai.msu.su>)
Re: Questions regarding contrib/tsearch (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
Hello! I have installed tsearch for several of our databases (PostgreSQL 7.2.1 in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's really working like a charm - much more flexible than contrib/fti and faster on inserts, too. Documentation still lacks a bit on explanation about what exactly is happening - the fti-mechanism was quite easy to grasp, the workings of tsearch seem more like a black box to me. But as long as it does work, and it does work very well indeed, you won't find me complaining :) I have got two questions, though. First: Has somebody implemented this with a dictionary for German already? I imagine that it could by a bit difficult because of plural of words with diphtongs in it tend to transform one vowel into an umlaut ("Haus"->"Häuser", "Maus"->"Mäuse" etc.) and there's lots of more complicated grammar which doesn't confine itself to changing a suffix... So german morphology cannot be quite as easily described in an algorithm. And the umlaut-vowels and the sharp-s are buggering me anyway, because we store them in HTML-transcription in the database; therefore a search for "Fähre" (fähre = ferry) must be written as "fä&hre" in the searchstring - and if there is just one word containing an O-umlaut (ö, ö) in that row, a search for "Föhre" (searchstring: "fö&hre", the word is german for "pine tree") or "führe" (searchstring: "fü&hre", german subjunctive of the verb "to drive") would lead to a hit, even though these words don't appear in the indexed text at all. I can live with this tiny inaccuracy because we've got a website about games and 99% of our searches are for game-titles (and near enough all of them are in English, hardly any of them contain an umlaut), but I'd be interested in your experiences with this issue and how you resolved it - or if you just ignored it, too :) My second question is about performance - I think I know the answer, but I'd like to know if I'm correct. I've got a table containing forum-messages with nearly 500,000 rows; the tsearch-indexed fields can currently contain as much as 5,000 characters per row (we plan on allowing about 12,000 characters in the near future), the field that contains the txtidx-data is named textindex. Now I start a search for messages containing the words 'Dungeon' and 'Siege': First time search for 'dungeon&siege': community=# explain analyze select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; NOTICE: QUERY PLAN: Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual time=24913.37..24913.38 rows=1 loops=1) -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1) Total runtime: 24913.60 msec Second time, same search: community=# explain analyze select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; NOTICE: QUERY PLAN: Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual time=415.66..415.66 rows=1 loops=1) -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18 rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1) Total runtime: 415.88 msec Just for curiosity - number of results of this query: community=# select count (*) from ct_com_board_message where textindex ## 'dungeon&siege'; count ------- 2647 (1 row) Right now the database is not in full production state, as our main website is not yet running on it - and this site is causing more than 90% of our overall database-traffic. This is the reason I need to get as much performance as I can out of the searching, before switching this last remaining site to PostgreSQL, too. I suspect that the high running time for the first call of that query is due to the database having to do harddisk-access in order to get the needed parts of the table into memory. This would explain the acceptably low running time of the second call - the information needed is already in memory, so there's no slow harddisk-access involved and the query is completed quite quickly. Is this correct? If so, what can I do to have all of the database in memory? The machine has got 2 GB of RAM and if I dump all the databases into one sql-script, the resulting file is about 600MB in size. Shurely it should be possible to keep most of that in memory at all times? What would I need to do to accomplish this? And is there a way to get even more detailed information about query execution like how much time is needed for the query-plan, for hd-access and so on? Regards, Markus Wollny
pgsql-general by date: