Thread: TSearch2 and optimisation ...
Hi, I'm a little beginner with Tsearch2 .... I have simples tables like this : # \d article Table "public.article" Column | Type | Modifiers ------------+-----------------------------+----------------------------------------------------------------- id | integer | not null default nextval('public.article_rss_id_rss_seq'::text) id_site | integer | not null title | text | url | text | desc | text | r_date | timestamp without time zone | default now() r_update | timestamp without time zone | default now() idxfti | tsvector | Indexes: "article_id_key" unique, btree (id) "idxfti_idx" gist (idxfti) "ix_article_update" btree (r_update) "ix_article_url" btree (url) "ix_id_site" btree (id_site) Foreign-key constraints: "$1" FOREIGN KEY (id_site) REFERENCES site (id_site) Triggers: tsvectorupdate BEFORE INSERT OR UPDATE ON article FOR EACH ROW EXECUTE PROCEDURE tsearch2('idxfti', 'title', 'desc') # \d site_rss Table "public.site" Column | Type | Modifiers --------------+---------+--------------------------------------------------------------- id_site | integer | not null default nextval('public.site_id_site_seq'::text) site_name | text | site_url | text | url | text | language | text | datecrea | date | default now() id_category | integer | time_refresh | integer | active | integer | error | integer | Indexes: "site_id_site_key" unique, btree (id_site) "ix_site_id_category" btree (id_category) "ix_site_url" btree (url) # \d user_choice Table "public.user_choice" Column | Type | Modifiers ---------+---------+----------- id_user | integer | id_site | integer | Indexes: "ix_user_choice_all" unique, btree (id_user, id_site) I have done a simple request, looking for title or description having Postgres inside order by rank and date, like this : SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS') as dt, s.site_name, s.id_site, case when exists (select id_user from user_choice u where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 end as bookmarked FROM article a, site s WHERE s.id_site = a.id_site AND idxfti @@ to_tsquery('postgresql') ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC; The request takes about 4 seconds ... I have about 1 400 000 records in article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5 For me this result is very very slow I really need a quicker result with less than 1 second ... The next time I call the same request I have got the result in 439 ms ... but If I replace "Postgresql" in my find with "Linux" for example I will get the next result in 5 seconds ... :o( Is it a bad use of Tsearch2 ... or a bad table structure ... or from my request ... ? I have no idea how to optimise this ... Explain gives me this result : QUERY PLAN ----------------------------------------------------------------------------------------------------------- Sort (cost=10720.91..10724.29 rows=1351 width=191) Sort Key: rank(a.idxfti, '\'postgresql\''::tsquery), a.r_date -> Merge Join (cost=4123.09..10650.66 rows=1351 width=191) Merge Cond: ("outer".id_site = "inner".id_site) -> Index Scan using site_id_site_key on site s (cost=0.00..2834.96 rows=35705 width=28) -> Sort (cost=4123.09..4126.47 rows=1351 width=167) Sort Key: a.id_site -> Index Scan using idxfti_idx on article a (cost=0.00..4052.84 rows=1351 width=167) Index Cond: (idxfti @@ '\'postgresql\''::tsquery) Filter: (idxfti @@ '\'postgresql\''::tsquery) SubPlan -> Seq Scan on user_choice u (cost=0.00..2.69 rows=1 width=4) Filter: ((id_site = $0) AND (id_user = 1)) (13 rows) Any idea are well done ;o) Regards, -- Bill Footcow
Herve' > The request takes about 4 seconds ... I have about 1 400 000 records in > article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz > server with 1 Gb memory ... I'm using Postgresql 7.4.5 > For me this result is very very slow I really need a quicker result with > less than 1 second ... > The next time I call the same request I have got the result in 439 ms ... > but If I replace "Postgresql" in my find with "Linux" for example I will > get the next result in 5 seconds ... :o( Hmmm. It sounds like your system is unable to keep all of the data cached in memory. What else do you have going on on that machine? > Explain gives me this result : Please do "EXPLAIN ANALYZE" so that we can see where time is actually spent. -- Josh Berkus Aglio Database Solutions San Francisco
Josh, Le Jeudi 26 Août 2004 01:50, Josh Berkus a écrit : > > The request takes about 4 seconds ... I have about 1 400 000 records in > > article and 36 000 records in site table ... it's a Bi-Pentium III 933 > > MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5 > > For me this result is very very slow I really need a quicker result with > > less than 1 second ... > > The next time I call the same request I have got the result in 439 ms ... > > but If I replace "Postgresql" in my find with "Linux" for example I will > > get the next result in 5 seconds ... :o( > > Hmmm. It sounds like your system is unable to keep all of the data cached > in memory. What else do you have going on on that machine? There is an Apache + PHP running in same time ... > > Explain gives me this result : > > Please do "EXPLAIN ANALYZE" so that we can see where time is actually > spent. QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Sort (cost=10740.35..10743.73 rows=1351 width=190) (actual time=7054.603..7054.707 rows=139 loops=1) Sort Key: rank(a.idxfti, '\'postgresql\''::tsquery), a.r_date -> Merge Join (cost=4123.09..10670.10 rows=1351 width=190) (actual time=5476.749..7052.766 rows=139 loops=1) Merge Cond: ("outer".id_site = "inner".id_site) -> Index Scan using site_id_site_key on site s (cost=0.00..2846.52 rows=35705 width=28) (actual time=43.985..1548.903 rows=34897 loops=1) -> Sort (cost=4123.09..4126.47 rows=1351 width=166) (actual time=5416.836..5416.983 rows=139 loops=1) Sort Key: a.id_site -> Index Scan using idxfti_idx on article a (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108 rows=139 loops=1) Index Cond: (idxfti @@ '\'postgresql\''::tsquery) Filter: (idxfti @@ '\'postgresql\''::tsquery) SubPlan -> Seq Scan on user_choice u (cost=0.00..2.69 rows=1 width=4) (actual time=0.146..0.146 rows=0 loops=139) Filter: ((id_site = $0) AND (id_user = 1)) Total runtime: 7056.126 ms Thanks for your help ... -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
Herve' > (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108 > rows=139 loops=1) > Index Cond: (idxfti @@ '\'postgresql\''::tsquery) > Filter: (idxfti @@ '\'postgresql\''::tsquery) From this, it looks like your FTI index isn't fitting in your sort_mem. What's sort_mem at now? Can you increase it? Overall, though, I'm not sure you can get this sub-1s without a faster machine. Although I'm doing FTI on about 25MB of FTI text on a single-processor machine, and getting 40ms response times, so maybe we can ... -- Josh Berkus Aglio Database Solutions San Francisco
Bill Footcow wrote: ... > I have done a simple request, looking for title or description having Postgres > inside order by rank and date, like this : > SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS') as dt, > s.site_name, s.id_site, case when exists (select id_user from user_choice u > where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 end as bookmarked > FROM article a, site s > WHERE s.id_site = a.id_site > AND idxfti @@ to_tsquery('postgresql') > ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC; > > The request takes about 4 seconds ... I have about 1 400 000 records in > article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz > server with 1 Gb memory ... I'm using Postgresql 7.4.5 > For me this result is very very slow I really need a quicker result with less > than 1 second ... > The next time I call the same request I have got the result in 439 ms ... but ... The first query is slow because the relevant index pages are not cached in memory. Everyone experiences this. GiST indexes on tsvector columns can get really big. You have done nothing wrong. When you have a lot of records, tsearch2 will not run fast without extensive performance tuning. Read the following: Optimization http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/optimization.html stat function http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html Stop words http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html Multicolumn GiST index http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html openfts-general mailing list archive http://sourceforge.net/mailarchive/forum.php?forum=openfts-general Try some of things out and let me know how it goes. George Essig
Le Jeudi 26 Août 2004 19:48, Josh Berkus a écrit : > Herve' > > > (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108 > > rows=139 loops=1) > > Index Cond: (idxfti @@ '\'postgresql\''::tsquery) > > Filter: (idxfti @@ '\'postgresql\''::tsquery) > > > >From this, it looks like your FTI index isn't fitting in your sort_mem. > > What's sort_mem at now? Can you increase it? shared_buffers = 3000 sort_mem = 10240 > Overall, though, I'm not sure you can get this sub-1s without a faster > machine. Although I'm doing FTI on about 25MB of FTI text on a > single-processor machine, and getting 40ms response times, so maybe we can > ... Sorry I missed understand what you mean here ... You tell me to upgrade the hardware but you manage a 25 Mb with a single processor ?? What you mean ? My database is about 450 Mb ... Regards, -- Bill Footcow
George, Le Jeudi 26 Août 2004 19:58, George Essig a écrit : > Bill Footcow wrote: > > ... > > > I have done a simple request, looking for title or description having > > Postgres inside order by rank and date, like this : > > SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS') > > as dt, s.site_name, s.id_site, case when exists (select id_user from > > user_choice u where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 > > end as bookmarked FROM article a, site s > > WHERE s.id_site = a.id_site > > AND idxfti @@ to_tsquery('postgresql') > > ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC; > > > > The request takes about 4 seconds ... I have about 1 400 000 records in > > article and 36 000 records in site table ... it's a Bi-Pentium III 933 > > MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5 > > For me this result is very very slow I really need a quicker result with > > less than 1 second ... > > The next time I call the same request I have got the result in 439 ms ... > > but > > ... > > The first query is slow because the relevant index pages are not cached in > memory. Everyone experiences this. GiST indexes on tsvector columns can > get really big. You have done nothing wrong. When you have a lot of > records, tsearch2 will not run fast without extensive performance tuning. > > Read the following: > > Optimization > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/ >optimization.html > > ... I have well read many pages about this subject ... but I have not found any thing for the moment to really help me ... What can I do to optimize my PostgreSQL configuration for a special use of Tsearch2 ... I'm a little dispointed looking the Postgresql Russian search engine using Tsearch2 is really quick ... why I can't haev the same result with a bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 records ? Regards, -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
--- Herv���<inputPiedvache <herve@elma.fr> wrote: > George, > > I have well read many pages about this subject ... but I have not found any > thing for the moment to really help me ... > What can I do to optimize my PostgreSQL configuration for a special use of > Tsearch2 ... > I'm a little dispointed looking the Postgresql Russian search engine using > Tsearch2 is really quick ... why I can't haev the same result with a > bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 > records ? > > Regards, > -- > Herv���<inputPiedvache > > Elma Ing���<inputierie Informatique > 6 rue du Faubourg Saint-Honor���<input> F-75008 - Paris - France > Pho. 33-144949901 > Fax. 33-144949902 > Tsearch does not scale indefinitely. It was designed for fast online updates and to be integrated into PostgreSQL. My understanding is that it uses a bloom filter together with bit string signatures. Typically, full text searches use inverted indexes, scale better, but are slower to update. My understanding is that tsearch has a practical limit of 100,000 distinct word stems or lexemes. Note that word stems are not words. Word stems are what are actually stored in a tsvector after parsing and dictionary processing. The key to making tsearch fast is to keep the number of word stems low. You decrease the number of word stems by using stop words, various dictionaries, synonyms, and preprocessing text before it gets to tsearch. You can find what word stems are stored in a tsvector column by using the stat function. For examples of how to use the stat function, see: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html Note that the stat function will take a long time to run on large tables. Performance tuning must be done on a case by case basis. It can take some time to try different things and see the change in performance. Each time you try something new, use the stat function to see how the number of word stems has changed. The largest project I used tsearch2 on contained 900,000 records. Without performance tuning, there were 275,000 distinct word stems. After performance tuning, I got it down to 14,000 distinct word stems. By using the stat function, I noticed some obvious stop words that were very frequent that nobody would ever search for. For how to use stop words, see: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html Also I noticed some strange patterns by looking through all of the word stems. In one case, strings of 3 to 7 words were joined together with hyphens to indicate category nesting. Tsearch would store these long hyphenated words intact and also store the stem of each individual word. I made a judgment call that no one would ever search for the long hyphenated words, so I preprocessed the text to remove the hyphens. I also noticed that many of the word stems were alphanumeric IDs that were designed to be unique. There were many of these IDs in the tsvector column although each ID would occur only once or twice. I again preprocessed the text to remove these IDs, but created a btree index on a varchar column representing the IDs. My search form allows users to either search full text using tsearch2 or search IDs using 'LIKE' queries which use a btree index. For 'LIKE' queries, it was another matter to get postgres to use the btree index and not use a sequential scan. For this, see: http://www.postgresql.org/docs/7.4/static/indexes-opclass.html Last, I noticed that most users wanted to restrict the full text search to a subset determined by another column in the table. As a result, I created a multicolumn gist index on an integer column and a tsvector column. For how to setup a multicolumn gist index, see: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html There are no easy answers. Like I said, performance tuning must be done on a case by case basis. Hope this helps, George Essig