Thread: TSearch2 and optimisation ...

TSearch2 and optimisation ...

From
Hervé Piedvache
Date:
Hi,

I'm a little beginner with Tsearch2 ....

I have simples tables like this :

# \d article
                                         Table "public.article"
   Column   |            Type             |
Modifiers
------------+-----------------------------+-----------------------------------------------------------------
 id     | integer                     | not null default
nextval('public.article_rss_id_rss_seq'::text)
 id_site    | integer                     | not null
 title  | text                        |
 url    | text                        |
 desc   | text                        |
 r_date   | timestamp without time zone | default now()
 r_update | timestamp without time zone | default now()
 idxfti     | tsvector                    |
Indexes:
    "article_id_key" unique, btree (id)
    "idxfti_idx" gist (idxfti)
    "ix_article_update" btree (r_update)
    "ix_article_url" btree (url)
    "ix_id_site" btree (id_site)
Foreign-key constraints:
    "$1" FOREIGN KEY (id_site) REFERENCES site (id_site)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON article FOR EACH ROW EXECUTE
PROCEDURE tsearch2('idxfti', 'title', 'desc')

# \d site_rss
                                Table "public.site"
    Column    |  Type   |                           Modifiers
--------------+---------+---------------------------------------------------------------
 id_site      | integer | not null default
nextval('public.site_id_site_seq'::text)
 site_name    | text    |
 site_url     | text    |
url             | text    |
 language     | text    |
 datecrea     | date    | default now()
 id_category  | integer |
 time_refresh | integer |
 active       | integer |
 error        | integer |
Indexes:
    "site_id_site_key" unique, btree (id_site)
    "ix_site_id_category" btree (id_category)
    "ix_site_url" btree (url)

# \d user_choice
  Table "public.user_choice"
 Column  |  Type   | Modifiers
---------+---------+-----------
 id_user | integer |
 id_site | integer |
Indexes:
    "ix_user_choice_all" unique, btree (id_user, id_site)

I have done a simple request, looking for title or description having Postgres
inside order by rank and date,  like this :
SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS') as dt,
s.site_name, s.id_site, case when exists (select id_user from user_choice u
where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 end as bookmarked
    FROM article a, site s
  WHERE s.id_site = a.id_site
       AND idxfti @@ to_tsquery('postgresql')
   ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC;

The request takes about 4 seconds ... I have about 1 400 000 records in
article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz
server with 1 Gb memory ... I'm using Postgresql 7.4.5
For me this result is very very slow I really need a quicker result with less
than 1 second ...
The next time I call the same request I have got the result in 439 ms ... but
If I replace "Postgresql" in my find with "Linux" for example I will get the
next result in 5 seconds ... :o(

Is it a bad use of Tsearch2 ... or a bad table structure ... or from my
request ... ? I have no idea how to optimise this ...

Explain gives me this result :
                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Sort  (cost=10720.91..10724.29 rows=1351 width=191)
   Sort Key: rank(a.idxfti, '\'postgresql\''::tsquery), a.r_date
   ->  Merge Join  (cost=4123.09..10650.66 rows=1351 width=191)
         Merge Cond: ("outer".id_site = "inner".id_site)
         ->  Index Scan using site_id_site_key on site s  (cost=0.00..2834.96
rows=35705 width=28)
         ->  Sort  (cost=4123.09..4126.47 rows=1351 width=167)
               Sort Key: a.id_site
               ->  Index Scan using idxfti_idx on article a
(cost=0.00..4052.84 rows=1351 width=167)
                     Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
                     Filter: (idxfti @@ '\'postgresql\''::tsquery)
         SubPlan
           ->  Seq Scan on user_choice u  (cost=0.00..2.69 rows=1 width=4)
                 Filter: ((id_site = $0) AND (id_user = 1))
(13 rows)

Any idea are well done ;o)

Regards,
--
Bill Footcow


Re: TSearch2 and optimisation ...

From
Josh Berkus
Date:
Herve'

> The request takes about 4 seconds ... I have about 1 400 000 records in
> article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz
> server with 1 Gb memory ... I'm using Postgresql 7.4.5
> For me this result is very very slow I really need a quicker result with
> less than 1 second ...
> The next time I call the same request I have got the result in 439 ms ...
> but If I replace "Postgresql" in my find with "Linux" for example I will
> get the next result in 5 seconds ... :o(

Hmmm.  It sounds like your system is unable to keep all of the data cached in
memory.  What else do you have going on on that machine?

> Explain gives me this result :

Please do "EXPLAIN ANALYZE" so that we can see where time is actually spent.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: TSearch2 and optimisation ...

From
Hervé Piedvache
Date:
Josh,

Le Jeudi 26 Août 2004 01:50, Josh Berkus a écrit :
> > The request takes about 4 seconds ... I have about 1 400 000 records in
> > article and 36 000 records in site table ... it's a Bi-Pentium III 933
> > MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5
> > For me this result is very very slow I really need a quicker result with
> > less than 1 second ...
> > The next time I call the same request I have got the result in 439 ms ...
> > but If I replace "Postgresql" in my find with "Linux" for example I will
> > get the next result in 5 seconds ... :o(
>
> Hmmm.  It sounds like your system is unable to keep all of the data cached
> in memory.  What else do you have going on on that machine?

There is an Apache + PHP running in same time ...

> > Explain gives me this result :
>
> Please do "EXPLAIN ANALYZE" so that we can see where time is actually
> spent.

                                                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10740.35..10743.73 rows=1351 width=190) (actual
time=7054.603..7054.707 rows=139 loops=1)
   Sort Key: rank(a.idxfti, '\'postgresql\''::tsquery), a.r_date
   ->  Merge Join  (cost=4123.09..10670.10 rows=1351 width=190) (actual
time=5476.749..7052.766 rows=139 loops=1)
         Merge Cond: ("outer".id_site = "inner".id_site)
         ->  Index Scan using site_id_site_key on site s  (cost=0.00..2846.52
rows=35705 width=28) (actual time=43.985..1548.903 rows=34897 loops=1)
         ->  Sort  (cost=4123.09..4126.47 rows=1351 width=166) (actual
time=5416.836..5416.983 rows=139 loops=1)
               Sort Key: a.id_site
               ->  Index Scan using idxfti_idx on article a
(cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108
rows=139 loops=1)
                     Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
                     Filter: (idxfti @@ '\'postgresql\''::tsquery)
         SubPlan
           ->  Seq Scan on user_choice u  (cost=0.00..2.69 rows=1 width=4)
(actual time=0.146..0.146 rows=0 loops=139)
                 Filter: ((id_site = $0) AND (id_user = 1))
 Total runtime: 7056.126 ms

Thanks for your help ...
--
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

Re: TSearch2 and optimisation ...

From
Josh Berkus
Date:
Herve'

> (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108
> rows=139 loops=1)
>                      Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
>                      Filter: (idxfti @@ '\'postgresql\''::tsquery)

From this, it looks like your FTI index isn't fitting in your sort_mem.
What's sort_mem at now?  Can you increase it?

Overall, though, I'm not sure you can get this sub-1s without a faster
machine.   Although I'm doing FTI on about 25MB of FTI text on a
single-processor machine, and getting 40ms response times, so maybe we can
...

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: TSearch2 and optimisation ...

From
George Essig
Date:
Bill Footcow wrote:

...
> I have done a simple request, looking for title or description having Postgres
> inside order by rank and date,  like this :
> SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS') as dt,
> s.site_name, s.id_site, case when exists (select id_user from user_choice u
> where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 end as bookmarked
>     FROM article a, site s
>   WHERE s.id_site = a.id_site
>        AND idxfti @@ to_tsquery('postgresql')
>    ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC;
>
> The request takes about 4 seconds ... I have about 1 400 000 records in
> article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz
> server with 1 Gb memory ... I'm using Postgresql 7.4.5
> For me this result is very very slow I really need a quicker result with less
> than 1 second ...
> The next time I call the same request I have got the result in 439 ms ... but
...

The first query is slow because the relevant index pages are not cached in memory.  Everyone
experiences this.  GiST indexes on tsvector columns can get really big.  You have done nothing
wrong.  When you have a lot of records, tsearch2 will not run fast without extensive performance
tuning.

Read the following:

Optimization
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/optimization.html

stat function
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Stop words
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Multicolumn GiST index
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

openfts-general mailing list archive
http://sourceforge.net/mailarchive/forum.php?forum=openfts-general

Try some of things out and let me know how it goes.

George Essig



Re: TSearch2 and optimisation ...

From
Hervé Piedvache
Date:
Le Jeudi 26 Août 2004 19:48, Josh Berkus a écrit :
> Herve'
>
> > (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108
> > rows=139 loops=1)
> >                      Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
> >                      Filter: (idxfti @@ '\'postgresql\''::tsquery)
> >
> >From this, it looks like your FTI index isn't fitting in your sort_mem.
>
> What's sort_mem at now?  Can you increase it?

shared_buffers = 3000
sort_mem = 10240

> Overall, though, I'm not sure you can get this sub-1s without a faster
> machine.   Although I'm doing FTI on about 25MB of FTI text on a
> single-processor machine, and getting 40ms response times, so maybe we can
> ...

Sorry I missed understand what you mean here ...
You tell me to upgrade the hardware but you manage a 25 Mb with a single
processor ?? What you mean ?
My database is about 450 Mb ...

Regards,
--
Bill Footcow


Re: TSearch2 and optimisation ...

From
Hervé Piedvache
Date:
George,

Le Jeudi 26 Août 2004 19:58, George Essig a écrit :
> Bill Footcow wrote:
>
> ...
>
> > I have done a simple request, looking for title or description having
> > Postgres inside order by rank and date,  like this :
> > SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/YYYY HH24:MI:SS')
> > as dt, s.site_name, s.id_site, case when exists (select id_user from
> > user_choice u where u.id_site=s.id_site and u.id_user = 1) then 1 else 0
> > end as bookmarked FROM article a, site s
> >   WHERE s.id_site = a.id_site
> >        AND idxfti @@ to_tsquery('postgresql')
> >    ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC;
> >
> > The request takes about 4 seconds ... I have about 1 400 000 records in
> > article and 36 000 records in site table ... it's a Bi-Pentium III 933
> > MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5
> > For me this result is very very slow I really need a quicker result with
> > less than 1 second ...
> > The next time I call the same request I have got the result in 439 ms ...
> > but
>
> ...
>
> The first query is slow because the relevant index pages are not cached in
> memory.  Everyone experiences this.  GiST indexes on tsvector columns can
> get really big.  You have done nothing wrong.  When you have a lot of
> records, tsearch2 will not run fast without extensive performance tuning.
>
> Read the following:
>
> Optimization
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/
>optimization.html
>
> ...

I have well read many pages about this subject ... but I have not found any
thing for the moment to really help me ...
What can I do to optimize my PostgreSQL configuration for a special use of
Tsearch2 ...
I'm a little dispointed looking the Postgresql Russian search engine using
Tsearch2 is really quick ... why I can't haev the same result with a
bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000
records ?

Regards,
--
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

Re: TSearch2 and optimisation ...

From
George Essig
Date:
--- Herv���<inputPiedvache <herve@elma.fr> wrote:

> George,
>
> I have well read many pages about this subject ... but I have not found any
> thing for the moment to really help me ...
> What can I do to optimize my PostgreSQL configuration for a special use of
> Tsearch2 ...
> I'm a little dispointed looking the Postgresql Russian search engine using
> Tsearch2 is really quick ... why I can't haev the same result with a
> bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000
> records ?
>
> Regards,
> --
> Herv���<inputPiedvache
>
> Elma Ing���<inputierie Informatique
> 6 rue du Faubourg Saint-Honor���<input> F-75008 - Paris - France
> Pho. 33-144949901
> Fax. 33-144949902
>

Tsearch does not scale indefinitely.  It was designed for fast online updates and to be integrated
into PostgreSQL.  My understanding is that it uses a bloom filter together with bit string
signatures.  Typically, full text searches use inverted indexes, scale better, but are slower to
update.

My understanding is that tsearch has a practical limit of 100,000 distinct word stems or lexemes.
Note that word stems are not words.  Word stems are what are actually stored in a tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low.  You decrease the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing text before
it gets to tsearch.  You can find what word stems are stored in a tsvector column by using the
stat function.  For examples of how to use the stat function, see:

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis.  It can take some time to try different
things and see the change in performance.  Each time you try something new, use the stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records.  Without performance tuning,
there were 275,000 distinct word stems.  After performance tuning, I got it down to 14,000
distinct word stems.

By using the stat function, I noticed some obvious stop words that were very frequent that nobody
would ever search for.  For how to use stop words, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Also I noticed some strange patterns by looking through all of the word stems.

In one case, strings of 3 to 7 words were joined together with hyphens to indicate category
nesting.  Tsearch would store these long hyphenated words intact and also store the stem of each
individual word.  I made a judgment call that no one would ever search for the long hyphenated
words, so I preprocessed the text to remove the hyphens.

I also noticed that many of the word stems were alphanumeric IDs that were designed to be unique.
There were many of these IDs in the tsvector column although each ID would occur only once or
twice.  I again preprocessed the text to remove these IDs, but created a btree index on a varchar
column representing the IDs.  My search form allows users to either search full text using
tsearch2 or search IDs using 'LIKE' queries which use a btree index.  For 'LIKE' queries, it was
another matter to get postgres to use the btree index and not use a sequential scan.  For this,
see:

http://www.postgresql.org/docs/7.4/static/indexes-opclass.html

Last, I noticed that most users wanted to restrict the full text search to a subset determined by
another column in the table.  As a result, I created a multicolumn gist index on an integer column
and a tsvector column.  For how to setup a multicolumn gist index, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

There are no easy answers.  Like I said, performance tuning must be done on a case by case basis.

Hope this helps,

George Essig