Thread: Tsearch2 performance on big database

Tsearch2 performance on big database

From
Rick Jansen
Date:
Hi,

I'm looking for a *fast* solution to search thru ~ 4 million records of
book descriptions. I've installed PostgreSQL 8.0.1 on a dual opteron
server with 8G of memory, running Linux 2.6. I haven't done a lot of
tuning on PostgreSQL itself, but here's the settings I have changed so far:

shared_buffers = 2000 (anything much bigger says the kernel doesnt allow
  it, still have to look into that)
effective_cache_size = 32768

Here's my table:

ilab=# \d books
                                        Table "public.books"
     Column     |          Type          |                        Modifiers
---------------+------------------------+----------------------------------------------------------
  recordnumber  | integer                | not null default
nextval('books_recordnumber_seq'::text)
  membernumber  | integer                | not null default 0
  booknumber    | character varying(20)  | not null default
''::character varying
  author        | character varying(60)  | not null default
''::character varying
  titel         | text                   | not null
  description   | character varying(100) | not null default
''::character varying
  descriprest   | text                   | not null
  price         | bigint                 | not null default 0::bigint
  keywords      | character varying(100) | not null default
''::character varying
  dollarprice   | bigint                 | not null default 0::bigint
  countrynumber | smallint               | not null default 0::smallint
  entrydate     | date                   | not null
  status        | smallint               | not null default 0::smallint
  recordtype    | smallint               | not null default 0::smallint
  bookflags     | smallint               | not null default 0::smallint
  year          | smallint               | not null default 0::smallint
  firstedition  | smallint               | not null default 0::smallint
  dustwrapper   | smallint               | not null default 0::smallint
  signed        | smallint               | not null default 0::smallint
  cover         | smallint               | not null default 0::smallint
  specialfield  | smallint               | not null default 0::smallint
  idxfti        | tsvector               |
Indexes:
     "recordnumber_idx" unique, btree (recordnumber)
     "idxfti_idx" gist (idxfti)

idxfti is a tsvector of concatenated description and descriprest.

ilab=# select
avg(character_length(description)),avg(character_length(descriprest))
from books;
          avg         |         avg
---------------------+----------------------
  89.1596992873947218 | 133.0468689304200538

Queries take forever to run. Right now we run a MySQL server, on which
we maintain our own indices (we split the description fields by word and
have different tables for words and the bookdescriptions they appear in).

For example, a query for the word 'terminology' on our MySQL search
takes 5.8 seconds and returns 375 results. The same query on postgresql
using the tsearch2 index takes 30802.105 ms and returns 298 results.

How do I speed this up? Should I change settings, add or change indexes
or.. what?

Rick Jansen
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
On Tue, 22 Mar 2005, Rick Jansen wrote:

> Hi,
>
> I'm looking for a *fast* solution to search thru ~ 4 million records of book
> descriptions. I've installed PostgreSQL 8.0.1 on a dual opteron server with
> 8G of memory, running Linux 2.6. I haven't done a lot of tuning on PostgreSQL
> itself, but here's the settings I have changed so far:
>
> shared_buffers = 2000 (anything much bigger says the kernel doesnt allow  it,
> still have to look into that)

use something like
echo "150000000" > /proc/sys/kernel/shmmax
to increase shared memory. In your case you could dedicate much more
memory.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Mike Rylander
Date:
On Tue, 22 Mar 2005 15:36:11 +0300 (MSK), Oleg Bartunov <oleg@sai.msu.su> wrote:
> On Tue, 22 Mar 2005, Rick Jansen wrote:
>
> > Hi,
> >
> > I'm looking for a *fast* solution to search thru ~ 4 million records of book
> > descriptions. I've installed PostgreSQL 8.0.1 on a dual opteron server with
> > 8G of memory, running Linux 2.6. I haven't done a lot of tuning on PostgreSQL
> > itself, but here's the settings I have changed so far:
> >
> > shared_buffers = 2000 (anything much bigger says the kernel doesnt allow  it,
> > still have to look into that)
>
> use something like
> echo "150000000" > /proc/sys/kernel/shmmax
> to increase shared memory. In your case you could dedicate much more
> memory.
>
>         Regards,
>                 Oleg

And Oleg should know.  Unless I'm mistaken, he (co)wrote tsearch2.
Other than shared buffers, I can't imagine what could be causing that
kind of slowness.  EXPLAIN ANALYZE, please?

As an example of what I think you *should* be seeing, I have a similar
box (4 procs, but that doesn't matter for one query) and I can search
a column with tens of millions of rows in around a second.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
On Tue, 22 Mar 2005, Mike Rylander wrote:

>
> And Oleg should know.  Unless I'm mistaken, he (co)wrote tsearch2.

You're not mistaken :)

> Other than shared buffers, I can't imagine what could be causing that
> kind of slowness.  EXPLAIN ANALYZE, please?
>

tsearch2 config's also are very important. I've seen a lot of
mistakes in configs !


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Rick Jansen
Date:
Mike Rylander wrote:
> On Tue, 22 Mar 2005 15:36:11 +0300 (MSK), Oleg Bartunov <oleg@sai.msu.su> wrote:
>>
>>use something like
>>echo "150000000" > /proc/sys/kernel/shmmax
>>to increase shared memory. In your case you could dedicate much more
>>memory.
>>
>>        Regards,
>>                Oleg


Thanks, I'll check that out.

> And Oleg should know.  Unless I'm mistaken, he (co)wrote tsearch2.
> Other than shared buffers, I can't imagine what could be causing that
> kind of slowness.  EXPLAIN ANALYZE, please?
>

ilab=# explain analyze select count(titel) from books where idxfti @@
to_tsquery('default', 'buckingham | palace');
                                                                QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=35547.99..35547.99 rows=1 width=56) (actual
time=125968.119..125968.120 rows=1 loops=1)
    ->  Index Scan using idxfti_idx on books  (cost=0.00..35525.81
rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1)
          Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery)
  Total runtime: 125968.212 ms
(4 rows)

Time: 125969.264 ms
ilab=#

 > As an example of what I think you *should* be seeing, I have a similar
 > box (4 procs, but that doesn't matter for one query) and I can search
 > a column with tens of millions of rows in around a second.
 >

That sounds very promising, I'd love to get those results.. could you
tell me what your settings are, howmuch memory you have and such? Thanks.

Rick


--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once

Re: Tsearch2 performance on big database

From
Mike Rylander
Date:
On Tue, 22 Mar 2005 14:25:19 +0100, Rick Jansen <rick@rockingstone.nl> wrote:
>
> ilab=# explain analyze select count(titel) from books where idxfti @@
> to_tsquery('default', 'buckingham | palace');
>                                                                 QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>   Aggregate  (cost=35547.99..35547.99 rows=1 width=56) (actual
> time=125968.119..125968.120 rows=1 loops=1)
>     ->  Index Scan using idxfti_idx on books  (cost=0.00..35525.81
> rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1)
>           Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery)
>   Total runtime: 125968.212 ms
> (4 rows)
>
> Time: 125969.264 ms
> ilab=#

Ahh... I should have qualified my claim.  I am creating a google-esqe
search interface and almost every query uses '&' as the term joiner.
'AND' queries and one-term queries are orders of magnitude faster than
'OR' queries, and fortunately are the expected default for most users.
 (Think, "I typed in these words, therefore I want to match these
words"...)  An interesting test may be to time multiple queries
independently, one for each search term, and see if the combined cost
is less than a single 'OR' search.  If so, you could use UNION to join
the results.

However, the example you originally gave ('terminology') should be
very fast.  On a comparable query ("select count(value) from
metabib.full_rec where index_vector @@ to_tsquery('default','jane');")
I get 12ms.

Oleg, do you see anything else on the surface here?

Try:

EXPLAIN ANALYZE
  SELECT titel FROM books WHERE idxfti @@
    to_tsquery('default', 'buckingham')
  UNION
  SELECT titel FROM books WHERE idxfti @@
    to_tsquery('default', 'palace');

and see if using '&' instead of '|' where you can helps out.  I
imagine you'd be surprised by the speed of:

  SELECT titel FROM books WHERE idxfti @@
    to_tsquery('default', 'buckingham&palace');


>
>  > As an example of what I think you *should* be seeing, I have a similar
>  > box (4 procs, but that doesn't matter for one query) and I can search
>  > a column with tens of millions of rows in around a second.
>  >
>
> That sounds very promising, I'd love to get those results.. could you
> tell me what your settings are, howmuch memory you have and such?

16G of RAM on a dedicated machine.


shared_buffers = 15000          # min 16, at least max_connections*2, 8KB each
work_mem = 10240                # min 64, size in KB
maintenance_work_mem = 1000000  # min 1024, size in KB
# big m_w_m for loading data...

random_page_cost = 2.5          # units are one sequential page fetch cost
# fast drives, and tons of RAM


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
Mike,

no comments before Rick post tsearch configs and increased buffers !
Union shouldn't be faster than (term1|term2).
tsearch2 internals description might help you understanding tsearch2 limitations.
See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
Also, don't miss my notes:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

Oleg
On Tue, 22 Mar 2005, Mike Rylander wrote:

> On Tue, 22 Mar 2005 14:25:19 +0100, Rick Jansen <rick@rockingstone.nl> wrote:
>>
>> ilab=# explain analyze select count(titel) from books where idxfti @@
>> to_tsquery('default', 'buckingham | palace');
>>                                                                 QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------
>>   Aggregate  (cost=35547.99..35547.99 rows=1 width=56) (actual
>> time=125968.119..125968.120 rows=1 loops=1)
>>     ->  Index Scan using idxfti_idx on books  (cost=0.00..35525.81
>> rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1)
>>           Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery)
>>   Total runtime: 125968.212 ms
>> (4 rows)
>>
>> Time: 125969.264 ms
>> ilab=#
>
> Ahh... I should have qualified my claim.  I am creating a google-esqe
> search interface and almost every query uses '&' as the term joiner.
> 'AND' queries and one-term queries are orders of magnitude faster than
> 'OR' queries, and fortunately are the expected default for most users.
> (Think, "I typed in these words, therefore I want to match these
> words"...)  An interesting test may be to time multiple queries
> independently, one for each search term, and see if the combined cost
> is less than a single 'OR' search.  If so, you could use UNION to join
> the results.
>
> However, the example you originally gave ('terminology') should be
> very fast.  On a comparable query ("select count(value) from
> metabib.full_rec where index_vector @@ to_tsquery('default','jane');")
> I get 12ms.
>
> Oleg, do you see anything else on the surface here?
>
> Try:
>
> EXPLAIN ANALYZE
>  SELECT titel FROM books WHERE idxfti @@
>    to_tsquery('default', 'buckingham')
>  UNION
>  SELECT titel FROM books WHERE idxfti @@
>    to_tsquery('default', 'palace');
>
> and see if using '&' instead of '|' where you can helps out.  I
> imagine you'd be surprised by the speed of:
>
>  SELECT titel FROM books WHERE idxfti @@
>    to_tsquery('default', 'buckingham&palace');
>
>
>>
>> > As an example of what I think you *should* be seeing, I have a similar
>> > box (4 procs, but that doesn't matter for one query) and I can search
>> > a column with tens of millions of rows in around a second.
>> >
>>
>> That sounds very promising, I'd love to get those results.. could you
>> tell me what your settings are, howmuch memory you have and such?
>
> 16G of RAM on a dedicated machine.
>
>
> shared_buffers = 15000          # min 16, at least max_connections*2, 8KB each
> work_mem = 10240                # min 64, size in KB
> maintenance_work_mem = 1000000  # min 1024, size in KB
> # big m_w_m for loading data...
>
> random_page_cost = 2.5          # units are one sequential page fetch cost
> # fast drives, and tons of RAM
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Rick Jansen
Date:
Oleg Bartunov wrote:
> Mike,
>
> no comments before Rick post tsearch configs and increased buffers !
> Union shouldn't be faster than (term1|term2).
> tsearch2 internals description might help you understanding tsearch2
> limitations.
> See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
> Also, don't miss my notes:
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>
> Oleg

Thanks Oleg, i've seen those pages before :) I've set shared_buffers to
45000 now (yes thats probably very much, isn't it?) and it already seems
a lot quicker.

How do I find out what my tsearch config is? I followed the intro
(http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro) and
applied it to our books table, thats all, didnt change anything else
about configs.


> On Tue, 22 Mar 2005, Mike Rylander wrote:
>> Ahh... I should have qualified my claim.  I am creating a google-esqe
>> search interface and almost every query uses '&' as the term joiner.
>> 'AND' queries and one-term queries are orders of magnitude faster than
>> 'OR' queries, and fortunately are the expected default for most users.
>> (Think, "I typed in these words, therefore I want to match these
>> words"...)  An interesting test may be to time multiple queries
>> independently, one for each search term, and see if the combined cost
>> is less than a single 'OR' search.  If so, you could use UNION to join
>> the results.

Well I just asked my colleges and OR queries arent used by us anyway, so
I'll test for AND queries instead.

>> However, the example you originally gave ('terminology') should be
>> very fast.  On a comparable query ("select count(value) from
>> metabib.full_rec where index_vector @@ to_tsquery('default','jane');")
>> I get 12ms.

ilab=# select count(*) from books where idxfti @@ to_tsquery('default',
'jane');
  count
-------
   4093
(1 row)
Time: 217395.820 ms

:(

ilab=# explain analyze select count(*) from books where idxfti @@
to_tsquery('default', 'jane');
                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=16591.95..16591.95 rows=1 width=0) (actual
time=4634.931..4634.932 rows=1 loops=1)
    ->  Index Scan using idxfti_idx on books  (cost=0.00..16581.69
rows=4102 width=0) (actual time=0.395..4631.454 rows=4093 loops=1)
          Index Cond: (idxfti @@ '\'jane\''::tsquery)
  Total runtime: 4635.023 ms
(4 rows)

Time: 4636.028 ms
ilab=#

>> 16G of RAM on a dedicated machine.
>>
>>
>> shared_buffers = 15000          # min 16, at least max_connections*2,
>> 8KB each
>> work_mem = 10240                # min 64, size in KB
>> maintenance_work_mem = 1000000  # min 1024, size in KB
>> # big m_w_m for loading data...
>>
>> random_page_cost = 2.5          # units are one sequential page fetch
>> cost
>> # fast drives, and tons of RAM
>>

Right.. well I'll try copying these settings, see how that works out,
thanks :)

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
On Tue, 22 Mar 2005, Rick Jansen wrote:

> Oleg Bartunov wrote:
>> Mike,
>>
>> no comments before Rick post tsearch configs and increased buffers !
>> Union shouldn't be faster than (term1|term2).
>> tsearch2 internals description might help you understanding tsearch2
>> limitations.
>> See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>> Also, don't miss my notes:
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>>
>> Oleg
>
> Thanks Oleg, i've seen those pages before :) I've set shared_buffers to 45000
> now (yes thats probably very much, isn't it?) and it already seems a lot
> quicker.
>
> How do I find out what my tsearch config is? I followed the intro
> (http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro) and
> applied it to our books table, thats all, didnt change anything else about
> configs.

Hmm, default configuration is too eager, you index every lexem using
simple dictionary) ! Probably, it's too much. Here is what I have for my
russian configuration in dictionary database:

  default_russian | lword        | {en_ispell,en_stem}
  default_russian | lpart_hword  | {en_ispell,en_stem}
  default_russian | lhword       | {en_ispell,en_stem}
  default_russian | nlword       | {ru_ispell,ru_stem}
  default_russian | nlpart_hword | {ru_ispell,ru_stem}
  default_russian | nlhword      | {ru_ispell,ru_stem}

Notice, I index only russian and english words, no numbers, url, etc.
You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
but I'd recommend just update them setting dict_name to NULL.
For example, to not indexing integers:

update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian'
and tok_alias='int';

voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +70000 bucks');
  token  |      dict_name      | tok_type | tsvector
--------+---------------------+----------+----------
  Do     | {en_ispell,en_stem} | lword    |
  you    | {en_ispell,en_stem} | lword    |
  have   | {en_ispell,en_stem} | lword    |
  +70000 |                     | int      |
  bucks  | {en_ispell,en_stem} | lword    | 'buck'

Only 'bucks' gets indexed :)
Hmm, probably I should add this into documentation.

What about word statistics (# of unique words, for example).



     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Rick Jansen
Date:
Oleg Bartunov wrote:
> On Tue, 22 Mar 2005, Rick Jansen wrote:
>
> Hmm, default configuration is too eager, you index every lexem using
> simple dictionary) ! Probably, it's too much. Here is what I have for my
> russian configuration in dictionary database:
>
>  default_russian | lword        | {en_ispell,en_stem}
>  default_russian | lpart_hword  | {en_ispell,en_stem}
>  default_russian | lhword       | {en_ispell,en_stem}
>  default_russian | nlword       | {ru_ispell,ru_stem}
>  default_russian | nlpart_hword | {ru_ispell,ru_stem}
>  default_russian | nlhword      | {ru_ispell,ru_stem}
>
> Notice, I index only russian and english words, no numbers, url, etc.
> You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
> but I'd recommend just update them setting dict_name to NULL.
> For example, to not indexing integers:
>
> update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian'
> and tok_alias='int';
>
> voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you
> have +70000 bucks');
>  token  |      dict_name      | tok_type | tsvector
> --------+---------------------+----------+----------
>  Do     | {en_ispell,en_stem} | lword    |
>  you    | {en_ispell,en_stem} | lword    |
>  have   | {en_ispell,en_stem} | lword    |
>  +70000 |                     | int      |
>  bucks  | {en_ispell,en_stem} | lword    | 'buck'
>
> Only 'bucks' gets indexed :)
> Hmm, probably I should add this into documentation.
>
> What about word statistics (# of unique words, for example).
>

I'm now following the guide to add the ispell dictionary and I've
updated most of the rows setting dict_name to NULL:

      ts_name     |  tok_alias   | dict_name
-----------------+--------------+-----------
  default         | lword        | {en_stem}
  default         | nlword       | {simple}
  default         | word         | {simple}
  default         | part_hword   | {simple}
  default         | nlpart_hword | {simple}
  default         | lpart_hword  | {en_stem}
  default         | hword        | {simple}
  default         | lhword       | {en_stem}
  default         | nlhword      | {simple}

These are left, but I have no idea what a 'hword' or 'nlhword' or any
other of these tokens are.

Anyway, how do I find out the number of unique words or other word
statistics?

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
On Wed, 23 Mar 2005, Rick Jansen wrote:

> Oleg Bartunov wrote:
>> On Tue, 22 Mar 2005, Rick Jansen wrote:
>>
>> Hmm, default configuration is too eager, you index every lexem using simple
>> dictionary) ! Probably, it's too much. Here is what I have for my russian
>> configuration in dictionary database:
>>
>>  default_russian | lword        | {en_ispell,en_stem}
>>  default_russian | lpart_hword  | {en_ispell,en_stem}
>>  default_russian | lhword       | {en_ispell,en_stem}
>>  default_russian | nlword       | {ru_ispell,ru_stem}
>>  default_russian | nlpart_hword | {ru_ispell,ru_stem}
>>  default_russian | nlhword      | {ru_ispell,ru_stem}
>>
>> Notice, I index only russian and english words, no numbers, url, etc.
>> You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
>> but I'd recommend just update them setting dict_name to NULL.
>> For example, to not indexing integers:
>>
>> update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and
>> tok_alias='int';
>>
>> voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have
>> +70000 bucks');
>>  token  |      dict_name      | tok_type | tsvector
>> --------+---------------------+----------+----------
>>  Do     | {en_ispell,en_stem} | lword    |
>>  you    | {en_ispell,en_stem} | lword    |
>>  have   | {en_ispell,en_stem} | lword    |
>>  +70000 |                     | int      |
>>  bucks  | {en_ispell,en_stem} | lword    | 'buck'
>>
>> Only 'bucks' gets indexed :)
>> Hmm, probably I should add this into documentation.
>>
>> What about word statistics (# of unique words, for example).
>>
>
> I'm now following the guide to add the ispell dictionary and I've updated
> most of the rows setting dict_name to NULL:
>
>     ts_name     |  tok_alias   | dict_name
> -----------------+--------------+-----------
> default         | lword        | {en_stem}
> default         | nlword       | {simple}
> default         | word         | {simple}
> default         | part_hword   | {simple}
> default         | nlpart_hword | {simple}
> default         | lpart_hword  | {en_stem}
> default         | hword        | {simple}
> default         | lhword       | {en_stem}
> default         | nlhword      | {simple}
>
> These are left, but I have no idea what a 'hword' or 'nlhword' or any other
> of these tokens are.

from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
  I've asked how to know token types supported by parser. Actually, there is function token_type(parser), so you just
use:

     select * from token_type();

>
> Anyway, how do I find out the number of unique words or other word
> statistics?


from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

It's usefull to see words statistics, for example, to check how good your
dictionaries work or how did you configure pg_ts_cfgmap. Also, you may notice
probable stop words relevant for your collection.
Tsearch provides stat() function:

.......................

Don't hesitate to read it and if you find some bugs or know better wording
I'd be glad to improve my notes.

>
> Rick
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Rick Jansen
Date:
Oleg Bartunov wrote:
 > from my notes
 > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
 >
 > It's usefull to see words statistics, for example, to check how good
 > your dictionaries work or how did you configure pg_ts_cfgmap. Also, you
 > may notice probable stop words relevant for your collection. Tsearch
 > provides stat() function:
 >
 > .......................
 >
 > Don't hesitate to read it and if you find some bugs or know better
wording
 > I'd be glad to improve my notes.
 >

Thanks, but that stat() query takes way too long.. I let it run for like
4 hours and still nothing. The database I am testing tsearch2 on is also
the production database (mysql) server so I have to be careful not to
use too many resources :o

Anyway, here's my pg_ts_cfgmap now (well the relevant bits):

default_english | lhword       | {en_ispell,en_stem}
default_english | lpart_hword  | {en_ispell,en_stem}
default_english | lword        | {en_ispell,en_stem}

Is it normal that queries for single words (or perhaps they are words
that are common) take a really long time? Like this:

ilab=# explain analyze select count(*) from books where description_fti
@@ to_tsquery('default', 'hispanic');
                                                                 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=20369.81..20369.81 rows=1 width=0) (actual
time=261512.031..261512.031 rows=1 loops=1)
    ->  Index Scan using idxfti_idx on books  (cost=0.00..20349.70
rows=8041 width=0) (actual time=45777.760..261509.288 rows=674 loops=1)
          Index Cond: (description_fti @@ '\'hispan\''::tsquery)
  Total runtime: 261518.529 ms
(4 rows)

ilab=# explain analyze select titel from books where description_fti @@
to_tsquery('default', 'buckingham & palace');
                                   QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041
width=57) (actual time=18992.045..48863.385 rows=185 loops=1)
    Index Cond: (description_fti @@ '\'buckingham\' & \'palac\''::tsquery)
  Total runtime: 48863.874 ms
(3 rows)


I dont know what happened, these queries were a lot faster 2 days
ago..what the feck is going on?!

Rick

--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once

Re: Tsearch2 performance on big database

From
Oleg Bartunov
Date:
On Thu, 24 Mar 2005, Rick Jansen wrote:

> Oleg Bartunov wrote:
>> from my notes
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>>
>> It's usefull to see words statistics, for example, to check how good
>> your dictionaries work or how did you configure pg_ts_cfgmap. Also, you
>> may notice probable stop words relevant for your collection. Tsearch
>> provides stat() function:
>>
>> .......................
>>
>> Don't hesitate to read it and if you find some bugs or know better wording
>> I'd be glad to improve my notes.
>>
>
> Thanks, but that stat() query takes way too long.. I let it run for like
> 4 hours and still nothing. The database I am testing tsearch2 on is also
> the production database (mysql) server so I have to be careful not to
> use too many resources :o

stat() is indeed a bigdog, it was designed for developers needs,
so we recommend to save results in table.

>
> Anyway, here's my pg_ts_cfgmap now (well the relevant bits):
>
> default_english | lhword       | {en_ispell,en_stem}
> default_english | lpart_hword  | {en_ispell,en_stem}
> default_english | lword        | {en_ispell,en_stem}
>
> Is it normal that queries for single words (or perhaps they are words
> that are common) take a really long time? Like this:
>

'hispanic' isn't common, I see you get only  674 rows and
'buckingham & palace'  returns 185 rows. Did you run 'vacuum analyze' ?
I see a big discrepancy between estimated rows (8041) and actual rows.



> ilab=# explain analyze select count(*) from books where description_fti @@
> to_tsquery('default', 'hispanic');
>                                                                QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate  (cost=20369.81..20369.81 rows=1 width=0) (actual
> time=261512.031..261512.031 rows=1 loops=1)
>   ->  Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041
> width=0) (actual time=45777.760..261509.288 rows=674 loops=1)
>         Index Cond: (description_fti @@ '\'hispan\''::tsquery)
> Total runtime: 261518.529 ms
> (4 rows)
>
> ilab=# explain analyze select titel from books where description_fti @@
> to_tsquery('default', 'buckingham & palace');
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041
> width=57) (actual time=18992.045..48863.385 rows=185 loops=1)
>   Index Cond: (description_fti @@ '\'buckingham\' & \'palac\''::tsquery)
> Total runtime: 48863.874 ms
> (3 rows)
>
>
> I dont know what happened, these queries were a lot faster 2 days ago..what
> the feck is going on?!
>
> Rick
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 performance on big database

From
Rick Jansen
Date:
Oleg Bartunov wrote:
>
> stat() is indeed a bigdog, it was designed for developers needs,
> so we recommend to save results in table.
>
>>
>> Anyway, here's my pg_ts_cfgmap now (well the relevant bits):
>>
>> default_english | lhword       | {en_ispell,en_stem}
>> default_english | lpart_hword  | {en_ispell,en_stem}
>> default_english | lword        | {en_ispell,en_stem}
>>
>> Is it normal that queries for single words (or perhaps they are words
>> that are common) take a really long time? Like this:
>>
>
> 'hispanic' isn't common, I see you get only  674 rows and 'buckingham &
> palace'  returns 185 rows. Did you run 'vacuum analyze' ?
> I see a big discrepancy between estimated rows (8041) and actual rows.
>
>

Yes, I did a vacuum analyze right before executing these queries.

I'm going to recreate the gist index now, and do a vacuum full analyze
after that.. see if that makes a difference.

Rick

--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once