Thread: Fast tsearch2, trigram matching on short phrases

Fast tsearch2, trigram matching on short phrases

From

"Carlo Stonebanks"

Date:

22 August 2007, 13:02:18

I have read that trigram matching (similarity()) performance degrades when
the matching is on longer strings such as phrases. I need to quickly match
strings and rate them by similiarity. The strings are typically one to seven
words in length - and will often include unconventional abbreviations and
misspellings.

I have a stored function which does more thorough testing of the phrases,
including spelling correction, abbreviation translation, etc... and scores
the results - I pick the winning score that passes a pass/fail constant.
However, the function is slow. My solution was to reduce the number of rows
that are passed to the function by pruning obvious mismatches using
similarity(). However, trigram matching on phrases is slow as well.

I have experimented with tsearch2 but I have two problems:

1) I need a "score" so I can decide if match passed or failed. trigram
similarity() has a fixed result that you can test, but I don't know if
rank() returns results that can be compared to a fixed value

2) I need an efficient methodology to create vectors based on trigrams, and
a way to create an index to support it. My tsearch2 experiment with normal
vectors used gist(text tsvector) and an on insert/update trigger to populate
the vector field.

Any suggestions on where to go with this project to improve performance
would be greatly appreciated.

Carlo

Re: Fast tsearch2, trigram matching on short phrases

From

"Steinar H. Gunderson"

Date:

22 August 2007, 13:30:30

On Wed, Aug 22, 2007 at 12:02:54PM -0400, Carlo Stonebanks wrote:
> Any suggestions on where to go with this project to improve performance
> would be greatly appreciated.

I'm a bit unsure from reading your mail -- have you tried pg_trgm with a GiST
index?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Fast tsearch2, trigram matching on short phrases

From

Oleg Bartunov

Date:

22 August 2007, 15:48:50

On Wed, 22 Aug 2007, Carlo Stonebanks wrote:

> I have read that trigram matching (similarity()) performance degrades when
> the matching is on longer strings such as phrases. I need to quickly match
> strings and rate them by similiarity. The strings are typically one to seven
> words in length - and will often include unconventional abbreviations and
> misspellings.
>
> I have a stored function which does more thorough testing of the phrases,
> including spelling correction, abbreviation translation, etc... and scores
> the results - I pick the winning score that passes a pass/fail constant.
> However, the function is slow. My solution was to reduce the number of rows
> that are passed to the function by pruning obvious mismatches using
> similarity(). However, trigram matching on phrases is slow as well.

you didn't show us explain analyze of your select.

>
> I have experimented with tsearch2 but I have two problems:
>
> 1) I need a "score" so I can decide if match passed or failed. trigram
> similarity() has a fixed result that you can test, but I don't know if rank()
> returns results that can be compared to a fixed value
>
> 2) I need an efficient methodology to create vectors based on trigrams, and a
> way to create an index to support it. My tsearch2 experiment with normal
> vectors used gist(text tsvector) and an on insert/update trigger to populate
> the vector field.
>
> Any suggestions on where to go with this project to improve performance would
> be greatly appreciated.
>
> Carlo
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fast tsearch2, trigram matching on short phrases

From

"Carlo Stonebanks"

Date:

22 August 2007, 17:24:39

In December I had tried this; it caused a tremendous slowdown in our system.
I have avoided it since then.

Do you expect pg_trgm to work with phrases? OI had read a post earlier from
an earlier support question that suggested that it I SHOULD expect
performance to degrade and that pg_trgrm was oriented towards word mathcing,
not phrase matching.

Carlo

""Steinar H. Gunderson"" <sgunderson@bigfoot.com> wrote in message
news:20070822162950.GB24930@uio.no...
> On Wed, Aug 22, 2007 at 12:02:54PM -0400, Carlo Stonebanks wrote:
>> Any suggestions on where to go with this project to improve performance
>> would be greatly appreciated.
>
> I'm a bit unsure from reading your mail -- have you tried pg_trgm with a
> GiST
> index?
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

Re: Fast tsearch2, trigram matching on short phrases

From

"Carlo Stonebanks"

Date:

22 August 2007, 17:30:31

Hi Oleg,

> you didn't show us explain analyze of your select.

I didn't because I didn't expect any reaction to it - my understanding is
that trigram matching for phrases is not recommended because of the
performance. Do you believe that I SHOULD expect good performance from
trigram matching on phrases (a sopposed to words)?

Carlo

Re: Fast tsearch2, trigram matching on short phrases

From

Oleg Bartunov

Date:

22 August 2007, 17:43:16

On Wed, 22 Aug 2007, Carlo Stonebanks wrote:

> Hi Oleg,
>
>> you didn't show us explain analyze of your select.
>
> I didn't because I didn't expect any reaction to it - my understanding is
> that trigram matching for phrases is not recommended because of the
> performance. Do you believe that I SHOULD expect good performance from
> trigram matching on phrases (a sopposed to words)?

The problem is in idea, not in performance.

>
> Carlo
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>               http://www.postgresql.org/about/donate
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Fast tsearch2, trigram matching on short phrases

From

"Carlo Stonebanks"

Date:

22 August 2007, 18:09:18

>> The problem is in idea, not in performance.

Oh, I think we both agree on that! ;-D

This is why I didn't post any EXPLAINs or anything like that. I thought the
problem was in the entire method of how to best zero in on the set of
records best suited for closer analysis by my phrase-matching function.

Since you asked about an EXPLAIN ANALYZE, I put together some results for
for you. I added a pg_trgm index to the table to show the performance of
GiST indexes, and did another based on exclusively on similarity(). But I
don't think that you will be surprised by what you see.

As you say, the problem is in the idea - but no matter what, I need to be
able to match phrases that will have all sorts of erratic abbreviations and
misspellings - and I have to do it at very high speeds.

I would appreciate any suggestions you might have.

Carlo



select similarity('veterans''s affairs', name) as sim, name
from institution
where name % 'veterans''s affairs'
order by sim desc

Sort  (cost=4068.21..4071.83 rows=1446 width=23) (actual
time=4154.962..4155.006 rows=228 loops=1)
  Sort Key: similarity('veterans''s affairs'::text, (name)::text)
  ->  Bitmap Heap Scan on institution  (cost=75.07..3992.31 rows=1446
width=23) (actual time=4152.825..4154.754 rows=228 loops=1)
        Recheck Cond: ((name)::text % 'veterans''s affairs'::text)
        ->  Bitmap Index Scan on institution_name_trgm_idx
(cost=0.00..74.71 rows=1446 width=0) (actual time=4152.761..4152.761
rows=228 loops=1)
              Index Cond: ((name)::text % 'veterans''s affairs'::text)
Total runtime: 4155.127 ms

select name
from institution
where
   similarity('veterans''s affairs', name) > 0.5
order by similarity('veterans''s affairs', name) > 0.5

Sort  (cost=142850.08..144055.17 rows=482036 width=23) (actual
time=12603.745..12603.760 rows=77 loops=1)
  Sort Key: (similarity('veterans''s affairs'::text, (name)::text) >
0.5::double precision)
  ->  Seq Scan on institution  (cost=0.00..97348.81 rows=482036 width=23)
(actual time=2032.439..12603.370 rows=77 loops=1)
        Filter: (similarity('veterans''s affairs'::text, (name)::text) >
0.5::double precision)
Total runtime: 12603.818 ms