Thread: n-gram search function

n-gram search function

From
Tatsuo Ishii
Date:
Hi,

Is anybody working on implementing n-gram search functionality for
text type data? tsearch2 is great for long text but it's not
appropreate for short (10-100 bytes) text data. What I want to achieve
is a fast partial match search using indexes, i.e. foo ~ 'bar' or foo
LIKE '%bar%' type matching.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: n-gram search function

From
Oleg Bartunov
Date:
3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
but may be enhanced with the GiN.

Oleg

On Sat, 17 Feb 2007, Tatsuo Ishii wrote:

> Hi,
>
> Is anybody working on implementing n-gram search functionality for
> text type data? tsearch2 is great for long text but it's not
> appropreate for short (10-100 bytes) text data. What I want to achieve
> is a fast partial match search using indexes, i.e. foo ~ 'bar' or foo
> LIKE '%bar%' type matching.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: n-gram search function

From
Tatsuo Ishii
Date:
Thanks. I'll look into this.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> 3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
> but may be enhanced with the GiN.
> 
> Oleg
> 
> On Sat, 17 Feb 2007, Tatsuo Ishii wrote:
> 
> > Hi,
> >
> > Is anybody working on implementing n-gram search functionality for
> > text type data? tsearch2 is great for long text but it's not
> > appropreate for short (10-100 bytes) text data. What I want to achieve
> > is a fast partial match search using indexes, i.e. foo ~ 'bar' or foo
> > LIKE '%bar%' type matching.
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
> 
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings



Re: n-gram search function

From
Tatsuo Ishii
Date:
Thanks. I'll look into this.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> 3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
> but may be enhanced with the GiN.
> 
> Oleg
> 
> On Sat, 17 Feb 2007, Tatsuo Ishii wrote:
> 
> > Hi,
> >
> > Is anybody working on implementing n-gram search functionality for
> > text type data? tsearch2 is great for long text but it's not
> > appropreate for short (10-100 bytes) text data. What I want to achieve
> > is a fast partial match search using indexes, i.e. foo ~ 'bar' or foo
> > LIKE '%bar%' type matching.
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
> 
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83




Re: n-gram search function

From
"Guillaume Smet"
Date:
Hi Oleg,

On 2/17/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
> 3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
> but may be enhanced with the GiN.

As I'm facing the same problem, I've taken a look to pg_trgm. At the
moment, my opinion is quite mixed but perhaps I did something wrong.

I have a table (100k rows) with a location name in it generally
composed of several words but not that long. I created the index
directly on this column (ie I don't create a table with each word of
the location name). Then I tried a few queries.

Here is an example:

prod=# explain analyze select nomlieu from lieu where nomlieu ilike '%gaumont%';
   QUERY PLAN
 
-----------------------------------------------------------------------------------------------------Seq Scan on lieu
(cost=0.00..7230.20rows=7 width=21) (actual
 
time=7.768..556.930 rows=39 loops=1)  Filter: ((nomlieu)::text ~~* '%gaumont%'::text)Total runtime: 557.066 ms
(3 rows)

_prod=# explain analyze select nomlieu from lieu where nomlieu % 'gaumont';
              QUERY
 
PLAN

-------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on lieu  (cost=3.37..200.80 rows=106 width=21)
 
(actual time=689.799..690.035 rows=36 loops=1)  Recheck Cond: ((nomlieu)::text % 'gaumont'::text)  ->  Bitmap Index
Scanon idx_lieu_nomlieu_trgm  (cost=0.00..3.37
 
rows=106 width=0) (actual time=689.749..689.749 rows=36 loops=1)        Index Cond: ((nomlieu)::text %
'gaumont'::text)Totalruntime: 690.195 ms
 
(5 rows)

The trigram version is slower and doesn't return 3 results I should
have. The 3 results it doesn't return have the word gaumont in them at
the start of the string exactly like the others.

Is there anything I can do to improve the performances and investigate
why I don't have these 3 results?

Thanks.

--
Guillaume


Re: n-gram search function

From
Oleg Bartunov
Date:
On Sun, 18 Feb 2007, Guillaume Smet wrote:

> Hi Oleg,
>
> On 2/17/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
>> 3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
>> but may be enhanced with the GiN.
>
> As I'm facing the same problem, I've taken a look to pg_trgm. At the
> moment, my opinion is quite mixed but perhaps I did something wrong.
>
> I have a table (100k rows) with a location name in it generally
> composed of several words but not that long. I created the index
> directly on this column (ie I don't create a table with each word of
> the location name). Then I tried a few queries.
>
> Here is an example:
>
> prod=# explain analyze select nomlieu from lieu where nomlieu ilike 
> '%gaumont%';
>                                            QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Seq Scan on lieu  (cost=0.00..7230.20 rows=7 width=21) (actual
> time=7.768..556.930 rows=39 loops=1)
>  Filter: ((nomlieu)::text ~~* '%gaumont%'::text)
> Total runtime: 557.066 ms
> (3 rows)
>
> _prod=# explain analyze select nomlieu from lieu where nomlieu % 'gaumont';
>                                                            QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on lieu  (cost=3.37..200.80 rows=106 width=21)
> (actual time=689.799..690.035 rows=36 loops=1)
>  Recheck Cond: ((nomlieu)::text % 'gaumont'::text)
>  ->  Bitmap Index Scan on idx_lieu_nomlieu_trgm  (cost=0.00..3.37
> rows=106 width=0) (actual time=689.749..689.749 rows=36 loops=1)
>        Index Cond: ((nomlieu)::text % 'gaumont'::text)
> Total runtime: 690.195 ms
> (5 rows)
>
> The trigram version is slower and doesn't return 3 results I should
> have. The 3 results it doesn't return have the word gaumont in them at
> the start of the string exactly like the others.
>
> Is there anything I can do to improve the performances and investigate
> why I don't have these 3 results?

pg_trgm was developed for spelling corrrection and there is a threshold of
similarity, which is 0.3 by default. Readme explains what does it means.

Similarity could be very low, since you didn't make separate column and length
of the full string is used to normalize similarity.

pg_trgm as is isn't well suited for wild card search, but the idea is there.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: n-gram search function

From
"Guillaume Smet"
Date:
On 2/19/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
> pg_trgm was developed for spelling corrrection and there is a threshold of
> similarity, which is 0.3 by default. Readme explains what does it means.

Yes, I read it.

> Similarity could be very low, since you didn't make separate column and length
> of the full string is used to normalize similarity.

Yep, that's probably my problem. Ignored records are a bit longer than
the others.

I tried the tip in README.pg_trgm to generate a table with all the words.

It can do the work in conjunction of tsearch2 and a bit of AJAX to
suggest the full words to the users. The reason why I was not using
tsearch2 is that it's sometimes hard to spell location names
correctly.

The only problem is that it is still quite slow on a 50k rows words
table but I'll make further tests on a decent server this afternoon.

--
Guillaume


Re: n-gram search function

From
Oleg Bartunov
Date:
On Mon, 19 Feb 2007, Guillaume Smet wrote:

> On 2/19/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
>> pg_trgm was developed for spelling corrrection and there is a threshold of
>> similarity, which is 0.3 by default. Readme explains what does it means.
>
> Yes, I read it.
>
>> Similarity could be very low, since you didn't make separate column and 
>> length
>> of the full string is used to normalize similarity.
>
> Yep, that's probably my problem. Ignored records are a bit longer than
> the others.
>
> I tried the tip in README.pg_trgm to generate a table with all the words.
>
> It can do the work in conjunction of tsearch2 and a bit of AJAX to
> suggest the full words to the users. The reason why I was not using
> tsearch2 is that it's sometimes hard to spell location names
> correctly.
>
> The only problem is that it is still quite slow on a 50k rows words
> table but I'll make further tests on a decent server this afternoon.

You need to wait GiN support.

    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: n-gram search function

From
"Guillaume Smet"
Date:
On 2/19/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
> You need to wait GiN support.

OK. Thanks.

If you need testers for this one, feel free to contact me. I'm very
interested in testing pg_trgm in conjunction with tsearch2.

--
Guillaume