Thread: Q: text query search and

Q: text query search and

From
amihay gonen
Date:
Hi  I'm trying to implement a text search in PG . 

My goal to enable the user search on several columns also on partial words.

here is sample code :
create table test_test( text_data tsvector, text_a varchar,text_b varchar);

insert into test_test(text_a,text_b) select 'name 10.10.2.3 ','name3 name' from generate_series(1,500);
update test_test set text_data=to_tsvector(text_a||' '||text_b);
CREATE INDEX test_test_idx ON test_test USING gin(text_data);

explain ANALYZE select * from test_test where text_data@@plainto_tsquery('name');

my questions are :
1. why the index is not used (I guess it is related to the way the data is generated)
2, how can I use pg_trgm with ts_vector to enable to answer query like 10.10 or nam ? 
   the idea is to use the gin index , maybe there are other option without using pg_trgm?

thanks 
amihay 

Re: Q: text query search and

From
Kevin Grittner
Date:
amihay gonen <agonenil@gmail.com> wrote:

> explain ANALYZE select * from test_test
>   where text_data@@plainto_tsquery('name');
>
> my questions are :
> 1. why the index is not used (I guess it is related to the way
>    the data is generated)

Because the sequential scan has a lower cost based on available
statistics.  If I run all the example statements without any
delays, I actually see it using the index; it is only if I run
VACUUM ANALYZE (or give autovacuum enough time to do so) that I see
a sequential scan.  That's because it sees that it will need to
visit all the tuples in the heap anyway, so why bother also
visiting all the index tuples?  If you search for a word that's not
in the table, you'll see it use the index, too.

> 2, how can I use pg_trgm with ts_vector to enable to answer query
>    like 10.10 or nam ?  the idea is to use the gin index , maybe
>    there are other option without using pg_trgm?

The easy way to do that is just to add a trigram index and search
for similar strings, and forget about full text search.  If you
want to use full text search for this you may need to use special
dictionaries or parser code.  (I used both for a system to search
court document text, and it seemed to work well.)  The facilities
for custom full text search parsers seem pretty bad; I found what I
needed using regular expressions and cast to the appopriate ts
types.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Q: text query search and

From
Jeff Janes
Date:
On Thu, Jul 23, 2015 at 2:55 AM, amihay gonen <agonenil@gmail.com> wrote:
Hi  I'm trying to implement a text search in PG . 

My goal to enable the user search on several columns also on partial words.

here is sample code :
create table test_test( text_data tsvector, text_a varchar,text_b varchar);

insert into test_test(text_a,text_b) select 'name 10.10.2.3 ','name3 name' from generate_series(1,500);
update test_test set text_data=to_tsvector(text_a||' '||text_b);
CREATE INDEX test_test_idx ON test_test USING gin(text_data);

explain ANALYZE select * from test_test where text_data@@plainto_tsquery('name');

my questions are :
1. why the index is not used (I guess it is related to the way the data is generated)

It returns the entire table, so there is no point in using an index.  Yes, it is the way it is generated, the same data repeated over and over is not very realistic.  If you just want random text, I use md5(random()::text).  But if you want text that looks vaguely like English, I don't have a nice trick for that.  Maybe load the sgml files into a table.
 
2, how can I use pg_trgm with ts_vector to enable to answer query like 10.10 or nam ? 
   the idea is to use the gin index , maybe there are other option without using pg_trgm?

Do you mean:

WHERE text_a LIKE '%10.10%' or text_a LIKE '%nam%' ?

With the or, that going to be hard to optimize.  

Anyway, pg_tgrm requires its own special GIN index, it can't piggy back on the tsvector GIN index.

CREATE INDEX whatever ON test_test USING gin(text_a gin_trgm_ops);

or 

CREATE INDEX whatever ON test_test USING gin((text_a||' '||text_b) gin_trgm_ops);

But, LIKE '%10.10%' is going to be tough for a pg_trgm index to help with, unless you compile your own code after removing "#define KEEPONLYALNUM"

Cheers,

Jeff

Re: Q: text query search and

From
Dane Foster
Date:
Kevin,

I am also interested in a deeper explanation of what you mean by, "The easy way to do that is just to add a trigram index and search for similar strings, and forget about full text search." Because I need to make a decision about whether to use full text search or use other pattern matching facilities such as LIKE and/or regular expressions. For me, the reason I don't just default to full text search is the documents are relative small (i.e, HTML <= 128K) and number fewer than 10,000 so I'm not sure if the effort expended to learn the ins/outs of full text search will be beneficial to my use case.

Regards,

Dane

On Thu, Jul 23, 2015 at 12:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Jul 23, 2015 at 2:55 AM, amihay gonen <agonenil@gmail.com> wrote:
Hi  I'm trying to implement a text search in PG . 

My goal to enable the user search on several columns also on partial words.

here is sample code :
create table test_test( text_data tsvector, text_a varchar,text_b varchar);

insert into test_test(text_a,text_b) select 'name 10.10.2.3 ','name3 name' from generate_series(1,500);
update test_test set text_data=to_tsvector(text_a||' '||text_b);
CREATE INDEX test_test_idx ON test_test USING gin(text_data);

explain ANALYZE select * from test_test where text_data@@plainto_tsquery('name');

my questions are :
1. why the index is not used (I guess it is related to the way the data is generated)

It returns the entire table, so there is no point in using an index.  Yes, it is the way it is generated, the same data repeated over and over is not very realistic.  If you just want random text, I use md5(random()::text).  But if you want text that looks vaguely like English, I don't have a nice trick for that.  Maybe load the sgml files into a table.
 
2, how can I use pg_trgm with ts_vector to enable to answer query like 10.10 or nam ? 
   the idea is to use the gin index , maybe there are other option without using pg_trgm?

Do you mean:

WHERE text_a LIKE '%10.10%' or text_a LIKE '%nam%' ?

With the or, that going to be hard to optimize.  

Anyway, pg_tgrm requires its own special GIN index, it can't piggy back on the tsvector GIN index.

CREATE INDEX whatever ON test_test USING gin(text_a gin_trgm_ops);

or 

CREATE INDEX whatever ON test_test USING gin((text_a||' '||text_b) gin_trgm_ops);

But, LIKE '%10.10%' is going to be tough for a pg_trgm index to help with, unless you compile your own code after removing "#define KEEPONLYALNUM"

Cheers,

Jeff