Thread: String searching

String searching

From

Robert DiFalco

Date:

17 November 2014, 17:55:54

I notice there are several modules to create specialized indices in PostgreSQL for searching VARCHAR data. For example, fuzzy, trigram, full text, etc.

I've been googling around but I can't find the optimal method (reasonable speed and size, simplicity) for my use case.

My text searches will always be like the following. User specifies a word (e.g. "John") and I have a field called "FullName" that could return records with "John Doe", "Robert Johnson", "Joe Johnson Smith", etc. I may also extend the search criteria to other fields. So for example the query would always look like this:

SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%');

So you see it is also case insensitive. Pretty simple/standard stuff.

That said, which would be the best extension module to use? A "gist" index on the uppercased column? Or something else? Thanks!

Re: String searching

From

Jonathan Vanasco

Date:

18 November 2014, 01:54:46

On Nov 17, 2014, at 12:55 PM, Robert DiFalco wrote:

>     SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%');
>
> That said, which would be the best extension module to use? A "gist" index on the uppercased column? Or something
else?Thanks! 

Performance wise, I think a function index would probably be the best:

    CREATE INDEX mytable_lower_fullname_idx ON mytable(lower(fullname));

    SELECT * FROM mytable WHERE lower(fullname) LIKE lower('%john%');

The only reason why I use `lower` and not `upper` is that it's easier to look at when dealing with debugging and sample
queries.

I'd bench against GIN and GIST, but I think this will work the best.

The reason is that GIN/GIST use language patterns to simplify the index.  so they work great on "words"

    select plainto_tsquery('doing watching reading programming');
    'watch' & 'read' & 'program'

but not so great on "names":

    select plainto_tsquery('john doe');
     'john' & 'doe'

    select plainto_tsquery('jon doe');
     'jon' & 'doe

So you'll get a bit more overhead on the match and you won't get a smaller index (which is why they're great for
fulltext)

The search execution might turn out to be much faster.  If so, i'd love to know.  But doing a lower() search on a
lower()function index has always been ridiculously fast for me. 

This only goes for names though.  If you're searching other fields, then another search method might be considerably
better.

Re: String searching

From

David G Johnston

Date:

18 November 2014, 02:03:51

Jonathan Vanasco-7 wrote
> The reason is that GIN/GIST use language patterns to simplify the index.
> so they work great on "words"
>
>     select plainto_tsquery('doing watching reading programming');
>     'watch' & 'read' & 'program'
>
> but not so great on "names":
>
>     select plainto_tsquery('john doe');
>      'john' & 'doe'
>
>     select plainto_tsquery('jon doe');
>      'jon' & 'doe

Going from memory here so take with a cave full of salt...

This doesn't help the OP but full text search is simply a single user of
gin/gist index families; there are others.  The two features are related but
operate at different levels of abstraction/interaction.

At a cursory glance I would think trigram would be a good fit here...but as
I've never used it personally I could be way off base.  The idea being to
figure out whether the smaller string's trigrams all exist in the larger
string.

David J.




--
View this message in context: http://postgresql.nabble.com/String-searching-tp5827268p5827320.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: String searching

From

Albe Laurenz

Date:

18 November 2014, 12:38:36

Jonathan Vanasco wrote:
> On Nov 17, 2014, at 12:55 PM, Robert DiFalco wrote:
>>     SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%');
>>
>> That said, which would be the best extension module to use? A "gist" index on the uppercased column?
>> Or something else? Thanks!
> 
> Performance wise, I think a function index would probably be the best:
> 
>     CREATE INDEX mytable_lower_fullname_idx ON mytable(lower(fullname));
> 
>     SELECT * FROM mytable WHERE lower(fullname) LIKE lower('%john%');

That index wouldn't help with the query at all.

If you really need a full substring search (i.e., you want to find
"howardjohnson"), the only thing that could help are trigram indexes.

But maybe you can lower the requirements to a prefix search (i.e.,
you want to find "john" and "johnson"), in which case a full text search
with an appropriate index would do the trick (if you use a prefix search pattern).

Yours,
Laurenz Albe

Re: String searching

From

Andy Colson

Date:

18 November 2014, 14:37:22

On 11/17/2014 7:54 PM, Jonathan Vanasco wrote:
>
> On Nov 17, 2014, at 12:55 PM, Robert DiFalco wrote:
>
>>      SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%');
>>
>> That said, which would be the best extension module to use? A "gist" index on the uppercased column? Or something
else?Thanks! 
>
> Performance wise, I think a function index would probably be the best:
>
>     CREATE INDEX mytable_lower_fullname_idx ON mytable(lower(fullname));
>
>     SELECT * FROM mytable WHERE lower(fullname) LIKE lower('%john%');
>
> The only reason why I use `lower` and not `upper` is that it's easier to look at when dealing with debugging and
samplequeries. 
>
> I'd bench against GIN and GIST, but I think this will work the best.
>
> The reason is that GIN/GIST use language patterns to simplify the index.  so they work great on "words"
>
>     select plainto_tsquery('doing watching reading programming');
>     'watch' & 'read' & 'program'
>
> but not so great on "names":
>
>     select plainto_tsquery('john doe');
>      'john' & 'doe'
>
>     select plainto_tsquery('jon doe');
>      'jon' & 'doe
>
> So you'll get a bit more overhead on the match and you won't get a smaller index (which is why they're great for
fulltext)
>
> The search execution might turn out to be much faster.  If so, i'd love to know.  But doing a lower() search on a
lower()function index has always been ridiculously fast for me. 
>
> This only goes for names though.  If you're searching other fields, then another search method might be considerably
better.
>

Full Text Search has another awesome benefit.  Aliases.  Bob == Robert.
  I do address searches, and I've created a custom dictionary that says
st == street, n == north, etc.

So when a person searches for 1st ne, they find all combinations of
1 street north east.

Its indexes, so its fast.  Ram and disk are cheap, who cares how big it is.

-Andy

Re: String searching

From

Jonathan Vanasco

Date:

18 November 2014, 16:01:55

On Nov 18, 2014, at 7:38 AM, Albe Laurenz wrote:

That index wouldn't help with the query at all.

If you really need a full substring search (i.e., you want to find
"howardjohnson"), the only thing that could help are trigram indexes.

I stand corrected.

I ran a sample query on my test database of 100k names

using a function index `lower(name)`

this runs an index scan in .2ms

... where lower(name) = lower('bob');

but this runs a sequential scan in 90ms:

... where lower(name) like lower('%bob%');

I didn't know that 'like' doesn't run on indexes!

using a trigaram index,

this runs a bitmap index on the trigram, then a bitmap heap on the table. 13ms.

...where name ilike '%bob%';

Re: String searching

From

Robert DiFalco

Date:

18 November 2014, 16:49:48

Thanks everyone.

Either I'm not that smart or I am working on too many things at once (or both) but making Full Text work seems super tedious. I just have a single VARCHAR field for name, so the full name "William S. Burroughs" is a single row and column. I want to as simply as possible have the ability to search find this record with Will, will, Burr, burroughs, etc.

As far as I can tell, the trigram extension would be the easiest way to implement this. It looks like I wouldn't need to mess with vectors, etc. It would just look like a standard index and query, right? It seems that if I need something more powerful in the future that I could always move to ElasticSearch, Sphinx, or something similar.

Does this sound about right?

On Tue, Nov 18, 2014 at 8:01 AM, Jonathan Vanasco <postgres@2xlp.com> wrote:

On Nov 18, 2014, at 7:38 AM, Albe Laurenz wrote:

That index wouldn't help with the query at all.

If you really need a full substring search (i.e., you want to find
"howardjohnson"), the only thing that could help are trigram indexes.

I stand corrected.

I ran a sample query on my test database of 100k names

using a function index `lower(name)`

this runs an index scan in .2ms
... where lower(name) = lower('bob');

but this runs a sequential scan in 90ms:
... where lower(name) like lower('%bob%');

I didn't know that 'like' doesn't run on indexes!

using a trigaram index,

this runs a bitmap index on the trigram, then a bitmap heap on the table. 13ms.
...where name ilike '%bob%';

Re: String searching

From

Jonathan Vanasco

Date:

18 November 2014, 16:59:53

On Nov 18, 2014, at 11:49 AM, Robert DiFalco wrote:

> As far as I can tell, the trigram extension would be the easiest way to implement this. It looks like I wouldn't need
tomess with vectors, etc. It would just look like a standard index and query, right? It seems that if I need something
morepowerful in the future that I could always move to ElasticSearch, Sphinx, or something similar. 

I just followed the instructions in the docs to create the index, and ran ANALYZE on the table before running a
standard"like" SELECT.

Re: String searching

From

Vick Khera

Date:

18 November 2014, 18:01:42

On Tue, Nov 18, 2014 at 11:49 AM, Robert DiFalco <robert.difalco@gmail.com> wrote:

Either I'm not that smart or I am working on too many things at once (or both) but making Full Text work seems super tedious. I just have a single VARCHAR field for name, so the full name "William S. Burroughs" is a single row and column. I want to as simply as possible have the ability to search find this record with Will, will, Burr, burroughs, etc.

As far as I can tell, the trigram extension would be the easiest way to implement this. It looks like I wouldn't need to mess with vectors, etc. It would just look like a standard index and query, right? It seems that if I need something more powerful in the future that I could always move to ElasticSearch, Sphinx, or something similar.

Does this sound about right?

It depends on how complicated you want to make your indexing and searching. With the FTS in the data store, your updates, deletes, inserts are automagically handled. You can also trivially combine your full text search with other columns like "create_date > 2010-01-01" with ease.

The three steps to success are:

1) add a column to store the tsvector with an index on it.

2) Add a trigger to populate this tsvector on insert/update.

3) change your search queries to compute the tsvector of your search term and compare that to the tsvector column instead of the original column.

4) profit.

Re: String searching

From

Kevin Grittner

Date:

18 November 2014, 20:35:31

Robert DiFalco <robert.difalco@gmail.com> wrote:

> I just have a single VARCHAR field for name, so the full name
> "William S. Burroughs" is a single row and column. I want to as
> simply as possible have the ability to search find this record
> with Will, will, Burr, burroughs, etc.

> As far as I can tell, the trigram extension would be the easiest
> way to implement this. It looks like I wouldn't need to mess with
> vectors, etc. It would just look like a standard index and query,
> right?

Right.  For what you describe, I think the KNN searches using
trigrams and a gist index sounds like the appropriate solution.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company