Thread: optimizing SELECT with LIKE

optimizing SELECT with LIKE

From

Jorge Sarmiento

Date:

12 June 2002, 12:45:29

I have 3000000 rows in a database where I have to make a:

SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE
'%secondname%';

to obtain the data I need.

Due to the data nature, there is no other way to look for the data, it's old
data that was registered in paper years ago, with no searchable index... just
names, that sometimes are wrote like: JORGE SARMIENTO, other times like:
SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal
reasons, the data must be entered the exact way it was wroten.

Is there any way to optimize postgres for doing this kind of SELECT?

thanx in advance!

Jorge S.

Re: optimizing SELECT with LIKE

From

Ericson Smith

Date:

12 June 2002, 13:25:46

Hmmm....

That's gonna be super slow.

You might want to create 2 additional tables:

keywords
--------
id
keyword

searchmap
---------
keywordid
documentid

Then index your data so that
1. There is a unique entry in keywords for your existing data
2. Your searchmap would have an entry for keywordid, and a documentid

To search for data...
1. Find the ID of the keywords in question eg:
JORGE => 55
SARMIENTO => 89

2. Get data from searchmap like so:
SELECT documentid, count(*) as hits FROM searchmap WHERE keywordid IN
(55,89) GROUP BY documentid ORDER BY hits DESC

3. Get your documentid's from the document id in #2 above.

This will *ligntning* fast.

- Ericson Smith
eric@did-it.com
http://www.did-it.com

On Wed, 2002-06-12 at 11:43, Jorge Sarmiento wrote:
> I have 3000000 rows in a database where I have to make a:
>
> SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE
> '%secondname%';
>
> to obtain the data I need.
>
> Due to the data nature, there is no other way to look for the data, it's old
> data that was registered in paper years ago, with no searchable index... just
> names, that sometimes are wrote like: JORGE SARMIENTO, other times like:
> SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal
> reasons, the data must be entered the exact way it was wroten.
>
> Is there any way to optimize postgres for doing this kind of SELECT?
>
> thanx in advance!
>
> Jorge S.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: optimizing SELECT with LIKE

From

Steve Atkins

Date:

12 June 2002, 13:38:42

On Wed, Jun 12, 2002 at 11:43:25AM -0400, Jorge Sarmiento wrote:
> I have 3000000 rows in a database where I have to make a:
>
> SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE
> '%secondname%';
>
> to obtain the data I need.
>
> Due to the data nature, there is no other way to look for the data, it's old
> data that was registered in paper years ago, with no searchable index... just
> names, that sometimes are wrote like: JORGE SARMIENTO, other times like:
> SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal
> reasons, the data must be entered the exact way it was wroten.

> Is there any way to optimize postgres for doing this kind of SELECT?

Not in any trivial way. The wildcard at the beginning of the match pattern
makes it impossible to use indexes.

One of the full text search modules might help, but a simpler way
might be to create a second table just to search on.

So from an input table like this

UID  |  NAME
-----------------------
 1   | SARMIENTO, JORGE
 2   | JORGE LUIS SARMIENTO

generate a search table with a perl script to convert the existing data
and index it on word. I've used code like this (untested, not complete):

$sh = $dbh->prepare("SELECT uid, name FROM table");
$ih = $dbh->prepare("INSERT INTO searchtable(uid, word) VALUES (?,?)");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  foreach $el (@cleanname) {
    $ih->execute($uid, $el);
  }
}

UID  | WORD
-----------------------
 1   | SARMENIO
 1   | JORGE
 2   | JORGE
 2   | LUIS
 2   | SARMIENTO

Then you should be able to do a much more efficient query on the
search table to get the UID of the result.

SELECT uid FROM searchtable WHERE word='firstname' AND
  uid IN (SELECT uid FROM searchtable WHERE word='lastname')

or the more efficient equivalent I'm sure the SQL gurus on the
list will pop up with.

An even simpler way would be to do a single pass through the data to
parse the names and add firstname and lastname columns, if that's
possible given the format of your data.

UID  |  NAME                | FIRSTNAME |   LASTNAME
-----------------------------------------------------
 1   | SARMIENTO, JORGE     | JORGE     | SARMIENTO
 2   | JORGE LUIS SARMIENTO | JORGE     | SARMIENTO

I've used code something like this snippet (untested) for that:

$sh = $dbh->prepare("SELECT uid, name FROM table");
$uh = $dbh->prepare("UPDATE table SET firstname=?, lastname=? where uid=?");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  $firstname = $cleanname[0];
  $lastname = $cleanname[$#cleanname];
  $uh->execute($firstname, $lastname, $uid);
}

All of the name munging and index generation could be done on insert triggers,
making it transparent to insertions of new data, but I tend to handle that
at an application level rather than within the database.

Cheers,
  Steve

Re: optimizing SELECT with LIKE

From

Scott Marlowe

Date:

12 June 2002, 14:28:22

Have a look in the postgresql-7.2.1/contrib/fulltextindex directory.