Re: optimizing SELECT with LIKE - Mailing list pgsql-general

From Steve Atkins
Subject Re: optimizing SELECT with LIKE
Date
Msg-id 20020612104847.A69991@blighty.com
Whole thread Raw
In response to optimizing SELECT with LIKE  (Jorge Sarmiento <jsarmiento@ccom.org>)
List pgsql-general
On Wed, Jun 12, 2002 at 11:43:25AM -0400, Jorge Sarmiento wrote:
> I have 3000000 rows in a database where I have to make a:
>
> SELECT name FROM table WHERE name LIKE '%firstname%' AND name LIKE
> '%secondname%';
>
> to obtain the data I need.
>
> Due to the data nature, there is no other way to look for the data, it's old
> data that was registered in paper years ago, with no searchable index... just
> names, that sometimes are wrote like: JORGE SARMIENTO, other times like:
> SARMIENTO, JORGE and other times like JORGE LUIS SARMIENTO. and due to legal
> reasons, the data must be entered the exact way it was wroten.


> Is there any way to optimize postgres for doing this kind of SELECT?

Not in any trivial way. The wildcard at the beginning of the match pattern
makes it impossible to use indexes.



One of the full text search modules might help, but a simpler way
might be to create a second table just to search on.

So from an input table like this

UID  |  NAME
-----------------------
 1   | SARMIENTO, JORGE
 2   | JORGE LUIS SARMIENTO

generate a search table with a perl script to convert the existing data
and index it on word. I've used code like this (untested, not complete):

$sh = $dbh->prepare("SELECT uid, name FROM table");
$ih = $dbh->prepare("INSERT INTO searchtable(uid, word) VALUES (?,?)");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  foreach $el (@cleanname) {
    $ih->execute($uid, $el);
  }
}

UID  | WORD
-----------------------
 1   | SARMENIO
 1   | JORGE
 2   | JORGE
 2   | LUIS
 2   | SARMIENTO

Then you should be able to do a much more efficient query on the
search table to get the UID of the result.

SELECT uid FROM searchtable WHERE word='firstname' AND
  uid IN (SELECT uid FROM searchtable WHERE word='lastname')

or the more efficient equivalent I'm sure the SQL gurus on the
list will pop up with.


An even simpler way would be to do a single pass through the data to
parse the names and add firstname and lastname columns, if that's
possible given the format of your data.

UID  |  NAME                | FIRSTNAME |   LASTNAME
-----------------------------------------------------
 1   | SARMIENTO, JORGE     | JORGE     | SARMIENTO
 2   | JORGE LUIS SARMIENTO | JORGE     | SARMIENTO


I've used code something like this snippet (untested) for that:

$sh = $dbh->prepare("SELECT uid, name FROM table");
$uh = $dbh->prepare("UPDATE table SET firstname=?, lastname=? where uid=?");

$sh->execute();

while(($uid, $name) = $sh->fetchrow_array()) {
  @cleanname = split(' ', uc(join(' ', reverse(split(',', $name)))));
  $firstname = $cleanname[0];
  $lastname = $cleanname[$#cleanname];
  $uh->execute($firstname, $lastname, $uid);
}

All of the name munging and index generation could be done on insert triggers,
making it transparent to insertions of new data, but I tend to handle that
at an application level rather than within the database.

Cheers,
  Steve

pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [BUGS] createdb comments
Next
From: Doug Fields
Date:
Subject: Re: "set nice" running a transaction in a lower