Thread: similarity and operator '%'

similarity and operator '%'

From
Volker Boehm
Date:
Hello,
I'm trying to find persons in an address database where I have built
trgm-indexes on name, street, zip and city.

When I search for all four parts of the address (name, street, zip and city)

     select name, street, zip, city
     from addresses
     where name % $1
       and street % $2
       and (zip % $3 or city % $4)

everything works fine: It takes less than a second to get some (5 - 500)
proposed addresses out of 500,000 addresses and the query plan shows

     Bitmap Heap Scan on addresses  (cost=168.31..1993.38 rows=524 ...
       Recheck Cond: ...
       ->  Bitmap Index Scan on ...
             Index Cond: ...

The same happens when I search only by name with

     select name, street, zip, city
     from addresses
     where name % $1

But when I rewrite this query to

     select name, street, zip, city
     from addresses
     where similarity(name, $1) > 0.3

which means exactly then same as the second example, the query plan
changes to

     Seq Scan on addresses  (cost=0.00..149714.42 rows=174675 width=60)
       Filter: ...

and the query lasts about a minute.

The reason for using the similarity function in place of the
'%'-operator is that I want to use different similarity values in one query:

     select name, street, zip, city
     from addresses
     where name % $1
         and street % $2
         and (zip % $3 or city % $4)
         or similarity(name, $1) > 0.8

which means: take all addresses where name, street, zip and city have
little similarity _plus_ all addresses where the name matches very good.


The only way I found, was to create a temporary table from the first
query, change the similarity value with set_limit() and then select the
second query UNION the temporary table.

Is there a more elegant and straight forward way to achieve this result?

regards  Volker

--
Volker Böhm        Tel.: +49 4141 981155
Voßkuhl 5          mailto:volker@vboehm.de
21682 Stade        http://www.vboehm.de


Re: similarity and operator '%'

From
"David G. Johnston"
Date:
On Mon, May 30, 2016 at 1:53 PM, Volker Boehm <volker@vboehm.de> wrote:

The reason for using the similarity function in place of the '%'-operator is that I want to use different similarity values in one query:

    select name, street, zip, city
    from addresses
    where name % $1
        and street % $2
        and (zip % $3 or city % $4)
        or similarity(name, $1) > 0.8

which means: take all addresses where name, street, zip and city have little similarity _plus_ all addresses where the name matches very good.


The only way I found, was to create a temporary table from the first query, change the similarity value with set_limit() and then select the second query UNION the temporary table.

Is there a more elegant and straight forward way to achieve this result?

​Not that I can envision.

You are forced into using an operator due to our index implementation.

You are thus forced into using a GUC to control the parameter that the index scanning function uses to compute true/false.

A GUC can only take on a single value within a given query - well, not quite true[1] but the exception doesn't seem like it will help here.

Th
us you are consigned to​
 
​using two queries.

*​A functional index​ doesn't work since the second argument is query specific

[1]​ When defining a function you can attach a "SET" clause to it; commonly used for search_path but should work with any GUC.  If you could wrap the operator comparison into a custom function you could use this capability.  It also would require a function that would take the threshold as a value - the extension only provides variations that use the GUC.

I don't think this will use the index even if it compiles (not tested):

CREATE FUNCTION similarity_80(col, val)
RETURNS boolean
SET similarity_threshold = 0.80
LANGUAGE sql
AS $$
​SELECT ​col % val;
$$;

​David J.​

Re: similarity and operator '%'

From
Jeff Janes
Date:
On Mon, May 30, 2016 at 10:53 AM, Volker Boehm <volker@vboehm.de> wrote:

> The reason for using the similarity function in place of the '%'-operator is
> that I want to use different similarity values in one query:
>
>     select name, street, zip, city
>     from addresses
>     where name % $1
>         and street % $2
>         and (zip % $3 or city % $4)
>         or similarity(name, $1) > 0.8

I think the best you can do through query writing is to use the
most-lenient setting in all places, and then refilter to get the less
lenient cutoff:

     select name, street, zip, city
     from addresses
     where name % $1
         and street % $2
         and (zip % $3 or city % $4)
         or (name % $1 and similarity(name, $1) > 0.8)

If it were really important to me to get maximum performance, what I
would do is alter/fork the pg_trgm extension so that it had another
operator, say %%%, with a hard-coded cutoff which paid no attention to
the set_limit().  I'm not really sure how the planner would deal with
that, though.

Cheers,

Jeff