Thread: similarity and operator '%'
Hello, I'm trying to find persons in an address database where I have built trgm-indexes on name, street, zip and city. When I search for all four parts of the address (name, street, zip and city) select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) everything works fine: It takes less than a second to get some (5 - 500) proposed addresses out of 500,000 addresses and the query plan shows Bitmap Heap Scan on addresses (cost=168.31..1993.38 rows=524 ... Recheck Cond: ... -> Bitmap Index Scan on ... Index Cond: ... The same happens when I search only by name with select name, street, zip, city from addresses where name % $1 But when I rewrite this query to select name, street, zip, city from addresses where similarity(name, $1) > 0.3 which means exactly then same as the second example, the query plan changes to Seq Scan on addresses (cost=0.00..149714.42 rows=174675 width=60) Filter: ... and the query lasts about a minute. The reason for using the similarity function in place of the '%'-operator is that I want to use different similarity values in one query: select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) or similarity(name, $1) > 0.8 which means: take all addresses where name, street, zip and city have little similarity _plus_ all addresses where the name matches very good. The only way I found, was to create a temporary table from the first query, change the similarity value with set_limit() and then select the second query UNION the temporary table. Is there a more elegant and straight forward way to achieve this result? regards Volker -- Volker Böhm Tel.: +49 4141 981155 Voßkuhl 5 mailto:volker@vboehm.de 21682 Stade http://www.vboehm.de
The reason for using the similarity function in place of the '%'-operator is that I want to use different similarity values in one query:
select name, street, zip, city
from addresses
where name % $1
and street % $2
and (zip % $3 or city % $4)
or similarity(name, $1) > 0.8
which means: take all addresses where name, street, zip and city have little similarity _plus_ all addresses where the name matches very good.
The only way I found, was to create a temporary table from the first query, change the similarity value with set_limit() and then select the second query UNION the temporary table.
Is there a more elegant and straight forward way to achieve this result?
Not that I can envision.
You are forced into using an operator due to our index implementation.
You are thus forced into using a GUC to control the parameter that the index scanning function uses to compute true/false.
A GUC can only take on a single value within a given query - well, not quite true[1] but the exception doesn't seem like it will help here.
Th
us you are consigned to
using two queries.
*A functional index doesn't work since the second argument is query specific
[1] When defining a function you can attach a "SET" clause to it; commonly used for search_path but should work with any GUC. If you could wrap the operator comparison into a custom function you could use this capability. It also would require a function that would take the threshold as a value - the extension only provides variations that use the GUC.
I don't think this will use the index even if it compiles (not tested):
CREATE FUNCTION similarity_80(col, val)
RETURNS boolean
SET similarity_threshold = 0.80
LANGUAGE sql
AS $$
SELECT col % val;
$$;
David J.
On Mon, May 30, 2016 at 10:53 AM, Volker Boehm <volker@vboehm.de> wrote: > The reason for using the similarity function in place of the '%'-operator is > that I want to use different similarity values in one query: > > select name, street, zip, city > from addresses > where name % $1 > and street % $2 > and (zip % $3 or city % $4) > or similarity(name, $1) > 0.8 I think the best you can do through query writing is to use the most-lenient setting in all places, and then refilter to get the less lenient cutoff: select name, street, zip, city from addresses where name % $1 and street % $2 and (zip % $3 or city % $4) or (name % $1 and similarity(name, $1) > 0.8) If it were really important to me to get maximum performance, what I would do is alter/fork the pg_trgm extension so that it had another operator, say %%%, with a hard-coded cutoff which paid no attention to the set_limit(). I'm not really sure how the planner would deal with that, though. Cheers, Jeff