Re: [pg_trgm] Making similarity(?, ?) < ? use an index - Mailing list pgsql-general

From Jeff Janes
Subject Re: [pg_trgm] Making similarity(?, ?) < ? use an index
Date
Msg-id CAMkU=1z4ZA-xHBbtRzUEqqT36ac_X2x1Tfa7DBawL03_rWJnGg@mail.gmail.com
Whole thread Raw
In response to Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Greg Navis <contact@gregnavis.com>)
Responses Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [pg_trgm] Making similarity(?, ?) < ? use an index  (Greg Navis <contact@gregnavis.com>)
List pgsql-general
On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index.  This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one.  These would
be ordinary btree indexes, not gin indexes.   I've done this before,
only using substring rather than similarity.


> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan.  Such a thing doesn't exists for "a<->b", there is no stopping
point.  It would have to be "a <-> b < c", at which point you are back
to ternary again.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch.  Doing that in a way that is robust and tested and
backward compatible would  probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code.  And, it is not really clear even what you want
to rewrite it into.

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)

I don't know if this would even be appropriate as an addition to
pg_trgm.  We might want to fork that code instead.  That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa.  True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff


pgsql-general by date:

Previous
From: Vik Fearing
Date:
Subject: Re: WAL's listing in pg_xlog by some sql query
Next
From: Tom Lane
Date:
Subject: Re: [pg_trgm] Making similarity(?, ?) < ? use an index