Thread: t1.col like '%t2.col%'

t1.col like '%t2.col%'

From
"Dan Kaplan"
Date:

I’ve got a lot of rows in one table and a lot of rows in another table.  I want to do a bunch of queries on their join column.  One of these is like this: t1.col like '%t2.col%'

 

I know that always sucks.  I’m wondering how I can make it better.  First, I should let you know that I can likely hold both of these tables entirely in ram.  Since that’s the case, would it be better to accomplish this with my programming language?  Also you should know that in most cases, t1.col and t2.col is 2 words or less.  I’m not sure if that matters, I mention it because it may make tsearch2 perform badly.

Re: t1.col like '%t2.col%'

From
Oleg Bartunov
Date:
On Wed, 27 Feb 2008, Dan Kaplan wrote:

> I've got a lot of rows in one table and a lot of rows in another table.  I
> want to do a bunch of queries on their join column.  One of these is like
> this: t1.col like '%t2.col%'

We have an idea how to speedup wildcard search at the expense of the size -
we have to index all permutation of the original word. Then we could
use GIN for quieries like a*b.

>
>
>
> I know that always sucks.  I'm wondering how I can make it better.  First, I
> should let you know that I can likely hold both of these tables entirely in
> ram.  Since that's the case, would it be better to accomplish this with my
> programming language?  Also you should know that in most cases, t1.col and
> t2.col is 2 words or less.  I'm not sure if that matters, I mention it
> because it may make tsearch2 perform badly.
>

contrib/pg_trgm should help you.

>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: t1.col like '%t2.col%'

From
"Dan Kaplan"
Date:
I learned a little about pg_trgm here:
http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm

But this seems like it's for finding similarities, not substrings.  How can
I use it to speed up t1.col like '%t2.col%'?

Thanks,
Dan

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Oleg Bartunov
Sent: Wednesday, February 27, 2008 9:47 PM
To: Dan Kaplan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] t1.col like '%t2.col%'

On Wed, 27 Feb 2008, Dan Kaplan wrote:

> I've got a lot of rows in one table and a lot of rows in another table.  I
> want to do a bunch of queries on their join column.  One of these is like
> this: t1.col like '%t2.col%'

We have an idea how to speedup wildcard search at the expense of the size -
we have to index all permutation of the original word. Then we could
use GIN for quieries like a*b.

>
>
>
> I know that always sucks.  I'm wondering how I can make it better.  First,
I
> should let you know that I can likely hold both of these tables entirely
in
> ram.  Since that's the case, would it be better to accomplish this with my
> programming language?  Also you should know that in most cases, t1.col and
> t2.col is 2 words or less.  I'm not sure if that matters, I mention it
> because it may make tsearch2 perform badly.
>

contrib/pg_trgm should help you.

>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match


Re: t1.col like '%t2.col%'

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 29 Feb 2008 15:52:31 -0800
"Dan Kaplan" <dkaplan@citizenhawk.com> wrote:

> I learned a little about pg_trgm here:
> http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm
> 
> But this seems like it's for finding similarities, not substrings.
> How can I use it to speed up t1.col like '%t2.col%'?

Faster disks.

No matter what, that will seqscan. So if you want it to go faster, you
need faster hardware.

Sincerely,

Joshua D. Drake

- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHyJu9ATb/zqfZUUQRAlPwAJ9XZvoWvNquuWGytvJfNlm79LBvtwCbBwRw
uqb7fhD5+w87BzUoVEjICEY=
=z5xQ
-----END PGP SIGNATURE-----

Re: t1.col like '%t2.col%'

From
"Gregory Williamson"
Date:

Joshua Drake spake thusly:

> On Fri, 29 Feb 2008 15:52:31 -0800
> "Dan Kaplan" <dkaplan@citizenhawk.com> wrote:
>
> > I learned a little about pg_trgm here:
> > http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm
> >
> > But this seems like it's for finding similarities, not substrings.
> > How can I use it to speed up t1.col like '%t2.col%'?
>
> Faster disks.
>
> No matter what, that will seqscan. So if you want it to go faster, you
> need faster hardware.

Word!

That said ...

Once upon a time we had a requirement to allow users to search within US counties for property owner name or street names by text fragment.

We used the now deprecated Full Text Indexing (FTI) with some handwaving. But that was in PostgreSQL 7.4 and FTI is not in the contrib package for some time now. See <http://pgfoundry.org/projects/simplefti/> ... I looked at using it in 8.1 but my "C" chops weren't up to it, and it depended heavily on OIDs which we didn't want to use. Anyway, our business requirement evaporated so it doesn't matter to us now.

HTH,

Greg Williamson
Senior DBA
GlobeXplorer LLC, a DigitalGlobe company

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

(My corporate masters made me say this.)

Re: t1.col like '%t2.col%'

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 29 Feb 2008 17:30:08 -0700
"Gregory Williamson" <Gregory.Williamson@digitalglobe.com> wrote:

> Joshua Drake spake thusly:

> We used the now deprecated Full Text Indexing (FTI) with some
> handwaving. But that was in PostgreSQL 7.4 and FTI is not in the
> contrib package for some time now. See
> <http://pgfoundry.org/projects/simplefti/> ... I looked at using it
> in 8.1 but my "C" chops weren't up to it, and it depended heavily on
> OIDs which we didn't want to use. Anyway, our business requirement
> evaporated so it doesn't matter to us now.
> 

Right but wouldn't this be solved with tsearch2 and pg_tgrm?

- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHyLO+ATb/zqfZUUQRAnRlAJ0S0jrc4pSmnBcobEtZvuDkpkWzIACcDP1t
Cwk/j1C2pEXWdANsyZV5f8E=
=9GYn
-----END PGP SIGNATURE-----

Re: t1.col like '%t2.col%'

From
Tom Lane
Date:
"Dan Kaplan" <dkaplan@citizenhawk.com> writes:
> I learned a little about pg_trgm here:
> http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm

There's also real documentation in the 8.3 release:
http://www.postgresql.org/docs/8.3/static/pgtrgm.html
AFAIK pg_trgm hasn't changed much lately, so you should be able to
rely on that for recent earlier branches.

> But this seems like it's for finding similarities, not substrings.  How can
> I use it to speed up t1.col like '%t2.col%'?

The idea is to use it as a lossy index.  You make a trigram index on
t1.col and then do something like

    ... where t1.col % t2.col and t1.col like ('%'||t2.col||'%');

The index gets you the %-matches and then you filter for the exact
matches with LIKE.

The similarity threshold (set_limit()) has to be set low enough that you
don't lose any desired matches, but not so low that you get everything
in the table back.  Not sure how delicate that will be.  It might be
unworkable, but surely it's worth a try.

            regards, tom lane