Thread: t1.col like '%t2.col%'
I’ve got a lot of rows in one table and a lot of rows in another table. I want to do a bunch of queries on their join column. One of these is like this: t1.col like '%t2.col%'
I know that always sucks. I’m wondering how I can make it better. First, I should let you know that I can likely hold both of these tables entirely in ram. Since that’s the case, would it be better to accomplish this with my programming language? Also you should know that in most cases, t1.col and t2.col is 2 words or less. I’m not sure if that matters, I mention it because it may make tsearch2 perform badly.
On Wed, 27 Feb 2008, Dan Kaplan wrote: > I've got a lot of rows in one table and a lot of rows in another table. I > want to do a bunch of queries on their join column. One of these is like > this: t1.col like '%t2.col%' We have an idea how to speedup wildcard search at the expense of the size - we have to index all permutation of the original word. Then we could use GIN for quieries like a*b. > > > > I know that always sucks. I'm wondering how I can make it better. First, I > should let you know that I can likely hold both of these tables entirely in > ram. Since that's the case, would it be better to accomplish this with my > programming language? Also you should know that in most cases, t1.col and > t2.col is 2 words or less. I'm not sure if that matters, I mention it > because it may make tsearch2 perform badly. > contrib/pg_trgm should help you. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I learned a little about pg_trgm here: http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm But this seems like it's for finding similarities, not substrings. How can I use it to speed up t1.col like '%t2.col%'? Thanks, Dan -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Oleg Bartunov Sent: Wednesday, February 27, 2008 9:47 PM To: Dan Kaplan Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] t1.col like '%t2.col%' On Wed, 27 Feb 2008, Dan Kaplan wrote: > I've got a lot of rows in one table and a lot of rows in another table. I > want to do a bunch of queries on their join column. One of these is like > this: t1.col like '%t2.col%' We have an idea how to speedup wildcard search at the expense of the size - we have to index all permutation of the original word. Then we could use GIN for quieries like a*b. > > > > I know that always sucks. I'm wondering how I can make it better. First, I > should let you know that I can likely hold both of these tables entirely in > ram. Since that's the case, would it be better to accomplish this with my > programming language? Also you should know that in most cases, t1.col and > t2.col is 2 words or less. I'm not sure if that matters, I mention it > because it may make tsearch2 perform badly. > contrib/pg_trgm should help you. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 29 Feb 2008 15:52:31 -0800 "Dan Kaplan" <dkaplan@citizenhawk.com> wrote: > I learned a little about pg_trgm here: > http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm > > But this seems like it's for finding similarities, not substrings. > How can I use it to speed up t1.col like '%t2.col%'? Faster disks. No matter what, that will seqscan. So if you want it to go faster, you need faster hardware. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHyJu9ATb/zqfZUUQRAlPwAJ9XZvoWvNquuWGytvJfNlm79LBvtwCbBwRw uqb7fhD5+w87BzUoVEjICEY= =z5xQ -----END PGP SIGNATURE-----
Joshua Drake spake thusly:
> On Fri, 29 Feb 2008 15:52:31 -0800
> "Dan Kaplan" <dkaplan@citizenhawk.com> wrote:
>
> > I learned a little about pg_trgm here:
> > http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm
> >
> > But this seems like it's for finding similarities, not substrings.
> > How can I use it to speed up t1.col like '%t2.col%'?
>
> Faster disks.
>
> No matter what, that will seqscan. So if you want it to go faster, you
> need faster hardware.
Word!
That said ...
Once upon a time we had a requirement to allow users to search within US counties for property owner name or street names by text fragment.
We used the now deprecated Full Text Indexing (FTI) with some handwaving. But that was in PostgreSQL 7.4 and FTI is not in the contrib package for some time now. See <http://pgfoundry.org/projects/simplefti/> ... I looked at using it in 8.1 but my "C" chops weren't up to it, and it depended heavily on OIDs which we didn't want to use. Anyway, our business requirement evaporated so it doesn't matter to us now.
HTH,
Greg Williamson
Senior DBA
GlobeXplorer LLC, a DigitalGlobe company
Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
(My corporate masters made me say this.)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 29 Feb 2008 17:30:08 -0700 "Gregory Williamson" <Gregory.Williamson@digitalglobe.com> wrote: > Joshua Drake spake thusly: > We used the now deprecated Full Text Indexing (FTI) with some > handwaving. But that was in PostgreSQL 7.4 and FTI is not in the > contrib package for some time now. See > <http://pgfoundry.org/projects/simplefti/> ... I looked at using it > in 8.1 but my "C" chops weren't up to it, and it depended heavily on > OIDs which we didn't want to use. Anyway, our business requirement > evaporated so it doesn't matter to us now. > Right but wouldn't this be solved with tsearch2 and pg_tgrm? - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHyLO+ATb/zqfZUUQRAnRlAJ0S0jrc4pSmnBcobEtZvuDkpkWzIACcDP1t Cwk/j1C2pEXWdANsyZV5f8E= =9GYn -----END PGP SIGNATURE-----
"Dan Kaplan" <dkaplan@citizenhawk.com> writes: > I learned a little about pg_trgm here: > http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm There's also real documentation in the 8.3 release: http://www.postgresql.org/docs/8.3/static/pgtrgm.html AFAIK pg_trgm hasn't changed much lately, so you should be able to rely on that for recent earlier branches. > But this seems like it's for finding similarities, not substrings. How can > I use it to speed up t1.col like '%t2.col%'? The idea is to use it as a lossy index. You make a trigram index on t1.col and then do something like ... where t1.col % t2.col and t1.col like ('%'||t2.col||'%'); The index gets you the %-matches and then you filter for the exact matches with LIKE. The similarity threshold (set_limit()) has to be set low enough that you don't lose any desired matches, but not so low that you get everything in the table back. Not sure how delicate that will be. It might be unworkable, but surely it's worth a try. regards, tom lane