Thread: Using In Clause For a Large Text Matching Query

Using In Clause For a Large Text Matching Query

From
Jason Farmer
Date:
Hello all, my first post to the pgsql mailing list!

There may be a completely better way to do this; if so please help point 
me in the right direction!

What I'm trying to do is run a query to partially match 300+ text fields 
to a PG table of about 200,000 rows. The idea is to pull out a portion 
of text from the original text fields and do a partial text match on 
anything in my PG table containing that substring.

I liked the idea of using a where IN(group) to do my comparisons, as in

select col1 from table1 where col1 in ('text1','text2')

however, this requires an exact string match. Is there any way to do a 
substring match inside of my IN group? Or can anyone think of a better 
way to do something like this?

Heres an example of something of how I'd like this to work:

Portion of 300 Original Text fields:
"brown cat"
"green dog"

2 rows of 200k+ Database table:
"brown kitty"
"green doggy"

We can assume a certain portion of the text is included in the DB table, 
so I want to be able to do a substring match on "brown" and "green" and 
in this case return both "brown kitty", and "green doggy". However the 
problem is, if I run the query on each of my 300 rows to scan 200,000 
rows in my DB is entirely too slow. So I was hoping to use the IN clause 
to create an IN group of about 300 items to scan the DB once.

I hope this makes sense, but if anything sounds confusing please let me 
know, and I will be sure to clarify! Thanks for any help or direction 
anyone can provide!!

- Jason Farmer


Re: Using In Clause For a Large Text Matching Query

From
Richard Broersma Jr
Date:
> We can assume a certain portion of the text is included in the DB table, 
> so I want to be able to do a substring match on "brown" and "green" and 
> in this case return both "brown kitty", and "green doggy". However the 
> problem is, if I run the query on each of my 300 rows to scan 200,000 
> rows in my DB is entirely too slow. So I was hoping to use the IN clause 
> to create an IN group of about 300 items to scan the DB once.

You can probably do it.  However, you will have to pick a substring from your text field to
compare against.  In this case you seem to be choosing the first word, i.e. "brown" and "green".

so maybe:

select t1.col1
from    table1 as t1,   (      select distinct split_part( tblFileDump.Col1, ' ', 1 ) as samplecol      from
tblFileDump  ) as fd1
 

where t1.col1 like '%' || fd1.samplecol || '%'
;

This is just an idea.  I've never used split_part or developed a sudo join this way.  But it may
work provided you and jump your text files into a temp table.

Notice:
http://www.postgresql.org/docs/8.1/interactive/functions-string.html
for the syntax for split_part().

Regards,

Richard Broersma Jr.


Re: Using In Clause For a Large Text Matching Query

From
Richard Broersma Jr
Date:
> Well, there is also: <a 
> href="http://www.postgresql.org/docs/8.1/interactive/functions-comparisons.html#AEN13377">
> 
> /expression/ /operator/ ANY (/array expression/)</a>.  So, if you have a way to preprocess you
> input text fields that you want matched 
> you could build a regex for each and feed them in an array to an '~ ANY' expression like so (or,
> use ~* for case 
> insensitive matching):
> 
> SELECT col1
> FROM table
> WHERE col1 ~ ANY (ARRAY['regex1', 'regex2', ...]);

Good point,  But don't forget to include the list in your response. :-)

Regards,

Richard Broersma Jr.


Re: Using In Clause For a Large Text Matching Query

From
Jason Farmer
Date:
Ah, I do think that sounds about like what I want! Let me play with this 
one some, thanks so much!!

Richard Broersma Jr wrote:
>> Well, there is also: <a 
>> href="http://www.postgresql.org/docs/8.1/interactive/functions-comparisons.html#AEN13377">
>>
>> /expression/ /operator/ ANY (/array expression/)</a>.  So, if you have a way to preprocess you
>> input text fields that you want matched 
>> you could build a regex for each and feed them in an array to an '~ ANY' expression like so (or,
>> use ~* for case 
>> insensitive matching):
>>
>> SELECT col1
>> FROM table
>> WHERE col1 ~ ANY (ARRAY['regex1', 'regex2', ...]);
>>     
>
> Good point,  But don't forget to include the list in your response. :-)
>
> Regards,
>
> Richard Broersma Jr.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>
>