Thread: Using In Clause For a Large Text Matching Query
Hello all, my first post to the pgsql mailing list! There may be a completely better way to do this; if so please help point me in the right direction! What I'm trying to do is run a query to partially match 300+ text fields to a PG table of about 200,000 rows. The idea is to pull out a portion of text from the original text fields and do a partial text match on anything in my PG table containing that substring. I liked the idea of using a where IN(group) to do my comparisons, as in select col1 from table1 where col1 in ('text1','text2') however, this requires an exact string match. Is there any way to do a substring match inside of my IN group? Or can anyone think of a better way to do something like this? Heres an example of something of how I'd like this to work: Portion of 300 Original Text fields: "brown cat" "green dog" 2 rows of 200k+ Database table: "brown kitty" "green doggy" We can assume a certain portion of the text is included in the DB table, so I want to be able to do a substring match on "brown" and "green" and in this case return both "brown kitty", and "green doggy". However the problem is, if I run the query on each of my 300 rows to scan 200,000 rows in my DB is entirely too slow. So I was hoping to use the IN clause to create an IN group of about 300 items to scan the DB once. I hope this makes sense, but if anything sounds confusing please let me know, and I will be sure to clarify! Thanks for any help or direction anyone can provide!! - Jason Farmer
> We can assume a certain portion of the text is included in the DB table, > so I want to be able to do a substring match on "brown" and "green" and > in this case return both "brown kitty", and "green doggy". However the > problem is, if I run the query on each of my 300 rows to scan 200,000 > rows in my DB is entirely too slow. So I was hoping to use the IN clause > to create an IN group of about 300 items to scan the DB once. You can probably do it. However, you will have to pick a substring from your text field to compare against. In this case you seem to be choosing the first word, i.e. "brown" and "green". so maybe: select t1.col1 from table1 as t1, ( select distinct split_part( tblFileDump.Col1, ' ', 1 ) as samplecol from tblFileDump ) as fd1 where t1.col1 like '%' || fd1.samplecol || '%' ; This is just an idea. I've never used split_part or developed a sudo join this way. But it may work provided you and jump your text files into a temp table. Notice: http://www.postgresql.org/docs/8.1/interactive/functions-string.html for the syntax for split_part(). Regards, Richard Broersma Jr.
> Well, there is also: <a > href="http://www.postgresql.org/docs/8.1/interactive/functions-comparisons.html#AEN13377"> > > /expression/ /operator/ ANY (/array expression/)</a>. So, if you have a way to preprocess you > input text fields that you want matched > you could build a regex for each and feed them in an array to an '~ ANY' expression like so (or, > use ~* for case > insensitive matching): > > SELECT col1 > FROM table > WHERE col1 ~ ANY (ARRAY['regex1', 'regex2', ...]); Good point, But don't forget to include the list in your response. :-) Regards, Richard Broersma Jr.
Ah, I do think that sounds about like what I want! Let me play with this one some, thanks so much!! Richard Broersma Jr wrote: >> Well, there is also: <a >> href="http://www.postgresql.org/docs/8.1/interactive/functions-comparisons.html#AEN13377"> >> >> /expression/ /operator/ ANY (/array expression/)</a>. So, if you have a way to preprocess you >> input text fields that you want matched >> you could build a regex for each and feed them in an array to an '~ ANY' expression like so (or, >> use ~* for case >> insensitive matching): >> >> SELECT col1 >> FROM table >> WHERE col1 ~ ANY (ARRAY['regex1', 'regex2', ...]); >> > > Good point, But don't forget to include the list in your response. :-) > > Regards, > > Richard Broersma Jr. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > >