Thread: non-static LIKE patterns
Hi all, I'm sure this has been discussed before, but I am not too sure what key search-terms to use to find any potentially relevant discussions. Issue: I have two tables, each has a column that contains a directory path. First table contains a starting path and the second holds sub-paths (retaining full path names from root directory). In short, first table entries are sub-strings of those found in the second table. I need to match entries in second table to the first, so I use the following in my WHERE clause: ... WHERE second.path LIKE first.path||'%' This seemed to work at first, but it fails if the paths use back-slashes (like Windows paths). Following is a simple test-case to illustrate what I described. PostgreSQL 9.1.1 (similar results with much older version) $ psql -d db -e < testcase.sql CREATE TEMPORARY TABLE foo (id INTEGER, a TEXT); CREATE TABLE CREATE TEMPORARY TABLE bar (id INTEGER, b TEXT); CREATE TABLE INSERT INTO foo VALUES (0, '/root/a/b'); INSERT 8030228 1 INSERT INTO foo VALUES (1, '\root\a\b'); INSERT 8030229 1 INSERT INTO bar VALUES (0, '/root/a/b/c/*nix'); INSERT 8030230 1 INSERT INTO bar VALUES (1, '\root\a\b\c\Windows'); INSERT 8030231 1 SELECT * FROM foo; id | a ----+----------- 0 | /root/a/b 1 | \root\a\b (2 rows) SELECT * FROM bar; id | b ----+--------------------- 0 | /root/a/b/c/*nix 1 | \root\a\b\c\Windows (2 rows) SELECT a,b, b LIKE a||'%' FROM foo JOIN bar USING (id); a | b | ?column? -----------+---------------------+---------- /root/a/b | /root/a/b/c/*nix | t \root\a\b | \root\a\b\c\Windows | f (2 rows) Hmm... just tried these two cases as well which seem interesting: SELECT '\root\a\b\c\Windows' LIKE '\root\a\b'||'%'; ?column? ---------- f (1 row) mod=# SELECT '\root\a\b\c\Windows' LIKE '\\root\\a\\b'||'%'; ?column? ---------- t (1 row) Is this a bug in the SQL statement, or a bug in PostgreSQL? If the former, what is the correct way to do this? If the latter, is there a work-around? I realize the same thing can be done with the following statement, but it is harder to read and might be slightly more expensive to run on a large data set. SELECT a,b,substr(b,1,length(a)), substr(b,1,length(a)) = a FROM foo JOIN bar USING (id); a | b | substr | ?column? -----------+---------------------+-----------+---------- /root/a/b | /root/a/b/c/*nix | /root/a/b | t \root\a\b | \root\a\b\c\Windows | \root\a\b | t (2 rows) Thanks, --patrick
patrick keshishian <pkeshish@gmail.com> writes: > I need to match entries in second table to the first, so I use the > following in my WHERE clause: > ... WHERE second.path LIKE first.path||'%' > This seemed to work at first, but it fails if the paths use > back-slashes (like Windows paths). By default, back-slash is a special character in LIKE patterns. You can change that with the ESCAPE option. See http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE regards, tom lane
On Wed, Apr 11, 2012 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > patrick keshishian <pkeshish@gmail.com> writes: >> I need to match entries in second table to the first, so I use the >> following in my WHERE clause: >> ... WHERE second.path LIKE first.path||'%' >> This seemed to work at first, but it fails if the paths use >> back-slashes (like Windows paths). > > By default, back-slash is a special character in LIKE patterns. > You can change that with the ESCAPE option. See > http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE Thanks for the quick reply. Would be tough choosing another "reasonable" ESCAPE character while dealing with paths. Will think more about this. Cheers, --patrick
patrick keshishian <pkeshish@gmail.com> writes: > Thanks for the quick reply. Would be tough choosing another > "reasonable" ESCAPE character while dealing with paths. Will think > more about this. If you want it to be bulletproof, what I'd think about is something like WHERE second.path LIKE quote_like(first.path)||'%' where quote_like() is a function that inserts a backslash before each backslash, percent, and underscore in the given value. Probably not hard to cons that up from regexp_replace(). regards, tom lane
Tom Lane wrote: patrick keshishian <pkeshish@gmail.com> writes: > Thanks for the quick reply. Would be tough choosing another > "reasonable" ESCAPE character while dealing with paths. Will think > more about this. If you want it to be bulletproof, what I'd think about is something like WHERE second.path LIKE quote_like(first.path)||'%' where quote_like() is a function that inserts a backslash before each backslash, percent, and underscore in the given value. Probably not hard to cons that up from regexp_replace(). regards, tom lane Just out of curiosity: wouldn't that (as well as using non-static like) be an enormous performance problem? I tried something with normal "~" regex matching some time ago but gave up on the idea pretty soon Regards Wolfgang Hamann
hamann.w@t-online.de writes: > Tom Lane wrote: > If you want it to be bulletproof, what I'd think about is something like > WHERE second.path LIKE quote_like(first.path)||'%' > Just out of curiosity: wouldn't that (as well as using non-static like) > be an enormous performance problem? Well, it won't be free, but I think you've already doomed yourself to a not-very-bright plan by using LIKE in this way at all. In any case, as a wise man once said, you can make it run arbitrarily fast if it doesn't have to give the right answer. Correctness trumps any micro-optimization questions, so if you have to have prefix matching of this sort, it's gonna cost ya somehow. Actually, if the only case you're worried about is prefix match, you could do it in substring style: WHERE second.path = substring(first.path, 1, length(second.path)) (better double-check the substring syntax, I'm too lazy to). This is still going to completely suck on a macro level: there's still no way to perform the join except by tediously iterating through every combination of rows. But it'll likely outrun any LIKE-based solution by some percentage. regards, tom lane
>> >> hamann.w@t-online.de writes: >> > Tom Lane wrote: >> > If you want it to be bulletproof, what I'd think about is something like >> > WHERE second.path LIKE quote_like(first.path)||'%' >> >> > Just out of curiosity: wouldn't that (as well as using non-static like) >> > be an enormous performance problem? >> >> Well, it won't be free, but I think you've already doomed yourself to >> a not-very-bright plan by using LIKE in this way at all. >> >> In any case, as a wise man once said, you can make it run arbitrarily >> fast if it doesn't have to give the right answer. Correctness trumps >> any micro-optimization questions, so if you have to have prefix matching >> of this sort, it's gonna cost ya somehow. >> Hi Tom, I just stumbled across this question because I regularly come across problems that, at first, look like they should be solved with non-static LIKE or REGEX patterns I actually have two situations where I would need a better plan. One is, fortunately, fairly static (mostly lookups, hardly inserts) for name matches. Many famous people appear in different spellings, say these two musicians Franz|Ferenc Liszt Fr(e|y)der(ic|yk) Chopin So the first plan would be to regex-compare the sought name against the first name (or last name) regexes. Run-time is astronomical, though My current approach is to a) keep the regexes in a separate table/column, so names with a regex entry are handled in a smaller query b) reverse the query: for every regex (they are well-behaved in this context) I pre-create a pattern so that my actual query becomes where pre-made-pattern ~ searched_name c) while preparing the pattern, a common initial character (the "F" for Franz and Ferenc) is identified to build an index. In the rare case that the first letter is already different, there would be two entries in the table. So the actual query can check for first letter before it does the regex. The other situation, unfortionately, is ad-hoc queries where I cannot do that kind of preparation typically, the DB would contain strings like XY4711A, XY271, XY17321AAA, and I want to check whether an input like XY17321 matches a database entry up to the end of the numerals. So I add [^0-9]*$ to the end of my candidates, select where candidate ~ entry-in-table and go for a coffee or two Of course I would prefer to see a pre-built solution do all that mess for me... Regards Wolfgang Hamann