Thread: non-static LIKE patterns

non-static LIKE patterns

From
patrick keshishian
Date:
Hi all,

I'm sure this has been discussed before, but I am not too sure what
key search-terms to use to find any potentially relevant discussions.

Issue: I have two tables, each has a column that contains a directory
path. First table contains a starting path and the second holds
sub-paths (retaining full path names from root directory). In short,
first table entries are sub-strings of those found in the second
table.

I need to match entries in second table to the first, so I use the
following in my WHERE clause:

    ... WHERE second.path LIKE first.path||'%'

This seemed to work at first, but it fails if the paths use
back-slashes (like Windows paths).

Following is a simple test-case to illustrate what I described.

PostgreSQL 9.1.1 (similar results with much older version)

$ psql -d db -e < testcase.sql
CREATE TEMPORARY TABLE foo (id INTEGER, a TEXT);
CREATE TABLE
CREATE TEMPORARY TABLE bar (id INTEGER, b TEXT);
CREATE TABLE
INSERT INTO foo VALUES (0, '/root/a/b');
INSERT 8030228 1
INSERT INTO foo VALUES (1, '\root\a\b');
INSERT 8030229 1
INSERT INTO bar VALUES (0, '/root/a/b/c/*nix');
INSERT 8030230 1
INSERT INTO bar VALUES (1, '\root\a\b\c\Windows');
INSERT 8030231 1
SELECT * FROM foo;
 id |     a
----+-----------
  0 | /root/a/b
  1 | \root\a\b
(2 rows)

SELECT * FROM bar;
 id |          b
----+---------------------
  0 | /root/a/b/c/*nix
  1 | \root\a\b\c\Windows
(2 rows)

SELECT a,b, b LIKE a||'%' FROM foo JOIN bar USING (id);
     a     |          b          | ?column?
-----------+---------------------+----------
 /root/a/b | /root/a/b/c/*nix    | t
 \root\a\b | \root\a\b\c\Windows | f
(2 rows)


Hmm... just tried these two cases as well which seem interesting:

SELECT '\root\a\b\c\Windows' LIKE '\root\a\b'||'%';
 ?column?
----------
 f
(1 row)

mod=# SELECT '\root\a\b\c\Windows' LIKE '\\root\\a\\b'||'%';
 ?column?
----------
 t
(1 row)



Is this a bug in the SQL statement, or a bug in PostgreSQL? If the
former, what is the correct way to do this? If the latter, is there a
work-around?

I realize the same thing can be done with the following statement, but
it is harder to read and might be slightly more expensive to run on a
large data set.

SELECT a,b,substr(b,1,length(a)), substr(b,1,length(a)) = a FROM foo
JOIN bar USING (id);
     a     |          b          |  substr   | ?column?
-----------+---------------------+-----------+----------
 /root/a/b | /root/a/b/c/*nix    | /root/a/b | t
 \root\a\b | \root\a\b\c\Windows | \root\a\b | t
(2 rows)


Thanks,
--patrick

Re: non-static LIKE patterns

From
Tom Lane
Date:
patrick keshishian <pkeshish@gmail.com> writes:
> I need to match entries in second table to the first, so I use the
> following in my WHERE clause:
>     ... WHERE second.path LIKE first.path||'%'
> This seemed to work at first, but it fails if the paths use
> back-slashes (like Windows paths).

By default, back-slash is a special character in LIKE patterns.
You can change that with the ESCAPE option.  See
http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE

            regards, tom lane

Re: non-static LIKE patterns

From
patrick keshishian
Date:
On Wed, Apr 11, 2012 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> patrick keshishian <pkeshish@gmail.com> writes:
>> I need to match entries in second table to the first, so I use the
>> following in my WHERE clause:
>>       ... WHERE second.path LIKE first.path||'%'
>> This seemed to work at first, but it fails if the paths use
>> back-slashes (like Windows paths).
>
> By default, back-slash is a special character in LIKE patterns.
> You can change that with the ESCAPE option.  See
> http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE

Thanks for the quick reply. Would be tough choosing another
"reasonable" ESCAPE character while dealing with paths. Will think
more about this.

Cheers,
--patrick

Re: non-static LIKE patterns

From
Tom Lane
Date:
patrick keshishian <pkeshish@gmail.com> writes:
> Thanks for the quick reply. Would be tough choosing another
> "reasonable" ESCAPE character while dealing with paths. Will think
> more about this.

If you want it to be bulletproof, what I'd think about is something like

    WHERE second.path LIKE quote_like(first.path)||'%'

where quote_like() is a function that inserts a backslash before each
backslash, percent, and underscore in the given value.  Probably not
hard to cons that up from regexp_replace().

            regards, tom lane

Re: non-static LIKE patterns

From
hamann.w@t-online.de
Date:
Tom Lane wrote:

patrick keshishian <pkeshish@gmail.com> writes:
> Thanks for the quick reply. Would be tough choosing another
> "reasonable" ESCAPE character while dealing with paths. Will think
> more about this.

If you want it to be bulletproof, what I'd think about is something like

    WHERE second.path LIKE quote_like(first.path)||'%'

where quote_like() is a function that inserts a backslash before each
backslash, percent, and underscore in the given value.  Probably not
hard to cons that up from regexp_replace().

            regards, tom lane


Just out of curiosity: wouldn't that (as well as using non-static like)
be an enormous performance problem?
I tried something with normal "~" regex matching some time ago but
gave up on the idea pretty soon

Regards
Wolfgang Hamann

Re: non-static LIKE patterns

From
Tom Lane
Date:
hamann.w@t-online.de writes:
> Tom Lane wrote:
> If you want it to be bulletproof, what I'd think about is something like
>     WHERE second.path LIKE quote_like(first.path)||'%'

> Just out of curiosity: wouldn't that (as well as using non-static like)
> be an enormous performance problem?

Well, it won't be free, but I think you've already doomed yourself to
a not-very-bright plan by using LIKE in this way at all.

In any case, as a wise man once said, you can make it run arbitrarily
fast if it doesn't have to give the right answer.  Correctness trumps
any micro-optimization questions, so if you have to have prefix matching
of this sort, it's gonna cost ya somehow.

Actually, if the only case you're worried about is prefix match, you
could do it in substring style:

    WHERE second.path = substring(first.path, 1, length(second.path))

(better double-check the substring syntax, I'm too lazy to).  This is
still going to completely suck on a macro level: there's still no way to
perform the join except by tediously iterating through every combination
of rows.  But it'll likely outrun any LIKE-based solution by some
percentage.

            regards, tom lane

Re: non-static LIKE patterns

From
hamann.w@t-online.de
Date:
>>
>> hamann.w@t-online.de writes:
>> > Tom Lane wrote:
>> > If you want it to be bulletproof, what I'd think about is something like
>> >     WHERE second.path LIKE quote_like(first.path)||'%'
>>
>> > Just out of curiosity: wouldn't that (as well as using non-static like)
>> > be an enormous performance problem?
>>
>> Well, it won't be free, but I think you've already doomed yourself to
>> a not-very-bright plan by using LIKE in this way at all.
>>
>> In any case, as a wise man once said, you can make it run arbitrarily
>> fast if it doesn't have to give the right answer.  Correctness trumps
>> any micro-optimization questions, so if you have to have prefix matching
>> of this sort, it's gonna cost ya somehow.
>>

Hi Tom,

I just stumbled across this question because I regularly come across problems that,
at first, look like they should be solved with non-static LIKE or REGEX patterns

I actually have two situations where I would need a better plan. One is, fortunately,
fairly static (mostly lookups, hardly inserts) for name matches. Many famous people
appear in different spellings, say these two musicians
Franz|Ferenc Liszt
Fr(e|y)der(ic|yk) Chopin
So the first plan would be to regex-compare the sought name against the first name (or last name)
regexes. Run-time is astronomical, though
My current approach is to
a) keep the regexes in a separate table/column, so names with a regex entry are handled in
a smaller query
b) reverse the query: for every regex (they are well-behaved in this context) I pre-create a pattern
so that my actual query becomes
where pre-made-pattern ~ searched_name
c) while preparing the pattern, a common initial character (the "F" for Franz and Ferenc) is identified
to build an index. In the rare case that the first letter is already different, there would be
two entries in the table. So the actual query can check for first letter before it does the regex.

The other situation, unfortionately, is ad-hoc queries where I cannot do that kind of preparation
typically, the DB would contain strings like XY4711A, XY271, XY17321AAA, and I want to
check whether an input like XY17321 matches a database entry up to the end of the
numerals. So I add [^0-9]*$ to the end of my candidates, select
  where candidate ~ entry-in-table
and go for a coffee or two

Of course I would prefer to see a pre-built solution do all that mess for me...

Regards
Wolfgang Hamann