Thread: Using POSIX Regular Expressions on xml type fields gives inconsistent results
Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Denis Papathanasiou
Date:
I have a table with an xml type column, and while I can make regex queries like this successfully: => select id from form_d where 'kumar' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); id -------------------------------------- 97e1541b-27f4-4d95-beb5-2f67830ebc48 (1 row) and => select id from form_d where '(kumar|gonzales)' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); id -------------------------------------- aea32e7e-f422-405c-953b-86fe3c8c1e30 97e1541b-27f4-4d95-beb5-2f67830ebc48 (2 rows) I.e., they are successful in that the last names in the xml data are "Kumar" and "Gonzales", so the ~* operator handled the case comparison correctly, and the (|) grouping also found the two rows where the corresponding xml had "Kumar" and "Gonzales" in the PersonList attribute. But if I change the expression to ask for all last names beginning with "Kuma" or "Gonza", like this, the query returns no matches: => select id from form_d where '^kuma' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); id ---- (0 rows) => select id from form_d where '^gonza' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); id ---- (0 rows) => select id from form_d where '^(kuma|gonza)' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); id ---- (0 rows) Why is that?
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
"David Johnston"
Date:
> > I have a table with an xml type column, and while I can make regex queries > like this successfully: > > => select id from form_d where 'kumar' ~* any( > CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) > ); It took me way too long to figure this out: SELECT 'text to search against' ~* 'regular expression' The text to be searched must be on the left-hand side of the operator and thus the regular expression operators cannot be used with "ANY/ALL" because they require that the text to be search exist on the right-hand side of the operator. I'll let others respond as to the difficulty of implementing a LHS-Array version of ANY/ALL David J.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
"David Johnston"
Date:
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Denis Papathanasiou > Sent: Thursday, December 20, 2012 7:56 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Using POSIX Regular Expressions on xml type fields gives > inconsistent results > > I have a table with an xml type column, and while I can make regex queries > like this successfully: > > => select id from form_d where 'kumar' ~* any( > CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) > ); > id > -------------------------------------- > 97e1541b-27f4-4d95-beb5-2f67830ebc48 > (1 row) For educational purposes: SELECT 'kumar' ~* 'Kumar' -> TRUE (simplest possible) SELECT 'kumar' ~* any ( ARRAY['Kumar','Gonzales']::text[] ) -> TRUE (ok, arrays work) [let's play with partial matching now that we know full/basic matching works] SELECT 'kuma' ~* 'Kumar' -> FALSE ???? (wait, if this is broken then regression tests would had to have caught this...I must have something wrong) [optionally go look at " ~* " in the documentation at this point; or just try a simple flip-flop of the expression] SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed) SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on the other side... :( [At this point I'd confirm or question why ANY hasn't been made to go both ways but also realize that I will have to approach this in a different way to achieve my goal.] If you come across a problem with basic features of any software it is likely you are confused rather than the system providing inconsistent results. Take the feature in question and make sure you understand the more basic uses (while omitting as many moving parts as possible - in this case the use of XML when all you care about is how ANY(array) interacts with the regular expression operator) and slowly building up until something breaks. Shooting out a question to the mailing list is probably easier (for you) but you are more likely to learn and remember if you figure it out yourself. You had a good beginning with your original e-mail but you really needed to take it a step or two further. The community is very helpful and forgiving to these kinds of questions but it is in your own interest to dig deeper before asking for help. This applies especially for basic and long-present features such as regular expressions and arrays. And if it really is a bug you will be able to provide some very helpful information in your report. David J.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Tom Lane
Date:
"David Johnston" <polobo@yahoo.com> writes: > [optionally go look at " ~* " in the documentation at this point; or just > try a simple flip-flop of the expression] > SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed) > SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on > the other side... :( > [At this point I'd confirm or question why ANY hasn't been made to go both > ways but also realize that I will have to approach this in a different way > to achieve my goal.] It's been awhile since I looked at the point in detail, but I seem to recall that there are fundamental syntactic-ambiguity reasons why the ANY/ALL part has to be on the righthand side of the comparison operator. There's a much easier fix to this problem though, which is to invent a "reverse ~" operator that does POSIX comparison with the pattern on the left. The hardest part of doing that for yourself is choosing a name for the reverse operator --- it just goes like create function reverse_regexeq(text, text) returns bool as 'select $2 ~ $1' language sql strict immutable; create operator ~~~ (procedure = reverse_regexeq, leftarg = text, rightarg = text); and similarly for the case-insensitive version, and there you go: pattern ~~~ ANY (whatever) solves the problem. Every so often we debate providing built-in operators like this, but we never seem to get past the what-to-call-it part. Anyone have a good color for that bikeshed? regards, tom lane
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Denis Papathanasiou
Date:
Hi David, On 12/20/2012 08:48 PM, David Johnston wrote: >> -----Original Message----- >> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- >> owner@postgresql.org] On Behalf Of Denis Papathanasiou >> Sent: Thursday, December 20, 2012 7:56 PM >> To: pgsql-general@postgresql.org >> Subject: [GENERAL] Using POSIX Regular Expressions on xml type fields > gives >> inconsistent results [snip] > [At this point I'd confirm or question why ANY hasn't been made to go both > ways but also realize that I will have to approach this in a different way > to achieve my goal.] I did realize that ANY() must be a right-hand operator, but what I didn't understand (and admittedly still don't understand) is why regex operations that are normally right-side work from the left. If you look at the four examples which follow the posix match table in the docs (http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-TABLE), some of them work from the left side, e.g.: 'abc' ~ '(b|d)' true In my original example, I found I could write this from left to right like this, and it would still work: '(b|d)' ~ 'abc' true But when I reverse this expression: 'abc' ~ '^a' true like this, it doesn't work: '^a' ~ 'abc' false
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
"David Johnston"
Date:
> > If you look at the four examples which follow the posix match table in the > docs (http://www.postgresql.org/docs/9.1/static/functions- > matching.html#FUNCTIONS-POSIX-TABLE), > some of them work from the left side, e.g.: > > 'abc' ~ '(b|d)' true > > In my original example, I found I could write this from left to right like this, > and it would still work: > > '(b|d)' ~ 'abc' true Really??? Testing on Windows 9.0.4 this expression returns FALSE, not true as you claim. Please try again and reply with detailed version information and the exact query(s) used if you can get the behavior to repeat itself. David J.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Denis Papathanasiou
Date:
On 12/21/2012 10:35 AM, David Johnston wrote: >> >> If you look at the four examples which follow the posix match table in the >> docs (http://www.postgresql.org/docs/9.1/static/functions- >> matching.html#FUNCTIONS-POSIX-TABLE), >> some of them work from the left side, e.g.: >> >> 'abc' ~ '(b|d)' true >> >> In my original example, I found I could write this from left to right like > this, >> and it would still work: >> >> '(b|d)' ~ 'abc' true > > Really??? > > Testing on Windows 9.0.4 this expression returns FALSE, not true as you > claim. Please try again and reply with detailed version information and the > exact query(s) used if you can get the behavior to repeat itself. => select id from form_d where '(kumar|gonzales)' ~* any( CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) ); This expression returned (correctly) the information I wanted. Shouldn't it *not* have worked b/c the '(kumar|gonzales)' regex is on the left? I was hoping to be able to use the xml type field to be able to do '^a' type searches on names (e.g. a search for 'Fred' would also match 'Frederick', etc.) but since it seems that I cannot, I'll look at different ways of solving this problem.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
"David Johnston"
Date:
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Denis Papathanasiou > Sent: Friday, December 21, 2012 10:43 AM > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Using POSIX Regular Expressions on xml type fields > gives inconsistent results > > On 12/21/2012 10:35 AM, David Johnston wrote: > >> > >> If you look at the four examples which follow the posix match table > >> in the docs (http://www.postgresql.org/docs/9.1/static/functions- > >> matching.html#FUNCTIONS-POSIX-TABLE), > >> some of them work from the left side, e.g.: > >> > >> 'abc' ~ '(b|d)' true > >> > >> In my original example, I found I could write this from left to right > >> like > > this, > >> and it would still work: > >> > >> '(b|d)' ~ 'abc' true > > > > Really??? > > > > Testing on Windows 9.0.4 this expression returns FALSE, not true as > > you claim. Please try again and reply with detailed version > > information and the exact query(s) used if you can get the behavior to > repeat itself. > > => select id from form_d where '(kumar|gonzales)' ~* any( > CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) > ); > > This expression returned (correctly) the information I wanted. > > Shouldn't it *not* have worked b/c the '(kumar|gonzales)' regex is on the > left? It worked for three reasons: 1) The expression on the right-hand side was a valid regular expression 2) The value on the left-hand side contained a string that happened to exactly match the expression on the right-hand side 3) The "(kumar|gonzales)" text, while it happens to look like a regular expression, is just plain text because it appears on the left-hand side of the operator. Arguably "regular expression" should have its own data type just like JSON and XML since RegEx is text-like but with special validation characteristics. Lacking that the system is unable to help in this situation. You may have gotten help if someone had a name that did not resolve to a valid regular expression - in which case the match attempt would have raised an error. You said above that '(b|d') ~ 'abc' returned TRUE for you. Did you actually test that exact (simple) expression or did you immediately jump to your convoluted example with XML and ANY(array)? > > I was hoping to be able to use the xml type field to be able to do '^a' > type searches on names (e.g. a search for 'Fred' would also match 'Frederick', > etc.) but since it seems that I cannot, I'll look at different ways of solving this > problem. > See Tom's suggestion of creating a custom function and operator that reverses the order of the two text fields. David J.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Tom Lane
Date:
Denis Papathanasiou <denis.papathanasiou@banrai.com> writes: > I did realize that ANY() must be a right-hand operator, but what I > didn't understand (and admittedly still don't understand) is why regex > operations that are normally right-side work from the left. All regex operators in Postgres have the pattern on the right. This follows the lead of the SQL standard's LIKE operator, which has its pattern on the right. You might be used to some other language that puts the pattern on the left, but that isn't some kind of natural law. And it most certainly isn't the case that the operator will look at the two text strings and try to guess which one is meant as a pattern. That would be impossibly unreliable. > If you look at the four examples which follow the posix match table in > the docs > (http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-TABLE), > some of them work from the left side, e.g.: Not to my eyes ... regards, tom lane
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
hamann.w@t-online.de
Date:
>> >> "David Johnston" <polobo@yahoo.com> writes: >> > [optionally go look at " ~* " in the documentation at this point; or just >> > try a simple flip-flop of the expression] >> >> > SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed) >> >> > SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on >> > the other side... :( >> >> > [At this point I'd confirm or question why ANY hasn't been made to go both >> > ways but also realize that I will have to approach this in a different way >> > to achieve my goal.] >> >> It's been awhile since I looked at the point in detail, but I seem to >> recall that there are fundamental syntactic-ambiguity reasons why the >> ANY/ALL part has to be on the righthand side of the comparison operator. >> >> There's a much easier fix to this problem though, which is to invent a >> "reverse ~" operator that does POSIX comparison with the pattern on the >> left. The hardest part of doing that for yourself is choosing a name >> for the reverse operator --- it just goes like >> >> create function reverse_regexeq(text, text) returns bool as >> 'select $2 ~ $1' language sql strict immutable; >> >> create operator ~~~ (procedure = reverse_regexeq, >> leftarg = text, rightarg = text); >> >> and similarly for the case-insensitive version, and there you go: >> pattern ~~~ ANY (whatever) solves the problem. >> Hi Tom, while it might work to reverse the order in a regex, I have experienced severe slowdown when the pattern comes from table data Regards Wolfgang Hamann
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
"David Johnston"
Date:
> >> There's a much easier fix to this problem though, which is to invent > >> a "reverse ~" operator that does POSIX comparison with the pattern on > >> the left. The hardest part of doing that for yourself is choosing a > >> name for the reverse operator --- it just goes like > >> > >> create function reverse_regexeq(text, text) returns bool as 'select > >> $2 ~ $1' language sql strict immutable; > >> > >> create operator ~~~ (procedure = reverse_regexeq, leftarg = text, > >> rightarg = text); > >> > >> and similarly for the case-insensitive version, and there you go: > >> pattern ~~~ ANY (whatever) solves the problem. > >> > > Hi Tom, > > while it might work to reverse the order in a regex, I have experienced > severe slowdown when the pattern comes from table data > > Regards > Wolfgang Hamann You at least have to provide a query that you feel is "too slow". Since you claim a "slowdown" you should also provide a query the better performing query you are referring to. For all queries provided you also need to provide comparable timing information. It is also unclear whether you experience this slowdown when using the "reverse" expression only or if it occurs in any situation where the expression comes from a table column rather than a string constant. If it is indeed related to the custom reverse query the fact that you are going through an SQL wrapper to access a c-level function is inevitably going to degrade performance. In short you need to provide much more detail and clarity regarding what exactly made you draw this conclusion so that others can repeat and verify and then provide meaningful explanations or solutions. If you have not done so please read the content at http://wiki.postgresql.org/wiki/Slow_Query_Questions and http://wiki.postgresql.org/wiki/Guide_to_reporting_problems David J.
Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results
From
Tom Lane
Date:
"David Johnston" <polobo@yahoo.com> writes: >> while it might work to reverse the order in a regex, I have experienced >> severe slowdown when the pattern comes from table data > If it is indeed related to the custom reverse query the fact that you are > going through an SQL wrapper to access a c-level function is inevitably > going to degrade performance. I believe that in most situations of interest, the SQL wrapper would get "inlined" and thus create no significant per-row penalty. It's possible this isn't happening in some specific case, but as you say we'd need to see a concrete example to know one way or the other. Another point that comes to mind is that "pattern comes from table data" probably means "query uses a whole lot of different patterns over its lifetime". Each different pattern is going to require regex compilation, which is not a particularly cheap operation. regards, tom lane