Thread: Using POSIX Regular Expressions on xml type fields gives inconsistent results

Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
Denis Papathanasiou
Date:
I have a table with an xml type column, and while I can make regex
queries like this successfully:

=> select id from form_d where 'kumar' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );
                   id
--------------------------------------
  97e1541b-27f4-4d95-beb5-2f67830ebc48
(1 row)

and

=> select id from form_d where '(kumar|gonzales)' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );
                   id
--------------------------------------
  aea32e7e-f422-405c-953b-86fe3c8c1e30
  97e1541b-27f4-4d95-beb5-2f67830ebc48
(2 rows)

I.e., they are successful in that the last names in the xml data are
"Kumar" and "Gonzales", so the ~* operator handled the case comparison
correctly, and the (|) grouping also found the two rows where the
corresponding xml had "Kumar" and "Gonzales" in the PersonList attribute.

But if I change the expression to ask for all last names beginning with
"Kuma" or "Gonza", like this, the query returns no matches:

=> select id from form_d where '^kuma' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );
  id
----
(0 rows)

=> select id from form_d where '^gonza' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );
  id
----
(0 rows)

=> select id from form_d where '^(kuma|gonza)' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );
  id
----
(0 rows)

Why is that?


Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
"David Johnston"
Date:
>
> I have a table with an xml type column, and while I can make regex queries
> like this successfully:
>
> => select id from form_d where 'kumar' ~* any(
> CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[])
> );

It took me way too long to figure this out:

SELECT 'text to search against'  ~* 'regular expression'

The text to be searched must be on the left-hand side of the operator and
thus the regular expression operators cannot be used with "ANY/ALL" because
they require that the text to be search exist on the right-hand side of the
operator.

I'll let others respond as to the difficulty of implementing a LHS-Array
version of ANY/ALL

David J.




Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
"David Johnston"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Denis Papathanasiou
> Sent: Thursday, December 20, 2012 7:56 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Using POSIX Regular Expressions on xml type fields
gives
> inconsistent results
>
> I have a table with an xml type column, and while I can make regex queries
> like this successfully:
>
> => select id from form_d where 'kumar' ~* any(
> CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[])
> );
>                    id
> --------------------------------------
>   97e1541b-27f4-4d95-beb5-2f67830ebc48
> (1 row)

For educational purposes:

SELECT 'kumar' ~* 'Kumar' -> TRUE (simplest possible)

SELECT 'kumar' ~* any ( ARRAY['Kumar','Gonzales']::text[] ) -> TRUE (ok,
arrays work)

[let's play with partial matching now that we know full/basic matching
works]

SELECT 'kuma' ~* 'Kumar' -> FALSE ???? (wait, if this is broken then
regression tests would had to have caught this...I must have something
wrong)

[optionally go look at " ~* " in the documentation at this point; or just
try a simple flip-flop of the expression]

SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed)

SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on
the other side... :(

[At this point I'd confirm or question why ANY hasn't been made to go both
ways but also realize that I will have to approach this in a different way
to achieve my goal.]

If you come across a problem with basic features of any software it is
likely you are confused rather than the system providing inconsistent
results.  Take the feature in question and make sure you understand the more
basic uses (while omitting as many moving parts as possible - in this case
the use of XML when all you care about is how ANY(array) interacts with the
regular expression operator) and slowly building up until something breaks.
Shooting out a question to the mailing list is probably easier (for you) but
you are more likely to learn and remember if you figure it out yourself.

You had a good beginning with your original e-mail but you really needed to
take it a step or two further.  The community is very helpful and forgiving
to these kinds of questions but it is in your own interest to dig deeper
before asking for help.  This applies especially for basic and long-present
features such as regular expressions and arrays.  And if it really is a bug
you will be able to provide some very helpful information in your report.

David J.




"David Johnston" <polobo@yahoo.com> writes:
> [optionally go look at " ~* " in the documentation at this point; or just
> try a simple flip-flop of the expression]

> SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed)

> SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on
> the other side... :(

> [At this point I'd confirm or question why ANY hasn't been made to go both
> ways but also realize that I will have to approach this in a different way
> to achieve my goal.]

It's been awhile since I looked at the point in detail, but I seem to
recall that there are fundamental syntactic-ambiguity reasons why the
ANY/ALL part has to be on the righthand side of the comparison operator.

There's a much easier fix to this problem though, which is to invent a
"reverse ~" operator that does POSIX comparison with the pattern on the
left.  The hardest part of doing that for yourself is choosing a name
for the reverse operator --- it just goes like

create function reverse_regexeq(text, text) returns bool as
'select $2 ~ $1' language sql strict immutable;

create operator ~~~ (procedure = reverse_regexeq,
leftarg = text, rightarg = text);

and similarly for the case-insensitive version, and there you go:
pattern ~~~ ANY (whatever) solves the problem.

Every so often we debate providing built-in operators like this,
but we never seem to get past the what-to-call-it part.  Anyone
have a good color for that bikeshed?

            regards, tom lane


Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
Denis Papathanasiou
Date:
Hi David,

On 12/20/2012 08:48 PM, David Johnston wrote:
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
>> owner@postgresql.org] On Behalf Of Denis Papathanasiou
>> Sent: Thursday, December 20, 2012 7:56 PM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] Using POSIX Regular Expressions on xml type fields
> gives
>> inconsistent results
[snip]
> [At this point I'd confirm or question why ANY hasn't been made to go both
> ways but also realize that I will have to approach this in a different way
> to achieve my goal.]

I did realize that ANY() must be a right-hand operator, but what I
didn't understand (and admittedly still don't understand) is why regex
operations that are normally right-side work from the left.

If you look at the four examples which follow the posix match table in
the docs
(http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-TABLE),
some of them work from the left side, e.g.:

'abc' ~ '(b|d)'  true

In my original example, I found I could write this from left to right
like this, and it would still work:

'(b|d)' ~ 'abc'  true

But when I reverse this expression:

'abc' ~ '^a'     true

like this, it doesn't work:

'^a' ~ 'abc'     false




Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
"David Johnston"
Date:
>
> If you look at the four examples which follow the posix match table in the
> docs (http://www.postgresql.org/docs/9.1/static/functions-
> matching.html#FUNCTIONS-POSIX-TABLE),
> some of them work from the left side, e.g.:
>
> 'abc' ~ '(b|d)'  true
>
> In my original example, I found I could write this from left to right like
this,
> and it would still work:
>
> '(b|d)' ~ 'abc'  true

Really???

Testing on Windows 9.0.4 this expression returns FALSE, not true as you
claim.  Please try again and reply with detailed version information and the
exact query(s) used if you can get the behavior to repeat itself.

David J.






Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
Denis Papathanasiou
Date:
On 12/21/2012 10:35 AM, David Johnston wrote:
>>
>> If you look at the four examples which follow the posix match table in the
>> docs (http://www.postgresql.org/docs/9.1/static/functions-
>> matching.html#FUNCTIONS-POSIX-TABLE),
>> some of them work from the left side, e.g.:
>>
>> 'abc' ~ '(b|d)'  true
>>
>> In my original example, I found I could write this from left to right like
> this,
>> and it would still work:
>>
>> '(b|d)' ~ 'abc'  true
>
> Really???
>
> Testing on Windows 9.0.4 this expression returns FALSE, not true as you
> claim.  Please try again and reply with detailed version information and the
> exact query(s) used if you can get the behavior to repeat itself.

=> select id from form_d where '(kumar|gonzales)' ~* any(
CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[]) );

This expression returned (correctly) the information I wanted.

Shouldn't it *not* have worked b/c the '(kumar|gonzales)' regex is on
the left?

I was hoping to be able to use the xml type field to be able to do '^a'
type searches on names (e.g. a search for 'Fred' would also match
'Frederick', etc.) but since it seems that I cannot, I'll look at
different ways of solving this problem.


Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
"David Johnston"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Denis Papathanasiou
> Sent: Friday, December 21, 2012 10:43 AM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Using POSIX Regular Expressions on xml type fields
> gives inconsistent results
>
> On 12/21/2012 10:35 AM, David Johnston wrote:
> >>
> >> If you look at the four examples which follow the posix match table
> >> in the docs (http://www.postgresql.org/docs/9.1/static/functions-
> >> matching.html#FUNCTIONS-POSIX-TABLE),
> >> some of them work from the left side, e.g.:
> >>
> >> 'abc' ~ '(b|d)'  true
> >>
> >> In my original example, I found I could write this from left to right
> >> like
> > this,
> >> and it would still work:
> >>
> >> '(b|d)' ~ 'abc'  true
> >
> > Really???
> >
> > Testing on Windows 9.0.4 this expression returns FALSE, not true as
> > you claim.  Please try again and reply with detailed version
> > information and the exact query(s) used if you can get the behavior to
> repeat itself.
>
> => select id from form_d where '(kumar|gonzales)' ~* any(
> CAST(xpath('//PersonsList/PersonName/LastName/text()', data) as TEXT[])
> );
>
> This expression returned (correctly) the information I wanted.
>
> Shouldn't it *not* have worked b/c the '(kumar|gonzales)' regex is on the
> left?

It worked for three reasons:

1) The expression on the right-hand side was a valid regular expression
2) The value on the left-hand side contained a string that happened to
exactly match the expression on the right-hand side
3) The "(kumar|gonzales)" text, while it happens to look like a regular
expression, is just plain text because it appears on the left-hand side of
the operator.

Arguably "regular expression" should have its own data type just like JSON
and XML since RegEx is text-like but with special validation
characteristics.  Lacking that the system is unable to help in this
situation.  You may have gotten help if someone had a name that did not
resolve to a valid regular expression - in which case the match attempt
would have raised an error.

You said above that '(b|d') ~ 'abc' returned TRUE for you.  Did you actually
test that exact (simple) expression or did you immediately jump to your
convoluted example with XML and ANY(array)?

>
> I was hoping to be able to use the xml type field to be able to do '^a'
> type searches on names (e.g. a search for 'Fred' would also match
'Frederick',
> etc.) but since it seems that I cannot, I'll look at different ways of
solving this
> problem.
>

See Tom's suggestion of creating a custom function and operator that
reverses the order of the two text fields.

David J.




Denis Papathanasiou <denis.papathanasiou@banrai.com> writes:
> I did realize that ANY() must be a right-hand operator, but what I
> didn't understand (and admittedly still don't understand) is why regex
> operations that are normally right-side work from the left.

All regex operators in Postgres have the pattern on the right.  This
follows the lead of the SQL standard's LIKE operator, which has its
pattern on the right.  You might be used to some other language that
puts the pattern on the left, but that isn't some kind of natural law.
And it most certainly isn't the case that the operator will look at the
two text strings and try to guess which one is meant as a pattern.  That
would be impossibly unreliable.

> If you look at the four examples which follow the posix match table in
> the docs
> (http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-TABLE),
> some of them work from the left side, e.g.:

Not to my eyes ...

            regards, tom lane


Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
hamann.w@t-online.de
Date:
>>
>> "David Johnston" <polobo@yahoo.com> writes:
>> > [optionally go look at " ~* " in the documentation at this point; or just
>> > try a simple flip-flop of the expression]
>>
>> > SELECT 'Kumar' ~* 'kuma' -> TRUE !!! (ah ha! I had the order reversed)
>>
>> > SELECT any( ARRAY['Kumar','Gozales']::text ) ... wait...ANY doesn't work on
>> > the other side... :(
>>
>> > [At this point I'd confirm or question why ANY hasn't been made to go both
>> > ways but also realize that I will have to approach this in a different way
>> > to achieve my goal.]
>>
>> It's been awhile since I looked at the point in detail, but I seem to
>> recall that there are fundamental syntactic-ambiguity reasons why the
>> ANY/ALL part has to be on the righthand side of the comparison operator.
>>
>> There's a much easier fix to this problem though, which is to invent a
>> "reverse ~" operator that does POSIX comparison with the pattern on the
>> left.  The hardest part of doing that for yourself is choosing a name
>> for the reverse operator --- it just goes like
>>
>> create function reverse_regexeq(text, text) returns bool as
>> 'select $2 ~ $1' language sql strict immutable;
>>
>> create operator ~~~ (procedure = reverse_regexeq,
>> leftarg = text, rightarg = text);
>>
>> and similarly for the case-insensitive version, and there you go:
>> pattern ~~~ ANY (whatever) solves the problem.
>>

Hi Tom,

while it might work to reverse the order in a regex, I have experienced severe slowdown
when the pattern comes from table data

Regards
Wolfgang Hamann



Re: Using POSIX Regular Expressions on xml type fields gives inconsistent results

From
"David Johnston"
Date:
> >> There's a much easier fix to this problem though, which is to invent
> >> a "reverse ~" operator that does POSIX comparison with the pattern on
> >> the left.  The hardest part of doing that for yourself is choosing a
> >> name for the reverse operator --- it just goes like
> >>
> >> create function reverse_regexeq(text, text) returns bool as 'select
> >> $2 ~ $1' language sql strict immutable;
> >>
> >> create operator ~~~ (procedure = reverse_regexeq, leftarg = text,
> >> rightarg = text);
> >>
> >> and similarly for the case-insensitive version, and there you go:
> >> pattern ~~~ ANY (whatever) solves the problem.
> >>
>
> Hi Tom,
>
> while it might work to reverse the order in a regex, I have experienced
> severe slowdown when the pattern comes from table data
>
> Regards
> Wolfgang Hamann

You at least have to provide a query that you feel is "too slow".  Since you
claim a "slowdown" you should also provide a query the better performing
query you are referring to.  For all queries provided you also need to
provide comparable timing information.

It is also unclear whether you experience this slowdown when using the
"reverse" expression only or if it occurs in any situation where the
expression comes from a table column rather than a string constant.

If it is indeed related to the custom reverse query the fact that you are
going through an SQL wrapper to access a c-level function is inevitably
going to degrade performance.

In short you need to provide much more detail and clarity regarding what
exactly made you draw this conclusion so that others can repeat and verify
and then provide meaningful explanations or solutions.

If you have not done so please read the content at
http://wiki.postgresql.org/wiki/Slow_Query_Questions and
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

David J.





"David Johnston" <polobo@yahoo.com> writes:
>> while it might work to reverse the order in a regex, I have experienced
>> severe slowdown when the pattern comes from table data

> If it is indeed related to the custom reverse query the fact that you are
> going through an SQL wrapper to access a c-level function is inevitably
> going to degrade performance.

I believe that in most situations of interest, the SQL wrapper would get
"inlined" and thus create no significant per-row penalty.  It's possible
this isn't happening in some specific case, but as you say we'd need to
see a concrete example to know one way or the other.

Another point that comes to mind is that "pattern comes from table data"
probably means "query uses a whole lot of different patterns over its
lifetime".  Each different pattern is going to require regex
compilation, which is not a particularly cheap operation.

            regards, tom lane