Thread: 8.0.3 regexp_replace()...

8.0.3 regexp_replace()...

From
"rlee0001"
Date:
I have a stupid problem. My server is running an old version of
postgres (8.0.3) and therefore lacks the regexp_replace() function. It
does however support substring and replace functions. So what I am
trying to do is emulate the regexp_replace() function by creating a
function which finds each matching substring and replaces them by hand
in a loop. The loop is supposed to exit when there are no more matches
to replace. The problem is that the function enters and infinate loop
which brings down the server. The faulty code follows:

-- PRE-PostgreSQL 8.1 regexp_replace() function called
regexp_replacex()
CREATE OR REPLACE FUNCTION "regexp_replacex" (source varchar, pattern
varchar, replacement varchar) RETURNS varchar AS
$body$
DECLARE
 retvalue VARCHAR;
BEGIN
 retvalue = "source";
 LOOP
  retvalue = REPLACE(retvalue, SUBSTRING(retvalue FROM "pattern"),
"replacement");
  EXIT WHEN retvalue = REPLACE(retvalue, SUBSTRING(retvalue FROM
"pattern"), "replacement");
 END LOOP;
 RETURN retvalue;
END;
$body$
LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;

Now first of all I realize that my code is stupid and the problem is
stupid and if it were up to be I'd be running 8.1 anyway. But its not
up to me but I really need this functionality. The parameters might be
in a different order than the real 8.1 regexp_replace so pay attention
if you test it and do expect to have to kill the server when it enters
the infinate loop.

I can not beleave that nobody has done this before and yet I can't find
it anywhere on the net. I've been searching Google and Google Groups
for "postgres replace substring return function" and nothing.

Can anyone spot my folly?


Re: 8.0.3 regexp_replace()...

From
"rlee0001"
Date:
I did get the code working. The function DDL follows:

CREATE OR REPLACE FUNCTION "webadmin"."regexp_replacex" (source
varchar, pattern varchar, replacement varchar) RETURNS varchar AS
$body$
DECLARE
 retvalue VARCHAR;
BEGIN
 retvalue = "source";
 LOOP
  retvalue = REPLACE(retvalue, COALESCE(SUBSTRING(retvalue FROM
"pattern"), ''), "replacement");
  EXIT WHEN retvalue = REPLACE(retvalue, COALESCE(SUBSTRING(retvalue
FROM "pattern"), ''), "replacement");
 END LOOP;
 RETURN retvalue;
END;
$body$
LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;

The problem was that SUBSTRING returns NULL if it cannot find any
matches for the pattern and when the second parameter to REPLACE
returns NULL, REPLACE returns NULL (which is idiotic). Using COALESCE I
ensure that is SUBSTRING cannot find a match that '' (empty string) is
sent to REPLACE. REPLACE then behaves as expected and replaces nothing.

Debugging PostgreSQL's retarded behaviour around NULL values can be a
real pain. But at least I learned to use EMS PostgreSQL Manager for
Windows' function debugger, which can step through a function while
reporting the values of all variables, parameters and return values.
Very handy.

-Robert


Re: 8.0.3 regexp_replace()...

From
Tom Lane
Date:
"rlee0001" <robeddielee@hotmail.com> writes:
> CREATE OR REPLACE FUNCTION "regexp_replacex" (source varchar, pattern
> varchar, replacement varchar) RETURNS varchar AS
> $body$
> DECLARE
>  retvalue VARCHAR;
> BEGIN
>  retvalue = "source";
>  LOOP
>   retvalue = REPLACE(retvalue, SUBSTRING(retvalue FROM "pattern"),
> "replacement");
>   EXIT WHEN retvalue = REPLACE(retvalue, SUBSTRING(retvalue FROM
> "pattern"), "replacement");
>  END LOOP;
>  RETURN retvalue;
> END;
> $body$
> LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;

You probably don't want to be double-quoting the variable names.  Also,
I think this should be marked STRICT IMMUTABLE rather than the options
you have chosen.

> if you test it and do expect to have to kill the server when it enters
> the infinate loop.

Control-C cancels the query just fine for me ...

            regards, tom lane

Re: 8.0.3 regexp_replace()...

From
Martijn van Oosterhout
Date:
On Mon, Jan 30, 2006 at 11:27:23AM -0800, rlee0001 wrote:
> The problem was that SUBSTRING returns NULL if it cannot find any
> matches for the pattern and when the second parameter to REPLACE
> returns NULL, REPLACE returns NULL (which is idiotic). Using COALESCE I
> ensure that is SUBSTRING cannot find a match that '' (empty string) is
> sent to REPLACE. REPLACE then behaves as expected and replaces nothing.

Well, the rule for STRICT functions (which replace is) is that if any
of the arguments are NULL, the result is NULL. Most of the time this is
what you want. IMHO the problem above is substring returning null. NULL
should generally mean "unknown" and a substr that doesn't match
certainly isn't unknown. Question is, what should it return then?

In SQL2003 standard terms this is a "null-call" function:

4.27 SQL-invoked routines
...
A null-call function is an SQL-invoked function that is defined to
return the null value if any of its input arguments is the null value.
A null-call function is an SQL-invoked function whose <null-call
clause> specifies RETURNS NULL ON NULL INPUT.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: 8.0.3 regexp_replace()...

From
Stephan Szabo
Date:
On Mon, 30 Jan 2006, rlee0001 wrote:

> I did get the code working. The function DDL follows:
>
> CREATE OR REPLACE FUNCTION "webadmin"."regexp_replacex" (source
> varchar, pattern varchar, replacement varchar) RETURNS varchar AS
> $body$
> DECLARE
>  retvalue VARCHAR;
> BEGIN
>  retvalue = "source";
>  LOOP
>   retvalue = REPLACE(retvalue, COALESCE(SUBSTRING(retvalue FROM
> "pattern"), ''), "replacement");
>   EXIT WHEN retvalue = REPLACE(retvalue, COALESCE(SUBSTRING(retvalue
> FROM "pattern"), ''), "replacement");
>  END LOOP;
>  RETURN retvalue;
> END;
> $body$
> LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;
>
> The problem was that SUBSTRING returns NULL if it cannot find any
> matches for the pattern and when the second parameter to REPLACE
> returns NULL, REPLACE returns NULL (which is idiotic).

Why do you say that?  I'd say that's precisely the most reasonable answer.
You're asking to replace an unknown portion of a string (since NULL is
unknown, it might match some portion of the string, it might not) with
something else.  The answer to that seems pretty unknown to me.

Re: 8.0.3 regexp_replace()...

From
Stephan Szabo
Date:
On Tue, 31 Jan 2006, Martijn van Oosterhout wrote:

> On Mon, Jan 30, 2006 at 11:27:23AM -0800, rlee0001 wrote:
> > The problem was that SUBSTRING returns NULL if it cannot find any
> > matches for the pattern and when the second parameter to REPLACE
> > returns NULL, REPLACE returns NULL (which is idiotic). Using COALESCE I
> > ensure that is SUBSTRING cannot find a match that '' (empty string) is
> > sent to REPLACE. REPLACE then behaves as expected and replaces nothing.
>
> Well, the rule for STRICT functions (which replace is) is that if any
> of the arguments are NULL, the result is NULL. Most of the time this is
> what you want. IMHO the problem above is substring returning null. NULL
> should generally mean "unknown" and a substr that doesn't match
> certainly isn't unknown. Question is, what should it return then?

Sadly, that seems to me to match the SQL2003 semantics for its regular
expression substring search.  Or at least I believe that's what 6.29 GR5g
is implying.

Re: 8.0.3 regexp_replace()...

From
"rlee0001"
Date:
Martijn,

(Warning: This post contains somewhat of a long rant followed by a
question.)

I realize that NULL is the unknown value in SQL and that (most)
functions therefore treat it as such. I have no problem with "RETURNS
NULL ON NULL INPUT" except when a function returns NULL for no good
reason. If I were the ruler of the world I would declare that:

1) All functions correctly treat NULL input as "unknown" and if this
prevents them from returning a logical value they must return NULL.
(SQL has this already)
2) If an error occurs within a function which prevents the function
from returning a logical value (such as invalid inputs) the function
must raise an exception and not return. IE: No value is returned at all
-- not even NULL. PostgreSQL seems to already do this.
3) If a function can not calculate a return value due to some ambiguity
it should return NULL. This is really just an extention of rule 1.
4) If a function can calculate a return value (no processing error,
NULL input or other ambiguity exists) it MUST NOT return NULL.

My problem is with SUBSTRING. When it fails to find a match for the
regexp pattern within the source string it returns NULL. Why?! There is
no ambiguity nor NULL inputs! The result of the function is known to
logically by "No Results". So the only question is this: How do you
logically return that from a function?

a) An empty string is logically incorrect because an empty string is a
value and returning it would imply that it was matched.
b) NULL is logically incorrect because the result was known. NULL is
supposed to be returned when the result is unknown.

Does Postgres have a "EMPTY" or "NOTHING" return value? What does a
SELECT return when there are no matching records? NULL? Hell no! Then
what? Because whatever SELECT returns when it finds no matching records
is what SUBSTRING should return when there are no matching substrings.
Of course SELECT was never intended to return a SCALAR value either
while SUBSTRING was. Maybe "EMPTY" is something that SCALAR values
simply are no capable of specifying?

I am suggesting that the behaviour of SUBSTRING returning NULL when no
matches is found is either a bug in PostgreSQL or a flaw in the SQL
specification. It is not logical.

Another words I wanted by function to do this (in english):

Replace the next occurance of <pattern> in <sourcestring> with
<replacementstring>; if found. Repeat for all additional occurances; if
any.

At no point should any ambiguity ever be introduced into this function
so long as none of the three inputs are NULL since no external data
source is used. Therefore nothing should ever return NULL during the
operation of the function.

In my experience having to code "special cases" around the NULL value
is one of the biggest annoyances in SQL. Special cases should never
have to be coded for specific values that have no bearing on the
business logic of the application. As a programmer I want to be
reasonably assured that I can anticipate NULL popping up and that
anywhere that it might pop up it will be handled gracefully.

Obviously I'm somewhat new to SQL (less than 2 years) and especially to
PostgreSQL (about 2 months). In addition to returning null on null
input, what are the best programming practices for dealing with NULL
within a procedure/query? Was coalesce the best way to handle this?
Should I have just checked for NULL in my EXIT statement instead?

-Robert


Re: 8.0.3 regexp_replace()...

From
Tom Lane
Date:
"rlee0001" <robeddielee@hotmail.com> writes:
> My problem is with SUBSTRING. When it fails to find a match for the
> regexp pattern within the source string it returns NULL. Why?!

Because the SQL standard says so.

Of course, you're free to wrap the built-in function in your own
function that has behavior you like better for this case...

            regards, tom lane

Re: 8.0.3 regexp_replace()...

From
Stephan Szabo
Date:
On Tue, 31 Jan 2006, rlee0001 wrote:

> I am suggesting that the behaviour of SUBSTRING returning NULL when no
> matches is found is either a bug in PostgreSQL or a flaw in the SQL
> specification. It is not logical.

No, but sadly it seems to be what the SQL spec wants for its similar
construct.

---

In general, SQL's handling of NULLs is badly designed.

Sometimes it's misused (like the substring case).

Sometimes it's confusing (like the IN and NOT IN cases).

Sometimes it's just bizarre.  If cardinality of a table expression is
important (ie, count(*) is meaningful), why is DISTINCT defined in a way
that basically does not give results consistent with NULL being unknown.
UNIQUE(q) should return NULL in the presence of NULLs rather than true,
since the real result is well, unknown (the two of these together have the
side effect of UNIQUE being true not guaranteeing that the cardinality of
a subquery and the subquery with distinct being the same).

---

Re: 8.0.3 regexp_replace()...

From
"rlee0001"
Date:
Stephan,

How do IN and NOT IN treat NULLs? Don't these functions search an array
for a specified value returning true or false? I guess the intuitive
thing for IN and NOT IN to do would be to return NULL if NULL appears
anywhere in the array since those elements values are "unknown".

Personally I think treating NULL as "unknown" is rediculous. NULL is a
value and its value is known to be NULL. I know what NULL is: NULL. How
many NULLs do I have here {NULL, 'hi', NULL, NULL}? NULL NULLs? No,
three NULLs. How many NULLs are in this string: 'hi'? NULL NULLs? No,
zero NULLs. How many occurances of 'yo' are in 'hey'? NULL occurances?
No, zero occurances. How many NULLs are in this paragraph? You better
count them yourself because if you ask SQL you know what it will say.
NULL. Or will it tell you? Who the hell knows!?

Oh yeah, my favorite is this: NULL::VARCHAR? Nope, you can't do it. Not
without creating your own CAST. Seems to me that an obvious value would
be 'NULL'. Or maybe '' (empty string). I hate having to use COALESCE.
It just reaks of a bad programming practice.

Thats the way I see it. But if NULL has to mean "unknown" then all the
functions should treat it as such. Also several other values including
UNSPECIFIED and EMPTY should be provided. EMPTY should return an empty
array {} and UNSPECIFIED should do what NULL is often used to mean
(nothing). Then NULL should be renamed to UNKNOWN to clear up any
confusion. :o)

But I'm not really the ruler of the world. At least not yet. But maybe
someday...

-Robert


Re: 8.0.3 regexp_replace()...

From
Stephan Szabo
Date:
On Wed, 1 Feb 2006, rlee0001 wrote:

> How do IN and NOT IN treat NULLs? Don't these functions search an array
> for a specified value returning true or false? I guess the intuitive
> thing for IN and NOT IN to do would be to return NULL if NULL appears
> anywhere in the array since those elements values are "unknown".

It's IN and NOT IN (subselect) that people often get confused by, exactly
because it does return NULL which means that a row not selected by IN may
also not be selected by NOT IN.

> Personally I think treating NULL as "unknown" is rediculous. NULL is a
> value and its value is known to be NULL. I know what NULL is: NULL.

The problem is that NULL isn't a known string, numeric, time, etc value.
Until you define semantics for it, you don't really have a value. Those
semantics could be alot simpler than the SQL ones however.

> Oh yeah, my favorite is this: NULL::VARCHAR? Nope, you can't do it. Not
> without creating your own CAST. Seems to me that an obvious value would
> be 'NULL'. Or maybe '' (empty string). I hate having to use COALESCE.
> It just reaks of a bad programming practice.

I don't see how using COALESCE is particularly worse than using CAST,
honestly.  CAST(NULL AS VARCHAR(n)) versus COALESCE(NULL, <what you want
out>) seems pretty much a wash, unless you want it to happen implicitly.

> Thats the way I see it. But if NULL has to mean "unknown" then all the
> functions should treat it as such. Also several other values including
> UNSPECIFIED and EMPTY should be provided. EMPTY should return an empty
> array {} and UNSPECIFIED should do what NULL is often used to mean
> (nothing). Then NULL should be renamed to UNKNOWN to clear up any
> confusion. :o)

Well, yes, keeping the separate uses of NULL separate would have been
nice.

Re: 8.0.3 regexp_replace()...

From
Andrew - Supernews
Date:
On 2006-02-01, rlee0001 <robeddielee@hotmail.com> wrote:
> Stephan,
>
> How do IN and NOT IN treat NULLs? Don't these functions search an array
> for a specified value returning true or false? I guess the intuitive
> thing for IN and NOT IN to do would be to return NULL if NULL appears
> anywhere in the array since those elements values are "unknown".

foo IN (x1,x2,x3) is exactly equivalent to
(foo = x1) OR (foo = x2) OR (foo = x3)

foo NOT IN (x1,x2,x3) is likewise equivalent to
(foo <> x1) AND (foo <> x2) AND (foo <> x3)

In the first case, if one of the x? is null, then the result of the
expression is true if any of the clauses is true, or null otherwise;
TRUE OR NULL is true, while FALSE OR NULL is null.

In the second case, the result is likewise determined by the logic of
three-valued AND. Since TRUE AND NULL is null, and FALSE AND NULL is
false, that means that the expression can never return true if any of
the x? is null.

> Oh yeah, my favorite is this: NULL::VARCHAR? Nope, you can't do it.

=> select null::varchar;
 varchar
---------

(1 row)

works for me. (Note: that's not an empty string; use \pset null in
psql to see the difference.)

> Not without creating your own CAST.

Casting from what? NULL isn't a type...

> Seems to me that an obvious value would be 'NULL'. Or maybe ''
> (empty string).

If NULL ever got converted to 'NULL' or '', how would you distinguish it
from the literal 'NULL' or ''?

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services