Re: [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-hackers

From Joe Conway
Subject Re: [GENERAL] workaround for lack of REPLACE() function
Date
Msg-id 3D549ED8.40102@joeconway.com
Whole thread Raw
In response to Re: [GENERAL] workaround for lack of REPLACE() function  (Thomas Lockhart <lockhart@fourpalms.org>)
Responses Re: [GENERAL] workaround for lack of REPLACE() function  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-hackers
Joe Conway wrote:
> more work than I had time for when I wrote the current replace(). But as 
> I said, if there is support for getting this into the backend, I'll add 
> it to my todo list:
> 
> - Create new backend function replace()
> - Either create new backend functions, or merge into existing functions: 
> to_hex() and extract_tok()
> 

I'm just starting to take a look at this again. While studying the 
current text_substr() function I found two behaviors which conflict with 
specific SQL92/SQL99 requirements, and one bug. First the spec 
compliance -- SQL92 section 6.7/SQL99 section 6.18 say:

If <character substring function> is specified, then:
a) Let C be the value of the <character value expression>, let LC be the   length of C, and let S be the value of the
<startposition>.
 
b) If <string length> is specified, then let L be the value of <string   length> and let E be S+L. Otherwise, let E be
thelarger of LC + 1   and S.
 
c) If either C, S, or L is the null value, then the result of the   <character substring function> is the null value.
d) If E is less than S, then an exception condition is raised: data   exception-substring error.
e) Case:   i) If S is greater than LC or if E is less than 1, then the result of      the <character substring
function>is a zero-length string.  ii) Otherwise,      1) Let SI be the larger of S and 1. Let El be the smaller of E
and        LC+l. Let Ll be El-Sl.      2) The result of the <character substring function> is a character
stringcontaining the Ll characters of C starting at character         number Sl in the same order that the characters
appearin C.
 

The only way for d) to be true is when L < 0. Instead of an error, we do:
test=# select substr('hello',2,-1); substr
-------- ello
(1 row)

The other spec issue is wrt para e)i). If E (=S+L) < 1, we should return 
a zero-length string. Currently I get:
test=# select substr('hello',-4,3); substr
-------- hello
(1 row)

Neither behavior is documented (unless it's somewhere other than:
http://developer.postgresql.org/docs/postgres/functions-string.html ).

The bug is this one:
test=# create DATABASE testmb with encoding = 'EUC_JP';
CREATE DATABASE
test=# \c testmb
You are now connected to database testmb.
testmb=# select substr('hello',6,2); substr
-------- ~
(1 row)

testmb=# \c test
You are now connected to database test.
test=# select substr('hello',6,2); substr
--------

(1 row)

The multibyte database behavior is the bug. The SQL_ASCII behavior is 
correct (zero-length string):
test=# select substr('hello',6,2) is null; ?column?
---------- f
(1 row)


Any objection if I rework this function to meet SQL92 and fix the bug? 
Or is the SQL92 part not desirable because it breaks backward 
compatability?

In any case, can the #ifdef MULTIBYTE's be removed now in favor of a 
test for encoding max length?

Joe



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal: stand-alone composite types
Next
From: Joe Conway
Date:
Subject: Re: Proposal: stand-alone composite types