Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-patches
| From | Joe Conway |
|---|---|
| Subject | Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function |
| Date | |
| Msg-id | 3D59F57D.1070804@joeconway.com Whole thread Raw |
| Responses |
Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function
|
| List | pgsql-patches |
Tatsuo Ishii wrote:
> Joe Conway wrote:
>>Any objection if I rework this function to meet SQL92 and fix the bug?
>
> I don't object.
>
>>Or is the SQL92 part not desirable because it breaks backward
>>compatability?
>
> I don't think so.
>
>>In any case, can the #ifdef MULTIBYTE's be removed now in favor of a
>>test for encoding max length?
>
> Sure.
<sorry so long-winded>
Attached is a patch that implements the above items wrt text_substr(). I
also modified textlen(), textoctetlen(), byteaoctetlen(), and
bytea_substr(). Here's a summary of the change to each:
- text_substr(): rewrite function to meet SQL92, fix MB related bug,
and remove #ifdef MULTIBYTE.
- bytea_substr(): same as text_substr() enc max len == 1.
- textoctetlen(), byteaoctetlen(): use
toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ)
to avoid detoasting.
- textlen(): same as textoctetlen() for enc max len == 1, and remove
#ifdef MULTIBYTE.
I did some benchmarking to ensure no performance degradation, and to
help me understand MB and related performance issues. The results were
very enlightening:
===================================================================
First test - textlen() (already reported, repeated here for completeness):
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
insert into strtest values('12345....'); -- 100000 characters
loop
do 1000 times
select length(f1) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, new code 2 seconds
SQL_ASCII database, old code 66 seconds
EUC_JP database, new & old code 469 seconds
===================================================================
Second test - short string test:
-------------------------------------------------------------------
create table parts(partnum text);
<fill with ~220000 rows, 8 to 12 characters each>
do 300 times
select substr(partnum, 3, 3) from parts;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code 352 seconds
SQL_ASCII database, new code 350 seconds
EUC_JP database, old code 461 seconds
EUC_JP database, new code 422 seconds
===================================================================
Third test - long string, EXTENDED storage (EXTERNAL+COMPRESSED):
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
insert into strtest values('12345....'); -- 100000 characters
loop
do 1000 times
select substr(f1, 89000, 10000) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code 59 seconds
SQL_ASCII database, new code 58 seconds
EUC_JP database, old code 915 seconds
EUC_JP database, new code 912 seconds
===================================================================
Forth test - long string, EXTERNAL storage (not COMPRESSED)
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
insert into strtest values('12345....'); -- 100000 characters
loop
do 1000 times
select substr(f1, 89000, 10000) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code 17 seconds
SQL_ASCII database, new code 17 seconds
EUC_JP database, old code 918 seconds
EUC_JP database, new code 911 seconds
The only remaining problem is that this causes opr_sanity to fail based
on this query:
-- Considering only built-in procs (prolang = 12), look for multiple
-- uses of the same internal function (ie, matching prosrc fields).
-- It's OK to have several entries with different pronames for the same
-- internal function, but conflicts in the number of arguments and other
-- critical items should be complained of.
SELECT p1.oid, p1.proname, p2.oid, p2.proname
FROM pg_proc AS p1, pg_proc AS p2
WHERE p1.oid != p2.oid AND
p1.prosrc = p2.prosrc AND
p1.prolang = 12 AND p2.prolang = 12 AND
(p1.prolang != p2.prolang OR
p1.proisagg != p2.proisagg OR
p1.prosecdef != p2.prosecdef OR
p1.proisstrict != p2.proisstrict OR
p1.proretset != p2.proretset OR
p1.provolatile != p2.provolatile OR
p1.pronargs != p2.pronargs);
This fails because I implemented text_substr() and bytea_substr() to
take either 2 or 3 args. This was necessary for SQL92 spec compliance.
SQL92 requires L < 0 to throw an error, and L IS NULL to return NULL. It
also requires that if L is not provided, the length to the end of the
string is assumed. Current code handles L IS NULL correctly but not L <
0 -- it assumes L < 0 is the same as L is not provided. By allowing the
function to determine if it was passed 2 or 3 args, this can be handled
properly.
So the question is, can/should I change opr_sanity to allow this case?
I also still owe some additions to the strings regression test to make
it cover toasted values.
Other than those two issues, I think the patch is ready to go. I'm
planning to take on the replace function next.
Thanks,
Joe
Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: /opt/src/cvs/pgsql-server/src/backend/utils/adt/varlena.c,v
retrieving revision 1.87
diff -c -r1.87 varlena.c
*** src/backend/utils/adt/varlena.c 4 Aug 2002 06:44:47 -0000 1.87
--- src/backend/utils/adt/varlena.c 14 Aug 2002 05:42:45 -0000
***************
*** 18,23 ****
--- 18,24 ----
#include "mb/pg_wchar.h"
#include "miscadmin.h"
+ #include "access/tuptoaster.h"
#include "utils/builtins.h"
#include "utils/pg_locale.h"
***************
*** 285,303 ****
Datum
textlen(PG_FUNCTION_ARGS)
{
! text *t = PG_GETARG_TEXT_P(0);
! #ifdef MULTIBYTE
! /* optimization for single byte encoding */
! if (pg_database_encoding_max_length() <= 1)
! PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
!
! PG_RETURN_INT32(
! pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)
! );
! #else
! PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
! #endif
}
/*
--- 286,309 ----
Datum
textlen(PG_FUNCTION_ARGS)
{
! /* fastpath when max encoding length is one */
! if (pg_database_encoding_max_length() == 1)
! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
! if (pg_database_encoding_max_length() > 1)
! {
! text *t = PG_GETARG_TEXT_P(0);
!
! PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t),
! VARSIZE(t) - VARHDRSZ));
! }
!
! /* should never get here */
! elog(ERROR, "Invalid backend encoding; encoding max length "
! "is less than one.");
!
! /* notreached: suppress compiler warning */
! PG_RETURN_NULL();
}
/*
***************
*** 308,316 ****
Datum
textoctetlen(PG_FUNCTION_ARGS)
{
! text *arg = PG_GETARG_TEXT_P(0);
!
! PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ);
}
/*
--- 314,320 ----
Datum
textoctetlen(PG_FUNCTION_ARGS)
{
! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
}
/*
***************
*** 358,363 ****
--- 362,375 ----
PG_RETURN_TEXT_P(result);
}
+
+ #define PG_TEXTARG_GET_STR(arg_) \
+ DatumGetCString(DirectFunctionCall1(textout, PG_GETARG_DATUM(arg_)))
+ #define PG_TEXT_GET_STR(textp_) \
+ DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp_)))
+ #define PG_STR_GET_TEXT(str_) \
+ DatumGetTextP(DirectFunctionCall1(textin, CStringGetDatum(str_)))
+
/*
* text_substr()
* Return a substring starting at the specified position.
***************
*** 382,471 ****
* - Thomas Lockhart 1998-12-10
* Now uses faster TOAST-slicing interface
* - John Gray 2002-02-22
*/
Datum
text_substr(PG_FUNCTION_ARGS)
{
! text *string;
! int32 m = PG_GETARG_INT32(1);
! int32 n = PG_GETARG_INT32(2);
! int32 sm;
! int32 sn;
! int eml = 1;
! #ifdef MULTIBYTE
! int i;
! int len;
! text *ret;
! char *p;
! #endif
! /*
! * starting position before the start of the string? then offset into
! * the string per SQL92 spec...
! */
! if (m < 1)
{
! n += (m - 1);
! m = 1;
! }
! /* Check for m > octet length is made in TOAST access routine */
! /* m will now become a zero-based starting position */
! sm = m - 1;
! sn = n;
! #ifdef MULTIBYTE
! eml = pg_database_encoding_max_length ();
! if (eml > 1)
! {
! sm = 0;
! if (n > -1)
! sn = (m + n) * eml + 3; /* +3 to avoid mb characters overhanging slice end */
else
! sn = n; /* n < 0 is special-cased by heap_tuple_untoast_attr_slice */
! }
! #endif
! string = PG_GETARG_TEXT_P_SLICE (0, sm, sn);
! if (eml == 1)
! {
! PG_RETURN_TEXT_P (string);
! }
! #ifndef MULTIBYTE
! PG_RETURN_NULL(); /* notreached: suppress compiler warning */
! #endif
! #ifdef MULTIBYTE
! if (n > -1)
! len = pg_mbstrlen_with_len (VARDATA (string), sn - 3);
! else /* n < 0 is special-cased; need full string length */
! len = pg_mbstrlen_with_len (VARDATA (string), VARSIZE(string)-VARHDRSZ);
!
! if (m > len)
! {
! m = 1;
! n = 0;
! }
! m--;
! if (((m + n) > len) || (n < 0))
! n = (len - m);
!
! p = VARDATA(string);
! for (i = 0; i < m; i++)
! p += pg_mblen(p);
! m = p - VARDATA(string);
! for (i = 0; i < n; i++)
! p += pg_mblen(p);
! n = p - (VARDATA(string) + m);
! ret = (text *) palloc(VARHDRSZ + n);
! VARATT_SIZEP(ret) = VARHDRSZ + n;
! memcpy(VARDATA(ret), VARDATA(string) + m, n);
! PG_RETURN_TEXT_P(ret);
! #endif
}
/*
--- 394,582 ----
* - Thomas Lockhart 1998-12-10
* Now uses faster TOAST-slicing interface
* - John Gray 2002-02-22
+ * Remove "#ifdef MULTIBYTE" and test for encoding_max_length instead. Change
+ * behaviors conflicting with SQL92 to meet SQL92 (if E = S + L < S throw
+ * error; if E < 1, return '', not entire string). Fixed MB related bug when
+ * S > LC and < LC + 4 sometimes garbage characters are returned.
+ * - Joe Conway 2002-08-10
*/
Datum
text_substr(PG_FUNCTION_ARGS)
{
! int S = PG_GETARG_INT32(1); /* start position */
! int eml = pg_database_encoding_max_length();
! int S1; /* adjusted start position */
! int L1; /* adjusted substring length */
! /* life is easy if the encoding max length is 1 */
! if (eml == 1)
{
! S1 = Max(S, 1);
! if (fcinfo->nargs == 2)
! {
! /*
! * Not passed a length - PG_GETARG_TEXT_P_SLICE()
! * grabs everything to the end of the string if we pass it
! * a negative value for length.
! */
! L1 = -1;
! }
! else
! {
! /* end position */
! int E = S + PG_GETARG_INT32(2);
! /*
! * A negative value for L is the only way for the end position
! * to be before the start. SQL99 says to throw an error.
! */
! if (E < S)
! elog(ERROR, "negative substring length not allowed");
! /*
! * A zero or negative value for the end position can happen if the start
! * was negative or one. SQL99 says to return a zero-length string.
! */
! if (E < 1)
! PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
! L1 = E - S1;
! }
!
! /*
! * If the start position is past the end of the string,
! * SQL99 says to return a zero-length string --
! * PG_GETARG_TEXT_P_SLICE() will do that for us.
! * Convert to zero-based starting position
! */
! PG_RETURN_TEXT_P (PG_GETARG_TEXT_P_SLICE (0, S1 - 1, L1));
! }
! else if (eml > 1)
! {
! /*
! * When encoding max length is > 1, we can't get LC without
! * detoasting, so we'll grab a conservatively large slice
! * now and go back later to do the right thing
! */
! int slice_start;
! int slice_size;
! int slice_strlen;
! text *slice;
! int E1;
! int i;
! char *p;
! char *s;
! text *ret;
!
! /*
! * if S is past the end of the string, the tuple toaster
! * will return a zero-length string to us
! */
! S1 = Max(S, 1);
!
! /*
! * We need to start at position zero because there is no
! * way to know in advance which byte offset corresponds to
! * the supplied start position.
! */
! slice_start = 0;
!
! if (fcinfo->nargs == 2)
! {
! /*
! * If we were not passed a length, the spec says that
! * E = Max(LC + 1, S). Since we don't know LC yet, set
! * slice_size = -1 which will cause heap_tuple_untoast_attr_slice
! * to give use everything to the end of the string.
! * If S > LC + 1, we'll get back a zero length string anyway.
! */
! slice_size = L1 = -1;
! }
else
! {
! int E = S + PG_GETARG_INT32(2);
! /*
! * A negative value for L is the only way for the end position
! * to be before the start. SQL99 says to throw an error.
! */
! if (E < S)
! elog(ERROR, "negative substring length not allowed");
! /*
! * A zero or negative value for the end position can happen if the start
! * was negative or one. SQL99 says to return a zero-length string.
! */
! if (E < 1)
! PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
! /*
! * if E is past the end of the string, the tuple toaster
! * will truncate the length for us
! */
! L1 = E - S1;
! /*
! * Total slice size in bytes can't be any longer than the start
! * position plus substring length times the encoding max length.
! */
! slice_size = (S1 + L1) * eml;
! }
! slice = PG_GETARG_TEXT_P_SLICE (0, slice_start, slice_size);
! /* see if we got back an empty string */
! if ((VARSIZE(slice) - VARHDRSZ) == 0)
! PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
! /* Now we can get the actual length of the slice in MB characters */
! slice_strlen = pg_mbstrlen_with_len (VARDATA(slice), VARSIZE(slice) - VARHDRSZ);
!
! /* Check that the start position wasn't > slice_strlen. If so,
! * SQL99 says to return a zero-length string.
! */
! if (S1 > slice_strlen)
! PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
! /*
! * Adjust L1 and E1 now that we know the slice string length.
! * Again remember that S1 is one based, and slice_start is zero based.
! */
! if (L1 > -1)
! E1 = Min(S1 + L1 , slice_start + 1 + slice_strlen);
! else
! E1 = slice_start + 1 + slice_strlen;
!
! /*
! * Find the start position in the slice;
! * remember S1 is not zero based
! */
! p = VARDATA(slice);
! for (i = 0; i < S1 - 1; i++)
! p += pg_mblen(p);
!
! /* hang onto a pointer to our start position */
! s = p;
!
! /*
! * Count the actual bytes used by the substring of
! * the requested length.
! */
! for (i = S1; i < E1; i++)
! p += pg_mblen(p);
!
! ret = (text *) palloc(VARHDRSZ + (p - s));
! VARATT_SIZEP(ret) = VARHDRSZ + (p - s);
! memcpy(VARDATA(ret), s, (p - s));
!
! PG_RETURN_TEXT_P(ret);
! }
! else
! elog(ERROR, "Invalid backend encoding; encoding max length "
! "is less than one.");
!
! /* notreached: suppress compiler warning */
! PG_RETURN_NULL();
}
/*
***************
*** 758,766 ****
Datum
byteaoctetlen(PG_FUNCTION_ARGS)
{
! bytea *v = PG_GETARG_BYTEA_P(0);
!
! PG_RETURN_INT32(VARSIZE(v) - VARHDRSZ);
}
/*
--- 869,875 ----
Datum
byteaoctetlen(PG_FUNCTION_ARGS)
{
! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
}
/*
***************
*** 805,810 ****
--- 914,921 ----
PG_RETURN_BYTEA_P(result);
}
+ #define PG_STR_GET_BYTEA(str_) \
+ DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
/*
* bytea_substr()
* Return a substring starting at the specified position.
***************
*** 813,845 ****
* Input:
* - string
* - starting position (is one-based)
! * - string length
*
* If the starting position is zero or less, then return from the start of the string
* adjusting the length to be consistent with the "negative start" per SQL92.
! * If the length is less than zero, return the remaining string.
! *
*/
Datum
bytea_substr(PG_FUNCTION_ARGS)
{
! int32 m = PG_GETARG_INT32(1);
! int32 n = PG_GETARG_INT32(2);
!
! /*
! * starting position before the start of the string? then offset into
! * the string per SQL92 spec...
! */
! if (m < 1)
! {
! n += (m - 1);
! m = 1;
}
! /* m will now become a zero-based starting position */
! m--;
!
! PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, m, n));
}
/*
--- 924,983 ----
* Input:
* - string
* - starting position (is one-based)
! * - string length (optional)
*
* If the starting position is zero or less, then return from the start of the string
* adjusting the length to be consistent with the "negative start" per SQL92.
! * If the length is less than zero, an ERROR is thrown. If no third argument
! * (length) is provided, the length to the end of the string is assumed.
*/
Datum
bytea_substr(PG_FUNCTION_ARGS)
{
! int S = PG_GETARG_INT32(1); /* start position */
! int S1; /* adjusted start position */
! int L1; /* adjusted substring length */
!
! S1 = Max(S, 1);
!
! if (fcinfo->nargs == 2)
! {
! /*
! * Not passed a length - PG_GETARG_BYTEA_P_SLICE()
! * grabs everything to the end of the string if we pass it
! * a negative value for length.
! */
! L1 = -1;
}
+ else
+ {
+ /* end position */
+ int E = S + PG_GETARG_INT32(2);
! /*
! * A negative value for L is the only way for the end position
! * to be before the start. SQL99 says to throw an error.
! */
! if (E < S)
! elog(ERROR, "negative substring length not allowed");
!
! /*
! * A zero or negative value for the end position can happen if the start
! * was negative or one. SQL99 says to return a zero-length string.
! */
! if (E < 1)
! PG_RETURN_BYTEA_P(PG_STR_GET_BYTEA(""));
!
! L1 = E - S1;
! }
!
! /*
! * If the start position is past the end of the string,
! * SQL99 says to return a zero-length string --
! * PG_GETARG_TEXT_P_SLICE() will do that for us.
! * Convert to zero-based starting position
! */
! PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, S1 - 1, L1));
}
/*
Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /opt/src/cvs/pgsql-server/src/include/catalog/pg_proc.h,v
retrieving revision 1.253
diff -c -r1.253 pg_proc.h
*** src/include/catalog/pg_proc.h 9 Aug 2002 16:45:15 -0000 1.253
--- src/include/catalog/pg_proc.h 14 Aug 2002 02:34:38 -0000
***************
*** 2121,2127 ****
DESCR("remove initial characters from string");
DATA(insert OID = 882 ( rtrim PGNSP PGUID 14 f f t f i 1 25 "25" "select rtrim($1, \' \')" - _null_ ));
DESCR("remove trailing characters from string");
! DATA(insert OID = 883 ( substr PGNSP PGUID 14 f f t f i 2 25 "25 23" "select substr($1, $2, -1)" - _null_
));
DESCR("return portion of string");
DATA(insert OID = 884 ( btrim PGNSP PGUID 12 f f t f i 2 25 "25 25" btrim - _null_ ));
DESCR("trim both ends of string");
--- 2121,2127 ----
DESCR("remove initial characters from string");
DATA(insert OID = 882 ( rtrim PGNSP PGUID 14 f f t f i 1 25 "25" "select rtrim($1, \' \')" - _null_ ));
DESCR("remove trailing characters from string");
! DATA(insert OID = 883 ( substr PGNSP PGUID 12 f f t f i 2 25 "25 23" text_substr - _null_ ));
DESCR("return portion of string");
DATA(insert OID = 884 ( btrim PGNSP PGUID 12 f f t f i 2 25 "25 25" btrim - _null_ ));
DESCR("trim both ends of string");
***************
*** 2130,2136 ****
DATA(insert OID = 936 ( substring PGNSP PGUID 12 f f t f i 3 25 "25 23 23" text_substr - _null_ ));
DESCR("return portion of string");
! DATA(insert OID = 937 ( substring PGNSP PGUID 14 f f t f i 2 25 "25 23" "select substring($1, $2, -1)" -
_null_));
DESCR("return portion of string");
/* for multi-byte support */
--- 2130,2136 ----
DATA(insert OID = 936 ( substring PGNSP PGUID 12 f f t f i 3 25 "25 23 23" text_substr - _null_ ));
DESCR("return portion of string");
! DATA(insert OID = 937 ( substring PGNSP PGUID 12 f f t f i 2 25 "25 23" text_substr - _null_ ));
DESCR("return portion of string");
/* for multi-byte support */
***************
*** 2778,2784 ****
DESCR("concatenate");
DATA(insert OID = 2012 ( substring PGNSP PGUID 12 f f t f i 3 17 "17 23 23" bytea_substr - _null_ ));
DESCR("return portion of string");
! DATA(insert OID = 2013 ( substring PGNSP PGUID 14 f f t f i 2 17 "17 23" "select substring($1, $2, -1)"
-_null_ ));
DESCR("return portion of string");
DATA(insert OID = 2014 ( position PGNSP PGUID 12 f f t f i 2 23 "17 17" byteapos - _null_ ));
DESCR("return position of substring");
--- 2778,2784 ----
DESCR("concatenate");
DATA(insert OID = 2012 ( substring PGNSP PGUID 12 f f t f i 3 17 "17 23 23" bytea_substr - _null_ ));
DESCR("return portion of string");
! DATA(insert OID = 2013 ( substring PGNSP PGUID 12 f f t f i 2 17 "17 23" bytea_substr - _null_ ));
DESCR("return portion of string");
DATA(insert OID = 2014 ( position PGNSP PGUID 12 f f t f i 2 23 "17 17" byteapos - _null_ ));
DESCR("return position of substring");
pgsql-patches by date: