Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-patches
From | Joe Conway |
---|---|
Subject | Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function |
Date | |
Msg-id | 3D59F57D.1070804@joeconway.com Whole thread Raw |
Responses |
Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function
|
List | pgsql-patches |
Tatsuo Ishii wrote: > Joe Conway wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > > I don't object. > >>Or is the SQL92 part not desirable because it breaks backward >>compatability? > > I don't think so. > >>In any case, can the #ifdef MULTIBYTE's be removed now in favor of a >>test for encoding max length? > > Sure. <sorry so long-winded> Attached is a patch that implements the above items wrt text_substr(). I also modified textlen(), textoctetlen(), byteaoctetlen(), and bytea_substr(). Here's a summary of the change to each: - text_substr(): rewrite function to meet SQL92, fix MB related bug, and remove #ifdef MULTIBYTE. - bytea_substr(): same as text_substr() enc max len == 1. - textoctetlen(), byteaoctetlen(): use toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ) to avoid detoasting. - textlen(): same as textoctetlen() for enc max len == 1, and remove #ifdef MULTIBYTE. I did some benchmarking to ensure no performance degradation, and to help me understand MB and related performance issues. The results were very enlightening: =================================================================== First test - textlen() (already reported, repeated here for completeness): ------------------------------------------------------------------- create table strtest(f1 text); do 100 times insert into strtest values('12345....'); -- 100000 characters loop do 1000 times select length(f1) from strtest; loop ------------------------------------------------------------------- Results: SQL_ASCII database, new code 2 seconds SQL_ASCII database, old code 66 seconds EUC_JP database, new & old code 469 seconds =================================================================== Second test - short string test: ------------------------------------------------------------------- create table parts(partnum text); <fill with ~220000 rows, 8 to 12 characters each> do 300 times select substr(partnum, 3, 3) from parts; loop ------------------------------------------------------------------- Results: SQL_ASCII database, old code 352 seconds SQL_ASCII database, new code 350 seconds EUC_JP database, old code 461 seconds EUC_JP database, new code 422 seconds =================================================================== Third test - long string, EXTENDED storage (EXTERNAL+COMPRESSED): ------------------------------------------------------------------- create table strtest(f1 text); do 100 times insert into strtest values('12345....'); -- 100000 characters loop do 1000 times select substr(f1, 89000, 10000) from strtest; loop ------------------------------------------------------------------- Results: SQL_ASCII database, old code 59 seconds SQL_ASCII database, new code 58 seconds EUC_JP database, old code 915 seconds EUC_JP database, new code 912 seconds =================================================================== Forth test - long string, EXTERNAL storage (not COMPRESSED) ------------------------------------------------------------------- create table strtest(f1 text); do 100 times insert into strtest values('12345....'); -- 100000 characters loop do 1000 times select substr(f1, 89000, 10000) from strtest; loop ------------------------------------------------------------------- Results: SQL_ASCII database, old code 17 seconds SQL_ASCII database, new code 17 seconds EUC_JP database, old code 918 seconds EUC_JP database, new code 911 seconds The only remaining problem is that this causes opr_sanity to fail based on this query: -- Considering only built-in procs (prolang = 12), look for multiple -- uses of the same internal function (ie, matching prosrc fields). -- It's OK to have several entries with different pronames for the same -- internal function, but conflicts in the number of arguments and other -- critical items should be complained of. SELECT p1.oid, p1.proname, p2.oid, p2.proname FROM pg_proc AS p1, pg_proc AS p2 WHERE p1.oid != p2.oid AND p1.prosrc = p2.prosrc AND p1.prolang = 12 AND p2.prolang = 12 AND (p1.prolang != p2.prolang OR p1.proisagg != p2.proisagg OR p1.prosecdef != p2.prosecdef OR p1.proisstrict != p2.proisstrict OR p1.proretset != p2.proretset OR p1.provolatile != p2.provolatile OR p1.pronargs != p2.pronargs); This fails because I implemented text_substr() and bytea_substr() to take either 2 or 3 args. This was necessary for SQL92 spec compliance. SQL92 requires L < 0 to throw an error, and L IS NULL to return NULL. It also requires that if L is not provided, the length to the end of the string is assumed. Current code handles L IS NULL correctly but not L < 0 -- it assumes L < 0 is the same as L is not provided. By allowing the function to determine if it was passed 2 or 3 args, this can be handled properly. So the question is, can/should I change opr_sanity to allow this case? I also still owe some additions to the strings regression test to make it cover toasted values. Other than those two issues, I think the patch is ready to go. I'm planning to take on the replace function next. Thanks, Joe Index: src/backend/utils/adt/varlena.c =================================================================== RCS file: /opt/src/cvs/pgsql-server/src/backend/utils/adt/varlena.c,v retrieving revision 1.87 diff -c -r1.87 varlena.c *** src/backend/utils/adt/varlena.c 4 Aug 2002 06:44:47 -0000 1.87 --- src/backend/utils/adt/varlena.c 14 Aug 2002 05:42:45 -0000 *************** *** 18,23 **** --- 18,24 ---- #include "mb/pg_wchar.h" #include "miscadmin.h" + #include "access/tuptoaster.h" #include "utils/builtins.h" #include "utils/pg_locale.h" *************** *** 285,303 **** Datum textlen(PG_FUNCTION_ARGS) { ! text *t = PG_GETARG_TEXT_P(0); ! #ifdef MULTIBYTE ! /* optimization for single byte encoding */ ! if (pg_database_encoding_max_length() <= 1) ! PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); ! ! PG_RETURN_INT32( ! pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ) ! ); ! #else ! PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); ! #endif } /* --- 286,309 ---- Datum textlen(PG_FUNCTION_ARGS) { ! /* fastpath when max encoding length is one */ ! if (pg_database_encoding_max_length() == 1) ! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); ! if (pg_database_encoding_max_length() > 1) ! { ! text *t = PG_GETARG_TEXT_P(0); ! ! PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t), ! VARSIZE(t) - VARHDRSZ)); ! } ! ! /* should never get here */ ! elog(ERROR, "Invalid backend encoding; encoding max length " ! "is less than one."); ! ! /* notreached: suppress compiler warning */ ! PG_RETURN_NULL(); } /* *************** *** 308,316 **** Datum textoctetlen(PG_FUNCTION_ARGS) { ! text *arg = PG_GETARG_TEXT_P(0); ! ! PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ); } /* --- 314,320 ---- Datum textoctetlen(PG_FUNCTION_ARGS) { ! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); } /* *************** *** 358,363 **** --- 362,375 ---- PG_RETURN_TEXT_P(result); } + + #define PG_TEXTARG_GET_STR(arg_) \ + DatumGetCString(DirectFunctionCall1(textout, PG_GETARG_DATUM(arg_))) + #define PG_TEXT_GET_STR(textp_) \ + DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp_))) + #define PG_STR_GET_TEXT(str_) \ + DatumGetTextP(DirectFunctionCall1(textin, CStringGetDatum(str_))) + /* * text_substr() * Return a substring starting at the specified position. *************** *** 382,471 **** * - Thomas Lockhart 1998-12-10 * Now uses faster TOAST-slicing interface * - John Gray 2002-02-22 */ Datum text_substr(PG_FUNCTION_ARGS) { ! text *string; ! int32 m = PG_GETARG_INT32(1); ! int32 n = PG_GETARG_INT32(2); ! int32 sm; ! int32 sn; ! int eml = 1; ! #ifdef MULTIBYTE ! int i; ! int len; ! text *ret; ! char *p; ! #endif ! /* ! * starting position before the start of the string? then offset into ! * the string per SQL92 spec... ! */ ! if (m < 1) { ! n += (m - 1); ! m = 1; ! } ! /* Check for m > octet length is made in TOAST access routine */ ! /* m will now become a zero-based starting position */ ! sm = m - 1; ! sn = n; ! #ifdef MULTIBYTE ! eml = pg_database_encoding_max_length (); ! if (eml > 1) ! { ! sm = 0; ! if (n > -1) ! sn = (m + n) * eml + 3; /* +3 to avoid mb characters overhanging slice end */ else ! sn = n; /* n < 0 is special-cased by heap_tuple_untoast_attr_slice */ ! } ! #endif ! string = PG_GETARG_TEXT_P_SLICE (0, sm, sn); ! if (eml == 1) ! { ! PG_RETURN_TEXT_P (string); ! } ! #ifndef MULTIBYTE ! PG_RETURN_NULL(); /* notreached: suppress compiler warning */ ! #endif ! #ifdef MULTIBYTE ! if (n > -1) ! len = pg_mbstrlen_with_len (VARDATA (string), sn - 3); ! else /* n < 0 is special-cased; need full string length */ ! len = pg_mbstrlen_with_len (VARDATA (string), VARSIZE(string)-VARHDRSZ); ! ! if (m > len) ! { ! m = 1; ! n = 0; ! } ! m--; ! if (((m + n) > len) || (n < 0)) ! n = (len - m); ! ! p = VARDATA(string); ! for (i = 0; i < m; i++) ! p += pg_mblen(p); ! m = p - VARDATA(string); ! for (i = 0; i < n; i++) ! p += pg_mblen(p); ! n = p - (VARDATA(string) + m); ! ret = (text *) palloc(VARHDRSZ + n); ! VARATT_SIZEP(ret) = VARHDRSZ + n; ! memcpy(VARDATA(ret), VARDATA(string) + m, n); ! PG_RETURN_TEXT_P(ret); ! #endif } /* --- 394,582 ---- * - Thomas Lockhart 1998-12-10 * Now uses faster TOAST-slicing interface * - John Gray 2002-02-22 + * Remove "#ifdef MULTIBYTE" and test for encoding_max_length instead. Change + * behaviors conflicting with SQL92 to meet SQL92 (if E = S + L < S throw + * error; if E < 1, return '', not entire string). Fixed MB related bug when + * S > LC and < LC + 4 sometimes garbage characters are returned. + * - Joe Conway 2002-08-10 */ Datum text_substr(PG_FUNCTION_ARGS) { ! int S = PG_GETARG_INT32(1); /* start position */ ! int eml = pg_database_encoding_max_length(); ! int S1; /* adjusted start position */ ! int L1; /* adjusted substring length */ ! /* life is easy if the encoding max length is 1 */ ! if (eml == 1) { ! S1 = Max(S, 1); ! if (fcinfo->nargs == 2) ! { ! /* ! * Not passed a length - PG_GETARG_TEXT_P_SLICE() ! * grabs everything to the end of the string if we pass it ! * a negative value for length. ! */ ! L1 = -1; ! } ! else ! { ! /* end position */ ! int E = S + PG_GETARG_INT32(2); ! /* ! * A negative value for L is the only way for the end position ! * to be before the start. SQL99 says to throw an error. ! */ ! if (E < S) ! elog(ERROR, "negative substring length not allowed"); ! /* ! * A zero or negative value for the end position can happen if the start ! * was negative or one. SQL99 says to return a zero-length string. ! */ ! if (E < 1) ! PG_RETURN_TEXT_P(PG_STR_GET_TEXT("")); ! ! L1 = E - S1; ! } ! ! /* ! * If the start position is past the end of the string, ! * SQL99 says to return a zero-length string -- ! * PG_GETARG_TEXT_P_SLICE() will do that for us. ! * Convert to zero-based starting position ! */ ! PG_RETURN_TEXT_P (PG_GETARG_TEXT_P_SLICE (0, S1 - 1, L1)); ! } ! else if (eml > 1) ! { ! /* ! * When encoding max length is > 1, we can't get LC without ! * detoasting, so we'll grab a conservatively large slice ! * now and go back later to do the right thing ! */ ! int slice_start; ! int slice_size; ! int slice_strlen; ! text *slice; ! int E1; ! int i; ! char *p; ! char *s; ! text *ret; ! ! /* ! * if S is past the end of the string, the tuple toaster ! * will return a zero-length string to us ! */ ! S1 = Max(S, 1); ! ! /* ! * We need to start at position zero because there is no ! * way to know in advance which byte offset corresponds to ! * the supplied start position. ! */ ! slice_start = 0; ! ! if (fcinfo->nargs == 2) ! { ! /* ! * If we were not passed a length, the spec says that ! * E = Max(LC + 1, S). Since we don't know LC yet, set ! * slice_size = -1 which will cause heap_tuple_untoast_attr_slice ! * to give use everything to the end of the string. ! * If S > LC + 1, we'll get back a zero length string anyway. ! */ ! slice_size = L1 = -1; ! } else ! { ! int E = S + PG_GETARG_INT32(2); ! /* ! * A negative value for L is the only way for the end position ! * to be before the start. SQL99 says to throw an error. ! */ ! if (E < S) ! elog(ERROR, "negative substring length not allowed"); ! /* ! * A zero or negative value for the end position can happen if the start ! * was negative or one. SQL99 says to return a zero-length string. ! */ ! if (E < 1) ! PG_RETURN_TEXT_P(PG_STR_GET_TEXT("")); ! /* ! * if E is past the end of the string, the tuple toaster ! * will truncate the length for us ! */ ! L1 = E - S1; ! /* ! * Total slice size in bytes can't be any longer than the start ! * position plus substring length times the encoding max length. ! */ ! slice_size = (S1 + L1) * eml; ! } ! slice = PG_GETARG_TEXT_P_SLICE (0, slice_start, slice_size); ! /* see if we got back an empty string */ ! if ((VARSIZE(slice) - VARHDRSZ) == 0) ! PG_RETURN_TEXT_P(PG_STR_GET_TEXT("")); ! ! /* Now we can get the actual length of the slice in MB characters */ ! slice_strlen = pg_mbstrlen_with_len (VARDATA(slice), VARSIZE(slice) - VARHDRSZ); ! ! /* Check that the start position wasn't > slice_strlen. If so, ! * SQL99 says to return a zero-length string. ! */ ! if (S1 > slice_strlen) ! PG_RETURN_TEXT_P(PG_STR_GET_TEXT("")); ! ! /* ! * Adjust L1 and E1 now that we know the slice string length. ! * Again remember that S1 is one based, and slice_start is zero based. ! */ ! if (L1 > -1) ! E1 = Min(S1 + L1 , slice_start + 1 + slice_strlen); ! else ! E1 = slice_start + 1 + slice_strlen; ! ! /* ! * Find the start position in the slice; ! * remember S1 is not zero based ! */ ! p = VARDATA(slice); ! for (i = 0; i < S1 - 1; i++) ! p += pg_mblen(p); ! ! /* hang onto a pointer to our start position */ ! s = p; ! ! /* ! * Count the actual bytes used by the substring of ! * the requested length. ! */ ! for (i = S1; i < E1; i++) ! p += pg_mblen(p); ! ! ret = (text *) palloc(VARHDRSZ + (p - s)); ! VARATT_SIZEP(ret) = VARHDRSZ + (p - s); ! memcpy(VARDATA(ret), s, (p - s)); ! ! PG_RETURN_TEXT_P(ret); ! } ! else ! elog(ERROR, "Invalid backend encoding; encoding max length " ! "is less than one."); ! ! /* notreached: suppress compiler warning */ ! PG_RETURN_NULL(); } /* *************** *** 758,766 **** Datum byteaoctetlen(PG_FUNCTION_ARGS) { ! bytea *v = PG_GETARG_BYTEA_P(0); ! ! PG_RETURN_INT32(VARSIZE(v) - VARHDRSZ); } /* --- 869,875 ---- Datum byteaoctetlen(PG_FUNCTION_ARGS) { ! PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); } /* *************** *** 805,810 **** --- 914,921 ---- PG_RETURN_BYTEA_P(result); } + #define PG_STR_GET_BYTEA(str_) \ + DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(str_))) /* * bytea_substr() * Return a substring starting at the specified position. *************** *** 813,845 **** * Input: * - string * - starting position (is one-based) ! * - string length * * If the starting position is zero or less, then return from the start of the string * adjusting the length to be consistent with the "negative start" per SQL92. ! * If the length is less than zero, return the remaining string. ! * */ Datum bytea_substr(PG_FUNCTION_ARGS) { ! int32 m = PG_GETARG_INT32(1); ! int32 n = PG_GETARG_INT32(2); ! ! /* ! * starting position before the start of the string? then offset into ! * the string per SQL92 spec... ! */ ! if (m < 1) ! { ! n += (m - 1); ! m = 1; } ! /* m will now become a zero-based starting position */ ! m--; ! ! PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, m, n)); } /* --- 924,983 ---- * Input: * - string * - starting position (is one-based) ! * - string length (optional) * * If the starting position is zero or less, then return from the start of the string * adjusting the length to be consistent with the "negative start" per SQL92. ! * If the length is less than zero, an ERROR is thrown. If no third argument ! * (length) is provided, the length to the end of the string is assumed. */ Datum bytea_substr(PG_FUNCTION_ARGS) { ! int S = PG_GETARG_INT32(1); /* start position */ ! int S1; /* adjusted start position */ ! int L1; /* adjusted substring length */ ! ! S1 = Max(S, 1); ! ! if (fcinfo->nargs == 2) ! { ! /* ! * Not passed a length - PG_GETARG_BYTEA_P_SLICE() ! * grabs everything to the end of the string if we pass it ! * a negative value for length. ! */ ! L1 = -1; } + else + { + /* end position */ + int E = S + PG_GETARG_INT32(2); ! /* ! * A negative value for L is the only way for the end position ! * to be before the start. SQL99 says to throw an error. ! */ ! if (E < S) ! elog(ERROR, "negative substring length not allowed"); ! ! /* ! * A zero or negative value for the end position can happen if the start ! * was negative or one. SQL99 says to return a zero-length string. ! */ ! if (E < 1) ! PG_RETURN_BYTEA_P(PG_STR_GET_BYTEA("")); ! ! L1 = E - S1; ! } ! ! /* ! * If the start position is past the end of the string, ! * SQL99 says to return a zero-length string -- ! * PG_GETARG_TEXT_P_SLICE() will do that for us. ! * Convert to zero-based starting position ! */ ! PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, S1 - 1, L1)); } /* Index: src/include/catalog/pg_proc.h =================================================================== RCS file: /opt/src/cvs/pgsql-server/src/include/catalog/pg_proc.h,v retrieving revision 1.253 diff -c -r1.253 pg_proc.h *** src/include/catalog/pg_proc.h 9 Aug 2002 16:45:15 -0000 1.253 --- src/include/catalog/pg_proc.h 14 Aug 2002 02:34:38 -0000 *************** *** 2121,2127 **** DESCR("remove initial characters from string"); DATA(insert OID = 882 ( rtrim PGNSP PGUID 14 f f t f i 1 25 "25" "select rtrim($1, \' \')" - _null_ )); DESCR("remove trailing characters from string"); ! DATA(insert OID = 883 ( substr PGNSP PGUID 14 f f t f i 2 25 "25 23" "select substr($1, $2, -1)" - _null_ )); DESCR("return portion of string"); DATA(insert OID = 884 ( btrim PGNSP PGUID 12 f f t f i 2 25 "25 25" btrim - _null_ )); DESCR("trim both ends of string"); --- 2121,2127 ---- DESCR("remove initial characters from string"); DATA(insert OID = 882 ( rtrim PGNSP PGUID 14 f f t f i 1 25 "25" "select rtrim($1, \' \')" - _null_ )); DESCR("remove trailing characters from string"); ! DATA(insert OID = 883 ( substr PGNSP PGUID 12 f f t f i 2 25 "25 23" text_substr - _null_ )); DESCR("return portion of string"); DATA(insert OID = 884 ( btrim PGNSP PGUID 12 f f t f i 2 25 "25 25" btrim - _null_ )); DESCR("trim both ends of string"); *************** *** 2130,2136 **** DATA(insert OID = 936 ( substring PGNSP PGUID 12 f f t f i 3 25 "25 23 23" text_substr - _null_ )); DESCR("return portion of string"); ! DATA(insert OID = 937 ( substring PGNSP PGUID 14 f f t f i 2 25 "25 23" "select substring($1, $2, -1)" - _null_)); DESCR("return portion of string"); /* for multi-byte support */ --- 2130,2136 ---- DATA(insert OID = 936 ( substring PGNSP PGUID 12 f f t f i 3 25 "25 23 23" text_substr - _null_ )); DESCR("return portion of string"); ! DATA(insert OID = 937 ( substring PGNSP PGUID 12 f f t f i 2 25 "25 23" text_substr - _null_ )); DESCR("return portion of string"); /* for multi-byte support */ *************** *** 2778,2784 **** DESCR("concatenate"); DATA(insert OID = 2012 ( substring PGNSP PGUID 12 f f t f i 3 17 "17 23 23" bytea_substr - _null_ )); DESCR("return portion of string"); ! DATA(insert OID = 2013 ( substring PGNSP PGUID 14 f f t f i 2 17 "17 23" "select substring($1, $2, -1)" -_null_ )); DESCR("return portion of string"); DATA(insert OID = 2014 ( position PGNSP PGUID 12 f f t f i 2 23 "17 17" byteapos - _null_ )); DESCR("return position of substring"); --- 2778,2784 ---- DESCR("concatenate"); DATA(insert OID = 2012 ( substring PGNSP PGUID 12 f f t f i 3 17 "17 23 23" bytea_substr - _null_ )); DESCR("return portion of string"); ! DATA(insert OID = 2013 ( substring PGNSP PGUID 12 f f t f i 2 17 "17 23" bytea_substr - _null_ )); DESCR("return portion of string"); DATA(insert OID = 2014 ( position PGNSP PGUID 12 f f t f i 2 23 "17 17" byteapos - _null_ )); DESCR("return position of substring");
pgsql-patches by date: