Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-patches

From Joe Conway
Subject Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function
Date
Msg-id 3D59F57D.1070804@joeconway.com
Whole thread Raw
Responses Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function
List pgsql-patches
Tatsuo Ishii wrote:
  > Joe Conway wrote:
 >>Any objection if I rework this function to meet SQL92 and fix the bug?
 >
 > I don't object.
 >
 >>Or is the SQL92 part not desirable because it breaks backward
 >>compatability?
 >
 > I don't think so.
 >
 >>In any case, can the #ifdef MULTIBYTE's be removed now in favor of a
 >>test for encoding max length?
 >
 > Sure.

<sorry so long-winded>

Attached is a patch that implements the above items wrt text_substr(). I
also modified textlen(), textoctetlen(), byteaoctetlen(), and
bytea_substr(). Here's a summary of the change to each:

- text_substr(): rewrite function to meet SQL92, fix MB related bug,
    and remove #ifdef MULTIBYTE.

- bytea_substr(): same as text_substr() enc max len == 1.

- textoctetlen(), byteaoctetlen(): use
    toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ)
    to avoid detoasting.

- textlen(): same as textoctetlen() for enc max len == 1, and remove
    #ifdef MULTIBYTE.

I did some benchmarking to ensure no performance degradation, and to
help me understand MB and related performance issues. The results were
very enlightening:

===================================================================
First test - textlen() (already reported, repeated here for completeness):
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
     insert into strtest values('12345....');  -- 100000 characters
loop
do 1000 times
     select length(f1) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, new code        2 seconds
SQL_ASCII database, old code        66 seconds
EUC_JP database, new & old code        469 seconds
===================================================================
Second test - short string test:
-------------------------------------------------------------------
create table parts(partnum text);
<fill with ~220000 rows, 8 to 12 characters each>

do 300 times
     select substr(partnum, 3, 3) from parts;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code        352 seconds
SQL_ASCII database, new code        350 seconds
EUC_JP database, old code        461 seconds
EUC_JP database, new code        422 seconds
===================================================================
Third test - long string, EXTENDED storage (EXTERNAL+COMPRESSED):
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
     insert into strtest values('12345....');  -- 100000 characters
loop
do 1000 times
     select substr(f1, 89000, 10000) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code        59 seconds
SQL_ASCII database, new code        58 seconds
EUC_JP database, old code        915 seconds
EUC_JP database, new code        912 seconds
===================================================================
Forth test - long string, EXTERNAL storage (not COMPRESSED)
-------------------------------------------------------------------
create table strtest(f1 text);
do 100 times
     insert into strtest values('12345....');  -- 100000 characters
loop
do 1000 times
     select substr(f1, 89000, 10000) from strtest;
loop
-------------------------------------------------------------------
Results:
SQL_ASCII database, old code        17 seconds
SQL_ASCII database, new code        17 seconds
EUC_JP database, old code        918 seconds
EUC_JP database, new code        911 seconds


The only remaining problem is that this causes opr_sanity to fail based
on this query:

-- Considering only built-in procs (prolang = 12), look for multiple
-- uses of the same internal function (ie, matching prosrc fields).
-- It's OK to have several entries with different pronames for the same
-- internal function, but conflicts in the number of arguments and other
-- critical items should be complained of.
SELECT p1.oid, p1.proname, p2.oid, p2.proname
FROM pg_proc AS p1, pg_proc AS p2
WHERE p1.oid != p2.oid AND
      p1.prosrc = p2.prosrc AND
      p1.prolang = 12 AND p2.prolang = 12 AND
      (p1.prolang != p2.prolang OR
       p1.proisagg != p2.proisagg OR
       p1.prosecdef != p2.prosecdef OR
       p1.proisstrict != p2.proisstrict OR
       p1.proretset != p2.proretset OR
       p1.provolatile != p2.provolatile OR
       p1.pronargs != p2.pronargs);

This fails because I implemented text_substr() and bytea_substr() to
take either 2 or 3 args. This was necessary for SQL92 spec compliance.

SQL92 requires L < 0 to throw an error, and L IS NULL to return NULL. It
also requires that if L is not provided, the length to the end of the
string is assumed. Current code handles L IS NULL correctly but not L <
0 -- it assumes L < 0 is the same as L is not provided. By allowing the
function to determine if it was passed 2 or 3 args, this can be handled
properly.

So the question is, can/should I change opr_sanity to allow this case?

I also still owe some additions to the strings regression test to make
it cover toasted values.

Other than those two issues, I think the patch is ready to go. I'm
planning to take on the replace function next.

Thanks,

Joe

Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: /opt/src/cvs/pgsql-server/src/backend/utils/adt/varlena.c,v
retrieving revision 1.87
diff -c -r1.87 varlena.c
*** src/backend/utils/adt/varlena.c    4 Aug 2002 06:44:47 -0000    1.87
--- src/backend/utils/adt/varlena.c    14 Aug 2002 05:42:45 -0000
***************
*** 18,23 ****
--- 18,24 ----

  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
+ #include "access/tuptoaster.h"
  #include "utils/builtins.h"
  #include "utils/pg_locale.h"

***************
*** 285,303 ****
  Datum
  textlen(PG_FUNCTION_ARGS)
  {
!     text       *t = PG_GETARG_TEXT_P(0);

! #ifdef MULTIBYTE
!     /* optimization for single byte encoding */
!     if (pg_database_encoding_max_length() <= 1)
!         PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
!
!     PG_RETURN_INT32(
!         pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)
!         );
! #else
!     PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
! #endif
  }

  /*
--- 286,309 ----
  Datum
  textlen(PG_FUNCTION_ARGS)
  {
!     /* fastpath when max encoding length is one */
!     if (pg_database_encoding_max_length() == 1)
!         PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);

!     if (pg_database_encoding_max_length() > 1)
!     {
!         text       *t = PG_GETARG_TEXT_P(0);
!
!         PG_RETURN_INT32(pg_mbstrlen_with_len(VARDATA(t),
!                                      VARSIZE(t) - VARHDRSZ));
!     }
!
!     /* should never get here */
!     elog(ERROR, "Invalid backend encoding; encoding max length "
!                 "is less than one.");
!
!     /* notreached: suppress compiler warning */
!     PG_RETURN_NULL();
  }

  /*
***************
*** 308,316 ****
  Datum
  textoctetlen(PG_FUNCTION_ARGS)
  {
!     text    *arg = PG_GETARG_TEXT_P(0);
!
!     PG_RETURN_INT32(VARSIZE(arg) - VARHDRSZ);
  }

  /*
--- 314,320 ----
  Datum
  textoctetlen(PG_FUNCTION_ARGS)
  {
!     PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
  }

  /*
***************
*** 358,363 ****
--- 362,375 ----
      PG_RETURN_TEXT_P(result);
  }

+
+ #define PG_TEXTARG_GET_STR(arg_) \
+     DatumGetCString(DirectFunctionCall1(textout, PG_GETARG_DATUM(arg_)))
+ #define PG_TEXT_GET_STR(textp_) \
+     DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp_)))
+ #define PG_STR_GET_TEXT(str_) \
+     DatumGetTextP(DirectFunctionCall1(textin, CStringGetDatum(str_)))
+
  /*
   * text_substr()
   * Return a substring starting at the specified position.
***************
*** 382,471 ****
   * - Thomas Lockhart 1998-12-10
   * Now uses faster TOAST-slicing interface
   * - John Gray 2002-02-22
   */
  Datum
  text_substr(PG_FUNCTION_ARGS)
  {
!     text       *string;
!     int32        m = PG_GETARG_INT32(1);
!     int32        n = PG_GETARG_INT32(2);
!     int32       sm;
!     int32       sn;
!     int         eml = 1;
! #ifdef MULTIBYTE
!     int            i;
!     int            len;
!     text       *ret;
!     char       *p;
! #endif

!     /*
!      * starting position before the start of the string? then offset into
!      * the string per SQL92 spec...
!      */
!     if (m < 1)
      {
!         n += (m - 1);
!         m = 1;
!     }
!     /* Check for m > octet length is made in TOAST access routine */

!     /* m will now become a zero-based starting position */
!     sm = m - 1;
!     sn = n;

! #ifdef MULTIBYTE
!     eml = pg_database_encoding_max_length ();

!     if (eml > 1)
!     {
!         sm = 0;
!         if (n > -1)
!             sn = (m + n) * eml + 3; /* +3 to avoid mb characters overhanging slice end */
          else
!             sn = n;        /* n < 0 is special-cased by heap_tuple_untoast_attr_slice */
!     }
! #endif

!     string = PG_GETARG_TEXT_P_SLICE (0, sm, sn);

!     if (eml == 1)
!     {
!         PG_RETURN_TEXT_P (string);
!     }
! #ifndef MULTIBYTE
!     PG_RETURN_NULL();   /* notreached: suppress compiler warning */
! #endif
! #ifdef MULTIBYTE
!     if (n > -1)
!         len = pg_mbstrlen_with_len (VARDATA (string), sn - 3);
!     else    /* n < 0 is special-cased; need full string length */
!         len = pg_mbstrlen_with_len (VARDATA (string), VARSIZE(string)-VARHDRSZ);
!
!     if (m > len)
!     {
!         m = 1;
!         n = 0;
!     }
!     m--;
!     if (((m + n) > len) || (n < 0))
!         n = (len - m);
!
!     p = VARDATA(string);
!     for (i = 0; i < m; i++)
!         p += pg_mblen(p);
!     m = p - VARDATA(string);
!     for (i = 0; i < n; i++)
!         p += pg_mblen(p);
!     n = p - (VARDATA(string) + m);

!     ret = (text *) palloc(VARHDRSZ + n);
!     VARATT_SIZEP(ret) = VARHDRSZ + n;

!     memcpy(VARDATA(ret), VARDATA(string) + m, n);

!     PG_RETURN_TEXT_P(ret);
! #endif
  }

  /*
--- 394,582 ----
   * - Thomas Lockhart 1998-12-10
   * Now uses faster TOAST-slicing interface
   * - John Gray 2002-02-22
+  * Remove "#ifdef MULTIBYTE" and test for encoding_max_length instead. Change
+  * behaviors conflicting with SQL92 to meet SQL92 (if E = S + L < S throw
+  * error; if E < 1, return '', not entire string). Fixed MB related bug when
+  * S > LC and < LC + 4 sometimes garbage characters are returned.
+  * - Joe Conway 2002-08-10
   */
  Datum
  text_substr(PG_FUNCTION_ARGS)
  {
!     int        S = PG_GETARG_INT32(1);    /* start position */
!     int        eml = pg_database_encoding_max_length();
!     int        S1;                        /* adjusted start position */
!     int        L1;                        /* adjusted substring length */

!     /* life is easy if the encoding max length is 1 */
!     if (eml == 1)
      {
!         S1 = Max(S, 1);

!         if (fcinfo->nargs == 2)
!         {
!             /*
!              * Not passed a length - PG_GETARG_TEXT_P_SLICE()
!              * grabs everything to the end of the string if we pass it
!              * a negative value for length.
!              */
!             L1 = -1;
!         }
!         else
!         {
!             /* end position */
!             int    E = S + PG_GETARG_INT32(2);

!             /*
!              * A negative value for L is the only way for the end position
!              * to be before the start. SQL99 says to throw an error.
!              */
!             if (E < S)
!                 elog(ERROR, "negative substring length not allowed");

!             /*
!              * A zero or negative value for the end position can happen if the start
!              * was negative or one. SQL99 says to return a zero-length string.
!              */
!             if (E < 1)
!                 PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
!             L1 = E - S1;
!         }
!
!         /*
!          * If the start position is past the end of the string,
!          * SQL99 says to return a zero-length string --
!          * PG_GETARG_TEXT_P_SLICE() will do that for us.
!          * Convert to zero-based starting position
!          */
!         PG_RETURN_TEXT_P (PG_GETARG_TEXT_P_SLICE (0, S1 - 1, L1));
!     }
!     else if (eml > 1)
!     {
!         /*
!          * When encoding max length is > 1, we can't get LC without
!          * detoasting, so we'll grab a conservatively large slice
!          * now and go back later to do the right thing
!          */
!         int        slice_start;
!         int        slice_size;
!         int        slice_strlen;
!         text    *slice;
!         int        E1;
!         int        i;
!         char   *p;
!         char   *s;
!         text   *ret;
!
!         /*
!          * if S is past the end of the string, the tuple toaster
!          * will return a zero-length string to us
!          */
!         S1 = Max(S, 1);
!
!         /*
!          * We need to start at position zero because there is no
!          * way to know in advance which byte offset corresponds to
!          * the supplied start position.
!          */
!         slice_start = 0;
!
!         if (fcinfo->nargs == 2)
!         {
!             /*
!              * If we were not passed a length, the spec says that
!              * E = Max(LC + 1, S). Since we don't know LC yet, set
!              * slice_size = -1 which will cause heap_tuple_untoast_attr_slice
!              * to give use everything to the end of the string.
!              * If S > LC + 1, we'll get back a zero length string anyway.
!              */
!             slice_size = L1 = -1;
!         }
          else
!         {
!             int    E = S + PG_GETARG_INT32(2);

!             /*
!              * A negative value for L is the only way for the end position
!              * to be before the start. SQL99 says to throw an error.
!              */
!             if (E < S)
!                 elog(ERROR, "negative substring length not allowed");

!             /*
!              * A zero or negative value for the end position can happen if the start
!              * was negative or one. SQL99 says to return a zero-length string.
!              */
!             if (E < 1)
!                 PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));

!             /*
!              * if E is past the end of the string, the tuple toaster
!              * will truncate the length for us
!              */
!             L1 = E - S1;

!             /*
!              * Total slice size in bytes can't be any longer than the start
!              * position plus substring length times the encoding max length.
!              */
!             slice_size = (S1 + L1) * eml;
!         }
!         slice = PG_GETARG_TEXT_P_SLICE (0, slice_start, slice_size);

!         /* see if we got back an empty string */
!         if ((VARSIZE(slice) - VARHDRSZ) == 0)
!             PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
!         /* Now we can get the actual length of the slice in MB characters */
!         slice_strlen = pg_mbstrlen_with_len (VARDATA(slice), VARSIZE(slice) - VARHDRSZ);
!
!         /* Check that the start position wasn't > slice_strlen. If so,
!          * SQL99 says to return a zero-length string.
!          */
!         if (S1 > slice_strlen)
!             PG_RETURN_TEXT_P(PG_STR_GET_TEXT(""));
!
!         /*
!          * Adjust L1 and E1 now that we know the slice string length.
!          * Again remember that S1 is one based, and slice_start is zero based.
!          */
!         if (L1 > -1)
!             E1 = Min(S1 + L1 , slice_start + 1 + slice_strlen);
!         else
!             E1 = slice_start + 1 + slice_strlen;
!
!         /*
!          * Find the start position in the slice;
!          * remember S1 is not zero based
!          */
!         p = VARDATA(slice);
!         for (i = 0; i < S1 - 1; i++)
!             p += pg_mblen(p);
!
!         /* hang onto a pointer to our start position */
!         s = p;
!
!         /*
!          * Count the actual bytes used by the substring of
!          * the requested length.
!          */
!         for (i = S1; i < E1; i++)
!             p += pg_mblen(p);
!
!         ret = (text *) palloc(VARHDRSZ + (p - s));
!         VARATT_SIZEP(ret) = VARHDRSZ + (p - s);
!         memcpy(VARDATA(ret), s, (p - s));
!
!         PG_RETURN_TEXT_P(ret);
!     }
!     else
!         elog(ERROR, "Invalid backend encoding; encoding max length "
!                     "is less than one.");
!
!     /* notreached: suppress compiler warning */
!     PG_RETURN_NULL();
  }

  /*
***************
*** 758,766 ****
  Datum
  byteaoctetlen(PG_FUNCTION_ARGS)
  {
!     bytea       *v = PG_GETARG_BYTEA_P(0);
!
!     PG_RETURN_INT32(VARSIZE(v) - VARHDRSZ);
  }

  /*
--- 869,875 ----
  Datum
  byteaoctetlen(PG_FUNCTION_ARGS)
  {
!     PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
  }

  /*
***************
*** 805,810 ****
--- 914,921 ----
      PG_RETURN_BYTEA_P(result);
  }

+ #define PG_STR_GET_BYTEA(str_) \
+     DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
  /*
   * bytea_substr()
   * Return a substring starting at the specified position.
***************
*** 813,845 ****
   * Input:
   *    - string
   *    - starting position (is one-based)
!  *    - string length
   *
   * If the starting position is zero or less, then return from the start of the string
   * adjusting the length to be consistent with the "negative start" per SQL92.
!  * If the length is less than zero, return the remaining string.
!  *
   */
  Datum
  bytea_substr(PG_FUNCTION_ARGS)
  {
!     int32        m = PG_GETARG_INT32(1);
!     int32        n = PG_GETARG_INT32(2);
!
!     /*
!      * starting position before the start of the string? then offset into
!      * the string per SQL92 spec...
!      */
!     if (m < 1)
!     {
!         n += (m - 1);
!         m = 1;
      }

!     /* m will now become a zero-based starting position */
!     m--;
!
!     PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, m, n));
  }

  /*
--- 924,983 ----
   * Input:
   *    - string
   *    - starting position (is one-based)
!  *    - string length (optional)
   *
   * If the starting position is zero or less, then return from the start of the string
   * adjusting the length to be consistent with the "negative start" per SQL92.
!  * If the length is less than zero, an ERROR is thrown. If no third argument
!  * (length) is provided, the length to the end of the string is assumed.
   */
  Datum
  bytea_substr(PG_FUNCTION_ARGS)
  {
!     int        S = PG_GETARG_INT32(1);    /* start position */
!     int        S1;                        /* adjusted start position */
!     int        L1;                        /* adjusted substring length */
!
!     S1 = Max(S, 1);
!
!     if (fcinfo->nargs == 2)
!     {
!         /*
!          * Not passed a length - PG_GETARG_BYTEA_P_SLICE()
!          * grabs everything to the end of the string if we pass it
!          * a negative value for length.
!          */
!         L1 = -1;
      }
+     else
+     {
+         /* end position */
+         int    E = S + PG_GETARG_INT32(2);

!         /*
!          * A negative value for L is the only way for the end position
!          * to be before the start. SQL99 says to throw an error.
!          */
!         if (E < S)
!             elog(ERROR, "negative substring length not allowed");
!
!         /*
!          * A zero or negative value for the end position can happen if the start
!          * was negative or one. SQL99 says to return a zero-length string.
!          */
!         if (E < 1)
!             PG_RETURN_BYTEA_P(PG_STR_GET_BYTEA(""));
!
!         L1 = E - S1;
!     }
!
!     /*
!      * If the start position is past the end of the string,
!      * SQL99 says to return a zero-length string --
!      * PG_GETARG_TEXT_P_SLICE() will do that for us.
!      * Convert to zero-based starting position
!      */
!     PG_RETURN_BYTEA_P(PG_GETARG_BYTEA_P_SLICE (0, S1 - 1, L1));
  }

  /*
Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /opt/src/cvs/pgsql-server/src/include/catalog/pg_proc.h,v
retrieving revision 1.253
diff -c -r1.253 pg_proc.h
*** src/include/catalog/pg_proc.h    9 Aug 2002 16:45:15 -0000    1.253
--- src/include/catalog/pg_proc.h    14 Aug 2002 02:34:38 -0000
***************
*** 2121,2127 ****
  DESCR("remove initial characters from string");
  DATA(insert OID =  882 (  rtrim           PGNSP PGUID 14 f f t f i 1 25 "25"  "select rtrim($1, \' \')" - _null_ ));
  DESCR("remove trailing characters from string");
! DATA(insert OID =  883 (  substr       PGNSP PGUID 14 f f t f i 2 25 "25 23"    "select substr($1, $2, -1)" - _null_
));
  DESCR("return portion of string");
  DATA(insert OID =  884 (  btrim           PGNSP PGUID 12 f f t f i 2 25 "25 25"    btrim - _null_ ));
  DESCR("trim both ends of string");
--- 2121,2127 ----
  DESCR("remove initial characters from string");
  DATA(insert OID =  882 (  rtrim           PGNSP PGUID 14 f f t f i 1 25 "25"  "select rtrim($1, \' \')" - _null_ ));
  DESCR("remove trailing characters from string");
! DATA(insert OID =  883 (  substr       PGNSP PGUID 12 f f t f i 2 25 "25 23"    text_substr - _null_ ));
  DESCR("return portion of string");
  DATA(insert OID =  884 (  btrim           PGNSP PGUID 12 f f t f i 2 25 "25 25"    btrim - _null_ ));
  DESCR("trim both ends of string");
***************
*** 2130,2136 ****

  DATA(insert OID =  936 (  substring    PGNSP PGUID 12 f f t f i 3 25 "25 23 23"  text_substr - _null_ ));
  DESCR("return portion of string");
! DATA(insert OID =  937 (  substring    PGNSP PGUID 14 f f t f i 2 25 "25 23"    "select substring($1, $2, -1)" -
_null_)); 
  DESCR("return portion of string");

  /* for multi-byte support */
--- 2130,2136 ----

  DATA(insert OID =  936 (  substring    PGNSP PGUID 12 f f t f i 3 25 "25 23 23"  text_substr - _null_ ));
  DESCR("return portion of string");
! DATA(insert OID =  937 (  substring    PGNSP PGUID 12 f f t f i 2 25 "25 23"    text_substr - _null_ ));
  DESCR("return portion of string");

  /* for multi-byte support */
***************
*** 2778,2784 ****
  DESCR("concatenate");
  DATA(insert OID = 2012 (  substring           PGNSP PGUID 12 f f t f i 3 17 "17 23 23"  bytea_substr - _null_ ));
  DESCR("return portion of string");
! DATA(insert OID = 2013 (  substring           PGNSP PGUID 14 f f t f i 2 17 "17 23"    "select substring($1, $2, -1)"
-_null_ )); 
  DESCR("return portion of string");
  DATA(insert OID = 2014 (  position           PGNSP PGUID 12 f f t f i 2 23 "17 17"    byteapos - _null_ ));
  DESCR("return position of substring");
--- 2778,2784 ----
  DESCR("concatenate");
  DATA(insert OID = 2012 (  substring           PGNSP PGUID 12 f f t f i 3 17 "17 23 23"  bytea_substr - _null_ ));
  DESCR("return portion of string");
! DATA(insert OID = 2013 (  substring           PGNSP PGUID 12 f f t f i 2 17 "17 23"    bytea_substr - _null_ ));
  DESCR("return portion of string");
  DATA(insert OID = 2014 (  position           PGNSP PGUID 12 f f t f i 2 23 "17 17"    byteapos - _null_ ));
  DESCR("return position of substring");

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] CREATE TEMP TABLE .... ON COMMIT
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] [GENERAL] workaround for lack of REPLACE() function