Re: [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
Date
Msg-id 4385.1392859333@sss.pgh.pa.us
Whole thread Raw
In response to Re: [BUGS] BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding
List pgsql-hackers
I wrote:
>> The minimum-refactoring solution to this would be to tweak
>> pg_do_encoding_conversion() so that if the src_encoding is SQL_ASCII but
>> the dest_encoding isn't, it does pg_verify_mbstr() rather than nothing.

>> I'm not sure if this would break anything we need to have work,
>> though.  Thoughts?  Do we want to back-patch such a change?

> I looked through all the callers of pg_do_encoding_conversion(), and
> AFAICS this change is a good idea.  There are a whole bunch of places
> that use pg_do_encoding_conversion() to convert from the database encoding
> to encoding X (most usually UTF8), and right now if you do that in a
> SQL_ASCII database you have no assurance whatever that what is produced
> is actually valid in encoding X.  I think we need to close that loophole.

The more I looked into mbutils.c, the less happy I got.  The attached
proposed patch takes care of the missing-verification hole in
pg_do_encoding_conversion() and pg_server_to_any(), and also gets rid
of what I believe to be obsolete provisions in pg_do_encoding_conversion
to "work" if called outside a transaction --- if you consider it working
to completely fail to honor its API contract.  That should no longer be
necessary now that we perform client<->server encoding conversions via
perform_default_encoding_conversion rather than here.

For testing I inserted an "Assert(IsTransactionState());" at the top of
pg_do_encoding_conversion(), and couldn't trigger it, so I'm fairly sure
this change is OK.  Still, back-patching it might not be a good thing.
On the other hand, if there is still any code path that can get here
outside a transaction, we probably ought to find out rather than allow
completely bogus data to be returned.

I also made some more-cosmetic improvements, notably removing a bunch
of Asserts that are certainly dead code because the relevant variables
are never NULL.

I've not done anything yet about simplifying unnecessary calls of
pg_do_encoding_conversion into pg_server_to_any/pg_any_to_server.

How much of this is back-patch material, do you think?

            regards, tom lane

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 08440e9..7f43cae 100644
*** a/src/backend/utils/mb/mbutils.c
--- b/src/backend/utils/mb/mbutils.c
***************
*** 1,10 ****
! /*
!  * This file contains public functions for conversion between
!  * client encoding and server (database) encoding.
   *
!  * Tatsuo Ishii
   *
!  * src/backend/utils/mb/mbutils.c
   */
  #include "postgres.h"

--- 1,36 ----
! /*-------------------------------------------------------------------------
   *
!  * mbutils.c
!  *      This file contains functions for encoding conversion.
   *
!  * The string-conversion functions in this file share some API quirks.
!  * Note the following:
!  *
!  * The functions return a palloc'd, null-terminated string if conversion
!  * is required.  However, if no conversion is performed, the given source
!  * string pointer is returned as-is.
!  *
!  * Although the presence of a length argument means that callers can pass
!  * non-null-terminated strings, care is required because the same string
!  * will be passed back if no conversion occurs.  Such callers *must* check
!  * whether result == src and handle that case differently.
!  *
!  * If the source and destination encodings are the same, the source string
!  * is returned without any verification; it's assumed to be valid data.
!  * If that might not be the case, the caller is responsible for validating
!  * the string using a separate call to pg_verify_mbstr().  Whenever the
!  * source and destination encodings are different, the functions ensure that
!  * the result is validly encoded according to the destination encoding.
!  *
!  *
!  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
!  * Portions Copyright (c) 1994, Regents of the University of California
!  *
!  *
!  * IDENTIFICATION
!  *      src/backend/utils/mb/mbutils.c
!  *
!  *-------------------------------------------------------------------------
   */
  #include "postgres.h"

*************** InitializeClientEncoding(void)
*** 290,296 ****
  int
  pg_get_client_encoding(void)
  {
-     Assert(ClientEncoding);
      return ClientEncoding->encoding;
  }

--- 316,321 ----
*************** pg_get_client_encoding(void)
*** 300,328 ****
  const char *
  pg_get_client_encoding_name(void)
  {
-     Assert(ClientEncoding);
      return ClientEncoding->name;
  }

  /*
!  * Apply encoding conversion on src and return it. The encoding
!  * conversion function is chosen from the pg_conversion system catalog
!  * marked as "default". If it is not found in the schema search path,
!  * it's taken from pg_catalog schema. If it even is not in the schema,
!  * warn and return src.
!  *
!  * If conversion occurs, a palloc'd null-terminated string is returned.
!  * In the case of no conversion, src is returned.
!  *
!  * CAUTION: although the presence of a length argument means that callers
!  * can pass non-null-terminated strings, care is required because the same
!  * string will be passed back if no conversion occurs.    Such callers *must*
!  * check whether result == src and handle that case differently.
   *
!  * Note: we try to avoid raising error, since that could get us into
!  * infinite recursion when this function is invoked during error message
!  * sending.  It should be OK to raise error for overlength strings though,
!  * since the recursion will come with a shorter message.
   */
  unsigned char *
  pg_do_encoding_conversion(unsigned char *src, int len,
--- 325,337 ----
  const char *
  pg_get_client_encoding_name(void)
  {
      return ClientEncoding->name;
  }

  /*
!  * Convert src string to another encoding (general case).
   *
!  * See the notes about string conversion functions at the top of this file.
   */
  unsigned char *
  pg_do_encoding_conversion(unsigned char *src, int len,
*************** pg_do_encoding_conversion(unsigned char
*** 331,369 ****
      unsigned char *result;
      Oid            proc;

!     if (!IsTransactionState())
!         return src;

      if (src_encoding == dest_encoding)
!         return src;

!     if (src_encoding == PG_SQL_ASCII || dest_encoding == PG_SQL_ASCII)
!         return src;

!     if (len <= 0)
          return src;

      proc = FindDefaultConversionProc(src_encoding, dest_encoding);
      if (!OidIsValid(proc))
!     {
!         ereport(LOG,
                  (errcode(ERRCODE_UNDEFINED_FUNCTION),
                   errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
                          pg_encoding_to_char(src_encoding),
                          pg_encoding_to_char(dest_encoding))));
-         return src;
-     }
-
-     /*
-      * XXX we should avoid throwing errors in OidFunctionCall. Otherwise we
-      * are going into infinite loop!  So we have to make sure that the
-      * function exists before calling OidFunctionCall.
-      */
-     if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(proc)))
-     {
-         elog(LOG, "cache lookup failed for function %u", proc);
-         return src;
-     }

      /*
       * Allocate space for conversion result, being wary of integer overflow
--- 340,371 ----
      unsigned char *result;
      Oid            proc;

!     if (len <= 0)
!         return src;                /* empty string is always valid */

      if (src_encoding == dest_encoding)
!         return src;                /* no conversion required, assume valid */

!     if (dest_encoding == PG_SQL_ASCII)
!         return src;                /* any string is valid in SQL_ASCII */

!     if (src_encoding == PG_SQL_ASCII)
!     {
!         /* No conversion is possible, but we must validate the result */
!         (void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false);
          return src;
+     }
+
+     if (!IsTransactionState())    /* shouldn't happen */
+         elog(ERROR, "cannot perform encoding conversion outside a transaction");

      proc = FindDefaultConversionProc(src_encoding, dest_encoding);
      if (!OidIsValid(proc))
!         ereport(ERROR,
                  (errcode(ERRCODE_UNDEFINED_FUNCTION),
                   errmsg("default conversion function for encoding \"%s\" to \"%s\" does not exist",
                          pg_encoding_to_char(src_encoding),
                          pg_encoding_to_char(dest_encoding))));

      /*
       * Allocate space for conversion result, being wary of integer overflow
*************** pg_do_encoding_conversion(unsigned char
*** 387,393 ****
  }

  /*
!  * Convert string using encoding_name. The source
   * encoding is the DB encoding.
   *
   * BYTEA convert_to(TEXT string, NAME encoding_name) */
--- 389,395 ----
  }

  /*
!  * Convert string to encoding encoding_name. The source
   * encoding is the DB encoding.
   *
   * BYTEA convert_to(TEXT string, NAME encoding_name) */
*************** pg_convert_to(PG_FUNCTION_ARGS)
*** 412,418 ****
  }

  /*
!  * Convert string using encoding_name. The destination
   * encoding is the DB encoding.
   *
   * TEXT convert_from(BYTEA string, NAME encoding_name) */
--- 414,420 ----
  }

  /*
!  * Convert string from encoding encoding_name. The destination
   * encoding is the DB encoding.
   *
   * TEXT convert_from(BYTEA string, NAME encoding_name) */
*************** pg_convert_from(PG_FUNCTION_ARGS)
*** 439,445 ****
  }

  /*
!  * Convert string using encoding_names.
   *
   * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
   */
--- 441,447 ----
  }

  /*
!  * Convert string between two arbitrary encodings.
   *
   * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
   */
*************** pg_convert(PG_FUNCTION_ARGS)
*** 472,479 ****
      src_str = VARDATA_ANY(string);
      pg_verify_mbstr_len(src_encoding, src_str, len, false);

!     dest_str = (char *) pg_do_encoding_conversion(
!                 (unsigned char *) src_str, len, src_encoding, dest_encoding);
      if (dest_str != src_str)
          len = strlen(dest_str);

--- 474,486 ----
      src_str = VARDATA_ANY(string);
      pg_verify_mbstr_len(src_encoding, src_str, len, false);

!     /* perform conversion */
!     dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str,
!                                                   len,
!                                                   src_encoding,
!                                                   dest_encoding);
!
!     /* update len if conversion actually happened */
      if (dest_str != src_str)
          len = strlen(dest_str);

*************** pg_convert(PG_FUNCTION_ARGS)
*** 503,512 ****
  Datum
  length_in_encoding(PG_FUNCTION_ARGS)
  {
!     bytea       *string = PG_GETARG_BYTEA_P(0);
      char       *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
      int            src_encoding = pg_char_to_encoding(src_encoding_name);
!     int            len = VARSIZE(string) - VARHDRSZ;
      int            retval;

      if (src_encoding < 0)
--- 510,520 ----
  Datum
  length_in_encoding(PG_FUNCTION_ARGS)
  {
!     bytea       *string = PG_GETARG_BYTEA_PP(0);
      char       *src_encoding_name = NameStr(*PG_GETARG_NAME(1));
      int            src_encoding = pg_char_to_encoding(src_encoding_name);
!     const char *src_str;
!     int            len;
      int            retval;

      if (src_encoding < 0)
*************** length_in_encoding(PG_FUNCTION_ARGS)
*** 515,525 ****
                   errmsg("invalid encoding name \"%s\"",
                          src_encoding_name)));

!     retval = pg_verify_mbstr_len(src_encoding, VARDATA(string), len, false);
!     PG_RETURN_INT32(retval);

  }

  Datum
  pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
  {
--- 523,541 ----
                   errmsg("invalid encoding name \"%s\"",
                          src_encoding_name)));

!     len = VARSIZE_ANY_EXHDR(string);
!     src_str = VARDATA_ANY(string);

+     retval = pg_verify_mbstr_len(src_encoding, src_str, len, false);
+
+     PG_RETURN_INT32(retval);
  }

+ /*
+  * Get maximum multibyte character length in the specified encoding.
+  *
+  * Note encoding is specified numerically, not by name as above.
+  */
  Datum
  pg_encoding_max_length_sql(PG_FUNCTION_ARGS)
  {
*************** pg_encoding_max_length_sql(PG_FUNCTION_A
*** 532,558 ****
  }

  /*
!  * convert client encoding to server encoding.
   */
  char *
  pg_client_to_server(const char *s, int len)
  {
-     Assert(ClientEncoding);
-
      return pg_any_to_server(s, len, ClientEncoding->encoding);
  }

  /*
!  * convert any encoding to server encoding.
   */
  char *
  pg_any_to_server(const char *s, int len, int encoding)
  {
-     Assert(DatabaseEncoding);
-     Assert(ClientEncoding);
-
      if (len <= 0)
!         return (char *) s;

      if (encoding == DatabaseEncoding->encoding ||
          encoding == PG_SQL_ASCII)
--- 548,578 ----
  }

  /*
!  * Convert client encoding to server encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_client_to_server(const char *s, int len)
  {
      return pg_any_to_server(s, len, ClientEncoding->encoding);
  }

  /*
!  * Convert any encoding to server encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
!  *
!  * Unlike the other string conversion functions, this will apply validation
!  * even if encoding == DatabaseEncoding->encoding.    This is because this is
!  * used to process data coming in from outside the database, and we never
!  * want to just assume validity.
   */
  char *
  pg_any_to_server(const char *s, int len, int encoding)
  {
      if (len <= 0)
!         return (char *) s;        /* empty string is always valid */

      if (encoding == DatabaseEncoding->encoding ||
          encoding == PG_SQL_ASCII)
*************** pg_any_to_server(const char *s, int len,
*** 594,639 ****
          return (char *) s;
      }

!     if (ClientEncoding->encoding == encoding)
          return perform_default_encoding_conversion(s, len, true);
!     else
!         return (char *) pg_do_encoding_conversion(
!              (unsigned char *) s, len, encoding, DatabaseEncoding->encoding);
  }

  /*
!  * convert server encoding to client encoding.
   */
  char *
  pg_server_to_client(const char *s, int len)
  {
-     Assert(ClientEncoding);
-
      return pg_server_to_any(s, len, ClientEncoding->encoding);
  }

  /*
!  * convert server encoding to any encoding.
   */
  char *
  pg_server_to_any(const char *s, int len, int encoding)
  {
-     Assert(DatabaseEncoding);
-     Assert(ClientEncoding);
-
      if (len <= 0)
!         return (char *) s;

      if (encoding == DatabaseEncoding->encoding ||
!         encoding == PG_SQL_ASCII ||
!         DatabaseEncoding->encoding == PG_SQL_ASCII)
          return (char *) s;        /* assume data is valid */

!     if (ClientEncoding->encoding == encoding)
          return perform_default_encoding_conversion(s, len, false);
!     else
!         return (char *) pg_do_encoding_conversion(
!              (unsigned char *) s, len, DatabaseEncoding->encoding, encoding);
  }

  /*
--- 614,672 ----
          return (char *) s;
      }

!     /* Fast path if we can use cached conversion function */
!     if (encoding == ClientEncoding->encoding)
          return perform_default_encoding_conversion(s, len, true);
!
!     /* General case ... will not work outside transactions */
!     return (char *) pg_do_encoding_conversion((unsigned char *) s,
!                                               len,
!                                               encoding,
!                                               DatabaseEncoding->encoding);
  }

  /*
!  * Convert server encoding to client encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_server_to_client(const char *s, int len)
  {
      return pg_server_to_any(s, len, ClientEncoding->encoding);
  }

  /*
!  * Convert server encoding to any encoding.
!  *
!  * See the notes about string conversion functions at the top of this file.
   */
  char *
  pg_server_to_any(const char *s, int len, int encoding)
  {
      if (len <= 0)
!         return (char *) s;        /* empty string is always valid */

      if (encoding == DatabaseEncoding->encoding ||
!         encoding == PG_SQL_ASCII)
          return (char *) s;        /* assume data is valid */

!     if (DatabaseEncoding->encoding == PG_SQL_ASCII)
!     {
!         /* No conversion is possible, but we must validate the result */
!         (void) pg_verify_mbstr(encoding, s, len, false);
!         return (char *) s;
!     }
!
!     /* Fast path if we can use cached conversion function */
!     if (encoding == ClientEncoding->encoding)
          return perform_default_encoding_conversion(s, len, false);
!
!     /* General case ... will not work outside transactions */
!     return (char *) pg_do_encoding_conversion((unsigned char *) s,
!                                               len,
!                                               DatabaseEncoding->encoding,
!                                               encoding);
  }

  /*
*************** pg_server_to_any(const char *s, int len,
*** 643,649 ****
   *    SetClientEncoding(), no conversion is performed.
   */
  static char *
! perform_default_encoding_conversion(const char *src, int len, bool is_client_to_server)
  {
      char       *result;
      int            src_encoding,
--- 676,683 ----
   *    SetClientEncoding(), no conversion is performed.
   */
  static char *
! perform_default_encoding_conversion(const char *src, int len,
!                                     bool is_client_to_server)
  {
      char       *result;
      int            src_encoding,
*************** raw_pg_bind_textdomain_codeset(const cha
*** 931,941 ****
   * On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
   * When that matches the database encoding, we don't need to do anything.  In
   * CREATE DATABASE, we enforce or trust that the locale's codeset matches the
!  * database encoding, except for the C locale.  (On Windows, we also permit a
   * discrepancy under the UTF8 encoding.)  For the C locale, explicitly bind
   * gettext to the right codeset.
   *
!  * On Windows, gettext defaults to the Windows ANSI code page.  This is a
   * convenient departure for software that passes the strings to Windows ANSI
   * APIs, but we don't do that.  Compel gettext to use database encoding or,
   * failing that, the LC_CTYPE encoding as it would on other platforms.
--- 965,975 ----
   * On most platforms, gettext defaults to the codeset implied by LC_CTYPE.
   * When that matches the database encoding, we don't need to do anything.  In
   * CREATE DATABASE, we enforce or trust that the locale's codeset matches the
!  * database encoding, except for the C locale.    (On Windows, we also permit a
   * discrepancy under the UTF8 encoding.)  For the C locale, explicitly bind
   * gettext to the right codeset.
   *
!  * On Windows, gettext defaults to the Windows ANSI code page.    This is a
   * convenient departure for software that passes the strings to Windows ANSI
   * APIs, but we don't do that.  Compel gettext to use database encoding or,
   * failing that, the LC_CTYPE encoding as it would on other platforms.
*************** pg_bind_textdomain_codeset(const char *d
*** 980,1007 ****
  int
  GetDatabaseEncoding(void)
  {
-     Assert(DatabaseEncoding);
      return DatabaseEncoding->encoding;
  }

  const char *
  GetDatabaseEncodingName(void)
  {
-     Assert(DatabaseEncoding);
      return DatabaseEncoding->name;
  }

  Datum
  getdatabaseencoding(PG_FUNCTION_ARGS)
  {
-     Assert(DatabaseEncoding);
      return DirectFunctionCall1(namein, CStringGetDatum(DatabaseEncoding->name));
  }

  Datum
  pg_client_encoding(PG_FUNCTION_ARGS)
  {
-     Assert(ClientEncoding);
      return DirectFunctionCall1(namein, CStringGetDatum(ClientEncoding->name));
  }

--- 1014,1037 ----
*************** pg_client_encoding(PG_FUNCTION_ARGS)
*** 1014,1020 ****
  int
  GetMessageEncoding(void)
  {
-     Assert(MessageEncoding);
      return MessageEncoding->encoding;
  }

--- 1044,1049 ----

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?
Next
From: Michael Paquier
Date:
Subject: Re: should we add a XLogRecPtr/LSN SQL type?