Home > mailing lists
Re: Decision by Monday: PQescapeString() vs. encoding violation - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Decision by Monday: PQescapeString() vs. encoding violation
Date	February 15, 2025 22:12:06
Msg-id	164408.1739646726@sss.pgh.pa.us Whole thread Raw
In response to	Re: Decision by Monday: PQescapeString() vs. encoding violation (Noah Misch <noah@leadboat.com>)
Responses	Re: Decision by Monday: PQescapeString() vs. encoding violation
List	pgsql-hackers
Tree view
Noah Misch <noah@leadboat.com> writes:
> On 2025-02-15 13:08:03 -0500, Tom Lane wrote:
>> The other thing that was discussed in the security thread was
>> modifying PQescapeStringInternal and PQescapeInternal to produce
>> no more than one complaint about invalid multibyte characters,
>> on the grounds that input that's just plain in some other encoding
>> would otherwise produce a ton of repetitive messages.  That seems
>> trivial enough to mechanize with a bool already_complained flag,
>> so I think we should incorporate that refinement while we're here.

> +1

On closer inspection, PQescapeInternal already issues only one
error message, since it does "return NULL" after detecting the
first error.  So this makes PQescapeStringInternal behave more
like that.

            regards, tom lane

From 565b42cecf645b1dbde277512fd67836ee9081d1 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 15 Feb 2025 14:10:33 -0500
Subject: [PATCH v5] Make escaping functions retain trailing bytes of an
 invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete
multibyte character, replace only the first byte with a known-invalid
sequence, and process the rest normally.  This seems less likely to
confuse incautious callers than the behavior adopted in 5dc1e42b4.

Author: Andres Freund <andres@anarazel.de>
Reported-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com
Backpatch-through: 13
---
 src/fe_utils/string_utils.c    | 91 +++++++++++++---------------------
 src/interfaces/libpq/fe-exec.c | 69 ++++++++++++--------------
 2 files changed, 65 insertions(+), 95 deletions(-)

diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index b6a7b19708..bdf35db614 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding)
             /* Slow path for possible multibyte characters */
             charlen = pg_encoding_mblen(encoding, cp);

-            if (remaining < charlen)
-            {
-                /*
-                 * If the character is longer than the available input,
-                 * replace the string with an invalid sequence. The invalid
-                 * sequence ensures that the escaped string will trigger an
-                 * error on the server-side, even if we can't directly report
-                 * an error here.
-                 */
-                enlargePQExpBuffer(id_return, 2);
-                pg_encoding_set_invalid(encoding,
-                                        id_return->data + id_return->len);
-                id_return->len += 2;
-                id_return->data[id_return->len] = '\0';
-
-                /* there's no more input data, so we can stop */
-                break;
-            }
-            else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
+            if (remaining < charlen ||
+                pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
             {
                 /*
                  * Multibyte character is invalid.  It's important to verify
-                 * that as invalid multi-byte characters could e.g. be used to
+                 * that as invalid multibyte characters could e.g. be used to
                  * "skip" over quote characters, e.g. when parsing
                  * character-by-character.
                  *
-                 * Replace the bytes corresponding to the invalid character
-                 * with an invalid sequence, for the same reason as above.
+                 * Replace the character's first byte with an invalid
+                 * sequence. The invalid sequence ensures that the escaped
+                 * string will trigger an error on the server-side, even if we
+                 * can't directly report an error here.
                  *
                  * It would be a bit faster to verify the whole string the
                  * first time we encounter a set highbit, but this way we can
-                 * replace just the invalid characters, which probably makes
-                 * it easier for users to find the invalidly encoded portion
-                 * of a larger string.
+                 * replace just the invalid data, which probably makes it
+                 * easier for users to find the invalidly encoded portion of a
+                 * larger string.
                  */
                 enlargePQExpBuffer(id_return, 2);
                 pg_encoding_set_invalid(encoding,
@@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding)
                 id_return->data[id_return->len] = '\0';

                 /*
-                 * Copy the rest of the string after the invalid multi-byte
-                 * character.
+                 * Handle the following bytes as if this byte didn't exist.
+                 * That's safer in case the subsequent bytes contain
+                 * characters that are significant for the caller (e.g. '>' in
+                 * html).
                  */
-                remaining -= charlen;
-                cp += charlen;
+                remaining--;
+                cp++;
             }
             else
             {
@@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str,
         /* Slow path for possible multibyte characters */
         charlen = PQmblen(source, encoding);

-        if (remaining < charlen)
-        {
-            /*
-             * If the character is longer than the available input, replace
-             * the string with an invalid sequence. The invalid sequence
-             * ensures that the escaped string will trigger an error on the
-             * server-side, even if we can't directly report an error here.
-             *
-             * We know there's enough space for the invalid sequence because
-             * the "target" buffer is 2 * length + 2 long, and at worst we're
-             * replacing a single input byte with two invalid bytes.
-             */
-            pg_encoding_set_invalid(encoding, target);
-            target += 2;
-
-            /* there's no more valid input data, so we can stop */
-            break;
-        }
-        else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
+        if (remaining < charlen ||
+            pg_encoding_verifymbchar(encoding, source, charlen) == -1)
         {
             /*
              * Multibyte character is invalid.  It's important to verify that
-             * as invalid multi-byte characters could e.g. be used to "skip"
+             * as invalid multibyte characters could e.g. be used to "skip"
              * over quote characters, e.g. when parsing
              * character-by-character.
              *
-             * Replace the bytes corresponding to the invalid character with
-             * an invalid sequence, for the same reason as above.
+             * Replace the character's first byte with an invalid sequence.
+             * The invalid sequence ensures that the escaped string will
+             * trigger an error on the server-side, even if we can't directly
+             * report an error here.
+             *
+             * We know there's enough space for the invalid sequence because
+             * the "target" buffer is 2 * length + 2 long, and at worst we're
+             * replacing a single input byte with two invalid bytes.
              *
              * It would be a bit faster to verify the whole string the first
              * time we encounter a set highbit, but this way we can replace
-             * just the invalid characters, which probably makes it easier for
-             * users to find the invalidly encoded portion of a larger string.
+             * just the invalid data, which probably makes it easier for users
+             * to find the invalidly encoded portion of a larger string.
              */
             pg_encoding_set_invalid(encoding, target);
             target += 2;
-            remaining -= charlen;

             /*
-             * Copy the rest of the string after the invalid multi-byte
-             * character.
+             * Handle the following bytes as if this byte didn't exist. That's
+             * safer in case the subsequent bytes contain important characters
+             * for the caller (e.g. '>' in html).
              */
-            source += charlen;
+            source++;
+            remaining--;
         }
         else
         {
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 120d4d032e..294843ed8b 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4076,6 +4076,7 @@ PQescapeStringInternal(PGconn *conn,
     const char *source = from;
     char       *target = to;
     size_t        remaining = strnlen(from, length);
+    bool        already_complained = false;

     if (error)
         *error = 0;
@@ -4102,65 +4103,57 @@ PQescapeStringInternal(PGconn *conn,
         /* Slow path for possible multibyte characters */
         charlen = pg_encoding_mblen(encoding, source);

-        if (remaining < charlen)
+        if (remaining < charlen ||
+            pg_encoding_verifymbchar(encoding, source, charlen) == -1)
         {
             /*
-             * If the character is longer than the available input, report an
-             * error if possible, and replace the string with an invalid
-             * sequence. The invalid sequence ensures that the escaped string
-             * will trigger an error on the server-side, even if we can't
-             * directly report an error here.
+             * Multibyte character is invalid.  It's important to verify that
+             * as invalid multibyte characters could e.g. be used to "skip"
+             * over quote characters, e.g. when parsing
+             * character-by-character.
+             *
+             * Report an error if possible, and replace the character's first
+             * byte with an invalid sequence. The invalid sequence ensures
+             * that the escaped string will trigger an error on the
+             * server-side, even if we can't directly report an error here.
              *
              * This isn't *that* crucial when we can report an error to the
-             * caller, but if we can't, the caller will use this string
-             * unmodified and it needs to be safe for parsing.
+             * caller; but if we can't or the caller ignores it, the caller
+             * will use this string unmodified and it needs to be safe for
+             * parsing.
              *
              * We know there's enough space for the invalid sequence because
              * the "to" buffer needs to be at least 2 * length + 1 long, and
              * at worst we're replacing a single input byte with two invalid
              * bytes.
-             */
-            if (error)
-                *error = 1;
-            if (conn)
-                libpq_append_conn_error(conn, "incomplete multibyte character");
-
-            pg_encoding_set_invalid(encoding, target);
-            target += 2;
-
-            /* there's no more input data, so we can stop */
-            break;
-        }
-        else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
-        {
-            /*
-             * Multibyte character is invalid.  It's important to verify that
-             * as invalid multi-byte characters could e.g. be used to "skip"
-             * over quote characters, e.g. when parsing
-             * character-by-character.
-             *
-             * Replace the bytes corresponding to the invalid character with
-             * an invalid sequence, for the same reason as above.
              *
              * It would be a bit faster to verify the whole string the first
              * time we encounter a set highbit, but this way we can replace
-             * just the invalid characters, which probably makes it easier for
-             * users to find the invalidly encoded portion of a larger string.
+             * just the invalid data, which probably makes it easier for users
+             * to find the invalidly encoded portion of a larger string.
              */
             if (error)
                 *error = 1;
-            if (conn)
-                libpq_append_conn_error(conn, "invalid multibyte character");
+            if (conn && !already_complained)
+            {
+                if (remaining < charlen)
+                    libpq_append_conn_error(conn, "incomplete multibyte character");
+                else
+                    libpq_append_conn_error(conn, "invalid multibyte character");
+                /* Issue a complaint only once per string */
+                already_complained = true;
+            }

             pg_encoding_set_invalid(encoding, target);
             target += 2;
-            remaining -= charlen;

             /*
-             * Copy the rest of the string after the invalid multi-byte
-             * character.
+             * Handle the following bytes as if this byte didn't exist. That's
+             * safer in case the subsequent bytes contain important characters
+             * for the caller (e.g. '>' in html).
              */
-            source += charlen;
+            source++;
+            remaining--;
         }
         else
         {
--
2.43.5
pgsql-hackers by date:
From: Noah Misch
Date: 15 February 2025, 21:38:52
Subject: Re: Decision by Monday: PQescapeString() vs. encoding violation
From: Andres Freund
Date: 15 February 2025, 23:09:48
Subject: Re: Decision by Monday: PQescapeString() vs. encoding violation
Re: Decision by Monday: PQescapeString() vs. encoding violation - Mailing list pgsql-hackers

Previous

Next