Thread: Re: Decision by Monday: PQescapeString() vs. encoding violation
On 2025-02-14 Fr 8:27 PM, Noah Misch wrote: > The security team has a question below about how best to proceed with a recent > behavior change. > > Commit 5dc1e42b4fa6a4434afa7d7cdcf0291351a7b873 for this week's CVE-2025-1094 > changed how PQescapeString()[1] reacts to input that is not valid in the > client encoding. Before that commit, the function would ignore encoding > problems except at the end of the string. Now, it replaces the bad sequence > up to the length implied by the first byte. For example, if UTF8 input has > 0xc2 followed by an ASCII byte, the function removes both bytes. > > Jeff Davis reported to the security team that > http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences forbids something > like this, saying a UTF-8 converter "must not consume the [second byte] if it > continues". (That's my summary, not his. He might reply to add color here.) > While PQescapeString() is not a UTF-8 converter, that standard still felt to > multiple security team members like a decent argument for removing only the > invalid 0xc2, not the byte following it. UTF8 is the most important encoding, > and other encodings tolerate this. Either way, the server will report an > encoding error. The difference doesn't have functional consequences if one > simply puts the function result in a query. The difference could matter for > debugging or if applications are postprocessing the PQescapeString() result in > some way. Postprocessing is not supported, but we'd still like to do the best > thing for applications that may already be doing it. > > Security team members disagreed on whether next week's releases are the last > reasonable chance to change this, or whether changing it in e.g. May would be > reasonable. If applications make changes to cope with the new behavior, that > could be an argument against further change. > > Question for all: would you switch to the "remove fewer bytes" behavior in > next week's releases, switch later, or switch never? Why so? Please answer > in the next 24hr if possible, given the little time until we wrap next week's > releases on Monday. I regret the late notice. > > I'm attaching a WIP patch from Andres Freund. We may use it to adopt the > "remove fewer bytes" behavior, if that's the decision. > > Thanks, > nm > > > [1] The commit changed other functions, but PQescapeString() is most > interesting for this discussion. Others have ways to report errors, or they > have reason to believe the input is already checked. New code should be using > the others and checking the error indicator. Removing fewer bytes seems like the right thing to do, now or later. The WIP patch itself is trivial. If it weren't I'd be opposed to doing it at the last minute. Even so I'm a bit nervous - last minute changes have bitten us before. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
I studied the v3 patch a little and realized that it only fixes the behavior for the case of a complete-but-invalid multibyte character. If we have an incomplete character at the end of the string, the whole thing still gets dropped. Surely that's not what we want if we are going to adopt this behavior. Here's a v4 that fixes that. As a bonus, we can get rid of duplicative coding since the "incomplete" and "invalid" cases are now treated identically. One minor point is that the error messages from PQescapeStringInternal no longer distinguish "incomplete" from "invalid". I don't find that to be a terribly useful distinction, so that's fine with me. But if someone feels that's important to preserve, we could make it do something like if (conn) { if (remaining < charlen) libpq_append_conn_error(conn, "incomplete multibyte character"); else libpq_append_conn_error(conn, "invalid multibyte character"); } regards, tom lane From 935ccea6da82d0150f65672825c8f5e6f86bca89 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat, 15 Feb 2025 12:58:22 -0500 Subject: [PATCH v4] Make escaping functions retain trailing bytes of an invalid character. Instead of dropping the trailing byte(s) of an invalid or incomplete multibyte character, replace only the first byte with a known-invalid sequence, and process the rest normally. This seems less likely to confuse incautious callers than the behavior adopted in 5dc1e42b4. Author: Andres Freund <andres@anarazel.de> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com Backpatch-through: 13 --- src/fe_utils/string_utils.c | 91 +++++++++++++--------------------- src/interfaces/libpq/fe-exec.c | 57 ++++++++------------- 2 files changed, 55 insertions(+), 93 deletions(-) diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c index b6a7b19708..bdf35db614 100644 --- a/src/fe_utils/string_utils.c +++ b/src/fe_utils/string_utils.c @@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding) /* Slow path for possible multibyte characters */ charlen = pg_encoding_mblen(encoding, cp); - if (remaining < charlen) - { - /* - * If the character is longer than the available input, - * replace the string with an invalid sequence. The invalid - * sequence ensures that the escaped string will trigger an - * error on the server-side, even if we can't directly report - * an error here. - */ - enlargePQExpBuffer(id_return, 2); - pg_encoding_set_invalid(encoding, - id_return->data + id_return->len); - id_return->len += 2; - id_return->data[id_return->len] = '\0'; - - /* there's no more input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, cp, charlen) == -1) { /* * Multibyte character is invalid. It's important to verify - * that as invalid multi-byte characters could e.g. be used to + * that as invalid multibyte characters could e.g. be used to * "skip" over quote characters, e.g. when parsing * character-by-character. * - * Replace the bytes corresponding to the invalid character - * with an invalid sequence, for the same reason as above. + * Replace the character's first byte with an invalid + * sequence. The invalid sequence ensures that the escaped + * string will trigger an error on the server-side, even if we + * can't directly report an error here. * * It would be a bit faster to verify the whole string the * first time we encounter a set highbit, but this way we can - * replace just the invalid characters, which probably makes - * it easier for users to find the invalidly encoded portion - * of a larger string. + * replace just the invalid data, which probably makes it + * easier for users to find the invalidly encoded portion of a + * larger string. */ enlargePQExpBuffer(id_return, 2); pg_encoding_set_invalid(encoding, @@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding) id_return->data[id_return->len] = '\0'; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. + * That's safer in case the subsequent bytes contain + * characters that are significant for the caller (e.g. '>' in + * html). */ - remaining -= charlen; - cp += charlen; + remaining--; + cp++; } else { @@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str, /* Slow path for possible multibyte characters */ charlen = PQmblen(source, encoding); - if (remaining < charlen) - { - /* - * If the character is longer than the available input, replace - * the string with an invalid sequence. The invalid sequence - * ensures that the escaped string will trigger an error on the - * server-side, even if we can't directly report an error here. - * - * We know there's enough space for the invalid sequence because - * the "target" buffer is 2 * length + 2 long, and at worst we're - * replacing a single input byte with two invalid bytes. - */ - pg_encoding_set_invalid(encoding, target); - target += 2; - - /* there's no more valid input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, source, charlen) == -1) { /* * Multibyte character is invalid. It's important to verify that - * as invalid multi-byte characters could e.g. be used to "skip" + * as invalid multibyte characters could e.g. be used to "skip" * over quote characters, e.g. when parsing * character-by-character. * - * Replace the bytes corresponding to the invalid character with - * an invalid sequence, for the same reason as above. + * Replace the character's first byte with an invalid sequence. + * The invalid sequence ensures that the escaped string will + * trigger an error on the server-side, even if we can't directly + * report an error here. + * + * We know there's enough space for the invalid sequence because + * the "target" buffer is 2 * length + 2 long, and at worst we're + * replacing a single input byte with two invalid bytes. * * It would be a bit faster to verify the whole string the first * time we encounter a set highbit, but this way we can replace - * just the invalid characters, which probably makes it easier for - * users to find the invalidly encoded portion of a larger string. + * just the invalid data, which probably makes it easier for users + * to find the invalidly encoded portion of a larger string. */ pg_encoding_set_invalid(encoding, target); target += 2; - remaining -= charlen; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. That's + * safer in case the subsequent bytes contain important characters + * for the caller (e.g. '>' in html). */ - source += charlen; + source++; + remaining--; } else { diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 120d4d032e..69554cd603 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -4102,50 +4102,34 @@ PQescapeStringInternal(PGconn *conn, /* Slow path for possible multibyte characters */ charlen = pg_encoding_mblen(encoding, source); - if (remaining < charlen) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, source, charlen) == -1) { /* - * If the character is longer than the available input, report an - * error if possible, and replace the string with an invalid - * sequence. The invalid sequence ensures that the escaped string - * will trigger an error on the server-side, even if we can't - * directly report an error here. + * Multibyte character is invalid. It's important to verify that + * as invalid multibyte characters could e.g. be used to "skip" + * over quote characters, e.g. when parsing + * character-by-character. + * + * Report an error if possible, and replace the character's first + * byte with an invalid sequence. The invalid sequence ensures + * that the escaped string will trigger an error on the + * server-side, even if we can't directly report an error here. * * This isn't *that* crucial when we can report an error to the - * caller, but if we can't, the caller will use this string - * unmodified and it needs to be safe for parsing. + * caller; but if we can't or the caller ignores it, the caller + * will use this string unmodified and it needs to be safe for + * parsing. * * We know there's enough space for the invalid sequence because * the "to" buffer needs to be at least 2 * length + 1 long, and * at worst we're replacing a single input byte with two invalid * bytes. - */ - if (error) - *error = 1; - if (conn) - libpq_append_conn_error(conn, "incomplete multibyte character"); - - pg_encoding_set_invalid(encoding, target); - target += 2; - - /* there's no more input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1) - { - /* - * Multibyte character is invalid. It's important to verify that - * as invalid multi-byte characters could e.g. be used to "skip" - * over quote characters, e.g. when parsing - * character-by-character. - * - * Replace the bytes corresponding to the invalid character with - * an invalid sequence, for the same reason as above. * * It would be a bit faster to verify the whole string the first * time we encounter a set highbit, but this way we can replace - * just the invalid characters, which probably makes it easier for - * users to find the invalidly encoded portion of a larger string. + * just the invalid data, which probably makes it easier for users + * to find the invalidly encoded portion of a larger string. */ if (error) *error = 1; @@ -4154,13 +4138,14 @@ PQescapeStringInternal(PGconn *conn, pg_encoding_set_invalid(encoding, target); target += 2; - remaining -= charlen; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. That's + * safer in case the subsequent bytes contain important characters + * for the caller (e.g. '>' in html). */ - source += charlen; + source++; + remaining--; } else { -- 2.43.5
Hi, On 2025-02-15 13:08:03 -0500, Tom Lane wrote: > I studied the v3 patch a little and realized that it only fixes the > behavior for the case of a complete-but-invalid multibyte character. > If we have an incomplete character at the end of the string, the > whole thing still gets dropped. Surely that's not what we want if > we are going to adopt this behavior. Good catch! The different behaviour made sense when we were dropping the entire multi-byte character, but not anymore. > Here's a v4 that fixes that. As a bonus, we can get rid of > duplicative coding since the "incomplete" and "invalid" cases > are now treated identically. > One minor point is that the error messages from PQescapeStringInternal > no longer distinguish "incomplete" from "invalid". I don't find that > to be a terribly useful distinction, so that's fine with me. But if > someone feels that's important to preserve, we could make it do > something like I think there's some value in the distinction, because incomplete characters can happen a lot more easily due to truncating longer strings. I don't know if it's worth the code complexity though. > From 935ccea6da82d0150f65672825c8f5e6f86bca89 Mon Sep 17 00:00:00 2001 > From: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sat, 15 Feb 2025 12:58:22 -0500 > Subject: [PATCH v4] Make escaping functions retain trailing bytes of an > invalid character. > > Instead of dropping the trailing byte(s) of an invalid or incomplete > multibyte character, replace only the first byte with a known-invalid > sequence, and process the rest normally. This seems less likely to > confuse incautious callers than the behavior adopted in 5dc1e42b4. > > Author: Andres Freund <andres@anarazel.de> > Reviewed-by: Jeff Davis <pgsql@j-davis.com> This should have been a Reported-by, not a Reviewed-by, my mistake... It seems that nobody is arguing against the "just skip one byte" behaviour, so I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite elapsed. A few more cycles in the buildfarm wouldn't hurt. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > It seems that nobody is arguing against the "just skip one byte" behaviour, so > I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite > elapsed. A few more cycles in the buildfarm wouldn't hurt. Agreed. I thought there would be more discussion, but it seems nobody really objects to changing this. The other thing that was discussed in the security thread was modifying PQescapeStringInternal and PQescapeInternal to produce no more than one complaint about invalid multibyte characters, on the grounds that input that's just plain in some other encoding would otherwise produce a ton of repetitive messages. That seems trivial enough to mechanize with a bool already_complained flag, so I think we should incorporate that refinement while we're here. I can write that patch if you're busy. regards, tom lane
On Sat, Feb 15, 2025 at 01:23:51PM -0500, Andres Freund wrote: > On 2025-02-15 13:08:03 -0500, Tom Lane wrote: > > I studied the v3 patch a little and realized that it only fixes the > > behavior for the case of a complete-but-invalid multibyte character. > > If we have an incomplete character at the end of the string, the > > whole thing still gets dropped. Surely that's not what we want if > > we are going to adopt this behavior. Thanks for doing that. > Good catch! The different behaviour made sense when we were dropping the > entire multi-byte character, but not anymore. > > > > Here's a v4 that fixes that. As a bonus, we can get rid of > > duplicative coding since the "incomplete" and "invalid" cases > > are now treated identically. > > > One minor point is that the error messages from PQescapeStringInternal > > no longer distinguish "incomplete" from "invalid". I don't find that > > to be a terribly useful distinction, so that's fine with me. But if > > someone feels that's important to preserve, we could make it do > > something like > > I think there's some value in the distinction, because incomplete characters > can happen a lot more easily due to truncating longer strings. I don't know if > it's worth the code complexity though. While I agree it doesn't feel like a terribly-important distinction, I'd definitely keep that distinction today rather than drop it with so little notice. (I'd feel fine about dropping it in a major release. I'd probably not drop it in any minor release, but especially not this close to wrap.) On Sat, Feb 15, 2025 at 01:35:10PM -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > It seems that nobody is arguing against the "just skip one byte" behaviour, so > > I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite > > elapsed. A few more cycles in the buildfarm wouldn't hurt. Works for me; we can always revert in the event of further discussion. For what it's worth, I didn't intend this 24hr as a hard period for gating any particular action. > Agreed. I thought there would be more discussion, but it seems > nobody really objects to changing this. > > The other thing that was discussed in the security thread was > modifying PQescapeStringInternal and PQescapeInternal to produce > no more than one complaint about invalid multibyte characters, > on the grounds that input that's just plain in some other encoding > would otherwise produce a ton of repetitive messages. That seems > trivial enough to mechanize with a bool already_complained flag, > so I think we should incorporate that refinement while we're here. +1
Noah Misch <noah@leadboat.com> writes: > On 2025-02-15 13:08:03 -0500, Tom Lane wrote: >> The other thing that was discussed in the security thread was >> modifying PQescapeStringInternal and PQescapeInternal to produce >> no more than one complaint about invalid multibyte characters, >> on the grounds that input that's just plain in some other encoding >> would otherwise produce a ton of repetitive messages. That seems >> trivial enough to mechanize with a bool already_complained flag, >> so I think we should incorporate that refinement while we're here. > +1 On closer inspection, PQescapeInternal already issues only one error message, since it does "return NULL" after detecting the first error. So this makes PQescapeStringInternal behave more like that. regards, tom lane From 565b42cecf645b1dbde277512fd67836ee9081d1 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat, 15 Feb 2025 14:10:33 -0500 Subject: [PATCH v5] Make escaping functions retain trailing bytes of an invalid character. Instead of dropping the trailing byte(s) of an invalid or incomplete multibyte character, replace only the first byte with a known-invalid sequence, and process the rest normally. This seems less likely to confuse incautious callers than the behavior adopted in 5dc1e42b4. Author: Andres Freund <andres@anarazel.de> Reported-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com Backpatch-through: 13 --- src/fe_utils/string_utils.c | 91 +++++++++++++--------------------- src/interfaces/libpq/fe-exec.c | 69 ++++++++++++-------------- 2 files changed, 65 insertions(+), 95 deletions(-) diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c index b6a7b19708..bdf35db614 100644 --- a/src/fe_utils/string_utils.c +++ b/src/fe_utils/string_utils.c @@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding) /* Slow path for possible multibyte characters */ charlen = pg_encoding_mblen(encoding, cp); - if (remaining < charlen) - { - /* - * If the character is longer than the available input, - * replace the string with an invalid sequence. The invalid - * sequence ensures that the escaped string will trigger an - * error on the server-side, even if we can't directly report - * an error here. - */ - enlargePQExpBuffer(id_return, 2); - pg_encoding_set_invalid(encoding, - id_return->data + id_return->len); - id_return->len += 2; - id_return->data[id_return->len] = '\0'; - - /* there's no more input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, cp, charlen) == -1) { /* * Multibyte character is invalid. It's important to verify - * that as invalid multi-byte characters could e.g. be used to + * that as invalid multibyte characters could e.g. be used to * "skip" over quote characters, e.g. when parsing * character-by-character. * - * Replace the bytes corresponding to the invalid character - * with an invalid sequence, for the same reason as above. + * Replace the character's first byte with an invalid + * sequence. The invalid sequence ensures that the escaped + * string will trigger an error on the server-side, even if we + * can't directly report an error here. * * It would be a bit faster to verify the whole string the * first time we encounter a set highbit, but this way we can - * replace just the invalid characters, which probably makes - * it easier for users to find the invalidly encoded portion - * of a larger string. + * replace just the invalid data, which probably makes it + * easier for users to find the invalidly encoded portion of a + * larger string. */ enlargePQExpBuffer(id_return, 2); pg_encoding_set_invalid(encoding, @@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding) id_return->data[id_return->len] = '\0'; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. + * That's safer in case the subsequent bytes contain + * characters that are significant for the caller (e.g. '>' in + * html). */ - remaining -= charlen; - cp += charlen; + remaining--; + cp++; } else { @@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str, /* Slow path for possible multibyte characters */ charlen = PQmblen(source, encoding); - if (remaining < charlen) - { - /* - * If the character is longer than the available input, replace - * the string with an invalid sequence. The invalid sequence - * ensures that the escaped string will trigger an error on the - * server-side, even if we can't directly report an error here. - * - * We know there's enough space for the invalid sequence because - * the "target" buffer is 2 * length + 2 long, and at worst we're - * replacing a single input byte with two invalid bytes. - */ - pg_encoding_set_invalid(encoding, target); - target += 2; - - /* there's no more valid input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, source, charlen) == -1) { /* * Multibyte character is invalid. It's important to verify that - * as invalid multi-byte characters could e.g. be used to "skip" + * as invalid multibyte characters could e.g. be used to "skip" * over quote characters, e.g. when parsing * character-by-character. * - * Replace the bytes corresponding to the invalid character with - * an invalid sequence, for the same reason as above. + * Replace the character's first byte with an invalid sequence. + * The invalid sequence ensures that the escaped string will + * trigger an error on the server-side, even if we can't directly + * report an error here. + * + * We know there's enough space for the invalid sequence because + * the "target" buffer is 2 * length + 2 long, and at worst we're + * replacing a single input byte with two invalid bytes. * * It would be a bit faster to verify the whole string the first * time we encounter a set highbit, but this way we can replace - * just the invalid characters, which probably makes it easier for - * users to find the invalidly encoded portion of a larger string. + * just the invalid data, which probably makes it easier for users + * to find the invalidly encoded portion of a larger string. */ pg_encoding_set_invalid(encoding, target); target += 2; - remaining -= charlen; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. That's + * safer in case the subsequent bytes contain important characters + * for the caller (e.g. '>' in html). */ - source += charlen; + source++; + remaining--; } else { diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 120d4d032e..294843ed8b 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -4076,6 +4076,7 @@ PQescapeStringInternal(PGconn *conn, const char *source = from; char *target = to; size_t remaining = strnlen(from, length); + bool already_complained = false; if (error) *error = 0; @@ -4102,65 +4103,57 @@ PQescapeStringInternal(PGconn *conn, /* Slow path for possible multibyte characters */ charlen = pg_encoding_mblen(encoding, source); - if (remaining < charlen) + if (remaining < charlen || + pg_encoding_verifymbchar(encoding, source, charlen) == -1) { /* - * If the character is longer than the available input, report an - * error if possible, and replace the string with an invalid - * sequence. The invalid sequence ensures that the escaped string - * will trigger an error on the server-side, even if we can't - * directly report an error here. + * Multibyte character is invalid. It's important to verify that + * as invalid multibyte characters could e.g. be used to "skip" + * over quote characters, e.g. when parsing + * character-by-character. + * + * Report an error if possible, and replace the character's first + * byte with an invalid sequence. The invalid sequence ensures + * that the escaped string will trigger an error on the + * server-side, even if we can't directly report an error here. * * This isn't *that* crucial when we can report an error to the - * caller, but if we can't, the caller will use this string - * unmodified and it needs to be safe for parsing. + * caller; but if we can't or the caller ignores it, the caller + * will use this string unmodified and it needs to be safe for + * parsing. * * We know there's enough space for the invalid sequence because * the "to" buffer needs to be at least 2 * length + 1 long, and * at worst we're replacing a single input byte with two invalid * bytes. - */ - if (error) - *error = 1; - if (conn) - libpq_append_conn_error(conn, "incomplete multibyte character"); - - pg_encoding_set_invalid(encoding, target); - target += 2; - - /* there's no more input data, so we can stop */ - break; - } - else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1) - { - /* - * Multibyte character is invalid. It's important to verify that - * as invalid multi-byte characters could e.g. be used to "skip" - * over quote characters, e.g. when parsing - * character-by-character. - * - * Replace the bytes corresponding to the invalid character with - * an invalid sequence, for the same reason as above. * * It would be a bit faster to verify the whole string the first * time we encounter a set highbit, but this way we can replace - * just the invalid characters, which probably makes it easier for - * users to find the invalidly encoded portion of a larger string. + * just the invalid data, which probably makes it easier for users + * to find the invalidly encoded portion of a larger string. */ if (error) *error = 1; - if (conn) - libpq_append_conn_error(conn, "invalid multibyte character"); + if (conn && !already_complained) + { + if (remaining < charlen) + libpq_append_conn_error(conn, "incomplete multibyte character"); + else + libpq_append_conn_error(conn, "invalid multibyte character"); + /* Issue a complaint only once per string */ + already_complained = true; + } pg_encoding_set_invalid(encoding, target); target += 2; - remaining -= charlen; /* - * Copy the rest of the string after the invalid multi-byte - * character. + * Handle the following bytes as if this byte didn't exist. That's + * safer in case the subsequent bytes contain important characters + * for the caller (e.g. '>' in html). */ - source += charlen; + source++; + remaining--; } else { -- 2.43.5
Hi, On 2025-02-15 14:12:06 -0500, Tom Lane wrote: > On closer inspection, PQescapeInternal already issues only one > error message, since it does "return NULL" after detecting the > first error. So this makes PQescapeStringInternal behave more > like that. This looks good to me. I looked through the diff between what test_escape -v before/after this change prints out, looks good to me. > From 565b42cecf645b1dbde277512fd67836ee9081d1 Mon Sep 17 00:00:00 2001 > From: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sat, 15 Feb 2025 14:10:33 -0500 > Subject: [PATCH v5] Make escaping functions retain trailing bytes of an > invalid character. > > Instead of dropping the trailing byte(s) of an invalid or incomplete > multibyte character, replace only the first byte with a known-invalid > sequence, and process the rest normally. This seems less likely to > confuse incautious callers than the behavior adopted in 5dc1e42b4. > > Author: Andres Freund <andres@anarazel.de> I think you deserve primary authorship for the change now... Are you planning to push / backpatch, or should I? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Are you planning to push / backpatch, or should I? I can take care of it, if you like. regards, tom lane
On 2025-02-15 15:53:20 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Are you planning to push / backpatch, or should I? > > I can take care of it, if you like. That'd be appreciated!
Andres Freund <andres@anarazel.de> writes: > On 2025-02-15 15:53:20 -0500, Tom Lane wrote: >> I can take care of it, if you like. > That'd be appreciated! Done. regards, tom lane
On Sat, 2025-02-15 at 16:21 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2025-02-15 15:53:20 -0500, Tom Lane wrote: > > > I can take care of it, if you like. > > > That'd be appreciated! > > Done. Thank you all. Regards, Jeff Davis