BUG: SUBSTRING on TOASTed text returns "invalid byte sequence for encoding UTF8: 0xc3" - Mailing list pgsql-bugs

From ludovitkramar
Subject BUG: SUBSTRING on TOASTed text returns "invalid byte sequence for encoding UTF8: 0xc3"
Date
Msg-id ANM2Wyhl66VMeL4PfgAA4Quozz-NM9Q3j0OFpyxDIwlUQ8392QerkipA3xniJ7GaJIKF3Gd6X1wDw95bXkdP_8VdFK5etQmb1AoDkzvbAak=@proton.me
Whole thread
Responses Re: BUG: SUBSTRING on TOASTed text returns "invalid byte sequence for encoding UTF8: 0xc3"
List pgsql-bugs
Disclosure: This bug reproduction and report were generated with the
assistance of Claude Opus 4.6, and reviewed by me.


PostgreSQL version: 18.2-1 (Arch Linux package)

SELECT version();
  PostgreSQL 18.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 15.2.1 20260209, 64-bit

Platform: Arch Linux (rolling), kernel 6.18.9-arch1-2, x86_64
Encoding: UTF8, Collation: C.UTF-8, Ctype: C.UTF-8
Block size: 8192

Full reproduction (can be run through psql):

  -- 1. Load the test data from the attached hex file (pg18_repro_hex.txt)
  CREATE TABLE _pg18_repro (id INT PRIMARY KEY DEFAULT 1, hex_content TEXT);
  \COPY _pg18_repro (hex_content) FROM '/tmp/pg18_repro_hex.txt'
  ALTER TABLE _pg18_repro ADD COLUMN content TEXT;
  UPDATE _pg18_repro SET content = convert_from(decode(replace(hex_content, '\x', ''), 'hex'), 'UTF8');
  ALTER TABLE _pg18_repro DROP COLUMN hex_content;

  -- 2. Verify the data is valid UTF-8 and is TOASTed
  SELECT length(content) AS char_len,
         octet_length(content) AS byte_len,
         pg_column_size(content) AS stored_size
  FROM _pg18_repro WHERE id = 1;
  -- Returns: char_len=136485, byte_len=137165, stored_size=65412
  -- (stored_size confirms the value is TOASTed and compressed)

  -- 3. SUBSTRING fails at character position 1212
  SELECT SUBSTRING(content FROM 1212 FOR 1) FROM _pg18_repro WHERE id = 1;

Actual output (with \set VERBOSITY verbose):

  ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc3
  LOCATION:  report_invalid_encoding_int, mbutils.c:1847

Expected output:

  substring
  -----------
   &
  (1 row)

The character at position 1212 is U+0026 AMPERSAND (&), a single-byte ASCII
character. The error 0xc3 is the lead byte of a 2-byte UTF-8 sequence that
exists elsewhere in the data — SUBSTRING appears to be computing an incorrect
byte offset when fetching from the TOASTed value.

Evidence that the data is valid and the bug is TOAST-specific:

  -- LEFT/RIGHT returns the correct character at the same position:
  SELECT LEFT(RIGHT(content, -1211), 1) FROM _pg18_repro WHERE id = 1;
  -- Returns: &

  -- Adjacent positions work with SUBSTRING:
  SELECT SUBSTRING(content FROM 1211 FOR 1) FROM _pg18_repro WHERE id = 1;
  -- Returns: (newline)
  SELECT SUBSTRING(content FROM 1213 FOR 1) FROM _pg18_repro WHERE id = 1;
  -- Returns: S

  -- Forcing a full detoast with || '' makes SUBSTRING work:
  SELECT SUBSTRING((content || '') FROM 1212 FOR 1) FROM _pg18_repro WHERE id = 1;
  -- Returns: &

  -- An inline (non-TOASTed) copy of the same range works:
  SELECT SUBSTRING(
      (SELECT SUBSTRING(content FROM 1 FOR 1500) FROM _pg18_repro WHERE id = 1)
      FROM 1212 FOR 1
  );
  -- Returns: &

Position 1212 is not the only affected position. Scanning the first 5000
character positions, 37 positions fail — all with the same 0xc3 error.
The full list of failing positions in the first 5000 characters:

  18, 69, 75, 167, 308, 704, 876, 1183, 1212, 1235, 1557, 1758, 1770,
  2034, 2077, 2091, 2576, 2742, 2878, 2895, 2918, 2944, 3020, 3126,
  3181, 3226, 3267, 3462, 3548, 3551, 3554, 3556, 4068, 4154, 4269,
  4332, 4964

The scanning query used:

  DO $$
  DECLARE
      i INT;
      ch TEXT;
  BEGIN
      FOR i IN 1..5000 LOOP
          BEGIN
              SELECT SUBSTRING(content FROM i FOR 1) INTO ch
              FROM _pg18_repro WHERE id = 1;
          EXCEPTION WHEN OTHERS THEN
              RAISE NOTICE 'FAIL at position %: %', i, SQLERRM;
          END;
      END LOOP;
  END;
  $$;

No non-default configuration. The PostgreSQL installation is the stock Arch
Linux postgresql 18.2-1 package, started via systemd, with no changes to
postgresql.conf beyond the defaults.

Cleanup:
  DROP TABLE _pg18_repro;

Attachment

pgsql-bugs by date:

Previous
From: Sergey Shinderuk
Date:
Subject: Re: BUG #19412: Wrong query result with not null constraint
Next
From: Álvaro Herrera
Date:
Subject: Re: BUG #19393: pg_upgrade fails with duplicate key violation when CHECK constraint named *_not_null exists