tsearch is non-multibyte-aware in a few places - Mailing list pgsql-hackers

From Tom Lane
Subject tsearch is non-multibyte-aware in a few places
Date
Msg-id 15580.1213892951@sss.pgh.pa.us
Whole thread Raw
Responses Re: tsearch is non-multibyte-aware in a few places  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I've identified the cause of bug #4253:
           /* Trim trailing space */           while (*pbuf && !t_isspace(pbuf))               pbuf++;           *pbuf
='\0';
 

At least on Macs, t_isspace is capable of returning "true" when pointed
at the second byte of a 2-byte UTF8 character.  This explains the report
that the letter "�" has a problem when some other ones don't.  Of
course pbuf needs to be incremented using pg_mblen not just ++.

I looked around for other occurrences of the same problem and found
a couple.  I also found occurrences of the same pattern for skipping
whitespace:
           while (*s && t_isspace(s))               s++;

This is safe if and only if t_isspace is never true for multibyte
characters ... can anyone think of a counterexample?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Backend Stats Enhancement Request
Next
From: Tom Lane
Date:
Subject: Re: tsearch is non-multibyte-aware in a few places