Home > mailing lists

Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Date	August 20, 2010 16:50:22
Msg-id	25852.1282333813@sss.pgh.pa.us Whole thread Raw
In response to	Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence (Steven Schlansker <steven@trumpet.io>)
List	pgsql-hackers

Tree view

Steven Schlansker <steven@trumpet.io> writes:
> On Aug 19, 2010, at 3:24 PM, Tom Lane wrote:
>> We generally assume that in server-safe encodings, the ctype.h functions
>> will behave sanely on any single-byte value.  You can argue the wisdom
>> of that, but deciding to change that policy would be a rather massive
>> code change; I'm not excited about going that direction.

> Fair enough.  I presume there are no "server-safe encodings" for which
> a multibyte sequence 0x XX20 would be valid - which would break anyway
> (as the second byte looks like a real space)

Right: our definition of a "server-safe encoding" is precisely that no
byte of a multibyte character looks like ASCII, ie all bytes have their
high bit set.  We're essentially assuming that the <ctype.h> functions
will all return false for any byte with the high bit set, if the
selected encoding is multibyte.

> Anyway, it looks like this is actually a BSD bug which got copy +
> pasted into Apple's Darwin source -
> http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html

Interesting.  So the BSD people did fix it upstream?
        regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 20 August 2010, 16:47:12
Subject: Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

From: Stephen Frost
Date: 20 August 2010, 16:54:39
Subject: Re: Version Numbering

Re: [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence - Mailing list pgsql-hackers

Previous

Next