Re: text column constraint, newbie question - Mailing list pgsql-general

From Craig Ringer
Subject Re: text column constraint, newbie question
Date
Msg-id 49C82143.9080500@postnewspapers.com.au
Whole thread Raw
In response to Re: text column constraint, newbie question  (RebeccaJ <rebeccaj@gmail.com>)
List pgsql-general
RebeccaJ wrote:

> And I wonder why you like SQL_ASCII better than UTF8, and whether
> others have any opinions about those two. (My web server's LC_CTYPE is
> C, so I can use any character set.) Wouldn't UTF8 allow more
> characters than SQL_ASCII?

I've had a LOT of experience dealing with apps that use 8-bit byte
strings (like SQL_ASCII `text') to store data, and I've rarely seen one
that *doesn't* have text encoding handling bugs.

If you store your text as byte streams that don't know, check, or
enforce their own encoding you must keep track of the encoding
separately - either with another value stored alongside the string, or
through your app logic.

If you start storing data with multiple different text encodings in the
DB, you're most likely to land up tracking down annoying "corrupt text"
bugs sooner or later.

If, on the other hand, you use UTF-8, you *know* that everything in the
database is well-formed UTF-8. You ensure that it is UTF-8 before
storing it in the DB and know it'll be UTF-8 coming out. The DB takes
care of encoding conversion for you if you ask it to, by setting
client_encoding - the only downside being that it'll refuse to return
strings that can't be represented in your current client_encoding, like
say Cyrillic (Russian etc) text if you're using ISO-8859-1 (latin-1) for
your client encoding.

Even with a UTF-8 database you must still get your I/O to/from libraries
and the rest of the system right, converting UTF-8 text to whatever the
system expects or vice versa. Alternately, if you set client_encoding,
you must be prepared for cases where the DB can't send you what you ask
for because your encoding can't represent it.

All in all, I personally think a UTF-8 database is considerably better
for most uses. There are certainly cases where I'd use SQL_ASCII, but
not most.

--
Craig Ringer

pgsql-general by date:

Previous
From: Alban Hertroys
Date:
Subject: Re: text column constraint, newbie question
Next
From: Craig Ringer
Date:
Subject: Re: Last modification time of a database?