On Sun, Mar 20, 2005 at 10:02:24AM -0500, Carlos Moreno wrote:
Carlos,
> So, our system (CGI's written in C++ running on a Linux server)
> simply takes whatever the user gives (properly validated and
> escaped) and throws it in the database. We've never encountered
> any problem (well, or perhaps it's the opposite? Perhaps we've
> always been living with the problem without realizing it?)
The latter, I think. The problem is character recoding. If your old
system has been running with encoding SQL_ASCII, then no recoding ever
takes place. If you are now using UTF8 or latin1 (say) as server
encoding, then as soon as the client is using a different encoding,
there should be conversion in order to make the new data correct w.r.t.
the server encoding. If the wrong conversion takes place, or if no
conversion takes place, you may either end up with invalid data, or
have the server reject your input (as was this case.)
So the moral of the story seems to be that yes, you need to make each
application issue the correct client_encoding before entering any data.
You can attach it to the user or database, by issuing ALTER USER (resp.
DATABASE). But if you are using a web interface, where the user can
enter data in either win1252 or latin1 encoding (or whatever) depending
on the environment, then I'm not sure what you should do. One idea
would be "do nothing," but that seems very invalid-data-prone. Another
idea would be having the user select an encoding (and maybe display the
data to them after the recoding has taken place so they can correct it
in case they got it wrong.) This seems messy and likely to upset your
users.
Someone else may have better advise for you on this. I haven't really
worked with these things.
--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu. Five minutes later I realize that it's also talking
about food" (Donald Knuth)