Thread: SQL_ASCII vs. 7-bit ASCII encodings
The SQL_ASCII-breaks-JDBC issue just came up yet again on the JDBC list, and I'm wondering if we can do something better on the server side to help solve it. The problem is that people have SQL_ASCII databases with non-7-bit data in them under some encoding known only to a (non-JDBC) application. Changing client_encoding has no effect on a SQL_ASCII database, it's always passthrough. So when a JDBC client is later written, and the JDBC driver sets client_encoding=UNICODE, we get data corruption and/or complaints from the driver that the server is sending it invalid unicode (because it's really LATIN1 or whatever the original inserter happened to use). At this point the user has real problems as there is existing data in their database in one or more encodings, but the encoding info associated with that data has been lost. Converting such a database to a single database-wide encoding is painful at best. I suppose that we can't change the semantics of SQL_ASCII without backwards compatibility problems. I wonder if introducing a new encoding that only allows 7-bit ascii, and making that the default, is the way to go. This new encoding would be treated like any other normal encoding, i.e. setting client_encoding does transcoding (I expect that'd be a 1:1 mapping in most or all cases) and rejects unmappable characters as soon as they're encountered. Then the problem is visible as soon as problematic strings are given to the server, rather than when a client that depends on having proper encoding information (such as JDBC) happens to be used. If the DB is only using simple 7-bit ASCII, then there's no change in behaviour. If the DB does need to store additional characters, the user is forced to choose an appropriate encoding before any encoding info is lost. Any thoughts on this? -O
> I suppose that we can't change the semantics of SQL_ASCII without > backwards compatibility problems. I wonder if introducing a new encoding > that only allows 7-bit ascii, and making that the default, is the way to > go. A while back I requested a new encoding that is '7BITASCII'. It would be excellent for those of use who require that the data is ascii, latin1 and utf8. > This new encoding would be treated like any other normal encoding, i.e. > setting client_encoding does transcoding (I expect that'd be a 1:1 > mapping in most or all cases) and rejects unmappable characters as soon > as they're encountered. Personally, I'd like UTF8 to be the default encoding :) This is the 21st century :D Chris
Am Donnerstag, 12. Mai 2005 04:42 schrieb Oliver Jowett: > I suppose that we can't change the semantics of SQL_ASCII without > backwards compatibility problems. I wonder if introducing a new encoding > that only allows 7-bit ascii, and making that the default, is the way to > go. In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem should go away over time. Certainly, making 7-bit ASCII the default encoding is not an option. -- Peter Eisentraut http://developer.postgresql.org/~petere/
> In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem > should go away over time. Certainly, making 7-bit ASCII the default encoding > is not an option. You sure?
> Personally, I'd like UTF8 to be the default encoding :) This > is the 21st century :D I concur. ... John
Peter Eisentraut wrote: > Am Donnerstag, 12. Mai 2005 04:42 schrieb Oliver Jowett: > >>I suppose that we can't change the semantics of SQL_ASCII without >>backwards compatibility problems. I wonder if introducing a new encoding >>that only allows 7-bit ascii, and making that the default, is the way to >>go. > > > In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem > should go away over time. My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII cluster by default unless I specify -E. > Certainly, making 7-bit ASCII the default encoding > is not an option. Why is that? -O
Oliver Jowett <oliver@opencloud.com> writes: > Peter Eisentraut wrote: >> In 8.0, the de facto default encoding is no longer SQL_ASCII, so that >> problem should go away over time. > My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII > cluster by default unless I specify -E. This would depend on what your LANG/LC_ALL environment is. I think Peter is assuming that more and more systems default to setting up non-C locale settings for users. Which is probably true but I have no idea about total numbers. regards, tom lane
Am Donnerstag, 12. Mai 2005 14:57 schrieb Oliver Jowett: > My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII > cluster by default unless I specify -E. Then you use the locale C. We could create a 7-bit encoding and map it to locale C, I suppose. > > Certainly, making 7-bit ASCII the default encoding > > is not an option. > > Why is that? That would cripple a system that many users are perfectly content with now. I compare this to the occasional requests to make pg_hba.conf reject everyone by default. We have to err a little on the side of usablity. Anyway, the issue here is the mismatch between the C locale and the SQL_ASCII encoding. The solution is to fix that mismatch, not cripple the entire system. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Am Donnerstag, 12. Mai 2005 14:57 schrieb Oliver Jowett: > >>My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII >>cluster by default unless I specify -E. > > > Then you use the locale C. We could create a 7-bit encoding and map it to > locale C, I suppose. Ok, I think that's what I intended :) Do we choose SQL_ASCII in any case other than a C locale? > That would cripple a system that many users are perfectly content with now. I > compare this to the occasional requests to make pg_hba.conf reject everyone > by default. We have to err a little on the side of usablity. Anyway, the > issue here is the mismatch between the C locale and the SQL_ASCII encoding. > The solution is to fix that mismatch, not cripple the entire system. Well, I wasn't thinking of using a 7-bit encoding always, just as a replacement for the cases where we currently choose SQL_ASCII. Does that sound reasonable? -O
Oliver Jowett <oliver@opencloud.com> writes: > Peter Eisentraut wrote: >> That would cripple a system that many users are perfectly content with now. > Well, I wasn't thinking of using a 7-bit encoding always, just as a > replacement for the cases where we currently choose SQL_ASCII. Does that > sound reasonable? I agree with what (I think) Peter is saying: that would break things for many people for whom the default works fine now. We are currently seeing a whole lot of complaints due to the fact that 8.0 tends to default to Unicode encoding in environments where previous versions defaulted to SQL-ASCII. That says to me that a whole lot of people were getting along just fine in SQL-ASCII, and therefore that moving further away from that behavior is the wrong thing. In particular, there is not any single one of those complainants who would be happier with a 7-bit-only default; if they were using 7-bit-only data, they'd not have noticed a problem anyway. regards, tom lane
> We are currently seeing a whole lot of complaints due to the fact that > 8.0 tends to default to Unicode encoding in environments where previous > versions defaulted to SQL-ASCII. That says to me that a whole lot of > people were getting along just fine in SQL-ASCII, and therefore that > moving further away from that behavior is the wrong thing. In > particular, there is not any single one of those complainants who would > be happier with a 7-bit-only default; if they were using 7-bit-only > data, they'd not have noticed a problem anyway. I'm thinking of implementing a 7-bit-only encoding - any chance of getting a quick checklist of what has to be done? Thanks, Chris
Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: > >>Peter Eisentraut wrote: >> >>>That would cripple a system that many users are perfectly content with now. > > >>Well, I wasn't thinking of using a 7-bit encoding always, just as a >>replacement for the cases where we currently choose SQL_ASCII. Does that >>sound reasonable? > > > I agree with what (I think) Peter is saying: that would break things for > many people for whom the default works fine now. > > We are currently seeing a whole lot of complaints due to the fact that > 8.0 tends to default to Unicode encoding in environments where previous > versions defaulted to SQL-ASCII. That says to me that a whole lot of > people were getting along just fine in SQL-ASCII, and therefore that > moving further away from that behavior is the wrong thing. In > particular, there is not any single one of those complainants who would > be happier with a 7-bit-only default; if they were using 7-bit-only > data, they'd not have noticed a problem anyway. This is exactly the case where JDBC has problems, and the case I'd like to prevent happening in the first place where possible: SQL_ASCII with non-7-bit data. How do you propose that the JDBC driver converts from SQL_ASCII to UTF-16 (the internal Java String representation)? Changing client_encoding does not help. Requiring the JDBC client to specify the right encoding to use is error-prone at best, and impossible at worst (who says that only one encoding has been used?) I'm not suggesting that a 7-bit encoding is necessarily useful to everyone. I'm saying that we should make it a setting that users have to think about and correctly set before they can insert 8-bit data. If they decide they want SQL_ASCII and the associated client_encoding problems, rather than an appropriate encoding the database understands, so be it; but it's on their head, and requires active intervention before the database starts losing encoding information. If SQL_ASCII plus 8-bit data is considered the right thing to do, then I'd consider the ability to change client_encoding on a SQL_ASCII database without an error to be a bug -- you've asked the server to give you (for example) UTF8, but it isn't doing that. In that case, can we get this to generate an error when client_encoding is set instead of producing invalid output? -O
On Fri, May 13, 2005 at 01:15:36AM -0400, Tom Lane wrote: > We are currently seeing a whole lot of complaints due to the fact that > 8.0 tends to default to Unicode encoding in environments where previous > versions defaulted to SQL-ASCII. That says to me that a whole lot of > people were getting along just fine in SQL-ASCII, and therefore that > moving further away from that behavior is the wrong thing. In > particular, there is not any single one of those complainants who would > be happier with a 7-bit-only default; if they were using 7-bit-only > data, they'd not have noticed a problem anyway. I disagree. Of course none of the complainants would be happy with 7-bit encoding, but if they had noticed they had a problem before they had inserted millions of tuples, they could have corrected their configuration right away. The problem is that a single application coming from a single environment is happy with a 8-bit-unchecked encoding, but as soon as they develop a second application using a different environment, which uses a different encoding, they start seeing invalid data pop up. And then they have a problem, because they have to dump all data, recode it, and reimport it. And that's very painful. -- Alvaro Herrera (<alvherre[a]surnet.cl>) Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos / con todos los humanos acabaré (Bender)
Alvaro Herrera <alvherre@surnet.cl> writes: > The problem is that a single application coming from a single > environment is happy with a 8-bit-unchecked encoding, but as soon as > they develop a second application using a different environment, which > uses a different encoding, they start seeing invalid data pop up. [ shrug... ] The evidence at hand says that many people never get to that point. For instance, a particular database may never be accessed through anything except JDBC, and so all the incoming data will be utf8 anyway. My feeling about it is that we already made significant changes in 8.0 --- it won't default to SQL_ASCII unless your locale is "C", which to me is a pretty strong indication that you are not very concerned about encodings. We should wait and see what field experience is like with that, rather than insisting on anything as anal-retentive as disallowing 8-bit data in SQL_ASCII. Doing that might have technical purity but I think it will create as many problems as it prevents. regards, tom lane
On Fri, May 13, 2005 at 09:59:27AM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@surnet.cl> writes: > > The problem is that a single application coming from a single > > environment is happy with a 8-bit-unchecked encoding, but as soon as > > they develop a second application using a different environment, which > > uses a different encoding, they start seeing invalid data pop up. > > [ shrug... ] The evidence at hand says that many people never get to > that point. For instance, a particular database may never be accessed > through anything except JDBC, and so all the incoming data will be utf8 > anyway. One thing that's not clear to me is what encoding does people running on Windows get? Is it also determined based on locale, and is it something useful? In fact I've seen many more people with this problem after 8.0 was released, at least in pgsql-es-ayuda. -- Alvaro Herrera (<alvherre[a]surnet.cl>) "La verdad no siempre es bonita, pero el hambre de ella sí"
Alvaro Herrera <alvherre@surnet.cl> writes: > In fact I've seen many more people with this problem after 8.0 was > released, at least in pgsql-es-ayuda. Which problem exactly? Most of the 8.0 complaints I can recall seemed to come from people who were trying to dump from a SQL_ASCII database and reload into a UTF8 database --- that is, their problems were specifically *caused by* our decision to shift to a more restrictive default. Which is why I'm not eager to make it even more restrictive. regards, tom lane
Tom Lane wrote: > We should wait and see what field experience is like with > that, rather than insisting on anything as anal-retentive as disallowing > 8-bit data in SQL_ASCII. I didn't suggest changing the behaviour of SQL_ASCII.. -O
On Fri, May 13, 2005 at 10:22:06AM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@surnet.cl> writes: > > In fact I've seen many more people with this problem after 8.0 was > > released, at least in pgsql-es-ayuda. > > Which problem exactly? Most of the 8.0 complaints I can recall seemed > to come from people who were trying to dump from a SQL_ASCII database > and reload into a UTF8 database --- that is, their problems were > specifically *caused by* our decision to shift to a more restrictive > default. Which is why I'm not eager to make it even more restrictive. Well, of all kinds really. There are some of those. Also I've seen several questions from people using, say, a Python environment and then they have to add a Windows/ODBC application. The Python apps have been using SQL_ASCII all along, but ODBC apparently wants UTF8 or some sort of recode. So non-7-bit chars display funny. Anyway now that I look at them ISTM that most complaints actually come from misunderstanding of the whole encoding issue. They just create a database and an app and start coding and throwing data. At first it all seems fine. Later they find out that the client is using, say, win-1250, and has been using client_encoding latin1; so it has been wrong all around and they only find out when a new, different app shows something strange. So maybe it's not about 7 bits or not, it's just lack of knowledge. But I've certainly seen much more problems reported after 8.0 was released. -- Alvaro Herrera (<alvherre[a]surnet.cl>) "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)