Thread: SQL_ASCII vs. 7-bit ASCII encodings

SQL_ASCII vs. 7-bit ASCII encodings

From
Oliver Jowett
Date:
The SQL_ASCII-breaks-JDBC issue just came up yet again on the JDBC list,
and I'm wondering if we can do something better on the server side to
help solve it.

The problem is that people have SQL_ASCII databases with non-7-bit data
in them under some encoding known only to a (non-JDBC) application.
Changing client_encoding has no effect on a SQL_ASCII database, it's
always passthrough. So when a JDBC client is later written, and the JDBC
driver sets client_encoding=UNICODE, we get data corruption and/or
complaints from the driver that the server is sending it invalid unicode
(because it's really LATIN1 or whatever the original inserter happened
to use).

At this point the user has real problems as there is existing data in
their database in one or more encodings, but the encoding info
associated with that data has been lost. Converting such a database to a
single database-wide encoding is painful at best.

I suppose that we can't change the semantics of SQL_ASCII without
backwards compatibility problems. I wonder if introducing a new encoding
that only allows 7-bit ascii, and making that the default, is the way to
go.

This new encoding would be treated like any other normal encoding, i.e.
setting client_encoding does transcoding (I expect that'd be a 1:1
mapping in most or all cases) and rejects unmappable characters as soon
as they're encountered.

Then the problem is visible as soon as problematic strings are given to
the server, rather than when a client that depends on having proper
encoding information (such as JDBC) happens to be used. If the DB is
only using simple 7-bit ASCII, then there's no change in behaviour. If
the DB does need to store additional characters, the user is forced to
choose an appropriate encoding before any encoding info is lost.

Any thoughts on this?

-O



Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Christopher Kings-Lynne
Date:
> I suppose that we can't change the semantics of SQL_ASCII without
> backwards compatibility problems. I wonder if introducing a new encoding
> that only allows 7-bit ascii, and making that the default, is the way to
> go.

A while back I requested a new encoding that is '7BITASCII'.  It would 
be excellent for those of use who require that the data is ascii, latin1 
and utf8.

> This new encoding would be treated like any other normal encoding, i.e.
> setting client_encoding does transcoding (I expect that'd be a 1:1
> mapping in most or all cases) and rejects unmappable characters as soon
> as they're encountered.

Personally, I'd like UTF8 to be the default encoding :)  This is the 
21st century :D

Chris


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Peter Eisentraut
Date:
Am Donnerstag, 12. Mai 2005 04:42 schrieb Oliver Jowett:
> I suppose that we can't change the semantics of SQL_ASCII without
> backwards compatibility problems. I wonder if introducing a new encoding
> that only allows 7-bit ascii, and making that the default, is the way to
> go.

In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem 
should go away over time.  Certainly, making 7-bit ASCII the default encoding 
is not an option.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Christopher Kings-Lynne
Date:
> In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem 
> should go away over time.  Certainly, making 7-bit ASCII the default encoding 
> is not an option.

You sure?


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
"John Hansen"
Date:
> Personally, I'd like UTF8 to be the default encoding :)  This
> is the 21st century :D

I concur.

... John


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Oliver Jowett
Date:
Peter Eisentraut wrote:
> Am Donnerstag, 12. Mai 2005 04:42 schrieb Oliver Jowett:
> 
>>I suppose that we can't change the semantics of SQL_ASCII without
>>backwards compatibility problems. I wonder if introducing a new encoding
>>that only allows 7-bit ascii, and making that the default, is the way to
>>go.
> 
> 
> In 8.0, the de facto default encoding is no longer SQL_ASCII, so that problem 
> should go away over time.

My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII
cluster by default unless I specify -E.

>  Certainly, making 7-bit ASCII the default encoding 
> is not an option.

Why is that?

-O


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Tom Lane
Date:
Oliver Jowett <oliver@opencloud.com> writes:
> Peter Eisentraut wrote:
>> In 8.0, the de facto default encoding is no longer SQL_ASCII, so that
>> problem should go away over time.

> My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII
> cluster by default unless I specify -E.

This would depend on what your LANG/LC_ALL environment is.  I think
Peter is assuming that more and more systems default to setting up
non-C locale settings for users.  Which is probably true but I have
no idea about total numbers.
        regards, tom lane


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Peter Eisentraut
Date:
Am Donnerstag, 12. Mai 2005 14:57 schrieb Oliver Jowett:
> My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII
> cluster by default unless I specify -E.

Then you use the locale C.  We could create a 7-bit encoding and map it to 
locale C, I suppose.

> >  Certainly, making 7-bit ASCII the default encoding
> > is not an option.
>
> Why is that?

That would cripple a system that many users are perfectly content with now.  I 
compare this to the occasional requests to make pg_hba.conf reject everyone 
by default.  We have to err a little on the side of usablity.  Anyway, the 
issue here is the mismatch between the C locale and the SQL_ASCII encoding.  
The solution is to fix that mismatch, not cripple the entire system.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Oliver Jowett
Date:
Peter Eisentraut wrote:
> Am Donnerstag, 12. Mai 2005 14:57 schrieb Oliver Jowett:
> 
>>My 8.0.0 (what I happen to have on hand) initdb creates a SQL_ASCII
>>cluster by default unless I specify -E.
> 
> 
> Then you use the locale C.  We could create a 7-bit encoding and map it to 
> locale C, I suppose.

Ok, I think that's what I intended :)

Do we choose SQL_ASCII in any case other than a C locale?

> That would cripple a system that many users are perfectly content with now.  I 
> compare this to the occasional requests to make pg_hba.conf reject everyone 
> by default.  We have to err a little on the side of usablity.  Anyway, the 
> issue here is the mismatch between the C locale and the SQL_ASCII encoding.  
> The solution is to fix that mismatch, not cripple the entire system.

Well, I wasn't thinking of using a 7-bit encoding always, just as a
replacement for the cases where we currently choose SQL_ASCII. Does that
sound reasonable?

-O


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Tom Lane
Date:
Oliver Jowett <oliver@opencloud.com> writes:
> Peter Eisentraut wrote:
>> That would cripple a system that many users are perfectly content with now.

> Well, I wasn't thinking of using a 7-bit encoding always, just as a
> replacement for the cases where we currently choose SQL_ASCII. Does that
> sound reasonable?

I agree with what (I think) Peter is saying: that would break things for
many people for whom the default works fine now.

We are currently seeing a whole lot of complaints due to the fact that
8.0 tends to default to Unicode encoding in environments where previous
versions defaulted to SQL-ASCII.  That says to me that a whole lot of
people were getting along just fine in SQL-ASCII, and therefore that
moving further away from that behavior is the wrong thing.  In
particular, there is not any single one of those complainants who would
be happier with a 7-bit-only default; if they were using 7-bit-only
data, they'd not have noticed a problem anyway.
        regards, tom lane


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Christopher Kings-Lynne
Date:
> We are currently seeing a whole lot of complaints due to the fact that
> 8.0 tends to default to Unicode encoding in environments where previous
> versions defaulted to SQL-ASCII.  That says to me that a whole lot of
> people were getting along just fine in SQL-ASCII, and therefore that
> moving further away from that behavior is the wrong thing.  In
> particular, there is not any single one of those complainants who would
> be happier with a 7-bit-only default; if they were using 7-bit-only
> data, they'd not have noticed a problem anyway.

I'm thinking of implementing a 7-bit-only encoding - any chance of 
getting a quick checklist of what has to be done?

Thanks,

Chris


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Oliver Jowett
Date:
Tom Lane wrote:
> Oliver Jowett <oliver@opencloud.com> writes:
> 
>>Peter Eisentraut wrote:
>>
>>>That would cripple a system that many users are perfectly content with now.
> 
> 
>>Well, I wasn't thinking of using a 7-bit encoding always, just as a
>>replacement for the cases where we currently choose SQL_ASCII. Does that
>>sound reasonable?
> 
> 
> I agree with what (I think) Peter is saying: that would break things for
> many people for whom the default works fine now.
> 
> We are currently seeing a whole lot of complaints due to the fact that
> 8.0 tends to default to Unicode encoding in environments where previous
> versions defaulted to SQL-ASCII.  That says to me that a whole lot of
> people were getting along just fine in SQL-ASCII, and therefore that
> moving further away from that behavior is the wrong thing.  In
> particular, there is not any single one of those complainants who would
> be happier with a 7-bit-only default; if they were using 7-bit-only
> data, they'd not have noticed a problem anyway.

This is exactly the case where JDBC has problems, and the case I'd like
to prevent happening in the first place where possible: SQL_ASCII with
non-7-bit data. How do you propose that the JDBC driver converts from
SQL_ASCII to UTF-16 (the internal Java String representation)? Changing
client_encoding does not help. Requiring the JDBC client to specify the
right encoding to use is error-prone at best, and impossible at worst
(who says that only one encoding has been used?)

I'm not suggesting that a 7-bit encoding is necessarily useful to
everyone. I'm saying that we should make it a setting that users have to
think about and correctly set before they can insert 8-bit data. If they
decide they want SQL_ASCII and the associated client_encoding problems,
rather than an appropriate encoding the database understands, so be it;
but it's on their head, and requires active intervention before the
database starts losing encoding information.

If SQL_ASCII plus 8-bit data is considered the right thing to do, then
I'd consider the ability to change client_encoding on a SQL_ASCII
database without an error to be a bug -- you've asked the server to give
you (for example) UTF8, but it isn't doing that. In that case, can we
get this to generate an error when client_encoding is set instead of
producing invalid output?

-O


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Alvaro Herrera
Date:
On Fri, May 13, 2005 at 01:15:36AM -0400, Tom Lane wrote:

> We are currently seeing a whole lot of complaints due to the fact that
> 8.0 tends to default to Unicode encoding in environments where previous
> versions defaulted to SQL-ASCII.  That says to me that a whole lot of
> people were getting along just fine in SQL-ASCII, and therefore that
> moving further away from that behavior is the wrong thing.  In
> particular, there is not any single one of those complainants who would
> be happier with a 7-bit-only default; if they were using 7-bit-only
> data, they'd not have noticed a problem anyway.

I disagree.  Of course none of the complainants would be happy with
7-bit encoding, but if they had noticed they had a problem before they
had inserted millions of tuples, they could have corrected their
configuration right away.

The problem is that a single application coming from a single
environment is happy with a 8-bit-unchecked encoding, but as soon as
they develop a second application using a different environment, which
uses a different encoding, they start seeing invalid data pop up.  And
then they have a problem, because they have to dump all data, recode it,
and reimport it.  And that's very painful.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos / con todos los humanos acabaré (Bender)


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Tom Lane
Date:
Alvaro Herrera <alvherre@surnet.cl> writes:
> The problem is that a single application coming from a single
> environment is happy with a 8-bit-unchecked encoding, but as soon as
> they develop a second application using a different environment, which
> uses a different encoding, they start seeing invalid data pop up.

[ shrug... ]  The evidence at hand says that many people never get to
that point.  For instance, a particular database may never be accessed
through anything except JDBC, and so all the incoming data will be utf8
anyway.

My feeling about it is that we already made significant changes in 8.0
--- it won't default to SQL_ASCII unless your locale is "C", which to me
is a pretty strong indication that you are not very concerned about
encodings.  We should wait and see what field experience is like with
that, rather than insisting on anything as anal-retentive as disallowing
8-bit data in SQL_ASCII.  Doing that might have technical purity but I
think it will create as many problems as it prevents.
        regards, tom lane


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Alvaro Herrera
Date:
On Fri, May 13, 2005 at 09:59:27AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@surnet.cl> writes:
> > The problem is that a single application coming from a single
> > environment is happy with a 8-bit-unchecked encoding, but as soon as
> > they develop a second application using a different environment, which
> > uses a different encoding, they start seeing invalid data pop up.
> 
> [ shrug... ]  The evidence at hand says that many people never get to
> that point.  For instance, a particular database may never be accessed
> through anything except JDBC, and so all the incoming data will be utf8
> anyway.

One thing that's not clear to me is what encoding does people running on
Windows get?  Is it also determined based on locale, and is it something
useful?

In fact I've seen many more people with this problem after 8.0 was
released, at least in pgsql-es-ayuda.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"La verdad no siempre es bonita, pero el hambre de ella sí"


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Tom Lane
Date:
Alvaro Herrera <alvherre@surnet.cl> writes:
> In fact I've seen many more people with this problem after 8.0 was
> released, at least in pgsql-es-ayuda.

Which problem exactly?  Most of the 8.0 complaints I can recall seemed
to come from people who were trying to dump from a SQL_ASCII database
and reload into a UTF8 database --- that is, their problems were
specifically *caused by* our decision to shift to a more restrictive
default.  Which is why I'm not eager to make it even more restrictive.
        regards, tom lane


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Oliver Jowett
Date:
Tom Lane wrote:
> We should wait and see what field experience is like with
> that, rather than insisting on anything as anal-retentive as disallowing
> 8-bit data in SQL_ASCII.

I didn't suggest changing the behaviour of SQL_ASCII..

-O


Re: SQL_ASCII vs. 7-bit ASCII encodings

From
Alvaro Herrera
Date:
On Fri, May 13, 2005 at 10:22:06AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@surnet.cl> writes:
> > In fact I've seen many more people with this problem after 8.0 was
> > released, at least in pgsql-es-ayuda.
> 
> Which problem exactly?  Most of the 8.0 complaints I can recall seemed
> to come from people who were trying to dump from a SQL_ASCII database
> and reload into a UTF8 database --- that is, their problems were
> specifically *caused by* our decision to shift to a more restrictive
> default.  Which is why I'm not eager to make it even more restrictive.

Well, of all kinds really.  There are some of those.  Also I've seen
several questions from people using, say, a Python environment and then
they have to add a Windows/ODBC application.  The Python apps have been
using SQL_ASCII all along, but ODBC apparently wants UTF8 or some sort
of recode.  So non-7-bit chars display funny.

Anyway now that I look at them ISTM that most complaints actually come
from misunderstanding of the whole encoding issue.  They just create a
database and an app and start coding and throwing data.  At first it all
seems fine.  Later they find out that the client is using, say,
win-1250, and has been using client_encoding latin1; so it has been
wrong all around and they only find out when a new, different app shows
something strange.

So maybe it's not about 7 bits or not, it's just lack of knowledge.
But I've certainly seen much more problems reported after 8.0 was
released.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)