Thread: Bug or not about ASCII and Multi-Byte character set

Bug or not about ASCII and Multi-Byte character set

From

"ZF"

Date:

14 August 2005, 07:10:23

version: psqlodbc 08.00.0101 and 08.00.0102

The server database encoding is ASCII in win2000. After inserting several multi-byte characters with pgAdmin3(version:
1.2.1),i found the odbc driver can't provide the right data (application is written in visual c++ and call ODBC API to
accessdb server). All displayed multi-byte char set data were not correct but the ASCII data was right. However,
pgAdmin3can correctly insert and select all data . 
 

Then i read the "mylog" file and found the multi-byte data have been got correctly from server. 
Is it possible that some function (maybe the converting function) have a little bug?


P.S.
if the server db encoding is UNICODE and producing a new psqlodbc.dll through  deleting all SQL*W function in
'psqlodbc_win32.def'file, i got Delphi applications using odbc driver can show all right data.

Re: Bug or not about ASCII and Multi-Byte character set

From

Andreas Pflug

Date:

14 August 2005, 09:11:34

ZF wrote:
> version: psqlodbc 08.00.0101 and 08.00.0102
>
> The server database encoding is ASCII in win2000. After inserting
> several multi-byte characters with pgAdmin3(version: 1.2.1),

Since SQL_ASCII is *not* multi-byte, pgAdmin will insert single byte
characters which are non-ASCII compliant. You can't read such data with
unicode. Simple advice: don't use SQL_ASCII for storing non-ASCII data.
Use a specific encoding or Unicode.

Regards,
Andreas

Re: Bug or not about ASCII and Multi-Byte character set

From

Josef Springer

Date:

14 August 2005, 14:47:51

Multibyte characters are written correctly but readed wrong by the ODBC driver. Seems to work on WinXP clients right. Only Win2k clients may have this problem. This bug may be corrected in PostgreSQL 8.1.

Josef Springer

ZF wrote:

version: psqlodbc 08.00.0101 and 08.00.0102

The server database encoding is ASCII in win2000. After inserting several multi-byte characters with pgAdmin3(version: 1.2.1), i found the odbc driver can't provide the right data (application is written in visual c++ and call ODBC API to access db server). All displayed multi-byte char set data were not correct but the ASCII data was right. However, pgAdmin3 can correctly insert and select all data . 

Then i read the "mylog" file and found the multi-byte data have been got correctly from server. 
Is it possible that some function (maybe the converting function) have a little bug?


P.S.
if the server db encoding is UNICODE and producing a new psqlodbc.dll through  deleting all SQL*W function in 'psqlodbc_win32.def' file, i got Delphi applications using odbc driver can show all right data.
---------------------------(end of broadcast)---------------------------
TIP 9: In versions be
low 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

mit freundlichen Grüssen,
Josef Springer
(Geschäftsleitung)

-- the software company --

Orlando-di-Lasso Str. 2
D-85640 Putzbrunn

Tel.	++49(0)89 600 6920
Fax	++49(0)89 600 69220
mailto	Josef.Springer@joops.com
Website	http://www.joops.com

Attachment

image120.jpg

Re: Bug or not about ASCII and Multi-Byte character set

From

"Dave Page"

Date:

14 August 2005, 17:00:39

As I have said a number of times now, DO NOT USE ASCII FOR NON-ASCII DATA!!!

Whilst some apps like pgAdmin may do what you expect, other apps and interfaces may not.

Instead of ASCII, use the correct encoding for your data, or try Unicode if unsure.

Regards, Dave

-----Original Message-----
From: "Josef Springer"<Josef.Springer@JOOPS.COM>
Sent: 14/08/05 18:49:49
To: "ZF"<zf.tech@gmail.com>
Cc: "pgsql-odbc@postgresql.org"<pgsql-odbc@postgresql.org>
Subject: Re: [ODBC] Bug or not about ASCII and Multi-Byte character set

Multibyte characters are written correctly but readed wrong by the ODBC
driver. Seems to work on WinXP clients right. Only Win2k clients may
have this problem. This bug may be corrected in PostgreSQL 8.1.

Josef Springer

ZF wrote:

>version: psqlodbc 08.00.0101 and 08.00.0102
>
>The server database encoding is ASCII in win2000. After inserting several multi-byte characters with pgAdmin3(version:
1.2.1),i found the odbc driver can't provide the right data (application is written in visual c++ and call ODBC API to
accessdb server). All displayed multi-byte char set data were not correct but the ASCII data was right. However,
pgAdmin3can correctly insert and select all data .  
>
>Then i read the "mylog" file and found the multi-byte data have been got correctly from server.
>Is it possible that some function (maybe the converting function) have a little bug?
>
>
>P.S.
>if the server db encoding is UNICODE and producing a new psqlodbc.dll through  deleting all SQL*W function in
'psqlodbc_win32.def'file, i got Delphi applications using odbc driver can show all right data. 
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

--

mit freundlichen Grüssen,
Josef Springer
(Gescha"ftsleitung)

<cid:part1.05000701.06060503@netscape.com> -- the software company --

Orlando-di-Lasso Str. 2
D-85640 Putzbrunn

Tel. ++49(0)89 600 6920
Fax ++49(0)89 600 69220
mailto Josef.Springer@joops.com <mailto:Josef.Springer@joops.com>
Website http://www.joops.com





-----Unmodified Original Message-----
Multibyte characters are written correctly but readed wrong by the ODBC
driver. Seems to work on WinXP clients right. Only Win2k clients may
have this problem. This bug may be corrected in PostgreSQL 8.1.

Josef Springer

ZF wrote:

>version: psqlodbc 08.00.0101 and 08.00.0102
>
>The server database encoding is ASCII in win2000. After inserting several multi-byte characters with pgAdmin3(version:
1.2.1),i found the odbc driver can't provide the right data (application is written in visual c++ and call ODBC API to
accessdb server). All displayed multi-byte char set data were not correct but the ASCII data was right. However,
pgAdmin3can correctly insert and select all data .  
>
>Then i read the "mylog" file and found the multi-byte data have been got correctly from server.
>Is it possible that some function (maybe the converting function) have a little bug?
>
>
>P.S.
>if the server db encoding is UNICODE and producing a new psqlodbc.dll through  deleting all SQL*W function in
'psqlodbc_win32.def'file, i got Delphi applications using odbc driver can show all right data. 
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

--

mit freundlichen Grüssen,
Josef Springer
(Gescha"ftsleitung)

<cid:part1.05000701.06060503@netscape.com> -- the software company --

Orlando-di-Lasso Str. 2
D-85640 Putzbrunn

Tel. ++49(0)89 600 6920
Fax ++49(0)89 600 69220
mailto Josef.Springer@joops.com <mailto:Josef.Springer@joops.com>
Website http://www.joops.com

Re: Bug or not about ASCII and Multi-Byte character set

From

"ZF"

Date:

14 August 2005, 23:48:23

But in my environment in WinXP professional with sp2 client, there is the same problem as win2k.

I used postgresql 8.0.3 win32 installer to create the db server.

Multibyte characters are written correctly but readed wrong by the ODBC driver. Seems to work on WinXP clients right. Only Win2k clients may have this problem. This bug may be corrected in PostgreSQL 8.1.

Josef Springer

Re: Bug or not about ASCII and Multi-Byte character set

From

"ZF"

Date:

15 August 2005, 00:14:10

I like to use the UNICODE but in Delphi BDE/ODBC there are a little problems which can't be resolved. The driver worked
OKwith UNICODE DB only after I built the driver through deleting all SQL*W function name in 'psqlodbc_win32.def' file.
 
I only try to use the ASCII mode to look for whether the same problem exist.
The zeoslib may be good for Delphi, but it has stopped development a long time.

Thanx.



----- Original Message ----- 
From: "Dave Page" <dpage@vale-housing.co.uk>
To: <Josef.Springer@JOOPS.COM>; <zf.tech@gmail.com>
Cc: <pgsql-odbc@postgresql.org>
Sent: Monday, August 15, 2005 4:00 AM
Subject: Re: [ODBC] Bug or not about ASCII and Multi-Byte character set



As I have said a number of times now, DO NOT USE ASCII FOR NON-ASCII DATA!!!

Whilst some apps like pgAdmin may do what you expect, other apps and interfaces may not.

Instead of ASCII, use the correct encoding for your data, or try Unicode if unsure.

Regards, Dave

Re: Bug or not about ASCII and Multi-Byte character set

From

Josef Springer

Date:

16 August 2005, 06:38:19

I did not use WinXP. Other mailers work with WinXP fine. May be there are other circumstands too.

Josef Springer

ZF wrote:

But in my environment in WinXP professional with sp2 client, there is the same problem as win2k.

I used postgresql 8.0.3 win32 installer to create the db server.

Multibyte characters are written correctly but readed wrong by the ODBC driver. Seems to work on WinXP clients right. Only Win2k clients may have this problem. This bug may be corrected in PostgreSQL 8.1.

Josef Springer

mit freundlichen Grüssen,
Josef Springer
(Geschäftsleitung)

-- the software company --

Orlando-di-Lasso Str. 2
D-85640 Putzbrunn

Tel.	++49(0)89 600 6920
Fax	++49(0)89 600 69220
mailto	Josef.Springer@joops.com
Website	http://www.joops.com

Attachment

image120.jpg

Re: Bug or not about ASCII and Multi-Byte character set

From

"Ben Trewern"

Date:

19 August 2005, 01:23:53

I'd like to make a few points on this issue.

1.  This problem should be mentioned in the FAQs as largely as possible as
it is difficult to rectify if you have fallen into this trap.

2. If you do have data in SQL_ASCII the old ODBC driver worked, PgAdmin III
works, Delphi and zeoslib works, I understand why there may be a problem but
is it not possible to make the new 8.x work?

3. Correct me if I'm wrong but SQL_ASCII is PostgreSQL's default encoding.
If this isn't sorted out then we'll see lots more of these messages for
help.

4. I'd like to disagree with your "DO NOT USE ASCII FOR NON-ASCII DATA" as
if you read any of Tom Lanes many messages on the subject.  Here's a quote:

"The SQL_ASCII setting isn't an
encoding, really; it's a declaration of ignorance. In this setting
the server will just store and regurgitate whatever character strings
you send it. This will work fine, more or less, if all your clients
use exactly the same encoding and you don't care about functions like
upper()/lower()"

It should be possible to use SQL_ASCII and get out what you put in.  That's
how I read the above statement and at the moment the 8.0 version of the ODBC
driver doesn't to that.  I only wish I knew enough C to look into this and
make an attemp at fixing this problem.

Regards,

Ben

""Dave Page"" <dpage@vale-housing.co.uk> wrote in message
news:000101c5a10a$d47e0e02$6a01a8c0@valehousing.co.uk...
>
> As I have said a number of times now, DO NOT USE ASCII FOR NON-ASCII
> DATA!!!
>
> Whilst some apps like pgAdmin may do what you expect, other apps and
> interfaces may not.
>
> Instead of ASCII, use the correct encoding for your data, or try Unicode
> if unsure.
>
> Regards, Dave
>
> -----Original Message-----
> From: "Josef Springer"<Josef.Springer@JOOPS.COM>
> Sent: 14/08/05 18:49:49
> To: "ZF"<zf.tech@gmail.com>
> Cc: "pgsql-odbc@postgresql.org"<pgsql-odbc@postgresql.org>
> Subject: Re: [ODBC] Bug or not about ASCII and Multi-Byte character set
>
> Multibyte characters are written correctly but readed wrong by the ODBC
> driver. Seems to work on WinXP clients right. Only Win2k clients may
> have this problem. This bug may be corrected in PostgreSQL 8.1.
>
> Josef Springer
>
> ZF wrote:
>
>>version: psqlodbc 08.00.0101 and 08.00.0102
>>
>>The server database encoding is ASCII in win2000. After inserting several
>>multi-byte characters with pgAdmin3(version: 1.2.1), i found the odbc
>>driver can't provide the right data (application is written in visual c++
>>and call ODBC API to access db server). All displayed multi-byte char set
>>data were not correct but the ASCII data was right. However, pgAdmin3 can
>>correctly insert and select all data .
>>
>>Then i read the "mylog" file and found the multi-byte data have been got
>>correctly from server.
>>Is it possible that some function (maybe the converting function) have a
>>little bug?
>>
>>
>>P.S.
>>if the server db encoding is UNICODE and producing a new psqlodbc.dll
>>through  deleting all SQL*W function in 'psqlodbc_win32.def' file, i got
>>Delphi applications using odbc driver can show all right data.
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>       choose an index scan if your joining column's datatypes do not
>>       match
>>
>
> --
>
> mit freundlichen Gr�ssen,
> Josef Springer
> (Gescha"ftsleitung)
>
> <cid:part1.05000701.06060503@netscape.com> -- the software company --
>
> Orlando-di-Lasso Str. 2
> D-85640 Putzbrunn
>
> Tel. ++49(0)89 600 6920
> Fax ++49(0)89 600 69220
> mailto Josef.Springer@joops.com <mailto:Josef.Springer@joops.com>
> Website http://www.joops.com
>
>
>
>
>
> -----Unmodified Original Message-----
> Multibyte characters are written correctly but readed wrong by the ODBC
> driver. Seems to work on WinXP clients right. Only Win2k clients may
> have this problem. This bug may be corrected in PostgreSQL 8.1.
>
> Josef Springer
>
> ZF wrote:
>
>>version: psqlodbc 08.00.0101 and 08.00.0102
>>
>>The server database encoding is ASCII in win2000. After inserting several
>>multi-byte characters with pgAdmin3(version: 1.2.1), i found the odbc
>>driver can't provide the right data (application is written in visual c++
>>and call ODBC API to access db server). All displayed multi-byte char set
>>data were not correct but the ASCII data was right. However, pgAdmin3 can
>>correctly insert and select all data .
>>
>>Then i read the "mylog" file and found the multi-byte data have been got
>>correctly from server.
>>Is it possible that some function (maybe the converting function) have a
>>little bug?
>>
>>
>>P.S.
>>if the server db encoding is UNICODE and producing a new psqlodbc.dll
>>through  deleting all SQL*W function in 'psqlodbc_win32.def' file, i got
>>Delphi applications using odbc driver can show all right data.
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>       choose an index scan if your joining column's datatypes do not
>>       match
>>
>
> --
>
> mit freundlichen Gr�ssen,
> Josef Springer
> (Gescha"ftsleitung)
>
> <cid:part1.05000701.06060503@netscape.com> -- the software company --
>
> Orlando-di-Lasso Str. 2
> D-85640 Putzbrunn
>
> Tel. ++49(0)89 600 6920
> Fax ++49(0)89 600 69220
> mailto Josef.Springer@joops.com <mailto:Josef.Springer@joops.com>
> Website http://www.joops.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>

Re: Bug or not about ASCII and Multi-Byte character set

From

Andreas Pflug

Date:

19 August 2005, 05:36:30

Ben Trewern wrote:
> I'd like to make a few points on this issue.
>
> 1.  This problem should be mentioned in the FAQs as largely as possible as
> it is difficult to rectify if you have fallen into this trap.
>
> 2. If you do have data in SQL_ASCII the old ODBC driver worked, PgAdmin III
> works, Delphi and zeoslib works, I understand why there may be a problem but
> is it not possible to make the new 8.x work?
>
> 3. Correct me if I'm wrong but SQL_ASCII is PostgreSQL's default encoding.
> If this isn't sorted out then we'll see lots more of these messages for
> help.
>
> 4. I'd like to disagree with your "DO NOT USE ASCII FOR NON-ASCII DATA" as
> if you read any of Tom Lanes many messages on the subject.  Here's a quote:
>
> "The SQL_ASCII setting isn't an
> encoding, really; it's a declaration of ignorance. In this setting
> the server will just store and regurgitate whatever character strings
> you send it. This will work fine, more or less, if all your clients
> use exactly the same encoding and you don't care about functions like
> upper()/lower()"

Mind the *if*.
You'll always create mess when mixing drivers/apps. Ignoring proper
encodings is always non-standard, so don't expect drivers to support it,
and ask for support.
Server encoding is for fixing this issue, and all drivers are obeying
this. So if you want a guarantee to have a working configuration, DO NOT
USE ASCII FOR NON-ASCII DATA.

Regards,
Andreas

Re: Bug or not about ASCII and Multi-Byte character set

From

Marc Herbert

Date:

19 August 2005, 06:35:07

On Fri, Aug 19, 2005 at 10:35:28AM +0200, Andreas Pflug wrote:
> Ben Trewern wrote:
> >I'd like to make a few points on this issue.
> >
> >2. If you do have data in SQL_ASCII the old ODBC driver worked, PgAdmin
> >III works, Delphi and zeoslib works, I understand why there may be a
> >problem but is it not possible to make the new 8.x work?
> >
> >
> >4. I'd like to disagree with your "DO NOT USE ASCII FOR NON-ASCII DATA" as
> >if you read any of Tom Lanes many messages on the subject.  Here's a quote:
> >
> >"The SQL_ASCII setting isn't an
> >encoding, really; it's a declaration of ignorance. In this setting
> >the server will just store and regurgitate whatever character strings
> >you send it. This will work fine, more or less, if all your clients
> >use exactly the same encoding and you don't care about functions like
> >upper()/lower()"

If SQL_ASCII is/was equivalent to "ignoring encoding", then it
looks/looked pretty misnamed! For instance if someone wants some UTF16
from/to ASCII then some conversion is definitely needed and
ASCII==ignorance is then a bug.

Encoding ignorance should rather be called SQL_BINARY. A BINARY setting
for strings makes sense, just like when transfering text files using
FTP: you just don't trust FTP for encodings and use it like a
filesystem. BINARY just means that: "don't mess-up with encodings and
let something else deal with the issue".

Of course you cannot then use upper() or whatever if you don't want to
reveal the encoding you use...

> Mind the *if*.  You'll always create mess when mixing drivers/apps.

I guess some people knew what they did and simply did not mixed
driver/apps, or in a way they mastered and that worked.

> Ignoring proper encodings is always non-standard,

> so don't expect drivers to support it, and ask for support.

Well while reading at the complaints it seems this BINARY mode was
there before (by "accident"?), even if misnamed... so the BINARY mode
recently disappeared? No surprise some people become angry :-)

> Server encoding is for fixing this issue, and all drivers are obeying
> this. So if you want a guarantee to have a working configuration, DO NOT
> USE ASCII FOR NON-ASCII DATA.

Looks like people fixed issues by themselves before, and Postgres
recent fixing does not interact nicely with theirs?

PS: BTW "unicode" is not one encoding but many different ones.

Re: Bug or not about ASCII and Multi-Byte character set

From

"Dave Page"

Date:

19 August 2005, 09:24:17


> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Ben Trewern
> Sent: 15 August 2005 13:44
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Bug or not about ASCII and Multi-Byte
> character set
>
> I'd like to make a few points on this issue.
>
> 1.  This problem should be mentioned in the FAQs as largely
> as possible as
> it is difficult to rectify if you have fallen into this trap.

Agreed - I'll do that.

> 2. If you do have data in SQL_ASCII the old ODBC driver
> worked, PgAdmin III
> works, Delphi and zeoslib works, I understand why there may
> be a problem but
> is it not possible to make the new 8.x work?

Actually, pgAdmin doesn't always work. If you try to insert Japanese
characters into an SQL_ASCII database it errors for example. Try it with
a Unicode database and it's fine.

Also, the old ODBC driver didn't always work either. If you check the
archives, you'll find people complaining of the same problem with the
non-libpq drivers.

> 3. Correct me if I'm wrong but SQL_ASCII is PostgreSQL's
> default encoding.
> If this isn't sorted out then we'll see lots more of these
> messages for
> help.
>
> 4. I'd like to disagree with your "DO NOT USE ASCII FOR
> NON-ASCII DATA" as
> if you read any of Tom Lanes many messages on the subject.
> Here's a quote:
>
> "The SQL_ASCII setting isn't an
> encoding, really; it's a declaration of ignorance. In this setting
> the server will just store and regurgitate whatever character strings
> you send it. This will work fine, more or less, if all your clients
> use exactly the same encoding and you don't care about functions like
> upper()/lower()"

Which is fine - however, as Tom also says:

"If you flip between SQL_ASCII and other settings, on either end,
without
clearly understanding what's happening, you're likely to get very
confused."

Which is pretty much what seems to be happening - ppl are using a
Unicode ODBC driver, and 7bit ascii data that cannot be properly
represented gets converted to '?'. They don't necessarily realise that
apps like Access will usually retrieve data as SQL_C_WCHAR if they can,
thus forcing a conversion to Unicode.

Regards, Dave

Re: Bug or not about ASCII and Multi-Byte character set

From

"Joel Fradkin"

Date:

19 August 2005, 09:38:14

Looks like a lot of thoughts on this.

I would just like to say (as one who had to convert their data) thank so
much to every one who has worked so hard on the current driver.

I understand if folks are upset because of change, but keep in mind this is
an open source effort and my hat goes off to the accomplishment. The old
driver may of handles aschii in a less imposing way, but it also did not
work.

Maybe one of those needing a plain aschii odbc driver that deals with
encoding (or doesn't depending on the point of view) can make a version of
the current driver that will allow the user to not need specify and give it
a name like genericaschii or something (or add it as a connect string
param?).

I for one am very happy the work was done and seems to be working well. I
did have a major error occur during my conversion (because of human error
doing a backup restore at midnight), but it was still worth the end result.

You can not expect everyone to agree (any change is always hard to accept).
My users hate change even if it is more flexibility. I can agree it was a
major hassle for me to have to deal with a change, but we have been told the
driver functionality changed, if you want to use the driver on non aschii
then make sure the data base reflects the proper coding. I provided examples
of how to convert from one to the other, so bite the bullet and fix your
issues. In the end having upper etc working will and should be the way you
want it.

Joel Fradkin

Re: Bug or not about ASCII and Multi-Byte character set

From

Andreas Pflug

Date:

19 August 2005, 11:12:21

Marc Herbert wrote:

>If SQL_ASCII is/was equivalent to "ignoring encoding", then it
>looks/looked pretty misnamed!
>
It's not. It should be used for ASCII only, but the database system will
not barf if you offer it a byte with the upper bit set. You're simply on
your own.

>Encoding ignorance should rather be called SQL_BINARY. A BINARY setting
>for strings makes sense, just like when transfering text files using
>FTP: you just don't trust FTP for encodings and use it like a
>filesystem. BINARY just means that: "don't mess-up with encodings and
>let something else deal with the issue".
>
>
No, binary would include 0x00 which is definitely *not* a character but
the string terminator. If SQL_ASCII would be implemented nowadays, there
probably would be a check for the upper bit cleared, and have it
rejected otherwise. But since this part is really really old, this can't
be changed without breaking zillions of old apps that used to ignore
proper storage encoding.

>I guess some people knew what they did and simply did not mixed
>driver/apps, or in a way they mastered and that worked.
>
>
The latter, with the obvious chance to break if the next app accesses
the data. This is certainly not the design goal of a RDBMS.

>Well while reading at the complaints it seems this BINARY mode was
>there before (by "accident"?),
>
No.

>Looks like people fixed issues by themselves before,
>
They didn't fix anything, they worked around the wrong chosen server
encoding. I perfectly understand this, because initially I did the same
mistake.

> and Postgres
>recent fixing does not interact nicely with theirs?
>
>
Automatically choosing the right client encoding and properly converting
in the driver did (and maybe still has) bugs, but fixing these will
certainly support the rules as proper design requires it, not
ill-designed apps.

>PS: BTW "unicode" is not one encoding but many different ones.
>
>
Doesn't matter. Always means the current Unicode for the system: in the
backend UTF-8, on Win32 UCS16, Linux UCS32 or UTF-8 dependent on
interface definition. The *driver* has to take care of the proper
conversion, *if* it is instructed correctly (i.e. correct server encoding)

Regards,
Andreas

Re: Bug or not about ASCII and Multi-Byte character set

From

Marc Herbert

Date:

19 August 2005, 15:02:36

On Fri, Aug 19, 2005 at 04:11:48PM +0200, Andreas Pflug wrote:
> Marc Herbert wrote:
>
> >If SQL_ASCII is/was equivalent to "ignoring encoding", then it
> >looks/looked pretty misnamed!
> >
> It's not. It should be used for ASCII only, but the database system will
> not barf if you offer it a byte with the upper bit set. You're simply on
> your own.

Well this still looks like what I called a "BINARY/don't touch it"
accidental mode.

> >Encoding ignorance should rather be called SQL_BINARY. A BINARY setting
> >for strings makes sense, just like when transfering text files using
> >FTP: you just don't trust FTP for encodings and use it like a
> >filesystem. BINARY just means that: "don't mess-up with encodings and
> >let something else deal with the issue".
> >

> No, binary would include 0x00

This seems irrelevant to me, see below.

> which is definitely *not* a character but the string terminator.

Not everyone in the world uses 0x00 as a string terminator. C does,
Postgres also, but Java does not and I don't think databases standards
and even less encoding standards say anything about this (please prove
me wrong, I'd really like to have a definitive answer on this).

It just tried to insert a string into hsqldb using JDBC and it worked
perfectly fine. Postgres JDBC driver is also "strings with
null-character"-ready, so this seems to be only a limitation of
Postgres.

By the way many ODBC function calls ask for the length of string
arguments, _optionally_ being SQL_NTS (Null-Terminated String). So it
seems some people here catered for strings with null characters even
in C!

In any case whether 0x00 is The String Terminator or not is not
relevant to the fact that there was a accidental "BINARY" string
encoding before.  If we learn that 0x00 is really The Database String
Terminator, then it can also be interpreted as a terminator even in
"encoding ignorance" mode since it translates into 0x00 for every
known encoding.

> >I guess some people knew what they did and simply did not mixed
> >driver/apps, or in a way they mastered and that worked.

> The latter, with the obvious chance to break if the next app accesses
> the data. This is certainly not the design goal of a RDBMS.

There was a time, not so long ago, where every encoding-related stuff
was under-specified, every software buggy etc., so people had to cope
with it. They were probably pleased at that time to have this
accidental "BINARY" workaround available. One can easily understand
that they complain a little bit about the sudden removal of this
workaround and the unplanned migration to The Right Solution.

Of course on the other hand everyone can understand that the Postgres
developers want to get rid of this accidental BINARY string mode, and
that they are free to do what they want.

> >Well while reading at the complaints it seems this BINARY mode was
> >there before (by "accident"?),

> No.

Well, I am still waiting for some proof of the opposite (since this
0x00 stuff does not seem really related to it).

I was just reformulating Tom Lanes "SQL_ASCII ignorance" quote above,
which looked quite informed.

> >PS: BTW "unicode" is not one encoding but many different ones.

> Doesn't matter. Always means the current Unicode for the system: in the
> backend UTF-8, on Win32 UCS16, Linux UCS32 or UTF-8 dependent on
> interface definition.

Interesting. I hope this "current unicode for the system" concept is
well documented, because just saying "unicode" is not clear at all,
even if not ambiguous.

Regards,

Marc.