Re: Continuing encoding fun.... - Mailing list pgsql-odbc

From Marc Herbert
Subject Re: Continuing encoding fun....
Date
Msg-id 874q6510ze.fsf@meije.emic.fr
Whole thread Raw
In response to Re: Continuing encoding fun....  ("Dave Page" <dpage@vale-housing.co.uk>)
List pgsql-odbc
"Dave Page" <dpage@vale-housing.co.uk> writes:

>> I agree that 4) can never work, because ODBC does not seem compatible
>> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
>> strings, that's all.
>> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>
>

> Actually our ANSI driver works quite nicely in various non-Unicode
> multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll
> even work with pure UTF-8 in multibyte mode using the ANSI API.

Great.

Out of curiosity, is this because all the ODBC code has a "don't
touch" attitude in this full-ANSI case, leaving all string data as is?
Or is there something more clever?  Who performs the conversion if the
database is in UTF-8 for instance? Multibyte cases seem to fall outside
the scope of the ODBC spec, which refers only to "ANSI" and "Unicode".

Thanks in advance for providing pointers if this is an FAQ. Even vague
references to the archive of this list would be nice.


>> However, I don't get why 3) does not work.
>>
>>  If the driver is a Unicode driver, the Driver Manager makes function
>>  calls as follows:
>>  - Converts an ANSI function (with the A suffix) to a Unicode function
>>  (with the W suffix) by converting the string arguments into Unicode
>>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  characters and passes the Unicode function to the driver.
>>
>>
>> Are you saying in 3) that the "converting" underlined above is
>> actually just a static cast?!
>

> No, not really a static cast, but a similar effect. Unicode chars
> 0000-007F are exactly the same as their ASCII counterparts, as are
> LATIN1 (0080-00FF). All the DM does is map the single byte values
> into low bytes of the unicode characters and passes them to the
> Unicode functions.

> This works just fine for pure ASCII/LATIN1, but
> not with other charactersets which don't directly map from their
> single byte values into Unicode.

Very interesting. Maybe the driver manager does so only because the it
cannot/fails to get the active codepage, falling back on CP-1252?
(CP1252 ~= latin1, <http://czyborra.com/charsets/codepages.html#CP1252>)


>> Is this "bug" true for every driver manager out there?

> It's not really a bug, but I believe so, yes.

including unixodbc and iodbc for instance?


> It gets corrected by
> the more advanced drivers though - for example, the SQL server
> driver might see a 'Š' character (8A). It knows the local charset is
> LATIN4, so it can then rewrite that character to 0160, the Unicode
> equivalent.

Are you saying that the SQL server driver is fixing the flawed
conversion job of the driver manager, finally taking the codepage into
account? Surprising to say the least!

By the way 0x8A is not in the range of latin4
<http://czyborra.com/charsets/iso8859.html#ISO-8859-4>


> Our Unicode driver will simply leave it

Of course, you don't want to perform a conversion that is supposed to
already have happeneD.


> Regardless though, the encoding bug reports have all-but stopped now
> we ship 2 drivers again.

And having two different drivers is indeed the approach induced by the
ODBC documentation, from what I've got from it.

Thanks a lot for your insights.

pgsql-odbc by date:

Previous
From: "Dave Page"
Date:
Subject: Re: Continuing encoding fun....
Next
From: tomas@nocrew.org (Tomas Skäre)
Date:
Subject: asynchronous execution