Re: Continuing encoding fun.... - Mailing list pgsql-odbc
From | Marc Herbert |
---|---|
Subject | Re: Continuing encoding fun.... |
Date | |
Msg-id | 874q6510ze.fsf@meije.emic.fr Whole thread Raw |
In response to | Re: Continuing encoding fun.... ("Dave Page" <dpage@vale-housing.co.uk>) |
List | pgsql-odbc |
"Dave Page" <dpage@vale-housing.co.uk> writes: >> I agree that 4) can never work, because ODBC does not seem compatible >> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode" >> strings, that's all. >> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx> > > Actually our ANSI driver works quite nicely in various non-Unicode > multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll > even work with pure UTF-8 in multibyte mode using the ANSI API. Great. Out of curiosity, is this because all the ODBC code has a "don't touch" attitude in this full-ANSI case, leaving all string data as is? Or is there something more clever? Who performs the conversion if the database is in UTF-8 for instance? Multibyte cases seem to fall outside the scope of the ODBC spec, which refers only to "ANSI" and "Unicode". Thanks in advance for providing pointers if this is an FAQ. Even vague references to the archive of this list would be nice. >> However, I don't get why 3) does not work. >> >> If the driver is a Unicode driver, the Driver Manager makes function >> calls as follows: >> - Converts an ANSI function (with the A suffix) to a Unicode function >> (with the W suffix) by converting the string arguments into Unicode >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> characters and passes the Unicode function to the driver. >> >> >> Are you saying in 3) that the "converting" underlined above is >> actually just a static cast?! > > No, not really a static cast, but a similar effect. Unicode chars > 0000-007F are exactly the same as their ASCII counterparts, as are > LATIN1 (0080-00FF). All the DM does is map the single byte values > into low bytes of the unicode characters and passes them to the > Unicode functions. > This works just fine for pure ASCII/LATIN1, but > not with other charactersets which don't directly map from their > single byte values into Unicode. Very interesting. Maybe the driver manager does so only because the it cannot/fails to get the active codepage, falling back on CP-1252? (CP1252 ~= latin1, <http://czyborra.com/charsets/codepages.html#CP1252>) >> Is this "bug" true for every driver manager out there? > It's not really a bug, but I believe so, yes. including unixodbc and iodbc for instance? > It gets corrected by > the more advanced drivers though - for example, the SQL server > driver might see a 'Š' character (8A). It knows the local charset is > LATIN4, so it can then rewrite that character to 0160, the Unicode > equivalent. Are you saying that the SQL server driver is fixing the flawed conversion job of the driver manager, finally taking the codepage into account? Surprising to say the least! By the way 0x8A is not in the range of latin4 <http://czyborra.com/charsets/iso8859.html#ISO-8859-4> > Our Unicode driver will simply leave it Of course, you don't want to perform a conversion that is supposed to already have happeneD. > Regardless though, the encoding bug reports have all-but stopped now > we ship 2 drivers again. And having two different drivers is indeed the approach induced by the ODBC documentation, from what I've got from it. Thanks a lot for your insights.
pgsql-odbc by date: