Thread: Continuing encoding fun....
I've been thinking about this whilst getting dragged round the shops today, and having read Marko's, Johann's, Hiroshi's and other emails, not to mention bits of the ODBC spec, here's where I think we stand. 1) The current driver works as expected with Unicode apps. 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI functions to the Unicode ones, and because (as I think Marko pointed out) the basic latin chars map directly into the lower Unicode characters (see http://www.unicode.org/charts/PDF/U0000.pdf). 3) Some other single byte LATIN encodings do not work. This is because the characters do not map directly into Unicode 80-FF (http://www.unicode.org/charts/PDF/U0080.pdf). 4) Multibyte apps do not work. I believe that in fact they never will with a Unicode driver, because multibyte characters simply won't map into Unicode in the same way that ASCII does. The user cannot opt to use the non-wide functions, because the DM automatically maps them to the Unicode versions. Because the Driver Manager forces the user to use the *W functions if they exist, I cannot see any way to make 3 or 4 work with a Unicode driver. If we were to try to detect what encoding to use based on the OS settings and convert on the fly, we would most likely break any apps that try to do the right thing by using Unicode themselves. Does that sound reasonable? Therefore, it seems to me that the only thing to do is to reinstate the #ifdef UNICODE preprocessor definitions in the source code (that I now with I hadn't removed!), and ship 2 versions of the driver - a Unicode one, and an ANSI/Multibyte version (ie. What 07.xx was). Thoughts/comments? Hiroshi, what do other vendors do for the Japanese market? Regards, Dave.
Hi Dave. > With this patch, you can build either the old style ANSI/Multibyte > driver, or the Unicode driver. I've also removed the -libpq suffix that > was added for testing, as this patch gives the driver a new name anyway. > When installed on a Windows system, you then get: > > psqlodbca.dll "PostgreSQL ANSI" > psqlodbcw.dll "PostgreSQL Unicode" Is it meant as follows after all? with libpq version psqlodbca.dll "PostgreSQL ANSI" psqlodbcw.dll "PostgreSQL Unicode" without libpq version psqlodbca.dll "PostgreSQL ANSI" psqlodbcw.dll "PostgreSQL Unicode" > > Unless anyone has a better solution, I think this is the best fix to > allow users with non-Unicode friendly apps to work as they used to with > the older driver. Some complaint. Although I have not fully tried yet.-( I think that CRLF of a base code and patch is what it is hard to use. > > Please shout ASAP if you object!! one vote is invested. Regards, Hiroshi Saito
Attachment
> -----Original Message----- > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] > Sent: 05 September 2005 05:57 > To: Dave Page; pgsql-odbc@postgresql.org > Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar > Subject: Re: [ODBC] Continuing encoding fun.... > > Hi Dave. > > > With this patch, you can build either the old style ANSI/Multibyte > > driver, or the Unicode driver. I've also removed the -libpq > suffix that > > was added for testing, as this patch gives the driver a new > name anyway. > > When installed on a Windows system, you then get: > > > > psqlodbca.dll "PostgreSQL ANSI" > > psqlodbcw.dll "PostgreSQL Unicode" > > Is it meant as follows after all? > with libpq version > psqlodbca.dll "PostgreSQL ANSI" > psqlodbcw.dll "PostgreSQL Unicode" > without libpq version > psqlodbca.dll "PostgreSQL ANSI" > psqlodbcw.dll "PostgreSQL Unicode" Yes - I am not concerned with the socket version of the driver - in fact, I was going to talk to Anoop about removing the old code because we've had at least a couple of cases of people patching the wrong part, or mistakenly using the socket code. Either way, we're certainly not going to release the non-libpq version any more. > > Unless anyone has a better solution, I think this is the best fix to > > allow users with non-Unicode friendly apps to work as they > used to with > > the older driver. > > Some complaint. Although I have not fully tried yet.-( > I think that CRLF of a base code and patch is what it is hard to use. Yes, I was too tired to try to fix the patch to remove the CRLF changes :-( Still, they need to be fixed anyway. BTW, your version misses the changes to installer/psqlodbcm.wxs... > > Please shout ASAP if you object!! > > one vote is invested. :-) Regards, Dave
Dave Page schrieb: > > > > >>-----Original Message----- >>From: pgsql-odbc-owner@postgresql.org >>[mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Dave Page >>Sent: 03 September 2005 20:48 >>To: pgsql-odbc@postgresql.org >>Cc: Hiroshi Saito; Marko Ristola; Johann Zuschlag >>Subject: [ODBC] Continuing encoding fun.... >> >>Therefore, it seems to me that the only thing to do is to >>reinstate the >>#ifdef UNICODE preprocessor definitions in the source code (that I now >>with I hadn't removed!), and ship 2 versions of the driver - a Unicode >>one, and an ANSI/Multibyte version (ie. What 07.xx was). >> >> > >Attached is a patch to do this (apologies for the size, it seems that >options.c had broken line ends). > >With this patch, you can build either the old style ANSI/Multibyte >driver, or the Unicode driver. I've also removed the -libpq suffix that >was added for testing, as this patch gives the driver a new name anyway. >When installed on a Windows system, you then get: > >psqlodbca.dll "PostgreSQL ANSI" >psqlodbcw.dll "PostgreSQL Unicode" > >Unless anyone has a better solution, I think this is the best fix to >allow users with non-Unicode friendly apps to work as they used to with >the older driver. > >Please shout ASAP if you object!! > >Regards, Dave > > It is ok for me. Can you send me the dll for the ANSI Driver? It is not possible to just put a switch in the driver menu? Regards, Johann
> -----Original Message----- > From: Johann Zuschlag [mailto:zuschlag2@online.de] > Sent: 05 September 2005 10:40 > To: pgsql-odbc@postgresql.org > Cc: Dave Page > Subject: Re: [ODBC] Continuing encoding fun.... > > It is ok for me. > > Can you send me the dll for the ANSI Driver? Yup, I'll send it offlist. > It is not possible to just put a switch in the driver menu? Unfortunately not because it affects the functions exported by the DLL - if the *W functions exist, the DM will map all calls to the *W versions, even if the app uses the non-wide version. Regards, Dave
> Either way, we're certainly not going to release the non-libpq version > any more. Ok, I also think that it is accordant to reason. > BTW, your version misses the changes to installer/psqlodbcm.wxs... Uga... Sorry. Ah.. I look at a part strange one. Please check it.:-) Regards, Hiroshi Saito
Attachment
Hi Dave, It would be wise to remove the socket code from the new driver. I will let you know as soon as it gets completed. Regards Anoop > -----Original Message----- > From: Dave Page [mailto:dpage@vale-housing.co.uk] > Sent: Monday, September 05, 2005 12:47 PM > To: Hiroshi Saito; pgsql-odbc@postgresql.org > Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar > Subject: RE: [ODBC] Continuing encoding fun.... > > > > > -----Original Message----- > > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] > > Sent: 05 September 2005 05:57 > > To: Dave Page; pgsql-odbc@postgresql.org > > Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar > > Subject: Re: [ODBC] Continuing encoding fun.... > > > > Hi Dave. > > > > > With this patch, you can build either the old style ANSI/Multibyte > > > driver, or the Unicode driver. I've also removed the -libpq > > suffix that > > > was added for testing, as this patch gives the driver a new > > name anyway. > > > When installed on a Windows system, you then get: > > > > > > psqlodbca.dll "PostgreSQL ANSI" > > > psqlodbcw.dll "PostgreSQL Unicode" > > > > Is it meant as follows after all? > > with libpq version > > psqlodbca.dll "PostgreSQL ANSI" > > psqlodbcw.dll "PostgreSQL Unicode" > > without libpq version > > psqlodbca.dll "PostgreSQL ANSI" > > psqlodbcw.dll "PostgreSQL Unicode" > > Yes - I am not concerned with the socket version of the driver - in > fact, I was going to talk to Anoop about removing the old code because > we've had at least a couple of cases of people patching the wrong part, > or mistakenly using the socket code. > > Either way, we're certainly not going to release the non-libpq version > any more. > > > > Unless anyone has a better solution, I think this is the best fix to > > > allow users with non-Unicode friendly apps to work as they > > used to with > > > the older driver. > > > > Some complaint. Although I have not fully tried yet.-( > > I think that CRLF of a base code and patch is what it is hard to use. > > Yes, I was too tired to try to fix the patch to remove the CRLF changes > :-( Still, they need to be fixed anyway. > > BTW, your version misses the changes to installer/psqlodbcm.wxs... > > > > Please shout ASAP if you object!! > > > > one vote is invested. > > :-) > > Regards, Dave
> -----Original Message----- > From: Anoop Kumar [mailto:anoopk@pervasive-postgres.com] > Sent: 06 September 2005 06:25 > To: Dave Page; Hiroshi Saito; pgsql-odbc@postgresql.org > Cc: Marko Ristola; Johann Zuschlag > Subject: RE: [ODBC] Continuing encoding fun.... > > Hi Dave, > > It would be wise to remove the socket code from the new driver. I will > let you know as soon as it gets completed. Now there's a coincidence - I was going to email you about that today!! We've had a couple of instances of pople mistakenly compiling the wrong version, and even fixing bugs in the socket code :-( Shall I apply the ANSI/Unicode patch first? It's quite invasive of course - possibly more so than libpq/socket. Regards, Dave
> -----Original Message----- > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] > Sent: 05 September 2005 18:35 > To: Dave Page; pgsql-odbc@postgresql.org > Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar > Subject: Re: [ODBC] Continuing encoding fun.... > > > Either way, we're certainly not going to release the > non-libpq version > > any more. > > Ok, I also think that it is accordant to reason. > > > BTW, your version misses the changes to installer/psqlodbcm.wxs... > > Uga... Sorry. > > Ah.. I look at a part strange one. > Please check it.:-) Re patch: --- connection.c.orig Tue Sep 6 01:47:23 2005 +++ connection.c Tue Sep 6 02:13:53 2005 @@ -1545,7 +1545,7 @@ if (self->unicode) { if (!self->client_encoding || - !stricmp(self->client_encoding, "UNICODE")) + stricmp(self->client_encoding, "UNICODE")) { QResultClass *res; if (PG_VERSION_LT(self, 7.1)) The opposite of this change was made in 1.92 of connection.c: http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/psqlodbc/psqlodbc/connection .c?rev=1.92&content-type=text/x-cvsweb-markup It seems to me that the current case is correct - in the Unicode driver we *must* run with client_encoding = 'UNICODE' or it won't work properly. That said, I wonder if we shouldn't just remove the if() altogether, and unconditionally set the client encoding for the Unicode driver. Don't forget, this won't affect the ANSI/Multibyte case because it's inside a "#ifdef UNICODE_SUPPORT". What do you think Anoop? Regards, Dave
> Done. Once you've removed the socket code, a new release seems in order. > Sound OK to you? OK for me. A new release would be proper. > It seems to me that the current case is correct - in the Unicode driver > we *must* run with client_encoding = 'UNICODE' or it won't work > properly. That said, I wonder if we shouldn't just remove the if() > altogether, and unconditionally set the client encoding for the Unicode > driver. > > Don't forget, this won't affect the ANSI/Multibyte case because it's > inside a "#ifdef UNICODE_SUPPORT". > > What do you think Anoop? > As this is already inside "#ifdef UNICODE_SUPPORT", I don't find the necessity for checking it again. Regards Anoop
Hi Dave >It seems to me that the current case is correct - in the Unicode driver >we *must* run with client_encoding = 'UNICODE' or it won't work >properly. That said, I wonder if we shouldn't just remove the if() >altogether, and unconditionally set the client encoding for the Unicode driver. That assumption seems to be ok, even though I need it still for further testing. But I can use the version you've sent me. Regards, Johann
There is one inconvinient bug in the driver: I tested this with the very least CVS version, so it exists. psql > select * from test1; (104 rows) isql marko XXX 2 rows returned So I get the above result by configuring .odbc.ini: [marko] Fetch = 2 UseDeclareFetch = 1 So unfortunately it fetches too few rows. Old behaviour: 1. SELECT * from test1 with cursor STM7737819 2. While more rows; do Fetch at most 2 rows; done 3. CLOSE CURSOR maybe with STMT_DROP. So this hack was implemented to support SELECT * of more than a few million rows. So without this hack, with 32 bit operating systems, query results are limited into maybe 8 million rows, before memory allocation failure. I don't remember the exact number of millions. Of course the exact million depends heavily with the result row width and with the Operating system memory architecture. So, it seems that the hack implementation has been partially removed, but it is still active. Regards, Marko Ristola
Hi Marko. It is strange... > So I get the above result by configuring .odbc.ini: > [marko] > Fetch = 2 > UseDeclareFetch = 1 I do not find any problems by the driver for Windows. Probably, I think a portion peculiar to Linux version.?? In windows, though CACHE is used as FETCH. Although I want to see the log, Anoop or Dave may be able to be distinguished immediately.:-) Regards, Hiroshi Saito
zuschlag2@online.de wrote: >Hi Dave > > > >>It seems to me that the current case is correct - in the Unicode driver >>we *must* run with client_encoding = 'UNICODE' or it won't work >>properly. That said, I wonder if we shouldn't just remove the if() >>altogether, and unconditionally set the client encoding for the Unicode >> >> The following might be interesting for you: If I activate ISO C 99 API, I can do the following: ( I thought, that I used ANSI C 99, but the correct name for the standard, I meant is ISO C 99. It will become default later, maybe it already is with newest GCCs.) char cbuf[500]; wchar_t wbuf[500]; setlocale(LC_CTYPE,""); strcpy(cbuf,"Some multibyte text"); swprintf(wbuf,"%s",cbuf); Now the text is under wchar_t's internal format, maybe UCS-2. The following also works: strcpy(wbuf,L"Some UNICODE text"); sprintf(cbuf,"%ls",wbuf); So, the UCS-2 and multibyte conversion under ISO C 99 seems to be very easy. With GCC, with Debian Sarge, this can be done as follows: gcc -std=c99 I don't have now more time to test, at least today. Iconv seems to be the solution for more advanced conversions under Linux. Regards, Marko
Maybe those with more than 8 million row tables could move on into 64 bit operating systems. Memory hogging would not be a problem anymore with a big enough swap space. So making sure the feature would not be active, would fix it. PostgreSQL works with a bad performance with UseDeclareFetch by design. With UseDeclareFetch, the backend assumes, that only a few rows will be fetched. Maybe users are not prepared to move on so quickly into 64 bit. Now to the analyze of the problem: The problem seems to be, that with UseDeclareFetch=1 and Fetch=2, libpq psqlodbc driver does the FETCH only once for the PostgreSQL backend. The feature would be nice, if PGAPI_ExtendedFetch() could fetch more tuples with FETCH from the backend, when the first two tuples have been processed. Now it just understands, that the FETCH returned two rows, and after the two rows, it will not fetch more rows anymore. So I tracked down the problem with a debugger into PGAPI_ExtendedFetch. It seems that the ealier implementation was, that SQLFetch called somehow QR_fetch_tuples() to fetch more rows from the Backend. If QR_fetch_tuples() didn't return more rows, the fetching from the backend would stop. If the user application asks the number of rows, the ODBC driver is forced to read everything into the memory. Regards, Marko Ristola Hiroshi Saito wrote: >Hi Marko. > >It is strange... > > > >>So I get the above result by configuring .odbc.ini: >>[marko] >>Fetch = 2 >>UseDeclareFetch = 1 >> >> > >I do not find any problems by the driver for Windows. >Probably, I think a portion peculiar to Linux version.?? >In windows, though CACHE is used as FETCH. > >Although I want to see the log, Anoop or Dave may be able to >be distinguished immediately.:-) > >Regards, >Hiroshi Saito > > > > >---------------------------(end of broadcast)--------------------------- >TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
"Dave Page" <dpage@vale-housing.co.uk> writes: > I've been thinking about this whilst getting dragged round the shops > today, and having read Marko's, Johann's, Hiroshi's and other emails, > not to mention bits of the ODBC spec, here's where I think we stand. > > 1) The current driver works as expected with Unicode apps. > > 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI > functions to the Unicode ones, and because (as I think Marko pointed > out) the basic latin chars map directly into the lower Unicode > characters (see http://www.unicode.org/charts/PDF/U0000.pdf). > > 3) Some other single byte LATIN encodings do not work. This is because > the characters do not map directly into Unicode 80-FF > (http://www.unicode.org/charts/PDF/U0080.pdf). > > 4) Multibyte apps do not work. I believe that in fact they never will > with a Unicode driver, because multibyte characters simply won't map > into Unicode in the same way that ASCII does. The user cannot opt to use > the non-wide functions, because the DM automatically maps them to the > Unicode versions. > > Because the Driver Manager forces the user to use the *W functions if > they exist, I cannot see any way to make 3 or 4 work with a Unicode > driver. If we were to try to detect what encoding to use based on the OS > settings and convert on the fly, we would most likely break any apps > that try to do the right thing by using Unicode themselves. In a perfect world there are no "unicode apps", the internal encoding is set by the system, properly written apps use abstract TCHAR/wchar_t characters without knowing anything about what encoding they use, and programs communicating with the outside (such as an database driver), should query the system encoding using something like "setlocale()", and perform any appropriate conversion on the fly. Excerpt from "info libc - Character Set Handling" of GNU libc 2.3.2 <http://www.gnu.org/software/libc/manual/html_node/Character-Set-Handling.html> The question remaining is: how to select the character set or encoding to use. The answer: you cannot decide about it yourself, it is decided by the developers of the system or the majority of the users. Since the goal is interoperability one has to use whatever the other people one works with use. <http://www.faqs.org/docs/Linux-HOWTO/Unicode-HOWTO.html#s6> says the same thing: "Avoid direct access with Unicode. This is a task of the platform's internationalization framework." Of course those two quotes are targeted at applications developers. They imply that some driver communicating with the outside world/database should carry any conversion task. However, I have no idea how this theory is far from reality, far from the ODBC API, and far from Windows, sorry :-( I just was woken up by the "unicode apps" word. I tried to follow the discussions here but got lost. My 2 cents.
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert > Sent: 07 September 2005 19:16 > To: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Continuing encoding fun.... > > In a perfect world there are no "unicode apps", In my perfect world, everything is one flavour of Unicode, and everyone can consequently read and write everything with no compatibilty problems at all. But then I like to retreat to my little fantasy world from time to time... > > However, I have no idea how this theory is far from reality, far from > the ODBC API, and far from Windows, sorry :-( I just was woken up by > the "unicode apps" word. I tried to follow the discussions here but > got lost. The ODBC API (defined by Microsoft of course) includes a number of *W functions which are Unicode variants of the ANSI versions with the same name. The ODBC driver manager maps all ANSI function calls to the Unicode equivalents if they exist, on the assumption that ASCII chars will map correctly into Unicode (which they do if they are 7 bit chars). In theory we could attempt to recode incoming ascii or multibyte ourselves I guess, but it's not going to be a particularly easy task (and will mean performance loss), and given that some apps don't play nicely with Unicode drivers anyway, we might as well kill 2 birds with one stone and just ship 2 versions of the driver. Regards, Dave.
"Dave Page" <dpage@vale-housing.co.uk> writes: > The ODBC API (defined by Microsoft of course) includes a number of *W > functions which are Unicode variants of the ANSI versions with the same > name. I think one extra layer of confusion is added by the fact that POSIX defines the type wchar_t as "the abstract/platform-dependent character", W just meaning here: "W like Wide enough", whereas Microsoft defines WCHAR as: "W like Unicode". Microsoft's abstract character being "TCHAR". Am I right here?
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert > Sent: 08 September 2005 11:10 > To: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Continuing encoding fun.... > > "Dave Page" <dpage@vale-housing.co.uk> writes: > > > The ODBC API (defined by Microsoft of course) includes a > number of *W > > functions which are Unicode variants of the ANSI versions > with the same > > name. > > I think one extra layer of confusion is added by the fact that POSIX > defines the type wchar_t as "the abstract/platform-dependent > character", W just meaning here: "W like Wide enough", whereas > Microsoft defines WCHAR as: "W like Unicode". Microsoft's abstract > character being "TCHAR". > > Am I right here? That certainly wouldn't help matters. We already have ucs2<->utf-8 conversion in various places to deal with *nix/win32 differences - trying to properly munge other encodings into those correctly wouldn't be fun! As I said though - there are other advantages to having a non-Unicode driver (like, BDE won't barf for example), so why go to all the hassle, when we can just advise the non-Unicode folks to use the ANSI driver? Regards, Dave.
There is one thing, that might be good for you to know: I tried wprintf("%s",char_text) and printf("%ls",wchar_text) methods. They don't work with LATIN1 under Linux. gcc does not support NON-ASCII multibyte conversions. gcc gives that responsibility for library functions. That is so even for GCC 4.0. So, at least libiconv is a good way to handle the multibyte conversions robustly under Linux. That works if and only if the libiconv library works. libiconv is LGPL licensed. Regards, Marko Ristola >However, I have no idea how this theory is far from reality, far from >the ODBC API, and far from Windows, sorry :-( I just was woken up by >the "unicode apps" word. I tried to follow the discussions here but >got lost. > >
Marko Ristola <Marko.Ristola@kolumbus.fi> writes: > There is one thing, that might be good for you to know: > > I tried > wprintf("%s",char_text) and printf("%ls",wchar_text) methods. > They don't work with LATIN1 under Linux. What do you mean by that? Could you post a short sample code? Since wchar_t is 32bits for glibc, wchar_text can not be LATIN1 which is 8bits long... > gcc does not support NON-ASCII multibyte conversions. Well I would find weird for a compiler to perform such conversions.
"Dave Page" <dpage@vale-housing.co.uk> writes: > I've been thinking about this whilst getting dragged round the shops > today, and having read Marko's, Johann's, Hiroshi's and other emails, > not to mention bits of the ODBC spec, here's where I think we stand. > > 1) The current driver works as expected with Unicode apps. > > 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI > functions to the Unicode ones, and because (as I think Marko pointed > out) the basic latin chars map directly into the lower Unicode > characters (see http://www.unicode.org/charts/PDF/U0000.pdf). > > 3) Some other single byte LATIN encodings do not work. This is because > the characters do not map directly into Unicode 80-FF > (http://www.unicode.org/charts/PDF/U0080.pdf). > > 4) Multibyte apps do not work. I believe that in fact they never will > with a Unicode driver, because multibyte characters simply won't map > into Unicode in the same way that ASCII does. The user cannot opt to use > the non-wide functions, because the DM automatically maps them to the > Unicode versions. > > Because the Driver Manager forces the user to use the *W functions if > they exist, I cannot see any way to make 3 or 4 work with a Unicode > driver. I agree that 4) can never work, because ODBC does not seem compatible with multibyte apps by design. ODBC caters for "ANSI" and "Unicode" strings, that's all. <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx> However, I don't get why 3) does not work. From here: <http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_function_arguments.asp> If the driver is a Unicode driver, the Driver Manager makes function calls as follows: - Converts an ANSI function (with the A suffix) to a Unicode function (with the W suffix) by converting the string arguments into Unicode ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ characters and passes the Unicode function to the driver. Are you saying in 3) that the "converting" underlined above is actually just a static cast?! Is this "bug" true for every driver manager out there?
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert > Sent: 21 November 2005 17:19 > To: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Continuing encoding fun.... > > "Dave Page" <dpage@vale-housing.co.uk> writes: > > > I've been thinking about this whilst getting dragged round the shops > > today, and having read Marko's, Johann's, Hiroshi's and > other emails, > > not to mention bits of the ODBC spec, here's where I think we stand. > > > > 1) The current driver works as expected with Unicode apps. > > > > 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI > > functions to the Unicode ones, and because (as I think Marko pointed > > out) the basic latin chars map directly into the lower Unicode > > characters (see http://www.unicode.org/charts/PDF/U0000.pdf). > > > > 3) Some other single byte LATIN encodings do not work. This > is because > > the characters do not map directly into Unicode 80-FF > > (http://www.unicode.org/charts/PDF/U0080.pdf). > > > > 4) Multibyte apps do not work. I believe that in fact they > never will > > with a Unicode driver, because multibyte characters simply won't map > > into Unicode in the same way that ASCII does. The user > cannot opt to use > > the non-wide functions, because the DM automatically maps > them to the > > Unicode versions. > > > > Because the Driver Manager forces the user to use the *W > functions if > > they exist, I cannot see any way to make 3 or 4 work with a Unicode > > driver. > > > I agree that 4) can never work, because ODBC does not seem compatible > with multibyte apps by design. ODBC caters for "ANSI" and "Unicode" > strings, that's all. > <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx> Actually our ANSI driver works quite nicely in various non-Unicode multibyte encodings such as Shift-JIS, EUC_CN, JOHAB andmore. It'll even work with pure UTF-8 in multibyte mode using the ANSI API. > > However, I don't get why 3) does not work. From here: > <http://msdn.microsoft.com/library/default.asp?url=/library/en > -us/odbc/htm/odbcunicode_function_arguments.asp> > > If the driver is a Unicode driver, the Driver Manager makes function > calls as follows: > - Converts an ANSI function (with the A suffix) to a Unicode function > (with the W suffix) by converting the string arguments into Unicode > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > characters and passes the Unicode function to the driver. > > > Are you saying in 3) that the "converting" underlined above is > actually just a static cast?! No, not really a static cast, but a similar effect. Unicode chars 0000-007F are exactly the same as their ASCII counterparts,as are LATIN1 (0080-00FF). All the DM does is map the single byte values into low bytes of the unicode charactersand passes them to the Unicode functions. This works just fine for pure ASCII/LATIN1, but not with other charactersetswhich don't directly map from their single byte values into Unicode. > Is this "bug" true for every driver manager out there? It's not really a bug, but I believe so, yes. It gets corrected by the more advanced drivers though - for example, the SQLserver driver might see a 'Š' character (8A). It knows the local charset is LATIN4, so it can then rewrite that characterto 0160, the Unicode equivalent. Our Unicode driver will simply leave it as 8A, which is actually a control character(VTS - LINE TABULATION SET). http://www.unicode.org/roadmaps/bmp/ At least, this is how I understand things :-). Regardless though, the encoding bug reports have all-but stopped now we ship2 drivers again. Regards, Dave.
"Dave Page" <dpage@vale-housing.co.uk> writes: >> I agree that 4) can never work, because ODBC does not seem compatible >> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode" >> strings, that's all. >> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx> > > Actually our ANSI driver works quite nicely in various non-Unicode > multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll > even work with pure UTF-8 in multibyte mode using the ANSI API. Great. Out of curiosity, is this because all the ODBC code has a "don't touch" attitude in this full-ANSI case, leaving all string data as is? Or is there something more clever? Who performs the conversion if the database is in UTF-8 for instance? Multibyte cases seem to fall outside the scope of the ODBC spec, which refers only to "ANSI" and "Unicode". Thanks in advance for providing pointers if this is an FAQ. Even vague references to the archive of this list would be nice. >> However, I don't get why 3) does not work. >> >> If the driver is a Unicode driver, the Driver Manager makes function >> calls as follows: >> - Converts an ANSI function (with the A suffix) to a Unicode function >> (with the W suffix) by converting the string arguments into Unicode >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> characters and passes the Unicode function to the driver. >> >> >> Are you saying in 3) that the "converting" underlined above is >> actually just a static cast?! > > No, not really a static cast, but a similar effect. Unicode chars > 0000-007F are exactly the same as their ASCII counterparts, as are > LATIN1 (0080-00FF). All the DM does is map the single byte values > into low bytes of the unicode characters and passes them to the > Unicode functions. > This works just fine for pure ASCII/LATIN1, but > not with other charactersets which don't directly map from their > single byte values into Unicode. Very interesting. Maybe the driver manager does so only because the it cannot/fails to get the active codepage, falling back on CP-1252? (CP1252 ~= latin1, <http://czyborra.com/charsets/codepages.html#CP1252>) >> Is this "bug" true for every driver manager out there? > It's not really a bug, but I believe so, yes. including unixodbc and iodbc for instance? > It gets corrected by > the more advanced drivers though - for example, the SQL server > driver might see a 'Š' character (8A). It knows the local charset is > LATIN4, so it can then rewrite that character to 0160, the Unicode > equivalent. Are you saying that the SQL server driver is fixing the flawed conversion job of the driver manager, finally taking the codepage into account? Surprising to say the least! By the way 0x8A is not in the range of latin4 <http://czyborra.com/charsets/iso8859.html#ISO-8859-4> > Our Unicode driver will simply leave it Of course, you don't want to perform a conversion that is supposed to already have happeneD. > Regardless though, the encoding bug reports have all-but stopped now > we ship 2 drivers again. And having two different drivers is indeed the approach induced by the ODBC documentation, from what I've got from it. Thanks a lot for your insights.
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert > Sent: 22 November 2005 09:33 > To: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Continuing encoding fun.... > > "Dave Page" <dpage@vale-housing.co.uk> writes: > > >> I agree that 4) can never work, because ODBC does not seem > compatible > >> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode" > >> strings, that's all. > >> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx> > > > > > Actually our ANSI driver works quite nicely in various non-Unicode > > multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll > > even work with pure UTF-8 in multibyte mode using the ANSI API. > > Great. > > Out of curiosity, is this because all the ODBC code has a "don't > touch" attitude in this full-ANSI case, leaving all string data as is? > Or is there something more clever? Who performs the conversion if the > database is in UTF-8 for instance? Multibyte cases seem to > fall outside > the scope of the ODBC spec, which refers only to "ANSI" and "Unicode". No, Multibyte support was intentionally added by Eiji Tokuya in 2001. Don't ask me how it works though as I really don'tknow. Much of the code for it is in multibyte.c if you want to take a peek. > Very interesting. Maybe the driver manager does so only because the it > cannot/fails to get the active codepage, falling back on CP-1252? > (CP1252 ~= latin1, > <http://czyborra.com/charsets/codepages.html#CP1252>) The docs are somewhat fuzzy on this point, simply stating that "If the driver is a Unicode driver, the Driver Manager makes function calls as follows:" ... "Converts an ANSI function (withthe A suffix) to a Unicode function (with the W suffix) by converting the string arguments into Unicode characters andpasses the Unicode function to the driver." (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_applications.asp) My assertion that the driver does the conversion comes from the SQL Server driver which allows you to turn conversion onor off: "Perform translation for character data check box When selected, the SQL Server ODBC driver converts ANSI strings sent between the client computer and SQL Server by usingUnicode. The SQL Server ODBC driver sometimes converts between the SQL Server code page and Unicode on the client computer.This requires that the code page used by SQL Server be one of the code pages available on the client computer. When cleared, no translation of extended characters in ANSI character strings is done when they are sent between the clientapplication and the server. If the client computer is using an ANSI code page (ACP) different from the SQL Server codepage, extended characters in ANSI character strings may be misinterpreted. If the client computer is using the same codepage for its ACP that SQL Server is using, the extended characters are interpreted correctly." If Microsoft intended the DM to do the conversion when they wrote the spec, why would they then add the same functionalityto their driver? > >> Is this "bug" true for every driver manager out there? > > > It's not really a bug, but I believe so, yes. > > including unixodbc and iodbc for instance? If they follow the parts of the spec I quoted above, and interpret them in the same when, then yes. However I'm not overlyfamiliar with either DM, so I can't say for sure. > > It gets corrected by > > the more advanced drivers though - for example, the SQL server > > driver might see a 'Š' character (8A). It knows the local charset is > > LATIN4, so it can then rewrite that character to 0160, the Unicode > > equivalent. > > Are you saying that the SQL server driver is fixing the flawed > conversion job of the driver manager, finally taking the codepage into > account? Surprising to say the least! > > By the way 0x8A is not in the range of latin4 > <http://czyborra.com/charsets/iso8859.html#ISO-8859-4> http://www.gar.no/home/mats/8859-4.htm says differently, however, I can't claim to know enough about encoding issues to refuteeither. I've been forced to learn what I can about the subject to help maintain this driver and certainly may havegot the wrong end of the stick on one or more points! Regards, Dave.
Microsoft harmful extensions to 8859-X charsets (was: Continuing encoding fun....)
From
Marc Herbert
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes: >> By the way 0x8A is not in the range of latin4 >> <http://czyborra.com/charsets/iso8859.html#ISO-8859-4> > > http://www.gar.no/home/mats/8859-4.htm says differently, however, I > can't claim to know enough about encoding issues to refute > either. I've been forced to learn what I can about the subject to help > maintain this driver and certainly may have got the wrong end of the > stick on one or more points! The page from gar.no is just a dump of the *Microsoft-extended* latin4 charset. The standards comittee carefully left a gap in all LATIN-X charsets between 0x80 and 0x9F, because those characters become (harmful) control characters once stripped of their 8th bit (by accident). You can see that very clearly in this table for instance <http://en.wikipedia.org/wiki/ISO_8859-4> If you follow the links from gar.no itself, you can land here: <http://en.wikipedia.org/wiki/ISO_8859> with tons of links (like the ECMA standards for instance) showing this gap. Microsoft, being Microsoft, jumped in that gap. Those non-standard Microsoft characters now plague the web as clearly explained here: <http://home.earthlink.net/~bobbau/platforms/specialchars/#windows> or here: <http://www.cs.tut.fi/~jkorpela/www/windows-chars.html>
[Cross-posting to unixodbc-devel. Also crossing fingers so it works] Archives of both lists here for instance: <http://dir.gmane.org/search.php?match=odbc> "Dave Page" <dpage@vale-housing.co.uk> writes: > > The docs are somewhat fuzzy on this point, simply stating that > > "If the driver is a Unicode driver, the Driver Manager makes function > calls as follows:" ... "Converts an ANSI function (with the A suffix) > to a Unicode function (with the W suffix) by converting the string > arguments into Unicode characters and passes the Unicode function to > the driver." > > (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_applications.asp) > > My assertion that the driver does the conversion comes from the SQL > Server driver which allows you to turn conversion on or off: > > "Perform translation for character data check box > > When selected, the SQL Server ODBC driver converts ANSI strings sent > between the client computer and SQL Server by using Unicode. The SQL > Server ODBC driver sometimes converts between the SQL Server code page > and Unicode on the client computer. This requires that the code page > used by SQL Server be one of the code pages available on the client > computer. > > When cleared, no translation of extended characters in ANSI character > strings is done when they are sent between the client application and > the server. If the client computer is using an ANSI code page (ACP) > different from the SQL Server code page, extended characters in ANSI > character strings may be misinterpreted. If the client computer is > using the same code page for its ACP that SQL Server is using, the > extended characters are interpreted correctly." > > If Microsoft intended the DM to do the conversion when they wrote the > spec, why would they then add the same functionality to their driver? Here is a hypothesis: the checkbox in SQL Server driver is actually a switch between the ANSI version and the Unicode version of this driver. That would be pretty much consistent with all the above. The only inconsistency would be: "The driver converts...", to be actually read as: "This setting triggers the conversion operated by the DM". What do you think?
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert > Sent: 24 November 2005 14:18 > To: pgsql-odbc@postgresql.org > Cc: unixodbc-dev@unixodbc.org > Subject: Re: [ODBC] Continuing encoding fun.... > > > If Microsoft intended the DM to do the conversion when they > wrote the > > spec, why would they then add the same functionality to > their driver? > > > Here is a hypothesis: the checkbox in SQL Server driver is actually a > switch between the ANSI version and the Unicode version of this > driver. That would be pretty much consistent with all the above. The > only inconsistency would be: "The driver converts...", to be actually > read as: "This setting triggers the conversion operated by the DM". > > What do you think? The DM detects whether the driver is Unicode or not from the presence of the SQLConnectW function (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/h tm/odbcunicode_drivers.asp). Whether or not this is exported is determined at compile time and cannot be changed at runtime. Regards, Dave