Thread: Unicode support
Hi Anoop and anyone else who might be interested, I've been thinking about how the Unicode support might be improved to allow the old 07.xx non-unicode style behaviour to work for those that need it. At them moment, when we connect using on of the wide connect functions, the CC->unicode flag is set to true. This only affects a few options, such as pgtype_to_concise_type()'s mapping of PG types to SQL types. It seems to me that perhaps we should set CC->unicode = 1, only upon connection to a Unicode database. Anything else, we leave it set to 0 so that it always maps varchars etc to ANSI types, and handles other encodings in single byte or non-unicode multibyte mode (which worked fine in 07.xx where those encodings where appropriate, such as SJIS in Japan). This should also help BDE based apps, which further research has shown me are broken with Unicode columns in SQL Server and Oracle as well as PostgreSQL (search unicode + BDE on Google Groups for more). Am I seeing a possible improvement where in fact there isn't one, or missing some obvious downside? Regards, Dave.
From: "Dave Page" > Hi Anoop and anyone else who might be interested, > > I've been thinking about how the Unicode support might be improved to > allow the old 07.xx non-unicode style behaviour to work for those that Yea, I think that a libpq version is very great. However, Legacy environment is raising the scream for a rapid change. Then, I think that a multibyte needs to be supported. Regards, Hiroshi Saito
Hi Dave, Checking for the database encoding and calling the functions using the appropriate flag seems to be fine. Regards Anoop > -----Original Message----- > From: Dave Page [mailto:dpage@vale-housing.co.uk] > Sent: Wednesday, August 31, 2005 3:07 AM > To: Anoop Kumar > Cc: pgsql-odbc@postgresql.org > Subject: Unicode support > > Hi Anoop and anyone else who might be interested, > > I've been thinking about how the Unicode support might be improved to > allow the old 07.xx non-unicode style behaviour to work for those that > need it. At them moment, when we connect using on of the wide connect > functions, the CC->unicode flag is set to true. This only affects a few > options, such as pgtype_to_concise_type()'s mapping of PG types to SQL > types. > > It seems to me that perhaps we should set CC->unicode = 1, only upon > connection to a Unicode database. Anything else, we leave it set to 0 so > that it always maps varchars etc to ANSI types, and handles other > encodings in single byte or non-unicode multibyte mode (which worked > fine in 07.xx where those encodings where appropriate, such as SJIS in > Japan). This should also help BDE based apps, which further research has > shown me are broken with Unicode columns in SQL Server and Oracle as > well as PostgreSQL (search unicode + BDE on Google Groups for more). > > Am I seeing a possible improvement where in fact there isn't one, or > missing some obvious downside? > > Regards, Dave.
> -----Original Message----- > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] > Sent: 31 August 2005 02:56 > To: Dave Page; Anoop Kumar > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > From: "Dave Page" > > > Hi Anoop and anyone else who might be interested, > > > > I've been thinking about how the Unicode support might be > improved to > > allow the old 07.xx non-unicode style behaviour to work for > those that > > Yea, I think that a libpq version is very great. However, > Legacy environment > is raising the scream for a rapid change. Then, I think that > a multibyte needs > to be supported. OK, I'll prepare a patch, and because it's an odd problem, a test build to go with it. Any voluteers to test? It really needs people with a reproducable encoding error that doesn't existing in 07.xx, or people using BDE (which barfs bigtime on SQL_C_Wxxx data). Regards, Dave.
Dave Page schrieb: > >OK, I'll prepare a patch, and because it's an odd problem, a test build >to go with it. Any voluteers to test? It really needs people with a >reproducable encoding error that doesn't existing in 07.xx, or people >using BDE (which barfs bigtime on SQL_C_Wxxx data). > >Regards, Dave. > > > Hi Dave, just send it to me (the windows dll). Even though I just switched my linux server to UTF-8. :-) I'll test it with the old enviroment. Regards, Johann
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Dave Page > Sent: 31 August 2005 08:20 > To: Hiroshi Saito; Anoop Kumar > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > > OK, I'll prepare a patch, and because it's an odd problem, a > test build > to go with it. Any voluteers to test? It really needs people with a > reproducable encoding error that doesn't existing in 07.xx, or people > using BDE (which barfs bigtime on SQL_C_Wxxx data). OK, patch attached. This works slightly differently than I envisaged, because simply switching off Unicode isn't that straight forward, especially if the DM is using the *W functions. Basically what this does is only offer wide character types if the database is unicode, and, in that case, sets the client encoding to unicode. For anything else, it will report non-wide character types as per the 07 driver, and let the user set their own encoding as required. From what I can tell of the BDE missing fields problem, this should almost certainly fix it. Please look at this carefully - as most of you know, MB/Unicode issues aren't exactly my strong point! I'll forward test DLLs to volunteer victims privately. Regards, Dave.
Attachment
LATIN1 and UCS have one common point by design: 0x00 - 0xFF are equal numbers, so the SQL_ASCII ignorance means, that LATIN1 characters won't get changed! So, this means, that: 0xE4 in ISO-8859-1 is the same as 0x00E4 in UCS-2. Just the number of needed bytes change. Reference: "man 7 unicode" Marko Ristola Dave Page wrote: >Hi Anoop and anyone else who might be interested, > >I've been thinking about how the Unicode support might be improved to >allow the old 07.xx non-unicode style behaviour to work for those that >need it. At them moment, when we connect using on of the wide connect >functions, the CC->unicode flag is set to true. This only affects a few >options, such as pgtype_to_concise_type()'s mapping of PG types to SQL >types. > >It seems to me that perhaps we should set CC->unicode = 1, only upon >connection to a Unicode database. Anything else, we leave it set to 0 so >that it always maps varchars etc to ANSI types, and handles other >encodings in single byte or non-unicode multibyte mode (which worked >fine in 07.xx where those encodings where appropriate, such as SJIS in >Japan). This should also help BDE based apps, which further research has >shown me are broken with Unicode columns in SQL Server and Oracle as >well as PostgreSQL (search unicode + BDE on Google Groups for more). > >Am I seeing a possible improvement where in fact there isn't one, or >missing some obvious downside? > >Regards, Dave. > >---------------------------(end of broadcast)--------------------------- >TIP 2: Don't 'kill -9' the postmaster > >
From: "Dave Page" Thanks.!! > Please look at this carefully - as most of you know, MB/Unicode issues > aren't exactly my strong point! Ok, I am going to try the specification in multibyte. :-) > > I'll forward test DLLs to volunteer victims privately. Regards, Hiroshi Saito
Hi Dave. I tried your patch by SJIS of Japan. It seems that it needs some additional correction. Moreover, it is necessary to make the driver different from UNICODE (WideCharacter). It seems that I have to catch up further. BTW, I remembered the discussion original by pgAdminIII. I said that I should support MullutiByte then. However, How is it now? It is very wonderful. I feel that that there are many choices of a character code complicates a problem more. but, it is although external environment is different. Regards, Hiroshi Saito
Attachment
Did this miss something? :-) /D > -----Original Message----- > From: Miguel Juan [mailto:mjuan@cibal.es] > Sent: 01 September 2005 10:06 > To: Dave Page > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > > ----- Original Message ----- > From: "Dave Page" <dpage@vale-housing.co.uk> > To: "Miguel Juan" <mjuan@cibal.es> > Cc: <pgsql-odbc@postgresql.org> > Sent: Thursday, September 01, 2005 10:32 AM > Subject: RE: [ODBC] Unicode support > > > Yes please - attached. > > > -----Original Message----- > > From: Miguel Juan [mailto:mjuan@cibal.es] > > Sent: 01 September 2005 09:25 > > To: Dave Page > > Cc: pgsql-odbc@postgresql.org > > Subject: Re: [ODBC] Unicode support > > > > hello, > > > > I will try it with BDE environment if you want. > > > > Regards, > > > > Miguel Juan > > > > > > ----- Original Message ----- > > From: "Dave Page" <dpage@vale-housing.co.uk> > > To: "Anoop Kumar" <anoopk@pervasive-postgres.com> > > Cc: <pgsql-odbc@postgresql.org> > > Sent: Tuesday, August 30, 2005 11:36 PM > > Subject: [ODBC] Unicode support > > > > > > Hi Anoop and anyone else who might be interested, > > > > I've been thinking about how the Unicode support might be > improved to > > allow the old 07.xx non-unicode style behaviour to work for > those that > > need it. At them moment, when we connect using on of the > wide connect > > functions, the CC->unicode flag is set to true. This only > > affects a few > > options, such as pgtype_to_concise_type()'s mapping of PG > types to SQL > > types. > > > > It seems to me that perhaps we should set CC->unicode = 1, only upon > > connection to a Unicode database. Anything else, we leave it > > set to 0 so > > that it always maps varchars etc to ANSI types, and handles other > > encodings in single byte or non-unicode multibyte mode (which worked > > fine in 07.xx where those encodings where appropriate, such > as SJIS in > > Japan). This should also help BDE based apps, which further > > research has > > shown me are broken with Unicode columns in SQL Server and Oracle as > > well as PostgreSQL (search unicode + BDE on Google Groups for more). > > > > Am I seeing a possible improvement where in fact there isn't one, or > > missing some obvious downside? > > > > Regards, Dave. > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > > > > > > > >
> -----Original Message----- > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] > Sent: 31 August 2005 21:00 > To: Hiroshi Saito; Dave Page; Anoop Kumar > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > Hi Dave. > > I tried your patch by SJIS of Japan. It seems that it needs > some additional > correction. Moreover, it is necessary to make the driver > different from > UNICODE (WideCharacter). It seems that I have to catch up further. Hmmm, well I can't remove the Unicode functions. Do your apps request SQL_C_WCHAR etc even if the driver doesn't offer it? > BTW, I remembered the discussion original by pgAdminIII. I > said that I > should support MullutiByte then. However, How is it now? It > is very wonderful. > I feel that that there are many choices of a character code > complicates a problem > more. but, it is although external environment is different. Hmm, I hate multibyte :-(!! Regards, Dave
> -----Original Message----- > From: Miguel Juan [mailto:mjuan@cibal.es] > Sent: 01 September 2005 11:06 > To: Dave Page > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > Hello Dave, > > I'm just trying the last fix (for BDE) and I can see some odd > behavior. > > - It shows the TEXT fields as MEMO. But you can see the data > if you make a > double click on it. It looks like it doesn't use the "text as > LongVarchar" > option (this works in version 7.x). Right, I'll look at that. > - After a "SELECT * FROM table" The Borland SQL Explorer > shows an error > ('Invalid Blob Handle') for empty TEXT fields (NULL) when you > try to view > them. This works fine for table view. Strange. > - After an error inserting a row ('not null' constraint) it > closes the > connection (dead connection error) I'll test that as well. To be honest though, I've been researching BDE on Google Groups and there are lots of people reporting similar problems with SQL Server and Oracle - apparently BDE fails to work properly with any Unicode data. I'm happy to spend a little time trying to work around that, but I can't spend masses of time on it. Regards, Dave.
Hi all. How about creating a charset conversion interface and taking UTF-8 as an internal format for ODBC?: At least the following functions might be needed: Internal2WChar() WChar2Internal() Internal2Char() Char2Internal() Backend would talk only UTF-8. Here is a minimum set of interface (Object oriented design term) functions: cvt_FromUTF8() cvt_ToUTF8() cvt_Free() Interface implementation: struct CvtInterface { char (*cvt_FromUTF8)(void *internalData, char *source, size_t bytes); char (*cvt_ToUTF8)(void *internalData, char *source, size_t bytes); void (*cvt_Free)(void *internalData); void *internalData; } Object creation: struct Env { struct CvtInterface char_cvt; // C program char conversions struct CvtInterface wchar_cvt; // C program wchar_t conversions }; struct CvtInterface utf8_to_utf8_New(); env->char_cvt = utf8_to_utf8_New(); These are some interface implementation functions: (I don't know, how many are needed, but at least supporting of char, wchar and multibyte is needed). sjis_new() sjis_FromUTF8() sjis_ToUTF8() sjis_Free() wchar_FromUTF8() wchar_ToUTF8() wchar_Free() char_FromUTF8() char_ToUTF8() char_Free() utf8_FromUTF8() utf8_ToUTF8() utf8_Free() ascii8_FromUTF8() ascii8_ToUTF8() ascii8_Free() So, there would be a single internal UTF-8 format inside PsqlODBC. The backend could always deliver UTF-8, so the need for internal format <-> backend format layer is not needed. This implementation would be easy to implement. Examples: A C program calls SQLExecuteW. AllocEnv has found out, that the wchar format is UCS-2. So it has created an object: env->char_cvt = cvt_ucs2_UTF8_New(); The PGAPI function needs to convert from WCHAR into internal format: sqlquery = (*env->char_cvt->cvt_ToUTF8)(wcharquery); Then the sqlquery is in UTF8, and the query is in an easilly manageable format! A C program uses SQLGetDataW to get a string. So when the data will be converted in convert.c, psqlodbc calls: result = (*env->char_cvt->cvt_FromUTF8)(internalformat); I don't know, wether ENV handle is the best place to put the converter objects. I like about this implementation: - Simplifies support for clients using different charsets. - Simplifies psqodbc internally, because of internal UTF8 assumption. - Easy to implement and to test. - Easy to add more converters, when the initial implementation works. - Enables usage of advanced lexers and parsers when needed to improve performance. - PSQLODBC will support well all UTF-8 supported charsets. I have not suggested this before, because of the following reasons: - psqlodbc charset conversion implementation seems to work most times. - Avoiding unnecessary charset conversions is good for performance. - It takes time to implement and test this. - Unnecessary malloc + free is bad for performance. What do you think about this? Would this solve the problems? Is this implementable? Would the performance be good enough? Would this simplify things (that's the Goal)? Regards, Marko Ristola Dave Page wrote: > > > > >>-----Original Message----- >>From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp] >>Sent: 31 August 2005 21:00 >>To: Hiroshi Saito; Dave Page; Anoop Kumar >>Cc: pgsql-odbc@postgresql.org >>Subject: Re: [ODBC] Unicode support >> >>Hi Dave. >> >>I tried your patch by SJIS of Japan. It seems that it needs >>some additional >>correction. Moreover, it is necessary to make the driver >>different from >>UNICODE (WideCharacter). It seems that I have to catch up further. >> >> > >Hmmm, well I can't remove the Unicode functions. Do your apps request >SQL_C_WCHAR etc even if the driver doesn't offer it? > > > >>BTW, I remembered the discussion original by pgAdminIII. I >>said that I >>should support MullutiByte then. However, How is it now? It >>is very wonderful. >>I feel that that there are many choices of a character code >>complicates a problem >>more. but, it is although external environment is different. >> >> > >Hmm, I hate multibyte :-(!! > >Regards, Dave > >---------------------------(end of broadcast)--------------------------- >TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
> -----Original Message----- > From: pgsql-odbc-owner@postgresql.org > [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marko Ristola > Sent: 01 September 2005 18:21 > Cc: Hiroshi Saito; Anoop Kumar; pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > > Hi all. Hi Marko, > How about creating a charset conversion interface > and taking UTF-8 as an internal format for ODBC?: > <snip> > > So, there would be a single internal UTF-8 format inside PsqlODBC. > The backend could always deliver UTF-8, so the need for internal > format <-> backend format layer is not needed. > > This implementation would be easy to implement. This is what already happens (if you ignore my recent experimental patch). If the connection is made using one of the *W connect functions, then the ConnectionClass->unicode flag is set to true, and SET client_encoding = 'UTF-8' is sent to the backend. From then on, data going out to the client is fed through utf8_to_ucs2_lf() *if * the data type is specified as SQL_C_WCHAR, and data coming in to *W functions is fed through ucs2_to_utf8(). Afaict, Unicode mode works exactly as it should. If the connection is made using a non-wide function, the ConnectionClass->unicode is not set. In this case, the client is expected to continue using non-wide functions, and the client encoding left at default. In this case, the driver will never report data types as SQL_C_WCHAR. This, is where I believe the major problem occurs - if the ODBC Driver Manager sees that SQLConnectW (iirc) exists, it will automatically map ANSI calls (eg. SQLConnect()) to Unicode (eg. SQLConnectW()). This then causes the driver to report text/char columns as SQL_C_WCHAR. Less well written apps then fall over because they aren't clever enough to request data as SQL_C_CHAR instead of SQL_C_WCHAR. My recent experimental patch aims to address this, by forcing the driver to report SQL_C_CHAR instead of SQL_C_WCHAR for non-unicode databases. This should (and seems to, with minor side effects yet to be fully investigated) fix the BDE problem. As for multibyte (non-unicode) data such as Hiroshi's, my understanding is that in the presence of a Unicode driver, apps are expected to use Unicode (and in fact, are forced to by the driver manager's mapping of ANSI function calls to Unicode calls). Anoop, do you or any of your guys (or anyone else) know unicode/multibyte/encoding well? I'm learning as I go at the moment, so some more experienced help would be *really* appreciated. Regards, Dave.
> Anoop, do you or any of your guys (or anyone else) know > unicode/multibyte/encoding well? I'm learning as I go at the moment, so > some more experienced help would be *really* appreciated. Sorry Dave, No one in my list! :-( Regards Anoop
So, I don't have much experience with Windows ODBC. That's true. Is it possible to compile psqlodbc with MinGW tools for Windows? After using Google, I found out, that GLIB libraries are able to convert UTF-8 into multibyte under Windows. Windows should be able to convert UTF-8 into Multibyte and vice versa with it's character set conversion functions. After using Google, I found out, that Windows XP had a problem with Korean multibyte: "Windows XP Device Driver Does Not Convert Multibyte Data to Korean" Article ID: 817522. That was fixed in Service Pack 2. So I ask you, how you have thought about these things: If I have understood Windows correctly, it uses UCS-2 as it's internal UNICODE character set. Linux prefers into UTF-8. So, If we classify UCS-2 and UTF-8 equal inside psqoldbc, that makes sense. That's what has been implemented into psqlodbc already for Windows. Then there is the world before Unicode existed. There were DOS codepages, character sets for groups of countries and Multibyte character sets. JIS X 0208 is a character set (see man 7 charsets). Shift_JIS is an encoding that can contain JIS X 0208 multibyte characters (see man 7 charsets). So it seems, that one working implementation can be done by using UTF-8 PostgreSQL server and UTF-8 to multibyte conversions. However, according to Samba team's UNICODE problem descriptions, there are some problems: UTF-8 to EUC_JP conversion may be different on Linux and Windows, and on different conversion library implementations. Some multibyte character sets are contraditory with each other. If we drop the *W() functions away, we might get a working implementation, but we might not support the full ODBC API? So if and only if one single conversion library does the conversions, it works. So if and only if the PostgreSQL backend, or only the PSQLODBC side does the needed conversions, psqlodbc should work with multibyte encodings, with UTF-8. If the PostgreSQL Server is in a same kind of Windows environment than the clients, it should work fully with UTF-8 and the multibyte character sets. This should be the best working option. Windows does have a working UCS-2 to multibyte conversion implementation on the psqlodbc client (since Service Pack 2). Unfortunately pg_dump + restore from SJIS into UTF-8 might not work, because Linux's ICONV might not do the conversion correctly. The conversion into UTF-8 must be done using fully working Windows conversion functions. So one way might be something like using such pg_dump under Windows, that does the multibyte into UTF-8 conversion in Windows side. How about the following implementation: ODBC against the backend: - Backend has multibyte characters. - Windows uses multibyte characters. psqlodbc has UTF-8 as it's internal formats. => A fully working implementation: - Backend deliveres multibyte characters. PSQLODBC converts them into UTF-8. PSQLODBC deliveres multibyte characters to the client using utf8_to_locale Windows functions, when necessary. So the solution might be here to do all conversions on the client side! However the reasoning for this is, that two separate conversion libraries might be contradictory with each other, at least with the Asian character sets. (With MACs, UTF-8 implementation differs from the standard.) Or then Asian users should move and use UTF-8 as their PostgreSQL Server's backend format. That's the other solution for the same problem. Then PostgreSQL Server doesn't have to do the conversion. It does not seem possible to do all the conversion functions inside PostgreSQL Server under Windows, because of the xx() -> xxW() mapping inside Windows ODBC manager. We can't control that. What do you think about these thoughts? Marko Ristola Hiroshi Saito wrote: >Hi Dave. > >I tried your patch by SJIS of Japan. It seems that it needs some additional >correction. Moreover, it is necessary to make the driver different from >UNICODE (WideCharacter). It seems that I have to catch up further. > >BTW, I remembered the discussion original by pgAdminIII. I said that I >should support MullutiByte then. However, How is it now? It is very wonderful. >I feel that that there are many choices of a character code complicates a problem >more. but, it is although external environment is different. > >Regards, >Hiroshi Saito > >------------------------------------------------------------------------ > >--- convert.c.orig Thu Aug 4 21:26:57 2005 >+++ convert.c Thu Sep 1 04:38:45 2005 >@@ -762,7 +762,7 @@ > { > BOOL lf_conv = conn->connInfo.lf_conversion; > >- if (fCType == SQL_C_WCHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR)) > { > len = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0); > len *= WCLEN; >@@ -778,7 +778,7 @@ > } > else > #ifdef WIN32 >- if (fCType == SQL_C_CHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR)) > { > wstrlen = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0); > allocbuf = (SQLWCHAR *) malloc(WCLEN * (wstrlen + 1)); >@@ -810,7 +810,7 @@ > pgdc->ttlbuflen = len + 1; > } > >- if (fCType == SQL_C_WCHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR)) > { > utf8_to_ucs2_lf(neut_str, -1, lf_conv, (SQLWCHAR *) pgdc->ttlbuf, len / WCLEN); > } >@@ -824,7 +824,7 @@ > } > else > #ifdef WIN32 >- if (fCType == SQL_C_CHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR)) > { > len = WideCharToMultiByte(CP_ACP, 0, allocbuf, wstrlen, pgdc->ttlbuf, pgdc->ttlbuflen, NULL,NULL); > free(allocbuf); >@@ -871,7 +871,7 @@ > > copy_len = (len >= cbValueMax) ? cbValueMax - 1 : len; > >- if (fCType == SQL_C_WCHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR)) > { > copy_len /= WCLEN; > copy_len *= WCLEN; >@@ -911,7 +911,7 @@ > memcpy(rgbValueBindRow, ptr, copy_len); > /* Add null terminator */ > >- if (fCType == SQL_C_WCHAR) >+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR)) > memset(rgbValueBindRow + copy_len, 0, WCLEN); > else > >@@ -942,7 +942,7 @@ > break; > } > >- if (SQL_C_WCHAR == fCType && ! wchanged) >+ if ((conn->unicode && conn->report_wide_types) && (SQL_C_WCHAR == fCType && ! wchanged)) > { > if (cbValueMax > (SDWORD) (WCLEN * (len + 1))) > { >@@ -2629,6 +2629,8 @@ > case SQL_WCHAR: > case SQL_WVARCHAR: > case SQL_WLONGVARCHAR: >+ if (conn->unicode && conn->report_wide_types) >+ { > if (SQL_NTS == used) > used = strlen(buffer); > allocbuf = malloc(WCLEN * (used + 1)); >@@ -2637,6 +2639,11 @@ > buf = ucs2_to_utf8((SQLWCHAR *) allocbuf, used, (UInt4 *) &used, FALSE); > free(allocbuf); > allocbuf = buf; >+ { >+ else >+ { >+ buf = buffer; >+ } > break; > default: > buf = buffer; >@@ -2647,10 +2654,17 @@ > break; > > case SQL_C_WCHAR: >+ if (conn->unicode && conn->report_wide_types) >+ { > if (SQL_NTS == used) > used = WCLEN * wcslen((SQLWCHAR *) buffer); > buf = allocbuf = ucs2_to_utf8((SQLWCHAR *) buffer, used / WCLEN, (UInt4 *) &used, FALSE); > used *= WCLEN; >+ } >+ else >+ { >+ buf = buffer; >+ } > break; > > case SQL_C_DOUBLE: >--- psqlodbc_win32.def.orig Thu Sep 1 04:41:37 2005 >+++ psqlodbc_win32.def Thu Sep 1 04:42:08 2005 >@@ -78,31 +78,3 @@ > DllMain @201 > ConfigDSN @202 > >-SQLColAttributeW @101 >-SQLColumnPrivilegesW @102 >-SQLColumnsW @103 >-SQLConnectW @104 >-SQLDescribeColW @106 >-SQLExecDirectW @107 >-SQLForeignKeysW @108 >-SQLGetConnectAttrW @109 >-SQLGetCursorNameW @110 >-SQLGetInfoW @111 >-SQLNativeSqlW @112 >-SQLPrepareW @113 >-SQLPrimaryKeysW @114 >-SQLProcedureColumnsW @115 >-SQLProceduresW @116 >-SQLSetConnectAttrW @117 >-SQLSetCursorNameW @118 >-SQLSpecialColumnsW @119 >-SQLStatisticsW @120 >-SQLTablesW @121 >-SQLTablePrivilegesW @122 >-SQLDriverConnectW @123 >-SQLGetDiagRecW @124 >-SQLGetStmtAttrW @125 >-SQLSetStmtAttrW @126 >-SQLSetDescFieldW @127 >-SQLGetTypeInfoW @128 >-SQLGetDiagFieldW @129 > > >------------------------------------------------------------------------ > > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > >
Marko Ristola wrote: >However, according to Samba team's UNICODE problem descriptions, >there are some problems: UTF-8 to EUC_JP conversion may be different >on Linux and Windows, and on different conversion library implementations. > > > This was the Samba reference. I recommend you to read the applicable parts. http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/unicode.html I hope, that the multibyte into UTF-8 and vice versa is possible. If not, disabling UTF-8 and UCS-2 seems to be the only workable choise :( Regards, Marko >Some multibyte character sets are contraditory with each other. > > >
----- Original Message ----- From: "Dave Page" <dpage@vale-housing.co.uk> To: "Miguel Juan" <mjuan@cibal.es> Cc: <pgsql-odbc@postgresql.org> Sent: Thursday, September 01, 2005 10:32 AM Subject: RE: [ODBC] Unicode support Yes please - attached. > -----Original Message----- > From: Miguel Juan [mailto:mjuan@cibal.es] > Sent: 01 September 2005 09:25 > To: Dave Page > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > hello, > > I will try it with BDE environment if you want. > > Regards, > > Miguel Juan > > > ----- Original Message ----- > From: "Dave Page" <dpage@vale-housing.co.uk> > To: "Anoop Kumar" <anoopk@pervasive-postgres.com> > Cc: <pgsql-odbc@postgresql.org> > Sent: Tuesday, August 30, 2005 11:36 PM > Subject: [ODBC] Unicode support > > > Hi Anoop and anyone else who might be interested, > > I've been thinking about how the Unicode support might be improved to > allow the old 07.xx non-unicode style behaviour to work for those that > need it. At them moment, when we connect using on of the wide connect > functions, the CC->unicode flag is set to true. This only > affects a few > options, such as pgtype_to_concise_type()'s mapping of PG types to SQL > types. > > It seems to me that perhaps we should set CC->unicode = 1, only upon > connection to a Unicode database. Anything else, we leave it > set to 0 so > that it always maps varchars etc to ANSI types, and handles other > encodings in single byte or non-unicode multibyte mode (which worked > fine in 07.xx where those encodings where appropriate, such as SJIS in > Japan). This should also help BDE based apps, which further > research has > shown me are broken with Unicode columns in SQL Server and Oracle as > well as PostgreSQL (search unicode + BDE on Google Groups for more). > > Am I seeing a possible improvement where in fact there isn't one, or > missing some obvious downside? > > Regards, Dave. > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > > >
Hello Dave, I'm just trying the last fix (for BDE) and I can see some odd behavior. - It shows the TEXT fields as MEMO. But you can see the data if you make a double click on it. It looks like it doesn't use the "text as LongVarchar" option (this works in version 7.x). - After a "SELECT * FROM table" The Borland SQL Explorer shows an error ('Invalid Blob Handle') for empty TEXT fields (NULL) when you try to view them. This works fine for table view. - After an error inserting a row ('not null' constraint) it closes the connection (dead connection error) Regards, Miguel Juan ----- Original Message ----- From: "Dave Page" <dpage@vale-housing.co.uk> To: "Miguel Juan" <mjuan@cibal.es> Cc: <pgsql-odbc@postgresql.org> Sent: Thursday, September 01, 2005 10:32 AM Subject: RE: [ODBC] Unicode support Yes please - attached. > -----Original Message----- > From: Miguel Juan [mailto:mjuan@cibal.es] > Sent: 01 September 2005 09:25 > To: Dave Page > Cc: pgsql-odbc@postgresql.org > Subject: Re: [ODBC] Unicode support > > hello, > > I will try it with BDE environment if you want. > > Regards, > > Miguel Juan > > > ----- Original Message ----- > From: "Dave Page" <dpage@vale-housing.co.uk> > To: "Anoop Kumar" <anoopk@pervasive-postgres.com> > Cc: <pgsql-odbc@postgresql.org> > Sent: Tuesday, August 30, 2005 11:36 PM > Subject: [ODBC] Unicode support > > > Hi Anoop and anyone else who might be interested, > > I've been thinking about how the Unicode support might be improved to > allow the old 07.xx non-unicode style behaviour to work for those that > need it. At them moment, when we connect using on of the wide connect > functions, the CC->unicode flag is set to true. This only > affects a few > options, such as pgtype_to_concise_type()'s mapping of PG types to SQL > types. > > It seems to me that perhaps we should set CC->unicode = 1, only upon > connection to a Unicode database. Anything else, we leave it > set to 0 so > that it always maps varchars etc to ANSI types, and handles other > encodings in single byte or non-unicode multibyte mode (which worked > fine in 07.xx where those encodings where appropriate, such as SJIS in > Japan). This should also help BDE based apps, which further > research has > shown me are broken with Unicode columns in SQL Server and Oracle as > well as PostgreSQL (search unicode + BDE on Google Groups for more). > > Am I seeing a possible improvement where in fact there isn't one, or > missing some obvious downside? > > Regards, Dave. > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > > >
hello, I will try it with BDE environment if you want. Regards, Miguel Juan ----- Original Message ----- From: "Dave Page" <dpage@vale-housing.co.uk> To: "Anoop Kumar" <anoopk@pervasive-postgres.com> Cc: <pgsql-odbc@postgresql.org> Sent: Tuesday, August 30, 2005 11:36 PM Subject: [ODBC] Unicode support Hi Anoop and anyone else who might be interested, I've been thinking about how the Unicode support might be improved to allow the old 07.xx non-unicode style behaviour to work for those that need it. At them moment, when we connect using on of the wide connect functions, the CC->unicode flag is set to true. This only affects a few options, such as pgtype_to_concise_type()'s mapping of PG types to SQL types. It seems to me that perhaps we should set CC->unicode = 1, only upon connection to a Unicode database. Anything else, we leave it set to 0 so that it always maps varchars etc to ANSI types, and handles other encodings in single byte or non-unicode multibyte mode (which worked fine in 07.xx where those encodings where appropriate, such as SJIS in Japan). This should also help BDE based apps, which further research has shown me are broken with Unicode columns in SQL Server and Oracle as well as PostgreSQL (search unicode + BDE on Google Groups for more). Am I seeing a possible improvement where in fact there isn't one, or missing some obvious downside? Regards, Dave. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
Marc Herbert wrote: >Marko Ristola <Marko.Ristola@kolumbus.fi> writes: > > >>So I ask you, how you have thought about these things: >> >>If I have understood Windows correctly, it uses UCS-2 as it's internal >>UNICODE >>character set. Linux prefers into UTF-8. >> >> > >I am not sure what you mean by "internal UNICODE character set", but I >understand that Linux does prefer UTF-32, NOT UTF-8 ! > > > If you want to know the details about UTF-8's encoding, the following is a recommended reading (Linux manual page) :) man utf-8 It gives you a good explanation of the encoding used in UTF-8. UTF-8 uses from one to four bytes per character. It supports almost all character sets in the World. Because the task is so huge, there exist variants and bugs in the implementations. That's what I read from Samba filesystem FAQ. So, if you stick with Windows implementation, you don't find any bugs, but when you move the file into another operating system, the file might look different :( UCS-2 is a 32-bit Unicode wchar_t type. According to Linux manuals, wchar_t is not equal on all implementations. According to manuals, inside binary files, it is recommended in C to use UTF-8 strings, that are then converted at runtime into wchar_t type. Java language is another story. There might be same problems though. The number remains the same, but if you try to draw the character into the window with different implementations, you might get different drawings. >On all platforms I had a look at, variable-length encodings are only >for disk and network, never used in memory. > >Don't you agree? > > > locale LANG=fi_FI.UTF-8@euro LC_CTYPE="fi_FI.UTF-8@euro" LC_NUMERIC="fi_FI.UTF-8@euro" LC_TIME="fi_FI.UTF-8@euro" LC_COLLATE="fi_FI.UTF-8@euro" LC_MONETARY="fi_FI.UTF-8@euro" LC_MESSAGES="fi_FI.UTF-8@euro" LC_PAPER="fi_FI.UTF-8@euro" LC_NAME="fi_FI.UTF-8@euro" LC_ADDRESS="fi_FI.UTF-8@euro" LC_TELEPHONE="fi_FI.UTF-8@euro" LC_MEASUREMENT="fi_FI.UTF-8@euro" LC_IDENTIFICATION="fi_FI.UTF-8@euro" LC_ALL= So, under Linux nowadays, UTF-8 is used very much. Just as Windows recommends everybody to move into native Windows Unicode characters (UCS-2), under Linux it is recommended to move into UTF-8. Both are UNICODE character encodings. UCS-2 encoding is just simpler: just an integer, that has a numerical value. The reason for the popularity of UTF-8 under Linux is, that each program needs to be adjusted very little to be able to move from LATIN1 style encoding into UTF-8. Happy studying about Unicode character sets :) Regards, Marko Ristola
Marc Herbert wrote: >On Thu, Sep 08, 2005 at 08:22:50PM +0300, Marko Ristola wrote: > > >>Marc Herbert wrote: >> >> >> >>>Marko Ristola <Marko.Ristola@kolumbus.fi> writes: >>> >>> >>> >>> > >Actually my question was just: what do you mean by 'internal'? > >Usually 'internal" means 'in memory', and I really don't think there >is any application/system using UTF-8 in memory, is there? > > > I'm sorry. I hope, that this time I'll answer to the right question. Unfortunately this is also a lengthy answer. I meant with internal unicode, the wchar_t type in Linux and Unixes. Or the Windows internal Unicode (TCHAR). I tried to find out more about the wchar_t(TCHAR) implementations - the internal Unicode representations. According to libc info pages: Under GNU Linux, wchar_t is implemented as 32bit UCS-4 characters. My ealier assumption, that Linux uses UCS-2, is wrong :( My ealier assumption, that UCS-2 is 32bit, is wrong :( Some Unix systems implement wchar_t as 16bit UCS-2 characters. This means, that if they want to implement the full 31 bit character set space, they can do so by using pairs of certain UCS-2 characters. This is UCS-16 format (a multibyte version of UCS-2). - If I remember correctly: Java uses 16bit Unicode, meaning, that they use UCS-2 or UCS-16. - According to psqlodbc implementation, Windows uses UCS-2 as it's internal format. This implicates also, that Windows might actually use UCS-16 multibyte format internally, because UCS-2 is a subset of UCS-16. Of course, Windows is capable to create it's own private standards. PostgreSQL ODBC driver for Windows has already the UTF-8 to UCS-2 character set conversion functions. PostgreSQL ODBC driver for Linux still misses UTF-8 to UCS-4 character set conversion functions. PostgreSQL Server deliveres the query results as UTF-8 for Windows ODBC driver. ODBC driver then converts the UTF-8 data into UCS-2. So psqlodbc driver uses internally UTF-8 under Windows. UTF-8 use cases under Linux: - Openoffice files are stored as UTF-8. - Emacs and many other editors store files as UTF-8. - LATIN1 or another non-Unicode characters still work, but are less used. - Some programs still don't work with UTF-8. The names of the files and their data are stored nowadays as UTF-8. Many programs use UTF-8 internally, not wchar_t format UCS-4. There might be some new programs and editors (Gnome, Kde ??) that are written from scratch, and might use wchar_t internally (UCS-4). Java programs use always UCS-2 (or UCS-16) as their internal format. So it isn't wchar_t Unicode. Java way is standardized. Terminals use nowadays UTF-8. With network, UTF-8 works (ideal. Reality??) from Windows to Linux, and from Mac to Linux. So a common format is good. So, under Linux every program may choose: - to store file names as UTF-8 or as LATIN1. This is actually a bad thing. The behaviour depends on which character set you have selected before logging in. - to store files as UTF-8 or as LATIN1. This is again based on the console logging option. - to do any character conversions they like, with libiconv library. > > > > >>>locale >>> >>> >>The reason for the popularity of UTF-8 under Linux is, that each >>program needs to be adjusted very little to be able to move >>from LATIN1 style encoding into UTF-8. >> >> > >Again, are you talking about memory, disk/network? > >This is definitely not the same thing IMHO. > > > So when you ask about memory and disk: the answer is that each application chooses it's character set formats. Usually the environment variables affect the selections. So when you ask about the network character set: Network interaction is standardized. Many Unixes,Linux and Windows try to conform to these network standards. Regards, Marko Ristola