Thread: Unicode support

Unicode support

From
"Dave Page"
Date:
Hi Anoop and anyone else who might be interested,

I've been thinking about how the Unicode support might be improved to
allow the old 07.xx non-unicode style behaviour to work for those that
need it. At them moment, when we connect using on of the wide connect
functions, the CC->unicode flag is set to true. This only affects a few
options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
types.

It seems to me that perhaps we should set CC->unicode = 1, only upon
connection to a Unicode database. Anything else, we leave it set to 0 so
that it always maps varchars etc to ANSI types, and handles other
encodings in single byte or non-unicode multibyte mode (which worked
fine in 07.xx where those encodings where appropriate, such as SJIS in
Japan). This should also help BDE based apps, which further research has
shown me are broken with Unicode columns in SQL Server and Oracle as
well as PostgreSQL (search unicode + BDE on Google Groups for more).

Am I seeing a possible improvement where in fact there isn't one, or
missing some obvious downside?

Regards, Dave.

Re: Unicode support

From
"Hiroshi Saito"
Date:
From: "Dave Page"

> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that

Yea, I think that a libpq version is very great. However, Legacy environment
is raising the scream for a rapid change. Then, I think that a multibyte needs
to be supported.

Regards,
Hiroshi Saito


Re: Unicode support

From
"Anoop Kumar"
Date:
Hi Dave,

Checking for the database encoding and calling the functions using the
appropriate flag seems to be fine.

Regards

Anoop

> -----Original Message-----
> From: Dave Page [mailto:dpage@vale-housing.co.uk]
> Sent: Wednesday, August 31, 2005 3:07 AM
> To: Anoop Kumar
> Cc: pgsql-odbc@postgresql.org
> Subject: Unicode support
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only affects a
few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it set to 0
so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further research
has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.

Re: Unicode support

From
"Dave Page"
Date:

> -----Original Message-----
> From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
> Sent: 31 August 2005 02:56
> To: Dave Page; Anoop Kumar
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
> From: "Dave Page"
>
> > Hi Anoop and anyone else who might be interested,
> >
> > I've been thinking about how the Unicode support might be
> improved to
> > allow the old 07.xx non-unicode style behaviour to work for
> those that
>
> Yea, I think that a libpq version is very great. However,
> Legacy environment
> is raising the scream for a rapid change. Then, I think that
> a multibyte needs
> to be supported.

OK, I'll prepare a patch, and because it's an odd problem, a test build
to go with it. Any voluteers to test? It really needs people with a
reproducable encoding error that doesn't existing in 07.xx, or people
using BDE (which barfs bigtime on SQL_C_Wxxx data).

Regards, Dave.

Re: Unicode support

From
Johann Zuschlag
Date:
Dave Page schrieb:

>
>OK, I'll prepare a patch, and because it's an odd problem, a test build
>to go with it. Any voluteers to test? It really needs people with a
>reproducable encoding error that doesn't existing in 07.xx, or people
>using BDE (which barfs bigtime on SQL_C_Wxxx data).
>
>Regards, Dave.
>
>
>
Hi Dave,

just send it to me (the windows dll). Even though I just switched my
linux server to UTF-8. :-)
I'll test it with the old enviroment.

Regards,
Johann




Re: Unicode support

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Dave Page
> Sent: 31 August 2005 08:20
> To: Hiroshi Saito; Anoop Kumar
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
>
> OK, I'll prepare a patch, and because it's an odd problem, a
> test build
> to go with it. Any voluteers to test? It really needs people with a
> reproducable encoding error that doesn't existing in 07.xx, or people
> using BDE (which barfs bigtime on SQL_C_Wxxx data).

OK, patch attached. This works slightly differently than I envisaged,
because simply switching off Unicode isn't that straight forward,
especially if the DM is using the *W functions.

Basically what this does is only offer wide character types if the
database is unicode, and, in that case, sets the client encoding to
unicode. For anything else, it will report non-wide character types as
per the 07 driver, and let the user set their own encoding as required.
From what I can tell of the BDE missing fields problem, this should
almost certainly fix it.

Please look at this carefully - as most of you know, MB/Unicode issues
aren't exactly my strong point!

I'll forward test DLLs to volunteer victims privately.

Regards, Dave.

Attachment

Re: Unicode support

From
Marko Ristola
Date:
LATIN1 and UCS have one common point by design:
0x00 - 0xFF are equal numbers, so the SQL_ASCII
ignorance means, that LATIN1 characters won't get changed!

So, this means, that:
0xE4 in ISO-8859-1 is the same as
0x00E4 in UCS-2. Just the number of needed bytes change.

Reference: "man 7 unicode"

Marko Ristola

Dave Page wrote:

>Hi Anoop and anyone else who might be interested,
>
>I've been thinking about how the Unicode support might be improved to
>allow the old 07.xx non-unicode style behaviour to work for those that
>need it. At them moment, when we connect using on of the wide connect
>functions, the CC->unicode flag is set to true. This only affects a few
>options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
>types.
>
>It seems to me that perhaps we should set CC->unicode = 1, only upon
>connection to a Unicode database. Anything else, we leave it set to 0 so
>that it always maps varchars etc to ANSI types, and handles other
>encodings in single byte or non-unicode multibyte mode (which worked
>fine in 07.xx where those encodings where appropriate, such as SJIS in
>Japan). This should also help BDE based apps, which further research has
>shown me are broken with Unicode columns in SQL Server and Oracle as
>well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
>Am I seeing a possible improvement where in fact there isn't one, or
>missing some obvious downside?
>
>Regards, Dave.
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster
>
>


Re: Unicode support

From
"Hiroshi Saito"
Date:
From: "Dave Page"

Thanks.!!

> Please look at this carefully - as most of you know, MB/Unicode issues
> aren't exactly my strong point!

Ok, I am going to try the specification in multibyte. :-)

>
> I'll forward test DLLs to volunteer victims privately.

Regards,
Hiroshi Saito


Re: Unicode support

From
"Hiroshi Saito"
Date:
Hi Dave.

I tried your patch by SJIS of Japan. It seems that it needs some additional
correction. Moreover, it is necessary to make the driver different from
UNICODE (WideCharacter). It seems that I have to catch up further.

BTW, I remembered the discussion original by pgAdminIII. I said that I
should support MullutiByte then. However, How is it now? It is very wonderful.
I feel that that there are many choices of a character code complicates a problem
more. but, it is although external environment is different.

Regards,
Hiroshi Saito

Attachment

Re: Unicode support

From
"Dave Page"
Date:
Did this miss something?

:-)

/D

> -----Original Message-----
> From: Miguel Juan [mailto:mjuan@cibal.es]
> Sent: 01 September 2005 10:06
> To: Dave Page
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
>
> ----- Original Message -----
> From: "Dave Page" <dpage@vale-housing.co.uk>
> To: "Miguel Juan" <mjuan@cibal.es>
> Cc: <pgsql-odbc@postgresql.org>
> Sent: Thursday, September 01, 2005 10:32 AM
> Subject: RE: [ODBC] Unicode support
>
>
> Yes please - attached.
>
> > -----Original Message-----
> > From: Miguel Juan [mailto:mjuan@cibal.es]
> > Sent: 01 September 2005 09:25
> > To: Dave Page
> > Cc: pgsql-odbc@postgresql.org
> > Subject: Re: [ODBC] Unicode support
> >
> > hello,
> >
> > I will try it with BDE environment if you want.
> >
> > Regards,
> >
> > Miguel Juan
> >
> >
> > ----- Original Message -----
> > From: "Dave Page" <dpage@vale-housing.co.uk>
> > To: "Anoop Kumar" <anoopk@pervasive-postgres.com>
> > Cc: <pgsql-odbc@postgresql.org>
> > Sent: Tuesday, August 30, 2005 11:36 PM
> > Subject: [ODBC] Unicode support
> >
> >
> > Hi Anoop and anyone else who might be interested,
> >
> > I've been thinking about how the Unicode support might be
> improved to
> > allow the old 07.xx non-unicode style behaviour to work for
> those that
> > need it. At them moment, when we connect using on of the
> wide connect
> > functions, the CC->unicode flag is set to true. This only
> > affects a few
> > options, such as pgtype_to_concise_type()'s mapping of PG
> types to SQL
> > types.
> >
> > It seems to me that perhaps we should set CC->unicode = 1, only upon
> > connection to a Unicode database. Anything else, we leave it
> > set to 0 so
> > that it always maps varchars etc to ANSI types, and handles other
> > encodings in single byte or non-unicode multibyte mode (which worked
> > fine in 07.xx where those encodings where appropriate, such
> as SJIS in
> > Japan). This should also help BDE based apps, which further
> > research has
> > shown me are broken with Unicode columns in SQL Server and Oracle as
> > well as PostgreSQL (search unicode + BDE on Google Groups for more).
> >
> > Am I seeing a possible improvement where in fact there isn't one, or
> > missing some obvious downside?
> >
> > Regards, Dave.
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
> >
> >
>
>
>

Re: Unicode support

From
"Dave Page"
Date:

> -----Original Message-----
> From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
> Sent: 31 August 2005 21:00
> To: Hiroshi Saito; Dave Page; Anoop Kumar
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
> Hi Dave.
>
> I tried your patch by SJIS of Japan. It seems that it needs
> some additional
> correction. Moreover, it is necessary to make the driver
> different from
> UNICODE (WideCharacter). It seems that I have to catch up further.

Hmmm, well I can't remove the Unicode functions. Do your apps request
SQL_C_WCHAR etc even if the driver doesn't offer it?

> BTW, I remembered the discussion original by pgAdminIII. I
> said that I
> should support MullutiByte then. However, How is it now? It
> is very wonderful.
> I feel that that there are many choices of a character code
> complicates a problem
> more. but, it is although external environment is different.

Hmm, I hate multibyte :-(!!

Regards, Dave

Re: Unicode support

From
"Dave Page"
Date:

> -----Original Message-----
> From: Miguel Juan [mailto:mjuan@cibal.es]
> Sent: 01 September 2005 11:06
> To: Dave Page
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
> Hello Dave,
>
> I'm just trying the last fix (for BDE) and I can see some odd
> behavior.
>
> - It shows the TEXT fields as MEMO. But you can see the data
> if you make a
> double click on it. It looks like it doesn't use the "text as
> LongVarchar"
> option (this works in version 7.x).

Right, I'll look at that.

> - After a "SELECT * FROM table" The Borland SQL Explorer
> shows an error
> ('Invalid Blob Handle') for empty TEXT fields (NULL) when you
> try to view
> them. This works fine for table view.

Strange.

> - After an error inserting a row ('not null' constraint) it
> closes the
> connection (dead connection error)

I'll test that as well.

To be honest though, I've been researching BDE on Google Groups and
there are lots of people reporting similar problems with SQL Server and
Oracle - apparently BDE fails to work properly with any Unicode data.
I'm happy to spend a little time trying to work around that, but I can't
spend masses of time on it.

Regards, Dave.

Re: Unicode support

From
Marko Ristola
Date:
Hi all.

How about creating a charset conversion interface
and taking UTF-8 as an internal format for ODBC?:

At least the following functions might be needed:

Internal2WChar()
WChar2Internal()

Internal2Char()
Char2Internal()

Backend would talk only UTF-8.

Here is a minimum set of interface
(Object oriented design term) functions:

cvt_FromUTF8()
cvt_ToUTF8()
cvt_Free()

Interface implementation:

struct CvtInterface {
  char (*cvt_FromUTF8)(void *internalData, char *source, size_t bytes);
  char (*cvt_ToUTF8)(void *internalData, char *source, size_t bytes);
  void (*cvt_Free)(void *internalData);

  void *internalData;
}
Object creation:

struct Env {
    struct CvtInterface char_cvt; // C program char conversions
    struct CvtInterface wchar_cvt; // C program wchar_t conversions
};

struct CvtInterface utf8_to_utf8_New();
env->char_cvt = utf8_to_utf8_New();

These are some interface implementation functions:
(I don't know, how many are needed, but at least
supporting of char, wchar and multibyte is needed).

sjis_new()
sjis_FromUTF8()
sjis_ToUTF8()
sjis_Free()

wchar_FromUTF8()
wchar_ToUTF8()
wchar_Free()

char_FromUTF8()
char_ToUTF8()
char_Free()

utf8_FromUTF8()
utf8_ToUTF8()
utf8_Free()

ascii8_FromUTF8()
ascii8_ToUTF8()
ascii8_Free()

So, there would be a single internal UTF-8 format inside PsqlODBC.
The backend could always deliver UTF-8, so the need for internal
format <-> backend format layer is not needed.

This implementation would be easy to implement.

Examples:

A C program calls SQLExecuteW.
AllocEnv has found out, that the wchar format is UCS-2.
So it has created an object:
env->char_cvt = cvt_ucs2_UTF8_New();

The PGAPI function needs to convert from WCHAR into internal format:
sqlquery = (*env->char_cvt->cvt_ToUTF8)(wcharquery);
Then the sqlquery is in UTF8, and the query is in
an easilly manageable format!

A C program uses SQLGetDataW to get a string.
So when the data will be converted in convert.c, psqlodbc calls:
result = (*env->char_cvt->cvt_FromUTF8)(internalformat);

I don't know, wether ENV handle is the best place to put the converter
objects.

I like about this implementation:
- Simplifies support for clients using different charsets.
- Simplifies psqodbc internally, because of internal UTF8 assumption.
- Easy to implement and to test.
- Easy to add more converters, when the initial implementation works.
- Enables usage of advanced lexers and parsers when needed to improve
performance.
- PSQLODBC will support well all UTF-8 supported charsets.

I have not suggested this before, because of the following reasons:
- psqlodbc charset conversion implementation seems to work most times.
- Avoiding unnecessary charset conversions is good for performance.
- It takes time to implement and test this.
- Unnecessary malloc + free is bad for performance.

What do you think about this?
Would this solve the problems?
Is this implementable?
Would the performance be good enough?
Would this simplify things (that's the Goal)?

Regards,
Marko Ristola


Dave Page wrote:

>
>
>
>
>>-----Original Message-----
>>From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
>>Sent: 31 August 2005 21:00
>>To: Hiroshi Saito; Dave Page; Anoop Kumar
>>Cc: pgsql-odbc@postgresql.org
>>Subject: Re: [ODBC] Unicode support
>>
>>Hi Dave.
>>
>>I tried your patch by SJIS of Japan. It seems that it needs
>>some additional
>>correction. Moreover, it is necessary to make the driver
>>different from
>>UNICODE (WideCharacter). It seems that I have to catch up further.
>>
>>
>
>Hmmm, well I can't remove the Unicode functions. Do your apps request
>SQL_C_WCHAR etc even if the driver doesn't offer it?
>
>
>



>>BTW, I remembered the discussion original by pgAdminIII. I
>>said that I
>>should support MullutiByte then. However, How is it now? It
>>is very wonderful.
>>I feel that that there are many choices of a character code
>>complicates a problem
>>more. but, it is although external environment is different.
>>
>>
>
>Hmm, I hate multibyte :-(!!
>
>Regards, Dave
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
>


Re: Unicode support

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marko Ristola
> Sent: 01 September 2005 18:21
> Cc: Hiroshi Saito; Anoop Kumar; pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
>
> Hi all.

Hi Marko,

> How about creating a charset conversion interface
> and taking UTF-8 as an internal format for ODBC?:
>

<snip>

>
> So, there would be a single internal UTF-8 format inside PsqlODBC.
> The backend could always deliver UTF-8, so the need for internal
> format <-> backend format layer is not needed.
>
> This implementation would be easy to implement.

This is what already happens (if you ignore my recent experimental
patch).

If the connection is made using one of the *W connect functions, then
the ConnectionClass->unicode flag is set to true, and SET
client_encoding = 'UTF-8' is sent to the backend. From then on, data
going out to the client is fed through utf8_to_ucs2_lf() *if * the data
type is specified as SQL_C_WCHAR, and data coming in to *W functions is
fed through ucs2_to_utf8().

Afaict, Unicode mode works exactly as it should.

If the connection is made using a non-wide function, the
ConnectionClass->unicode is not set. In this case, the client is
expected to continue using non-wide functions, and the client encoding
left at default. In this case, the driver will never report data types
as SQL_C_WCHAR.

This, is where I believe the major problem occurs - if the ODBC Driver
Manager sees that SQLConnectW (iirc) exists, it will automatically map
ANSI calls (eg. SQLConnect()) to Unicode (eg. SQLConnectW()). This then
causes the driver to report text/char columns as SQL_C_WCHAR. Less well
written apps then fall over because they aren't clever enough to request
data as SQL_C_CHAR instead of SQL_C_WCHAR.

My recent experimental patch aims to address this, by forcing the driver
to report SQL_C_CHAR instead of SQL_C_WCHAR for non-unicode databases.
This should (and seems to, with minor side effects yet to be fully
investigated) fix the BDE problem.

As for multibyte (non-unicode) data such as Hiroshi's, my understanding
is that in the presence of a Unicode driver, apps are expected to use
Unicode (and in fact, are forced to by the driver manager's mapping of
ANSI function calls to Unicode calls).

Anoop, do you or any of your guys (or anyone else) know
unicode/multibyte/encoding well? I'm learning as I go at the moment, so
some more experienced help would be *really* appreciated.

Regards, Dave.

Re: Unicode support

From
"Anoop Kumar"
Date:
> Anoop, do you or any of your guys (or anyone else) know
> unicode/multibyte/encoding well? I'm learning as I go at the moment,
so
> some more experienced help would be *really* appreciated.

Sorry Dave, No one in my list! :-(

Regards

Anoop

Re: Unicode support

From
Marko Ristola
Date:
So, I don't have much experience with Windows ODBC. That's true.
Is it possible to compile psqlodbc with MinGW tools for Windows?

After using Google, I found out, that GLIB libraries are able to convert
UTF-8 into multibyte under Windows. Windows should be
able to convert UTF-8 into Multibyte and vice versa with it's character
set conversion
functions.

After using Google, I found out, that Windows XP had a problem with
Korean multibyte:


  "Windows XP Device Driver Does Not Convert Multibyte Data to Korean"

 Article ID: 817522.

That was fixed in Service Pack 2.

So I ask you, how you have thought about these things:

If I have understood Windows correctly, it uses UCS-2 as it's internal
UNICODE
character set. Linux prefers into UTF-8. So, If we classify UCS-2 and
UTF-8 equal inside psqoldbc,
that makes sense. That's what has been implemented into psqlodbc already
for Windows.

Then there is the world before Unicode existed. There were DOS codepages,
character sets for groups of countries and Multibyte character sets.

JIS X 0208 is a character set (see man 7 charsets).
Shift_JIS is an encoding that can contain JIS X 0208 multibyte
characters (see man 7 charsets).

So it seems, that one working implementation can be done by using UTF-8
PostgreSQL server
and UTF-8 to multibyte conversions.

However, according to Samba team's UNICODE problem descriptions,
there are some problems: UTF-8 to EUC_JP conversion may be different
on Linux and Windows, and on different conversion library implementations.

Some multibyte character sets are contraditory with each other.

If we drop the *W() functions away, we might get a working implementation,
but we might not support the full ODBC API?

So if and only if one single conversion library does the conversions, it
works.

So if and only if the PostgreSQL backend, or only the PSQLODBC side
does the needed conversions, psqlodbc should work with multibyte
encodings, with UTF-8. If the PostgreSQL Server is in a same kind of
Windows environment than the clients, it should work
fully with UTF-8 and the multibyte character sets. This should be the
best working option.

Windows does have a working UCS-2 to multibyte conversion implementation
on the psqlodbc client (since Service Pack 2).

Unfortunately pg_dump + restore from SJIS into UTF-8 might not work,
because Linux's ICONV might not do the conversion correctly.

The conversion into UTF-8 must be done using fully working Windows
conversion functions.
So one way might be something like using such pg_dump under Windows,
that does the multibyte into UTF-8 conversion in Windows side.

How about the following implementation:
ODBC against the backend:
- Backend has multibyte characters.
- Windows uses multibyte characters.
psqlodbc has UTF-8 as it's internal formats.

=> A fully working implementation:
- Backend deliveres multibyte characters.
PSQLODBC converts them into UTF-8.
PSQLODBC deliveres multibyte characters to the client
using utf8_to_locale Windows functions, when necessary.

So the solution might be here to do all conversions on the client side!
However the reasoning for this is, that two separate conversion
libraries might
be contradictory with each other, at least with the Asian character sets.
(With MACs, UTF-8 implementation differs from the standard.)

Or then Asian users should move and use UTF-8 as their PostgreSQL
Server's backend format.
That's the other solution for the same problem. Then PostgreSQL Server
doesn't
have to do the conversion.

It does not seem possible to do all the conversion functions inside
PostgreSQL Server under Windows,
because of the xx() -> xxW() mapping inside Windows ODBC manager. We
can't control that.

What do you think about these thoughts?

Marko Ristola

Hiroshi Saito wrote:

>Hi Dave.
>
>I tried your patch by SJIS of Japan. It seems that it needs some additional
>correction. Moreover, it is necessary to make the driver different from
>UNICODE (WideCharacter). It seems that I have to catch up further.
>
>BTW, I remembered the discussion original by pgAdminIII. I said that I
>should support MullutiByte then. However, How is it now? It is very wonderful.
>I feel that that there are many choices of a character code complicates a problem
>more. but, it is although external environment is different.
>
>Regards,
>Hiroshi Saito
>
>------------------------------------------------------------------------
>
>--- convert.c.orig    Thu Aug  4 21:26:57 2005
>+++ convert.c    Thu Sep  1 04:38:45 2005
>@@ -762,7 +762,7 @@
>                 {
>                     BOOL lf_conv = conn->connInfo.lf_conversion;
>
>-                    if (fCType == SQL_C_WCHAR)
>+                    if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
>                     {
>                         len = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0);
>                         len *= WCLEN;
>@@ -778,7 +778,7 @@
>                     }
>                     else
> #ifdef    WIN32
>-                    if (fCType == SQL_C_CHAR)
>+                    if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR))
>                     {
>                         wstrlen = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0);
>                         allocbuf = (SQLWCHAR *) malloc(WCLEN * (wstrlen + 1));
>@@ -810,7 +810,7 @@
>                             pgdc->ttlbuflen = len + 1;
>                         }
>
>-                        if (fCType == SQL_C_WCHAR)
>+                        if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
>                         {
>                             utf8_to_ucs2_lf(neut_str, -1, lf_conv, (SQLWCHAR *) pgdc->ttlbuf, len / WCLEN);
>                         }
>@@ -824,7 +824,7 @@
>                         }
>                         else
> #ifdef    WIN32
>-                        if (fCType == SQL_C_CHAR)
>+                        if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR))
>                         {
>                             len = WideCharToMultiByte(CP_ACP, 0, allocbuf, wstrlen, pgdc->ttlbuf, pgdc->ttlbuflen,
NULL,NULL); 
>                             free(allocbuf);
>@@ -871,7 +871,7 @@
>
>                     copy_len = (len >= cbValueMax) ? cbValueMax - 1 : len;
>
>-                    if (fCType == SQL_C_WCHAR)
>+                    if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
>                     {
>                         copy_len /= WCLEN;
>                         copy_len *= WCLEN;
>@@ -911,7 +911,7 @@
>                         memcpy(rgbValueBindRow, ptr, copy_len);
>                         /* Add null terminator */
>
>-                        if (fCType == SQL_C_WCHAR)
>+                        if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
>                             memset(rgbValueBindRow + copy_len, 0, WCLEN);
>                         else
>
>@@ -942,7 +942,7 @@
>                 break;
>         }
>
>-        if (SQL_C_WCHAR == fCType && ! wchanged)
>+        if ((conn->unicode && conn->report_wide_types) && (SQL_C_WCHAR == fCType && ! wchanged))
>         {
>             if (cbValueMax > (SDWORD) (WCLEN * (len + 1)))
>             {
>@@ -2629,6 +2629,8 @@
>                 case SQL_WCHAR:
>                 case SQL_WVARCHAR:
>                 case SQL_WLONGVARCHAR:
>+                    if (conn->unicode && conn->report_wide_types)
>+                    {
>                     if (SQL_NTS == used)
>                         used = strlen(buffer);
>                     allocbuf = malloc(WCLEN * (used + 1));
>@@ -2637,6 +2639,11 @@
>                     buf = ucs2_to_utf8((SQLWCHAR *) allocbuf, used, (UInt4 *) &used, FALSE);
>                     free(allocbuf);
>                     allocbuf = buf;
>+                    {
>+                    else
>+                    {
>+                        buf = buffer;
>+                    }
>                     break;
>                 default:
>                     buf = buffer;
>@@ -2647,10 +2654,17 @@
>             break;
>
>         case SQL_C_WCHAR:
>+            if (conn->unicode && conn->report_wide_types)
>+            {
>             if (SQL_NTS == used)
>                 used = WCLEN * wcslen((SQLWCHAR *) buffer);
>             buf = allocbuf = ucs2_to_utf8((SQLWCHAR *) buffer, used / WCLEN, (UInt4 *) &used, FALSE);
>             used *= WCLEN;
>+            }
>+            else
>+            {
>+                buf = buffer;
>+            }
>             break;
>
>         case SQL_C_DOUBLE:
>--- psqlodbc_win32.def.orig    Thu Sep  1 04:41:37 2005
>+++ psqlodbc_win32.def    Thu Sep  1 04:42:08 2005
>@@ -78,31 +78,3 @@
> DllMain @201
> ConfigDSN @202
>
>-SQLColAttributeW    @101
>-SQLColumnPrivilegesW    @102
>-SQLColumnsW        @103
>-SQLConnectW        @104
>-SQLDescribeColW        @106
>-SQLExecDirectW        @107
>-SQLForeignKeysW        @108
>-SQLGetConnectAttrW    @109
>-SQLGetCursorNameW    @110
>-SQLGetInfoW        @111
>-SQLNativeSqlW        @112
>-SQLPrepareW        @113
>-SQLPrimaryKeysW        @114
>-SQLProcedureColumnsW    @115
>-SQLProceduresW        @116
>-SQLSetConnectAttrW    @117
>-SQLSetCursorNameW    @118
>-SQLSpecialColumnsW    @119
>-SQLStatisticsW        @120
>-SQLTablesW        @121
>-SQLTablePrivilegesW    @122
>-SQLDriverConnectW    @123
>-SQLGetDiagRecW        @124
>-SQLGetStmtAttrW        @125
>-SQLSetStmtAttrW        @126
>-SQLSetDescFieldW    @127
>-SQLGetTypeInfoW        @128
>-SQLGetDiagFieldW    @129
>
>
>------------------------------------------------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>


Re: Unicode support

From
Marko Ristola
Date:
Marko Ristola wrote:

>However, according to Samba team's UNICODE problem descriptions,
>there are some problems: UTF-8 to EUC_JP conversion may be different
>on Linux and Windows, and on different conversion library implementations.
>
>
>

This was the Samba reference. I recommend you to read the applicable parts.
http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/unicode.html

I hope, that the multibyte into UTF-8 and vice versa is possible.
If not, disabling UTF-8 and UCS-2 seems to be the only workable choise :(

Regards,
Marko

>Some multibyte character sets are contraditory with each other.
>
>
>


Re: Unicode support

From
"Miguel Juan"
Date:
----- Original Message -----
From: "Dave Page" <dpage@vale-housing.co.uk>
To: "Miguel Juan" <mjuan@cibal.es>
Cc: <pgsql-odbc@postgresql.org>
Sent: Thursday, September 01, 2005 10:32 AM
Subject: RE: [ODBC] Unicode support


Yes please - attached.

> -----Original Message-----
> From: Miguel Juan [mailto:mjuan@cibal.es]
> Sent: 01 September 2005 09:25
> To: Dave Page
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
> hello,
>
> I will try it with BDE environment if you want.
>
> Regards,
>
> Miguel Juan
>
>
> ----- Original Message -----
> From: "Dave Page" <dpage@vale-housing.co.uk>
> To: "Anoop Kumar" <anoopk@pervasive-postgres.com>
> Cc: <pgsql-odbc@postgresql.org>
> Sent: Tuesday, August 30, 2005 11:36 PM
> Subject: [ODBC] Unicode support
>
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only
> affects a few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it
> set to 0 so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further
> research has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>



Re: Unicode support

From
"Miguel Juan"
Date:
Hello Dave,

I'm just trying the last fix (for BDE) and I can see some odd behavior.

- It shows the TEXT fields as MEMO. But you can see the data if you make a
double click on it. It looks like it doesn't use the "text as LongVarchar"
option (this works in version 7.x).

- After a "SELECT * FROM table" The Borland SQL Explorer shows an error
('Invalid Blob Handle') for empty TEXT fields (NULL) when you try to view
them. This works fine for table view.

- After an error inserting a row ('not null' constraint) it closes the
connection (dead connection error)


Regards,

Miguel Juan




----- Original Message -----
From: "Dave Page" <dpage@vale-housing.co.uk>
To: "Miguel Juan" <mjuan@cibal.es>
Cc: <pgsql-odbc@postgresql.org>
Sent: Thursday, September 01, 2005 10:32 AM
Subject: RE: [ODBC] Unicode support


Yes please - attached.

> -----Original Message-----
> From: Miguel Juan [mailto:mjuan@cibal.es]
> Sent: 01 September 2005 09:25
> To: Dave Page
> Cc: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Unicode support
>
> hello,
>
> I will try it with BDE environment if you want.
>
> Regards,
>
> Miguel Juan
>
>
> ----- Original Message -----
> From: "Dave Page" <dpage@vale-housing.co.uk>
> To: "Anoop Kumar" <anoopk@pervasive-postgres.com>
> Cc: <pgsql-odbc@postgresql.org>
> Sent: Tuesday, August 30, 2005 11:36 PM
> Subject: [ODBC] Unicode support
>
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only
> affects a few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it
> set to 0 so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further
> research has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>



Re: Unicode support

From
"Miguel Juan"
Date:
hello,

I will try it with BDE environment if you want.

Regards,

Miguel Juan


----- Original Message -----
From: "Dave Page" <dpage@vale-housing.co.uk>
To: "Anoop Kumar" <anoopk@pervasive-postgres.com>
Cc: <pgsql-odbc@postgresql.org>
Sent: Tuesday, August 30, 2005 11:36 PM
Subject: [ODBC] Unicode support


Hi Anoop and anyone else who might be interested,

I've been thinking about how the Unicode support might be improved to
allow the old 07.xx non-unicode style behaviour to work for those that
need it. At them moment, when we connect using on of the wide connect
functions, the CC->unicode flag is set to true. This only affects a few
options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
types.

It seems to me that perhaps we should set CC->unicode = 1, only upon
connection to a Unicode database. Anything else, we leave it set to 0 so
that it always maps varchars etc to ANSI types, and handles other
encodings in single byte or non-unicode multibyte mode (which worked
fine in 07.xx where those encodings where appropriate, such as SJIS in
Japan). This should also help BDE based apps, which further research has
shown me are broken with Unicode columns in SQL Server and Oracle as
well as PostgreSQL (search unicode + BDE on Google Groups for more).

Am I seeing a possible improvement where in fact there isn't one, or
missing some obvious downside?

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster



Re: Unicode support

From
Marko Ristola
Date:
Marc Herbert wrote:

>Marko Ristola <Marko.Ristola@kolumbus.fi> writes:
>
>
>>So I ask you, how you have thought about these things:
>>
>>If I have understood Windows correctly, it uses UCS-2 as it's internal
>>UNICODE
>>character set. Linux prefers into UTF-8.
>>
>>
>
>I am not sure what you mean by "internal UNICODE character set", but I
>understand that Linux does prefer UTF-32, NOT UTF-8 !
>
>
>

If you want to know the details about UTF-8's encoding, the following
is a recommended reading (Linux manual page) :)

man utf-8

It gives you a good explanation of the encoding used in  UTF-8.

UTF-8 uses from one to four bytes per character.
It supports almost all character sets in the World.

Because the task is so huge, there exist variants and bugs in
the implementations. That's what I read from Samba filesystem
FAQ.

So, if you stick with Windows implementation, you don't find
any bugs, but when you move the file into another operating system,
the file might look different :(

UCS-2 is a 32-bit Unicode wchar_t type. According to
Linux manuals, wchar_t is not equal on all implementations.
According to manuals, inside binary files, it is recommended in C
to use UTF-8 strings, that are then converted at runtime into
wchar_t type. Java language is another story. There might
be same problems though. The number remains the same, but
if you try to draw the character into the window with
different implementations, you might get different drawings.


>On all platforms I had a look at, variable-length encodings are only
>for disk and network, never used in memory.
>
>Don't you agree?
>
>
> locale
LANG=fi_FI.UTF-8@euro
LC_CTYPE="fi_FI.UTF-8@euro"
LC_NUMERIC="fi_FI.UTF-8@euro"
LC_TIME="fi_FI.UTF-8@euro"
LC_COLLATE="fi_FI.UTF-8@euro"
LC_MONETARY="fi_FI.UTF-8@euro"
LC_MESSAGES="fi_FI.UTF-8@euro"
LC_PAPER="fi_FI.UTF-8@euro"
LC_NAME="fi_FI.UTF-8@euro"
LC_ADDRESS="fi_FI.UTF-8@euro"
LC_TELEPHONE="fi_FI.UTF-8@euro"
LC_MEASUREMENT="fi_FI.UTF-8@euro"
LC_IDENTIFICATION="fi_FI.UTF-8@euro"
LC_ALL=

So, under Linux nowadays, UTF-8 is used very much.
Just as Windows recommends everybody to move into
native Windows Unicode characters (UCS-2), under Linux
it is recommended to move into UTF-8. Both are UNICODE
character encodings.  UCS-2 encoding is just simpler: just
an integer, that has a numerical value.

The reason for the popularity of UTF-8 under Linux is, that each
program needs to be adjusted very little to be able to move
from LATIN1 style encoding into UTF-8.

Happy studying about Unicode character sets :)

Regards,
Marko Ristola




Re: Unicode support

From
Marko Ristola
Date:
Marc Herbert wrote:

>On Thu, Sep 08, 2005 at 08:22:50PM +0300, Marko Ristola wrote:
>
>
>>Marc Herbert wrote:
>>
>>
>>
>>>Marko Ristola <Marko.Ristola@kolumbus.fi> writes:
>>>
>>>
>>>
>>>
>
>Actually my question was just: what do you mean by 'internal'?
>
>Usually 'internal" means 'in memory', and I really don't think there
>is any application/system using UTF-8 in memory, is there?
>
>
>
I'm sorry.

I hope, that this time I'll answer to the right question.
Unfortunately this is also a lengthy answer.


I meant with internal unicode, the wchar_t type in Linux and Unixes.
Or the Windows internal Unicode (TCHAR).

I tried to find out more about the wchar_t(TCHAR) implementations - the
internal Unicode representations.

According to libc info pages:

Under GNU Linux, wchar_t is implemented as 32bit UCS-4 characters.
My ealier assumption, that Linux uses UCS-2, is wrong :(
My ealier assumption, that UCS-2 is 32bit, is wrong :(

Some Unix systems implement wchar_t as 16bit UCS-2 characters.
This means, that if they want to implement the full 31 bit character set
space,
they can do so by using pairs of certain UCS-2 characters. This
is UCS-16 format (a multibyte version of UCS-2).

- If I remember correctly: Java uses 16bit Unicode, meaning, that they
use UCS-2 or UCS-16.
- According to psqlodbc implementation, Windows uses UCS-2 as it's
internal format. This implicates also, that Windows might actually use
UCS-16 multibyte format internally, because UCS-2 is a subset of UCS-16.
Of course, Windows is capable to create it's own private standards.

PostgreSQL ODBC driver for Windows has already the UTF-8 to UCS-2
character set conversion functions.

PostgreSQL ODBC driver for Linux still misses UTF-8 to UCS-4 character
set conversion functions.


PostgreSQL Server deliveres the query results as UTF-8 for Windows ODBC
driver. ODBC driver then converts the UTF-8 data into UCS-2. So psqlodbc
driver
uses internally UTF-8 under Windows.

UTF-8 use cases under Linux:

- Openoffice files are stored as UTF-8.
- Emacs and many other editors store files as UTF-8.
- LATIN1 or another non-Unicode characters still work, but are less used.
- Some programs still don't work with UTF-8.

The names of the files and their data are stored nowadays as UTF-8.
Many programs use UTF-8 internally, not wchar_t format UCS-4.

There might be some new programs and editors (Gnome, Kde ??) that are
written from
scratch, and might use wchar_t internally (UCS-4). Java programs use
always UCS-2 (or UCS-16)
as their internal format. So it isn't wchar_t Unicode. Java way is
standardized.

Terminals use nowadays UTF-8. With network, UTF-8 works (ideal.
Reality??) from Windows
to Linux, and from Mac to Linux. So a common format is good.

So, under Linux every program may choose:
- to store file names as UTF-8 or as LATIN1. This is actually a bad
thing. The
behaviour depends on which character set you have selected before
logging in.
- to store files as UTF-8 or as LATIN1. This is again based on the
console logging option.
- to do any character conversions they like, with libiconv library.


>
>
>
>
>>>locale
>>>
>>>
>>The reason for the popularity of UTF-8 under Linux is, that each
>>program needs to be adjusted very little to be able to move
>>from LATIN1 style encoding into UTF-8.
>>
>>
>
>Again, are you talking about memory, disk/network?
>
>This is definitely not the same thing IMHO.
>
>
>
So when you ask about memory and disk: the answer is that each
application chooses
it's character set formats. Usually the environment variables affect the
selections.

So when you ask about the network character set: Network interaction is
standardized.
Many Unixes,Linux and Windows try to conform to these network standards.

Regards, Marko Ristola