Thread: Continuing encoding fun....

Continuing encoding fun....

From
"Dave Page"
Date:
I've been thinking about this whilst getting dragged round the shops
today, and having read Marko's, Johann's, Hiroshi's and other emails,
not to mention bits of the ODBC spec, here's where I think we stand.

1) The current driver works as expected with Unicode apps.

2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
functions to the Unicode ones, and because (as I think Marko pointed
out) the basic latin chars map directly into the lower Unicode
characters (see http://www.unicode.org/charts/PDF/U0000.pdf).

3) Some other single byte LATIN encodings do not work. This is because
the characters do not map directly into Unicode 80-FF
(http://www.unicode.org/charts/PDF/U0080.pdf).

4) Multibyte apps do not work. I believe that in fact they never will
with a Unicode driver, because multibyte characters simply won't map
into Unicode in the same way that ASCII does. The user cannot opt to use
the non-wide functions, because the DM automatically maps them to the
Unicode versions.

Because the Driver Manager forces the user to use the *W functions if
they exist, I cannot see any way to make 3 or 4 work with a Unicode
driver. If we were to try to detect what encoding to use based on the OS
settings and convert on the fly, we would most likely break any apps
that try to do the right thing by using Unicode themselves. Does that
sound reasonable?

Therefore, it seems to me that the only thing to do is to reinstate the
#ifdef UNICODE preprocessor definitions in the source code (that I now
with I hadn't removed!), and ship 2 versions of the driver - a Unicode
one, and an ANSI/Multibyte version (ie. What 07.xx was).

Thoughts/comments? Hiroshi, what do other vendors do for the Japanese
market?

Regards, Dave.

Re: Continuing encoding fun....

From
"Hiroshi Saito"
Date:
Hi Dave.

> With this patch, you can build either the old style ANSI/Multibyte
> driver, or the Unicode driver. I've also removed the -libpq suffix that
> was added for testing, as this patch gives the driver a new name anyway.
> When installed on a Windows system, you then get:
>
> psqlodbca.dll "PostgreSQL ANSI"
> psqlodbcw.dll "PostgreSQL Unicode"

Is it meant as follows after all?
with libpq version
psqlodbca.dll "PostgreSQL ANSI"
psqlodbcw.dll "PostgreSQL Unicode"
without libpq version
psqlodbca.dll "PostgreSQL ANSI"
psqlodbcw.dll "PostgreSQL Unicode"

>
> Unless anyone has a better solution, I think this is the best fix to
> allow users with non-Unicode friendly apps to work as they used to with
> the older driver.

Some complaint. Although I have not fully tried yet.-(
I think that CRLF of a base code and patch is what it is hard to use.

>
> Please shout ASAP if you object!!

one vote is invested.

Regards,
Hiroshi Saito

Attachment

Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
> Sent: 05 September 2005 05:57
> To: Dave Page; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: Re: [ODBC] Continuing encoding fun....
>
> Hi Dave.
>
> > With this patch, you can build either the old style ANSI/Multibyte
> > driver, or the Unicode driver. I've also removed the -libpq
> suffix that
> > was added for testing, as this patch gives the driver a new
> name anyway.
> > When installed on a Windows system, you then get:
> >
> > psqlodbca.dll "PostgreSQL ANSI"
> > psqlodbcw.dll "PostgreSQL Unicode"
>
> Is it meant as follows after all?
> with libpq version
> psqlodbca.dll "PostgreSQL ANSI"
> psqlodbcw.dll "PostgreSQL Unicode"
> without libpq version
> psqlodbca.dll "PostgreSQL ANSI"
> psqlodbcw.dll "PostgreSQL Unicode"

Yes - I am not concerned with the socket version of the driver - in
fact, I was going to talk to Anoop about removing the old code because
we've had at least a couple of cases of people patching the wrong part,
or mistakenly using the socket code.

Either way, we're certainly not going to release the non-libpq version
any more.

> > Unless anyone has a better solution, I think this is the best fix to
> > allow users with non-Unicode friendly apps to work as they
> used to with
> > the older driver.
>
> Some complaint. Although I have not fully tried yet.-(
> I think that CRLF of a base code and patch is what it is hard to use.

Yes, I was too tired to try to fix the patch to remove the CRLF changes
:-( Still, they need to be fixed anyway.

BTW, your version misses the changes to installer/psqlodbcm.wxs...

> > Please shout ASAP if you object!!
>
> one vote is invested.

:-)

Regards, Dave

Re: Continuing encoding fun....

From
Johann Zuschlag
Date:
Dave Page schrieb:

>
>
>
>
>>-----Original Message-----
>>From: pgsql-odbc-owner@postgresql.org
>>[mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Dave Page
>>Sent: 03 September 2005 20:48
>>To: pgsql-odbc@postgresql.org
>>Cc: Hiroshi Saito; Marko Ristola; Johann Zuschlag
>>Subject: [ODBC] Continuing encoding fun....
>>
>>Therefore, it seems to me that the only thing to do is to
>>reinstate the
>>#ifdef UNICODE preprocessor definitions in the source code (that I now
>>with I hadn't removed!), and ship 2 versions of the driver - a Unicode
>>one, and an ANSI/Multibyte version (ie. What 07.xx was).
>>
>>
>
>Attached is a patch to do this (apologies for the size, it seems that
>options.c had broken line ends).
>
>With this patch, you can build either the old style ANSI/Multibyte
>driver, or the Unicode driver. I've also removed the -libpq suffix that
>was added for testing, as this patch gives the driver a new name anyway.
>When installed on a Windows system, you then get:
>
>psqlodbca.dll "PostgreSQL ANSI"
>psqlodbcw.dll "PostgreSQL Unicode"
>
>Unless anyone has a better solution, I think this is the best fix to
>allow users with non-Unicode friendly apps to work as they used to with
>the older driver.
>
>Please shout ASAP if you object!!
>
>Regards, Dave
>
>
It is ok for me.

Can you send me the dll for the ANSI Driver?

It is not possible to just put a switch in the driver menu?

Regards,
Johann


Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: Johann Zuschlag [mailto:zuschlag2@online.de]
> Sent: 05 September 2005 10:40
> To: pgsql-odbc@postgresql.org
> Cc: Dave Page
> Subject: Re: [ODBC] Continuing encoding fun....
>
> It is ok for me.
>
> Can you send me the dll for the ANSI Driver?

Yup, I'll send it offlist.

> It is not possible to just put a switch in the driver menu?

Unfortunately not because it affects the functions exported by the DLL -
if the *W functions exist, the DM will map all calls to the *W versions,
even if the app uses the non-wide version.

Regards, Dave

Re: Continuing encoding fun....

From
"Hiroshi Saito"
Date:
> Either way, we're certainly not going to release the non-libpq version
> any more.

Ok, I also think that it is accordant to reason.

> BTW, your version misses the changes to installer/psqlodbcm.wxs...

Uga... Sorry.

Ah.. I look at a part strange one.
Please check it.:-)

Regards,
Hiroshi Saito

Attachment

Re: Continuing encoding fun....

From
"Anoop Kumar"
Date:
Hi Dave,

It would be wise to remove the socket code from the new driver. I will
let you know as soon as it gets completed.

Regards

Anoop

> -----Original Message-----
> From: Dave Page [mailto:dpage@vale-housing.co.uk]
> Sent: Monday, September 05, 2005 12:47 PM
> To: Hiroshi Saito; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: RE: [ODBC] Continuing encoding fun....
>
>
>
> > -----Original Message-----
> > From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
> > Sent: 05 September 2005 05:57
> > To: Dave Page; pgsql-odbc@postgresql.org
> > Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> > Subject: Re: [ODBC] Continuing encoding fun....
> >
> > Hi Dave.
> >
> > > With this patch, you can build either the old style ANSI/Multibyte
> > > driver, or the Unicode driver. I've also removed the -libpq
> > suffix that
> > > was added for testing, as this patch gives the driver a new
> > name anyway.
> > > When installed on a Windows system, you then get:
> > >
> > > psqlodbca.dll "PostgreSQL ANSI"
> > > psqlodbcw.dll "PostgreSQL Unicode"
> >
> > Is it meant as follows after all?
> > with libpq version
> > psqlodbca.dll "PostgreSQL ANSI"
> > psqlodbcw.dll "PostgreSQL Unicode"
> > without libpq version
> > psqlodbca.dll "PostgreSQL ANSI"
> > psqlodbcw.dll "PostgreSQL Unicode"
>
> Yes - I am not concerned with the socket version of the driver - in
> fact, I was going to talk to Anoop about removing the old code because
> we've had at least a couple of cases of people patching the wrong
part,
> or mistakenly using the socket code.
>
> Either way, we're certainly not going to release the non-libpq version
> any more.
>
> > > Unless anyone has a better solution, I think this is the best fix
to
> > > allow users with non-Unicode friendly apps to work as they
> > used to with
> > > the older driver.
> >
> > Some complaint. Although I have not fully tried yet.-(
> > I think that CRLF of a base code and patch is what it is hard to
use.
>
> Yes, I was too tired to try to fix the patch to remove the CRLF
changes
> :-( Still, they need to be fixed anyway.
>
> BTW, your version misses the changes to installer/psqlodbcm.wxs...
>
> > > Please shout ASAP if you object!!
> >
> > one vote is invested.
>
> :-)
>
> Regards, Dave

Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: Anoop Kumar [mailto:anoopk@pervasive-postgres.com]
> Sent: 06 September 2005 06:25
> To: Dave Page; Hiroshi Saito; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag
> Subject: RE: [ODBC] Continuing encoding fun....
>
> Hi Dave,
>
> It would be wise to remove the socket code from the new driver. I will
> let you know as soon as it gets completed.

Now there's a coincidence - I was going to email you about that today!!

We've had a couple of instances of pople mistakenly compiling the wrong
version, and even fixing bugs in the socket code :-(

Shall I apply the ANSI/Unicode patch first? It's quite invasive of
course - possibly more so than libpq/socket.

Regards, Dave

Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: Hiroshi Saito [mailto:saito@inetrt.skcapi.co.jp]
> Sent: 05 September 2005 18:35
> To: Dave Page; pgsql-odbc@postgresql.org
> Cc: Marko Ristola; Johann Zuschlag; Anoop Kumar
> Subject: Re: [ODBC] Continuing encoding fun....
>
> > Either way, we're certainly not going to release the
> non-libpq version
> > any more.
>
> Ok, I also think that it is accordant to reason.
>
> > BTW, your version misses the changes to installer/psqlodbcm.wxs...
>
> Uga... Sorry.
>
> Ah.. I look at a part strange one.
> Please check it.:-)


Re patch:

--- connection.c.orig    Tue Sep  6 01:47:23 2005
+++ connection.c    Tue Sep  6 02:13:53 2005
@@ -1545,7 +1545,7 @@
         if (self->unicode)
         {
             if (!self->client_encoding ||
-                !stricmp(self->client_encoding, "UNICODE"))
+                stricmp(self->client_encoding, "UNICODE"))
             {
                 QResultClass    *res;
                 if (PG_VERSION_LT(self, 7.1))

The opposite of this change was made in 1.92 of connection.c:
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/psqlodbc/psqlodbc/connection
.c?rev=1.92&content-type=text/x-cvsweb-markup

It seems to me that the current case is correct - in the Unicode driver
we *must* run with client_encoding = 'UNICODE' or it won't work
properly. That said, I wonder if we shouldn't just remove the if()
altogether, and unconditionally set the client encoding for the Unicode
driver.

Don't forget, this won't affect the ANSI/Multibyte case because it's
inside a "#ifdef UNICODE_SUPPORT".

What do you think Anoop?

Regards, Dave


Re: Continuing encoding fun....

From
"Anoop Kumar"
Date:
> Done. Once you've removed the socket code, a new release seems in
order.
> Sound OK to you?

OK for me. A new release would be proper.

> It seems to me that the current case is correct - in the Unicode
driver
> we *must* run with client_encoding = 'UNICODE' or it won't work
> properly. That said, I wonder if we shouldn't just remove the if()
> altogether, and unconditionally set the client encoding for the
Unicode
> driver.
>
> Don't forget, this won't affect the ANSI/Multibyte case because it's
> inside a "#ifdef UNICODE_SUPPORT".
>
> What do you think Anoop?
>
As this is already inside "#ifdef UNICODE_SUPPORT", I don't find the
necessity for checking it again.

Regards
Anoop

Re: Continuing encoding fun....

From
zuschlag2@online.de
Date:
Hi Dave

>It seems to me that the current case is correct - in the Unicode driver
>we *must* run with client_encoding = 'UNICODE' or it won't work
>properly. That said, I wonder if we shouldn't just remove the if()
>altogether, and unconditionally set the client encoding for the Unicode
driver.

That assumption seems to be ok, even though I need it still for further testing. But I can use the version you've sent
me.

Regards,
Johann


Critical Bug with UseDeclareFetch in development version

From
Marko Ristola
Date:
There is one inconvinient bug in the driver:

I tested this with the very least CVS version, so it exists.


psql
> select * from test1;
(104 rows)

isql marko XXX
2 rows returned

So I get the above result by configuring .odbc.ini:
[marko]
Fetch = 2
UseDeclareFetch = 1

So unfortunately it fetches too few rows.

Old behaviour:
1. SELECT * from test1 with cursor STM7737819
2. While more rows; do Fetch at most 2 rows; done
3. CLOSE CURSOR maybe with STMT_DROP.

So this hack was implemented to support SELECT * of more than
a few million rows.

So without this hack, with 32 bit operating systems, query results are
limited
into maybe 8 million rows, before memory allocation failure. I don't
remember
the exact number of millions. Of course the exact million depends
heavily with the result row width and with the Operating system memory
architecture.

So, it seems that the hack implementation has been partially removed,
but it is still active.

Regards,
Marko Ristola


Re: Critical Bug with UseDeclareFetch in development version

From
"Hiroshi Saito"
Date:
Hi Marko.

It is strange...

> So I get the above result by configuring .odbc.ini:
> [marko]
> Fetch = 2
> UseDeclareFetch = 1

I do not find any problems by the driver for Windows.
Probably, I think a portion peculiar to Linux version.??
In windows, though CACHE is used as FETCH.

Although I want to see the log, Anoop or Dave may be able to
be distinguished immediately.:-)

Regards,
Hiroshi Saito




Re: Continuing encoding fun....

From
Marko Ristola
Date:
zuschlag2@online.de wrote:

>Hi Dave
>
>
>
>>It seems to me that the current case is correct - in the Unicode driver
>>we *must* run with client_encoding = 'UNICODE' or it won't work
>>properly. That said, I wonder if we shouldn't just remove the if()
>>altogether, and unconditionally set the client encoding for the Unicode
>>
>>

The following might be interesting for you:

If I activate ISO C 99 API, I can do the following:
( I thought, that I used ANSI C 99, but the correct name for the
standard, I meant
  is ISO C 99. It will become default later, maybe it already is with
newest GCCs.)

char cbuf[500];
wchar_t wbuf[500];

setlocale(LC_CTYPE,"");

strcpy(cbuf,"Some multibyte text");
swprintf(wbuf,"%s",cbuf);
Now the text is under wchar_t's internal format, maybe UCS-2.

The following also works:
strcpy(wbuf,L"Some UNICODE text");
sprintf(cbuf,"%ls",wbuf);

So, the UCS-2 and multibyte conversion under ISO C 99 seems to be very easy.
With GCC, with Debian Sarge, this can be done as follows:
gcc -std=c99

I don't have now more time to test, at least today.

Iconv seems to be the solution for more advanced conversions under Linux.

Regards, Marko



Re: Critical Bug with UseDeclareFetch in development version

From
Marko Ristola
Date:
Maybe those with more than 8 million row tables could move on into 64 bit
operating systems. Memory hogging would not be a problem anymore with a
big enough swap space.

So making sure the feature would not be active, would fix it.
PostgreSQL works with a bad performance with UseDeclareFetch by design.
With UseDeclareFetch, the backend assumes, that only a few rows will be
fetched.

Maybe users are not prepared to move on so quickly into 64 bit.

Now to the analyze of the problem:

The problem seems to be, that
with UseDeclareFetch=1
and Fetch=2, libpq psqlodbc driver does the FETCH
only once for the PostgreSQL backend.

The feature would be nice, if PGAPI_ExtendedFetch() could
fetch more tuples with FETCH from the backend,
when the first two tuples have been processed.

Now it just understands, that the FETCH returned two rows,
and after the two rows, it will not fetch more rows anymore.

So I tracked down the problem with a debugger into PGAPI_ExtendedFetch.

It seems that the ealier implementation was, that SQLFetch called somehow

QR_fetch_tuples() to fetch more rows from the Backend.
If QR_fetch_tuples() didn't return more rows, the fetching from the backend
would stop.

If the user application asks the number of rows, the ODBC driver is forced
to read everything into the memory.


Regards,
Marko Ristola

Hiroshi Saito wrote:

>Hi Marko.
>
>It is strange...
>
>
>
>>So I get the above result by configuring .odbc.ini:
>>[marko]
>>Fetch = 2
>>UseDeclareFetch = 1
>>
>>
>
>I do not find any problems by the driver for Windows.
>Probably, I think a portion peculiar to Linux version.??
>In windows, though CACHE is used as FETCH.
>
>Although I want to see the log, Anoop or Dave may be able to
>be distinguished immediately.:-)
>
>Regards,
>Hiroshi Saito
>
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
>


Re: Continuing encoding fun....

From
Marc Herbert
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:

> I've been thinking about this whilst getting dragged round the shops
> today, and having read Marko's, Johann's, Hiroshi's and other emails,
> not to mention bits of the ODBC spec, here's where I think we stand.
>
> 1) The current driver works as expected with Unicode apps.
>
> 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> functions to the Unicode ones, and because (as I think Marko pointed
> out) the basic latin chars map directly into the lower Unicode
> characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
>
> 3) Some other single byte LATIN encodings do not work. This is because
> the characters do not map directly into Unicode 80-FF
> (http://www.unicode.org/charts/PDF/U0080.pdf).
>
> 4) Multibyte apps do not work. I believe that in fact they never will
> with a Unicode driver, because multibyte characters simply won't map
> into Unicode in the same way that ASCII does. The user cannot opt to use
> the non-wide functions, because the DM automatically maps them to the
> Unicode versions.
>
> Because the Driver Manager forces the user to use the *W functions if
> they exist, I cannot see any way to make 3 or 4 work with a Unicode
> driver. If we were to try to detect what encoding to use based on the OS
> settings and convert on the fly, we would most likely break any apps
> that try to do the right thing by using Unicode themselves.

In a perfect world there are no "unicode apps", the internal encoding
is set by the system, properly written apps use abstract TCHAR/wchar_t
characters without knowing anything about what encoding they use, and
programs communicating with the outside (such as an database driver),
should query the system encoding using something like "setlocale()",
and perform any appropriate conversion on the fly.

Excerpt from "info libc - Character Set Handling" of GNU libc 2.3.2

<http://www.gnu.org/software/libc/manual/html_node/Character-Set-Handling.html>

  The question remaining is: how to select the character set or
  encoding to use.  The answer: you cannot decide about it yourself,
  it is decided by the developers of the system or the majority of the
  users.  Since the goal is interoperability one has to use whatever
  the other people one works with use.

<http://www.faqs.org/docs/Linux-HOWTO/Unicode-HOWTO.html#s6>
says the same thing:

 "Avoid direct access with Unicode. This is a task of the platform's
 internationalization framework."

Of course those two quotes are targeted at applications
developers. They imply that some driver communicating with the outside
world/database should carry any conversion task.

However, I have no idea how this theory is far from reality, far from
the ODBC API, and far from Windows, sorry :-( I just was woken up by
the "unicode apps" word. I tried to follow the discussions here but
got lost.


My 2 cents.

Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 07 September 2005 19:16
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> In a perfect world there are no "unicode apps",

In my perfect world, everything is one flavour of Unicode, and everyone
can consequently read and write everything with no compatibilty problems
at all. But then I like to retreat to my little fantasy world from time
to time...

>
> However, I have no idea how this theory is far from reality, far from
> the ODBC API, and far from Windows, sorry :-( I just was woken up by
> the "unicode apps" word. I tried to follow the discussions here but
> got lost.

The ODBC API (defined by Microsoft of course) includes a number of *W
functions which are Unicode variants of the ANSI versions with the same
name. The ODBC driver manager maps all ANSI function calls to the
Unicode equivalents if they exist, on the assumption that ASCII chars
will map correctly into Unicode (which they do if they are 7 bit chars).
In theory we could attempt to recode incoming ascii or multibyte
ourselves I guess, but it's not going to be a particularly easy task
(and will mean performance loss), and given that some apps don't play
nicely with Unicode drivers anyway, we might as well kill 2 birds with
one stone and just ship 2 versions of the driver.

Regards, Dave.

Re: Continuing encoding fun....

From
Marc Herbert
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:

> The ODBC API (defined by Microsoft of course) includes a number of *W
> functions which are Unicode variants of the ANSI versions with the same
> name.

I think one extra layer of confusion is added by the fact that POSIX
defines the type wchar_t as "the abstract/platform-dependent
character", W just meaning here: "W like Wide enough", whereas
Microsoft defines WCHAR as: "W like Unicode".  Microsoft's abstract
character being "TCHAR".

Am I right here?



Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 08 September 2005 11:10
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> > The ODBC API (defined by Microsoft of course) includes a
> number of *W
> > functions which are Unicode variants of the ANSI versions
> with the same
> > name.
>
> I think one extra layer of confusion is added by the fact that POSIX
> defines the type wchar_t as "the abstract/platform-dependent
> character", W just meaning here: "W like Wide enough", whereas
> Microsoft defines WCHAR as: "W like Unicode".  Microsoft's abstract
> character being "TCHAR".
>
> Am I right here?

That certainly wouldn't help matters. We already have ucs2<->utf-8
conversion in various places to deal with *nix/win32 differences -
trying to properly munge other encodings into those correctly wouldn't
be fun!

As I said though - there are other advantages to having a non-Unicode
driver (like, BDE won't barf for example), so why go to all the hassle,
when we can just advise the non-Unicode folks to use the ANSI driver?

Regards, Dave.

Re: Continuing encoding fun....

From
Marko Ristola
Date:
There is one thing, that might be good for you to know:

I tried
wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
They don't work with LATIN1 under Linux.

gcc does not support NON-ASCII multibyte conversions.
gcc gives that responsibility for library functions.

That is so even for GCC 4.0.

So, at least libiconv is a good way to handle the multibyte conversions
robustly under Linux. That works if and only if the libiconv library works.

libiconv is LGPL licensed.

Regards,
Marko Ristola

>However, I have no idea how this theory is far from reality, far from
>the ODBC API, and far from Windows, sorry :-( I just was woken up by
>the "unicode apps" word. I tried to follow the discussions here but
>got lost.
>
>


Re: Continuing encoding fun....

From
Marc Herbert
Date:
Marko Ristola <Marko.Ristola@kolumbus.fi> writes:

> There is one thing, that might be good for you to know:
>
> I tried
> wprintf("%s",char_text) and printf("%ls",wchar_text) methods.
> They don't work with LATIN1 under Linux.

What do you mean by that? Could you post a short sample code?

Since wchar_t is 32bits for glibc, wchar_text can not be LATIN1 which
is 8bits long...


> gcc does not support NON-ASCII multibyte conversions.

Well I would find weird for a compiler to perform such conversions.

Re: Continuing encoding fun....

From
Marc Herbert
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:

> I've been thinking about this whilst getting dragged round the shops
> today, and having read Marko's, Johann's, Hiroshi's and other emails,
> not to mention bits of the ODBC spec, here's where I think we stand.
>
> 1) The current driver works as expected with Unicode apps.
>
> 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> functions to the Unicode ones, and because (as I think Marko pointed
> out) the basic latin chars map directly into the lower Unicode
> characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
>
> 3) Some other single byte LATIN encodings do not work. This is because
> the characters do not map directly into Unicode 80-FF
> (http://www.unicode.org/charts/PDF/U0080.pdf).
>
> 4) Multibyte apps do not work. I believe that in fact they never will
> with a Unicode driver, because multibyte characters simply won't map
> into Unicode in the same way that ASCII does. The user cannot opt to use
> the non-wide functions, because the DM automatically maps them to the
> Unicode versions.
>
> Because the Driver Manager forces the user to use the *W functions if
> they exist, I cannot see any way to make 3 or 4 work with a Unicode
> driver.


I agree that 4) can never work, because ODBC does not seem compatible
with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
strings, that's all.
<http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>


However, I don't get why 3) does not work. From here:
<http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_function_arguments.asp>

 If the driver is a Unicode driver, the Driver Manager makes function
 calls as follows:
 - Converts an ANSI function (with the A suffix) to a Unicode function
 (with the W suffix) by converting the string arguments into Unicode
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 characters and passes the Unicode function to the driver.


Are you saying in 3) that the "converting" underlined above is
actually just a static cast?!

Is this "bug" true for every driver manager out there?



Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 21 November 2005 17:19
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> > I've been thinking about this whilst getting dragged round the shops
> > today, and having read Marko's, Johann's, Hiroshi's and
> other emails,
> > not to mention bits of the ODBC spec, here's where I think we stand.
> >
> > 1) The current driver works as expected with Unicode apps.
> >
> > 2) 7 bit ASCII apps work correctly. The driver manager maps the ANSI
> > functions to the Unicode ones, and because (as I think Marko pointed
> > out) the basic latin chars map directly into the lower Unicode
> > characters (see http://www.unicode.org/charts/PDF/U0000.pdf).
> >
> > 3) Some other single byte LATIN encodings do not work. This
> is because
> > the characters do not map directly into Unicode 80-FF
> > (http://www.unicode.org/charts/PDF/U0080.pdf).
> >
> > 4) Multibyte apps do not work. I believe that in fact they
> never will
> > with a Unicode driver, because multibyte characters simply won't map
> > into Unicode in the same way that ASCII does. The user
> cannot opt to use
> > the non-wide functions, because the DM automatically maps
> them to the
> > Unicode versions.
> >
> > Because the Driver Manager forces the user to use the *W
> functions if
> > they exist, I cannot see any way to make 3 or 4 work with a Unicode
> > driver.
>
>
> I agree that 4) can never work, because ODBC does not seem compatible
> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
> strings, that's all.
> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>

Actually our ANSI driver works quite nicely in various non-Unicode multibyte encodings such as Shift-JIS, EUC_CN, JOHAB
andmore. It'll even work with pure UTF-8 in multibyte mode using the ANSI API. 

>
> However, I don't get why 3) does not work. From here:
> <http://msdn.microsoft.com/library/default.asp?url=/library/en
> -us/odbc/htm/odbcunicode_function_arguments.asp>
>
>  If the driver is a Unicode driver, the Driver Manager makes function
>  calls as follows:
>  - Converts an ANSI function (with the A suffix) to a Unicode function
>  (with the W suffix) by converting the string arguments into Unicode
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  characters and passes the Unicode function to the driver.
>
>
> Are you saying in 3) that the "converting" underlined above is
> actually just a static cast?!

No, not really a static cast, but a similar effect. Unicode chars 0000-007F are exactly the same as their ASCII
counterparts,as are LATIN1 (0080-00FF). All the DM does is map the single byte values into low bytes of the unicode
charactersand passes them to the Unicode functions. This works just fine for pure ASCII/LATIN1, but not with other
charactersetswhich don't directly map from their single byte values into Unicode. 

> Is this "bug" true for every driver manager out there?

It's not really a bug, but I believe so, yes. It gets corrected by the more advanced drivers though - for example, the
SQLserver driver might see a 'Š' character (8A). It knows the local charset is LATIN4, so it can then rewrite that
characterto 0160, the Unicode equivalent. Our Unicode driver will simply leave it as 8A, which is actually a control
character(VTS - LINE TABULATION SET). 

http://www.unicode.org/roadmaps/bmp/

At least, this is how I understand things :-). Regardless though, the encoding bug reports have all-but stopped now we
ship2 drivers again. 

Regards, Dave.

Re: Continuing encoding fun....

From
Marc Herbert
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:

>> I agree that 4) can never work, because ODBC does not seem compatible
>> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
>> strings, that's all.
>> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>
>

> Actually our ANSI driver works quite nicely in various non-Unicode
> multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll
> even work with pure UTF-8 in multibyte mode using the ANSI API.

Great.

Out of curiosity, is this because all the ODBC code has a "don't
touch" attitude in this full-ANSI case, leaving all string data as is?
Or is there something more clever?  Who performs the conversion if the
database is in UTF-8 for instance? Multibyte cases seem to fall outside
the scope of the ODBC spec, which refers only to "ANSI" and "Unicode".

Thanks in advance for providing pointers if this is an FAQ. Even vague
references to the archive of this list would be nice.


>> However, I don't get why 3) does not work.
>>
>>  If the driver is a Unicode driver, the Driver Manager makes function
>>  calls as follows:
>>  - Converts an ANSI function (with the A suffix) to a Unicode function
>>  (with the W suffix) by converting the string arguments into Unicode
>>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  characters and passes the Unicode function to the driver.
>>
>>
>> Are you saying in 3) that the "converting" underlined above is
>> actually just a static cast?!
>

> No, not really a static cast, but a similar effect. Unicode chars
> 0000-007F are exactly the same as their ASCII counterparts, as are
> LATIN1 (0080-00FF). All the DM does is map the single byte values
> into low bytes of the unicode characters and passes them to the
> Unicode functions.

> This works just fine for pure ASCII/LATIN1, but
> not with other charactersets which don't directly map from their
> single byte values into Unicode.

Very interesting. Maybe the driver manager does so only because the it
cannot/fails to get the active codepage, falling back on CP-1252?
(CP1252 ~= latin1, <http://czyborra.com/charsets/codepages.html#CP1252>)


>> Is this "bug" true for every driver manager out there?

> It's not really a bug, but I believe so, yes.

including unixodbc and iodbc for instance?


> It gets corrected by
> the more advanced drivers though - for example, the SQL server
> driver might see a 'Š' character (8A). It knows the local charset is
> LATIN4, so it can then rewrite that character to 0160, the Unicode
> equivalent.

Are you saying that the SQL server driver is fixing the flawed
conversion job of the driver manager, finally taking the codepage into
account? Surprising to say the least!

By the way 0x8A is not in the range of latin4
<http://czyborra.com/charsets/iso8859.html#ISO-8859-4>


> Our Unicode driver will simply leave it

Of course, you don't want to perform a conversion that is supposed to
already have happeneD.


> Regardless though, the encoding bug reports have all-but stopped now
> we ship 2 drivers again.

And having two different drivers is indeed the approach induced by the
ODBC documentation, from what I've got from it.

Thanks a lot for your insights.

Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 22 November 2005 09:33
> To: pgsql-odbc@postgresql.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>
> >> I agree that 4) can never work, because ODBC does not seem
> compatible
> >> with multibyte apps by design. ODBC caters for "ANSI" and "Unicode"
> >> strings, that's all.
> >> <http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx>
> >
>
> > Actually our ANSI driver works quite nicely in various non-Unicode
> > multibyte encodings such as Shift-JIS, EUC_CN, JOHAB and more. It'll
> > even work with pure UTF-8 in multibyte mode using the ANSI API.
>
> Great.
>
> Out of curiosity, is this because all the ODBC code has a "don't
> touch" attitude in this full-ANSI case, leaving all string data as is?
> Or is there something more clever?  Who performs the conversion if the
> database is in UTF-8 for instance? Multibyte cases seem to
> fall outside
> the scope of the ODBC spec, which refers only to "ANSI" and "Unicode".

No, Multibyte support was intentionally added by Eiji Tokuya in 2001. Don't ask me how it works though as I really
don'tknow. Much of the code for it is in multibyte.c if you want to take a peek. 


> Very interesting. Maybe the driver manager does so only because the it
> cannot/fails to get the active codepage, falling back on CP-1252?
> (CP1252 ~= latin1,
> <http://czyborra.com/charsets/codepages.html#CP1252>)

The docs are somewhat fuzzy on this point, simply stating that

"If the driver is a Unicode driver, the Driver Manager makes function calls as follows:" ... "Converts an ANSI function
(withthe A suffix) to a Unicode function (with the W suffix) by converting the string arguments into Unicode characters
andpasses the Unicode function to the driver." 

 (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_applications.asp)

My assertion that the driver does the conversion comes from the SQL Server driver which allows you to turn conversion
onor off: 

"Perform translation for character data check box

When selected, the SQL Server ODBC driver converts ANSI strings sent between the client computer and SQL Server by
usingUnicode. The SQL Server ODBC driver sometimes converts between the SQL Server code page and Unicode on the client
computer.This requires that the code page used by SQL Server be one of the code pages available on the client computer. 

When cleared, no translation of extended characters in ANSI character strings is done when they are sent between the
clientapplication and the server. If the client computer is using an ANSI code page (ACP) different from the SQL Server
codepage, extended characters in ANSI character strings may be misinterpreted. If the client computer is using the same
codepage for its ACP that SQL Server is using, the extended characters are interpreted correctly." 

If Microsoft intended the DM to do the conversion when they wrote the spec, why would they then add the same
functionalityto their driver? 

> >> Is this "bug" true for every driver manager out there?
>
> > It's not really a bug, but I believe so, yes.
>
> including unixodbc and iodbc for instance?

If they follow the parts of the spec I quoted above, and interpret them in the same when, then yes. However I'm not
overlyfamiliar with either DM, so I can't say for sure. 


> > It gets corrected by
> > the more advanced drivers though - for example, the SQL server
> > driver might see a 'Š' character (8A). It knows the local charset is
> > LATIN4, so it can then rewrite that character to 0160, the Unicode
> > equivalent.
>
> Are you saying that the SQL server driver is fixing the flawed
> conversion job of the driver manager, finally taking the codepage into
> account? Surprising to say the least!
>
> By the way 0x8A is not in the range of latin4
> <http://czyborra.com/charsets/iso8859.html#ISO-8859-4>

http://www.gar.no/home/mats/8859-4.htm says differently, however, I can't claim to know enough about encoding issues to
refuteeither. I've been forced to learn what I can about the subject to help maintain this driver and certainly may
havegot the wrong end of the stick on one or more points! 

Regards, Dave.

"Dave Page" <dpage@vale-housing.co.uk> writes:

>> By the way 0x8A is not in the range of latin4
>> <http://czyborra.com/charsets/iso8859.html#ISO-8859-4>
>
> http://www.gar.no/home/mats/8859-4.htm says differently, however, I
> can't claim to know enough about encoding issues to refute
> either. I've been forced to learn what I can about the subject to help
> maintain this driver and certainly may have got the wrong end of the
> stick on one or more points!

The page from gar.no is just a dump of the *Microsoft-extended* latin4
charset.

The standards comittee carefully left a gap in all LATIN-X charsets
between 0x80 and 0x9F, because those characters become (harmful)
control characters once stripped of their 8th bit (by accident).
You can see that very clearly in this table for instance
 <http://en.wikipedia.org/wiki/ISO_8859-4>

If you follow the links from gar.no itself, you can land here:
<http://en.wikipedia.org/wiki/ISO_8859> with tons of links (like the
ECMA standards for instance) showing this gap.

Microsoft, being Microsoft, jumped in that gap. Those non-standard
Microsoft characters now plague the web as clearly explained here:

<http://home.earthlink.net/~bobbau/platforms/specialchars/#windows>
or here:
<http://www.cs.tut.fi/~jkorpela/www/windows-chars.html>



Re: Continuing encoding fun....

From
Marc Herbert
Date:
[Cross-posting to unixodbc-devel. Also crossing fingers so it works]
Archives of both lists here for instance: <http://dir.gmane.org/search.php?match=odbc>

"Dave Page" <dpage@vale-housing.co.uk> writes:
>
> The docs are somewhat fuzzy on this point, simply stating that
>
> "If the driver is a Unicode driver, the Driver Manager makes function
> calls as follows:" ... "Converts an ANSI function (with the A suffix)
> to a Unicode function (with the W suffix) by converting the string
> arguments into Unicode characters and passes the Unicode function to
> the driver."
>
>  (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcunicode_applications.asp)
>
> My assertion that the driver does the conversion comes from the SQL
> Server driver which allows you to turn conversion on or off:
>
> "Perform translation for character data check box
>
> When selected, the SQL Server ODBC driver converts ANSI strings sent
> between the client computer and SQL Server by using Unicode. The SQL
> Server ODBC driver sometimes converts between the SQL Server code page
> and Unicode on the client computer. This requires that the code page
> used by SQL Server be one of the code pages available on the client
> computer.
>
> When cleared, no translation of extended characters in ANSI character
> strings is done when they are sent between the client application and
> the server. If the client computer is using an ANSI code page (ACP)
> different from the SQL Server code page, extended characters in ANSI
> character strings may be misinterpreted. If the client computer is
> using the same code page for its ACP that SQL Server is using, the
> extended characters are interpreted correctly."
>
> If Microsoft intended the DM to do the conversion when they wrote the
> spec, why would they then add the same functionality to their driver?


Here is a hypothesis: the checkbox in SQL Server driver is actually a
switch between the ANSI version and the Unicode version of this
driver.  That would be pretty much consistent with all the above. The
only inconsistency would be: "The driver converts...", to be actually
read as: "This setting triggers the conversion operated by the DM".

What do you think?




Re: Continuing encoding fun....

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-odbc-owner@postgresql.org
> [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Marc Herbert
> Sent: 24 November 2005 14:18
> To: pgsql-odbc@postgresql.org
> Cc: unixodbc-dev@unixodbc.org
> Subject: Re: [ODBC] Continuing encoding fun....
>
> > If Microsoft intended the DM to do the conversion when they
> wrote the
> > spec, why would they then add the same functionality to
> their driver?
>
>
> Here is a hypothesis: the checkbox in SQL Server driver is actually a
> switch between the ANSI version and the Unicode version of this
> driver.  That would be pretty much consistent with all the above. The
> only inconsistency would be: "The driver converts...", to be actually
> read as: "This setting triggers the conversion operated by the DM".
>
> What do you think?

The DM detects whether the driver is Unicode or not from the presence of
the SQLConnectW function
(http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/h
tm/odbcunicode_drivers.asp). Whether or not this is exported is
determined at compile time and cannot be changed at runtime.

Regards, Dave