Thread: Unicode support

Unicode support

From
Nhan Ngo Dinh
Date:
Greetings,

I've been searching for long time for the way to convert a Microsoft SQL
7.0 database into a database such as MySQL or PostgreSQL. The problem is
that this database contains tables with both LATIN1 characters (English,
French) and Arabic characters, so the only way to put all together is to
use an unicode charset such as UTF-8.

PostgreSQL does support unicode characters (at least the documents say
that) while MySQL doesn't. I enabled the multibyte support while compiling
sources on a Linux platform and tried some ways to transfer the original
database.

First of all, I tried using the PostgreSQL ODBC driver. I've exported the
database to Microsoft Access to better handle the situation and I've
installed the PostgreSQL ODBC driver taken from ftp.postgresql.org (version
7.01.00.09). I've transferred a sample table containing both charsets and
the result was that Arabic letters were replaced into the target PostgreSQL
database by "?" characters.

In the psqlodbc_*.log files the Arabic entries were already replaced with
quotation marks.

Then I tried many other methods, using pgAdmin II, or even trying to
synchronize the databases with a PHP script that used the freetds library
(first) or the Sybase library (then) to directly access the MSSQL database.
The imported Arabic fields were always quotation marks, because the unicode
support was not fully enabled into both sybase libraries.

The only solution that I've found is to import the database using ODBC in a
PHP script with PHP on a Windows box, converting nvarchar values (with
CONVERT in the query) into varbinary, setting the binary visualization to
hex. So I can parse that string and store unicode strings into the database.

Well, may be there is a better solution directly using the PostgreSQL ODBC
driver.
Any idea?

Thanks.

Nhan NGO DINH





Re: Unicode support

From
Jean-Michel POURE
Date:
Le Dimanche 3 Février 2002 21:26, Nhan Ngo Dinh a écrit :
> Any idea?

The only problem is with Arabic, right? Export schema and data separately
from MS SQL Server.

1) Import the schema file in PostgreSQL directly in Latin1.
2) Translate the data file to UFT-8 under Linux using recode or iconv:
- iconv -f arabic_enconding_name  -t utf-8 filename.txt
- recode arabic_enconding_name..u8 file_name.txt

A good manual can be found here :
http://www.iro.umontreal.ca/contrib/recode/HTML/recode.html

3) Use pgAdmin2 or pgsql \copy statement to import the file.

Please note that Access2000 and PgAdmin 2 do not support Unicode. This was an
issue for us at pgAdmin Team and the answer is "there is no unicode support
although $M claims there is". You have to wait for Windows ZX 2439, which
will take place in more than 300 years and 134 Windows upgrades. So prepare a
cryogenic place and wait until UTF-8 is added to Access.

On the converse PHP supports unicode very well. Do not hesitate to write me
back for further information.

Cheers,
Jean-Michel POURE

Re: Unicode support

From
Hiroshi Inoue
Date:
Jean-Michel POURE wrote:
>
> Le Dimanche 3 F騅rier 2002 21:26, Nhan Ngo Dinh a 馗rit :
> > Any idea?
>
> 3) Use pgAdmin2 or pgsql \copy statement to import the file.
>
> Please note that Access2000 and PgAdmin 2 do not support Unicode. This was an
> issue for us at pgAdmin Team and the answer is "there is no unicode support
> although $M claims there is".

If our ODBC driver supports UNICODE (UCS-2 not utf-8)
it would probably be possible for Access2000, PgAdmin
or other VB apps etc to handle UNICODE data via ODBC
(ADO etc) under Windows NT/2000/XP because the OS's
are UNICODE(UCS-2) based. Though it's been on my TODO
list I've had no time to do it unfortunately.

regards,
Hiroshi Inoue

Re: Unicode support

From
Jean-Michel POURE
Date:
Le Mardi 12 Février 2002 01:45, Hiroshi Inoue a écrit :
> If our ODBC driver supports UNICODE (UCS-2 not utf-8)
> it would probably be possible for Access2000, PgAdmin
> or other VB apps etc to handle UNICODE data via ODBC
> (ADO etc) under Windows NT/2000/XP because the OS's
> are UNICODE(UCS-2) based. Though it's been on my TODO
> list I've had no time to do it unfortunately.

OK, I understand.
Cheers, Jean-Michel POURE

Re: Unicode support

From
Jean-Michel POURE
Date:
Le Lundi 18 Février 2002 11:13, Nhan NGO DINH a écrit :
> Well, you said to export data from the SQL server splitted into structure
> and data. How to do that? I really don't have so much experience with
> Microsoft SQL Server...
>
> Then: I've tried to convert with iconv an UNICODE exported table but I
> don't see the expected results.

What is the encoding of your $MS SQL database? If Unicode, no recoding is
needed.

I recommand you installed pgAdmin2 (http://pgadmi.postgresql.org) and
followed the following steps:

1) Create a Unicode database.

2) Use the migration wizard to import $MS SQL Schema into PostgreSQL. It will
handle type conversion from $MS SQL Server <-> PostgreSQL.

Alternatively, you may use $MS SQL Server GUI to export schema. You will need
to adapt schema types and objects (convert money into float8, sequences,
etc...). This can be quite difficult in a large schema.

Use pgAdmin2 graphical interface to make sure all objects are imported :
keys, sequences, etc.. The best way is to run both GUIs at the same time to
create the missing objects manually.

3) Export the data from SQL Server into a text file using $MS SQL Server GUI.
Alternatively, you can attach files in Access2K and export them into a text
file.

Recode the text file using : recode Latin1..u8 file_name.txt (assuming
Latin1 is the source encoding). If the $MS SQL database is already Unicode,
no recoding is required and you can use pgAdmin2 during the entire process.

4) Import the resulting data into PostgreSQL using pgAdmin2 or pgsql \Copy
statement.

Do not hesitate to write me on pgadmin-hackers@postgresql.org for more
support.

Cheers,
Jean-MIchel POURE

Re: Unicode support

From
Jean-Michel POURE
Date:
> 1) When I try to view tables from Access XP, the arabic letters are not
> correctly displayed: I can see only the multibyte to single byte 1:1
> conversion. This is not that important because I don't need to see it in
> Access because the database will be seen through PHP on the web.

The odbc driver does not support UTF-8 yet.

> 2) The most critical problem is that I cannot convert a table with both
> french letters with accents and arabic letters. Pgadmin seems to refuse
> french characters. So I've erased french records (at least records with
> accents) and the database has been transferred.

The French letters are correctly stored in PostgreSQL. The problem comes from
pgAdmin2 / odbc.

> As far as I can see the unicode text file from Microsoft SQL Server is not
> fully UTF-8 standard, but maybe I'm wrong.
>
> On the postgresql server side there is the message:
> ERROR:  Invalid UNICODE character sequence found (0xe974e9)
> So E9 74 E9 is "été" ("société" field in the 5th record)

Can you test displaying you data in Apache / Php. I tried displaying UTF-8
data in Windows but never ***ever*** succedeed.

Please give us more details :
- What is the encoding of you source data
- What is your platform (Win2000, XP),
- Do yo have Access 2K?
- Do you have access to an Apache/PHP platform?

Also, keep in mind pgAdmin2 and the odbc driver are not UTF-8 compliant yet.
So if you plan to use Windows / odbc / PostgreSQL / UTF-8 it will not work.
Windows can only be used during migration of UTF-8 data.

I can only help you if the data is to be used on an Apache/php platform.

Cheers,
Jean-Michel POURE

Re: Unicode support

From
Jean-Michel POURE
Date:
Dear Nhan Dgo,

Please CC pg-odbc and pgadmin lists so that anyone can participate.

> I'm currently running Microsoft SQL Server 2000 on a Windows 2000 Server
> platform, but this is only to read the source database. I don't know how to
> retrieve the source encoding... It should be UNICODE since it stores both
> LATIN1 and arabic chars.

I am not sure. There should be a way to run a select statement to know what
encoding is used internally. Search the $W website.

> The target platform is Linux with PostgreSQL 7.2 and PHP/Apache of course
> and this shouldn't be a problem for me.

Don't try to view data in the Windows platform, this will not work because
pgAdmin2 and the odbc driver are not utf-8 compatible.

Make sure to migrate the schema first using pgAdmin2. Then export text and
convert it using iconv or recode (if not utf-8 yet). Maybe you can try to
open the text files in Yudit, the multi-byte text editor, which will tell you
what encoding is used.

Try to write a simple script under PHP. Please note the testing php page 1)
should be saved in utf-8 format 2) and should carry utf-8 headers to tell the
browser to switch to utf-8. Be sure your browser has the arabic fonts (go to
an arabic web site to verify).

Good luck my friend.

Cheers,
Jean-Michel POURE

Re: Unicode support

From
Hiroshi Inoue
Date:
Jean-Michel POURE wrote:
>
> > 1) When I try to view tables from Access XP, the arabic letters are not
> > correctly displayed: I can see only the multibyte to single byte 1:1
> > conversion. This is not that important because I don't need to see it in
> > Access because the database will be seen through PHP on the web.
>
> The odbc driver does not support UTF-8 yet.

The odbc driver can support UTF-8.
What it doesn't support is UCS-2.
In most cases Unicode means UCS-2 under Windows.
Note that Unicode in PostgreSQL means UTF-8 and
it couldn't handle UCS-2.

regards,
Hiroshi Inoue

Re: Unicode support

From
Jean-Michel POURE
Date:
Ok, I've tried to follow your steps: be sure I've already installed pgadmin
II, that's obviously the best way to intercommunicate with postgresql from
Microsoft platforms.

It was surprising that I've succeeded transferring arabic unicode to
postgresql, at least it seems to have included multibyte characters. Then
I've found two problems:

1) When I try to view tables from Access XP, the arabic letters are not
correctly displayed: I can see only the multibyte to single byte 1:1
conversion. This is not that important because I don't need to see it in
Access because the database will be seen through PHP on the web.

2) The most critical problem is that I cannot convert a table with both
french letters with accents and arabic letters. Pgadmin seems to refuse
french characters. So I've erased french records (at least records with
accents) and the database has been transferred.

As far as I can see the unicode text file from Microsoft SQL Server is not
fully UTF-8 standard, but maybe I'm wrong.

On the postgresql server side there is the message:
ERROR:  Invalid UNICODE character sequence found (0xe974e9)
So E9 74 E9 is "été" ("société" field in the 5th record)

To better understand the situation I send you in attachment the unicode TXT
file that I'm trying to import and the pgadmin log.

Thx a lot!
Bye.

At 12.14 18/02/2002 +0100, you wrote:
>Le Lundi 18 Février 2002 11:13, Nhan NGO DINH a écrit :
> > Well, you said to export data from the SQL server splitted into structure
> > and data. How to do that? I really don't have so much experience with
> > Microsoft SQL Server...
> >
> > Then: I've tried to convert with iconv an UNICODE exported table but I
> > don't see the expected results.
>
>What is the encoding of your $MS SQL database? If Unicode, no recoding is
>needed.
>
>I recommand you installed pgAdmin2 (http://pgadmi.postgresql.org) and
>followed the following steps:
>
>1) Create a Unicode database.
>
>2) Use the migration wizard to import $MS SQL Schema into PostgreSQL. It will
>handle type conversion from $MS SQL Server <-> PostgreSQL.
>
>Alternatively, you may use $MS SQL Server GUI to export schema. You will need
>to adapt schema types and objects (convert money into float8, sequences,
>etc...). This can be quite difficult in a large schema.
>
>Use pgAdmin2 graphical interface to make sure all objects are imported :
>keys, sequences, etc.. The best way is to run both GUIs at the same time to
>create the missing objects manually.
>
>3) Export the data from SQL Server into a text file using $MS SQL Server GUI.
>Alternatively, you can attach files in Access2K and export them into a text
>file.
>
>Recode the text file using : recode Latin1..u8 file_name.txt (assuming
>Latin1 is the source encoding). If the $MS SQL database is already Unicode,
>no recoding is required and you can use pgAdmin2 during the entire process.
>
>4) Import the resulting data into PostgreSQL using pgAdmin2 or pgsql \Copy
>statement.
>
>Do not hesitate to write me on pgadmin-hackers@postgresql.org for more
>support.
>
>Cheers,
>Jean-MIchel POURE
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)



Attachment

Re: Unicode support

From
Jean-Michel POURE
Date:
> 1) When I try to view tables from Access XP, the arabic letters are not
> correctly displayed: I can see only the multibyte to single byte 1:1
> conversion. This is not that important because I don't need to see it in
> Access because the database will be seen through PHP on the web.

Dave Page wrote 30.000 lines of code for pgAdmin2... It is getting better
every day.

Access2K/XP + PostgreSQL support Unicode. You can use them to display both
French and Arabic. Althought I never tried, I know it is possible.


> 2) The most critical problem is that I cannot convert a table with both
> french letters with accents and arabic letters. Pgadmin seems to refuse
> french characters. So I've erased french records (at least records with
> accents) and the database has been transferred.

pgAdmin2 is not multi-byte safe for various ***serious*** reasons.

It means you cannot view Unicode data in pgAdmin2 grid. On the converse, you
can manage your database (add tables, views, triggers, etc...)

Cheers,
Jean-Michel