Thread: MSSQL to PostgreSQL : Encoding problem

MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Hi list !

I already posted this as "COPY FROM encoding error", but I have
been doing some more tests since then.

I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.

In DTS, I have 3 options to export a table as a text file : ANSI,
OEM and UNICODE.
I tried all these options (and I have three files, one for each).

I then try to import into PostgreSQL. The farther I can get is
when using the UNICODE export, and importing it using a
client_encoding set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
État SQL :22021

The problematic character is the euro currency symbol.

Does anyone know how I can solve this ?
Thanks a lot !

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Richard Huxton
Date:
Arnaud Lesauvage wrote:
> Hi list !
>
> I already posted this as "COPY FROM encoding error", but I have been
> doing some more tests since then.
>
> I'm trying to export data from MS SQL Server to PostgreSQL.
> The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3
> import seems to be the only reasonable solution.

Or go via MS-Access/Perl and ODBC/DBI perhaps?

> In DTS, I have 3 options to export a table as a text file : ANSI, OEM
> and UNICODE.
> I tried all these options (and I have three files, one for each).

Well, what character-set is your database in?

> I then try to import into PostgreSQL. The farther I can get is when
> using the UNICODE export, and importing it using a client_encoding set
> to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
> The copy then stops with an error :
> ERROR: invalid byte sequence for encoding "UTF8": 0xff
> État SQL :22021
>
> The problematic character is the euro currency symbol.

You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that
character-number but it is used for a different symbol.

Your first step needs to be to find out what character-set your data is in.
Your second is then to decide what char-set you want to use to store it
in PG.
Then you can decide how to get there.

--
   Richard Huxton
   Archonet Ltd


Re: MSSQL to PostgreSQL : Encoding problem

From
Tony Caduto
Date:
Arnaud Lesauvage wrote:
>
>
>> I then try to import into PostgreSQL. The farther I can get is when
>> using the UNICODE export, and importing it using a client_encoding
>> set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
>> The copy then stops with an error :
>> ERROR: invalid byte sequence for encoding "UTF8": 0xff
>> État SQL :22021
>>
>> The problematic character is the euro currency symbol.
>
>
Exporting from MS SQL server as unicode is going to give you full
Unicode, not UTF8.  Full unicde is 2 bytes per character and UTF8 is 1,
same as ASCII.
You will have to encode the Unicode data to UTF8

I have done this in Delphi using it's built in UTF8 encoding and
decoding routines.   You can get a free copy of Delphi Turbo Explorer
which includes components for MS SQL server and ODBC, so it would be
pretty straight forward to get this working.

The actual method in Delphi is system.UTF8Encode(widestring).  This will
encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.

I am sure Perl could do it also.

Hope this helps.

--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration


Re: MSSQL to PostgreSQL : Encoding problem

From
Richard Huxton
Date:
Tony Caduto wrote:
> Arnaud Lesauvage wrote:
>>
>>
>>> I then try to import into PostgreSQL. The farther I can get is when
>>> using the UNICODE export, and importing it using a client_encoding
>>> set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
>>> The copy then stops with an error :
>>> ERROR: invalid byte sequence for encoding "UTF8": 0xff
>>> État SQL :22021
>>>
>>> The problematic character is the euro currency symbol.
>>
>>
> Exporting from MS SQL server as unicode is going to give you full
> Unicode, not UTF8.  Full unicde is 2 bytes per character and UTF8 is 1,
> same as ASCII.
> You will have to encode the Unicode data to UTF8

Well, UTF8 is a minimum of one byte, but can be longer for non-ASCII
characters. The idea being that chars below 128 map to ASCII. There's
also UTF16 and I believe UTF32 with 2+ and four byte characters.

> I have done this in Delphi using it's built in UTF8 encoding and
> decoding routines.   You can get a free copy of Delphi Turbo Explorer
> which includes components for MS SQL server and ODBC, so it would be
> pretty straight forward to get this working.
>
> The actual method in Delphi is system.UTF8Encode(widestring).  This will
> encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.

Ah, that's useful to know. Windows just doesn't have the same quantity
of tools installed as a *nix platform.

> I am sure Perl could do it also.

And in one line if you're clever enough no doubt ;-)

--
   Richard Huxton
   Archonet Ltd


Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Tomi NA a écrit :
>> I think I'll go this way... No other choice, actually !
>> The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
>> I don't really understand what this is. It supports the euro
>> symbol, so it is probably not pure LATIN1, right ?
>
> I suppose you'd have to look at the latin1 codepage character table
> somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
> question. :)

Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Richard Huxton a écrit :
> Arnaud Lesauvage wrote:
>> Hi list !
>>
>> I already posted this as "COPY FROM encoding error", but I have been
>> doing some more tests since then.
>>
>> I'm trying to export data from MS SQL Server to PostgreSQL.
>> The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3
>> import seems to be the only reasonable solution.
>
> Or go via MS-Access/Perl and ODBC/DBI perhaps?

Yes, I think it would work. The problem is that the DB is
too big for this king of export. Using DTS from MSSQL to
export directly to PostgreSQL using psqlODBC Unicode Driver,
I exported ~1000 rows per second in a 2-columns table with
~20M rows. That means several days just for this table, and
I have bigger ones !


>> In DTS, I have 3 options to export a table as a text file : ANSI, OEM
>> and UNICODE.
>> I tried all these options (and I have three files, one for each).
>
> Well, what character-set is your database in?


Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should
export using the current codepage.
According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using
client_encoding of WIN1252 though...


>> I then try to import into PostgreSQL. The farther I can get is when
>> using the UNICODE export, and importing it using a client_encoding set
>> to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
>> The copy then stops with an error :
>> ERROR: invalid byte sequence for encoding "UTF8": 0xff
>> État SQL :22021
>>
>> The problematic character is the euro currency symbol.
>
> You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that
> character-number but it is used for a different symbol.
>
> Your first step needs to be to find out what character-set your data is in.
> Your second is then to decide what char-set you want to use to store it
> in PG.
> Then you can decide how to get there.

In PG, UTF8 was my choice (the DB already exists, I am just
adding some tables that are still stored in MSSQL), and
according to what you say this was the right choice.
The problem is really about reading this file I think.

I thought that given the character sets available in
PostgreSQL, I would be able to COPY directly from my
exported files.
If I have to convert them using some third party tool, I'll
do that, but that's a bit more painful...

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
"Tomi NA"
Date:
2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
> Hi list !
>
> I already posted this as "COPY FROM encoding error", but I have
> been doing some more tests since then.
>
> I'm trying to export data from MS SQL Server to PostgreSQL.
> The tables are quite big (>20M rows), so a CSV export and a "COPY
> FROM3 import seems to be the only reasonable solution.

I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.

You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.

t.n.a.

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Richard Huxton a écrit :
> Arnaud Lesauvage wrote:
>> Richard Huxton a écrit :
>>>
>>> Or go via MS-Access/Perl and ODBC/DBI perhaps?
>>
>> Yes, I think it would work. The problem is that the DB is too big for
>> this king of export. Using DTS from MSSQL to export directly to
>> PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
>> second in a 2-columns table with ~20M rows. That means several days just
>> for this table, and I have bigger ones !
>
> Well it's about 0.25 days, but if it's too long, it's too long.

Sure, sorry for the confusion, the problem is with the other
tables (same number of rows but a lot of columns, some very
large).

>> > Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
>> > DTS documentation tells me that exporting in ANSI should export using
>> > the current codepage.
>> > According to my local setting, my codepage is Windows-1252.
>> > This file is not correctly read by COPY when using client_encoding of
>> > WIN1252 though...
>
> Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to
> check the file and confirm one way or the other. Anyone else on the list
> got an idea?

I just downloaded the GnuWin32 version of iconv.
I'm giving it a try and I'll tell you haw it went.


--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Alvaro Herrera
Date:
Arnaud Lesauvage wrote:
> Tomi NA a écrit :
> >>I think I'll go this way... No other choice, actually !
> >>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
> >>I don't really understand what this is. It supports the euro
> >>symbol, so it is probably not pure LATIN1, right ?
> >
> >I suppose you'd have to look at the latin1 codepage character table
> >somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
> >question. :)
>
> Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
> LATIN1 is missing the euro sign...
> Grrrrr I hate this !!!

So use Latin9 ...

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: MSSQL to PostgreSQL : Encoding problem

From
"Tomi NA"
Date:
2006/11/22, Arnaud Lesauvage <thewild@freesurf.fr>:
> Tomi NA a écrit :
> > 2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
> >> Hi list !
> >>
> >> I already posted this as "COPY FROM encoding error", but I have
> >> been doing some more tests since then.
> >>
> >> I'm trying to export data from MS SQL Server to PostgreSQL.
> >> The tables are quite big (>20M rows), so a CSV export and a "COPY
> >> FROM3 import seems to be the only reasonable solution.
> >
> > I believe you might have more luck working without files altogether.
> > Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
> > That's exactly what those tools are for.
> >
> > You still have to get the encodings right, though.
> > I suggest unicode for pgsql, but only you know how the MSSQL database
> > is encoded.
>
> I think I'll go this way... No other choice, actually !
> The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
> I don't really understand what this is. It supports the euro
> symbol, so it is probably not pure LATIN1, right ?

I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)

t.n.a.

Re: MSSQL to PostgreSQL : Encoding problem

From
"Magnus Hagander"
Date:
> > I have done this in Delphi using it's built in UTF8 encoding and
> > decoding routines.   You can get a free copy of Delphi
> Turbo Explorer
> > which includes components for MS SQL server and ODBC, so it
> would be
> > pretty straight forward to get this working.
> >
> > The actual method in Delphi is system.UTF8Encode(widestring).  This
> > will encode unicode to UTF8 which is compatible with a
> Postgresql UTF8 database.
>
> Ah, that's useful to know. Windows just doesn't have the same
> quantity of tools installed as a *nix platform.

If your file is small enough, you can just open it up in Notepad and
re-save it as UTF8. It might play funny with the BOMs though
(byte-order-marks).

There is also, IIRC, an iconv binary available for Windows that should
be able to do such a conversion. Can't rememebr where thuogh :-)

//Magnus

Re: MSSQL to PostgreSQL : Encoding problem

From
"Magnus Hagander"
Date:
> >> I already posted this as "COPY FROM encoding error", but I
> have been
> >> doing some more tests since then.
> >>
> >> I'm trying to export data from MS SQL Server to PostgreSQL.
> >> The tables are quite big (>20M rows), so a CSV export and a "COPY
> >> FROM3 import seems to be the only reasonable solution.
> >
> > Or go via MS-Access/Perl and ODBC/DBI perhaps?
>
> Yes, I think it would work. The problem is that the DB is too
> big for this king of export. Using DTS from MSSQL to export
> directly to PostgreSQL using psqlODBC Unicode Driver, I
> exported ~1000 rows per second in a 2-columns table with ~20M
> rows. That means several days just for this table, and I have
> bigger ones !
>

Interesting. What did you set the "Inser batch size" to? (I think that's
available for all transformatino tasks). And did you remember to check
the box for "use transactions"?

While it's never as fast as a COPY, it should be possible to make it
faster than that, Ithink.


Another option is to just BCP the file out, and then COPY it into
postgresql. No nice GUI, but you can tune almost everything with BCP.

//Magnus

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Alvaro Herrera a écrit :
> Arnaud Lesauvage wrote:
>> Tomi NA a écrit :
>> >>I think I'll go this way... No other choice, actually !
>> >>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
>> >>I don't really understand what this is. It supports the euro
>> >>symbol, so it is probably not pure LATIN1, right ?
>> >
>> >I suppose you'd have to look at the latin1 codepage character table
>> >somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
>> >question. :)
>>
>> Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
>> LATIN1 is missing the euro sign...
>> Grrrrr I hate this !!!
>
> So use Latin9 ...

Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before
COPYing, I get the 'invalid byte sequence error'.

The farther I can get is exporting to UNICODE and importing
as UTF8. Then COPY only breaks on the euro symbol (otherwise
it breaks very early, I think on the first "non-ascii"
character).

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Tomi NA a écrit :
> 2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
>> Hi list !
>>
>> I already posted this as "COPY FROM encoding error", but I have
>> been doing some more tests since then.
>>
>> I'm trying to export data from MS SQL Server to PostgreSQL.
>> The tables are quite big (>20M rows), so a CSV export and a "COPY
>> FROM3 import seems to be the only reasonable solution.
>
> I believe you might have more luck working without files altogether.
> Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
> That's exactly what those tools are for.
>
> You still have to get the encodings right, though.
> I suggest unicode for pgsql, but only you know how the MSSQL database
> is encoded.

I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Alvaro Herrera
Date:
Arnaud Lesauvage wrote:
> Alvaro Herrera a écrit :
> >Arnaud Lesauvage wrote:
> >>Tomi NA a écrit :
> >>>>I think I'll go this way... No other choice, actually !
> >>>>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
> >>>>I don't really understand what this is. It supports the euro
> >>>>symbol, so it is probably not pure LATIN1, right ?
> >>>
> >>>I suppose you'd have to look at the latin1 codepage character table
> >>>somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
> >>>question. :)
> >>
> >>Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
> >>LATIN1 is missing the euro sign...
> >>Grrrrr I hate this !!!
> >
> >So use Latin9 ...
>
> Of course, but it doesn't work !!!
> Whatever client encoding I choose in postgresql before
> COPYing, I get the 'invalid byte sequence error'.

Humm ... how are you choosing the client encoding?  Is it actually
working?  I don't see how choosing Latin1 or Latin9 and feeding whatever
byte sequence would give you an "invalid byte sequence".  These charsets
don't have any way to validate the bytes, as opposed to what UTF-8 can
do.  So you could end up with invalid bytes if you choose the wrong
client encoding, but that's a different error.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: MSSQL to PostgreSQL : Encoding problem

From
Richard Huxton
Date:
Arnaud Lesauvage wrote:
> Richard Huxton a écrit :
>>
>> Or go via MS-Access/Perl and ODBC/DBI perhaps?
>
> Yes, I think it would work. The problem is that the DB is too big for
> this king of export. Using DTS from MSSQL to export directly to
> PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
> second in a 2-columns table with ~20M rows. That means several days just
> for this table, and I have bigger ones !

Well it's about 0.25 days, but if it's too long, it's too long.

>>> In DTS, I have 3 options to export a table as a text file : ANSI, OEM
>>> and UNICODE.
>>> I tried all these options (and I have three files, one for each).
>>
>> Well, what character-set is your database in?
>
>
> Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
> DTS documentation tells me that exporting in ANSI should export using
> the current codepage.
> According to my local setting, my codepage is Windows-1252.
> This file is not correctly read by COPY when using client_encoding of
> WIN1252 though...

Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to
check the file and confirm one way or the other. Anyone else on the list
got an idea?

--
   Richard Huxton
   Archonet Ltd


Re: MSSQL to PostgreSQL : Encoding problem

From
Tony Caduto
Date:
>
> Of course, but it doesn't work !!!
> Whatever client encoding I choose in postgresql before COPYing, I get
> the 'invalid byte sequence error'.
>
> The farther I can get is exporting to UNICODE and importing as UTF8.
> Then COPY only breaks on the euro symbol (otherwise it breaks very
> early, I think on the first "non-ascii" character).
>
> --
Like I said before UNICODE <> UTF8  That's why the COPY command breaks
on the Euro symbol.
You will have to export as UNICODE, then encode it as UTF8, then you
won't get the breakage.

UTF8 is simply a means to store UNICODE pretty much as ASCII text.

You could grab a copy of Delphi TurboExplorer and create a import
routine using the dbGO ADO components and the PG ODBC driver.
Basicly you need to encode any UNICODE data going to the PG server with
the system.utf8encode function:

[Delphi] function *UTF8Encode*(const WS: WideString): UTF8String;
Call Utf8Encode to convert a Unicode string to UTF-8. WS is the Unicode
string to convert. Utf8Encode returns the corresponding UTF-8 string.

I would imagine that Perl also has such routines, but I don't know for
sure. These routines might be in FreePascal as well.

--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration


Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Magnus Hagander a écrit :
>> > I have done this in Delphi using it's built in UTF8 encoding and
>> > decoding routines.   You can get a free copy of Delphi
>> Turbo Explorer
>> > which includes components for MS SQL server and ODBC, so it
>> would be
>> > pretty straight forward to get this working.
>> >
>> > The actual method in Delphi is system.UTF8Encode(widestring).  This
>> > will encode unicode to UTF8 which is compatible with a
>> Postgresql UTF8 database.
>>
>> Ah, that's useful to know. Windows just doesn't have the same
>> quantity of tools installed as a *nix platform.
>
> If your file is small enough, you can just open it up in Notepad and
> re-save it as UTF8. It might play funny with the BOMs though
> (byte-order-marks).
>
> There is also, IIRC, an iconv binary available for Windows that should
> be able to do such a conversion. Can't rememebr where thuogh :-)

The file is way too big for notepad. It is even too big for
notepad++.

I do have the GnuWin32 version of iconv (*great* software
collection, BTW), but still no go...
I tried iconv -f "CP1252" -t "UTF-8"
detailrecherche_ansi.csv >detailrecherche_cp1252utf8.csv
and iconv -f "LATIN-9" -t "UTF-8" detailrecherche_ansi.csv
 >detailrecherche_latin9utf8.csv

Both don't want to load as UTF8 (invalid byte sequence x00).

I am desperate...
--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Alvaro Herrera a écrit :
> Arnaud Lesauvage wrote:
>> Alvaro Herrera a écrit :
>> >Arnaud Lesauvage wrote:
>> >>Tomi NA a écrit :
>> >>>>I think I'll go this way... No other choice, actually !
>> >>>>The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
>> >>>>I don't really understand what this is. It supports the euro
>> >>>>symbol, so it is probably not pure LATIN1, right ?
>> >>>
>> >>>I suppose you'd have to look at the latin1 codepage character table
>> >>>somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
>> >>>question. :)
>> >>
>> >>Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
>> >>LATIN1 is missing the euro sign...
>> >>Grrrrr I hate this !!!
>> >
>> >So use Latin9 ...
>>
>> Of course, but it doesn't work !!!
>> Whatever client encoding I choose in postgresql before
>> COPYing, I get the 'invalid byte sequence error'.
>
> Humm ... how are you choosing the client encoding?  Is it actually
> working?  I don't see how choosing Latin1 or Latin9 and feeding whatever
> byte sequence would give you an "invalid byte sequence".  These charsets
> don't have any way to validate the bytes, as opposed to what UTF-8 can
> do.  So you could end up with invalid bytes if you choose the wrong
> client encoding, but that's a different error.
>

mydb=# SET client_encoding TO LATIN9;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
ERROR:  invalid byte sequence for encoding "LATIN9": 0x00
HINT:  This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT:  COPY detailrecherche, line 9212
mydb=# SET client_encoding TO WIN1252;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
ERROR:  invalid byte sequence for encoding "WIN1252": 0x00
HINT:  This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT:  COPY detailrecherche, line 9212


Really, I'd rather have another error, but this is all I can
get.
This is with the "ANSI" export.
With the "UNICODE" export :

mydb=# SET client_encoding TO UTF8;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_unicode.csv' CSV;
ERROR:  invalid byte sequence for encoding "UTF8": 0xff
HINT:  This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT:  COPY detailrecherche, line 592680

So, line 592680 is *a lot* better, but it is still not good!

--
Arnaud


Re: MSSQL to PostgreSQL : Encoding problem

From
"Thomas H."
Date:
>>>> Or go via MS-Access/Perl and ODBC/DBI perhaps?
>>>
>>> Yes, I think it would work. The problem is that the DB is too big for
>>> this king of export. Using DTS from MSSQL to export directly to
>>> PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
>>> second in a 2-columns table with ~20M rows. That means several days just
>>> for this table, and I have bigger ones !
>>
>> Well it's about 0.25 days, but if it's too long, it's too long.
>
> Sure, sorry for the confusion, the problem is with the other tables (same
> number of rows but a lot of columns, some very large).
>

well, if its too slow, then you will have to dump the db to a textfile (DTS
does this for you) and then convert the textfile to utf8 manually before
importing it to pgsql. iconv for win32 will help you there. i found tho it
removes some wanted special characters, so watch out.
a less "scientific" approach would be using an unicode-aware texteditor to
convert it (ultraedit does this pretty nicely, for example). have had good
results with it.

loading several million rows will always take some time, tho.

- thomas



Re: MSSQL to PostgreSQL : Encoding problem

From
Alvaro Herrera
Date:
Arnaud Lesauvage wrote:

> mydb=# SET client_encoding TO LATIN9;
> SET
> mydb=# COPY statistiques.detailrecherche (log_gid,
> champrecherche, valeurrecherche) FROM
> 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
> ERROR:  invalid byte sequence for encoding "LATIN9": 0x00
> HINT:  This error can also happen if the byte sequence does
> not match the encoding expected by the server, which is
> controlled by "client_encoding".

Huh, why do you have a "0x00" byte in there?  That's certainly not
Latin9 (nor UTF8 as far as I know).

Is the file actually Latin-something or did you convert it to something
else at some point?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: MSSQL to PostgreSQL : Encoding problem

From
Alvaro Herrera
Date:
Arnaud Lesauvage wrote:
> Alvaro Herrera a écrit :
> >Arnaud Lesauvage wrote:
> >
> >>mydb=# SET client_encoding TO LATIN9;
> >>SET
> >>mydb=# COPY statistiques.detailrecherche (log_gid,
> >>champrecherche, valeurrecherche) FROM
> >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
> >>ERROR:  invalid byte sequence for encoding "LATIN9": 0x00
> >>HINT:  This error can also happen if the byte sequence does
> >>not match the encoding expected by the server, which is
> >>controlled by "client_encoding".
> >
> >Huh, why do you have a "0x00" byte in there?  That's certainly not
> >Latin9 (nor UTF8 as far as I know).
> >
> >Is the file actually Latin-something or did you convert it to something
> >else at some point?
>
> This is the file generated by DTS with "ANSI" encoding. It
> was not altered in any way after that !
> The doc states that ANSI exports with the local codepage
> (which is Win1252). That's all I know. :(

I thought Win1252 was supposed to be almost the same as Latin1.  While
I'd expect certain differences, I wouldn't expect it to use 0x00 as
data!

Maybe you could have DTS export Unicode, which would presumably be
UTF-16, then recode that to something else (possibly UTF-8) with GNU
iconv.

FWIW, I think the preferred way to set the client encoding on psql is
\encoding.  I'm not sure if it does anything different from the SET
command though.

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth.
 That's because in Europe they call me by name, and in the US by value!"

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Alvaro Herrera a écrit :
> Arnaud Lesauvage wrote:
>> Alvaro Herrera a écrit :
>> >Arnaud Lesauvage wrote:
>> >
>> >>mydb=# SET client_encoding TO LATIN9;
>> >>SET
>> >>mydb=# COPY statistiques.detailrecherche (log_gid,
>> >>champrecherche, valeurrecherche) FROM
>> >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
>> >>ERROR:  invalid byte sequence for encoding "LATIN9": 0x00
>> >>HINT:  This error can also happen if the byte sequence does
>> >>not match the encoding expected by the server, which is
>> >>controlled by "client_encoding".
>> >
>> >Huh, why do you have a "0x00" byte in there?  That's certainly not
>> >Latin9 (nor UTF8 as far as I know).
>> >
>> >Is the file actually Latin-something or did you convert it to something
>> >else at some point?
>>
>> This is the file generated by DTS with "ANSI" encoding. It
>> was not altered in any way after that !
>> The doc states that ANSI exports with the local codepage
>> (which is Win1252). That's all I know. :(
>
> I thought Win1252 was supposed to be almost the same as Latin1.  While
> I'd expect certain differences, I wouldn't expect it to use 0x00 as
> data!
>
> Maybe you could have DTS export Unicode, which would presumably be
> UTF-16, then recode that to something else (possibly UTF-8) with GNU
> iconv.

UTF-16 ! That's something I haven't tried !
I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
Bruce Momjian
Date:
Arnaud Lesauvage wrote:
> > I thought Win1252 was supposed to be almost the same as Latin1.  While
> > I'd expect certain differences, I wouldn't expect it to use 0x00 as
> > data!
> >
> > Maybe you could have DTS export Unicode, which would presumably be
> > UTF-16, then recode that to something else (possibly UTF-8) with GNU
> > iconv.
>
> UTF-16 ! That's something I haven't tried !
> I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

Right!  To clarify, Unicode is the character set, and UTF8 and UTF16 are
ways of representing that characters set in 8-bit and 16-bit segments,
respectively.  PostgreSQL only suports UTF8, and Win32 only supports
UTF16 in the operating system.  And 0x00 is not a valid value in any of
those, that I know of, but perhaps it is in UTF16.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: MSSQL to PostgreSQL : Encoding problem

From
"Magnus Hagander"
Date:
> > > I thought Win1252 was supposed to be almost the same as Latin1.
> > > While I'd expect certain differences, I wouldn't expect it to use
> > > 0x00 as data!
> > >
> > > Maybe you could have DTS export Unicode, which would
> presumably be
> > > UTF-16, then recode that to something else (possibly
> UTF-8) with GNU
> > > iconv.
> >
> > UTF-16 ! That's something I haven't tried !
> > I'll try an iconv conversion tomorrow from UTF16 to UTF8 !
>
> Right!  To clarify, Unicode is the character set, and UTF8
> and UTF16 are ways of representing that characters set in
> 8-bit and 16-bit segments, respectively.  PostgreSQL only
> suports UTF8, and Win32 only supports
> UTF16 in the operating system.  And 0x00 is not a valid value
> in any of those, that I know of, but perhaps it is in UTF16.

Actually, Win32 supports UTF8 as well. There are a few operations that
aren't supported on it, but you can certainly read and write files in it
from most builtin apps.

One other problem is that in most (all) win32 documentation talks about
UNICODE when they mean UTF16 (in <= NT4, UCS-2). And PostgreSQL used to
say UNICODE when we meant UTF8. Adds to the confusion.

Finally, UTF-8 does not represent the characters in 8-bit segments - it
can use anything from 8 to 32 bits. UTF-16 always uses 16 bits. This
also means that you acn't talk about "0x00 being valid" in UTF-16,
because all characters are 16-bit. It would be "0x0000" or "0x00 0x00".
But that requires an application that knows UTF16, which postgresql
doesn't, so it reports on the first 0x00.

//Magnus

Re: MSSQL to PostgreSQL : Encoding problem

From
"Brandon Aiken"
Date:
It also might be a big/little endian problem, although I always thought that was platform specific, not locale
specific.

Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of
UCSencoding using the system's default endian setting. 

There's many Unicode codepage formats that iconv supports:
UTF-8
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7
UCS-2-INTERNAL
UCS-2-SWAPPED
UCS-4-INTERNAL
UCS-4-SWAPPED

Gee, didn't Unicode just so simplify this codepage mess?  Remember when it was just ASCII, EBCDIC, ANSI, and localized
codepages?

--
Brandon Aiken
CS/IT Systems Engineer
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Arnaud Lesauvage
Sent: Wednesday, November 22, 2006 12:38 PM
To: Arnaud Lesauvage; General
Subject: Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

Alvaro Herrera a écrit :
> Arnaud Lesauvage wrote:
>> Alvaro Herrera a écrit :
>> >Arnaud Lesauvage wrote:
>> >
>> >>mydb=# SET client_encoding TO LATIN9;
>> >>SET
>> >>mydb=# COPY statistiques.detailrecherche (log_gid,
>> >>champrecherche, valeurrecherche) FROM
>> >>'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
>> >>ERROR:  invalid byte sequence for encoding "LATIN9": 0x00
>> >>HINT:  This error can also happen if the byte sequence does
>> >>not match the encoding expected by the server, which is
>> >>controlled by "client_encoding".
>> >
>> >Huh, why do you have a "0x00" byte in there?  That's certainly not
>> >Latin9 (nor UTF8 as far as I know).
>> >
>> >Is the file actually Latin-something or did you convert it to something
>> >else at some point?
>>
>> This is the file generated by DTS with "ANSI" encoding. It
>> was not altered in any way after that !
>> The doc states that ANSI exports with the local codepage
>> (which is Win1252). That's all I know. :(
>
> I thought Win1252 was supposed to be almost the same as Latin1.  While
> I'd expect certain differences, I wouldn't expect it to use 0x00 as
> data!
>
> Maybe you could have DTS export Unicode, which would presumably be
> UTF-16, then recode that to something else (possibly UTF-8) with GNU
> iconv.

UTF-16 ! That's something I haven't tried !
I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

--
Arnaud

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: MSSQL to PostgreSQL : Encoding problem

From
"Tomi NA"
Date:
2006/11/22, Brandon Aiken <BAiken@winemantech.com>:

> Gee, didn't Unicode just so simplify this codepage mess?  Remember when it was just ASCII, EBCDIC, ANSI, and
localizedcodepages? 

Unicode is a heaven sent, compared to 3 or 4 codepages representing
any given (obviously non-English) language, and 3 or 4 more for every
other language you have to deal with in your application. Perfect?
Hardly. But then again, much more so than natural languages.
I'd say we'd deliver products 10-20% faster (in the company I work in)
if people looked ahead a couple of decades ago and decided upon
something along the lines of unicode instead of ASCII.

Cheers,
t.n.a.

Re: MSSQL to PostgreSQL : Encoding problem

From
Martijn van Oosterhout
Date:
On Wed, Nov 22, 2006 at 01:55:55PM -0500, Brandon Aiken wrote:
> Gee, didn't Unicode just so simplify this codepage mess?  Remember
> when it was just ASCII, EBCDIC, ANSI, and localized codepages?

I think that's one reason why Unix has standardised on UTF-8 rather
than one of the other Unicode variants. For transmission between
systems it's the easiest to get right...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Brandon Aiken a écrit :
> It also might be a big/little endian problem, although I always thought that was platform specific, not locale
specific.
>
> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of
UCSencoding using the system's default endian setting. 


Guys, it worked !!!!
UCS-4-INTERNAL was the right choice !!!

I love you all !

(now I just have an out of memory problem, but that's going
to be a new thread)

--
Arnaud


Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Arnaud Lesauvage a écrit :
> Brandon Aiken a écrit :
>> It also might be a big/little endian problem, although I always thought that was platform specific, not locale
specific.
>>
>> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions of
UCSencoding using the system's default endian setting. 
>
>
> Guys, it worked !!!!
> UCS-4-INTERNAL was the right choice !!!
>
> I love you all !
>
> (now I just have an out of memory problem, but that's going
> to be a new thread)

Guys, it did not work !!! :(
I thought it worked because postgres seemed to be loading
the file and failing at the end with an "out of memory"
error, but in fact I think the conversion remove all
end-of-line characters (one line of 1.5GB was too much for
COPY...).

Still searching !

--
Arnaud

Re: MSSQL to PostgreSQL : Encoding problem

From
"Tomi NA"
Date:
2006/11/23, Arnaud Lesauvage <thewild@freesurf.fr>:
> Arnaud Lesauvage a écrit :
> > Brandon Aiken a écrit :
> >> It also might be a big/little endian problem, although I always thought that was platform specific, not locale
specific.
> >>
> >> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions
ofUCS encoding using the system's default endian setting. 
> >
> >
> > Guys, it worked !!!!
> > UCS-4-INTERNAL was the right choice !!!
> >
> > I love you all !
> >
> > (now I just have an out of memory problem, but that's going
> > to be a new thread)
>
> Guys, it did not work !!! :(
> I thought it worked because postgres seemed to be loading
> the file and failing at the end with an "out of memory"
> error, but in fact I think the conversion remove all
> end-of-line characters (one line of 1.5GB was too much for
> COPY...).
>
> Still searching !

It will take you a day or two to get started, and then a day or two to
get the job done, but you really might want to look into kettle or
some other ETL tool to do the job.
It looks to me like you're trying to screw in a screw using a hammer.

t.n.a.

Re: MSSQL to PostgreSQL : Encoding problem

From
Arnaud Lesauvage
Date:
Tomi NA a écrit :
> 2006/11/23, Arnaud Lesauvage <thewild@freesurf.fr>:
>> Arnaud Lesauvage a écrit :
>> > Brandon Aiken a écrit :
>> >> It also might be a big/little endian problem, although I always thought that was platform specific, not locale
specific.
>> >>
>> >> Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use the two-byte or four-byte versions
ofUCS encoding using the system's default endian setting. 
>> >
>> >
>> > Guys, it worked !!!!
>> > UCS-4-INTERNAL was the right choice !!!
>> >
>> > I love you all !
>> >
>> > (now I just have an out of memory problem, but that's going
>> > to be a new thread)
>>
>> Guys, it did not work !!! :(
>> I thought it worked because postgres seemed to be loading
>> the file and failing at the end with an "out of memory"
>> error, but in fact I think the conversion remove all
>> end-of-line characters (one line of 1.5GB was too much for
>> COPY...).
>>
>> Still searching !
>
> It will take you a day or two to get started, and then a day or two to
> get the job done, but you really might want to look into kettle or
> some other ETL tool to do the job.
> It looks to me like you're trying to screw in a screw using a hammer.

Yes, I might try something else.
I was thinking that others would probably run into this
problem sometime, and that our investigations might help them.
I think I'll forget about this COPY stuff and just export
with DTS through PostgreSQL ODBC Unicode.

--
Arnaud