Re: MSSQL to PostgreSQL : Encoding problem - Mailing list pgsql-general

From Arnaud Lesauvage
Subject Re: MSSQL to PostgreSQL : Encoding problem
Date
Msg-id 45641385.5020006@freesurf.fr
Whole thread Raw
In response to Re: MSSQL to PostgreSQL : Encoding problem  (Richard Huxton <dev@archonet.com>)
Responses Re: MSSQL to PostgreSQL : Encoding problem  ("Magnus Hagander" <mha@sollentuna.net>)
Re: MSSQL to PostgreSQL : Encoding problem  (Richard Huxton <dev@archonet.com>)
List pgsql-general
Richard Huxton a écrit :
> Arnaud Lesauvage wrote:
>> Hi list !
>>
>> I already posted this as "COPY FROM encoding error", but I have been
>> doing some more tests since then.
>>
>> I'm trying to export data from MS SQL Server to PostgreSQL.
>> The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3
>> import seems to be the only reasonable solution.
>
> Or go via MS-Access/Perl and ODBC/DBI perhaps?

Yes, I think it would work. The problem is that the DB is
too big for this king of export. Using DTS from MSSQL to
export directly to PostgreSQL using psqlODBC Unicode Driver,
I exported ~1000 rows per second in a 2-columns table with
~20M rows. That means several days just for this table, and
I have bigger ones !


>> In DTS, I have 3 options to export a table as a text file : ANSI, OEM
>> and UNICODE.
>> I tried all these options (and I have three files, one for each).
>
> Well, what character-set is your database in?


Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should
export using the current codepage.
According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using
client_encoding of WIN1252 though...


>> I then try to import into PostgreSQL. The farther I can get is when
>> using the UNICODE export, and importing it using a client_encoding set
>> to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
>> The copy then stops with an error :
>> ERROR: invalid byte sequence for encoding "UTF8": 0xff
>> État SQL :22021
>>
>> The problematic character is the euro currency symbol.
>
> You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that
> character-number but it is used for a different symbol.
>
> Your first step needs to be to find out what character-set your data is in.
> Your second is then to decide what char-set you want to use to store it
> in PG.
> Then you can decide how to get there.

In PG, UTF8 was my choice (the DB already exists, I am just
adding some tables that are still stored in MSSQL), and
according to what you say this was the right choice.
The problem is really about reading this file I think.

I thought that given the character sets available in
PostgreSQL, I would be able to COPY directly from my
exported files.
If I have to convert them using some third party tool, I'll
do that, but that's a bit more painful...

--
Arnaud

pgsql-general by date:

Previous
From: Arnaud Lesauvage
Date:
Subject: Re: MSSQL to PostgreSQL : Encoding problem
Next
From: "Tomi NA"
Date:
Subject: Re: MSSQL to PostgreSQL : Encoding problem