Thread: UNICODE problem on 7.4 with COPY

UNICODE problem on 7.4 with COPY

From
"Toby Doig"
Date:
When I try to import data from a unicode file into PostgreSQL 7.4 under FreeBSD it appears to now understand the
Unicodefile format. 

To demonstrate I export a set of Integers into a Unicode file from MSSQL 2000. I samba the file to a FreeBSD box and
tryto import from psql with COPY. It fails. Wordpad and Notepad both read the file ok, even after I bounce the file via
theFreeBSD box (to test samba didn't munge it). 

FreeBSD 5.1-RELEASE #0
PGSql 7.4 (dl'd and compiled fri 28th Nov 2003)
Dual 800MHz P3's

I create a database with encoding = UNICODE.
I create a table

CREATE TABLE testunicode
(
  anum int4
) WITHOUT OIDS;

I then use psql to import the file, which is a single column of integers.

copy testunicode from '/home/toby/itxt/anum.txt';
ERROR:  invalid input syntax for integer: "ÿþ1"
CONTEXT:  COPY testunicode, line 1, column anum: "ÿþ1"


When viewing the file as hex I see:
FF FE 31 00 31 00 32 00 37 00 39 00 30 00 0D 00 0A 00
 ÿ  þ  1  .  1  .  2  .  7  .  9  .  0  .  .  .  .  .

According to http://www.crispen.org/src/archive/0013.html

FF FE   UTF-16/UCS-2, big endian

So, what is going wrong? Why can't I import this very simple unicode file?
I've searched the archives and google, but to no avail.

Btw, the actual stuff I want to import is larger and more complex, this little table is to demonstrate the problem.

Help would be muchly appreciated.
Toby

Re: UNICODE problem on 7.4 with COPY

From
Gianni Mariani
Date:
Toby Doig wrote:
...

>So, what is going wrong? Why can't I import this very simple unicode file?
>I've searched the archives and google, but to no avail.
>
>
try converting the file to utf-8.

iconv -t utf-8 -f utf-16 < unicode-file.txt > utf-8-file.txt




Re: UNICODE problem on 7.4 with COPY

From
"Toby Doig"
Date:
Same error as before

Toby Doig
Software Development Manager
Vibrant Media
toby@vibrantmedia.com
0207 239 0134

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Gianni Mariani
Sent: 01 December 2003 16:37
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] UNICODE problem on 7.4 with COPY

Toby Doig wrote:
...

>So, what is going wrong? Why can't I import this very simple unicode
file?
>I've searched the archives and google, but to no avail.
>
>
try converting the file to utf-8.

iconv -t utf-8 -f utf-16 < unicode-file.txt > utf-8-file.txt




---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: UNICODE problem on 7.4 with COPY

From
Tino Wildenhain
Date:
Toby Doig schrieb:
> When I try to import data from a unicode file into PostgreSQL 7.4 under FreeBSD it appears to now understand the
Unicodefile format. 
>
> To demonstrate I export a set of Integers into a Unicode file from MSSQL 2000. I samba the file to a FreeBSD box and
tryto import from psql with COPY. It fails. Wordpad and Notepad both read the file ok, even after I bounce the file via
theFreeBSD box (to test samba didn't munge it). 
>
> FreeBSD 5.1-RELEASE #0
> PGSql 7.4 (dl'd and compiled fri 28th Nov 2003)
> Dual 800MHz P3's
>
> I create a database with encoding = UNICODE.
> I create a table
>
> CREATE TABLE testunicode
> (
>   anum int4
> ) WITHOUT OIDS;
>
> I then use psql to import the file, which is a single column of integers.
>
> copy testunicode from '/home/toby/itxt/anum.txt';
> ERROR:  invalid input syntax for integer: "ÿþ1"
> CONTEXT:  COPY testunicode, line 1, column anum: "ÿþ1"
>
>
> When viewing the file as hex I see:
> FF FE 31 00 31 00 32 00 37 00 39 00 30 00 0D 00 0A 00
>  ÿ  þ  1  .  1  .  2  .  7  .  9  .  0  .  .  .  .  .
>
> According to http://www.crispen.org/src/archive/0013.html
>
> FF FE   UTF-16/UCS-2, big endian

See also
http://www.unicode.org/unicode/faq/utf_bom.html#22


>
> So, what is going wrong? Why can't I import this very simple unicode file?
> I've searched the archives and google, but to no avail.

Postgresql only accepts a stream of chars in the given client
encoding. This defaults to "utf-8" when you set up your
db as "unicode". psql does not read the BOM information
in files since it does not operate on files but on streams.
The same I fear is true for postgresqls COPY command.

I think a patch made by you is appreciated :-)

Regards
Tino