Thread: moving data from windows to linux

moving data from windows to linux

From
Francois Deliege
Date:
Hi,

I am having problem importing data from a windows xp2 postgresql server
to a linux one.
On the windows server I did: pg_dumpall -U postgres > dumpall.sql

On the linux machine I did:
tr -d '\r' < /data/dumpall.sql | psql -U pgsql postgres

ERROR:  invalid byte sequence for encoding "UTF8": 0x92
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

As adviced in
http://pgfoundry.org/tracker/?func=detail&atid=538&aid=1000836&group_id=1000125
<http://pgfoundry.org/tracker/?func=detail&atid=538&aid=1000836&group_id=1000125>,
I changed the client_encoding in postgresql.conf and restarted the server.
I also changed it using \encoding in psql.  But I get the same error.

What am I missing ?
Also, would a brutal copy of the base directory do the trick ?

Cheers,

Francois

Attachment

Re: moving data from windows to linux

From
Alan Hodgson
Date:
On Wednesday 25 April 2007 06:29, Francois Deliege <fdeliege@cs.aau.dk>
wrote:
> ERROR:  invalid byte sequence for encoding "UTF8": 0x92
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".

Newer versions of PostgreSQL do much stricter checks of UTF-8 data than
older versions did. It is very possible that if you are upgrading
versions that your original data is invalid UTF-8 even if it's in a
UTF-8 database.

And of course, if the source database is SQL-ASCII, then any sort of
invalid data error is possible when importing to a UTF-8 database.

iconv is usually suggested as the tool to correct encoding errors in the
dump file prior to importing to a new database. There are several
threads in the archives that address this issue.

--
"It is a besetting vice of democracies to substitute public opinion for
law." - James Fenimore Cooper


Re: moving data from windows to linux

From
Francois Deliege
Date:
Alan Hodgson wrote:
> On Wednesday 25 April 2007 06:29, Francois Deliege <fdeliege@cs.aau.dk>
> wrote:
>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0x92
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>>
> Newer versions of PostgreSQL do much stricter checks of UTF-8 data than
> older versions did. It is very possible that if you are upgrading
> versions that your original data is invalid UTF-8 even if it's in a
> UTF-8 database.
>
I am moving from 8.2.4 under Windows to 8.2.4 under FreeBSD 6.2.
The databases are all encoded in UTF8 on the windows server.
The threads I have seen in the archives are all about version or
database encoding issues.
The only trick here, is that I am using tr to solve the endline problem.

Cheers,

Francois


Attachment

Re: moving data from windows to linux

From
Alan Hodgson
Date:
On Wednesday 25 April 2007 08:36, Francois Deliege <fdeliege@cs.aau.dk>
wrote:
> I am moving from 8.2.4 under Windows to 8.2.4 under FreeBSD 6.2.
> The databases are all encoded in UTF8 on the windows server.
> The threads I have seen in the archives are all about version or
> database encoding issues.
> The only trick here, is that I am using tr to solve the endline
> problem.

Is that actually necessary? I'm not seeing anything in the docs about
cr/lf issues in dump files. Or maybe just use pg_dump on the FreeBSD
box to create the dump file.

--
"The true danger is when liberty is nibbled away, for expedients, and by
parts ... the only thing necessary for evil to triumph is for good men
to do nothing." -- Edmund Burke


Re: moving data from windows to linux

From
Francois Deliege
Date:
Alan Hodgson wrote:
> On Wednesday 25 April 2007 08:36, Francois Deliege <fdeliege@cs.aau.dk>
> wrote:
>
>> I am moving from 8.2.4 under Windows to 8.2.4 under FreeBSD 6.2.
>> The databases are all encoded in UTF8 on the windows server.
>> The threads I have seen in the archives are all about version or
>> database encoding issues.
>> The only trick here, is that I am using tr to solve the endline
>> problem.
>>
>
> Is that actually necessary? I'm not seeing anything in the docs about
> cr/lf issues in dump files. Or maybe just use pg_dump on the FreeBSD
> box to create the dump file.
>
It seems so.  I get this error when using directly the dump file from
the windows server.

ERROR:  literal carriage return found in data
HINT:  Use "\r" to represent carriage return.

I have been looking at sed, tr and dos2unix so far, but without
success.  Ken was right, the
'invalid byte sequence for encoding "UTF8": 0x92' error seems to have
been caused by tr removing to many chars.

The FreeBSD and Windows are the same machine, different HD, it makes it
difficult to generate the dump of the Windows XP box from the FreeBSD box.

I have been trying to "look" at the line of the dump file using tail,
but the file seems to large to be handled.

Attachment

Re: moving data from windows to linux

From
Francois Deliege
Date:
Alan Hodgson wrote:
> On Wednesday 25 April 2007 08:36, Francois Deliege <fdeliege@cs.aau.dk>
> wrote:
>
>> I am moving from 8.2.4 under Windows to 8.2.4 under FreeBSD 6.2.
>> The databases are all encoded in UTF8 on the windows server.
>> The threads I have seen in the archives are all about version or
>> database encoding issues.
I think I identified the "moving from Windows to FreeBSD" problem.  In
short:
FreeBSD 6.2 cannot handle files larger than 1.7GB on NTFS partition mounted.

--> Would be nice to enable splits to be handled by pg_dump /
pg_restore.  :-)

Here are some pieces of evidence:

Looking at the data that causes the problem in the music-data dump file:
dimension1# awk '/16862\t8[7-9]/' /mnt/windows/music-data
16862   87      0.99999997673498697     2.7562404582937301e-010
5.3289215505938802e-011 1.63906667677656e-014   3.6428502118254301e-013
1.3056815099222999e-009       9.8685777446191209e-016
1.2298273123735399e-017 1.2144251138520999e-011 4.8726130130392397e-014
1.9123218467089399e-012 4.0612663109915703e-012
2.1516222137270299e-010       4.4099191584683199e-012
2.1392297825760101e-008
16862   88      0.99983678662586195     5.0406110505424003e-005
1.5563266660749601e-006 1.6705707371203099e-007 1.63349359798036e-011
3.48994254

Not a lot of data for 16862 88 compared to 16862 87 !!!  That's what was
causing the problem...
dimension1# sed '/16862\t88/,//d' < /mnt/windows/music-data >
/usr/home/fdeliege/musicdata-ok
dimension1# ls -l musicdata-ok
-rw-r--r--  1 root  fdeliege  1836729357 Apr 26 13:57 musicdata-ok

The size of the file is 1.7GB
A short python script to visualize the data mounted on the windows
partition from my FreeBSD box...
dimension1# python pythonscript.py /mnt/windows/music-data 1836729340 1000
e-011   3.48994254

The same running script running from windows gives me what I expected, a
lot of numbers.  :-)

Now, I am splitting the 150 GB dump file into 1G chunks... time for a beer.

Cheers,

Francois

Attachment