Thread: encoding confusion

encoding confusion

From

Sim Zacks

Date:

10 June 2008, 10:35:16

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

{BACKGROUND]
I am testing dbmail for our corporate email solution.

We originally tested it on mysql and now we are migrating it to postgresql.

The messages are stored in a longblob field on mysql and a bytea field
in postgresql.

I set the database up as UTF-8, even though we get emails that are not
UTF encoded, mostly because I didn't know what else to try that would
incorporate all the possible encodings. Examples of 3 encodings we
regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I.

I transferred the entire database except for one table using the
dbi-link contrib module, connecting directly to MySQL and pulling the
data. The table that did not transfer was the Messageblks table which
has a field mentioned above.
It would not transfer through the dbi-link, so I wrote a python script
(see below) to read a row from mysql and write a row to postgresql
(using pygresql and mysqldb).
When I used pygresql's escape_bytea function to copy the data, it went
smoothly, but the data was corrupt.
When I tried the escape_string function it died because the data it was
moving was not UTF-8.

I finally got it to work by defining a database as SQL-ASCII and then
using escape_string worked. After the data was all in place, I pg_dumped
and pg_restored into a UTF-8 database and it surprisingly works now.

[CONFUSION]
What I don't understand, is that if that database can't handle the non
UTF characters, how does it allow them in when I receive an email
(tested, it works) and when I restored the backup.
I also don't understand why the data transfer didn't work to a UTF
database, but it worked to an ASCII database, if the data can go into a
UTF database from an ascii database.

Lastly, I wanted to know if anybody has experience moving data from
mysql to postgresql and if what I did is considered normal, or if there
is a better way of doing this.

Thank you
Sim

[Python script]:
import MySQLdb
from MySQLdb.cursors import *
import pg
import sys

pgdb=pg.connect(host="1.2.3.4",user="username",dbname="dbmail",
passwd="password")
mydb=MySQLdb.connect(host="localhost",user="mysqluser",passwd="mysqlpassword",
db="dbmail", cursorclass=MySQLdb.cursors.SSDictCursor,charset="utf8")
mycrs=mydb.cursor()
mycrs.execute("select * from dbmail_messageblks")
while 1:
        nextrow=mycrs.fetchone()
        nextrow["messageblk"]=pg.escape_string(nextrow["messageblk"])
        pgdb.query("insert into
dbmail_messageblks(messageblk,physmessage_id,is_header,messageblk_idnr,blocksize)
values($field$%(messageblk)s$field$,%(physmessage_id)s,%(is_header)s,%(messageblk_idnr)s,%(blocksize)s)"
% nextrow)


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhOfUQACgkQjDX6szCBa+oyJwCghkG7XpcrHKY7ybeJgvjPA/XM
qLIAoMS4gOWpP4o2lKrRU2v0IdtaiSQj
=OhYp
-----END PGP SIGNATURE-----

Re: encoding confusion

From

Richard Huxton

Date:

10 June 2008, 12:14:08

Sim Zacks wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> {BACKGROUND]
> I am testing dbmail for our corporate email solution.
>
> We originally tested it on mysql and now we are migrating it to postgresql.
>
> The messages are stored in a longblob field on mysql and a bytea field
> in postgresql.
>
> I set the database up as UTF-8

Not relevant if you're using bytea. Encoding is for character-based
types (varchar, text) not byte-based types.

[snip]
> When I used pygresql's escape_bytea function to copy the data, it went
> smoothly, but the data was corrupt.
> When I tried the escape_string function it died because the data it was
> moving was not UTF-8.

Your Python script seems to think it's dealing it's dealing with text
rather than a stream of bytes too. I'm not a Python programmer, but I'd
guess it's treating one of the database fields (either MySQL or
PostgreSQL) as text not bytes. You'll need to check the docs for
binary-data handling in your Python libraries.

I'm puzzled as to how the data was corrupted with escape_bytea() - I
can't imagine it would be that difficult for the library to get right.
I'd be suspicious that the source data wasn't what I thought it was.

> [CONFUSION]
> What I don't understand, is that if that database can't handle the non
> UTF characters, how does it allow them in when I receive an email
> (tested, it works) and when I restored the backup.
> I also don't understand why the data transfer didn't work to a UTF
> database, but it worked to an ASCII database, if the data can go into a
> UTF database from an ascii database.

Whatever is going on, it's nothing to do with the bytea type.

> Lastly, I wanted to know if anybody has experience moving data from
> mysql to postgresql and if what I did is considered normal, or if there
> is a better way of doing this.

I think that something in the process is trying to be clever and
treating the blobs as text.

--
   Richard Huxton
   Archonet Ltd

Re: encoding confusion

From

"Albe Laurenz"

Date:

11 June 2008, 03:00:22

Sim Zacks wrote:
> We originally tested it on mysql and now we are migrating it
> to postgresql.
>
> The messages are stored in a longblob field on mysql and a bytea field
> in postgresql.
>
> I set the database up as UTF-8, even though we get emails that are not
> UTF encoded, mostly because I didn't know what else to try that would
> incorporate all the possible encodings. Examples of 3 encodings we
> regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I.

[...]

> It would not transfer through the dbi-link, so I wrote a python script
> (see below) to read a row from mysql and write a row to postgresql
> (using pygresql and mysqldb).
> When I used pygresql's escape_bytea function to copy the data, it went
> smoothly, but the data was corrupt.
> When I tried the escape_string function it died because the data it was
> moving was not UTF-8.
>
> I finally got it to work by defining a database as SQL-ASCII and then
> using escape_string worked. After the data was all in place, I pg_dumped
> and pg_restored into a UTF-8 database and it surprisingly works now.

It's very dificult to know what exactly happened unless you have some
examples of a byte sequence that illustrates what you describe:
How it looked in MySQL, how it looked in your Python script, what you
fed to escape_bytea.

What client encoding did you use in your Python script?

Yours,
Laurenz Albe

Re: encoding confusion

From

Sim Zacks

Date:

11 June 2008, 03:14:49

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The data in the longblob field might be text, which could be causing the
confusion. For example, when I look at the data in the longblob field, I
see /n for a newline and when I look at the bytea it is 012.

I can only tell you what happened in the client end, in terms of
corruption. I am using the Thunderbord client. When I clicked on a
message, it didn't show the data and when I looked at the headers, it
was just a big mess. I'm guessing that somehow the newlines didn't work
and the headers and message were overlaid on top of each other.

Richard Huxton wrote:
> Your Python script seems to think it's dealing it's dealing with text
> rather than a stream of bytes too. I'm not a Python programmer, but I'd
> guess it's treating one of the database fields (either MySQL or
> PostgreSQL) as text not bytes. You'll need to check the docs for
> binary-data handling in your Python libraries.
>
> I'm puzzled as to how the data was corrupted with escape_bytea() - I
> can't imagine it would be that difficult for the library to get right.
> I'd be suspicious that the source data wasn't what I thought it was.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhPZD8ACgkQjDX6szCBa+pKkwCfXwzX9iZJl6OS7gji9WXczfy7
T7UAn0sCr+Ut5lbZzXQizC8V3P/Irwrh
=/9bo
-----END PGP SIGNATURE-----

Re: encoding confusion

From

Richard Huxton

Date:

11 June 2008, 04:04:00

Sim Zacks wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> The data in the longblob field might be text, which could be causing the
> confusion. For example, when I look at the data in the longblob field, I
> see /n for a newline and when I look at the bytea it is 012.

That's right - newline is ASCII 10 (or 12 in octal).

> I can only tell you what happened in the client end, in terms of
> corruption. I am using the Thunderbord client. When I clicked on a
> message, it didn't show the data and when I looked at the headers, it
> was just a big mess. I'm guessing that somehow the newlines didn't work
> and the headers and message were overlaid on top of each other.

Well that might be a problem with dmail's setup rather than the
database. I think headers are restricted to ASCII only (the body is a
different matter). The best bet is to be certain whether the database is
to blame.

Find a problem entry, dump that one row to a file from MySQL, do the
same from PostgreSQL and also from the midpoint in your Python code
doing the transfer. Then use a hex editor / dumper (e.g. "hexdump -C" on
linux) to see what bytes differ in the files.

--
   Richard Huxton
   Archonet Ltd

Re: encoding confusion

From

Alban Hertroys

Date:

11 June 2008, 15:06:19

On Jun 11, 2008, at 9:03 AM, Richard Huxton wrote:

> Sim Zacks wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> The data in the longblob field might be text, which could be
>> causing the
>> confusion. For example, when I look at the data in the longblob
>> field, I
>> see /n for a newline and when I look at the bytea it is 012.
>
> That's right - newline is ASCII 10 (or 12 in octal).

Just from the top of my head, but aren't mail headers supposed to be
terminated by \r\n? In that case you're missing a byte 13 (or 015
octal) before the 012 byte. That could explain your missing line breaks.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4850140f927661409586227!