Thread: encoding confusion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 {BACKGROUND] I am testing dbmail for our corporate email solution. We originally tested it on mysql and now we are migrating it to postgresql. The messages are stored in a longblob field on mysql and a bytea field in postgresql. I set the database up as UTF-8, even though we get emails that are not UTF encoded, mostly because I didn't know what else to try that would incorporate all the possible encodings. Examples of 3 encodings we regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I. I transferred the entire database except for one table using the dbi-link contrib module, connecting directly to MySQL and pulling the data. The table that did not transfer was the Messageblks table which has a field mentioned above. It would not transfer through the dbi-link, so I wrote a python script (see below) to read a row from mysql and write a row to postgresql (using pygresql and mysqldb). When I used pygresql's escape_bytea function to copy the data, it went smoothly, but the data was corrupt. When I tried the escape_string function it died because the data it was moving was not UTF-8. I finally got it to work by defining a database as SQL-ASCII and then using escape_string worked. After the data was all in place, I pg_dumped and pg_restored into a UTF-8 database and it surprisingly works now. [CONFUSION] What I don't understand, is that if that database can't handle the non UTF characters, how does it allow them in when I receive an email (tested, it works) and when I restored the backup. I also don't understand why the data transfer didn't work to a UTF database, but it worked to an ASCII database, if the data can go into a UTF database from an ascii database. Lastly, I wanted to know if anybody has experience moving data from mysql to postgresql and if what I did is considered normal, or if there is a better way of doing this. Thank you Sim [Python script]: import MySQLdb from MySQLdb.cursors import * import pg import sys pgdb=pg.connect(host="1.2.3.4",user="username",dbname="dbmail", passwd="password") mydb=MySQLdb.connect(host="localhost",user="mysqluser",passwd="mysqlpassword", db="dbmail", cursorclass=MySQLdb.cursors.SSDictCursor,charset="utf8") mycrs=mydb.cursor() mycrs.execute("select * from dbmail_messageblks") while 1: nextrow=mycrs.fetchone() nextrow["messageblk"]=pg.escape_string(nextrow["messageblk"]) pgdb.query("insert into dbmail_messageblks(messageblk,physmessage_id,is_header,messageblk_idnr,blocksize) values($field$%(messageblk)s$field$,%(physmessage_id)s,%(is_header)s,%(messageblk_idnr)s,%(blocksize)s)" % nextrow) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkhOfUQACgkQjDX6szCBa+oyJwCghkG7XpcrHKY7ybeJgvjPA/XM qLIAoMS4gOWpP4o2lKrRU2v0IdtaiSQj =OhYp -----END PGP SIGNATURE-----
Sim Zacks wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > {BACKGROUND] > I am testing dbmail for our corporate email solution. > > We originally tested it on mysql and now we are migrating it to postgresql. > > The messages are stored in a longblob field on mysql and a bytea field > in postgresql. > > I set the database up as UTF-8 Not relevant if you're using bytea. Encoding is for character-based types (varchar, text) not byte-based types. [snip] > When I used pygresql's escape_bytea function to copy the data, it went > smoothly, but the data was corrupt. > When I tried the escape_string function it died because the data it was > moving was not UTF-8. Your Python script seems to think it's dealing it's dealing with text rather than a stream of bytes too. I'm not a Python programmer, but I'd guess it's treating one of the database fields (either MySQL or PostgreSQL) as text not bytes. You'll need to check the docs for binary-data handling in your Python libraries. I'm puzzled as to how the data was corrupted with escape_bytea() - I can't imagine it would be that difficult for the library to get right. I'd be suspicious that the source data wasn't what I thought it was. > [CONFUSION] > What I don't understand, is that if that database can't handle the non > UTF characters, how does it allow them in when I receive an email > (tested, it works) and when I restored the backup. > I also don't understand why the data transfer didn't work to a UTF > database, but it worked to an ASCII database, if the data can go into a > UTF database from an ascii database. Whatever is going on, it's nothing to do with the bytea type. > Lastly, I wanted to know if anybody has experience moving data from > mysql to postgresql and if what I did is considered normal, or if there > is a better way of doing this. I think that something in the process is trying to be clever and treating the blobs as text. -- Richard Huxton Archonet Ltd
Sim Zacks wrote: > We originally tested it on mysql and now we are migrating it > to postgresql. > > The messages are stored in a longblob field on mysql and a bytea field > in postgresql. > > I set the database up as UTF-8, even though we get emails that are not > UTF encoded, mostly because I didn't know what else to try that would > incorporate all the possible encodings. Examples of 3 encodings we > regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I. [...] > It would not transfer through the dbi-link, so I wrote a python script > (see below) to read a row from mysql and write a row to postgresql > (using pygresql and mysqldb). > When I used pygresql's escape_bytea function to copy the data, it went > smoothly, but the data was corrupt. > When I tried the escape_string function it died because the data it was > moving was not UTF-8. > > I finally got it to work by defining a database as SQL-ASCII and then > using escape_string worked. After the data was all in place, I pg_dumped > and pg_restored into a UTF-8 database and it surprisingly works now. It's very dificult to know what exactly happened unless you have some examples of a byte sequence that illustrates what you describe: How it looked in MySQL, how it looked in your Python script, what you fed to escape_bytea. What client encoding did you use in your Python script? Yours, Laurenz Albe
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The data in the longblob field might be text, which could be causing the confusion. For example, when I look at the data in the longblob field, I see /n for a newline and when I look at the bytea it is 012. I can only tell you what happened in the client end, in terms of corruption. I am using the Thunderbord client. When I clicked on a message, it didn't show the data and when I looked at the headers, it was just a big mess. I'm guessing that somehow the newlines didn't work and the headers and message were overlaid on top of each other. Richard Huxton wrote: > Your Python script seems to think it's dealing it's dealing with text > rather than a stream of bytes too. I'm not a Python programmer, but I'd > guess it's treating one of the database fields (either MySQL or > PostgreSQL) as text not bytes. You'll need to check the docs for > binary-data handling in your Python libraries. > > I'm puzzled as to how the data was corrupted with escape_bytea() - I > can't imagine it would be that difficult for the library to get right. > I'd be suspicious that the source data wasn't what I thought it was. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkhPZD8ACgkQjDX6szCBa+pKkwCfXwzX9iZJl6OS7gji9WXczfy7 T7UAn0sCr+Ut5lbZzXQizC8V3P/Irwrh =/9bo -----END PGP SIGNATURE-----
Sim Zacks wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > The data in the longblob field might be text, which could be causing the > confusion. For example, when I look at the data in the longblob field, I > see /n for a newline and when I look at the bytea it is 012. That's right - newline is ASCII 10 (or 12 in octal). > I can only tell you what happened in the client end, in terms of > corruption. I am using the Thunderbord client. When I clicked on a > message, it didn't show the data and when I looked at the headers, it > was just a big mess. I'm guessing that somehow the newlines didn't work > and the headers and message were overlaid on top of each other. Well that might be a problem with dmail's setup rather than the database. I think headers are restricted to ASCII only (the body is a different matter). The best bet is to be certain whether the database is to blame. Find a problem entry, dump that one row to a file from MySQL, do the same from PostgreSQL and also from the midpoint in your Python code doing the transfer. Then use a hex editor / dumper (e.g. "hexdump -C" on linux) to see what bytes differ in the files. -- Richard Huxton Archonet Ltd
On Jun 11, 2008, at 9:03 AM, Richard Huxton wrote: > Sim Zacks wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> The data in the longblob field might be text, which could be >> causing the >> confusion. For example, when I look at the data in the longblob >> field, I >> see /n for a newline and when I look at the bytea it is 012. > > That's right - newline is ASCII 10 (or 12 in octal). Just from the top of my head, but aren't mail headers supposed to be terminated by \r\n? In that case you're missing a byte 13 (or 015 octal) before the 012 byte. That could explain your missing line breaks. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4850140f927661409586227!