Thread: Hex characters in COPY input
Good evening list, I am taking the output of a MySQL query and saving it into a text file for COPY input into a PostgreSQL database. The query gives me a list of addresses. One of the addresses is in Montreal, and was input using the correct spelling of Montreal where the e is an accented e. The output ends up in the text file as Montr\xe9al, where the xe9 is a single character. When I try to copy that into my PostgreSQL table, I get an error "ERROR: invalid byte sequence for encoding "UTF8": 0xe9616c", which makes sense since the hex character has not been sanitized. Now if I run the output through sed first, and substitute the \xe9 character for something, say the word TEST, I end up with MontrTESTal in my input file, and naturally that imports just fine. So this tells me that I can perform a substitute on the hex character. But I have been pulling my hair out trying to figure out how to substitute in a properly escaped representation of the accented e. For instance, this: s/\(\xe9\)/U\&'\1'/g gives me MontrU&'\xe9'al in my input file, but that just causes a different invalid byte sequence error. So my question is, how do I sanitize the hex character in the middle of a word to be able to copy in Montreal with an accented e? Or am I going about this at the wrong point? Thanks, Melvin
On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> wrote:
Montreal where the e is an accented e. The output ends up in the text file as
Montr\xe9al, where the xe9 is a single character. When I try to copy that into
my PostgreSQL table, I get an error "ERROR: invalid byte sequence for encoding
Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if you're not.
On 2/26/15, Vick Khera <vivek@khera.org> wrote: > On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> > wrote: > >> Montreal where the e is an accented e. The output ends up in the text >> file >> as >> Montr\xe9al, where the xe9 is a single character. When I try to copy that >> into >> my PostgreSQL table, I get an error "ERROR: invalid byte sequence for >> encoding >> > > Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if > you're not. > Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked perfectly. If you don't mind a follow up to your reply, I have tried to understand the different character sets and collations, but I guess I still have a lot to learn. Your suggestion did not even come close to crossing my mind because the MySQL table and database are encoded in UTF8. I assume the conversion to latin1 happened because I was putting the MySQL query output into a locally stored text file? Regardless, can you point me to some reading that would have clued me in that e9 is not a UTF8 character? Or is the clue the fact that it was not preceeded with 0x00? Regards, Melvin
On 2/27/15, Adam Hooper <adam@adamhooper.com> wrote: > On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> > wrote: > >> So my question is, how do I sanitize the hex character in the middle of a >> word >> to be able to copy in Montreal with an accented e? Or am I going about >> this at >> the wrong point? > > Hi Melvin, > > This is not a Postgres problem, and it is not a regex problem. So yes, > you're going about it at the wrong point: you're trying to modify a > _character_ at a time, but you _should_ be trying to modify a _byte_ > at a time. Text replacement cannot do what you want it to do. > > If you're on Linux or Mac, uconv will work -- for instance, `iconv > --from-code=windows-1252 --to-code=utf-8 < input-file.txt > > output-file.txt` > > Otherwise, you can use a text editor. Be sure to open the file > properly (such that é appears) and then save it as utf-8. > > Alternatively, you could tell Postgres to use your existing encoding > -- judging from the \xe9, any of "windows-1252", "iso-8859-15" or > "iso-8859-1" will be accurate. But I always prefer my data to be > stored as "utf-8", and you should, too. > > Read up on character sets here: > http://www.joelonsoftware.com/articles/Unicode.html > > Enjoy life, > Adam Thank you Adam. I was able to make this work by adding the ENCODING 'latin1' option to the COPY command per Vic's suggestion, and as you correctly pointed out as well. However iconv would probably do the trick too, now that I know where the problem actually lies. I failed to realize that I was not dealing with UTF8 because the MySQL data is encoded in UTF8, but you saw what I wasn't seeing. Your suggested reading is also most appreciated. Maybe one of these days I will actually make sense of this encoding issue. Thanks for the link. Regards, Melvin
On 02/27/2015 06:39 AM, Melvin Call wrote: > On 2/26/15, Vick Khera <vivek@khera.org> wrote: >> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> >> wrote: >> >>> Montreal where the e is an accented e. The output ends up in the text >>> file >>> as >>> Montr\xe9al, where the xe9 is a single character. When I try to copy that >>> into >>> my PostgreSQL table, I get an error "ERROR: invalid byte sequence for >>> encoding >>> >> >> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if >> you're not. >> > Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked > perfectly. > > If you don't mind a follow up to your reply, I have tried to understand the > different character sets and collations, but I guess I still have a lot to > learn. Your suggestion did not even come close to crossing my mind because the > MySQL table and database are encoded in UTF8. I assume the conversion to latin1 > happened because I was putting the MySQL query output into a locally stored > text file? Regardless, can you point me to some reading that would have clued > me in that e9 is not a UTF8 character? Or is the clue the fact that it was not > preceeded with 0x00? For UTF8 characters see here: http://www.utf8-chartable.de/ For the MySQL part, you are going to detail how you got the data out? > > Regards, > Melvin > > -- Adrian Klaver adrian.klaver@aklaver.com
On Fri, Feb 27, 2015 at 8:58 AM, Adam Hooper <adam@adamhooper.com> wrote: > On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote: >> On 2/26/15, Vick Khera <vivek@khera.org> wrote: > >>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if >>> you're not. >>> >> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command worked >> perfectly. >> >> Your suggestion did not even come close to crossing my mind because the >> MySQL table and database are encoded in UTF8. I assume the conversion to latin1 >> happened because I was putting the MySQL query output into a locally stored >> text file? > > The reality is confusing: when moving text around, every step of the > process involves a character set conversion. > > I'm guessing your MySQL client is defaulting to character_set_client = > 'iso-8859-1' or some-such. Depending on your client, that could be > because your _terminal_ is set to iso-8859-1 encoding. (If you're on > Unix, type `locale` and if you don't see lots of "UTF-8"s your > terminal probably isn't using UTF-8.) Ah, that makes perfect sense. > > But really, there are so many variables it's only an issue if you're > trying to change the way the MySQL client is behaving. And I don't > think this is the list for that. Agreed. I didn't realize that the MySQL client would be the issue, but I think you've hit it. This is in preparation for moving away from MySQL so no need to pursue it much further. I just need to get the information out once, and y'all have helped me get there. Thanks! > > MySQL's encoding logic is written up here: > http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html. (The > way the data is stored is in a completely separate web page: > http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html ... > which only has a tiny hint at the bottom of the page about UTF-8 in > the MySQL command-line client.) > > Enjoy life, > Adam > > -- > Adam Hooper > +1-613-986-3339 > http://adamhooper.com
On Fri, Feb 27, 2015 at 9:03 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > On 02/27/2015 06:39 AM, Melvin Call wrote: >> >> On 2/26/15, Vick Khera <vivek@khera.org> wrote: >>> >>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> >>> wrote: >>> >>>> Montreal where the e is an accented e. The output ends up in the text >>>> file >>>> as >>>> Montr\xe9al, where the xe9 is a single character. When I try to copy >>>> that >>>> into >>>> my PostgreSQL table, I get an error "ERROR: invalid byte sequence for >>>> encoding >>>> >>> >>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 >>> if >>> you're not. >>> >> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command >> worked >> perfectly. >> >> If you don't mind a follow up to your reply, I have tried to understand >> the >> different character sets and collations, but I guess I still have a lot to >> learn. Your suggestion did not even come close to crossing my mind because >> the >> MySQL table and database are encoded in UTF8. I assume the conversion to >> latin1 >> happened because I was putting the MySQL query output into a locally >> stored >> text file? Regardless, can you point me to some reading that would have >> clued >> me in that e9 is not a UTF8 character? Or is the clue the fact that it was >> not >> preceeded with 0x00? > > > For UTF8 characters see here: > > http://www.utf8-chartable.de/ Thank you for the link. Bookmarked. > > > For the MySQL part, you are going to detail how you got the data out? This is in preparation of moving away from MySQL. I inherited this MySQL database and it is in horrible shape, no referential integrity, no constraints other than arbitrarily chosen VARCHAR lengths, no indexes, and inconsistent entity and attribute naming. I have to pull the fields out through a query that is being redirected to a local file, as opposed to a dump, because I'm having to filter out a lot of useless rows (TestCity in a production system!?). I just realized I could have put the usable data into similar tables and then used mysqldump with the encoding specified, but no need now. I have my extraction and import done, so this should be the last I need to touch the MySQL system. > >> >> Regards, >> Melvin >> >> > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On 02/27/2015 07:55 AM, Melvin Call wrote: > On Fri, Feb 27, 2015 at 9:03 AM, Adrian Klaver > <adrian.klaver@aklaver.com> wrote: >> On 02/27/2015 06:39 AM, Melvin Call wrote: >>> >>> On 2/26/15, Vick Khera <vivek@khera.org> wrote: >>>> >>>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> >>>> wrote: >>>> >>>>> Montreal where the e is an accented e. The output ends up in the text >>>>> file >>>>> as >>>>> Montr\xe9al, where the xe9 is a single character. When I try to copy >>>>> that >>>>> into >>>>> my PostgreSQL table, I get an error "ERROR: invalid byte sequence for >>>>> encoding >>>>> >>>> >>>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 >>>> if >>>> you're not. >>>> >>> Thank you Vic, adding the ENCODING 'latin1' option to the COPY command >>> worked >>> perfectly. >>> >>> If you don't mind a follow up to your reply, I have tried to understand >>> the >>> different character sets and collations, but I guess I still have a lot to >>> learn. Your suggestion did not even come close to crossing my mind because >>> the >>> MySQL table and database are encoded in UTF8. I assume the conversion to >>> latin1 >>> happened because I was putting the MySQL query output into a locally >>> stored >>> text file? Regardless, can you point me to some reading that would have >>> clued >>> me in that e9 is not a UTF8 character? Or is the clue the fact that it was >>> not >>> preceeded with 0x00? >> >> >> For UTF8 characters see here: >> >> http://www.utf8-chartable.de/ > > Thank you for the link. Bookmarked. > >> >> >> For the MySQL part, you are going to detail how you got the data out? > > This is in preparation of moving away from MySQL. I inherited this MySQL > database and it is in horrible shape, no referential integrity, no constraints > other than arbitrarily chosen VARCHAR lengths, no indexes, and inconsistent > entity and attribute naming. I have to pull the fields out through a query that > is being redirected to a local file, as opposed to a dump, because I'm having > to filter out a lot of useless rows (TestCity in a production system!?). I just > realized I could have put the usable data into similar tables and then used > mysqldump with the encoding specified, but no need now. I have my extraction > and import done, so this should be the last I need to touch the MySQL system. Gotcha, I recently did something similar. > >> >>> >>> Regards, >>> Melvin >>> >>> >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com > > -- Adrian Klaver adrian.klaver@aklaver.com
On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote: > On 2/26/15, Vick Khera <vivek@khera.org> wrote: >> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> >> wrote: >> >>> I get an error "ERROR: invalid byte sequence for >>> encoding "UTF8": 0xe9616c" >> >> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if >> you're not. > > Regardless, can you point me to some reading that would have clued > me in that e9 is not a UTF8 character? Or is the clue the fact that it was not > preceeded with 0x00? The error message Postgres gave you is a pretty darned good clue :). But your question has an error, and it's worth expounding a bit. 0xe9 is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a _byte_. Characters are not bytes. Characters can be _encoded_ into bytes, and that's not the same thing. UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f] represents the same character as in ASCII encoding (an extremely popular encoding). Any byte in the range [0x80-0xff] is the "leading byte" in a sequence of bytes that represents a single character. "Continuation" bytes are in the range [0x80-0xbf]. (Why not the range [0x00-0xff]? Because UTF-8 was designed to generate errors when fed non-UTF8 byte sequences.) The first four bits of the leading byte describe how many continuation bytes there are. If you care to read up on the how and why of UTF-8 (a red herring in this discussion), try: See http://en.wikipedia.org/wiki/UTF-8 Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder sees that initial '1110' and determines that it needs to inspect three bytes to read one character. The second byte is 0x61, which is not in the range [0x80-0xbf], so Postgres reports an invalid byte sequence. Hooray: it produces exactly the error message it should. You don't need to concern yourself with the complications of UTF-8. You only need to know that bytes are not characters; if you don't know the encoding of a sequence of bytes, you've made a logic error. Postgres told you the error, though it didn't hint at how to fix it. (Text editors and web browsers use heuristics to guess file encodings, and they're usually right, though it's impossible to create a perfect heuristic. See http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and for further discussion there.) If you're looking for take-away lessons, the main one is: "read the error message" :). Next time you have the "wrong encoding" problem, you have two options: 1) figure out the encoding and tell Postgres; or 2) regenerate the file in the correct encoding (UTF-8). The "why" is here: http://www.joelonsoftware.com/articles/Unicode.html We on this list jumped strait to option 1. We've memorized 0xe9 in particular, because we've been through your pain before. In the Americas and Western Europe, if a file contains the byte 0xe9 it probably contains the character "é" encoded as windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in particular is a ghastly Internet denizen, in that it defaults to ISO-8859-15 in an apparent crusade against globalization and modern standards. Enjoy life, Adam -- Adam Hooper +1-613-986-3339 http://adamhooper.com
On Fri, Feb 27, 2015 at 12:02 PM, Adam Hooper <adam@adamhooper.com> wrote: > On Fri, Feb 27, 2015 at 9:39 AM, Melvin Call <melvincall979@gmail.com> wrote: >> On 2/26/15, Vick Khera <vivek@khera.org> wrote: >>> On Thu, Feb 26, 2015 at 9:50 PM, Melvin Call <melvincall979@gmail.com> >>> wrote: >>> >>>> I get an error "ERROR: invalid byte sequence for >>>> encoding "UTF8": 0xe9616c" >>> >>> Character code E9 is not UTF8. Don't tell Postgres you're importing UTF8 if >>> you're not. >> >> Regardless, can you point me to some reading that would have clued >> me in that e9 is not a UTF8 character? Or is the clue the fact that it was not >> preceeded with 0x00? > > The error message Postgres gave you is a pretty darned good clue :). > > But your question has an error, and it's worth expounding a bit. 0xe9 > is not a UTF8 character because 0xe9 is not a _character_. 0xe9 is a > _byte_. Characters are not bytes. Characters can be _encoded_ into > bytes, and that's not the same thing. > > UTF-8 encoding is a bit confusing: any byte in the range [0x00-0x7f] > represents the same character as in ASCII encoding (an extremely > popular encoding). Any byte in the range [0x80-0xff] is the "leading > byte" in a sequence of bytes that represents a single character. > "Continuation" bytes are in the range [0x80-0xbf]. (Why not the range > [0x00-0xff]? Because UTF-8 was designed to generate errors when fed > non-UTF8 byte sequences.) The first four bits of the leading byte > describe how many continuation bytes there are. If you care to read up > on the how and why of UTF-8 (a red herring in this discussion), try: > See http://en.wikipedia.org/wiki/UTF-8 > > Back to 0xe9. 0xe9 is '1110 1001' in binary. Postgres' UTF-8 decoder > sees that initial '1110' and determines that it needs to inspect three > bytes to read one character. The second byte is 0x61, which is not in > the range [0x80-0xbf], so Postgres reports an invalid byte sequence. > Hooray: it produces exactly the error message it should. > > You don't need to concern yourself with the complications of UTF-8. > You only need to know that bytes are not characters; if you don't know > the encoding of a sequence of bytes, you've made a logic error. > Postgres told you the error, though it didn't hint at how to fix it. > (Text editors and web browsers use heuristics to guess file encodings, > and they're usually right, though it's impossible to create a perfect > heuristic. See http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and > for further discussion there.) > > If you're looking for take-away lessons, the main one is: "read the > error message" :). > > Next time you have the "wrong encoding" problem, you have two options: > 1) figure out the encoding and tell Postgres; or 2) regenerate the > file in the correct encoding (UTF-8). The "why" is here: > http://www.joelonsoftware.com/articles/Unicode.html Timing is everything: I was just starting to read this in fact. Thank you for taking the time to provide this information. It is proving to be most useful. And yes, I did misspeak. At least the one part I have learned about all of this is that more than one byte is required to represent the majority of characters out there. So yeah, I did know that e9 was a byte, but I appreciate you re-iterating that for me. Hoping to represent each character with a single byte is an artifact of most early computer development being dominated by English speaking persons way back when, eh? I must say, at least as far as the MySQL outputting goes, I like option 1. It was easy enough to find and fix that way once I knew what to look for. The MySQL documentation is sometimes difficult to wade through. But no documentation is ever perfect, is it? One of the things that has endeared me to PostgreSQL is the fact that it seems to strive for global usefulness while remaining standards compliant. Or are those one and the same? Regards > > We on this list jumped strait to option 1. We've memorized 0xe9 in > particular, because we've been through your pain before. In the > Americas and Western Europe, if a file contains the byte 0xe9 it > probably contains the character "é" encoded as > windows-1252/ISO-8859-1/ISO-8859-15. That's very common. MySQL in > particular is a ghastly Internet denizen, in that it defaults to > ISO-8859-15 in an apparent crusade against globalization and modern > standards. > > Enjoy life, > Adam > > -- > Adam Hooper > +1-613-986-3339 > http://adamhooper.com