Thread: convert string function and built-in conversions
It seems to me that these values should be the same: select 'lydia eugenia treviño', convert('lydia eugenia treviño' using ascii_to_utf_8); but they seem to be different. What am I missing? culley
On Sun, 19 Oct 2003, culley harrelson wrote: > It seems to me that these values should be the same: > > select 'lydia eugenia treviño', convert('lydia eugenia treviño' using > ascii_to_utf_8); > > but they seem to be different. What am I missing? I don't think the marked n is a valid ascii character (it might be extended ascii, but that's different and not really standard afaik). You're probably getting the character associated with the lower 7 bits.
It is one of the extended characters in iso-8859-1. This data was taken from a text field in a SQL_ASCII database. Basically what I am trying to do is migrate data from a SQL_ASCII database to a UNICODE database by running all the data through an external script that does something like: select convert(my_field using ascii_to_utf_8) from my_table; then inserts the selected text into an identical table in the unicode database. All the data goes across, but extended characters such as ñ are getting munged. The docs indicate that ascii_to_utf_8 is for SQL_ASCII -> UNICODE... Are you saying that ñ isn't really an ASCII character even though it is valid in a SQL_ASCII database? I have found that all extended characters of the various LATIN encodings will work just fine in my SQL_ASCII database. This project is a big can of worms... Every 6 months I open the can, stir the worms around a bit, wrinkle my nose then promptly close the can again and stuff it away for another 6 months. :) Wish I could figure it out. On Sun, 19 Oct 2003 00:31:43 -0700 (PDT), "Stephan Szabo" <sszabo@megazone.bigpanda.com> said: > On Sun, 19 Oct 2003, culley harrelson wrote: > > > It seems to me that these values should be the same: > > > > select 'lydia eugenia treviño', convert('lydia eugenia treviño' using > > ascii_to_utf_8); > > > > but they seem to be different. What am I missing? > > I don't think the marked n is a valid ascii character (it might be > extended ascii, but that's different and not really standard afaik). > You're probably getting the character associated with the lower 7 bits.
On Sun, 19 Oct 2003, culley harrelson wrote: > It is one of the extended characters in iso-8859-1. This data was taken > from a text field in a SQL_ASCII database. Basically what I am trying to > do is migrate data from a SQL_ASCII database to a UNICODE database by > running all the data through an external script that does something like: > > select convert(my_field using ascii_to_utf_8) from my_table; > > then inserts the selected text into an identical table in the unicode > database. All the data goes across, but extended characters such as � > are getting munged. The docs indicate that ascii_to_utf_8 is for > SQL_ASCII -> UNICODE... Are you saying that � isn't really an ASCII > character even though it is valid in a SQL_ASCII database? I have found > that all extended characters of the various LATIN encodings will work > just fine in my SQL_ASCII database. I would guess that it's not actually forcing/checking the characters for 7 bitness in SQL_ASCII, but that the conversions are treating them as if you had actually only put in valid 7 bit values (as they appear to be doing an & 0x7F in at least the routines I looked at). If you're actually putting iso-8859-1 (latin1) in there, try the conversion from iso-8859-1 to utf8. It doesn't appear to display properly in my iso-8859-1 terminal, but taking that string and inserting it into a unicode database and then setting my client_encoding to iso-8859-1 gives me the original string back when I select it.