Thread: Significance of Database Encoding
Hi , I would want to know what is the difference between databases that are created using UNICODE encoding and SQL_ASCII encoding. I have an existing database that has SQL_ASCII encoding but still i am able to store multibyte characters that are not in ASCII character set. for example: tradein_clients=# \l List of databases +-----------------+----------+-----------+ | Name | Owner | Encoding | +-----------------+----------+-----------+ | template0 | postgres | SQL_ASCII | | template1 | postgres | SQL_ASCII | | tradein_clients | tradein | SQL_ASCII | +-----------------+----------+-----------+ tradein_clients=# SELECT * from t_A; +--------------------------------------------------------------------------------------+ | a | +--------------------------------------------------------------------------------------+ | 私はガラス | +--------------------------------------------------------------------------------------+ Above is some japanese character. I have seen some posting regarding migrating databases from SQL_ASCII to UNICODE, given the above observation what significance does a migration have. Regards Rajesh Kumar Mallah. __________________________________ Yahoo! Mail Mobile Take Yahoo! Mail with you! Check email on your mobile phone. http://mobile.yahoo.com/learn/mail
> +--------------------------------------------------------------------------------------+ > | 私はガラス > +--------------------------------------------------------------------------------------+ You say it displays correctly in xterm (ie. you didn't see these in your xterm).There are HTML/XML unicode character entities, probably generated by your mailer from your Unicode cut'n'paste.Using SQL ASCII to store UTF8 encoded data will work, but postgres won't know that it's manipulating multibyte characters, so for instance the length of a string will be its Byte length instead of correctly counting the characters, collation rules will be funky, etc. And substring() may well cut in the middle of an UTF8 multibyte char which will then screw your application side processing...Apart from that, it'll work ;)
--- PFC <lists@boutiquenumerique.com> wrote: > > > +--------------------------------------------------------------------------------------+ > > | 私はガラス > > +--------------------------------------------------------------------------------------+ > > You say it displays correctly in xterm (ie. you didn't see these in your > xterm). > There are HTML/XML unicode character entities, probably generated by your > mailer from your Unicode cut'n'paste. That is correct. Now the question is how to convert from SQL_ASCII to UNICODE. Mailing lists suggests to run recode or iconv on the dump file and restore. The problem is on running iconv with -f US-ASCII the program aborted: $ iconv -f US-ASCII -t UTF-8 < test.sql > out.sql iconv: illegal input sequence at position 114500 Any ideas how the job can be accomplised reliably. Also my database may contain data in multiple encodings like WINDOWS-1251 and WINDOWS-1256 in various places as data has been inserted by different peoples using different sources and client software. Regds Rajesh Kumar Mallah. > Using SQL ASCII to store UTF8 encoded data will work, but postgres won't > know that it's manipulating multibyte characters, so for instance the > length of a string will be its Byte length instead of correctly counting > the characters, collation rules will be funky, etc. And substring() may > well cut in the middle of an UTF8 multibyte char which will then screw > your application side processing... > Apart from that, it'll work ;) > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
> $ iconv -f US-ASCII -t UTF-8 < test.sql > out.sql > iconv: illegal input sequence at position 114500 > > Any ideas how the job can be accomplised reliably. > > Also my database may contain data in multiple encodings > like WINDOWS-1251 and WINDOWS-1256 in various places > as data has been inserted by different peoples using > different sources and client software. You could use a simple program like that (in Python): output = open( "unidump", "w" ) for line in open( "your dump" ):for encoding in "utf-8", "iso-8859-15", "whatever": try: output.write( unicode(line, encoding ).encode( "utf-8" )) break except UnicodeError: passelse: print "No suitable encodingfor line..." I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a multibit character. Can it ? Or you could do that to all your table using SELECTs but it's going to be painful...
--- PFC <lists@boutiquenumerique.com> wrote: > > > $ iconv -f US-ASCII -t UTF-8 < test.sql > out.sql > > iconv: illegal input sequence at position 114500 > > > > Any ideas how the job can be accomplised reliably. > > > > Also my database may contain data in multiple encodings > > like WINDOWS-1251 and WINDOWS-1256 in various places > > as data has been inserted by different peoples using > > different sources and client software. > > You could use a simple program like that (in Python): > > output = open( "unidump", "w" ) > for line in open( "your dump" ): > for encoding in "utf-8", "iso-8859-15", "whatever": > try: > output.write( unicode( line, encoding ).encode( "utf-8" )) > break > except UnicodeError: > pass > else: > print "No suitable encoding for line..." This may not work . Becuase ,conversion to utf-8 can be successfull (no runtime error) even for an incorrect guess of the original encoding but the result will be an incorrect utf8. Regds Rajesh Kumar Mallah > > I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a > multibit character. Can it ? > > Or you could do that to all your table using SELECTs but it's going to be > painful... > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > __________________________________ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail