Thread: Significance of Database Encoding

Significance of Database Encoding

From

Rajesh Mallah

Date:

15 May 2005, 12:33:46

Hi ,

I would want to know what is the difference between databases
that are created using UNICODE encoding and SQL_ASCII encoding.

I have an existing database that has SQL_ASCII encoding but
still i am able to store multibyte characters that are not
in ASCII character set. for example:

tradein_clients=# \l          List of databases
+-----------------+----------+-----------+
|      Name       |  Owner   | Encoding  |
+-----------------+----------+-----------+
| template0       | postgres | SQL_ASCII |
| template1       | postgres | SQL_ASCII |
| tradein_clients | tradein  | SQL_ASCII |
+-----------------+----------+-----------+

tradein_clients=# SELECT  * from t_A;
+--------------------------------------------------------------------------------------+
|                                    a                                                 |
+--------------------------------------------------------------------------------------+
| 私はガラス                                                                           |
+--------------------------------------------------------------------------------------+

Above is some japanese character.

I have seen some posting regarding migrating databases from
SQL_ASCII to UNICODE, given the above observation what 
significance does a migration have.

Regards

Rajesh Kumar Mallah.





    
__________________________________ 
Yahoo! Mail Mobile 
Take Yahoo! Mail with you! Check email on your mobile phone. 
http://mobile.yahoo.com/learn/mail

Re: Significance of Database Encoding

From

PFC

Date:

15 May 2005, 14:03:22

> +--------------------------------------------------------------------------------------+
> | 私はガラス
> +--------------------------------------------------------------------------------------+
You say it displays correctly in xterm (ie. you didn't see these in your  
xterm).There are HTML/XML unicode character entities, probably generated by your  
mailer from your Unicode cut'n'paste.Using SQL ASCII to store UTF8 encoded data will work, but postgres won't  
know that it's manipulating multibyte characters, so for instance the  
length of a string will be its Byte length instead of correctly counting  
the characters, collation rules will be funky, etc. And substring() may  
well cut in the middle of an UTF8 multibyte char which will then screw  
your application side processing...Apart from that, it'll work ;)

Re: Significance of Database Encoding

From

Rajesh Mallah

Date:

15 May 2005, 15:38:36

--- PFC <lists@boutiquenumerique.com> wrote:
> 
> > +--------------------------------------------------------------------------------------+
> > | 私はガラス
> > +--------------------------------------------------------------------------------------+
> 
>     You say it displays correctly in xterm (ie. you didn't see these in your  
> xterm).
>     There are HTML/XML unicode character entities, probably generated by your  
> mailer from your Unicode cut'n'paste.

That is correct.

Now the question is how to convert from SQL_ASCII to UNICODE. 
Mailing lists suggests to run recode or iconv on the dump file
and restore. The problem is on running iconv with -f US-ASCII
the program aborted:

$ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
iconv: illegal input sequence at position 114500

Any ideas how the job can be accomplised reliably.

Also my database may contain data in multiple encodings
like WINDOWS-1251 and WINDOWS-1256 in various places
as data has been inserted by different peoples using
different sources and client software.

Regds
Rajesh Kumar Mallah.

>     Using SQL ASCII to store UTF8 encoded data will work, but postgres won't  
> know that it's manipulating multibyte characters, so for instance the  
> length of a string will be its Byte length instead of correctly counting  
> the characters, collation rules will be funky, etc. And substring() may  
> well cut in the middle of an UTF8 multibyte char which will then screw  
> your application side processing...
>     Apart from that, it'll work ;)
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Significance of Database Encoding

From

PFC

Date:

15 May 2005, 16:49:04

> $ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
> iconv: illegal input sequence at position 114500
>
> Any ideas how the job can be accomplised reliably.
>
> Also my database may contain data in multiple encodings
> like WINDOWS-1251 and WINDOWS-1256 in various places
> as data has been inserted by different peoples using
> different sources and client software.
You could use a simple program like that (in Python):

output = open( "unidump", "w" )
for line in open( "your dump" ):for encoding in "utf-8", "iso-8859-15", "whatever":    try:        output.write(
unicode(line, encoding ).encode( "utf-8" ))        break    except UnicodeError:        passelse:    print "No suitable
encodingfor line..."
 
I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a  
multibit character. Can it ?
Or you could do that to all your table using SELECTs but it's going to be  
painful...

Re: Significance of Database Encoding

From

Rajesh Mallah

Date:

15 May 2005, 23:16:50

--- PFC <lists@boutiquenumerique.com> wrote:
> 
> > $ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
> > iconv: illegal input sequence at position 114500
> >
> > Any ideas how the job can be accomplised reliably.
> >
> > Also my database may contain data in multiple encodings
> > like WINDOWS-1251 and WINDOWS-1256 in various places
> > as data has been inserted by different peoples using
> > different sources and client software.
> 
>     You could use a simple program like that (in Python):
> 
> output = open( "unidump", "w" )
> for line in open( "your dump" ):
>     for encoding in "utf-8", "iso-8859-15", "whatever":
>         try:
>             output.write( unicode( line, encoding ).encode( "utf-8" ))
>             break
>         except UnicodeError:
>             pass
>     else:
>         print "No suitable encoding for line..."


This may not work . Becuase ,conversion to utf-8 can be successfull (no runtime error)
even for an incorrect guess of the original encoding but the  result will be  an 
incorrect utf8. 

Regds
Rajesh Kumar Mallah


> 
>     I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a  
> multibit character. Can it ?
> 
>     Or you could do that to all your table using SELECTs but it's going to be  
> painful...
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
> 

    
__________________________________ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail