Thread: Changing character set when the damage is done

Changing character set when the damage is done

From
Alexis Paul Bertolini
Date:
Dear all,

I set up a DB with default values and it now uses the SQL_ASCII 
character set (as per "show client_encoding;").

I have copied in quite a lot of data from various Access databases but 
only now have I realized that all accented vowels show up differently 
from what I expected (I dare not cite some examples a they would 
doubtlessly show up wrong in the mail).

The contents are text written in Italian, so it's just the 5 vowels with 
an acute or grave accent. The user base is all Italian, so a quick and 
dirty trick is perfectly viable.

Any chance of running some smart UPDATE queries to put things right? Or 
elaborating on the fields whilst performing a SELECT query?

Thanks!

Alex.




Re: Changing character set when the damage is done

From
Ragnar
Date:
On lau, 2006-12-23 at 00:12 +0100, Alexis Paul Bertolini wrote:

> I set up a DB with default values and it now uses the SQL_ASCII 
> character set (as per "show client_encoding;").
> 
> I have copied in quite a lot of data from various Access databases but 
> only now have I realized that all accented vowels show up differently 
> from what I expected (I dare not cite some examples a they would 
> doubtlessly show up wrong in the mail).
> 
> The contents are text written in Italian, so it's just the 5 vowels with 
> an acute or grave accent. The user base is all Italian, so a quick and 
> dirty trick is perfectly viable.
> 
> Any chance of running some smart UPDATE queries to put things right? Or 
> elaborating on the fields whilst performing a SELECT query?

depends on whether all imports have been in the same encoding 
or not. SQL_ASCII basically accepts and stores the characters
without interpretation, so if all imorts were done with one
client_encoding, you should be able to read the values again
using the same encoding.

are all the instances of these 5 characters the same everywhere?
in that case there is a good chance that you can fix the values
with UPDATES

when you say 
'all accented vowels show up differently from what I expected'
do you mean in psql or some other tool?
do these show as 1 or 2 characters?

gnari




Re: Changing character set when the damage is done

From
Alexis Paul Bertolini
Date:
Ragnar wrote:
> On lau, 2006-12-23 at 00:12 +0100, Alexis Paul Bertolini wrote:
>   
> depends on whether all imports have been in the same encoding 
> or not. SQL_ASCII basically accepts and stores the characters
> without interpretation, so if all imorts were done with one
> client_encoding, you should be able to read the values again
> using the same encoding.
>
> are all the instances of these 5 characters the same everywhere?
> in that case there is a good chance that you can fix the values
> with UPDATES
>
> when you say 
> 'all accented vowels show up differently from what I expected'
> do you mean in psql or some other tool?
> do these show as 1 or 2 characters?
Thanks for your reply, Gnari,

all imports were done using default values, so the encoding ought to be 
the same. As proof of this, same characters show up in the same way.

They show up in PHP, PgAdminIII and psql. All the same. A lowercase e 
with a grave accent appears as a capital A with the cedilla, followed by 
an umlaut (just the umlaut, on its own). So to answer your question, 
they are two characters.

Oh... and Merry Christmas!

Alex.


Re: Changing character set when the damage is done

From
Michael Fuhr
Date:
On Sun, Dec 24, 2006 at 05:05:37PM +0100, Alexis Paul Bertolini wrote:
> They show up in PHP, PgAdminIII and psql. All the same. A lowercase e 
> with a grave accent appears as a capital A with the cedilla, followed by 
> an umlaut (just the umlaut, on its own). So to answer your question, 
> they are two characters.

Are you sure that's not a tilde (a wavy line above the A) instead
of a cedilla (a hook below the A)?  The UTF-8 encoding for lowercase e
with grave is 0xc3 0xa8, which in ISO-8859-1 (LATIN1) or Windows-1252
is uppercase A with tilde followed by a diaeresis (an umlaut on its
own).  Does the data appear correctly if you do either of the following?

SELECT convert(colname, 'utf8', 'latin1') FROM tablename;
SELECT convert(colname, 'utf8', 'win1252') FROM tablename;

If you use characters like "smart quotes" or the Euro sign then
you'll probably need to use win1252 instead of latin1.  Does the
following show a Euro sign or does it show blank?

SELECT convert('\342\202\254', 'utf8', 'win1252');

-- 
Michael Fuhr


Re: Changing character set when the damage is done

From
Alexis Paul Bertolini
Date:
Michael Fuhr wrote:
> Are you sure that's not a tilde (a wavy line above the A) instead
> of a cedilla (a hook below the A)?  The UTF-8 encoding for lowercase e
> with grave is 0xc3 0xa8, which in ISO-8859-1 (LATIN1) or Windows-1252
> is uppercase A with tilde followed by a diaeresis (an umlaut on its
> own).  Does the data appear correctly if you do either of the following?
>
> SELECT convert(colname, 'utf8', 'latin1') FROM tablename;
>   
Yep, I meant the A with the tilde (the word cedilla sounded Spanish... 
lol). Thanks, the "convert" function does the trick for me.

Using win1252 I get a "invalid destination encoding" error, but no 
problem since I know for sure there aren't any euro signs.

Thanks!!

Alex.


Re: Changing character set when the damage is done

From
Michael Fuhr
Date:
On Sun, Dec 24, 2006 at 06:44:04PM +0100, Alexis Paul Bertolini wrote:
> >SELECT convert(colname, 'utf8', 'latin1') FROM tablename;
>
> Yep, I meant the A with the tilde (the word cedilla sounded Spanish... 
> lol). Thanks, the "convert" function does the trick for me.

If possible consider recreating the database with an encoding other
than SQL_ASCII (e.g., LATIN1 or UTF8).  If you do that, and if you
set client_encoding appropriately when inserting and selecting data,
then the database will convert the data automatically.

> Using win1252 I get a "invalid destination encoding" error, but no 
> problem since I know for sure there aren't any euro signs.

Versions 8.1 and later support win1252 so I'd guess you're using
8.0 or earlier.  Consider upgrading to 8.1 or 8.2 for numerous
improvements over earlier releases.

http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
http://www.postgresql.org/docs/8.2/interactive/release-8-2.html

-- 
Michael Fuhr