Thread: How to find broken UTF-8 characters ?

How to find broken UTF-8 characters ?

From
Andreas
Date:
Hi,

I regularly have to import from Excel files, that hold rather simple 
text and number columns.
That works.

Occasionally there are unwanted special characters at the end of 
text-columns that Exel and pgAdmin either show as a upward arrow with a 
short leg on top 90° to the right or others are invisible with UTF-8 
aware programs or get dispayed as ? by Access.
A text viewer shows "ÔÇÄ" or E2 80 8E in Hex for the invisible thingy.

My database is unicode so it doesn't mind those freak-chars.
The problem rises when I need to export those records to CSV with 
pgAdmin. pgAdmin complains about not beeing able to store those lines in 
the local charset.

How can I find those broken UTF-8 characters?
How can I get rid of them?



Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
silly sad
Date:
On 04/26/10 04:12, Andreas wrote:

> Excel files> pgAdmin> Access.

looks like a complete offtopic

> How can I find those broken UTF-8 characters?
> How can I get rid of them?

iconv -c

BUT
u should not have those characters at all
if one is occured it most probably an error

AND
u should get rid of this error itself --
not of its consequences.




Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
Andreas
Date:
Am 26.04.2010 12:12, schrieb silly sad:
> On 04/26/10 04:12, Andreas wrote:
>
> looks like a complete offtopic
Not anymore. The bad signs are in the DB now.

I'd need some command that filters somehow for inconvertible 
(Unicode-->local charset) data.
How can I find those Unicode characters that allready sneaked in?

Actually there shouldn't be anything within the tables that NEED to be 
coded in Unicode.

something like
SELECT * FROM tab_1 WHERE  field_x <>  ConvertToLocal(field_x)
might be a good start.


>> How can I get rid of them?
> iconv -c
AFAIK iconv would translate on file system level but I would think that 
messed up a allready messed up Excel workmap even further.
I'd be glad to handle csv, too.

> BUT
> u should not have those characters at all
> if one is occured it most probably an error

Sure, but those files hit me over a chain of people who consider it ok 
to convert data over numerus file formats, cut, edit, save as X, send 
per mail .... then hit me and I am the one to clean up.


> AND
> u should get rid of this error itself -- not of its consequences.
Like quitting the job and grow flowers instead?
I'll consider this.   ;)



Re: [SPAM]-D] Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
silly sad
Date:
>>> How can I get rid of them?
>> iconv -c
> AFAIK iconv would translate on file system level but I would think that
> messed up a allready messed up Excel workmap even further.
> I'd be glad to handle csv, too.

pg_dump | iconv -c | psql


Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
Andreas
Date:
Hi,

while writing the reply below I found it sounds like beeing OT but it's 
actually not.
I just need a way to check if a collumn contains values that CAN NOT be 
converted from Utf8 to Latin1.
I tried:
Select convert_to (my_column::text, 'LATIN1') from my_table;

It raises an error that says translated:
ERROR:  character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«

I'd like to select all those records.
When I know which record has faulty content I can correct it.

If this is really OT on the SQL list then please tell me where to ask.


Am 28.04.2010 15:18, schrieb Justin Graf:
> On 4/26/2010 8:41 AM, Andreas wrote:
>>>> How can I get rid of them?
>>> iconv -c
>> AFAIK iconv would translate on file system level but I would think 
>> that messed up a allready messed up Excel workmap even further.
>> I'd be glad to handle csv, too.
>
> I would look at a macro/script to have excel dump the data out in CSV 
> then move data to into Postgres

It's like this.
I load the spreadsheet into an Access-DB and let a VBA skript stuff the 
data into PG via ADODB/ODBC.
Often I have to clean up more obvious things than obscure characters or 
amend the info out of other sources before I can upload it to PG.

>
> Now these are not illegal UTF chars.  If those values where wacky  
> Postgresql would not have allowed you insert the record.
> Ô = utf code 212, Ç = utf code 199, Ä = utf code 196

Those are even in Latin1.
They were only 1 example. I suppose where I find them the 3 codes form a 
multibyte code that can't be displayd or don't get displayd as a usual 
letter but some symbol or asian-looking thing which definately doesn't 
belong there.
I saw occasionally that such a wacky symbol replaced some other signes 
that are language specific like ä, ö, ü. Then the next sign is missing 
too, so something is mixing up the encoding and combines 2 chars into 1 
utf8-code.

> To force a string into a specific encoding  we have the Covert, 
> Convert_From and  Cover_to  see section 9.5 in the help files

The problem is, that pgAdmin complains those signes aren't convertible 
and drops the whole record out of the result of the select that I'd like 
to dump into a csv.


> Select covert('MyUtf8', 'UTF8', 'LATIN')
> or
> Select covert_to('MyUtf8',  'LATIN')

I found them before but didn't understand their output.
e.g.
Select convert('1aäßx', 'utf8', 'LATIN1') ;
Result = "1a\344\337x"
so it translated  ä = 344  and  ß = 337. The other 3 are just as they 
were before.
How can this be valid in a single byte charset like Latin1?
Especially as ä, ß are E4 and DF.
Why do they come out as escaped codes when they are in Latin1 aswell as 
1, a and x?

> What ever pg client library used to move Excel data to PG my have 
> incorrectly converted some of the data or moved formatting information 
> into the database.  I have seen Access and Excel do mightily odd 
> things when connecting to DB's  I don't know about current versions 
> but 2000 and 2003 Excels did really stupid things when trying to write 
> to DB's including MSSQL.

Cute ... we use Access 2000 and 2003   :(




Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
Justin Graf
Date:
On 4/28/2010 10:34 PM, Andreas wrote: <blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite">Hi, <br /><br />
whilewriting the reply below I found it sounds like beeing OT but it's actually not. <br /> I just need a way to check
ifa collumn contains values that CAN NOT be converted from Utf8 to Latin1. <br /> I tried: <br /> Select convert_to
(my_column::text,'LATIN1') from my_table; <br /><br /> It raises an error that says translated: <br /> ERROR: 
character0xe28093 in encoding »UTF8« has no equivalent in »LATIN1« <br /><br /> I'd like to select all those records.
<br/> When I know which record has faulty content I can correct it. <br /><br /> If this is really OT on the SQL list
thenplease tell me where to ask. <br /></blockquote><br /><font color="#990000">That's easy enough  you need to write
anUpdate statement using regular expression to replace  all non legal Latin/ASCII char <br /><a
href="http://www.postgresql.org/docs/8.4/interactive/functions-string.html">http://www.postgresql.org/docs/8.4/interactive/functions-string.html
</a><br/><br /> the command is regexp_replace('MyBadString', 'SearchForallNoneAsccIIChars', 'ReplaceWithBlankString
')<br/><br /> I'm pretty sure this is the regualr expression to find all non ASCII chars..  <span
class="messagetext">[^\x00-\xFF]<br/><br /> To test is try to  Select </span>regexp_replace( MyColumn, '<span
class="messagetext">[^\x00-\xFF]',' ') from screweduptable </span><br /><br /> If the regular expression does not work,
I'mdry well, when it comes to regular expressions.  Dd i say i hate regular expression.  It dam near impossible to
write.<br /> Once you get the expression right and working<br /><br /> the Update is straight forward.<br /> Update
mytableset mybadcolumn = regexp_replace( mybadcolumn, '<span class="messagetext">[^\x00-\xFF]', ' ')</span></font><br
/><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><blockquote type="cite">Select covert('MyUtf8',
'UTF8','LATIN') <br /> or <br /> Select covert_to('MyUtf8',  'LATIN') <br /></blockquote><br /> I found them before but
didn'tunderstand their output. <br /> e.g. <br /> Select convert('1aäßx', 'utf8', 'LATIN1') ; <br /> Result =
"1a\344\337x"<br /> so it translated  ä = 344  and  ß = 337. The other 3 are just as they were before. <br /> How can
thisbe valid in a single byte charset like Latin1? <br /> Especially as ä, ß are E4 and DF. <br /> Why do they come out
asescaped codes when they are in Latin1 aswell as 1, a and x? <br /></blockquote><br /><font color="#990000">Someone
withmore knowledge how convert()  works is going to have to explain why they have been escaped. PgAdmin may have
escapedthem.  But those characters are valid Latin1 characters<br /><br /><a class="moz-txt-link-freetext"
href="http://en.wikipedia.org/wiki/%C3%84">http://en.wikipedia.org/wiki/%C3%84</a><br/><a class="moz-txt-link-freetext"
href="http://en.wikipedia.org/wiki/%C3%9F">http://en.wikipedia.org/wiki/%C3%9F</a><br/> ß = latin Beta <br /><br /> It
seemsAccess and Excel are putting in extra bits of data into the field.  In the past i had to change inserts/updates
fromAccess so it would send data in a specific char encoding.  I had problems where Access was using a Windows
Encoding,the ODBC converted it to Latin1, and the MsSQL Database put it in UTF8.  It was no fun cleaning it up.  <br
/></font><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><br /><blockquote type="cite">What ever pg
clientlibrary used to move Excel data to PG my have incorrectly converted some of the data or moved formatting
informationinto the database.  I have seen Access and Excel do mightily odd things when connecting to DB's  I don't
knowabout current versions but 2000 and 2003 Excels did really stupid things when trying to write to DB's including
MSSQL.<br /></blockquote><br /> Cute ... we use Access 2000 and 2003   :( <br /></blockquote><br /><font
color="#990000">Beenmore Screwed by Excel and Access flakiness, and Access Programmers thinking they are
DBA's.</font><br/><br />   <br /><br /><br /> All legitimate Magwerks Corporation quotations are sent in a .PDF file
attachmentwith a unique ID number generated by our proprietary quotation system. Quotations received via any other form
ofcommunication will not be honored. <br /><br /> CONFIDENTIALITY NOTICE: This e-mail, including attachments, may
containlegally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely
forthe use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or
authorizedagent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of
thise-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to
thismessage and destroy all occurrences of this e-mail immediately. <br /> Thank you. <br /> 

Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
Jasen Betts
Date:
On 2010-04-29, Andreas <maps.on@gmx.net> wrote:
> Hi,
>
> while writing the reply below I found it sounds like beeing OT but it's 
> actually not.
> I just need a way to check if a collumn contains values that CAN NOT be 
> converted from Utf8 to Latin1.
> I tried:
> Select convert_to (my_column::text, 'LATIN1') from my_table;
>
> It raises an error that says translated:
> ERROR:  character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«

use a regular expression.
ISO8859-1 is easy, all the caracters a grouped together in unicode so
the regular expression consists of a single inverted range class
SELECT pkey FROM tabname WHERE ( textfield || textfiled2 || textfield3 ) ~ ('[^'||chr(1)||'-'||chr(255)||']');



Re: [SPAM]-D] How to find broken UTF-8 characters ?

From
Jasen Betts
Date:
On 2010-04-29, Justin Graf <justin@magwerks.com> wrote:

> I'm pretty sure this is the regualr expression to find all non ASCII=20
> chars.. [^\x00-\xFF]

Not in postgres.
\x00 does not work well in strings,  and \xFF is invalid utf-8.this is why I used char() 
(also ASCII is undefined past at \x7F ... but the original requestwas for LATIN-1 which does end at char(255))