Re: [SPAM]-D] How to find broken UTF-8 characters ? - Mailing list pgsql-sql

From Justin Graf
Subject Re: [SPAM]-D] How to find broken UTF-8 characters ?
Date
Msg-id 4BD98413.1010109@magwerks.com
Whole thread Raw
In response to Re: [SPAM]-D] How to find broken UTF-8 characters ?  (Andreas <maps.on@gmx.net>)
List pgsql-sql
On 4/28/2010 10:34 PM, Andreas wrote: <blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite">Hi, <br /><br />
whilewriting the reply below I found it sounds like beeing OT but it's actually not. <br /> I just need a way to check
ifa collumn contains values that CAN NOT be converted from Utf8 to Latin1. <br /> I tried: <br /> Select convert_to
(my_column::text,'LATIN1') from my_table; <br /><br /> It raises an error that says translated: <br /> ERROR: 
character0xe28093 in encoding »UTF8« has no equivalent in »LATIN1« <br /><br /> I'd like to select all those records.
<br/> When I know which record has faulty content I can correct it. <br /><br /> If this is really OT on the SQL list
thenplease tell me where to ask. <br /></blockquote><br /><font color="#990000">That's easy enough  you need to write
anUpdate statement using regular expression to replace  all non legal Latin/ASCII char <br /><a
href="http://www.postgresql.org/docs/8.4/interactive/functions-string.html">http://www.postgresql.org/docs/8.4/interactive/functions-string.html
</a><br/><br /> the command is regexp_replace('MyBadString', 'SearchForallNoneAsccIIChars', 'ReplaceWithBlankString
')<br/><br /> I'm pretty sure this is the regualr expression to find all non ASCII chars..  <span
class="messagetext">[^\x00-\xFF]<br/><br /> To test is try to  Select </span>regexp_replace( MyColumn, '<span
class="messagetext">[^\x00-\xFF]',' ') from screweduptable </span><br /><br /> If the regular expression does not work,
I'mdry well, when it comes to regular expressions.  Dd i say i hate regular expression.  It dam near impossible to
write.<br /> Once you get the expression right and working<br /><br /> the Update is straight forward.<br /> Update
mytableset mybadcolumn = regexp_replace( mybadcolumn, '<span class="messagetext">[^\x00-\xFF]', ' ')</span></font><br
/><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><blockquote type="cite">Select covert('MyUtf8',
'UTF8','LATIN') <br /> or <br /> Select covert_to('MyUtf8',  'LATIN') <br /></blockquote><br /> I found them before but
didn'tunderstand their output. <br /> e.g. <br /> Select convert('1aäßx', 'utf8', 'LATIN1') ; <br /> Result =
"1a\344\337x"<br /> so it translated  ä = 344  and  ß = 337. The other 3 are just as they were before. <br /> How can
thisbe valid in a single byte charset like Latin1? <br /> Especially as ä, ß are E4 and DF. <br /> Why do they come out
asescaped codes when they are in Latin1 aswell as 1, a and x? <br /></blockquote><br /><font color="#990000">Someone
withmore knowledge how convert()  works is going to have to explain why they have been escaped. PgAdmin may have
escapedthem.  But those characters are valid Latin1 characters<br /><br /><a class="moz-txt-link-freetext"
href="http://en.wikipedia.org/wiki/%C3%84">http://en.wikipedia.org/wiki/%C3%84</a><br/><a class="moz-txt-link-freetext"
href="http://en.wikipedia.org/wiki/%C3%9F">http://en.wikipedia.org/wiki/%C3%9F</a><br/> ß = latin Beta <br /><br /> It
seemsAccess and Excel are putting in extra bits of data into the field.  In the past i had to change inserts/updates
fromAccess so it would send data in a specific char encoding.  I had problems where Access was using a Windows
Encoding,the ODBC converted it to Latin1, and the MsSQL Database put it in UTF8.  It was no fun cleaning it up.  <br
/></font><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><br /><blockquote type="cite">What ever pg
clientlibrary used to move Excel data to PG my have incorrectly converted some of the data or moved formatting
informationinto the database.  I have seen Access and Excel do mightily odd things when connecting to DB's  I don't
knowabout current versions but 2000 and 2003 Excels did really stupid things when trying to write to DB's including
MSSQL.<br /></blockquote><br /> Cute ... we use Access 2000 and 2003   :( <br /></blockquote><br /><font
color="#990000">Beenmore Screwed by Excel and Access flakiness, and Access Programmers thinking they are
DBA's.</font><br/><br />   <br /><br /><br /> All legitimate Magwerks Corporation quotations are sent in a .PDF file
attachmentwith a unique ID number generated by our proprietary quotation system. Quotations received via any other form
ofcommunication will not be honored. <br /><br /> CONFIDENTIALITY NOTICE: This e-mail, including attachments, may
containlegally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely
forthe use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or
authorizedagent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of
thise-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to
thismessage and destroy all occurrences of this e-mail immediately. <br /> Thank you. <br /> 

pgsql-sql by date:

Previous
From: Edward Ross
Date:
Subject: Re: problem converting strings to timestamps with time zone
Next
From: "sandeep prakash dhumale"
Date:
Subject: Tsearch not searching 'Y'