Re: finding bogus UTF-8 - Mailing list pgsql-general

From Geoffrey Myers
Subject Re: finding bogus UTF-8
Date
Msg-id 4D5AA54E.9080104@serioustechnology.com
Whole thread Raw
In response to Re: finding bogus UTF-8  (Glenn Maynard <glenn@zewt.org>)
Responses Re: finding bogus UTF-8
List pgsql-general
Glenn Maynard wrote:
> On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com
> <mailto:scott_ribe@elevated-dev.com>> wrote:
>
>     I know that I have at least one instance of a varchar that is not
>     valid UTF-8, imported from a source with errors (AMA CPT files,
>     actually) before PG's checking was as stringent as it is today. Can
>     anybody suggest a query to find such values?
>
>
> I hit this problem too, if I remember correctly when trying to upgrade a
> database from 8.3 to 8.4.  I ended up aborting the upgrade, since the
> upgrade documentation made no mention of this and I didn't have time to
> dig into it at the time.  A tool to find all instances of this would be
> very helpful.

I'm about to pipe the ascii output of a database dump through a perl
script that removes any unwanted characters.  To help define what
'unwanted characters' are, compare the ord() values to decimal values at
http://www.asciitable.com/

while (<>)
{
     $_ =~ s/(.)/((ord($1) >= 0) && (ord($1) <= 8))
         || ((ord($1) >= 11) && (ord($1) <= 31))
         || ((ord($1) >= 127)) ?"": $1/egs;
     print;
}

comments would be appreciated.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

pgsql-general by date:

Previous
From: Vick Khera
Date:
Subject: Re: Speeding up index scans by truncating timestamp?
Next
From: pasman pasmański
Date:
Subject: Re: read and restore deleted record