Thread: finding bogus UTF-8

finding bogus UTF-8

From
Scott Ribe
Date:
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA
CPTfiles, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such
values?


--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: finding bogus UTF-8

From
dennis jenkins
Date:
I'm working on a project to convert a large database form SQL_ASCII to
UTF-8.  I am using this procedure:

1) pg_dump the SQL_ASCII database to an SQL text file.
2) Run through a small (efficient) C program that logs each line that
contains ANY "unclean" ASCII text.
3) Parse that log with a small perl program (hashes are easier in perl
than C) to produce a report, and emit some SQL.
4) Construct SQL update statements to "repair" the original data.
5) Repeat at step #1 until the database is clean.
6) pg_dump (SQL_ASCII) -> pg_restore -EUTF8 new database.
7) Profit!

If you are interested, I can email to you the C and Perl source.

It runs like this:

# time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
./bad-ascii-report.pl > unclean-ascii.rpt

real    11m11.804s
user    18m2.579s
sys     2m25.803s

# grep "^--" unclean-ascii.rpt
-- some_table 4051021
-- other_table 16

^^^ Numbers are count of rows that need cleaning.
Entire "rpt" file contains SQL comments "--" and SQL select statements
of the form:
select * from table where primary_key in (1, 2, 3, 4, ....);

The perl script contains a hash that maps table names to primary key
column IDs (to pick up when parsing the raw SQL restore "COPY"
script).  I will need to purge my secret schema stuff from it before
sharing it with anyone.

My solution is probably not perfect, and probably not optimal, but it
is working great so far.  I'm almost done cleaning up my database and
hope to attempt a real UTF8 restore in the near future.


On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors
(AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such
values?
>
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: finding bogus UTF-8

From
dennis jenkins
Date:
On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors
(AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such
values?

If you know which table and column the data is in, you can also do
something like this:

(I typed this up without checking the syntax of it.  The basic idea is
to cast the column as bytea, encode with the 'escape' method, then
grep for back-slashes).

select * from bad_table where regexp_match (encode (bad_column::bytea,
'escape'), '\\\\'));

Re: finding bogus UTF-8

From
dennis jenkins
Date:
>
> If you are interested, I can email to you the C and Perl source.
>
> It runs like this:
>
> # time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
> ./bad-ascii-report.pl > unclean-ascii.rpt

http://www.ecoligames.com/~djenkins/pgsql/

Disclaimer: I offer NO warranty.  Use at your own risk.  Code does
minimal error checking (its a hack / tool for manual use, not reliable
production use).

C code compiles cleanly with gcc.  Perl code uses no libraries (just a
STDIN -> STDOUT processor).  This code should run damn near anywhere.

The code will stay on my web server until I forget about it and re-org
stuff in a few weeks, so grab it while you can.

Re: finding bogus UTF-8

From
Glenn Maynard
Date:
On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

I hit this problem too, if I remember correctly when trying to upgrade a database from 8.3 to 8.4.  I ended up aborting the upgrade, since the upgrade documentation made no mention of this and I didn't have time to dig into it at the time.  A tool to find all instances of this would be very helpful.

--
Glenn Maynard

Re: finding bogus UTF-8

From
Geoffrey Myers
Date:
Glenn Maynard wrote:
> On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com
> <mailto:scott_ribe@elevated-dev.com>> wrote:
>
>     I know that I have at least one instance of a varchar that is not
>     valid UTF-8, imported from a source with errors (AMA CPT files,
>     actually) before PG's checking was as stringent as it is today. Can
>     anybody suggest a query to find such values?
>
>
> I hit this problem too, if I remember correctly when trying to upgrade a
> database from 8.3 to 8.4.  I ended up aborting the upgrade, since the
> upgrade documentation made no mention of this and I didn't have time to
> dig into it at the time.  A tool to find all instances of this would be
> very helpful.

I'm about to pipe the ascii output of a database dump through a perl
script that removes any unwanted characters.  To help define what
'unwanted characters' are, compare the ord() values to decimal values at
http://www.asciitable.com/

while (<>)
{
     $_ =~ s/(.)/((ord($1) >= 0) && (ord($1) <= 8))
         || ((ord($1) >= 11) && (ord($1) <= 31))
         || ((ord($1) >= 127)) ?"": $1/egs;
     print;
}

comments would be appreciated.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: finding bogus UTF-8

From
Marko Kreen
Date:
On Thu, Feb 10, 2011 at 9:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors
(AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such
values?

CREATE OR REPLACE FUNCTION is_utf8(text)
RETURNS bool AS $$
try:
    args[0].decode('utf8')
    return True
except UnicodeDecodeError:
    return False
$$ LANGUAGE plpythonu STRICT;

--
marko

Re: finding bogus UTF-8

From
Vick Khera
Date:
On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers
<lists@serioustechnology.com> wrote:
> comments would be appreciated.
>

If all you're doing is filtering stdin to stdout and deleting a range
of characters, it seems that tr would be a faster tool:

cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt

Re: finding bogus UTF-8

From
Geoffrey Myers
Date:
Vick Khera wrote:
> On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers
> <lists@serioustechnology.com> wrote:
>> comments would be appreciated.
>>
>
> If all you're doing is filtering stdin to stdout and deleting a range
> of characters, it seems that tr would be a faster tool:
>
> cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt

I toyed with tr for a bit, but could not get it to work.  The above did
not work for me either.  Not exactly sure what it's doing, but here's a
couple of diff lines:


1619c1619
<     days integer DEFAULT 28,
---
 >     days integer DEFAULT 2,


So it appears 'tr' is deleting the '8' character, rather then the octal
value for 008.


--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: finding bogus UTF-8

From
Vick Khera
Date:
On Tue, Feb 15, 2011 at 5:06 PM, Geoffrey Myers
<lists@serioustechnology.com> wrote:
> I toyed with tr for a bit, but could not get it to work.  The above did not
> work for me either.  Not exactly sure what it's doing, but here's a couple
> of diff lines:

check your shell escaping.  You may need \\ to protect the \