Re: Eliminating bad characters from a database for - Mailing list pgsql-general

From mike
Subject Re: Eliminating bad characters from a database for
Date
Msg-id 1163735351.30434.0.camel@localhost.localdomain
Whole thread Raw
In response to Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1  ("Gregory S. Williamson" <gsw@globexplorer.com>)
List pgsql-general
The manual suggests: iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql
dumpfile.sql. The -c option removes invalid character sequences. A diff
of the two files will show the sequences that are invalid. iconv reads
the entire input file into memory so it might be necessary to use split
to break up the dump into multiple smaller files for processing.

On Thu, 2006-11-16 at 19:38 -0800, Gregory S. Williamson wrote:
> Thanks for the suggestion ... since the data involved came from different source, I suspect there may be more than
oneencoding, but this has great promise. 
>
> Greg
>
>
> -----Original Message-----
> From:    Russell Smith [mailto:mr-russ@pws.com.au]
> Sent:    Thu 11/16/2006 7:27 PM
> To:    Gregory S. Williamson
> Cc:    pgsql-general@postgresql.org
> Subject:    Re: [GENERAL] Eliminating bad characters from a database for upgrading  from 7.4 to 8.1
>
> Gregory S. Williamson wrote:
> > Dear list,
> >
> > I have been banging my head against a problem for a few days now, and although I am making progress it is painfully
slow,and I am hoping that some one out there can steer me in a better way. 
> >
> > I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32
millionrows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text. 
> >
> > There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter
standardson UTF encoding. 
> >
> > I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use
splitto break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column.
ThenI use something like: 
> >
> > create table bad_char_gids as select gid from parcels where position('Ñ' in s_street) > 0;
> >
> > And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the
offendingcharacters from that column. This example got 5001 records, but often it is one record in the whole DB will
havesome other offending character. I fix the problem in the loaddata as well, and continue. 
> >
> > The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records),
andthe offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so
thisis a very slow process. 
> >
> > Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the
offendingrecords in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any
individualbad character, but I want to find them all at once, if possible. 
> >
> Try converting the dump files encoding to UTF-8.  before 8.1 you could
> insert invalid characters into the DB because it accepted other
> encodings.  It will also dump other encoding.  For example, converting
> something with windows characters in it.
>
> iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file
>
> And import the converted file.  you may need to try a couple of
> different input encodings if you aren't sure what encoding was used when
> inserting data into the DB.
>
> Russell.
>
> > TIA,
> >
> > Greg Williamson
> > DBA
> > GlobeXplorer LLC
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: explain analyze is your friend
> >
> >
> >
>
>
>
> -------------------------------------------------------
> Click link below if it is SPAM gsw@globexplorer.com
>
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=455d2cee144961034217237&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
> !DSPAM:455d2cee144961034217237!
> -------------------------------------------------------
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly


pgsql-general by date:

Previous
From: "Gregory S. Williamson"
Date:
Subject: Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1
Next
From: Tom Lane
Date:
Subject: Re: PostgreSQL: Question about rules