Re: How to remove non-UTF values from a table? - Mailing list pgsql-general

From Howard Cole
Subject Re: How to remove non-UTF values from a table?
Date
Msg-id 4B28E942.8090402@selestial.com
Whole thread Raw
In response to Re: How to remove non-UTF values from a table?  (Phoenix Kiula <phoenix.kiula@gmail.com>)
Responses Re: How to remove non-UTF values from a table?  (Richard Broersma <richard.broersma@gmail.com>)
List pgsql-general
Phoenix Kiula wrote:
> On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@selestial.com> wrote:
>
>> Phoenix Kiula wrote:
>>
>>> An easy question for some I hope.
>>>
>>> I have a DB from 8.2 days that when I now dump and try to take into
>>> the 8.3.7, it gives me errors about utf-8 stuff.
>>>
>>> I tried searching this list's archives but could not come up with an
>>> answer.
>>>
>>> Google returns some sites like these:
>>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>>> but I'm not clear on how to use them.
>>>
>>> Following the SQL on this site I could identify some columns that
>>> contain text like this:
>>>
>>>    "Évolution générale de la situation démographique"
>>>
>>> So my guess is that the non-English characters were originally not
>>> getting written in proper utf-8 variants.
>>>
>>> Is there any SQL possibility to find these columns and replace them
>>> with utf-8 equivalents using some postgresql commands? Couldn't find
>>> anything in the "Strings functions" (chapter 9 of manual).
>>>
>>> We're on CentOS.
>>>
>>> Thanks!
>>>
>>> My recommendation would be to install the iconv utility and run it on a
>>> plain text (pg_dump -Fp) backup as suggested in the google article - and
>>> then reimport the clean UTF-8.
>>>
>>> I am surprised that you managed to install the original backup on 8.3
>>> because it seems to be much more strict on encoding - Unless your database
>>> is not in UTF-8?
>>>
>
>
>
> Thanks Howard.
>
> I ran the SQL and it finds anything that has non-English characters.
> For example:
>
>
>
http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1
>
>
> Part of this URL is actually in Japanese, but when I paste it in this
> email it comes up with all these percentage signs. I suppose this is
> "url encoded".
>
> Shouldn't this be valid UTF-8? How does PG calculate if something is
> not valid UTF-8?
>
> Thanks.
>
>
I am sure that postgres uses standard test to see if text is valid UTF8,
however I guess you will only get a warning when you try to read or
write the data, and the warning only appears in the log as far as I am
aware. Encoding checking has become more strict with each upgrade to
postgres, this is why you are seeing errors as you import into 8.3 from
an 8.2 backup.

I still maintain that you should check and fix using iconv rather than
SQL. Read your distro notes on how to install iconv (if it isn't already
installed) and run it on your plain text backup. You can then pinpoint
individual changes using diff if you want to find it in your exising 8.2
database. Better still, just fix using Iconv then import the clean data
into 8.3 or 8.4

Howard Cole
www.selestial.com


pgsql-general by date:

Previous
From: Howard Cole
Date:
Subject: Re: Interesting Benchmark Article
Next
From: Michael Clark
Date:
Subject: Re: Possible causes for database corruption and solutions