Thread: How to remove non-UTF values from a table?

How to remove non-UTF values from a table?

From
Phoenix Kiula
Date:
An easy question for some I hope.

I have a DB from 8.2 days that when I now dump and try to take into
the 8.3.7, it gives me errors about utf-8 stuff.

I tried searching this list's archives but could not come up with an answer.

Google returns some sites like these:
http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
but I'm not clear on how to use them.

Following the SQL on this site I could identify some columns that
contain text like this:

    "Évolution générale de la situation démographique"

So my guess is that the non-English characters were originally not
getting written in proper utf-8 variants.

Is there any SQL possibility to find these columns and replace them
with utf-8 equivalents using some postgresql commands? Couldn't find
anything in the "Strings functions" (chapter 9 of manual).

We're on CentOS.

Thanks!

Re: How to remove non-UTF values from a table?

From
Phoenix Kiula
Date:
Actually the title of my email should have been "how to **replace**
utf-8 values".

Thanks.



On Mon, Dec 14, 2009 at 7:03 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> An easy question for some I hope.
>
> I have a DB from 8.2 days that when I now dump and try to take into
> the 8.3.7, it gives me errors about utf-8 stuff.
>
> I tried searching this list's archives but could not come up with an answer.
>
> Google returns some sites like these:
> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
> but I'm not clear on how to use them.
>
> Following the SQL on this site I could identify some columns that
> contain text like this:
>
>    "Évolution générale de la situation démographique"
>
> So my guess is that the non-English characters were originally not
> getting written in proper utf-8 variants.
>
> Is there any SQL possibility to find these columns and replace them
> with utf-8 equivalents using some postgresql commands? Couldn't find
> anything in the "Strings functions" (chapter 9 of manual).
>
> We're on CentOS.
>
> Thanks!
>

Re: How to remove non-UTF values from a table?

From
Phoenix Kiula
Date:
Actually I just realized that the SQL below will also pick up on these
perfectly valid-looking columns:



http://factfinder.census.gov/servlet/ReferenceMapFramesetServlet?_bm=y&-zip=27340&-PANEL_ID=rm_result&-_MapEvent=zoomToAddress&-street=&-city=&-rm_config=|b=50|l=en|t=420|zf=0.0|ms=ref_legal_00dec|dw=0.21626605473484609|dh=0.13180874155445527|dt=gov.census.aff.domain.map.EnglishMapExtent|if=gif|cx=-79.8023|cy=35.827|zl=5|pz=5|bo=404:315:314:313:323:321:319|bl=362:360:393:392:355:354:385|ft=350:349:335:389:388:332:331|fl=381:403:204:380:369:379:368|g=16000US3752760&-tree_id=420&-errMsg=&-redoLog=false&-geo_id=16000US3752760&-states=


Which part of this is non-UTF8? Why is this going into a UTF8 table
with corrupted values? The lc_collate etc and all settings I can
imagine are already utf-8!

Thanks for any pointers.





On Mon, Dec 14, 2009 at 7:04 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Actually the title of my email should have been "how to **replace**
> utf-8 values".
>
> Thanks.
>
>
>
> On Mon, Dec 14, 2009 at 7:03 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>> An easy question for some I hope.
>>
>> I have a DB from 8.2 days that when I now dump and try to take into
>> the 8.3.7, it gives me errors about utf-8 stuff.
>>
>> I tried searching this list's archives but could not come up with an answer.
>>
>> Google returns some sites like these:
>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>> but I'm not clear on how to use them.
>>
>> Following the SQL on this site I could identify some columns that
>> contain text like this:
>>
>>    "Évolution générale de la situation démographique"
>>
>> So my guess is that the non-English characters were originally not
>> getting written in proper utf-8 variants.
>>
>> Is there any SQL possibility to find these columns and replace them
>> with utf-8 equivalents using some postgresql commands? Couldn't find
>> anything in the "Strings functions" (chapter 9 of manual).
>>
>> We're on CentOS.
>>
>> Thanks!
>>
>

Re: How to remove non-UTF values from a table?

From
Alban Hertroys
Date:
On 14 Dec 2009, at 13:21, Phoenix Kiula wrote:

> Actually I just realized that the SQL below will also pick up on these
> perfectly valid-looking columns:
>
>
>
http://factfinder.census.gov/servlet/ReferenceMapFramesetServlet?_bm=y&-zip=27340&-PANEL_ID=rm_result&-_MapEvent=zoomToAddress&-street=&-city=&-rm_config=|b=50|l=en|t=420|zf=0.0|ms=ref_legal_00dec|dw=0.21626605473484609|dh=0.13180874155445527|dt=gov.census.aff.domain.map.EnglishMapExtent|if=gif|cx=-79.8023|cy=35.827|zl=5|pz=5|bo=404:315:314:313:323:321:319|bl=362:360:393:392:355:354:385|ft=350:349:335:389:388:332:331|fl=381:403:204:380:369:379:368|g=16000US3752760&-tree_id=420&-errMsg=&-redoLog=false&-geo_id=16000US3752760&-states=

I think you accidentally posted your holiday plans?

> Which part of this is non-UTF8? Why is this going into a UTF8 table
> with corrupted values? The lc_collate etc and all settings I can
> imagine are already utf-8!
>
> Thanks for any pointers.

Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.


!DSPAM:737,4b264f49228051716694515!



Re: How to remove non-UTF values from a table?

From
Dimitri Fontaine
Date:
Phoenix Kiula <phoenix.kiula@gmail.com> writes:
> Is there any SQL possibility to find these columns and replace them
> with utf-8 equivalents using some postgresql commands? Couldn't find
> anything in the "Strings functions" (chapter 9 of manual).

I've bookmarked this for later:

  http://archives.postgresql.org/pgsql-general/2009-07/msg00904.php

Regards,
--
dim

Re: How to remove non-UTF values from a table?

From
Howard Cole
Date:
Phoenix Kiula wrote:
> An easy question for some I hope.
>
> I have a DB from 8.2 days that when I now dump and try to take into
> the 8.3.7, it gives me errors about utf-8 stuff.
>
> I tried searching this list's archives but could not come up with an answer.
>
> Google returns some sites like these:
> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
> but I'm not clear on how to use them.
>
> Following the SQL on this site I could identify some columns that
> contain text like this:
>
>     "Évolution générale de la situation démographique"
>
> So my guess is that the non-English characters were originally not
> getting written in proper utf-8 variants.
>
> Is there any SQL possibility to find these columns and replace them
> with utf-8 equivalents using some postgresql commands? Couldn't find
> anything in the "Strings functions" (chapter 9 of manual).
>
> We're on CentOS.
>
> Thanks!
>
>
My recommendation would be to install the iconv utility and run it on a
plain text (pg_dump -Fp) backup as suggested in the google article - and
then reimport the clean UTF-8.

I am surprised that you managed to install the original backup on 8.3
because it seems to be much more strict on encoding - Unless your
database is not in UTF-8?

Howard
www.selestial.com



Re: How to remove non-UTF values from a table?

From
Phoenix Kiula
Date:
On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@selestial.com> wrote:
> Phoenix Kiula wrote:
>>
>> An easy question for some I hope.
>>
>> I have a DB from 8.2 days that when I now dump and try to take into
>> the 8.3.7, it gives me errors about utf-8 stuff.
>>
>> I tried searching this list's archives but could not come up with an
>> answer.
>>
>> Google returns some sites like these:
>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>> but I'm not clear on how to use them.
>>
>> Following the SQL on this site I could identify some columns that
>> contain text like this:
>>
>>    "Évolution générale de la situation démographique"
>>
>> So my guess is that the non-English characters were originally not
>> getting written in proper utf-8 variants.
>>
>> Is there any SQL possibility to find these columns and replace them
>> with utf-8 equivalents using some postgresql commands? Couldn't find
>> anything in the "Strings functions" (chapter 9 of manual).
>>
>> We're on CentOS.
>>
>> Thanks!
>>
>>
>
> My recommendation would be to install the iconv utility and run it on a
> plain text (pg_dump -Fp) backup as suggested in the google article - and
> then reimport the clean UTF-8.
>
> I am surprised that you managed to install the original backup on 8.3
> because it seems to be much more strict on encoding - Unless your database
> is not in UTF-8?



Thanks Howard.

I ran the SQL and it finds anything that has non-English characters.
For example:



http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1


Part of this URL is actually in Japanese, but when I paste it in this
email it comes up with all these percentage signs. I suppose this is
"url encoded".

Shouldn't this be valid UTF-8? How does PG calculate if something is
not valid UTF-8?

Thanks.

Re: How to remove non-UTF values from a table?

From
Howard Cole
Date:
Phoenix Kiula wrote:
> On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@selestial.com> wrote:
>
>> Phoenix Kiula wrote:
>>
>>> An easy question for some I hope.
>>>
>>> I have a DB from 8.2 days that when I now dump and try to take into
>>> the 8.3.7, it gives me errors about utf-8 stuff.
>>>
>>> I tried searching this list's archives but could not come up with an
>>> answer.
>>>
>>> Google returns some sites like these:
>>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>>> but I'm not clear on how to use them.
>>>
>>> Following the SQL on this site I could identify some columns that
>>> contain text like this:
>>>
>>>    "Évolution générale de la situation démographique"
>>>
>>> So my guess is that the non-English characters were originally not
>>> getting written in proper utf-8 variants.
>>>
>>> Is there any SQL possibility to find these columns and replace them
>>> with utf-8 equivalents using some postgresql commands? Couldn't find
>>> anything in the "Strings functions" (chapter 9 of manual).
>>>
>>> We're on CentOS.
>>>
>>> Thanks!
>>>
>>> My recommendation would be to install the iconv utility and run it on a
>>> plain text (pg_dump -Fp) backup as suggested in the google article - and
>>> then reimport the clean UTF-8.
>>>
>>> I am surprised that you managed to install the original backup on 8.3
>>> because it seems to be much more strict on encoding - Unless your database
>>> is not in UTF-8?
>>>
>
>
>
> Thanks Howard.
>
> I ran the SQL and it finds anything that has non-English characters.
> For example:
>
>
>
http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1
>
>
> Part of this URL is actually in Japanese, but when I paste it in this
> email it comes up with all these percentage signs. I suppose this is
> "url encoded".
>
> Shouldn't this be valid UTF-8? How does PG calculate if something is
> not valid UTF-8?
>
> Thanks.
>
>
I am sure that postgres uses standard test to see if text is valid UTF8,
however I guess you will only get a warning when you try to read or
write the data, and the warning only appears in the log as far as I am
aware. Encoding checking has become more strict with each upgrade to
postgres, this is why you are seeing errors as you import into 8.3 from
an 8.2 backup.

I still maintain that you should check and fix using iconv rather than
SQL. Read your distro notes on how to install iconv (if it isn't already
installed) and run it on your plain text backup. You can then pinpoint
individual changes using diff if you want to find it in your exising 8.2
database. Better still, just fix using Iconv then import the clean data
into 8.3 or 8.4

Howard Cole
www.selestial.com


Re: How to remove non-UTF values from a table?

From
Richard Broersma
Date:
On Wed, Dec 16, 2009 at 6:05 AM, Howard Cole <howardnews@selestial.com> wrote:

> Better still, just fix using Iconv then import the clean data into
> 8.3 or 8.4

FWIW,  If I was going to make a choice between 8.3 and 8.4, I would
choose 8.4.  It looks like the future 8.5 release will be able to
preform an in-place upgrade on 8.4.  This *should* eliminate to future
need to dump and reload for future upgrades.

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: How to remove non-UTF values from a table?

From
Raymond O'Donnell
Date:
On 16/12/2009 15:01, Richard Broersma wrote:

> It looks like the future 8.5 release will be able to
> preform an in-place upgrade on 8.4.

Really? That would be *wonderful*. I know it's impossible to be
definitive, but how likely would you say this is?

Ray.


--
Raymond O'Donnell :: Galway :: Ireland
rod@iol.ie

Re: How to remove non-UTF values from a table?

From
Tom Lane
Date:
"Raymond O'Donnell" <rod@iol.ie> writes:
> On 16/12/2009 15:01, Richard Broersma wrote:
>> It looks like the future 8.5 release will be able to
>> preform an in-place upgrade on 8.4.

> Really? That would be *wonderful*. I know it's impossible to be
> definitive, but how likely would you say this is?

It's wishful thinking.  We tried to have that for 8.4, and failed: there
are so many constraints on pg_upgrade that it's not a usable mainstream
solution.  So far as I've seen there has been no work done that would
make it any more reliable for 8.5.

            regards, tom lane

Re: How to remove non-UTF values from a table?

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Raymond O'Donnell" <rod@iol.ie> writes:
> > On 16/12/2009 15:01, Richard Broersma wrote:
> >> It looks like the future 8.5 release will be able to
> >> preform an in-place upgrade on 8.4.
>
> > Really? That would be *wonderful*. I know it's impossible to be
> > definitive, but how likely would you say this is?
>
> It's wishful thinking.  We tried to have that for 8.4, and failed: there
> are so many constraints on pg_upgrade that it's not a usable mainstream
> solution.  So far as I've seen there has been no work done that would
> make it any more reliable for 8.5.

I do not share Tom's conclusions.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: How to remove non-UTF values from a table?

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Raymond O'Donnell" <rod@iol.ie> writes:
> > On 16/12/2009 15:01, Richard Broersma wrote:
> >> It looks like the future 8.5 release will be able to
> >> preform an in-place upgrade on 8.4.
>
> > Really? That would be *wonderful*. I know it's impossible to be
> > definitive, but how likely would you say this is?
>
> It's wishful thinking.  We tried to have that for 8.4, and failed: there
> are so many constraints on pg_upgrade that it's not a usable mainstream
> solution.  So far as I've seen there has been no work done that would
> make it any more reliable for 8.5.

Some people just enjoy discouraging people.

pg_migrator (not pg_upgrade) has been used by many people to migrate
from 8.3 to 8.4.  I just helped someone yesterday with a migration.
pg_migrator threw an error because they had reindexed pg_largeobject,
and pg_migrator was not prepared to handle that.  They also had an old
cluster with a higher autovacuum_freeze_max_age than the new cluster,
causing autovacuum freeze to run during the upgrade and remove clog
files.  These are unusual cases, so I assume there are many people who
successfully use pg_migrator that I never hear from.

Both of these issues are fixed in pg_migrator 8.4.6, released yesterday.
I also talked to Alvaro about using fixed oids to remove these migration
restrictions in 8.5:

        o  a user-defined composite data type
        o  a user-defined array data type
        o  a user-defined enum data type

So, pg_migrator is a work in progress.  It current requires migration
testing and should be used by experienced Postgres users.  Someday
pg_migrator will be easier to use and more reliable.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: How to remove non-UTF values from a table?

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> pg_migrator (not pg_upgrade) has been used by many people to migrate
> from 8.3 to 8.4.  I just helped someone yesterday with a migration.
> pg_migrator threw an error because they had reindexed pg_largeobject,
> and pg_migrator was not prepared to handle that.  They also had an old
> cluster with a higher autovacuum_freeze_max_age than the new cluster,
> causing autovacuum freeze to run during the upgrade and remove clog
> files.  These are unusual cases, so I assume there are many people who
> successfully use pg_migrator that I never hear from.
>
> Both of these issues are fixed in pg_migrator 8.4.6, released yesterday.
> I also talked to Alvaro about using fixed oids to remove these migration
> restrictions in 8.5:
>
>         o  a user-defined composite data type
>         o  a user-defined array data type
>         o  a user-defined enum data type
>
> So, pg_migrator is a work in progress.  It current requires migration
> testing and should be used by experienced Postgres users.  Someday
> pg_migrator will be easier to use and more reliable.

Oh, and pg_migrator 8.4.6 works for migrations from 8.4 to 8.5 CVS.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: How to remove non-UTF values from a table?

From
Phoenix Kiula
Date:
On Fri, Dec 18, 2009 at 3:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Bruce Momjian wrote:
>> pg_migrator (not pg_upgrade) has been used by many people to migrate
>> from 8.3 to 8.4.  I just helped someone yesterday with a migration.
>> pg_migrator threw an error because they had reindexed pg_largeobject,
>> and pg_migrator was not prepared to handle that.  They also had an old
>> cluster with a higher autovacuum_freeze_max_age than the new cluster,
>> causing autovacuum freeze to run during the upgrade and remove clog
>> files.  These are unusual cases, so I assume there are many people who
>> successfully use pg_migrator that I never hear from.
>>
>> Both of these issues are fixed in pg_migrator 8.4.6, released yesterday.
>> I also talked to Alvaro about using fixed oids to remove these migration
>> restrictions in 8.5:
>>
>>         o  a user-defined composite data type
>>         o  a user-defined array data type
>>         o  a user-defined enum data type
>>
>> So, pg_migrator is a work in progress.  It current requires migration
>> testing and should be used by experienced Postgres users.  Someday
>> pg_migrator will be easier to use and more reliable.
>
> Oh, and pg_migrator 8.4.6 works for migrations from 8.4 to 8.5 CVS.


Is pg_migrator worth trying between 8.2.9 and 8.4.x?

Thanks

Re: How to remove non-UTF values from a table?

From
Bruce Momjian
Date:
Phoenix Kiula wrote:
> On Fri, Dec 18, 2009 at 3:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Bruce Momjian wrote:
> >> pg_migrator (not pg_upgrade) has been used by many people to migrate
> >> from 8.3 to 8.4. ?I just helped someone yesterday with a migration.
> >> pg_migrator threw an error because they had reindexed pg_largeobject,
> >> and pg_migrator was not prepared to handle that. ?They also had an old
> >> cluster with a higher autovacuum_freeze_max_age than the new cluster,
> >> causing autovacuum freeze to run during the upgrade and remove clog
> >> files. ?These are unusual cases, so I assume there are many people who
> >> successfully use pg_migrator that I never hear from.
> >>
> >> Both of these issues are fixed in pg_migrator 8.4.6, released yesterday.
> >> I also talked to Alvaro about using fixed oids to remove these migration
> >> restrictions in 8.5:
> >>
> >> ? ? ? ? o ?a user-defined composite data type
> >> ? ? ? ? o ?a user-defined array data type
> >> ? ? ? ? o ?a user-defined enum data type
> >>
> >> So, pg_migrator is a work in progress. ?It current requires migration
> >> testing and should be used by experienced Postgres users. ?Someday
> >> pg_migrator will be easier to use and more reliable.
> >
> > Oh, and pg_migrator 8.4.6 works for migrations from 8.4 to 8.5 CVS.
>
>
> Is pg_migrator worth trying between 8.2.9 and 8.4.x?

No, it only handles 8.3 and later.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +