Re: How to find freak UTF-8 character? - Mailing list pgsql-general

From pasman pasmański
Subject Re: How to find freak UTF-8 character?
Date
Msg-id CAOWY8=ZXtzGrshy8Fe3v3RD0TwFsQiq_eSCY+3HWTt5EbN9irw@mail.gmail.com
Whole thread Raw
In response to Re: How to find freak UTF-8 character?  (Leif Biberg Kristensen <leif@solumslekt.org>)
Responses Re: How to find freak UTF-8 character?  (Leif Biberg Kristensen <leif@solumslekt.org>)
List pgsql-general
Its simple to remove strange chars  with regex_replace.

2011/10/1, Leif Biberg Kristensen <leif@solumslekt.org>:
> On Saturday 1. October 2011 21.29.45 Andrew Sullivan wrote:
>> I see you found it, but note that it's _not_ a spurious UTF-8
>> character: it's a right-to-left mark, ans is a perfectly ok UTF-8 code
>> point.
>
> Andrew,
> thank you for your reply. Yes I know that this is a perfectly legal UTF-8
> character. It crept into my database as a result of a copy-and-paste job
> from
> a web site. The point is that it doesn't have a counterpart in ISO-8859-1 to
> which I regularly have to export the data.
>
> The offending character came from this URL:
> <http://www.soge.kviteseid.no/individual.php?pid=I2914&ged=Kviteseid.GED&tab=0>
>
> and the text that I copied and pasted from the page looks like this in the
> source code:
>
> Aslaug Steinarsdotter Fjågesund  ‎(I2914)‎
>
> I'm going to write to the webmaster of the site and ask why that character,
> represented in the HTML as the ‎ entity, has to appear in a Norwegian
> web
> site which never should have to display text in anything but left-to-right
> order.
>
>> If you need a subset of the UTF-8 character set, you want to make sure
>> you have some sort of constraint in your application or your database
>> that prevents insertion of anything at all in UTF-8.  This is a need
>> people often forget when working in an internationalized setting,
>> because there's a lot of crap that comes from the client side in a
>> UTF-8 setting that might not come in other settings (like LATIN1).
>
> I don't want any constraint of that sort. I'm perfectly happy with UTF-8.
> And
> now that I've found out how to spot problematic characters that will crash
> my
> export script, it's really not an issue anymore. The character didn't print
> neither in psql nor in my PHP frontend, so I just removed the problematic
> text
> and re-entered it by hand. Problem solved.
>
> But thank you for the idea, I think that I will strip out at least any ‎
> entities from text entered into the database.
>
> By the way, is there a setting in psql that will output unprintable
> characters
> as question marks or something?
>
> regards, Leif.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


--
------------
pasman

pgsql-general by date:

Previous
From: Sim Zacks
Date:
Subject: Re: PL/Python
Next
From: Leif Biberg Kristensen
Date:
Subject: Re: How to find freak UTF-8 character?