Re: regexp_replace and UTF8 - Mailing list pgsql-sql

From Jasen Betts
Subject Re: regexp_replace and UTF8
Date
Msg-id gmbotc$2je$1@reversiblemaps.ath.cx
Whole thread Raw
In response to regexp_replace and UTF8  ("Bart Degryse" <Bart.Degryse@indicator.be>)
List pgsql-sql
On 2009-02-02, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Thanks for the ideas!
> The function Jasen suggests works partially.
> There are also entities like
> ‘
> ’
> “
> ”
> …

these work fine here.

jasen=# select htmlent('‘ ’ “ ” …'); htmlent   -----------  ‘ ’ “ ” …  (1 row)  
> The output of the htmlent function for these looks like |

I see a pipe symbol '|' is that what you wrote?

> Was that what you meant with 
>>"characters outside of the LATIN-1 space
>> are not handled but apparently this is what you want." ?

I was under a mistaken impression of the encoding of HTML numeric
entities.

> Because in that case, they should be handled too.
> How should that be done.
> Thanks,

it works here because server_encoding is UTF8;

to check do this.
 show server_encoding;show client_encoding; 
it looks like you's need to convert the numbers to utt-8 in a bytea
type and then use convert to translate then to your preferred encoding
(appears to be win125x)

postgresql8.3 appears to provide no way to generate UTF-8 in a bytea
type so if you can't use a utf-8 encoding for your database you'll
have to write your own.








>>>> Jasen Betts <jasen@xnet.co.nz> 2009-01-31 12:47 >>>
> On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>>
>> --=3D__Part8EA648F8.0__=3D
>> Content-Type: text/plain; charset=3DUTF-8
>> Content-Transfer-Encoding: quoted-printable
>>
>> Hi,
>> I have a text field with data like this: 'de patiënt niet'
>> (without the quotes).
>> I would like to convert this string to look like this: 'de pati=C3=ABnt
>> niet'
>> Basically what I need to do (I think) is
>> - get rid of the &, # and ;
>> - convert the number to hex
>> - make a UTF8 from that (thus: \xEB)
> that is not UTF8.
>
> the UTF8 representation for '=C3=AB' in  SQL is e'\xC3\xAB' or chr(235)
>
> your input appears to be encoded in LATIN-1 (or possibly 8859-13)
>
>
> I think you'll need to write a function.
>
> here, where I have database encoding UTF8 this appears to work as you
> desire.
>
> CREATE or replace FUNCTION htmlent(inp text) returns text as
> $f$
> DECLARE
>   str text;
> BEGIN
> str=3D regexp_replace(quote_literal( inp)
>                     ,$$&#(\d+);$$
>     ,$$'||chr(\1)||'$$=20
>     ,'g');
> execute 'select '||str into str;
> return str;
> END
> $f$
> LANGUAGE PLPGSQL;
>
> select htmlent('de patiënt niet');
>
> probably the above should be expanded to handle named entities=20
> like '&' too.
>
> characters outside of the LATIN-1 space are not handled=20
> but aparently this is what you want.
>
>
> --=20
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/html; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
><HTML><HEAD>
><META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8">
><META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD>
><BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
><DIV>Thanks for the ideas!</DIV>
><DIV>The function Jasen suggests works partially.</DIV>
><DIV>There are also entities like</DIV>
><DIV>&#8216;</DIV>
><DIV>
><DIV>&#8217;</DIV>
><DIV>&#8220;</DIV>
><DIV>&#8221;</DIV>
><DIV>&#8230;</DIV>
><DIV>The output of the htmlent function for these looks like | </DIV>
><DIV>Was that what you meant with "characters outside of the LATIN-1 space =
> are not handled</DIV>
><DIV>but apparently this is what you want." ?</DIV>
><DIV>Because in that case, they should be handled too.</DIV>
><DIV>How should that be done.</DIV>
><DIV>Thanks,</DIV>
><DIV>Bart </DIV><BR><BR>>>> Jasen Betts <jasen@xnet.co.nz&=
> gt; 2009-01-31 12:47 >>><BR>On 2009-01-30, Bart Degryse <Bart.D=
> egryse@indicator.be> wrote:<BR>><BR>> --=3D__Part8EA648F8.0__=3D<B=
> R>> Content-Type: text/plain; charset=3DUTF-8<BR>> Content-Transfer-E=
> ncoding: quoted-printable<BR>><BR>> Hi,<BR>> I have a text field =
> with data like this: 'de pati&#235;nt niet'<BR>> (without the =
> quotes).<BR>> I would like to convert this string to look like this: =
> 'de pati=C3=ABnt<BR>> niet'<BR>> Basically what I need to do (I =
> think) is<BR>> - get rid of the &, # and ;<BR>> - convert the =
> number to hex<BR>> - make a UTF8 from that (thus: \xEB)<BR>that is not =
> UTF8.<BR><BR>the UTF8 representation for '=C3=AB' in  SQL is =
> e'\xC3\xAB' or chr(235)<BR><BR>your input appears to be encoded in LATIN-1 =
> (or possibly 8859-13)<BR><BR><BR>I think you'll need to write a function.<B=
> R><BR>here, where I have database encoding UTF8 this appears to work as =
> you<BR>desire.<BR><BR>CREATE or replace FUNCTION htmlent(inp text) returns =
> text as<BR>$f$<BR>DECLARE<BR>  str text;<BR>BEGIN<BR>str=3D regexp_rep=
> lace(quote_literal( inp)<BR>        =
> ;            =
> ,$$&#(\d+);$$<BR>    ,$$'||chr(\1)||'$$ <BR>  =
>   ,'g');<BR>execute 'select '||str into str;<BR>return str;<BR>END<BR>=
> $f$<BR>LANGUAGE PLPGSQL;<BR><BR>select htmlent('de pati&#235;nt =
> niet');<BR><BR>probably the above should be expanded to handle named =
> entities <BR>like '&amp;' too.<BR><BR>characters outside of the =
> LATIN-1 space are not handled <BR>but aparently this is what you want.<BR><=
> BR><BR>-- <BR>Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)<BR=
>>To make changes to your subscription:<BR><A href=3D"http://www.postgresql.=
> org/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR=
>></DIV></BODY></HTML>
>
> --=__PartF6DE34E1.0__=--
>



pgsql-sql by date:

Previous
From: Nikhil teltia
Date:
Subject: Function Returning a Set of Composite Value
Next
From: "F."
Date:
Subject: postgre2postgre