Re: regexp_replace and UTF8 - Mailing list pgsql-sql

From Jasen Betts
Subject Re: regexp_replace and UTF8
Date
Msg-id gm1doi$914$1@reversiblemaps.ath.cx
Whole thread Raw
In response to regexp_replace and UTF8  ("Bart Degryse" <Bart.Degryse@indicator.be>)
Responses Re: regexp_replace and UTF8  ("Bart Degryse" <Bart.Degryse@indicator.be>)
List pgsql-sql
On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__Part8EA648F8.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
> I have a text field with data like this: 'de patiënt niet'
> (without the quotes).
> I would like to convert this string to look like this: 'de patiënt
> niet'
> Basically what I need to do (I think) is
> - get rid of the &, # and ;
> - convert the number to hex
> - make a UTF8 from that (thus: \xEB)

that is not UTF8.

the UTF8 representation for 'ë' in  SQL is e'\xC3\xAB' or chr(235)

your input appears to be encoded in LATIN-1 (or possibly 8859-13)


I think you'll need to write a function.

here, where I have database encoding UTF8 this appears to work as you
desire.

CREATE or replace FUNCTION htmlent(inp text) returns text as
$f$
DECLARE str text;
BEGINstr= regexp_replace(quote_literal( inp)                   ,$$&#(\d+);$$        ,$$'||chr(\1)||'$$
,'g');execute'select '||str into str;return str;
 
END
$f$
LANGUAGE PLPGSQL;

select htmlent('de patiënt niet');

probably the above should be expanded to handle named entities 
like '&' too.

characters outside of the LATIN-1 space are not handled 
but aparently this is what you want.



pgsql-sql by date:

Previous
From: Craig Ringer
Date:
Subject: Re: dynamic OUT parameters?
Next
From: gherzig@fmed.uba.ar
Date:
Subject: Re: dynamic OUT parameters?