Thread: regexp_replace and UTF8

regexp_replace and UTF8

From
"Bart Degryse"
Date:
Hi,
I have a text field with data like this: 'de patiënt niet' (without the quotes).
I would like to convert this string to look like this: 'de patiënt niet'
Basically what I need to do (I think) is
- get rid of the &, # and ;
- convert the number to hex
- make a UTF8 from that (thus: \xEB)
- convert that to SQL_ASCII
Since this works:
  select regexp_replace('de patiënt niet', '&#(\\d+);', '\xEB', 'g')
I was thinking that this would work too, but it doesn't
  select regexp_replace('de patiënt niet', '&#(\\d+);', '\x' || to_hex(E'\\1'), 'g')
It gives me:
  ERROR:  function to_hex("unknown") is not unique
  HINT:  Could not choose a best candidate function. You may need to add explicit type casts.
So I changed it to
  select regexp_replace('de patiënt niet', '&#(\\d+);', '\x' || to_hex(CAST (to_number(E'\\1','999') AS INTEGER)), 'g')
which kind of works except that the result is not what I need:
  'de patix1nt niet'
Can anyone help me fix this or point me to a better approach.
By the way, changing the way data is put into the field is unfortunately not an option.
Many thanks in advance.
 
ICT Departement - Indicator NV
Bart Degryse

Re: regexp_replace and UTF8

From
Gregory Stark
Date:
"Bart Degryse" <Bart.Degryse@indicator.be> writes:

> Hi,
> I have a text field with data like this: 'de patiënt niet'

> Can anyone help me fix this or point me to a better approach.
> By the way, changing the way data is put into the field is
> unfortunately not an option.
> Many thanks in advance.

You could use a plperl function to use one of the many html parsing perl
modules?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: regexp_replace and UTF8

From
Harald Fuchs
Date:
In article <87ljstm4eq.fsf@oxford.xeocode.com>,
Gregory Stark <stark@enterprisedb.com> writes:

> "Bart Degryse" <Bart.Degryse@indicator.be> writes:
>> Hi,
>> I have a text field with data like this: 'de patiënt niet'

>> Can anyone help me fix this or point me to a better approach.
>> By the way, changing the way data is put into the field is
>> unfortunately not an option.

> You could use a plperl function to use one of the many html parsing perl
> modules?

Yes, either plperl or some external HTML tool.

>> Basically what I need to do (I think) is
>> - get rid of the &, # and ;
>> - convert the number to hex
>> - make a UTF8 from that (thus: \xEB)
>> - convert that to SQL_ASCII

You know that SQL_ASCII is a misnomer for "no encoding at all, and I
don't care"?  I'd use UTF8 or (if you stay in Western Europe) Latin9.



Re: regexp_replace and UTF8

From
Jasen Betts
Date:
On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__Part8EA648F8.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
> I have a text field with data like this: 'de patiënt niet'
> (without the quotes).
> I would like to convert this string to look like this: 'de patiënt
> niet'
> Basically what I need to do (I think) is
> - get rid of the &, # and ;
> - convert the number to hex
> - make a UTF8 from that (thus: \xEB)

that is not UTF8.

the UTF8 representation for 'ë' in  SQL is e'\xC3\xAB' or chr(235)

your input appears to be encoded in LATIN-1 (or possibly 8859-13)


I think you'll need to write a function.

here, where I have database encoding UTF8 this appears to work as you
desire.

CREATE or replace FUNCTION htmlent(inp text) returns text as
$f$
DECLARE str text;
BEGINstr= regexp_replace(quote_literal( inp)                   ,$$&#(\d+);$$        ,$$'||chr(\1)||'$$
,'g');execute'select '||str into str;return str;
 
END
$f$
LANGUAGE PLPGSQL;

select htmlent('de patiënt niet');

probably the above should be expanded to handle named entities 
like '&' too.

characters outside of the LATIN-1 space are not handled 
but aparently this is what you want.



Re: regexp_replace and UTF8

From
"Bart Degryse"
Date:
Thanks for the ideas!
The function Jasen suggests works partially.
There are also entities like
&#8216;
&#8217;
&#8220;
&#8221;
&#8230;
The output of the htmlent function for these looks like |
Was that what you meant with "characters outside of the LATIN-1 space are not handled
but apparently this is what you want." ?
Because in that case, they should be handled too.
How should that be done.
Thanks,
Bart 


>>> Jasen Betts <jasen@xnet.co.nz> 2009-01-31 12:47 >>>
On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__Part8EA648F8.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
> I have a text field with data like this: 'de pati&#235;nt niet'
> (without the quotes).
> I would like to convert this string to look like this: 'de patiënt
> niet'
> Basically what I need to do (I think) is
> - get rid of the &, # and ;
> - convert the number to hex
> - make a UTF8 from that (thus: \xEB)
that is not UTF8.

the UTF8 representation for 'ë' in  SQL is e'\xC3\xAB' or chr(235)

your input appears to be encoded in LATIN-1 (or possibly 8859-13)


I think you'll need to write a function.

here, where I have database encoding UTF8 this appears to work as you
desire.

CREATE or replace FUNCTION htmlent(inp text) returns text as
$f$
DECLARE
  str text;
BEGIN
str= regexp_replace(quote_literal( inp)
                    ,$$&#(\d+);$$
    ,$$'||chr(\1)||'$$
    ,'g');
execute 'select '||str into str;
return str;
END
$f$
LANGUAGE PLPGSQL;

select htmlent('de pati&#235;nt niet');

probably the above should be expanded to handle named entities
like '&amp;' too.

characters outside of the LATIN-1 space are not handled
but aparently this is what you want.


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: regexp_replace and UTF8

From
Jasen Betts
Date:
On 2009-02-02, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Thanks for the ideas!
> The function Jasen suggests works partially.
> There are also entities like
> ‘
> ’
> “
> ”
> …

these work fine here.

jasen=# select htmlent('‘ ’ “ ” …'); htmlent   -----------  ‘ ’ “ ” …  (1 row)  
> The output of the htmlent function for these looks like |

I see a pipe symbol '|' is that what you wrote?

> Was that what you meant with 
>>"characters outside of the LATIN-1 space
>> are not handled but apparently this is what you want." ?

I was under a mistaken impression of the encoding of HTML numeric
entities.

> Because in that case, they should be handled too.
> How should that be done.
> Thanks,

it works here because server_encoding is UTF8;

to check do this.
 show server_encoding;show client_encoding; 
it looks like you's need to convert the numbers to utt-8 in a bytea
type and then use convert to translate then to your preferred encoding
(appears to be win125x)

postgresql8.3 appears to provide no way to generate UTF-8 in a bytea
type so if you can't use a utf-8 encoding for your database you'll
have to write your own.








>>>> Jasen Betts <jasen@xnet.co.nz> 2009-01-31 12:47 >>>
> On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>>
>> --=3D__Part8EA648F8.0__=3D
>> Content-Type: text/plain; charset=3DUTF-8
>> Content-Transfer-Encoding: quoted-printable
>>
>> Hi,
>> I have a text field with data like this: 'de patiënt niet'
>> (without the quotes).
>> I would like to convert this string to look like this: 'de pati=C3=ABnt
>> niet'
>> Basically what I need to do (I think) is
>> - get rid of the &, # and ;
>> - convert the number to hex
>> - make a UTF8 from that (thus: \xEB)
> that is not UTF8.
>
> the UTF8 representation for '=C3=AB' in  SQL is e'\xC3\xAB' or chr(235)
>
> your input appears to be encoded in LATIN-1 (or possibly 8859-13)
>
>
> I think you'll need to write a function.
>
> here, where I have database encoding UTF8 this appears to work as you
> desire.
>
> CREATE or replace FUNCTION htmlent(inp text) returns text as
> $f$
> DECLARE
>   str text;
> BEGIN
> str=3D regexp_replace(quote_literal( inp)
>                     ,$$&#(\d+);$$
>     ,$$'||chr(\1)||'$$=20
>     ,'g');
> execute 'select '||str into str;
> return str;
> END
> $f$
> LANGUAGE PLPGSQL;
>
> select htmlent('de patiënt niet');
>
> probably the above should be expanded to handle named entities=20
> like '&' too.
>
> characters outside of the LATIN-1 space are not handled=20
> but aparently this is what you want.
>
>
> --=20
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/html; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
><HTML><HEAD>
><META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8">
><META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD>
><BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
><DIV>Thanks for the ideas!</DIV>
><DIV>The function Jasen suggests works partially.</DIV>
><DIV>There are also entities like</DIV>
><DIV>&#8216;</DIV>
><DIV>
><DIV>&#8217;</DIV>
><DIV>&#8220;</DIV>
><DIV>&#8221;</DIV>
><DIV>&#8230;</DIV>
><DIV>The output of the htmlent function for these looks like | </DIV>
><DIV>Was that what you meant with "characters outside of the LATIN-1 space =
> are not handled</DIV>
><DIV>but apparently this is what you want." ?</DIV>
><DIV>Because in that case, they should be handled too.</DIV>
><DIV>How should that be done.</DIV>
><DIV>Thanks,</DIV>
><DIV>Bart </DIV><BR><BR>>>> Jasen Betts <jasen@xnet.co.nz&=
> gt; 2009-01-31 12:47 >>><BR>On 2009-01-30, Bart Degryse <Bart.D=
> egryse@indicator.be> wrote:<BR>><BR>> --=3D__Part8EA648F8.0__=3D<B=
> R>> Content-Type: text/plain; charset=3DUTF-8<BR>> Content-Transfer-E=
> ncoding: quoted-printable<BR>><BR>> Hi,<BR>> I have a text field =
> with data like this: 'de pati&#235;nt niet'<BR>> (without the =
> quotes).<BR>> I would like to convert this string to look like this: =
> 'de pati=C3=ABnt<BR>> niet'<BR>> Basically what I need to do (I =
> think) is<BR>> - get rid of the &, # and ;<BR>> - convert the =
> number to hex<BR>> - make a UTF8 from that (thus: \xEB)<BR>that is not =
> UTF8.<BR><BR>the UTF8 representation for '=C3=AB' in  SQL is =
> e'\xC3\xAB' or chr(235)<BR><BR>your input appears to be encoded in LATIN-1 =
> (or possibly 8859-13)<BR><BR><BR>I think you'll need to write a function.<B=
> R><BR>here, where I have database encoding UTF8 this appears to work as =
> you<BR>desire.<BR><BR>CREATE or replace FUNCTION htmlent(inp text) returns =
> text as<BR>$f$<BR>DECLARE<BR>  str text;<BR>BEGIN<BR>str=3D regexp_rep=
> lace(quote_literal( inp)<BR>        =
> ;            =
> ,$$&#(\d+);$$<BR>    ,$$'||chr(\1)||'$$ <BR>  =
>   ,'g');<BR>execute 'select '||str into str;<BR>return str;<BR>END<BR>=
> $f$<BR>LANGUAGE PLPGSQL;<BR><BR>select htmlent('de pati&#235;nt =
> niet');<BR><BR>probably the above should be expanded to handle named =
> entities <BR>like '&amp;' too.<BR><BR>characters outside of the =
> LATIN-1 space are not handled <BR>but aparently this is what you want.<BR><=
> BR><BR>-- <BR>Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)<BR=
>>To make changes to your subscription:<BR><A href=3D"http://www.postgresql.=
> org/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR=
>></DIV></BODY></HTML>
>
> --=__PartF6DE34E1.0__=--
>