Thread: regexp_replace and UTF8
Hi,
I have a text field with data like this: 'de patiënt niet' (without the quotes).
I would like to convert this string to look like this: 'de patiënt niet'
Basically what I need to do (I think) is
- get rid of the &, # and ;
- convert the number to hex
- make a UTF8 from that (thus: \xEB)
- convert that to SQL_ASCII
Since this works:
select regexp_replace('de patiënt niet', '&#(\\d+);', '\xEB', 'g')
I was thinking that this would work too, but it doesn't
select regexp_replace('de patiënt niet', '&#(\\d+);', '\x' || to_hex(E'\\1'), 'g')
It gives me:
It gives me:
ERROR: function to_hex("unknown") is not unique
HINT: Could not choose a best candidate function. You may need to add explicit type casts.
So I changed it to
HINT: Could not choose a best candidate function. You may need to add explicit type casts.
So I changed it to
select regexp_replace('de patiënt niet', '&#(\\d+);', '\x' || to_hex(CAST (to_number(E'\\1','999') AS INTEGER)), 'g')
which kind of works except that the result is not what I need:
'de patix1nt niet'
Can anyone help me fix this or point me to a better approach.
By the way, changing the way data is put into the field is unfortunately not an option.
Many thanks in advance.
ICT Departement - Indicator NV
Bart Degryse
Bart Degryse
"Bart Degryse" <Bart.Degryse@indicator.be> writes: > Hi, > I have a text field with data like this: 'de patiënt niet' > Can anyone help me fix this or point me to a better approach. > By the way, changing the way data is put into the field is > unfortunately not an option. > Many thanks in advance. You could use a plperl function to use one of the many html parsing perl modules? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
In article <87ljstm4eq.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> writes: > "Bart Degryse" <Bart.Degryse@indicator.be> writes: >> Hi, >> I have a text field with data like this: 'de patiënt niet' >> Can anyone help me fix this or point me to a better approach. >> By the way, changing the way data is put into the field is >> unfortunately not an option. > You could use a plperl function to use one of the many html parsing perl > modules? Yes, either plperl or some external HTML tool. >> Basically what I need to do (I think) is >> - get rid of the &, # and ; >> - convert the number to hex >> - make a UTF8 from that (thus: \xEB) >> - convert that to SQL_ASCII You know that SQL_ASCII is a misnomer for "no encoding at all, and I don't care"? I'd use UTF8 or (if you stay in Western Europe) Latin9.
On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote: > > --=__Part8EA648F8.0__= > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > > Hi, > I have a text field with data like this: 'de patiënt niet' > (without the quotes). > I would like to convert this string to look like this: 'de patiënt > niet' > Basically what I need to do (I think) is > - get rid of the &, # and ; > - convert the number to hex > - make a UTF8 from that (thus: \xEB) that is not UTF8. the UTF8 representation for 'ë' in SQL is e'\xC3\xAB' or chr(235) your input appears to be encoded in LATIN-1 (or possibly 8859-13) I think you'll need to write a function. here, where I have database encoding UTF8 this appears to work as you desire. CREATE or replace FUNCTION htmlent(inp text) returns text as $f$ DECLARE str text; BEGINstr= regexp_replace(quote_literal( inp) ,$$(\d+);$$ ,$$'||chr(\1)||'$$ ,'g');execute'select '||str into str;return str; END $f$ LANGUAGE PLPGSQL; select htmlent('de patiënt niet'); probably the above should be expanded to handle named entities like '&' too. characters outside of the LATIN-1 space are not handled but aparently this is what you want.
Thanks for the ideas!
The function Jasen suggests works partially.
There are also entities like
‘
’
“
”
…
The output of the htmlent function for these looks like |
Was that what you meant with "characters outside of the LATIN-1 space are not handled
but apparently this is what you want." ?
Because in that case, they should be handled too.
How should that be done.
Thanks,
Bart
>>> Jasen Betts <jasen@xnet.co.nz> 2009-01-31 12:47 >>>
On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote:
>
> --=__Part8EA648F8.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
> I have a text field with data like this: 'de patiënt niet'
> (without the quotes).
> I would like to convert this string to look like this: 'de patiënt
> niet'
> Basically what I need to do (I think) is
> - get rid of the &, # and ;
> - convert the number to hex
> - make a UTF8 from that (thus: \xEB)
that is not UTF8.
the UTF8 representation for 'ë' in SQL is e'\xC3\xAB' or chr(235)
your input appears to be encoded in LATIN-1 (or possibly 8859-13)
I think you'll need to write a function.
here, where I have database encoding UTF8 this appears to work as you
desire.
CREATE or replace FUNCTION htmlent(inp text) returns text as
$f$
DECLARE
str text;
BEGIN
str= regexp_replace(quote_literal( inp)
,$$&#(\d+);$$
,$$'||chr(\1)||'$$
,'g');
execute 'select '||str into str;
return str;
END
$f$
LANGUAGE PLPGSQL;
select htmlent('de patiënt niet');
probably the above should be expanded to handle named entities
like '&' too.
characters outside of the LATIN-1 space are not handled
but aparently this is what you want.
--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
On 2009-02-02, Bart Degryse <Bart.Degryse@indicator.be> wrote: > > --=__PartF6DE34E1.0__= > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > > Thanks for the ideas! > The function Jasen suggests works partially. > There are also entities like > ‘ > ’ > “ > ” > … these work fine here. jasen=# select htmlent('‘ ’ “ ” …'); htmlent ----------- ‘ ’ “ ” … (1 row) > The output of the htmlent function for these looks like | I see a pipe symbol '|' is that what you wrote? > Was that what you meant with >>"characters outside of the LATIN-1 space >> are not handled but apparently this is what you want." ? I was under a mistaken impression of the encoding of HTML numeric entities. > Because in that case, they should be handled too. > How should that be done. > Thanks, it works here because server_encoding is UTF8; to check do this. show server_encoding;show client_encoding; it looks like you's need to convert the numbers to utt-8 in a bytea type and then use convert to translate then to your preferred encoding (appears to be win125x) postgresql8.3 appears to provide no way to generate UTF-8 in a bytea type so if you can't use a utf-8 encoding for your database you'll have to write your own. >>>> Jasen Betts <jasen@xnet.co.nz> 2009-01-31 12:47 >>> > On 2009-01-30, Bart Degryse <Bart.Degryse@indicator.be> wrote: >> >> --=3D__Part8EA648F8.0__=3D >> Content-Type: text/plain; charset=3DUTF-8 >> Content-Transfer-Encoding: quoted-printable >> >> Hi, >> I have a text field with data like this: 'de patiënt niet' >> (without the quotes). >> I would like to convert this string to look like this: 'de pati=C3=ABnt >> niet' >> Basically what I need to do (I think) is >> - get rid of the &, # and ; >> - convert the number to hex >> - make a UTF8 from that (thus: \xEB) > that is not UTF8. > > the UTF8 representation for '=C3=AB' in SQL is e'\xC3\xAB' or chr(235) > > your input appears to be encoded in LATIN-1 (or possibly 8859-13) > > > I think you'll need to write a function. > > here, where I have database encoding UTF8 this appears to work as you > desire. > > CREATE or replace FUNCTION htmlent(inp text) returns text as > $f$ > DECLARE > str text; > BEGIN > str=3D regexp_replace(quote_literal( inp) > ,$$(\d+);$$ > ,$$'||chr(\1)||'$$=20 > ,'g'); > execute 'select '||str into str; > return str; > END > $f$ > LANGUAGE PLPGSQL; > > select htmlent('de patiënt niet'); > > probably the above should be expanded to handle named entities=20 > like '&' too. > > characters outside of the LATIN-1 space are not handled=20 > but aparently this is what you want. > > > --=20 > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql > > --=__PartF6DE34E1.0__= > Content-Type: text/html; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > ><HTML><HEAD> ><META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"> ><META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD> ><BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma"> ><DIV>Thanks for the ideas!</DIV> ><DIV>The function Jasen suggests works partially.</DIV> ><DIV>There are also entities like</DIV> ><DIV>‘</DIV> ><DIV> ><DIV>’</DIV> ><DIV>“</DIV> ><DIV>”</DIV> ><DIV>…</DIV> ><DIV>The output of the htmlent function for these looks like | </DIV> ><DIV>Was that what you meant with "characters outside of the LATIN-1 space = > are not handled</DIV> ><DIV>but apparently this is what you want." ?</DIV> ><DIV>Because in that case, they should be handled too.</DIV> ><DIV>How should that be done.</DIV> ><DIV>Thanks,</DIV> ><DIV>Bart </DIV><BR><BR>>>> Jasen Betts <jasen@xnet.co.nz&= > gt; 2009-01-31 12:47 >>><BR>On 2009-01-30, Bart Degryse <Bart.D= > egryse@indicator.be> wrote:<BR>><BR>> --=3D__Part8EA648F8.0__=3D<B= > R>> Content-Type: text/plain; charset=3DUTF-8<BR>> Content-Transfer-E= > ncoding: quoted-printable<BR>><BR>> Hi,<BR>> I have a text field = > with data like this: 'de patiënt niet'<BR>> (without the = > quotes).<BR>> I would like to convert this string to look like this: = > 'de pati=C3=ABnt<BR>> niet'<BR>> Basically what I need to do (I = > think) is<BR>> - get rid of the &, # and ;<BR>> - convert the = > number to hex<BR>> - make a UTF8 from that (thus: \xEB)<BR>that is not = > UTF8.<BR><BR>the UTF8 representation for '=C3=AB' in SQL is = > e'\xC3\xAB' or chr(235)<BR><BR>your input appears to be encoded in LATIN-1 = > (or possibly 8859-13)<BR><BR><BR>I think you'll need to write a function.<B= > R><BR>here, where I have database encoding UTF8 this appears to work as = > you<BR>desire.<BR><BR>CREATE or replace FUNCTION htmlent(inp text) returns = > text as<BR>$f$<BR>DECLARE<BR> str text;<BR>BEGIN<BR>str=3D regexp_rep= > lace(quote_literal( inp)<BR>  = > ; = > ,$$&#(\d+);$$<BR> ,$$'||chr(\1)||'$$ <BR> = > ,'g');<BR>execute 'select '||str into str;<BR>return str;<BR>END<BR>= > $f$<BR>LANGUAGE PLPGSQL;<BR><BR>select htmlent('de patiënt = > niet');<BR><BR>probably the above should be expanded to handle named = > entities <BR>like '&' too.<BR><BR>characters outside of the = > LATIN-1 space are not handled <BR>but aparently this is what you want.<BR><= > BR><BR>-- <BR>Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)<BR= >>To make changes to your subscription:<BR><A href=3D"http://www.postgresql.= > org/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR= >></DIV></BODY></HTML> > > --=__PartF6DE34E1.0__=-- >