Thread: LC_CTYPE and matching accented chars
Trying to understand how I can get Pg 8.2 to match 'martín' when I search for 'martin', and going through the documentation, specially http://www.postgresql.org/docs/8.2/static/locale.html http://www.postgresql.org/docs/8.1/static/functions-matching.html Here is a transcript of my commandline - I suspect I am quite close, but something is missing. Hints _really_ appreciated! # do I have a Spanish locale on this box? locale -a # no I don't - let's make one sudo locale-gen es_AR.utf8 # check that I have it locale -a ... sudo pg_createcluster --locale=es_AR.utf8 8.2 arg sudo pg_ctlcluster 8.2 arg start sudo -u postgres createdb --cluster 8.2/arg testing sudo -u postgres psql --cluster 8.2/arg testing # in psql... =# create table test (value text); CREATE TABLE =# insert into test values ('martín'); INSERT 0 1 # this is apparently the right way to # select base character based on the "equivalence class" # as defined in the LC_CTYPE =# select * from test where value ~ 'mart[=i=]n'; value ------- (0 rows) # grumble - double check =# SHOW LC_CTYPE; lc_ctype ------------ es_AR.utf8 Any hints in general? - How can I read the LC_CTYPE definition? IOWs locale-gen has created a binary blob in /usr/lib/locale/es_AR.utf8/LC_CTYPE but I sure there is a human-readable definition somewhere... - Is that the correct syntax? - Pg bug? I doubt many people are counting on this behaviour... as it seems to be widely announced to be brittle and slow. cheers, m -- ----------------------------------------------------------------------- Martin @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St OFFICE: +64(4)916-7224 UK: 0845 868 5733 ext 7224 MOB: +64(21)364-017 Make things as simple as possible, but no simpler - Einstein -----------------------------------------------------------------------
Martin Langhoff wrote: > Trying to understand how I can get Pg 8.2 to match 'martín' when I > search for 'martin', and going through the documentation, specially > > http://www.postgresql.org/docs/8.2/static/locale.html > http://www.postgresql.org/docs/8.1/static/functions-matching.html > > Here is a transcript of my commandline - I suspect I am quite close, but > something is missing. Hints _really_ appreciated! > =# create table test (value text); > CREATE TABLE > =# insert into test values ('martín'); > INSERT 0 1 > > # this is apparently the right way to > # select base character based on the "equivalence class" > # as defined in the LC_CTYPE > =# select * from test where value ~ 'mart[=i=]n'; I think it would be much easier if you did something like select * from test where lower(to_ascii(value)) = lower(to_ascii('martín')); When to_ascii doesn't work (for example because it doesn't work in UTF8) you may want to use convert() to recode the text to latin1 or latin9. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Martin Langhoff wrote: >> # this is apparently the right way to >> # select base character based on the "equivalence class" >> # as defined in the LC_CTYPE >> =# select * from test where value ~ 'mart[=i=]n'; > > I think it would be much easier if you did something like > > select * from test where lower(to_ascii(value)) = lower(to_ascii('martín')); > > When to_ascii doesn't work (for example because it doesn't work in UTF8) > you may want to use convert() to recode the text to latin1 or latin9. Well, with the example above to_ascii doesn't work. select to_ascii(value) from test ; ERROR: encoding conversion from UTF8 to ASCII not supported And neither does convert select convert(value using utf8_to_ascii) from test ; ERROR: character 0xc3 of encoding "MULE_INTERNAL" has no equivalent in "SQL_ASCII" select convert('martín' using utf8_to_ascii) from test ; ERROR: character 0xc3 of encoding "MULE_INTERNAL" has no equivalent in "SQL_ASCII" Any other alternatives? BTW, this is on Pg8.2 as seen on Ubuntu Feisty PPC. $ dpkg -l postgresql-8.2 ii postgresql-8.2 8.2.4-0ubuntu0.7.04 cheers, martín -- ----------------------------------------------------------------------- Martin @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St OFFICE: +64(4)916-7224 UK: 0845 868 5733 ext 7224 MOB: +64(21)364-017 Make things as simple as possible, but no simpler - Einstein -----------------------------------------------------------------------
On Wed, Jun 27, 2007 at 09:28:24AM +1200, Martin Langhoff wrote: > Alvaro Herrera wrote: > > I think it would be much easier if you did something like > > > > select * from test where lower(to_ascii(value)) = lower(to_ascii('martín')); > > > > When to_ascii doesn't work (for example because it doesn't work in UTF8) > > you may want to use convert() to recode the text to latin1 or latin9. > > Well, with the example above to_ascii doesn't work. > > select to_ascii(value) from test ; > ERROR: encoding conversion from UTF8 to ASCII not supported > > And neither does convert > > select convert(value using utf8_to_ascii) from test ; > ERROR: character 0xc3 of encoding "MULE_INTERNAL" has no equivalent > in "SQL_ASCII" As Alvaro suggested, try converting to latin1 or latin9 and then calling to_ascii: select 'martin' = to_ascii(convert('martín', 'latin1'), 'latin1'); ?column? ---------- t (1 row) For other possibilities search the list archives for examples of "unaccent" functions that normalize text to NFD (Unicode Normalization Form D) and remove nonspacing marks. Here's a message with a couple of PL/Perl functions: http://archives.postgresql.org/pgsql-general/2007-01/msg00702.php -- Michael Fuhr
Alvaro Herrera wrote: > Martin Langhoff wrote: >> # this is apparently the right way to >> # select base character based on the "equivalence class" >> # as defined in the LC_CTYPE >> =# select * from test where value ~ 'mart[=i=]n'; > > I think it would be much easier if you did something like > > select * from test where lower(to_ascii(value)) = lower(to_ascii('martín')); > > When to_ascii doesn't work (for example because it doesn't work in UTF8) > you may want to use convert() to recode the text to latin1 or latin9. Well, with the example above to_ascii doesn't work. select to_ascii(value) from test ; ERROR: encoding conversion from UTF8 to ASCII not supported And neither does convert select convert(value using utf8_to_ascii) from test ; ERROR: character 0xc3 of encoding "MULE_INTERNAL" has no equivalent in "SQL_ASCII" select convert('martín' using utf8_to_ascii) from test ; ERROR: character 0xc3 of encoding "MULE_INTERNAL" has no equivalent in "SQL_ASCII" Any other alternatives? BTW, this is on Pg8.2 as seen on Ubuntu Feisty PPC. $ dpkg -l postgresql-8.2 ii postgresql-8.2 8.2.4-0ubuntu0.7.04 cheers, martín -- ----------------------------------------------------------------------- Martin @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St OFFICE: +64(4)916-7224 UK: 0845 868 5733 ext 7224 MOB: +64(21)364-017 Make things as simple as possible, but no simpler - Einstein -----------------------------------------------------------------------