Re: UTF-8 and =, LIKE problems - Mailing list pgsql-general

From Michael Glaesemann
Subject Re: UTF-8 and =, LIKE problems
Date
Msg-id 53FE8566-2E1C-11D9-9FAD-000A95C88220@myrealbox.com
Whole thread Raw
In response to UTF-8 and =, LIKE problems  (Edmund Lian <elian@inbrief.net>)
List pgsql-general
On Nov 4, 2004, at 1:24 PM, Edmund Lian wrote:

> I am running a web-based accounting package (SQL-Ledger) that supports
> multiple languages on PostgreSQL. When a database encoding is set to
> Unicode, multilingual operation is possible.
>

<snip />

> Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037
> U+0039, but of course they aren't if a simple-minded byte-by-byte or
> character-by-character comparison is done.
>
> In the ideal case, one would probably want to convert all full width
> chars to their half width equivalents because the numbers look wierd
> on the screen (e.g., "7 9  B r i s b a n e  S t r e e t" instead of
> "79 Brisbane Street". Is there any way to get PostgreSQL to do so?
>
> Failing this, is there any way to get PostgreSQL to be a bit smarter
> in doing comparisons? I think I'm SOL, but I thought I'd ask anyway.

I've thought this would be a useful addition to PostgreSQL, but
currently I think it's best handled in the application layer. A brief
glance at the SQL-Ledger homepage shows that it's written in Perl. I'm
still in the early learning stages of Perl (heck, I'm the in the early
learning stages of nearly everthing), but I'd assume with Perl's good
Unicode support there should be a way to do this, similar to PHP's
mb_convert_kana (which handles much more than just kana, btw). Ideally,
I'd think you'd want to store all numbers and latin characters as
single-width characters, so you'd filter them before they enter the
database.

I'd think this might be best placed in the SQL-Ledger code, though you
might be able to fashion a plperl function that would do the same
thing. You could either update all entries (UPDATE foo SET bar =
double_to_single(bar)) or make a functional index on
double_to_single(bar).

I'm not sure which would be the best, and others out there have more
informed opinions than mine which I'd love to read.

Hope this helps a bit.

Michael


pgsql-general by date:

Previous
From: Edmund Lian
Date:
Subject: UTF-8 and =, LIKE problems
Next
From: Philippe Schmid
Date:
Subject: Re: PostgreSQL on Linux PC vs MacOS X