Home > mailing lists

Re: Regex code versus Unicode chars beyond codepoint 255 - Mailing list pgsql-hackers

From	David Smith
Subject	Re: Regex code versus Unicode chars beyond codepoint 255
Date	February 17, 2012 04:15:25
Msg-id	Pine.LNX.4.44.1202152050100.2772-100000@localhost.localdomain Whole thread Raw
In response to	Regex code versus Unicode chars beyond codepoint 255 (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

on 2010-11-24 at 15:56, Tom Lane wrote:

> Bug #5766 points out that we're still not there yet in terms of having
> sane behavior for locale-specific regex operations in Unicode
> encoding. The reason it's not working is that regc_locale does this to
> expand the set of characters that are considered to match [[:alnum:]]
> : <SNIP>

and it would appear that nobody answered the email.

I am currently implementing a library system that needs to search by
whole word. I am using \m...\M regexes, and the DB is utf8, which
includes text in Hebrew, Greek, Arabic and various European character
sets. I need a solution to do whole word searches on the data, and this
either means fixing the value of alnum for utf8 to include all character
sets, or manually generating a list of all characters and reimplementing
a word-start/end in regex myself. I would prefer to avoid the latter if
at all possible!

What is the current status regarding a full character list for alnum for
utf8, and is there anything I can do to help get it working?

Thanks,

David

pgsql-hackers by date:

From: Etsuro Fujita
Date: 17 February 2012, 03:49:04
Subject: Re: WIP: Collecting statistics on CSV file data

From: Guillaume Lelarge
Date: 17 February 2012, 04:42:24
Subject: Re: Bug in intarray?

Re: Regex code versus Unicode chars beyond codepoint 255 - Mailing list pgsql-hackers

Previous

Next