Thread: Queries with Regular Expressions
Hello! My queries work fine with Regular Expressions, as: SELECT field FROM table WHERE field ~ 'something'; SELECT field FROM table WHERE field ~* 'something'; SELECT field FROM table WHERE field ~* 'som[i,e]thing'; And it works fine for special characters: SELECT field FROM table WHERE field ~* 'chão'; But I just can't make it work correctly using brackets: SELECT field FROM table WHERE field ~* 'ch[aã]o'; It just returns tuples that have 'chao', but not 'chão'. My queries are utf-8 an the database is SQL_ASCII. Any idea? Thank you! -- Silas Justiniano - Brazil
> But I just can't make it work correctly using brackets: > SELECT field FROM table WHERE field ~* 'ch[aã]o'; > > It just returns tuples that have 'chao', but not 'chão'. > > My queries are utf-8 an the database is SQL_ASCII. I suspect the bracketed expression is turning into [aXY], where XY is the two-byte sequence corresponding to ã in UTF8. So the regular expression is only going to match strings of the form chao, chXo and chYo. To make sure that this is what's happening, try this: select length('ã'); I bet you get back 2, not 1. I don't know if a UTF8 database will handle this correctly or not. The safest thing to do may be to use queries like this: SELECT field FROM table WHERE field ~* 'ch(a|ã)o'; - John D. Burger MITRE
"John D. Burger" <john@mitre.org> writes: >> My queries are utf-8 an the database is SQL_ASCII. > I suspect the bracketed expression is turning into [aXY], where XY is > the two-byte sequence corresponding to � in UTF8. That's what it looks like to me. You can hardly blame the database for this, when you haven't clued it in that you're using UTF8. > I don't know if a UTF8 database will handle this correctly or not. I believe that it will work in recent PG releases (7.4 and up). regards, tom lane
John, it worked completely fine! Thank you! I don't understand exactly the difference between [] and () for REs, but I'm starting to study them deeply. Thank you very much! Pg is great! On 4/6/06, John D. Burger <john@mitre.org> wrote: > > But I just can't make it work correctly using brackets: > > SELECT field FROM table WHERE field ~* 'ch[aã]o'; > > > > It just returns tuples that have 'chao', but not 'chão'. > > > > My queries are utf-8 an the database is SQL_ASCII. > > I suspect the bracketed expression is turning into [aXY], where XY is > the two-byte sequence corresponding to ã in UTF8. So the regular > expression is only going to match strings of the form chao, chXo and > chYo. To make sure that this is what's happening, try this: > > select length('ã'); > > I bet you get back 2, not 1. I don't know if a UTF8 database will > handle this correctly or not. The safest thing to do may be to use > queries like this: > > SELECT field FROM table WHERE field ~* 'ch(a|ã)o'; > > - John D. Burger > MITRE > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Silas Justiniano - Brazil