Re: Best practices for moving UTF8 databases - Mailing list pgsql-general

From Phoenix Kiula
Subject Re: Best practices for moving UTF8 databases
Date
Msg-id e373d31e0907220226h6bd5c33ag5ac2c63fa0be241f@mail.gmail.com
Whole thread Raw
In response to Re: Best practices for moving UTF8 databases  (Sam Mason <sam@samason.me.uk>)
Responses Re: Best practices for moving UTF8 databases
Re: Best practices for moving UTF8 databases
two questions about fulltext searchign / tsvector indexes
List pgsql-general
On Tue, Jul 21, 2009 at 6:35 PM, Sam Mason<sam@samason.me.uk> wrote:
> On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
>> >I'd love to fix them. But if I do a search for
>> >SELECT * FROM xyz WHERE col like '%0x80%'
>> >
>> >it doesn't work. How should I search for these characters?
>>
>> In 8.2, try: WHERE strpos(col, E'\x80') > 0
>>
>> Note that this may find valid data as well, because the error you get
>> is when 0x80 is the first byte of a character in UTF8; when it's at
>> another position, you don't want to change it.
>
> There are various regexs around to check for valid UTF-8 encoding; one
> appears to be:
>
>  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
>
> One translation into PG would be:
>
>  WHERE NOT col ~ ( '^('||
>    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
>    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
>     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
>    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
>     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
>     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
>    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
>     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
>   '*)$' );
>
> This seems to do the right thing for me in an SQL_ASCII database.
>




I tried this. Get an error.


mypg=# select * from interesting WHERE NOT description ~ ( '^('||
mypg(#    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
mypg(#    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
mypg(#     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
mypg(#    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
mypg(#     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
mypg(#     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
mypg(#    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
mypg(#     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
mypg(#   '*)$' )
mypg-#
mypg-#   ;
ERROR:  invalid regular expression: quantifier operand invalid

pgsql-general by date:

Previous
From: Chris Spotts
Date:
Subject: Re: array_agg crash?
Next
From: Tom Lane
Date:
Subject: Re: array_agg crash?