Thread: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...
Dear Experts, I have a table that contains bytea data which sometimes is UTF-8 text. When it is - and that is indicated by another column - I want to text-search index it. Something like this: db=> create index ix on tbl using gin (to_tsvector('english',body)) where is_utf8_text; (Note my client encoding is UTF-8.) That doesn't work because to_tsvector doesn't take bytea. So I tried to_tsvector('english',body::text) ; that almost works, but CRLFs in the data become \015\012 (i.e. 6 characters) in the text and the tsvector contains things like '12hello'. Next I tried to_tsvector('english',convert_from(body::text,'UTF-8')). That doesn't work because convert_from is not immutable. (This is 9.6; maybe that has changed.) Is there a good reason for that? Maybe because I might change the client encoding? As a hack I tried ALTER FUNCTION to make it immutable, and now I get: ERROR: invalid byte sequence for encoding "UTF8": 0x00 Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean? But actually I'd be more than happy to ignore invalid UTF-8 here, since I'm only using it for text search; there may be some truly invalid UTF-8 in the data. Is there a "permissive" mode for charset conversion? (That error also suggests that the convert_from is not optimising the conversion from UTF-8 to UTF-8 to a no-op.) Anyway: given the problem of creating a text search index over bytea data that contains UTF-8 text, which may include oddities like null bytes, what would you do? Thanks for any suggestions! Phil.
>>>>> "Phil" == Phil Endecott <spam_from_pgsql_lists@chezphil.org> writes: Phil> Next I tried Phil> to_tsvector('english',convert_from(body::text,'UTF-8')). That Phil> doesn't work because convert_from is not immutable. (This is 9.6; Phil> maybe that has changed.) Is there a good reason for that? I would guess because conversions are controlled by the pg_conversion table which can be modified by create/drop conversion. Phil> Maybe because I might change the client encoding? No, because convert_from converts from the specified encoding to the server_encoding, not the client_encoding, and the server_encoding can't be changed except at db creation time. Phil> As a hack I tried ALTER FUNCTION to make it immutable, A better approach is to wrap it in a function of your own which is declared immutable, rather than hacking the catalogs: create function from_utf8(bytea) returns text language plpgsql immutable as $$ begin return convert_from($1, 'UTF8'); end; $$; Phil> and now I get: Phil> ERROR: invalid byte sequence for encoding "UTF8": 0x00 Phil> Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean? PG doesn't allow 0x00 in text values regardless of encoding. Phil> But actually I'd be more than happy to ignore invalid UTF-8 here, Phil> since I'm only using it for text search; there may be some truly Phil> invalid UTF-8 in the data. Is there a "permissive" mode for Phil> charset conversion? Unfortunately not. Phil> (That error also suggests that the convert_from is not optimising Phil> the conversion from UTF-8 to UTF-8 to a no-op.) Indeed not, because it must validate that the data really is UTF-8 before treating it as such. Phil> Anyway: given the problem of creating a text search index over Phil> bytea data that contains UTF-8 text, which may include oddities Phil> like null bytes, what would you do? You can search for 0x00 in a bytea using position() or LIKE. What do you want to do with values that contain null bytes? or values which you think are supposed to be valid utf8 text but are not? -- Andrew (irc:RhodiumToad)
Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...
From
"Phil Endecott"
Date:
Hi Andrew, Thanks for your great reply. Andrew Gierth wrote: >>>>>> "Phil" == Phil Endecott <spam_from_pgsql_lists@chezphil.org> writes: > Phil> As a hack I tried ALTER FUNCTION to make it immutable, > > A better approach is to wrap it in a function of your own which is > declared immutable, rather than hacking the catalogs: > > create function from_utf8(bytea) returns text language plpgsql immutable > as $$ begin return convert_from($1, 'UTF8'); end; $$; Thanks. I'm a bit surprised that it will allow me to declare a function immutable if it calls functions that aren't themselves immutable, but if it works... great. > Phil> Anyway: given the problem of creating a text search index over > Phil> bytea data that contains UTF-8 text, which may include oddities > Phil> like null bytes, what would you do? > > You can search for 0x00 in a bytea using position() or LIKE. What do you > want to do with values that contain null bytes? or values which you > think are supposed to be valid utf8 text but are not? As long as it doesn't crash I don't really care; it would be better if the text search worked for the valid parts of the text but even that isn't essential. I think I will probably need to do some more preprocessing on the data when I load it, at least to remove the null bytes. Regards, Phil.