Home > mailing lists

Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes... - Mailing list pgsql-general

From	Andrew Gierth
Subject	Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...
Date	October 7, 2018 05:52:25
Msg-id	87ftxia3l4.fsf@news-spur.riddles.org.uk Whole thread Raw
In response to	Text-indexing UTF-8 bytea, convert_from() immutability, null bytes... ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
Responses	Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...
List	pgsql-general

Tree view

>>>>> "Phil" == Phil Endecott <spam_from_pgsql_lists@chezphil.org> writes:

 Phil> Next I tried
 Phil> to_tsvector('english',convert_from(body::text,'UTF-8')). That
 Phil> doesn't work because convert_from is not immutable. (This is 9.6;
 Phil> maybe that has changed.) Is there a good reason for that?

I would guess because conversions are controlled by the pg_conversion
table which can be modified by create/drop conversion.

 Phil> Maybe because I might change the client encoding?

No, because convert_from converts from the specified encoding to the
server_encoding, not the client_encoding, and the server_encoding can't
be changed except at db creation time.

 Phil> As a hack I tried ALTER FUNCTION to make it immutable,

A better approach is to wrap it in a function of your own which is
declared immutable, rather than hacking the catalogs:

create function from_utf8(bytea) returns text language plpgsql immutable
  as $$ begin return convert_from($1, 'UTF8'); end; $$;

 Phil> and now I get:

 Phil> ERROR:  invalid byte sequence for encoding "UTF8": 0x00

 Phil> Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean?

PG doesn't allow 0x00 in text values regardless of encoding.

 Phil> But actually I'd be more than happy to ignore invalid UTF-8 here,
 Phil> since I'm only using it for text search; there may be some truly
 Phil> invalid UTF-8 in the data. Is there a "permissive" mode for
 Phil> charset conversion?

Unfortunately not.

 Phil> (That error also suggests that the convert_from is not optimising
 Phil> the conversion from UTF-8 to UTF-8 to a no-op.)

Indeed not, because it must validate that the data really is UTF-8
before treating it as such.

 Phil> Anyway: given the problem of creating a text search index over
 Phil> bytea data that contains UTF-8 text, which may include oddities
 Phil> like null bytes, what would you do?

You can search for 0x00 in a bytea using position() or LIKE. What do you
want to do with values that contain null bytes? or values which you
think are supposed to be valid utf8 text but are not?

-- 
Andrew (irc:RhodiumToad)

pgsql-general by date:

From: "Phil Endecott"
Date: 07 October 2018, 00:56:15
Subject: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...

From: "Phil Endecott"
Date: 07 October 2018, 15:45:07
Subject: Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...

Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes... - Mailing list pgsql-general

Previous

Next