Re: UTF8MatchText - Mailing list pgsql-patches

From Tom Lane
Subject Re: UTF8MatchText
Date
Msg-id 3999.1179423188@sss.pgh.pa.us
Whole thread Raw
In response to Re: UTF8MatchText  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: UTF8MatchText  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-patches
Andrew Dunstan <andrew@dunslane.net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?

I don't think this is a correct analysis.  If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes.  But it was
stated somewhere upthread that that doesn't actually work.  Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

            regards, tom lane

pgsql-patches by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Seq scans status update
Next
From: Tom Lane
Date:
Subject: Re: UTF8MatchText