Home > mailing lists

Re: UTF8MatchText - Mailing list pgsql-patches

From	Tom Lane
Subject	Re: UTF8MatchText
Date	May 17, 2007 17:33:54
Msg-id	3999.1179423188@sss.pgh.pa.us Whole thread Raw
In response to	Re: UTF8MatchText (Andrew Dunstan <andrew@dunslane.net>)
Responses	Re: UTF8MatchText (Andrew Dunstan <andrew@dunslane.net>)
List	pgsql-patches

Tree view

Andrew Dunstan <andrew@dunslane.net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?

I don't think this is a correct analysis.  If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes.  But it was
stated somewhere upthread that that doesn't actually work.  Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

            regards, tom lane

pgsql-patches by date:

From: Heikki Linnakangas
Date: 17 May 2007, 17:28:40
Subject: Re: Seq scans status update

From: Tom Lane
Date: 17 May 2007, 17:40:22
Subject: Re: UTF8MatchText

Re: UTF8MatchText - Mailing list pgsql-patches

Previous

Next