Re: UTF8MatchText - Mailing list pgsql-patches

From Andrew Dunstan
Subject Re: UTF8MatchText
Date
Msg-id 464CB3A5.9020600@dunslane.net
Whole thread Raw
In response to Re: UTF8MatchText  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: UTF8MatchText  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> Except that the entire point of this patch is to dumb down NextChar to
>>> be the same as NextByte for UTF8 strings.
>>>
>
>
>> That's not what I see in (what I think is) the latest submission, which
>> includes this snippet:
>>
>
> [ scratches head... ]  OK, then I think I totally missed what this patch
> is trying to accomplish; because this code looks just the same as the
> existing multibyte-character operations.  Where does the performance
> improvement come from?
>
>
>

That's what bothered me. The trouble is that we have so much code that
looks *almost* identical.

 From my WIP patch, here's where the difference appears to be - note
that UTF8 branch has two NextByte calls at the bottom, unlike the other
branch:


#ifdef UTF8_OPT
        /*
         * UTF8 is optimised to do byte at a time matching in most cases,
         * thus saving expensive calls to NextChar.
         *
         * UTF8 has disjoint representations for first-bytes and
         * not-first-bytes of MB characters, and thus it is
         * impossible to make a false match in which an MB pattern
         * character is matched to the end of one data character
         * plus the start of another.
         * In character sets without that property, we have to use the
         * slow way to ensure we don't make out-of-sync matches.
         */
        else if (*p == '_')
        {
            NextChar(t, tlen);
            NextByte(p, plen);
            continue;
        }
        else if (!BYTEEQ(t, p))
        {
            /*
             * Not the single-character wildcard and no explicit match? Then
             * time to quit...
             */
            return LIKE_FALSE;
        }

        NextByte(t, tlen);
        NextByte(p, plen);
#else
        /*
         * Branch for non-utf8 multi-byte charsets and also for single-byte
         * charsets which don't gain any benefit from the above
optimisation.
         */

        else if ((*p != '_') && !CHAREQ(t, p))
        {
            /*
             * Not the single-character wildcard and no explicit match? Then
             * time to quit...
             */
            return LIKE_FALSE;
        }

        NextChar(t, tlen);
        NextChar(p, plen);

#endif /* UTF8_OPT */


cheers

andrew



pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: UTF8MatchText
Next
From: Tom Lane
Date:
Subject: Re: CREATE TABLE LIKE INCLUDING INDEXES support