Re: UTF8MatchText - Mailing list pgsql-patches
From | Tom Lane |
---|---|
Subject | Re: UTF8MatchText |
Date | |
Msg-id | 13130.1179436715@sss.pgh.pa.us Whole thread Raw |
In response to | Re: UTF8MatchText (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: UTF8MatchText
Re: UTF8MatchText Re: UTF8MatchText |
List | pgsql-patches |
Andrew Dunstan <andrew@dunslane.net> writes: > From my WIP patch, here's where the difference appears to be - note > that UTF8 branch has two NextByte calls at the bottom, unlike the other > branch: Oh, I see: NextChar is still "real" but the patch is willing to have t and p pointing into the middle of an MB character. That's a bit risky. I think it works but it's making at least the following undocumented assumptions: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already paid the price of determining the character length once, not to mention matching the bytes of the characters once, and then throw that knowledge away. I think BYTEEQ would make more sense in the backslash path. * At pattern % or _, it's critical that we do NextChar not NextByte on the data side. Else t is pointing into the middle of an MB sequence when p isn't, and we have various out-of-sync conditions to worry about, notably possibly calling NextChar when t is not pointing at the first byte of the character, which will result in a wrong answer about the character length. * We *must* do NextChar not NextByte for _ since we have to match it to exactly one logical character, not byte. You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. So the actual optimization here is that we do bytewise comparison and advancing, but only when we are either at the start of a character (on both sides, and the pattern char is not wildcard) or we are in the middle of a character (on both sides) and we've already proven that both sides matched for the previous byte(s) of the character. On the strength of this closer reading, I would say that the patch isn't relying on UTF8's first-byte-vs-not-first-byte property after all. All that it's relying on is that no MB character is a prefix of another one, which seems like a necessary property for any sane encoding; plus that characters are considered equal only if they're bytewise equal. So are we sure it doesn't work for non-UTF8 encodings? Maybe that earlier conclusion was based on a misunderstanding of what the patch really does. regards, tom lane
pgsql-patches by date: