Thread: Multibyte LIKE optimization
Andrew - Supernews <andrew+nonews@supernews.com> wrote: > Actually, I think your proposal is fundamentally correct, merely incomplete. Yeah, I fixed the patch to handle '_' correctly. > Doing octet-based rather than character-based matching of strings is a > _design goal_ of UTF8. I think all "safe ASCII-supersets" encodings are comparable by bytes, not only UTF-8. Their all multibyte characters consist of bytes larger than 127. I updated the patch on this presupposition. It uses octet-based matching usually and character-based matching at '_'. There was 30%+ of performance win in selection using multibytes LIKE '%foo%'. encoding | HEAD | patched -----------+---------+--------- SQL_ASCII | 7094ms | 7062ms LATIN1 | 7083ms | 7078ms UTF8 | 17974ms | 11635ms (64.7%) EUC_JP | 17032ms | 12109ms (71.1%) If this patch is acceptable, please drop JOHAB encoding from server encodings before it is applied. Trailing bytes of JOHAB can be less than 128. http://archives.postgresql.org/pgsql-hackers/2007-03/msg01475.php Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
"Andrew - Supernews" <andrew@supernews.net> wrote: > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > ITAGAKI> by bytes, not only UTF-8. > > This is false, particularly for EUC. Umm, I see. I updated the optimization to be used only for UTF8 case. I also added some inlining hints that are useful on my machine (Pentium 4). x1000 of LIKE '%foo% on 10000 rows tables [ms] encoding | HEAD | P1 | P2 | P3 -----------+-------+-------+-------+------- SQL_ASCII | 7094 | 7120 | 7063 | 7031 LATIN1 | 7083 | 7130 | 7057 | 7031 UTF8 | 17974 | 10859 | 10839 | 9682 EUC_JP | 17032 | 17557 | 17599 | 15240 - P1: UTF8MatchText() - P2: P1 + __inline__ GenericMatchText() - P3: P2 + __inline__ wchareq() (The attached patch is P3.) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
I assume this replaces all your earlier multi-byte LIKE patches. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > -----------+-------+-------+-------+------- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1 | 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP | 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: #define CHAREQ(p1, p2) wchareq(p1, p2) #define ICHAREQ(p1, p2) wchareq(p1, p2) Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case it be ignored? --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > -----------+-------+-------+-------+------- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1 | 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP | 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > > I do not understand this patch. You have defined two functions, > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > is that one calls CHAREQ and the other calls ICHAREQ, but just above > those two functions you define the macros identically: > > #define CHAREQ(p1, p2) wchareq(p1, p2) > #define ICHAREQ(p1, p2) wchareq(p1, p2) > > Why are there two functions? Also, can't you use one function and just > pass a boolean to indicate whether case it be ignored? Sorry, typo: Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case should be ignored? ------ > > --------------------------------------------------------------------------- > > ITAGAKI Takahiro wrote: > > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > > ITAGAKI> by bytes, not only UTF-8. > > > > > > This is false, particularly for EUC. > > > > Umm, I see. I updated the optimization to be used only for UTF8 case. > > I also added some inlining hints that are useful on my machine (Pentium 4). > > > > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > > encoding | HEAD | P1 | P2 | P3 > > -----------+-------+-------+-------+------- > > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > > LATIN1 | 7083 | 7130 | 7057 | 7031 > > UTF8 | 17974 | 10859 | 10839 | 9682 > > EUC_JP | 17032 | 17557 | 17599 | 15240 > > > > - P1: UTF8MatchText() > > - P2: P1 + __inline__ GenericMatchText() > > - P3: P2 + __inline__ wchareq() > > (The attached patch is P3.) > > > > Regards, > > --- > > ITAGAKI Takahiro > > NTT Open Source Software Center > > > > [ Attachment, skipping... ] > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: You can help support the PostgreSQL project by donating at > > > > http://www.postgresql.org/about/donate > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> wrote: > > I do not understand this patch. You have defined two functions, > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > > is that one calls CHAREQ and the other calls ICHAREQ, but just above > > those two functions you define the macros identically: > > Why are there two functions? Also, can't you use one function and just > pass a boolean to indicate whether case should be ignored? The same is true of MBMatchText() and MBMatchTextIC(). Now, I'll split the patch into two changes. 1. DropMBMatchTextIC.patch Drop MBMatchTextIC() and use MBMatchText() instead. 2. UTF8MatchText.patch Add UTF8MatchText() as a specialized version of MBMatchText(). As a future work, it might be good to research the performance of rewriting "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that we can avoid to call lower() for constant pattern in the right-hand side and can use functional indexes (lower(col)). I think we never need MBMatchTextIC() in the future unless we move to wide-character server encoding :) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Patch removed, updated version submitted. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > -----------+-------+-------+-------+------- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1 | 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP | 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > > I do not understand this patch. You have defined two functions, > > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > > > is that one calls CHAREQ and the other calls ICHAREQ, but just above > > > those two functions you define the macros identically: > > > > Why are there two functions? Also, can't you use one function and just > > pass a boolean to indicate whether case should be ignored? > > The same is true of MBMatchText() and MBMatchTextIC(). > Now, I'll split the patch into two changes. > > 1. DropMBMatchTextIC.patch > Drop MBMatchTextIC() and use MBMatchText() instead. > > 2. UTF8MatchText.patch > Add UTF8MatchText() as a specialized version of MBMatchText(). > > > As a future work, it might be good to research the performance of rewriting > "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that > we can avoid to call lower() for constant pattern in the right-hand side and > can use functional indexes (lower(col)). I think we never need MBMatchTextIC() > in the future unless we move to wide-character server encoding :) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] [ Attachment, skipping... ] -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Itagaki, I find this still fairly unclean. It certainly took me some time to get me head around what's going on. ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive. That would make for a total of six functions, but in the case of both UTF8 and other MBCS we don't need a special Case Insensitive function - instead we downcase both the string and the pattern and then use the Case Sensitive function. That leaves a total of four functions. What is not clear to me is why the UTF8 optimisation work, and why it doesn't apply to other MBCS. At the very least we need a comment on that. I also find the existing function naming convention somewhat annoying - having foo() and MB_foo() is less than clear. I'd rather have SB_foo() and MB_foo(). That's not your fault, of course. If you supply me with some explanation on the UTF8 optimisation issue, I'll prepare a revised patch along these lines. cheers andrew ITAGAKI Takahiro wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > >>> I do not understand this patch. You have defined two functions, >>> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them >>> is that one calls CHAREQ and the other calls ICHAREQ, but just above >>> those two functions you define the macros identically: >>> >> Why are there two functions? Also, can't you use one function and just >> pass a boolean to indicate whether case should be ignored? >> > > The same is true of MBMatchText() and MBMatchTextIC(). > Now, I'll split the patch into two changes. > > 1. DropMBMatchTextIC.patch > Drop MBMatchTextIC() and use MBMatchText() instead. > > 2. UTF8MatchText.patch > Add UTF8MatchText() as a specialized version of MBMatchText(). > > > As a future work, it might be good to research the performance of rewriting > "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that > we can avoid to call lower() for constant pattern in the right-hand side and > can use functional indexes (lower(col)). I think we never need MBMatchTextIC() > in the future unless we move to wide-character server encoding :) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
I wrote: > > > ISTM we should generate all these match functions from one body of > code plus some #define magic. > > As I understand it, we have three possible encoding switches: Single > Byte, UTF8 and other Multi Byte Charsets, and two possible case > settings: case Sensitive and Case Insensitive. That would make for a > total of six functions, but in the case of both UTF8 and other MBCS we > don't need a special Case Insensitive function - instead we downcase > both the string and the pattern and then use the Case Sensitive > function. That leaves a total of four functions. > > What is not clear to me is why the UTF8 optimisation work, and why it > doesn't apply to other MBCS. At the very least we need a comment on that. > > I also find the existing function naming convention somewhat annoying > - having foo() and MB_foo() is less than clear. I'd rather have > SB_foo() and MB_foo(). That's not your fault, of course. > > If you supply me with some explanation on the UTF8 optimisation issue, > I'll prepare a revised patch along these lines. > > Ok, I have studied some more and I think I understand what's going on. AIUI, we are switching from some expensive char-wise comparisons to cheap byte-wise comparisons in the UTF8 case because we know that in UTF8 the magic characters ('_', '%' and '\') aren't a part of any other character sequence. Is that putting it too mildly? Do we need stronger conditions than that? If it's correct, are there other MBCS for which this is true? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Ok, I have studied some more and I think I understand what's going on. > AIUI, we are switching from some expensive char-wise comparisons to > cheap byte-wise comparisons in the UTF8 case because we know that in > UTF8 the magic characters ('_', '%' and '\') aren't a part of any other > character sequence. Is that putting it too mildly? Do we need stronger > conditions than that? If it's correct, are there other MBCS for which > this is true? I don't think this is a correct analysis. If it were correct then we could use the optimization for all backend charsets because none of them allow MB characters to contain non-high-bit-set bytes. But it was stated somewhere upthread that that doesn't actually work. Clearly it's a necessary property that we not falsely detect the magic pattern characters, but that's not sufficient. I think the real issue is that UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property, we have to use the slow way to ensure we don't make out-of-sync matches. regards, tom lane
Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior. regards, tom lane
Tom Lane wrote: > UTF8 has disjoint representations for > first-bytes and not-first-bytes of MB characters, and thus it is > impossible to make a false match in which an MB pattern character is > matched to the end of one data character plus the start of another. > In character sets without that property, we have to use the slow way to > ensure we don't make out-of-sync matches. > > > Thanks. I will include this info in the comments. cheers andrew
Tom Lane wrote: > Wait a second ... I just thought of a counterexample that destroys the > entire concept. Consider the pattern 'A__B', which clearly is supposed > to match strings of four *characters*. With the proposed patch in > place, it would match strings of four *bytes*. Which is not the correct > behavior. > > From what I can see the code is quite careful about when it calls NextByte vs NextChar, and after _ it calls NextChar. So I'll test for this, but I think it's safe. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Wait a second ... I just thought of a counterexample that destroys the >> entire concept. Consider the pattern 'A__B', which clearly is supposed >> to match strings of four *characters*. With the proposed patch in >> place, it would match strings of four *bytes*. Which is not the correct >> behavior. > From what I can see the code is quite careful about when it calls > NextByte vs NextChar, and after _ it calls NextChar. Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Wait a second ... I just thought of a counterexample that destroys the >>> entire concept. Consider the pattern 'A__B', which clearly is supposed >>> to match strings of four *characters*. With the proposed patch in >>> place, it would match strings of four *bytes*. Which is not the correct >>> behavior. >>> > > >> From what I can see the code is quite careful about when it calls >> NextByte vs NextChar, and after _ it calls NextChar. >> > > Except that the entire point of this patch is to dumb down NextChar to > be the same as NextByte for UTF8 strings. > > > That's not what I see in (what I think is) the latest submission, which includes this snippet: + /* Set up for utf8 characters */ + #define CHAREQ(p1, p2) wchareq(p1, p2) + #define NextChar(p, plen) \ + do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0) + + /* + * UTF8MatchText -- specialized version of MBMatchText for UTF8 + */ + static int + UTF8MatchText(char *t, int tlen, char *p, int plen) Am I looking at the wrong thing? This is from around April 9th I think. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Except that the entire point of this patch is to dumb down NextChar to >> be the same as NextByte for UTF8 strings. > That's not what I see in (what I think is) the latest submission, which > includes this snippet: [ scratches head... ] OK, then I think I totally missed what this patch is trying to accomplish; because this code looks just the same as the existing multibyte-character operations. Where does the performance improvement come from? regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Except that the entire point of this patch is to dumb down NextChar to >>> be the same as NextByte for UTF8 strings. >>> > > >> That's not what I see in (what I think is) the latest submission, which >> includes this snippet: >> > > [ scratches head... ] OK, then I think I totally missed what this patch > is trying to accomplish; because this code looks just the same as the > existing multibyte-character operations. Where does the performance > improvement come from? > > > That's what bothered me. The trouble is that we have so much code that looks *almost* identical. From my WIP patch, here's where the difference appears to be - note that UTF8 branch has two NextByte calls at the bottom, unlike the other branch: #ifdef UTF8_OPT /* * UTF8 is optimised to do byte at a time matching in most cases, * thus saving expensive calls to NextChar. * * UTF8 has disjoint representations for first-bytes and * not-first-bytes of MB characters, and thus it is * impossible to make a false match in which an MB pattern * character is matched to the end of one data character * plus the start of another. * In character sets without that property, we have to use the * slow way to ensure we don't make out-of-sync matches. */ else if (*p == '_') { NextChar(t, tlen); NextByte(p, plen); continue; } else if (!BYTEEQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then * time to quit... */ return LIKE_FALSE; } NextByte(t, tlen); NextByte(p, plen); #else /* * Branch for non-utf8 multi-byte charsets and also for single-byte * charsets which don't gain any benefit from the above optimisation. */ else if ((*p != '_') && !CHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then * time to quit... */ return LIKE_FALSE; } NextChar(t, tlen); NextChar(p, plen); #endif /* UTF8_OPT */ cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > From my WIP patch, here's where the difference appears to be - note > that UTF8 branch has two NextByte calls at the bottom, unlike the other > branch: Oh, I see: NextChar is still "real" but the patch is willing to have t and p pointing into the middle of an MB character. That's a bit risky. I think it works but it's making at least the following undocumented assumptions: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already paid the price of determining the character length once, not to mention matching the bytes of the characters once, and then throw that knowledge away. I think BYTEEQ would make more sense in the backslash path. * At pattern % or _, it's critical that we do NextChar not NextByte on the data side. Else t is pointing into the middle of an MB sequence when p isn't, and we have various out-of-sync conditions to worry about, notably possibly calling NextChar when t is not pointing at the first byte of the character, which will result in a wrong answer about the character length. * We *must* do NextChar not NextByte for _ since we have to match it to exactly one logical character, not byte. You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. So the actual optimization here is that we do bytewise comparison and advancing, but only when we are either at the start of a character (on both sides, and the pattern char is not wildcard) or we are in the middle of a character (on both sides) and we've already proven that both sides matched for the previous byte(s) of the character. On the strength of this closer reading, I would say that the patch isn't relying on UTF8's first-byte-vs-not-first-byte property after all. All that it's relying on is that no MB character is a prefix of another one, which seems like a necessary property for any sane encoding; plus that characters are considered equal only if they're bytewise equal. So are we sure it doesn't work for non-UTF8 encodings? Maybe that earlier conclusion was based on a misunderstanding of what the patch really does. regards, tom lane
Tom Lane wrote: > > * At a pattern backslash, it applies CHAREQ() but then advances > byte-by-byte over the matched characters (implicitly assuming that none > of these bytes will look like the magic characters). While that works > for backend-safe encodings, it seems a bit strange; you've already paid > the price of determining the character length once, not to mention > matching the bytes of the characters once, and then throw that knowledge > away. I think BYTEEQ would make more sense in the backslash path. > Probably, although the use of CHAREQ is in the present code. Is it legal to follow escape by anything other than _ % or escape? > > So the actual optimization here is that we do bytewise comparison and > advancing, but only when we are either at the start of a character > (on both sides, and the pattern char is not wildcard) or we are in the > middle of a character (on both sides) and we've already proven that both > sides matched for the previous byte(s) of the character. > I think that's correct. > On the strength of this closer reading, I would say that the patch isn't > relying on UTF8's first-byte-vs-not-first-byte property after all. > All that it's relying on is that no MB character is a prefix of another > one, which seems like a necessary property for any sane encoding; plus > that characters are considered equal only if they're bytewise equal. > So are we sure it doesn't work for non-UTF8 encodings? Maybe that > earlier conclusion was based on a misunderstanding of what the patch > really does. > > > Indeed. One more thing - I'm thinking of rolling up the bytea matching routines as well as the text routines to eliminate all the duplication of logic. I can do it by a little type casting from bytea* to text* and back again, or if that's not acceptable by some preprocessor magic. I think the casting is likely to be safe enough in this case - I don't think a null byte will hurt us anywhere in this code - and presumably the varlena stuff is all the same. Does that sound reasonable? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Is it legal to follow escape by anything other than _ % or escape? Certainly, but once you've compared the first byte you can handle any remaining bytes via the main loop. And in fact the code is already depending on being able to do that --- the use of CHAREQ rather than BYTEEQ is just wasted cycles. > One more thing - I'm thinking of rolling up the bytea matching routines > as well as the text routines to eliminate all the duplication of logic. +1, I think, but I wonder why we had the duplication in the first place. Is there any likelihood that bytea's semantics should diverge from the text case? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > On the strength of this closer reading, I would say that the patch isn't > relying on UTF8's first-byte-vs-not-first-byte property after all. > All that it's relying on is that no MB character is a prefix of another > one, which seems like a necessary property for any sane encoding; plus > that characters are considered equal only if they're bytewise equal. > So are we sure it doesn't work for non-UTF8 encodings? Maybe that > earlier conclusion was based on a misunderstanding of what the patch > really does. Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodings; If we have '[AB][AB]' in a table and search it with LIKE '%[BA]%', we judge that they are matched by mistake. I've also misunderstood it before, and Dennis Bjorklund pointed out. http://archives.postgresql.org/pgsql-hackers/2007-03/msg01377.php Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Yes, I only used the 'disjoint representations for first-bytes and > not-first-bytes of MB characters' feature in UTF8. Other encodings > allows both [AB] and [BA] for MB character patterns. UTF8Match() does > not cope with those encodings; If we have '[AB][AB]' in a table and > search it with LIKE '%[BA]%', we judge that they are matched by mistake. AFAICS, the patch does *not* make that mistake because % will not advance over a fractional character. regards, tom lane
Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > >> Yes, I only used the 'disjoint representations for first-bytes and >> not-first-bytes of MB characters' feature in UTF8. Other encodings >> allows both [AB] and [BA] for MB character patterns. UTF8Match() does >> not cope with those encodings; If we have '[AB][AB]' in a table and >> search it with LIKE '%[BA]%', we judge that they are matched by mistake. >> > > AFAICS, the patch does *not* make that mistake because % will not > advance over a fractional character. > > Yeah, I think that's right. Attached is my current WIP patch. If we decide that this optimisation can in fact be applied to all backend encodings, that will be easily incorporated. It will simplify the code further. Note that all the common code in the MatchText and do_like_escape functions has been factored - and the bytea functions just call the single-byte text versions - AFAICS the effect will be identical to having the specialised versions. (I'm always happy when code volume can be reduced.) cheers andrew Index: src/backend/utils/adt/like.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 -0000 1.68 --- src/backend/utils/adt/like.c 18 May 2007 02:47:41 -0000 *************** *** 28,48 **** #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,50 ---- #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static int SB_MatchTextIC(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int UTF8_MatchText(char *t, int tlen, char *p, int plen); + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int mbtexticlike(text *str, text *pat); + /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static __inline__ int wchareq(char *p1, char *p2) { int p1_len; *************** *** 72,86 **** * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 74,86 ---- * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *************** *** 90,122 **** *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #include "like_match.c" - #undef CHAREQ - #undef ICHAREQ - #undef NextChar - #undef CopyAdvChar - #undef MatchText - #undef MatchTextIC - #undef do_like_escape - /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) ! #define NextChar(p, plen) ((p)++, (plen)--) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) #include "like_match.c" - /* And some support for BYTEA */ - #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2)) - #define BYTEA_NextChar(p, plen) ((p)++, (plen)--) - #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) /* * interface routines called by the function manager --- 90,170 ---- *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MB_MatchText #define do_like_escape MB_do_like_escape #include "like_match.c" /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) BYTEEQ(p1, p2) ! #define NextChar(p, plen) NextByte(p, plen) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) + #define MatchText SB_MatchText + #define do_like_escape SB_do_like_escape + + #include "like_match.c" + + /* set up to compile like_match.c for single byte case insensitive matching */ + + #define CHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) + #define NextChar(p, plen) NextByte(p, plen) + #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) + + #define MatchText SB_MatchTextIC + + #include "like_match.c" + + /* set up for UTF8 match optimisation */ + + #define CHAREQ(p1, p2) wchareq(p1, p2) + #define NextChar(p, plen) \ + do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) + #define CopyAdvChar(dst, src, srclen) \ + do { int __l = pg_mblen(src); \ + (srclen) -= __l; \ + while (__l-- > 0) \ + *(dst)++ = *(src)++; \ + } while (0) + + #define MatchText UTF8_MatchText + #define UTF8_OPT + #include "like_match.c" + static __inline__ int + GenericMatchText(char *s, int slen, char* p, int plen) + { + if (pg_database_encoding_max_length() == 1) + return SB_MatchText(s, slen, p, plen); + else if (GetDatabaseEncoding() == PG_UTF8) + return UTF8_MatchText(s, slen, p, plen); + else + return MB_MatchText(s, slen, p, plen); + } + + static __inline__ int + mbtexticlike(text *str, text *pat) + { + char *s, + *p; + int slen, + plen; + + /* Force inputs to lower case to achieve case insensitivity */ + str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str))); + pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat))); + s = VARDATA(str); + slen = (VARSIZE(str) - VARHDRSZ); + p = VARDATA(pat); + plen = (VARSIZE(pat) - VARHDRSZ); + + if (GetDatabaseEncoding() == PG_UTF8) + return UTF8_MatchText(s, slen, p, plen); + else + return MB_MatchText(s, slen, p, plen); + } /* * interface routines called by the function manager *************** *** 138,147 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 186,192 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 162,171 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 207,213 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 186,195 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 228,234 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 210,219 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 249,255 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 234,240 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 270,276 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 255,261 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 291,297 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 281,305 **** slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else { - /* Force inputs to lower case to achieve case insensitivity */ text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); } PG_RETURN_BOOL(result); --- 317,331 ---- slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else { text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (mbtexticlike(strtext, pat) == LIKE_TRUE); } PG_RETURN_BOOL(result); *************** *** 322,346 **** slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else { - /* Force inputs to lower case to achieve case insensitivity */ text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); } PG_RETURN_BOOL(result); --- 348,362 ---- slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else { text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (mbtexticlike(strtext, pat) != LIKE_TRUE); } PG_RETURN_BOOL(result); *************** *** 363,383 **** slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 379,388 ---- slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else ! result = (mbtexticlike(str, pat) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 399,419 **** slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 404,413 ---- slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else ! result = (mbtexticlike(str, pat) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 430,436 **** text *result; if (pg_database_encoding_max_length() == 1) ! result = do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); --- 424,430 ---- text *result; if (pg_database_encoding_max_length() == 1) ! result = SB_do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); *************** *** 446,624 **** { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result; ! char *p, ! *e, ! *r; ! int plen, ! elen; ! bool afterescape; ! ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! e = VARDATA(esc); ! elen = (VARSIZE(esc) - VARHDRSZ); ! ! /* ! * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth ! * trying to calculate the size more accurately than that. ! */ ! result = (text *) palloc(plen * 2 + VARHDRSZ); ! r = VARDATA(result); ! ! if (elen == 0) ! { ! /* ! * No escape character is wanted. Double any backslashes in the ! * pattern to make them act like ordinary characters. ! */ ! while (plen > 0) ! { ! if (*p == '\\') ! *r++ = '\\'; ! BYTEA_CopyAdvChar(r, p, plen); ! } ! } ! else ! { ! /* ! * The specified escape must be only a single character. ! */ ! BYTEA_NextChar(e, elen); ! if (elen != 0) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg("invalid escape string"), ! errhint("Escape string must be empty or one character."))); ! ! e = VARDATA(esc); ! ! /* ! * If specified escape is '\', just copy the pattern as-is. ! */ ! if (*e == '\\') ! { ! memcpy(result, pat, VARSIZE(pat)); ! PG_RETURN_BYTEA_P(result); ! } ! ! /* ! * Otherwise, convert occurrences of the specified escape character to ! * '\', and double occurrences of '\' --- unless they immediately ! * follow an escape character! ! */ ! afterescape = false; ! while (plen > 0) ! { ! if (BYTEA_CHAREQ(p, e) && !afterescape) ! { ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = true; ! } ! else if (*p == '\\') ! { ! *r++ = '\\'; ! if (!afterescape) ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = false; ! } ! else ! { ! BYTEA_CopyAdvChar(r, p, plen); ! afterescape = false; ! } ! } ! } ! ! SET_VARSIZE(result, r - ((char *) result)); ! PG_RETURN_BYTEA_P(result); } - /* - * Same as above, but specifically for bytea (binary) datatype - */ - static int - MatchBytea(char *t, int tlen, char *p, int plen) - { - /* Fast path for match-everything pattern */ - if ((plen == 1) && (*p == '%')) - return LIKE_TRUE; - - while ((tlen > 0) && (plen > 0)) - { - if (*p == '\\') - { - /* Next pattern char must match literally, whatever it is */ - BYTEA_NextChar(p, plen); - if ((plen <= 0) || !BYTEA_CHAREQ(t, p)) - return LIKE_FALSE; - } - else if (*p == '%') - { - /* %% is the same as % according to the SQL standard */ - /* Advance past all %'s */ - while ((plen > 0) && (*p == '%')) - BYTEA_NextChar(p, plen); - /* Trailing percent matches everything. */ - if (plen <= 0) - return LIKE_TRUE; - - /* - * Otherwise, scan for a text position at which we can match the - * rest of the pattern. - */ - while (tlen > 0) - { - /* - * Optimization to prevent most recursion: don't recurse - * unless first pattern char might match this text char. - */ - if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_')) - { - int matched = MatchBytea(t, tlen, p, plen); - - if (matched != LIKE_FALSE) - return matched; /* TRUE or ABORT */ - } - - BYTEA_NextChar(t, tlen); - } - - /* - * End of text with no match, so no point in trying later places - * to start matching this pattern. - */ - return LIKE_ABORT; - } - else if ((*p != '_') && !BYTEA_CHAREQ(t, p)) - { - /* - * Not the single-character wildcard and no explicit match? Then - * time to quit... - */ - return LIKE_FALSE; - } - - BYTEA_NextChar(t, tlen); - BYTEA_NextChar(p, plen); - } - - if (tlen > 0) - return LIKE_FALSE; /* end of pattern, but not of text */ - - /* End of input string. Do we have matching pattern remaining? */ - while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of - * pattern */ - BYTEA_NextChar(p, plen); - if (plen <= 0) - return LIKE_TRUE; - - /* - * End of text with no match, so no point in trying later places to start - * matching this pattern. - */ - return LIKE_ABORT; - } /* MatchBytea() */ --- 440,447 ---- { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result = SB_do_like_escape((text *)pat, (text *)esc); ! PG_RETURN_BYTEA_P((bytea *)result); } Index: src/backend/utils/adt/like_match.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v retrieving revision 1.15 diff -c -r1.15 like_match.c *** src/backend/utils/adt/like_match.c 27 Feb 2007 23:48:08 -0000 1.15 --- src/backend/utils/adt/like_match.c 18 May 2007 02:47:41 -0000 *************** *** 9,19 **** * Before the inclusion, we need to define following macros: * * CHAREQ - * ICHAREQ * NextChar * CopyAdvChar * MatchText (MBMatchText) - * MatchTextIC (MBMatchTextIC) * do_like_escape (MB_do_like_escape) * * Copyright (c) 1996-2007, PostgreSQL Global Development Group --- 9,17 ---- *************** *** 82,88 **** if (*p == '\\') { /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); if ((plen <= 0) || !CHAREQ(t, p)) return LIKE_FALSE; } --- 80,86 ---- if (*p == '\\') { /* Next pattern char must match literally, whatever it is */ ! NextByte(p, plen); if ((plen <= 0) || !CHAREQ(t, p)) return LIKE_FALSE; } *************** *** 91,97 **** /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; --- 89,95 ---- /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextByte(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; *************** *** 123,129 **** */ return LIKE_ABORT; } ! else if ((*p != '_') && !CHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then --- 121,146 ---- */ return LIKE_ABORT; } ! #ifdef UTF8_OPT ! /* ! * UTF8 is optimised to do byte at a time matching in most cases, ! * thus saving expensive calls to NextChar. ! * ! * UTF8 has disjoint representations for first-bytes and ! * not-first-bytes of MB characters, and thus it is ! * impossible to make a false match in which an MB pattern ! * character is matched to the end of one data character ! * plus the start of another. ! * In character sets without that property, we have to use the ! * slow way to ensure we don't make out-of-sync matches. ! */ ! else if (*p == '_') ! { ! NextChar(t, tlen); ! NextByte(p, plen); ! continue; ! } ! else if (!BYTEEQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then *************** *** 132,215 **** return LIKE_FALSE; } ! NextChar(t, tlen); ! NextChar(p, plen); ! } ! ! if (tlen > 0) ! return LIKE_FALSE; /* end of pattern, but not of text */ ! ! /* End of input string. Do we have matching pattern remaining? */ ! while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of ! * pattern */ ! NextChar(p, plen); ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * End of text with no match, so no point in trying later places to start ! * matching this pattern. ! */ ! return LIKE_ABORT; ! } /* MatchText() */ ! ! /* ! * Same as above, but ignore case ! */ ! static int ! MatchTextIC(char *t, int tlen, char *p, int plen) ! { ! /* Fast path for match-everything pattern */ ! if ((plen == 1) && (*p == '%')) ! return LIKE_TRUE; ! ! while ((tlen > 0) && (plen > 0)) ! { ! if (*p == '\\') ! { ! /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); ! if ((plen <= 0) || !ICHAREQ(t, p)) ! return LIKE_FALSE; ! } ! else if (*p == '%') ! { ! /* %% is the same as % according to the SQL standard */ ! /* Advance past all %'s */ ! while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); ! /* Trailing percent matches everything. */ ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * Otherwise, scan for a text position at which we can match the ! * rest of the pattern. ! */ ! while (tlen > 0) ! { ! /* ! * Optimization to prevent most recursion: don't recurse ! * unless first pattern char might match this text char. ! */ ! if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_')) ! { ! int matched = MatchTextIC(t, tlen, p, plen); ! ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ ! } ! ! NextChar(t, tlen); ! } ! ! /* ! * End of text with no match, so no point in trying later places ! * to start matching this pattern. ! */ ! return LIKE_ABORT; ! } ! else if ((*p != '_') && !ICHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then --- 149,163 ---- return LIKE_FALSE; } ! NextByte(t, tlen); ! NextByte(p, plen); ! #else ! /* ! * Branch for non-utf8 multi-byte charsets and also for single-byte ! * charsets which don't gain any benefir from the above optimisation. ! */ ! ! else if ((*p != '_') && !CHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then *************** *** 220,225 **** --- 168,175 ---- NextChar(t, tlen); NextChar(p, plen); + + #endif /* UTF8_OPT */ } if (tlen > 0) *************** *** 228,234 **** /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextChar(p, plen); if (plen <= 0) return LIKE_TRUE; --- 178,184 ---- /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextByte(p, plen); if (plen <= 0) return LIKE_TRUE; *************** *** 237,248 **** * matching this pattern. */ return LIKE_ABORT; ! } /* MatchTextIC() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ static text * do_like_escape(text *pat, text *esc) { --- 187,200 ---- * matching this pattern. */ return LIKE_ABORT; ! } /* MatchText() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ + #ifdef do_like_escape + static text * do_like_escape(text *pat, text *esc) { *************** *** 336,338 **** --- 288,304 ---- return result; } + #endif /* do_like_escape */ + + #undef CHAREQ + #undef NextChar + #undef CopyAdvChar + #undef MatchText + + #ifdef do_like_escape + #undef do_like_escape + #endif + + #ifdef UTF8_OPT + #undef UTF8_OPT + #endif
Andrew Dunstan <andrew@dunslane.net> writes: > Attached is my current WIP patch. A few quick eyeball comments: > ! static __inline__ int Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use "inline", which we #define appropriately at need. > ! * UTF8 has disjoint representations for first-bytes and > ! * not-first-bytes of MB characters, and thus it is > ! * impossible to make a false match in which an MB pattern > ! * character is matched to the end of one data character > ! * plus the start of another. > ! * In character sets without that property, we have to use the > ! * slow way to ensure we don't make out-of-sync matches. I thought we'd concluded that this explanation is pseudo-science? > ! * Branch for non-utf8 multi-byte charsets and also for single-byte > ! * charsets which don't gain any benefir from the above optimisation. spellcheck... regards, tom lane
Tom Lane wrote: > Under *no* circumstances use __inline__, as it will certainly break > every non-gcc compiler. Use "inline", which we #define appropriately > at need. > OK. (this was from upstream patch.) > > I thought we'd concluded that this explanation is pseudo-science? > > [...] > > spellcheck... > > > Right. I'm waiting on a consensus about the UTF8-ness of the solution before I adjust comments. cheers andrew
Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > >> Yes, I only used the 'disjoint representations for first-bytes and >> not-first-bytes of MB characters' feature in UTF8. Other encodings >> allows both [AB] and [BA] for MB character patterns. UTF8Match() does >> not cope with those encodings; If we have '[AB][AB]' in a table and >> search it with LIKE '%[BA]%', we judge that they are matched by mistake. >> > > AFAICS, the patch does *not* make that mistake because % will not > advance over a fractional character. > > > Unless I hear differently, my present intention is to apply the suggested improvement universally. I'll wait a day or two before completing the patch. cheers andrew
Tom Lane skrev: > You could imagine trying to do > % a byte at a time (and indeed that's what I'd been thinking it did) > but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for the two cases. Other than this part that I think can be optimized I don't see anything wrong with the idea behind the patch. To make the '%' case fast might be an important optimization for a lot of use cases. It's not uncommon that '%' matches a bigger part of the string than the rest of the pattern. It's easy to make a misstake when one is used to think about the simple fixed size characters like ascii. Strange that this simple topic can be so difficult to think about... :-) /Dennis
Dennis Bjorklund wrote: > Tom Lane skrev: >> You could imagine trying to do >> % a byte at a time (and indeed that's what I'd been thinking it did) >> but that gets you out of sync which breaks the _ case. > > It is only when you have a pattern like '%_' when this is a problem > and we could detect this and do byte by byte when it's not. Now we > check (*p == '\\') || (*p == '_') in each iteration when we scan over > characters for '%', and we could do it once and have different loops > for the two cases. > > Other than this part that I think can be optimized I don't see > anything wrong with the idea behind the patch. To make the '%' case > fast might be an important optimization for a lot of use cases. It's > not uncommon that '%' matches a bigger part of the string than the > rest of the pattern. > Are you sure? The big remaining char-matching bottleneck will surely be in the code that scans for a place to start matching a %. But that's exactly where we can't use byte matching for cases where the charset might include AB and BA as characters - the pattern might contain %BA and the string AB. However, this isn't a danger for UTF8, which leads me to think that we do indeed need a special case for UTF8, but for a different improvement from that proposed in the original patch. I'll post an updated patch shortly. cheers andrew
I wrote: > > >> >> It is only when you have a pattern like '%_' when this is a problem >> and we could detect this and do byte by byte when it's not. Now we >> check (*p == '\\') || (*p == '_') in each iteration when we scan over >> characters for '%', and we could do it once and have different loops >> for the two cases. >> >> Other than this part that I think can be optimized I don't see >> anything wrong with the idea behind the patch. To make the '%' case >> fast might be an important optimization for a lot of use cases. It's >> not uncommon that '%' matches a bigger part of the string than the >> rest of the pattern. >> > > > Are you sure? The big remaining char-matching bottleneck will surely > be in the code that scans for a place to start matching a %. But > that's exactly where we can't use byte matching for cases where the > charset might include AB and BA as characters - the pattern might > contain %BA and the string AB. However, this isn't a danger for UTF8, > which leads me to think that we do indeed need a special case for > UTF8, but for a different improvement from that proposed in the > original patch. I'll post an updated patch shortly. > Here is a patch that implements this. Please analyse for possible breakage. cheers andrew
oops. patch attached this time Andrew Dunstan wrote: > > > I wrote: >> >> >>> >>> It is only when you have a pattern like '%_' when this is a problem >>> and we could detect this and do byte by byte when it's not. Now we >>> check (*p == '\\') || (*p == '_') in each iteration when we scan >>> over characters for '%', and we could do it once and have different >>> loops for the two cases. >>> >>> Other than this part that I think can be optimized I don't see >>> anything wrong with the idea behind the patch. To make the '%' case >>> fast might be an important optimization for a lot of use cases. It's >>> not uncommon that '%' matches a bigger part of the string than the >>> rest of the pattern. >>> >> >> >> Are you sure? The big remaining char-matching bottleneck will surely >> be in the code that scans for a place to start matching a %. But >> that's exactly where we can't use byte matching for cases where the >> charset might include AB and BA as characters - the pattern might >> contain %BA and the string AB. However, this isn't a danger for UTF8, >> which leads me to think that we do indeed need a special case for >> UTF8, but for a different improvement from that proposed in the >> original patch. I'll post an updated patch shortly. >> > > Here is a patch that implements this. Please analyse for possible > breakage. > > cheers > > andrew > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > Index: src/backend/utils/adt/like.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 -0000 1.68 --- src/backend/utils/adt/like.c 20 May 2007 14:16:22 -0000 *************** *** 28,48 **** #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,51 ---- #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static int SB_MatchTextIC(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int UTF8_MatchText(char *t, int tlen, char *p, int plen); + + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int mbtexticlike(text *str, text *pat); + /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static inline int wchareq(char *p1, char *p2) { int p1_len; *************** *** 72,86 **** * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 75,87 ---- * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *************** *** 89,122 **** while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #include "like_match.c" ! #undef CHAREQ ! #undef ICHAREQ ! #undef NextChar ! #undef CopyAdvChar ! #undef MatchText ! #undef MatchTextIC ! #undef do_like_escape /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) ! #define NextChar(p, plen) ((p)++, (plen)--) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) #include "like_match.c" ! /* And some support for BYTEA */ ! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--) ! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) /* * interface routines called by the function manager --- 90,173 ---- while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) + #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2) ! #define MatchText MB_MatchText #define do_like_escape MB_do_like_escape #include "like_match.c" ! /* Set up to compile like_match.c for UTF8 characters */ ! #define CHAREQ(p1, p2) wchareq(p1, p2) ! #define NextChar(p, plen) \ ! do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) ! #define CopyAdvChar(dst, src, srclen) \ ! do { int __l = pg_mblen(src); \ ! (srclen) -= __l; \ ! while (__l-- > 0) \ ! *(dst)++ = *(src)++; \ ! } while (0) ! #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2) ! #define UTF8_OPT ! ! #define MatchText UTF8_MatchText ! ! #include "like_match.c" /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) BYTEEQ(p1, p2) ! #define NextChar(p, plen) NextByte(p, plen) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) + #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2) + + #define MatchText SB_MatchText + #define do_like_escape SB_do_like_escape #include "like_match.c" ! /* set up to compile like_match.c for single byte case insensitive matching */ + #define CHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) + #define NextChar(p, plen) NextByte(p, plen) + #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) + #define BYTEEQIC(b1, b2) CHAREQ(b1, b2) + + #define MatchText SB_MatchTextIC + + #include "like_match.c" + + static inline int + GenericMatchText(char *s, int slen, char* p, int plen) + { + if (pg_database_encoding_max_length() == 1) + return SB_MatchText(s, slen, p, plen); + else if (GetDatabaseEncoding() == PG_UTF8) + return UTF8_MatchText(s, slen, p, plen); + else + return MB_MatchText(s, slen, p, plen); + } + + static inline int + mbtexticlike(text *str, text *pat) + { + char *s, + *p; + int slen, + plen; + + /* Force inputs to lower case to achieve case insensitivity */ + str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str))); + pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat))); + s = VARDATA(str); + slen = (VARSIZE(str) - VARHDRSZ); + p = VARDATA(pat); + plen = (VARSIZE(pat) - VARHDRSZ); + + if (GetDatabaseEncoding() == PG_UTF8) + return UTF8_MatchText(s, slen, p, plen); + else + return MB_MatchText(s, slen, p, plen); + } /* * interface routines called by the function manager *************** *** 138,147 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 189,195 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 162,171 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 210,216 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 186,195 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 231,237 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 210,219 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 252,258 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 234,240 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 273,279 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 255,261 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 294,300 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 281,305 **** slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else { - /* Force inputs to lower case to achieve case insensitivity */ text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); } PG_RETURN_BOOL(result); --- 320,334 ---- slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else { text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (mbtexticlike(strtext, pat) == LIKE_TRUE); } PG_RETURN_BOOL(result); *************** *** 322,346 **** slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else { - /* Force inputs to lower case to achieve case insensitivity */ text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); } PG_RETURN_BOOL(result); --- 351,365 ---- slen = strlen(s); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else { text *strtext; strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (mbtexticlike(strtext, pat) != LIKE_TRUE); } PG_RETURN_BOOL(result); *************** *** 363,383 **** slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 382,391 ---- slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE); } else ! result = (mbtexticlike(str, pat) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 399,419 **** slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 407,416 ---- slen = (VARSIZE(str) - VARHDRSZ); p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE); } else ! result = (mbtexticlike(str, pat) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 430,436 **** text *result; if (pg_database_encoding_max_length() == 1) ! result = do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); --- 427,433 ---- text *result; if (pg_database_encoding_max_length() == 1) ! result = SB_do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); *************** *** 446,624 **** { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result; ! char *p, ! *e, ! *r; ! int plen, ! elen; ! bool afterescape; ! ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! e = VARDATA(esc); ! elen = (VARSIZE(esc) - VARHDRSZ); ! ! /* ! * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth ! * trying to calculate the size more accurately than that. ! */ ! result = (text *) palloc(plen * 2 + VARHDRSZ); ! r = VARDATA(result); ! ! if (elen == 0) ! { ! /* ! * No escape character is wanted. Double any backslashes in the ! * pattern to make them act like ordinary characters. ! */ ! while (plen > 0) ! { ! if (*p == '\\') ! *r++ = '\\'; ! BYTEA_CopyAdvChar(r, p, plen); ! } ! } ! else ! { ! /* ! * The specified escape must be only a single character. ! */ ! BYTEA_NextChar(e, elen); ! if (elen != 0) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg("invalid escape string"), ! errhint("Escape string must be empty or one character."))); ! ! e = VARDATA(esc); ! ! /* ! * If specified escape is '\', just copy the pattern as-is. ! */ ! if (*e == '\\') ! { ! memcpy(result, pat, VARSIZE(pat)); ! PG_RETURN_BYTEA_P(result); ! } ! ! /* ! * Otherwise, convert occurrences of the specified escape character to ! * '\', and double occurrences of '\' --- unless they immediately ! * follow an escape character! ! */ ! afterescape = false; ! while (plen > 0) ! { ! if (BYTEA_CHAREQ(p, e) && !afterescape) ! { ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = true; ! } ! else if (*p == '\\') ! { ! *r++ = '\\'; ! if (!afterescape) ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = false; ! } ! else ! { ! BYTEA_CopyAdvChar(r, p, plen); ! afterescape = false; ! } ! } ! } ! SET_VARSIZE(result, r - ((char *) result)); ! ! PG_RETURN_BYTEA_P(result); } - /* - * Same as above, but specifically for bytea (binary) datatype - */ - static int - MatchBytea(char *t, int tlen, char *p, int plen) - { - /* Fast path for match-everything pattern */ - if ((plen == 1) && (*p == '%')) - return LIKE_TRUE; - - while ((tlen > 0) && (plen > 0)) - { - if (*p == '\\') - { - /* Next pattern char must match literally, whatever it is */ - BYTEA_NextChar(p, plen); - if ((plen <= 0) || !BYTEA_CHAREQ(t, p)) - return LIKE_FALSE; - } - else if (*p == '%') - { - /* %% is the same as % according to the SQL standard */ - /* Advance past all %'s */ - while ((plen > 0) && (*p == '%')) - BYTEA_NextChar(p, plen); - /* Trailing percent matches everything. */ - if (plen <= 0) - return LIKE_TRUE; - - /* - * Otherwise, scan for a text position at which we can match the - * rest of the pattern. - */ - while (tlen > 0) - { - /* - * Optimization to prevent most recursion: don't recurse - * unless first pattern char might match this text char. - */ - if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_')) - { - int matched = MatchBytea(t, tlen, p, plen); - - if (matched != LIKE_FALSE) - return matched; /* TRUE or ABORT */ - } - - BYTEA_NextChar(t, tlen); - } - - /* - * End of text with no match, so no point in trying later places - * to start matching this pattern. - */ - return LIKE_ABORT; - } - else if ((*p != '_') && !BYTEA_CHAREQ(t, p)) - { - /* - * Not the single-character wildcard and no explicit match? Then - * time to quit... - */ - return LIKE_FALSE; - } - - BYTEA_NextChar(t, tlen); - BYTEA_NextChar(p, plen); - } - - if (tlen > 0) - return LIKE_FALSE; /* end of pattern, but not of text */ - - /* End of input string. Do we have matching pattern remaining? */ - while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of - * pattern */ - BYTEA_NextChar(p, plen); - if (plen <= 0) - return LIKE_TRUE; - - /* - * End of text with no match, so no point in trying later places to start - * matching this pattern. - */ - return LIKE_ABORT; - } /* MatchBytea() */ --- 443,450 ---- { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result = SB_do_like_escape((text *)pat, (text *)esc); ! PG_RETURN_BYTEA_P((bytea *)result); } Index: src/backend/utils/adt/like_match.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v retrieving revision 1.15 diff -c -r1.15 like_match.c *** src/backend/utils/adt/like_match.c 27 Feb 2007 23:48:08 -0000 1.15 --- src/backend/utils/adt/like_match.c 20 May 2007 14:16:24 -0000 *************** *** 9,19 **** * Before the inclusion, we need to define following macros: * * CHAREQ ! * ICHAREQ * NextChar * CopyAdvChar * MatchText (MBMatchText) - * MatchTextIC (MBMatchTextIC) * do_like_escape (MB_do_like_escape) * * Copyright (c) 1996-2007, PostgreSQL Global Development Group --- 9,18 ---- * Before the inclusion, we need to define following macros: * * CHAREQ ! * BYTEEQIC * NextChar * CopyAdvChar * MatchText (MBMatchText) * do_like_escape (MB_do_like_escape) * * Copyright (c) 1996-2007, PostgreSQL Global Development Group *************** *** 70,75 **** --- 69,82 ---- *-------------------- */ + #ifdef UTF8_OPT + #define PCT_CHAREQ(c1, c2) BYTEEQ(c1, c2) + #define PCT_NEXT(a, b) NextByte(a, b) + #else + #define PCT_CHAREQ(c1, c2) CHAREQ(c1, c2) + #define PCT_NEXT(a, b) NextChar(a, b) + #endif + static int MatchText(char *t, int tlen, char *p, int plen) { *************** *** 81,89 **** { if (*p == '\\') { ! /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); ! if ((plen <= 0) || !CHAREQ(t, p)) return LIKE_FALSE; } else if (*p == '%') --- 88,96 ---- { if (*p == '\\') { ! /* Next byte must match literally, whatever it is */ ! NextByte(p, plen); ! if ((plen <= 0) || !BYTEEQ(t, p)) return LIKE_FALSE; } else if (*p == '%') *************** *** 91,97 **** /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; --- 98,104 ---- /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextByte(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; *************** *** 106,112 **** * Optimization to prevent most recursion: don't recurse * unless first pattern char might match this text char. */ ! if (CHAREQ(t, p) || (*p == '\\') || (*p == '_')) { int matched = MatchText(t, tlen, p, plen); --- 113,119 ---- * Optimization to prevent most recursion: don't recurse * unless first pattern char might match this text char. */ ! if (PCT_CHAREQ(t, p) || (*p == '\\') || (*p == '_')) { int matched = MatchText(t, tlen, p, plen); *************** *** 114,120 **** return matched; /* TRUE or ABORT */ } ! NextChar(t, tlen); } /* --- 121,127 ---- return matched; /* TRUE or ABORT */ } ! PCT_NEXT(t, tlen); } /* *************** *** 123,215 **** */ return LIKE_ABORT; } ! else if ((*p != '_') && !CHAREQ(t, p)) ! { ! /* ! * Not the single-character wildcard and no explicit match? Then ! * time to quit... ! */ ! return LIKE_FALSE; ! } ! ! NextChar(t, tlen); ! NextChar(p, plen); ! } ! ! if (tlen > 0) ! return LIKE_FALSE; /* end of pattern, but not of text */ ! ! /* End of input string. Do we have matching pattern remaining? */ ! while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of ! * pattern */ ! NextChar(p, plen); ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * End of text with no match, so no point in trying later places to start ! * matching this pattern. ! */ ! return LIKE_ABORT; ! } /* MatchText() */ ! ! /* ! * Same as above, but ignore case ! */ ! static int ! MatchTextIC(char *t, int tlen, char *p, int plen) ! { ! /* Fast path for match-everything pattern */ ! if ((plen == 1) && (*p == '%')) ! return LIKE_TRUE; ! ! while ((tlen > 0) && (plen > 0)) ! { ! if (*p == '\\') ! { ! /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); ! if ((plen <= 0) || !ICHAREQ(t, p)) ! return LIKE_FALSE; ! } ! else if (*p == '%') { ! /* %% is the same as % according to the SQL standard */ ! /* Advance past all %'s */ ! while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); ! /* Trailing percent matches everything. */ ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * Otherwise, scan for a text position at which we can match the ! * rest of the pattern. ! */ ! while (tlen > 0) ! { ! /* ! * Optimization to prevent most recursion: don't recurse ! * unless first pattern char might match this text char. ! */ ! if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_')) ! { ! int matched = MatchTextIC(t, tlen, p, plen); ! ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ ! } ! ! NextChar(t, tlen); ! } ! ! /* ! * End of text with no match, so no point in trying later places ! * to start matching this pattern. ! */ ! return LIKE_ABORT; } ! else if ((*p != '_') && !ICHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then --- 130,142 ---- */ return LIKE_ABORT; } ! else if (*p == '_') { ! NextChar(t, tlen); ! NextByte(p, plen); ! continue; } ! else if (!BYTEEQIC(t, p)) { /* * Not the single-character wildcard and no explicit match? Then *************** *** 217,225 **** */ return LIKE_FALSE; } ! ! NextChar(t, tlen); ! NextChar(p, plen); } if (tlen > 0) --- 144,156 ---- */ return LIKE_FALSE; } ! /* ! * It is safe to use NextByte instead of NextChar here, even for ! * multi-byte character sets, because we are not following ! * immediately after a wildcard character. ! */ ! NextByte(t, tlen); ! NextByte(p, plen); } if (tlen > 0) *************** *** 228,234 **** /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextChar(p, plen); if (plen <= 0) return LIKE_TRUE; --- 159,165 ---- /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextByte(p, plen); if (plen <= 0) return LIKE_TRUE; *************** *** 237,248 **** * matching this pattern. */ return LIKE_ABORT; ! } /* MatchTextIC() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ static text * do_like_escape(text *pat, text *esc) { --- 168,181 ---- * matching this pattern. */ return LIKE_ABORT; ! } /* MatchText() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ + #ifdef do_like_escape + static text * do_like_escape(text *pat, text *esc) { *************** *** 336,338 **** --- 269,288 ---- return result; } + #endif /* do_like_escape */ + + #undef CHAREQ + #undef BYTEEQIC + #undef PCT_CHAREQ + #undef PCT_NEXT + #undef NextChar + #undef CopyAdvChar + #undef MatchText + + #ifdef do_like_escape + #undef do_like_escape + #endif + + #ifdef UTF8_OPT + #undef UTF8_OPT + #endif
Andrew Dunstan <andrew@dunslane.net> writes: > Are you sure? The big remaining char-matching bottleneck will surely > be in the code that scans for a place to start matching a %. But > that's exactly where we can't use byte matching for cases where the > charset might include AB and BA as characters - the pattern might > contain %BA and the string AB. However, this isn't a danger for UTF8, > which leads me to think that we do indeed need a special case for > UTF8, but for a different improvement from that proposed in the > original patch. I'll post an updated patch shortly. > Here is a patch that implements this. Please analyse for possible > breakage. On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? It strikes me that we may be overcomplicating matters in another way too. If you believe that the %-scan code is now the bottleneck, that is, the key loop is where we have pattern '%foo' and we are trying to match 'f' to each successive data position, then you should be bothered that SB_MatchTextIC is applying tolower() to 'f' again for each data character. Worst-case we could have O(N^2) applications of tolower() during a match. I think there's a fair case to be made that we should get rid of SB_MatchTextIC and implement *all* the case-insensitive variants by means of an initial lower() call. This would leave us with just two match functions and allow considerable unification of the setup logic. regards, tom lane
Tom Lane wrote: > On the strength of this analysis, shouldn't we drop the separate > UTF8 match function and just use SB_MatchText for UTF8? > Possibly - IIRC I looked at that and there was some reason I didn't, but I'll look again. > It strikes me that we may be overcomplicating matters in another way > too. If you believe that the %-scan code is now the bottleneck, that > is, the key loop is where we have pattern '%foo' and we are trying to > match 'f' to each successive data position, then you should be bothered > that SB_MatchTextIC is applying tolower() to 'f' again for each data > character. Worst-case we could have O(N^2) applications of tolower() > during a match. I think there's a fair case to be made that we should > get rid of SB_MatchTextIC and implement *all* the case-insensitive > variants by means of an initial lower() call. This would leave us with > just two match functions and allow considerable unification of the setup > logic. > > > Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would be a gain. We're getting quite a long way from the original patch :-) cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Yeah, quite possibly. I'm also wondering if we are wasting effort > downcasing what will in most cases be the same pattern over and over > again. Maybe we need to look at memoizing that somehow, or at least test > to see if that would be a gain. Someone (Itagaki-san IIRC) suggested that we ought to convert "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early stage, definitely before constant-folding in the planner. That would take care of that issue without any run-time mechanism, and would open opportunities for making use of an index on lower(x). I recall thinking at the time that there were some potential downsides, but right at the moment I'm darned if I can see any --- especially if we're going to make ILIKE do this uniformly at runtime anyway. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Yeah, quite possibly. I'm also wondering if we are wasting effort >> downcasing what will in most cases be the same pattern over and over >> again. Maybe we need to look at memoizing that somehow, or at least test >> to see if that would be a gain. >> > > Someone (Itagaki-san IIRC) suggested that we ought to convert > "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early > stage, definitely before constant-folding in the planner. That > would take care of that issue without any run-time mechanism, > and would open opportunities for making use of an index on lower(x). > > I recall thinking at the time that there were some potential downsides, > but right at the moment I'm darned if I can see any --- especially > if we're going to make ILIKE do this uniformly at runtime anyway. > > > Sounds like a TODO item. I'm already concerned a bit about scope creep for this item. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> Yeah, quite possibly. I'm also wondering if we are wasting effort >>> downcasing what will in most cases be the same pattern over and over >>> again. Maybe we need to look at memoizing that somehow, or at least test >>> to see if that would be a gain. >> >> Someone (Itagaki-san IIRC) suggested that we ought to convert >> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early >> stage, definitely before constant-folding in the planner. >> > Sounds like a TODO item. I'm already concerned a bit about scope creep > for this item. Agreed, I don't want to tackle this right now --- I'm just suggesting it's probably a better answer than memoizing at runtime. regards, tom lane
Tom Lane wrote: > > On the strength of this analysis, shouldn't we drop the separate > UTF8 match function and just use SB_MatchText for UTF8? > We still call NextChar() after "_", and I think we probably need to, don't we? If so we can't just marry the cases. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> On the strength of this analysis, shouldn't we drop the separate >> UTF8 match function and just use SB_MatchText for UTF8? > We still call NextChar() after "_", and I think we probably need to, > don't we? If so we can't just marry the cases. Doh, you're right ... but on third thought, what happens with a pattern containing "%_"? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. I think we need to go back to the scheme with SB_ and MB_ variants and no special case for UTF8. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> On the strength of this analysis, shouldn't we drop the separate >>> UTF8 match function and just use SB_MatchText for UTF8? >>> > > >> We still call NextChar() after "_", and I think we probably need to, >> don't we? If so we can't just marry the cases. >> > > Doh, you're right ... but on third thought, what happens with a pattern > containing "%_"? If % tries to advance bytewise then we'll be trying to > apply NextChar in the middle of a data character, and bad things ensue. > > I think we need to go back to the scheme with SB_ and MB_ variants and > no special case for UTF8. > > > My head is spinning with all these variants. I'll look at ti tomorrow. cheers andrew
> Doh, you're right ... but on third thought, what happens with a pattern > containing "%_"? If % tries to advance bytewise then we'll be trying to > apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need to make sure the '%' advances full characters. In my suggestion the test if '_' (or '\') come after the '%' is done once and it select which of the two loops to use, the one that do byte stepping or the one with NextChar. It's difficult to know for sure that we have thought about all the corner cases. I hope the gain is worth the effort.. :-) /Dennis
db@zigo.dhs.org wrote: >> Doh, you're right ... but on third thought, what happens with a pattern >> containing "%_"? If % tries to advance bytewise then we'll be trying to >> apply NextChar in the middle of a data character, and bad things ensue. >> > > Right, when you have '_' after a '%' you need to make sure the '%' > advances full characters. In my suggestion the test if '_' (or '\') come > after the '%' is done once and it select which of the two loops to use, > the one that do byte stepping or the one with NextChar. > > It's difficult to know for sure that we have thought about all the corner > cases. I hope the gain is worth the effort.. :-) > > > Yes, I came to the same conclusion about how to restructure the code. The current code contains this: while (tlen > 0) { /* * Optimization to prevent most recursion: don't recurse * unless first pattern char might match this text char. */ if (CHAREQ(t, p) || (*p == '\\') || (*p == '_')) { int matched = MatchText(t, tlen, p, plen); if (matched != LIKE_FALSE) return matched; /* TRUE or ABORT */ } NextChar(t, tlen); } The code appears to date from v 1.23 of like.c way back in 2001. I'm not sure I agree with the comment, though. In the first place, the invariant tests should not be in the loop, I think, and I'll hoist them out as Dennis suggests. But why are we doing that CHAREQ? If it succeeds we'll just do it again when we recurse, I think. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. > If it succeeds we'll > just do it again when we recurse, I think. If you move the other two cases then you could advance t and p before entering the recursion. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> But why are we doing that CHAREQ? >> > > To avoid the cost of the recursive call, just like it says. > > >> If it succeeds we'll >> just do it again when we recurse, I think. >> > > If you move the other two cases then you could advance t and p before > entering the recursion. > > > Yeah. Since I have removed the "_" case I believe it's now safe there to use BYTEEQ/NextByte, and since they are sufficiently cheap it's not worth worrying about. Attached is a patch version that I think draws together all the threads of discussion so far. It's in fact quite a lot simpler than the existing code, with no special UTF8 case - this should improve LIKE/ILIKE processing for all charsets. More eyeballs please for nasty corner cases. cheers andrew ? src/backend/utils/adt/.deps Index: src/backend/utils/adt/like.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 -0000 1.68 --- src/backend/utils/adt/like.c 21 May 2007 16:50:48 -0000 *************** *** 28,48 **** #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,48 ---- #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int Generic_Text_IC_like(text *str, text *pat); + /*-------------------- * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. *-------------------- */ ! static inline int wchareq(char *p1, char *p2) { int p1_len; *************** *** 72,86 **** * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 72,84 ---- * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *************** *** 89,122 **** while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #include "like_match.c" - #undef CHAREQ - #undef ICHAREQ - #undef NextChar - #undef CopyAdvChar - #undef MatchText - #undef MatchTextIC - #undef do_like_escape - /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) ! #define NextChar(p, plen) ((p)++, (plen)--) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) #include "like_match.c" ! /* And some support for BYTEA */ ! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--) ! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) /* * interface routines called by the function manager --- 87,137 ---- while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) + #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2) ! #define MatchText MB_MatchText #define do_like_escape MB_do_like_escape #include "like_match.c" /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) BYTEEQ(p1, p2) ! #define NextChar(p, plen) NextByte(p, plen) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) + #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2) + + #define MatchText SB_MatchText + #define do_like_escape SB_do_like_escape #include "like_match.c" ! static inline int ! GenericMatchText(char *s, int slen, char* p, int plen) ! { ! if (pg_database_encoding_max_length() == 1) ! return SB_MatchText(s, slen, p, plen); ! else ! return MB_MatchText(s, slen, p, plen); ! } + static inline int + Generic_Text_IC_like(text *str, text *pat) + { + char *s, + *p; + int slen, + plen; + + /* Force inputs to lower case to achieve case insensitivity */ + str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str))); + pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat))); + s = VARDATA(str); + slen = (VARSIZE(str) - VARHDRSZ); + p = VARDATA(pat); + plen = (VARSIZE(pat) - VARHDRSZ); + + return GenericMatchText(s, slen, p, plen); + } /* * interface routines called by the function manager *************** *** 138,147 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 153,159 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 162,171 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 174,180 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 186,195 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) == LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 195,201 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 210,219 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! if (pg_database_encoding_max_length() == 1) ! result = (MatchText(s, slen, p, plen) != LIKE_TRUE); ! else ! result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 216,222 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 234,240 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } --- 237,243 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 255,261 **** p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } --- 258,264 ---- p = VARDATA(pat); plen = (VARSIZE(pat) - VARHDRSZ); ! result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 270,306 **** Name str = PG_GETARG_NAME(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! char *s, ! *p; ! int slen, ! plen; ! ! if (pg_database_encoding_max_length() == 1) ! { ! s = NameStr(*str); ! slen = strlen(s); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } ! else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! text *strtext; ! strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 273,283 ---- Name str = PG_GETARG_NAME(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! text *strtext; ! strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (Generic_Text_IC_like(strtext, pat) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 311,347 **** Name str = PG_GETARG_NAME(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! char *s, ! *p; ! int slen, ! plen; ! ! if (pg_database_encoding_max_length() == 1) ! { ! s = NameStr(*str); ! slen = strlen(s); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } ! else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! text *strtext; ! strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! strtext = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(strtext))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! ! s = VARDATA(strtext); ! slen = (VARSIZE(strtext) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 288,298 ---- Name str = PG_GETARG_NAME(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! text *strtext; ! strtext = DatumGetTextP(DirectFunctionCall1(name_text, NameGetDatum(str))); ! result = (Generic_Text_IC_like(strtext, pat) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 352,383 **** text *str = PG_GETARG_TEXT_P(0); text *pat = PG_GETARG_TEXT_P(1); bool result; - char *s, - *p; - int slen, - plen; ! if (pg_database_encoding_max_length() == 1) ! { ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } ! else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 303,310 ---- text *str = PG_GETARG_TEXT_P(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! result = (Generic_Text_IC_like(str, pat) == LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 388,419 **** text *str = PG_GETARG_TEXT_P(0); text *pat = PG_GETARG_TEXT_P(1); bool result; - char *s, - *p; - int slen, - plen; ! if (pg_database_encoding_max_length() == 1) ! { ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } ! else ! { ! /* Force inputs to lower case to achieve case insensitivity */ ! str = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(str))); ! pat = DatumGetTextP(DirectFunctionCall1(lower, ! PointerGetDatum(pat))); ! s = VARDATA(str); ! slen = (VARSIZE(str) - VARHDRSZ); ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE); ! } PG_RETURN_BOOL(result); } --- 315,322 ---- text *str = PG_GETARG_TEXT_P(0); text *pat = PG_GETARG_TEXT_P(1); bool result; ! result = (Generic_Text_IC_like(str, pat) != LIKE_TRUE); PG_RETURN_BOOL(result); } *************** *** 430,436 **** text *result; if (pg_database_encoding_max_length() == 1) ! result = do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); --- 333,339 ---- text *result; if (pg_database_encoding_max_length() == 1) ! result = SB_do_like_escape(pat, esc); else result = MB_do_like_escape(pat, esc); *************** *** 446,624 **** { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result; ! char *p, ! *e, ! *r; ! int plen, ! elen; ! bool afterescape; ! ! p = VARDATA(pat); ! plen = (VARSIZE(pat) - VARHDRSZ); ! e = VARDATA(esc); ! elen = (VARSIZE(esc) - VARHDRSZ); ! ! /* ! * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth ! * trying to calculate the size more accurately than that. ! */ ! result = (text *) palloc(plen * 2 + VARHDRSZ); ! r = VARDATA(result); ! ! if (elen == 0) ! { ! /* ! * No escape character is wanted. Double any backslashes in the ! * pattern to make them act like ordinary characters. ! */ ! while (plen > 0) ! { ! if (*p == '\\') ! *r++ = '\\'; ! BYTEA_CopyAdvChar(r, p, plen); ! } ! } ! else ! { ! /* ! * The specified escape must be only a single character. ! */ ! BYTEA_NextChar(e, elen); ! if (elen != 0) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg("invalid escape string"), ! errhint("Escape string must be empty or one character."))); ! ! e = VARDATA(esc); ! ! /* ! * If specified escape is '\', just copy the pattern as-is. ! */ ! if (*e == '\\') ! { ! memcpy(result, pat, VARSIZE(pat)); ! PG_RETURN_BYTEA_P(result); ! } ! ! /* ! * Otherwise, convert occurrences of the specified escape character to ! * '\', and double occurrences of '\' --- unless they immediately ! * follow an escape character! ! */ ! afterescape = false; ! while (plen > 0) ! { ! if (BYTEA_CHAREQ(p, e) && !afterescape) ! { ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = true; ! } ! else if (*p == '\\') ! { ! *r++ = '\\'; ! if (!afterescape) ! *r++ = '\\'; ! BYTEA_NextChar(p, plen); ! afterescape = false; ! } ! else ! { ! BYTEA_CopyAdvChar(r, p, plen); ! afterescape = false; ! } ! } ! } ! ! SET_VARSIZE(result, r - ((char *) result)); ! PG_RETURN_BYTEA_P(result); } - /* - * Same as above, but specifically for bytea (binary) datatype - */ - static int - MatchBytea(char *t, int tlen, char *p, int plen) - { - /* Fast path for match-everything pattern */ - if ((plen == 1) && (*p == '%')) - return LIKE_TRUE; - - while ((tlen > 0) && (plen > 0)) - { - if (*p == '\\') - { - /* Next pattern char must match literally, whatever it is */ - BYTEA_NextChar(p, plen); - if ((plen <= 0) || !BYTEA_CHAREQ(t, p)) - return LIKE_FALSE; - } - else if (*p == '%') - { - /* %% is the same as % according to the SQL standard */ - /* Advance past all %'s */ - while ((plen > 0) && (*p == '%')) - BYTEA_NextChar(p, plen); - /* Trailing percent matches everything. */ - if (plen <= 0) - return LIKE_TRUE; - - /* - * Otherwise, scan for a text position at which we can match the - * rest of the pattern. - */ - while (tlen > 0) - { - /* - * Optimization to prevent most recursion: don't recurse - * unless first pattern char might match this text char. - */ - if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_')) - { - int matched = MatchBytea(t, tlen, p, plen); - - if (matched != LIKE_FALSE) - return matched; /* TRUE or ABORT */ - } - - BYTEA_NextChar(t, tlen); - } - - /* - * End of text with no match, so no point in trying later places - * to start matching this pattern. - */ - return LIKE_ABORT; - } - else if ((*p != '_') && !BYTEA_CHAREQ(t, p)) - { - /* - * Not the single-character wildcard and no explicit match? Then - * time to quit... - */ - return LIKE_FALSE; - } - - BYTEA_NextChar(t, tlen); - BYTEA_NextChar(p, plen); - } - - if (tlen > 0) - return LIKE_FALSE; /* end of pattern, but not of text */ - - /* End of input string. Do we have matching pattern remaining? */ - while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of - * pattern */ - BYTEA_NextChar(p, plen); - if (plen <= 0) - return LIKE_TRUE; - - /* - * End of text with no match, so no point in trying later places to start - * matching this pattern. - */ - return LIKE_ABORT; - } /* MatchBytea() */ --- 349,356 ---- { bytea *pat = PG_GETARG_BYTEA_P(0); bytea *esc = PG_GETARG_BYTEA_P(1); ! bytea *result = SB_do_like_escape((text *)pat, (text *)esc); ! PG_RETURN_BYTEA_P((bytea *)result); } Index: src/backend/utils/adt/like_match.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v retrieving revision 1.15 diff -c -r1.15 like_match.c *** src/backend/utils/adt/like_match.c 27 Feb 2007 23:48:08 -0000 1.15 --- src/backend/utils/adt/like_match.c 21 May 2007 16:50:48 -0000 *************** *** 9,19 **** * Before the inclusion, we need to define following macros: * * CHAREQ ! * ICHAREQ * NextChar * CopyAdvChar * MatchText (MBMatchText) - * MatchTextIC (MBMatchTextIC) * do_like_escape (MB_do_like_escape) * * Copyright (c) 1996-2007, PostgreSQL Global Development Group --- 9,18 ---- * Before the inclusion, we need to define following macros: * * CHAREQ ! * BYTEEQIC * NextChar * CopyAdvChar * MatchText (MBMatchText) * do_like_escape (MB_do_like_escape) * * Copyright (c) 1996-2007, PostgreSQL Global Development Group *************** *** 81,89 **** { if (*p == '\\') { ! /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); ! if ((plen <= 0) || !CHAREQ(t, p)) return LIKE_FALSE; } else if (*p == '%') --- 80,88 ---- { if (*p == '\\') { ! /* Next byte must match literally, whatever it is */ ! NextByte(p, plen); ! if ((plen <= 0) || !BYTEEQ(t, p)) return LIKE_FALSE; } else if (*p == '%') *************** *** 91,97 **** /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; --- 90,96 ---- /* %% is the same as % according to the SQL standard */ /* Advance past all %'s */ while ((plen > 0) && (*p == '%')) ! NextByte(p, plen); /* Trailing percent matches everything. */ if (plen <= 0) return LIKE_TRUE; *************** *** 100,206 **** * Otherwise, scan for a text position at which we can match the * rest of the pattern. */ ! while (tlen > 0) { ! /* ! * Optimization to prevent most recursion: don't recurse ! * unless first pattern char might match this text char. ! */ ! if (CHAREQ(t, p) || (*p == '\\') || (*p == '_')) { int matched = MatchText(t, tlen, p, plen); ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ ! } ! NextChar(t, tlen); } ! ! /* ! * End of text with no match, so no point in trying later places ! * to start matching this pattern. ! */ ! return LIKE_ABORT; ! } ! else if ((*p != '_') && !CHAREQ(t, p)) ! { ! /* ! * Not the single-character wildcard and no explicit match? Then ! * time to quit... ! */ ! return LIKE_FALSE; ! } ! ! NextChar(t, tlen); ! NextChar(p, plen); ! } ! ! if (tlen > 0) ! return LIKE_FALSE; /* end of pattern, but not of text */ ! ! /* End of input string. Do we have matching pattern remaining? */ ! while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of ! * pattern */ ! NextChar(p, plen); ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * End of text with no match, so no point in trying later places to start ! * matching this pattern. ! */ ! return LIKE_ABORT; ! } /* MatchText() */ ! ! /* ! * Same as above, but ignore case ! */ ! static int ! MatchTextIC(char *t, int tlen, char *p, int plen) ! { ! /* Fast path for match-everything pattern */ ! if ((plen == 1) && (*p == '%')) ! return LIKE_TRUE; ! ! while ((tlen > 0) && (plen > 0)) ! { ! if (*p == '\\') ! { ! /* Next pattern char must match literally, whatever it is */ ! NextChar(p, plen); ! if ((plen <= 0) || !ICHAREQ(t, p)) ! return LIKE_FALSE; ! } ! else if (*p == '%') ! { ! /* %% is the same as % according to the SQL standard */ ! /* Advance past all %'s */ ! while ((plen > 0) && (*p == '%')) ! NextChar(p, plen); ! /* Trailing percent matches everything. */ ! if (plen <= 0) ! return LIKE_TRUE; ! ! /* ! * Otherwise, scan for a text position at which we can match the ! * rest of the pattern. ! */ ! while (tlen > 0) { ! /* ! * Optimization to prevent most recursion: don't recurse ! * unless first pattern char might match this text char. ! */ ! if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_')) { ! int matched = MatchTextIC(t, tlen, p, plen); ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ } ! NextChar(t, tlen); } /* --- 99,147 ---- * Otherwise, scan for a text position at which we can match the * rest of the pattern. */ ! if (*p == '_') { ! while (tlen > 0) { int matched = MatchText(t, tlen, p, plen); ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ ! NextChar(t, tlen); ! } } ! else if (*p == '\\') { ! while (tlen > 0) { ! int matched = MatchText(t, tlen, p, plen); ! if (matched != LIKE_FALSE) ! return matched; /* TRUE or ABORT */ ! ! NextByte(t, tlen); } + } + else + { + + while (tlen > 0) + { + /* + * Optimization to prevent most recursion: don't recurse + * unless first pattern char matches this text char. + */ + if (BYTEEQ(t, p)) + { + int matched = MatchText(t, tlen, p, plen); + + if (matched != LIKE_FALSE) + return matched; /* TRUE or ABORT */ + } ! NextByte(t, tlen); ! } } /* *************** *** 209,215 **** */ return LIKE_ABORT; } ! else if ((*p != '_') && !ICHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then --- 150,162 ---- */ return LIKE_ABORT; } ! else if (*p == '_') ! { ! NextChar(t, tlen); ! NextByte(p, plen); ! continue; ! } ! else if (!BYTEEQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then *************** *** 217,225 **** */ return LIKE_FALSE; } ! ! NextChar(t, tlen); ! NextChar(p, plen); } if (tlen > 0) --- 164,176 ---- */ return LIKE_FALSE; } ! /* ! * It is safe to use NextByte instead of NextChar here, even for ! * multi-byte character sets, because we are not following ! * immediately after a wildcard character. ! */ ! NextByte(t, tlen); ! NextByte(p, plen); } if (tlen > 0) *************** *** 228,234 **** /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextChar(p, plen); if (plen <= 0) return LIKE_TRUE; --- 179,186 ---- /* End of input string. Do we have matching pattern remaining? */ while ((plen > 0) && (*p == '%')) /* allow multiple %'s at end of * pattern */ ! NextByte(p, plen); ! if (plen <= 0) return LIKE_TRUE; *************** *** 237,248 **** * matching this pattern. */ return LIKE_ABORT; ! } /* MatchTextIC() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ static text * do_like_escape(text *pat, text *esc) { --- 189,202 ---- * matching this pattern. */ return LIKE_ABORT; ! } /* MatchText() */ /* * like_escape() --- given a pattern and an ESCAPE string, * convert the pattern to use Postgres' standard backslash escape convention. */ + #ifdef do_like_escape + static text * do_like_escape(text *pat, text *esc) { *************** *** 336,338 **** --- 290,302 ---- return result; } + #endif /* do_like_escape */ + + #undef CHAREQ + #undef NextChar + #undef CopyAdvChar + #undef MatchText + + #ifdef do_like_escape + #undef do_like_escape + #endif