Thread: 7.3 no longer using indexes for LIKE queries
I have a database that has a lot of records (~6mil, iirc), with a varchar column that often wants to be queried using something like "where acc like 'foo%'". There is a B-Tree index on the acc column. In 7.2.3, Postgres would use that index to do the queries and things were lightning fast. In 7.3, it is refusing to use the index, even if I set enable_seqscan = off, meaning that the query that used to take a few msec now takes a few aeons. I've run vacuum analyze on the whole database, and it doesn't change anything. I'm trying to cluster the table on the index (since that's the only way that particular table is ever queried), so I can't give an explain analyze, but here's one for another table using the same idea: Index "public.xfoo" Column | Type -------------+------------------------ stringthing | character varying(255) btree, for table "public.foo" xxx=> explain analyze select * from foo where stringthing like 'ABCDEF%'; Seq Scan on foo (cost=0.00..148503.29 rows=1 width=111) (actual time=30512.99..32082.95 rows=4 loops=1) Filter: (stringthing ~~ 'ABCDEF%'::text) Total runtime: 32083.07 msec For reference, there are 4,688,317 rows in this table. Changing the select * to select stringthing doesn't affect the query plan either. I can coerce it to do an index scan by making the condition "stringthing >= 'ABCDEF' and stringthing < 'ABCDEG'", in which case it executes nice and fast: xxx=> explain analyze select * from foo where stringthing >= 'ABCDEF' and stringthing < 'ABCDEG'; Index Scan using xfoo on foo (cost=0.00..6.02 rows=1 width=111) (actual time=0.08..0.08 rows=0 loops=1) Index Cond: ((stringthing >= 'ABCDEF'::character varying) AND (stringthing < 'ABCDEG'::character varying)) Total runtime: 0.17 msec This is an ugly workaround, though :( Something I noticed in trying to force the use of an index scan ... setting enable_seqscan = off here doesn't change whether it uses a seq scan, but it makes it change the cost estimate to '100000000.00..100148503.29'; bit weird, that, if you ask me. -Matt
On Tue, 3 Dec 2002, Matthew Gabeler-Lee wrote: > I have a database that has a lot of records (~6mil, iirc), with a varchar > column that often wants to be queried using something like "where acc like > 'foo%'". There is a B-Tree index on the acc column. In 7.2.3, Postgres > would use that index to do the queries and things were lightning fast. In > 7.3, it is refusing to use the index, even if I set enable_seqscan = off, > meaning that the query that used to take a few msec now takes a few aeons. > I've run vacuum analyze on the whole database, and it doesn't change > anything. Did you perhaps not initdb in "C" locale? I think pg_controldata will tell you what it was created with. The optimization for converting anchored likes into an indexable form doesn't work in all locales and is currently only done in C locale I believe. > Something I noticed in trying to force the use of an index scan ... setting > enable_seqscan = off here doesn't change whether it uses a seq scan, but it > makes it change the cost estimate to '100000000.00..100148503.29'; bit > weird, that, if you ask me. That's probably because it doesn't believe that there's an index scan possible here.
Matthew Gabeler-Lee wrote: > I have a database that has a lot of records (~6mil, iirc), with a varchar > column that often wants to be queried using something like "where acc like > 'foo%'". There is a B-Tree index on the acc column. In 7.2.3, Postgres > would use that index to do the queries and things were lightning fast. In > 7.3, it is refusing to use the index, even if I set enable_seqscan = off, > meaning that the query that used to take a few msec now takes a few aeons. > I've run vacuum analyze on the whole database, and it doesn't change > anything. Whats the output of pg_controldata, specifically, what is LC_COLLATE? If it isn't "C", then LIKE won't use your index. See: http://developer.postgresql.org/docs/postgres/charset.html Joe
Thanks, it did it in en_US ... This seems like it should be a fixable thing to me * decides to do his day long database dump/restore tomorrow ... -Matt -----Original Message----- From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com] Sent: Tuesday, December 03, 2002 19:29 To: Matthew Gabeler-Lee Cc: 'pgsql-general@postgresql.org' Subject: Re: [GENERAL] 7.3 no longer using indexes for LIKE queries Did you perhaps not initdb in "C" locale? I think pg_controldata will tell you what it was created with. The optimization for converting anchored likes into an indexable form doesn't work in all locales and is currently only done in C locale I believe.
Can someone please elaborate on why it stops doing this optimization? The only reasons for it that I can think of for it to be unsafe in a locale is that two characters that are not the same character still compare as being equal (does this ever really happen?). Otherwise, saying "foo LIKE 'xxx%'" is really saying are the first 3 characters of foo equal to 'xxx'. Regardless of sort order, in a sorted list of values, anything starting with 'xxx' should appear together, and thus be optimizable by an index unless there are values of x and y for which 'x' = 'y'. If that were the case, though, you couldn't use the index for "foo = 'xxx'" either, which is obviously not the case. So, can someone enlighten me as to where I'm making a wrong assumption or something here? -Matt -----Original Message----- From: Joe Conway [mailto:mail@joeconway.com] Sent: Tuesday, December 03, 2002 19:31 Whats the output of pg_controldata, specifically, what is LC_COLLATE? If it isn't "C", then LIKE won't use your index. See: http://developer.postgresql.org/docs/postgres/charset.html
Matthew Gabeler-Lee <mgabelerlee@zycos.com> writes: > Can someone please elaborate on why it stops doing this optimization? The > only reasons for it that I can think of for it to be unsafe in a locale is > that two characters that are not the same character still compare as being > equal (does this ever really happen?). Unfortunately, locale-dependent sorting rules are much stranger than you seem to think. You can dig in the pghackers archives for some of the reasons that we had to disable that optimization (we tried repeatedly to find ways around this, btw). Special digraph rules and multipass comparisons are a couple of examples of things that completely break the LIKE optimization. regards, tom lane
Okay, now I understand the problem better. How about, as an intermediate solution, a list of 'sane' locales in which the optimization applied to the C/POSIX locale works? It should be fairly straight forward to write a test program that will test a locale and see if any of the common problems apply. And I expect that several locales can be added straight away as they are known to obey simple sorting rules. Alternately, though I don't think this would be a very good idea, a SET option along the lines of enable_dangerous_like_optimization or such that would (of course) default to off. -Matt -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Unfortunately, locale-dependent sorting rules are much stranger than you seem to think. You can dig in the pghackers archives for some of the reasons that we had to disable that optimization (we tried repeatedly to find ways around this, btw). Special digraph rules and multipass comparisons are a couple of examples of things that completely break the LIKE optimization. regards, tom lane
Matthew Gabeler-Lee wrote: > Okay, now I understand the problem better. > > How about, as an intermediate solution, a list of 'sane' locales in which > the optimization applied to the C/POSIX locale works? It should be fairly > straight forward to write a test program that will test a locale and see if > any of the common problems apply. And I expect that several locales can be > added straight away as they are known to obey simple sorting rules. > > Alternately, though I don't think this would be a very good idea, a SET > option along the lines of enable_dangerous_like_optimization or such that > would (of course) default to off. I think we do know the C local is OK for this. It is just the other ones we don't know about, and we don't even know. Is there some test we can write for this? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Looking at the list archives, it looks like there's some general classes of problems: 1) Multibyte charsets; me not knowing much, I'd guess that initially none of these are 'sane'. 2) longer string less than shorter string ('ABC\NNN' < 'ABC'); can test by iterating through bytes (or potentially pairs of bytes, to catch needing two special chars as a digraph or something) appeneded to a base string and seeing if any of them trigger the longer-less. 3) characters that are ignored in a sort (e.g. 'ABC\NNNDEF' sorts same as 'ABCDEF'); can test again by iterating through bytes or pairs of bytes looking for the condition. 4) accent folding; I'm not entirely sure like is supposed to do this. I'm going to pretend for the rest of this that the like operator shouldn't fold accented characters. Employing extreme paranoia and assuming great ignorance of any specialties of the locale being tested, I think this should cover all the bases: // pglub(x) = postgres' upper bound string for the like optimization on 'x%' // <, >, = mean sorts less than, etc. report insane if locale is multibyte for every 2-byte string $a let $b = pglub($a) for every 2-byte string $c report insane if "$a$c" < $a or "$a$c" > $b for some 4 byte string(s) of plain characters $d for every 2 byte string $e report insane if $d = "$d[0,1]$e$d[2,3]" else report sane On a fast machine, this would probably take a couple minutes[1] to run per locale, so it would definitely be a one-time thing, not a compile-time thing, but would be practical to run on a one time or per-release basis to generate some source file that listed sane locales. [1] Basically iterating through all values of 32 bits, takes my pc a few seconds to just do 'for(unsigned n=0;n<0xffffffff;++n);', so I extrapolated an estimate of what needs to be done beyond the counting. The second loop runs so many less times that its runtime is in the noise. With some slight modifications, this sanity-finding routine could be used to generate tables that for use by a function to find the proper bounds for like optimization on locales that only suffered from problem #2. Yes, this embeds knowledge about locales into postgres, but it does it through exhaustive examination, not usage of personal expertise. An addendum on accent folding: To my ignorant mind, characters with different accents are different characters. My French teacher certainly marked me off if I got the accents wrong. It seems to me that the most common place one wants to think about this is in full text searching and the like. In this case, maybe I'm daft, but perhaps the thing to do is to create a functional index, where the function being indexed strips all the accents off characters. Does the SQL spec have anything to say on accent folding in comparisons? -Matt -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] I think we do know the C local is OK for this. It is just the other ones we don't know about, and we don't even know. Is there some test we can write for this?
C_COLLATE is what is involved with accents. How would you sort: CÔTE CÔTES COTÉ COTÉS COTE You can't fold accented character into non accented because unique index would barf on CÔTE if COTE is already in. You still need to know if 'CÔTE' < 'COTÉ' or not when you do a sort. Collating in french, for example, is not a byte to byte compare. If you compare words based only on their binary representation, the sort will be wrong CRIME before CÔTE. JLL Matthew Gabeler-Lee wrote: > 4) accent folding; I'm not entirely sure like is supposed to do this. I'm > going to pretend for the rest of this that the like operator shouldn't fold > accented characters. > > [...] > > It seems to me that the most common place one wants to think about this is > in full text searching and the like. In this case, maybe I'm daft, but > perhaps the thing to do is to create a functional index, where the function > being indexed strips all the accents off characters. > > Does the SQL spec have anything to say on accent folding in comparisons?
What I was referring to there was the idea that when a user searched for 'cote', you'd transform that into a condition along the lines of "WHERE foldaccents(textcol) LIKE 'cote%'" or such so that failing to enter accents wouldn't make the query not hit, which is especially desirable for someone who wants to query something with accents, but can't type them easily on their keyboard. In the case of optimizing the LIKE operator, if you ask for txtcol LIKE 'CÔTE%', then (assuming the charset is 'sane'), it would do an index scan with a filter something along the lines of (txtcol >= 'CÔTE' AND txtcol <'CÔTF') before applying the LIKE operator. Assuming the locale is set right in the database, the indexing stuff will know how to colate CÔTE vs. COTÉ; that's the whole point of setting the COLLATE thing, so that the index *will* have them in the correct order. My point was mostly that expecting txtcol LIKE 'CÔTE%' to match 'COTÉ' is probably bad logic. Now, if there is a good argument why that isn't so, I'm all ears, but it seems to me that 'CÔTE' and 'COTÉ' are different words and thus shouldn't match. -Matt -----Original Message----- From: Jean-Luc Lachance [mailto:jllachan@nsd.ca] Sent: Wednesday, December 04, 2002 15:25 To: Matthew Gabeler-Lee Cc: 'pgsql-general@postgresql.org' Subject: Re: [GENERAL] 7.3 no longer using indexes for LIKE queries C_COLLATE is what is involved with accents. How would you sort: CÔTE CÔTES COTÉ COTÉS COTE You can't fold accented character into non accented because unique index would barf on CÔTE if COTE is already in. You still need to know if 'CÔTE' < 'COTÉ' or not when you do a sort. Collating in french, for example, is not a byte to byte compare. If you compare words based only on their binary representation, the sort will be wrong CRIME before CÔTE. JLL Matthew Gabeler-Lee wrote: > 4) accent folding; I'm not entirely sure like is supposed to do this. I'm > going to pretend for the rest of this that the like operator shouldn't fold > accented characters. > > [...] > > It seems to me that the most common place one wants to think about this is > in full text searching and the like. In this case, maybe I'm daft, but > perhaps the thing to do is to create a functional index, where the function > being indexed strips all the accents off characters. > > Does the SQL spec have anything to say on accent folding in comparisons?
Matthew Gabeler-Lee <mgabelerlee@zycos.com> writes: > How about, as an intermediate solution, a list of 'sane' locales in which > the optimization applied to the C/POSIX locale works? If you provide such a list, we'll be happy to improve locale_is_like_safe(). Offhand though, I suspect that *most* if not all non-C locales have problems; even en_US, which has no character set issues, still manages to insist on a multipass sort algorithm :-(. An example on a recent Linux release: [tgl@g3]$ echo -e "a\na a\naa\na a\nab\na b" | LC_ALL=en_US sort a aa a a a a ab a b There's no way to use an index ordered like this to look for strings beginning "a ", because the sorting of spaces depends on what comes after them. Making any real dent in the problem will probably require in-depth analysis of common locale (mis)behaviors. For example, if space sorting is the only thing that's funny about en_US, it might make sense for us to support a modified form of the LIKE optimization that doesn't consider space as a "safe" prefix character (ie, we could index only for "a" not "a ", even if the pattern is LIKE 'a %'). I have no idea exactly what sort of compromises would be effective though. Any localedef experts out there? regards, tom lane
Can anyone explain the rationale behind that sort order? I'm guessing it has something to do with getting sequences of words sorted 'right', but I fail to see how that is right. I'd think when sorting it would make sense to have all sequences of words that start with the same word(s) together. $ echo -e 'bobbill\nbobrob\nbob bill\nbob robber' | LC_ALL=en_US sort bobbill bob bill -\ bobrob |-- Seems these ought to be adjacent? bob robber -/ Perhaps I'm just old fashioned. In any case, I don't define locale collation orders, so *shrug*. The pseudocode I gave for finding insane ones could also be extended to find information for the modified LIKE optimization you suggest. Find chars (like space) that are ignored in the collation order, as it does, and report them. When picking the lower bound, drop all such chars from the beginning of the expression, then find the upper bound from the thus shortened lower bound. This would work for stuff that sorted like en_US does with spaces, but something that sorted 'a b' *before* 'ab' gets into the realms of the impossible to deal with (there was a message in one of the lists talking about a locale that apparently suffered from this problem, it at least sorted 'abc\NNN' before 'abc' for some value of NNN which I forget). In locales whose second-pass sort only moves things forwards, the least the planner could know these 'danger' characters and only optimize if none of them are present in the query string. foo LIKE 'ab%' can be safely optimized in en_US since it only has chars that sort simply. Finding which chars sort simply should be straight forward and only require testing 2^8 or maybe 2^16 values. -Matt -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] [tgl@g3]$ echo -e "a\na a\naa\na a\nab\na b" | LC_ALL=en_US sort a aa a a a a ab a b There's no way to use an index ordered like this to look for strings beginning "a ", because the sorting of spaces depends on what comes after them. Making any real dent in the problem will probably require in-depth analysis of common locale (mis)behaviors. For example, if space sorting is the only thing that's funny about en_US, it might make sense for us to support a modified form of the LIKE optimization that doesn't consider space as a "safe" prefix character (ie, we could index only for "a" not "a ", even if the pattern is LIKE 'a %'). I have no idea exactly what sort of compromises would be effective though. Any localedef experts out there? regards, tom lane
Matthew Gabeler-Lee <mgabelerlee@zycos.com> writes: > foo LIKE 'ab%' can be safely optimized in en_US since it only has > chars that sort simply. Finding which chars sort simply should be > straight forward and only require testing 2^8 or maybe 2^16 values. We have several times thought we had a solution to LIKE optimization in non-C locales, only to discover the hard way that our solution didn't cover all cases. After that bitter experience, I am *deeply* distrustful of any proposal that involves black-box testing of locale behavior. A black-box test will only find the behaviors that you already know about. The only thing that would make me comfortable is a white-box approach: go in and read the source code to find out what collation behaviors are possible (I'm assuming here that commercial locale libraries don't have anything that's not present in an open-source one, eg glibc's locale code). Once we have that information for sure, we can think about how to detect and deal with the different sorting oddities. But right now I have no confidence that we know what we need to deal with. regards, tom lane
On Wed, Dec 04, 2002 at 16:02:28 -0500, Matthew Gabeler-Lee <mgabelerlee@zycos.com> wrote: > Can anyone explain the rationale behind that sort order? I'm guessing it > has something to do with getting sequences of words sorted 'right', but I > fail to see how that is right. I'd think when sorting it would make sense > to have all sequences of words that start with the same word(s) together. > > $ echo -e 'bobbill\nbobrob\nbob bill\nbob robber' | LC_ALL=en_US sort > bobbill > bob bill -\ > bobrob |-- Seems these ought to be adjacent? > bob robber -/ > > Perhaps I'm just old fashioned. In any case, I don't define locale > collation orders, so *shrug*. I prefer en_US for sorting names. I like that last names like "des Jardins", "desJardins" and "Des Jardins" will all show up next to each other, because that is what I think humans will expect. (Note that I sort by first, middle and last names separately.)
Tom Lane <tgl@sss.pgh.pa.us> writes: > [tgl@g3]$ echo -e "a\na a\naa\na a\nab\na b" | LC_ALL=en_US sort > a > aa > a a > a a > ab > a b > > There's no way to use an index ordered like this to look for strings > beginning "a ", because the sorting of spaces depends on what comes > after them. It seems like there's an obvious easy fix for this. Allow indexes to be created a simple non-locale dependent lexical sort order. They wouldn't be useful for sorting in the locale sort order but they would be useful for the case at hand. This has the disadvantage of requiring two indexes if you really do want both "WHERE x BETWEEN ? and ?" and "WHERE x LIKE 'foo%' to be fast, but then you're saying that if the user really wants one of these "unsafe" locales then that's what it'll cost the achieve it. Perhaps a warning about it in the initdb stage since it's probably usually not what the user wants would be nice too. -- greg
Greg Stark writes: > It seems like there's an obvious easy fix for this. Allow indexes to be > created a simple non-locale dependent lexical sort order. They wouldn't be > useful for sorting in the locale sort order but they would be useful for the > case at hand. There has already been a proposed implementation of that idea, but it has been rejected because of some interpretational problems with how exactly the LIKE operator should respond to locale settings. According to the SQL standard, constant strings that are part of a pattern should be compared using the relevant collation order. If this were implemented (which it currently isn't), then an index based on strxfrm() should be used. The current implementation should use an index based on a binary comparison opclass. We need to figure out which exactly we want to proceed with. I will point out that I believe that an implemenation following the SQL standard model won't be particularly practical. First of all, strings that are not binary equivalents won't be compare as equal under any reasonable collation. Second, even if that were the case, it would certainly not be appropriate to use for the LIKE operator. Third, some of the rules that underly many collations would cause the LIKE operator to have pretty bizarre results. For example, sometimes the strings are compared backwards from the end of the string to the start, and it's not clear how that should behave when faced with a wildcard pattern anchored to the start. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > > Greg Stark writes: > > > It seems like there's an obvious easy fix for this. Allow > > indexes to be created a simple non-locale dependent lexical > > sort order. They wouldn't be useful for sorting in the locale > > sort order but they would be useful for the case at hand. > > There has already been a proposed implementation of that idea, > but it has been rejected because of some interpretational > problems with how exactly the LIKE operator should respond > to locale settings. > > According to the SQL standard, constant strings that are part of a > pattern should be compared using the relevant collation order. If this > were implemented (which it currently isn't), then an index based on > strxfrm() should be used. The current implementation should use an > index based on a binary comparison opclass. We need to figure out > which exactly we want to proceed with. > > I will point out that I believe that an implemenation following the SQL > standard model won't be particularly practical. First of all, strings > that are not binary equivalents won't be compare as equal under any > reasonable collation. Second, even if that were the case, it would > certainly not be appropriate to use for the LIKE operator. For example, there could be case-insensitive collations. regards, Hiroshi Inoue http://w2422.nsk.ne.jp/~inoue/
Hiroshi Inoue writes: > For example, there could be case-insensitive collations. You could do that, but normally you would take account the case differences in the third pass. Another interesting problem is how you would expect to handle multicharacter collating elements that cross wildcard boundaries (like 'oc%' when 'ch' is a collating element). That would give weird results. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > > Hiroshi Inoue writes: > > > For example, there could be case-insensitive collations. > > You could do that, but normally you would take account the case > differences in the third pass. I would handle the cases as the case may be if such collations are available. You can develop the code applicable for the restricted cases where the equality induced by the collation is equivalent to the binary equality, can't you ? BTW you seem to be too inclined to the locale support. How can we unify the collation support (possibly introduced in the near future) and the locale support ? regards, Hiroshi Inoue http://w2422.nsk.ne.jp/~inoue/