Thread: Re: [GENERAL] Problem (bug?) with like
bombadil@wanadoo.es writes: > 1) # explain SELECT * from v_cliente where nombre like '%DA%'; > Merge Join (cost=54763.50..62874.36 rows=413980 width=183) > -> Sort (cost=16238.44..16238.44 rows=54 width=131) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) > -> Sort (cost=38525.06..38525.06 rows=20097 width=74) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... > 2) explain SELECT * from v_cliente where nombre like '%DAVID%'; > Nested Loop (cost=891.91..61414.58 rows=7638 width=183) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... [same subplan as above] The problem here is that the planner is being way too optimistic about the selectivity of LIKE '%DAVID%' --- notice the estimate that only one matching row will be found in cliente, rather than 54 as with '%DA%'. So it chooses a plan that avoids the sort overhead needed for an efficient merge join with the other tables. That would be a win if there were only one matching row, but as soon as there are lots, it's a big loss, because the subquery to join the other tables gets redone for every matching row :-( >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > 1) 2672 rows -> 3.59 sec. > 2) 257 rows -> 364.69 sec. I am thinking that the rules for selectivity of LIKE patterns probably need to be modified. Presently the code assumes that a long constant string has probability of occurrence proportional to the product of the probabilities of the individual letters. That might be true in a random world, but people don't search for random strings. I think we need to back off the selectivity estimate by some large factor to account for the fact that the pattern being searched for is probably not random. Anyone have ideas how to do that? regards, tom lane
> I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? But what about '%A%' vs. '%AC%'. Seems the second is reasonably different from the first the our optimizer may be fine with that. Is it only when the strings get longer that we lose specificity? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > different from the first the our optimizer may be fine with that. Is it > only when the strings get longer that we lose specificity? Yeah, I don't think that the estimates are bad for one or two characters. But the estimate gets real small real fast as you increase the number of match characters in the LIKE pattern. We need to slow that down some. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. Yea, maybe a log base 2 decrease: 1 char 1x 2 char 2x 4 char 3x 8 char 4x 16 char 5x -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>But what about '%A%' vs. '%AC%'. Seems the second is reasonably >>different from the first the our optimizer may be fine with that. Is it >>only when the strings get longer that we lose specificity? >> > >Yeah, I don't think that the estimates are bad for one or two >characters. But the estimate gets real small real fast as you >increase the number of match characters in the LIKE pattern. >We need to slow that down some. > Could we just assign weights to first few characters and then consider only these first few characters when determinind probabbility of finding it ? If someone searches for '%New York City%' we have quite good reasons to believe that there are some of these in there so we should factor in the fact that usually one searches for strings that do exist by said weights. Another option would be to gather statistics not only on individual letters but on bi- or trigraphs. Then as a next step we could implement proper trigraph indexes ;) ---------------- Hannu
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. See my earlier email about the 50% idea for LIKE. Do ordinary string comparisons also have this problem? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > > different from the first the our optimizer may be fine with that. Is it > > > only when the strings get longer that we lose specificity? > > > > Yeah, I don't think that the estimates are bad for one or two > > characters. But the estimate gets real small real fast as you > > increase the number of match characters in the LIKE pattern. > > We need to slow that down some. > > Yea, maybe a log base 2 decrease: > > 1 char 1x > 2 char 2x > 4 char 3x > 8 char 4x > 16 char 5x I did a little research on this. I think the problem is that ordinary characters are assumed to randomly appear in a character string, while in practice, if the string has already been specified like 'DAV', there are very few additional characters that can follow it and make sense. Looking at backend/utils/adt/selfuncs.c, I see this: #define FIXED_CHAR_SEL 0.04 /* about 1/25 */ ... sel *= FIXED_CHAR_SEL; which means every additional character reduces the selectivity by 96%. This seems much too restrictive to me. Because of the new optimizer buckets, we do have good statistics on the leading character, but additional characters drastically reduce selectivity. I think perhaps a number like 0.50 or 50% may be correct. That would be a table like this: 1 char 2x 2 char 4x 4 char 8x 8 char 16x 16 char 32x which is more restrictive than I initially suggested above but less restrictive than we have now. Should we assume additional characters are indeed randomly appearing in the string? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Do ordinary string > comparisons also have this problem? No, only LIKE and regex matching. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. OK, I think I have the proper value for FIXED_CHAR_SEL. It is currently 0.04 or 1/26, meaning the letters are random, though this is not usually the case. If we assume our optimizer buckets have given us a reasonable value for the first character, suppose it is an 'F', there are only a few valid characters after that, at least in English. There are vowels, and a few consonants, and given that character, there are only a few characters that can be valid after that. To my thinking, it is two characters that represent the same distribution as one random character, leaving 0.20 as the proper value for FIXED_CHAR_SEL because 0.20 * 0.20 is the same as 0.04. Added to TODO: * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce) If people think there is a better value for this, please chime in. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> The problem here is that the planner is being way too optimistic about > the selectivity of LIKE '%DAVID%' --- notice the estimate that only > one matching row will be found in cliente, rather than 54 as with '%DA%'. > So it chooses a plan that avoids the sort overhead needed for an > efficient merge join with the other tables. That would be a win if > there were only one matching row, but as soon as there are lots, it's > a big loss, because the subquery to join the other tables gets redone > for every matching row :-( > > >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > > > 1) 2672 rows -> 3.59 sec. > > 2) 257 rows -> 364.69 sec. > > I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? Let's use the above example with the new FIXED_CHAR_SEL values: With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID above: DA 1) 0.20 ^ 2 .04 DAVID 2) 0.20 ^ 5 .00032 If we divide these two, we get: > 0.04 / 0.00032 125 while looking at the total counts reported above, we get: > 2672 / 257 ~10.39688715953307392996 The 0.04 value gives a value of: > 0.04 ^ 2 .0016 > 0.04 ^ 5 .0000001024 > .0016 / .0000001024 15625 Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x too large. Because this was a contrived example, and because some have more random text than DAVID in their field, I think 0.20 is the proper value. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026