Thread: Problem (bug?) wih like
I'd like to contrast an error that I get when using like in text fields. I have a table A an a view v_A of A. Name is a text field (person names). Look at these queries: 1) select * from A where name like '%DAVID%' It works pretty well and fast. Result: DAVID DAVID FOO DAVID .../... 2) select * from v_A where name like '%DA%' It works too, with a result bigger (obviously) than first query. Result: DAVID DANIEL DAVID FOO .../... 3) select * from v_A where name like '%DAVID%' It freezes psql. Why?. It seems a bug, doesn't it?. Thanks for any info. David
El lunes 03 de diciembre, bombadil@wanadoo.es escribió: > I'd like to contrast an error that I get when using like in text > fields. Sorry. Info about my system: Postgresql 7.1.3 Debian Sid GNU/Linux i386 Greets. David
bombadil@wanadoo.es writes: > select * from v_A where name like '%DAVID%' > It freezes psql. I don't believe that it's really "frozen". Taking a long time, maybe. > Why? You tell us. What's the EXPLAIN query plan for these three queries? regards, tom lane
El lunes 03 de diciembre, Tom Lane escribió: > bombadil@wanadoo.es writes: > > select * from v_A where name like '%DAVID%' > > > > It freezes psql. > > I don't believe that it's really "frozen". Taking a long time, maybe. Perhaps. But a veeeeeery long time, in any way ;) I have been waiting more than 3 minutes and... ¡e voila!, here it is :) > > Why? > > You tell us. What's the EXPLAIN query plan for these three queries? Ops. Sorry for laziness. Here are my queries: -------------------------------------------------------------------------- 1) # explain SELECT * from cliente where nombre like '%DAVID%'; Result: NOTICE: QUERY PLAN: Seq Scan on cliente (cost=0.00..16139.44 rows=1 width=131) -------------------------------------------------------------------------- 2) # explain SELECT * from v_cliente where nombre like '%DA%'; Result: NOTICE: QUERY PLAN: Merge Join (cost=54763.50..62874.36 rows=413980 width=183) -> Sort (cost=16238.44..16238.44 rows=54 width=131) -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) -> Sort (cost=38525.06..38525.06 rows=20097 width=74) -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=100.89..26377.49 rows=20097 width=58) -> Merge Join (cost=78.96..17190.49 rows=20097 width=42) -> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26) -> Sort (cost=78.96..78.96 rows=176 width=16) -> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16) -> Hash (cost=21.80..21.80 rows=52 width=16) -> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16) -> Hash (cost=786.20..786.20 rows=1928 width=16) -> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16) ------------------------------------------------------------------------------ 3) # explain SELECT * from v_cliente where nombre like '%DAVID%'; Result: NOTICE: QUERY PLAN: Merge Join (cost=54763.50..62874.36 rows=413980 width=183) -> Sort (cost=16238.44..16238.44 rows=54 width=131) -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) -> Sort (cost=38525.06..38525.06 rows=20097 width=74) -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=100.89..26377.49 rows=20097 width=58) -> Merge Join (cost=78.96..17190.49 rows=20097 width=42) -> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26) -> Sort (cost=78.96..78.96 rows=176 width=16) -> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16) -> Hash (cost=21.80..21.80 rows=52 width=16) -> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16) -> Hash (cost=786.20..786.20 rows=1928 width=16) -> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16) -------------------------------------------------------------------------------- Greets. David
bombadil@wanadoo.es writes: > Here are my queries: You sure you didn't paste in the same result for #2 and #3? They're the same plan with the same rows estimates --- but I'd expect the rows estimates, at least, to change given the more-selective LIKE pattern. Also, how many rows are there really that match '%DA%' and '%DAVID%'? I suspect the planner is being overoptimistic about the selectivity of '%DAVID%', and is choosing a plan that doesn't work well when there are lots of DAVIDs :-( regards, tom lane
El lunes 03 de diciembre, Tom Lane escribió: > > Here are my queries: > > You sure you didn't paste in the same result for #2 and #3? They're > the same plan with the same rows estimates --- but I'd expect the rows > estimates, at least, to change given the more-selective LIKE pattern. I don't know for sure if I have send wrong data, but I think not. Tomorrow I'll get more info. > Also, how many rows are there really that match '%DA%' and '%DAVID%'? Very few for that incredible difference in time: '%DA%' -> 2 sec. '%DAVID%' -> 3 min. > I suspect the planner is being overoptimistic about the selectivity of > '%DAVID%', and is choosing a plan that doesn't work well when there are > lots of DAVIDs :-( I have thought that it only occurs when item to find is present completely in any number of registers, but this is only a burde hypothesis :? Thanks for all. David
bombadil@wanadoo.es writes: > Here are my queries: You sure you didn't paste in the same result for #2 and #3? They're the same plan with the same rows estimates --- but I'd expect the rows estimates, at least, to change given the more-selective LIKE pattern. Also, how many rows are there really that match '%DA%' and '%DAVID%'? I suspect the planner is being overoptimistic about the selectivity of '%DAVID%', and is choosing a plan that doesn't work well when there are lots of DAVIDs :-( regards, tom lane
El lunes 03 de diciembre, Tom Lane escribió: > You sure you didn't paste in the same result for #2 and #3? They're > the same plan with the same rows estimates --- but I'd expect the rows > estimates, at least, to change given the more-selective LIKE pattern. Ahem... You are right. It seems I have a problem with Vim and insertion of command result :( Here are the correct results: ----------------------------------------------------------------------- 1) # explain SELECT * from v_cliente where nombre like '%DA%'; Result: NOTICE: QUERY PLAN: Merge Join (cost=54763.50..62874.36 rows=413980 width=183) -> Sort (cost=16238.44..16238.44 rows=54 width=131) -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) -> Sort (cost=38525.06..38525.06 rows=20097 width=74) -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=100.89..26377.49 rows=20097 width=58) -> Merge Join (cost=78.96..17190.49 rows=20097 width=42) -> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26) -> Sort (cost=78.96..78.96 rows=176 width=16) -> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16) -> Hash (cost=21.80..21.80 rows=52 width=16) -> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16) -> Hash (cost=786.20..786.20 rows=1928 width=16) -> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16) EXPLAIN --------------------------------------------------------------------------- 2) explain SELECT * from v_cliente where nombre like '%DAVID%'; Result: NOTICE: QUERY PLAN: Nested Loop (cost=891.91..61414.58 rows=7638 width=183) -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131) -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) -> Hash Join (cost=100.89..26377.49 rows=20097 width=58) -> Merge Join (cost=78.96..17190.49 rows=20097 width=42) -> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26) -> Sort (cost=78.96..78.96 rows=176 width=16) -> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16) -> Hash (cost=21.80..21.80 rows=52 width=16) -> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16) -> Hash (cost=786.20..786.20 rows=1928 width=16) -> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16) EXPLAIN ---------------------------------------------------------------------------- > Also, how many rows are there really that match '%DA%' and '%DAVID%'? 1) 2672 rows -> 3.59 sec. 2) 257 rows -> 364.69 sec. v_cliente and cliente have same number of rows: 38975 I hope this is enough. Greets. David
bombadil@wanadoo.es writes: > 1) # explain SELECT * from v_cliente where nombre like '%DA%'; > Merge Join (cost=54763.50..62874.36 rows=413980 width=183) > -> Sort (cost=16238.44..16238.44 rows=54 width=131) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) > -> Sort (cost=38525.06..38525.06 rows=20097 width=74) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... > 2) explain SELECT * from v_cliente where nombre like '%DAVID%'; > Nested Loop (cost=891.91..61414.58 rows=7638 width=183) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... [same subplan as above] The problem here is that the planner is being way too optimistic about the selectivity of LIKE '%DAVID%' --- notice the estimate that only one matching row will be found in cliente, rather than 54 as with '%DA%'. So it chooses a plan that avoids the sort overhead needed for an efficient merge join with the other tables. That would be a win if there were only one matching row, but as soon as there are lots, it's a big loss, because the subquery to join the other tables gets redone for every matching row :-( >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > 1) 2672 rows -> 3.59 sec. > 2) 257 rows -> 364.69 sec. I am thinking that the rules for selectivity of LIKE patterns probably need to be modified. Presently the code assumes that a long constant string has probability of occurrence proportional to the product of the probabilities of the individual letters. That might be true in a random world, but people don't search for random strings. I think we need to back off the selectivity estimate by some large factor to account for the fact that the pattern being searched for is probably not random. Anyone have ideas how to do that? regards, tom lane
bombadil@wanadoo.es writes: > 1) # explain SELECT * from v_cliente where nombre like '%DA%'; > Merge Join (cost=54763.50..62874.36 rows=413980 width=183) > -> Sort (cost=16238.44..16238.44 rows=54 width=131) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131) > -> Sort (cost=38525.06..38525.06 rows=20097 width=74) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... > 2) explain SELECT * from v_cliente where nombre like '%DAVID%'; > Nested Loop (cost=891.91..61414.58 rows=7638 width=183) > -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131) > -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74) > -> Hash Join (cost=891.91..37088.66 rows=20097 width=74) > ... [same subplan as above] The problem here is that the planner is being way too optimistic about the selectivity of LIKE '%DAVID%' --- notice the estimate that only one matching row will be found in cliente, rather than 54 as with '%DA%'. So it chooses a plan that avoids the sort overhead needed for an efficient merge join with the other tables. That would be a win if there were only one matching row, but as soon as there are lots, it's a big loss, because the subquery to join the other tables gets redone for every matching row :-( >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > 1) 2672 rows -> 3.59 sec. > 2) 257 rows -> 364.69 sec. I am thinking that the rules for selectivity of LIKE patterns probably need to be modified. Presently the code assumes that a long constant string has probability of occurrence proportional to the product of the probabilities of the individual letters. That might be true in a random world, but people don't search for random strings. I think we need to back off the selectivity estimate by some large factor to account for the fact that the pattern being searched for is probably not random. Anyone have ideas how to do that? regards, tom lane
> I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? But what about '%A%' vs. '%AC%'. Seems the second is reasonably different from the first the our optimizer may be fine with that. Is it only when the strings get longer that we lose specificity? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > different from the first the our optimizer may be fine with that. Is it > only when the strings get longer that we lose specificity? Yeah, I don't think that the estimates are bad for one or two characters. But the estimate gets real small real fast as you increase the number of match characters in the LIKE pattern. We need to slow that down some. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. Yea, maybe a log base 2 decrease: 1 char 1x 2 char 2x 4 char 3x 8 char 4x 16 char 5x -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
El lunes 03 de diciembre, bombadil@wanadoo.es escribió: > I'd like to contrast an error that I get when using like in text > fields. Sorry. Info about my system: Postgresql 7.1.3 Debian Sid GNU/Linux i386 Greets. David
Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>But what about '%A%' vs. '%AC%'. Seems the second is reasonably >>different from the first the our optimizer may be fine with that. Is it >>only when the strings get longer that we lose specificity? >> > >Yeah, I don't think that the estimates are bad for one or two >characters. But the estimate gets real small real fast as you >increase the number of match characters in the LIKE pattern. >We need to slow that down some. > Could we just assign weights to first few characters and then consider only these first few characters when determinind probabbility of finding it ? If someone searches for '%New York City%' we have quite good reasons to believe that there are some of these in there so we should factor in the fact that usually one searches for strings that do exist by said weights. Another option would be to gather statistics not only on individual letters but on bi- or trigraphs. Then as a next step we could implement proper trigraph indexes ;) ---------------- Hannu
On Tue, 4 Dec 2001, Tom Lane wrote: > bombadil@wanadoo.es writes: > > 1) # explain SELECT * from v_cliente where nombre like '%DA%'; > > > Merge Join (cost=54763.50..62874.36 rows=413980 width=183) > > -> Sort (cost=16238.44..16238.44 rows=54 width=131) > The problem here is that the planner is being way too optimistic about > the selectivity of LIKE '%DAVID%' --- notice the estimate that only > one matching row will be found in cliente, rather than 54 as with '%DA%'. > So it chooses a plan that avoids the sort overhead needed for an > efficient merge join with the other tables. That would be a win if > there were only one matching row, but as soon as there are lots, it's > a big loss, because the subquery to join the other tables gets redone > for every matching row :-( > > >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > > > 1) 2672 rows -> 3.59 sec. > > 2) 257 rows -> 364.69 sec. > > I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? > Is there any statistic being kept by partial index? for instance #occurrences of A B A fairly small table/index could track these couldn't it? If it were a btree itself, then statistics could be split appropriately into sub-branches when the # of occurrences exceeds some threshold. david
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. See my earlier email about the 50% idea for LIKE. Do ordinary string comparisons also have this problem? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > > different from the first the our optimizer may be fine with that. Is it > > > only when the strings get longer that we lose specificity? > > > > Yeah, I don't think that the estimates are bad for one or two > > characters. But the estimate gets real small real fast as you > > increase the number of match characters in the LIKE pattern. > > We need to slow that down some. > > Yea, maybe a log base 2 decrease: > > 1 char 1x > 2 char 2x > 4 char 3x > 8 char 4x > 16 char 5x I did a little research on this. I think the problem is that ordinary characters are assumed to randomly appear in a character string, while in practice, if the string has already been specified like 'DAV', there are very few additional characters that can follow it and make sense. Looking at backend/utils/adt/selfuncs.c, I see this: #define FIXED_CHAR_SEL 0.04 /* about 1/25 */ ... sel *= FIXED_CHAR_SEL; which means every additional character reduces the selectivity by 96%. This seems much too restrictive to me. Because of the new optimizer buckets, we do have good statistics on the leading character, but additional characters drastically reduce selectivity. I think perhaps a number like 0.50 or 50% may be correct. That would be a table like this: 1 char 2x 2 char 4x 4 char 8x 8 char 16x 16 char 32x which is more restrictive than I initially suggested above but less restrictive than we have now. Should we assume additional characters are indeed randomly appearing in the string? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably > > different from the first the our optimizer may be fine with that. Is it > > only when the strings get longer that we lose specificity? > > Yeah, I don't think that the estimates are bad for one or two > characters. But the estimate gets real small real fast as you > increase the number of match characters in the LIKE pattern. > We need to slow that down some. OK, I think I have the proper value for FIXED_CHAR_SEL. It is currently 0.04 or 1/26, meaning the letters are random, though this is not usually the case. If we assume our optimizer buckets have given us a reasonable value for the first character, suppose it is an 'F', there are only a few valid characters after that, at least in English. There are vowels, and a few consonants, and given that character, there are only a few characters that can be valid after that. To my thinking, it is two characters that represent the same distribution as one random character, leaving 0.20 as the proper value for FIXED_CHAR_SEL because 0.20 * 0.20 is the same as 0.04. Added to TODO: * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce) If people think there is a better value for this, please chime in. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> The problem here is that the planner is being way too optimistic about > the selectivity of LIKE '%DAVID%' --- notice the estimate that only > one matching row will be found in cliente, rather than 54 as with '%DA%'. > So it chooses a plan that avoids the sort overhead needed for an > efficient merge join with the other tables. That would be a win if > there were only one matching row, but as soon as there are lots, it's > a big loss, because the subquery to join the other tables gets redone > for every matching row :-( > > >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > > > 1) 2672 rows -> 3.59 sec. > > 2) 257 rows -> 364.69 sec. > > I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? Let's use the above example with the new FIXED_CHAR_SEL values: With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID above: DA 1) 0.20 ^ 2 .04 DAVID 2) 0.20 ^ 5 .00032 If we divide these two, we get: > 0.04 / 0.00032 125 while looking at the total counts reported above, we get: > 2672 / 257 ~10.39688715953307392996 The 0.04 value gives a value of: > 0.04 ^ 2 .0016 > 0.04 ^ 5 .0000001024 > .0016 / .0000001024 15625 Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x too large. Because this was a contrived example, and because some have more random text than DAVID in their field, I think 0.20 is the proper value. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026