Thread: 7.0 like selectivity
Hi all, There was a bug(??) report about LIKE optimization of 7.0 beta3 in Japan from Akira Imagawa. It may be difficult to solve. Let t_hoge be a table like {hoge_cd int4 primary key,shimeinn text,tel text,.. } index hoge_ix2 on t_hoge(shimeinn). index hoge_ix3 on t_hoge(tel). There are 348236 rows in t_hoge. For the query select hoge_cd,shimeinn,telfrom t_hogewhere shimeinn like 'imag%' and tel like '012%'order by hoge_cdlimit 100; 64 rows returned immediately. And for the query select hoge_cd,shimeinn,telfrom t_hogewhere shimeinn like 'imag%' and tel like '012-3%'order by hoge_cd limit 100; 24 rows returned after waiting 8 minutes. I got the following output from him. explain select * from t_hoge where tel like '012%';Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981width=676) explain select * from t_hoge where tel like '012-3%';Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981width=676) In fact,count(*) is 342323 and 114741 respectively. The first problem is that estimated cost is too low. It seems that the index selectivity of '012-3%' = the index selectivity of '012%' / (256*256),right ? If so,does it give more practical estimation than before ? It doesn't correspond to rows information either. In reality, * shimeinn like 'imag%' * is much more restrictive than * tel like '012-3%' *. However I couldn't think of the way to foresee which is more restrictive. Now I doubt whether we have enough information to estimate LIKE selectivity correctly. It's the second problem. Comments ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > For the query > select hoge_cd,shimeinn,tel > from t_hoge > where shimeinn like 'imag%' > and tel like '012%' > order by hoge_cd > limit 100; > 64 rows returned immediately. > And for the query > select hoge_cd,shimeinn,tel > from t_hoge > where shimeinn like 'imag%' > and tel like '012-3%' > order by hoge_cd > limit 100; > 24 rows returned after waiting 8 minutes. So what were the plans for these two queries? Also, has this table been "vacuum analyzed"? > I got the following output from him. > explain select * from t_hoge where tel like '012%'; > Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981 > width=676) > explain select * from t_hoge where tel like '012-3%'; > Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981 > width=676) > In fact,count(*) is 342323 and 114741 respectively. > The first problem is that estimated cost is too low. > It seems that the index selectivity of '012-3%' = the index > selectivity of '012%' / (256*256),right ? > If so,does it give more practical estimation than before ? > It doesn't correspond to rows information either. The rows number is fairly bogus (because it's coming from application of eqsel, which is not the right thing; perhaps someday LIKE should have its very own selectivity estimation function). But the cost estimate is driven by the estimated selectivity oftel >= '012-3' AND tel < '012-4' and it would be nice to think that we have some handle on that. It could be that the thing is getting fooled by a very non-uniform distribution of telephone numbers. You indicate that most of the numbers in the DB begin with '012', but if there are a small number that begin with digits as high as 9, the selectivity estimates would be pretty bad. > In reality, * shimeinn like 'imag%' * is much more restrictive > than * tel like '012-3%' *. However I couldn't think of the > way to foresee which is more restrictive. Now I doubt whether > we have enough information to estimate LIKE selectivity > correctly. No, we don't, because we only keep track of the min and max values in each column and assume that the data is uniformly distributed between those limits. Perhaps someday we could keep a histogram instead --- but VACUUM ANALYZE would get a lot slower and more complicated ... regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > For the query > > select hoge_cd,shimeinn,tel > > from t_hoge > > where shimeinn like 'imag%' > > and tel like '012%' > > order by hoge_cd > > limit 100; > > > 64 rows returned immediately. > > > And for the query > > select hoge_cd,shimeinn,tel > > from t_hoge > > where shimeinn like 'imag%' > > and tel like '012-3%' > > order by hoge_cd > > limit 100; > > > 24 rows returned after waiting 8 minutes. > > So what were the plans for these two queries? OK,I would ask him to send them. > Also, has this table been > "vacuum analyzed"? > Yes,his another problem was solved by "vacuum analyze". > > I got the following output from him. > > explain select * from t_hoge where tel like '012%'; > > Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981 > > width=676) > > > explain select * from t_hoge where tel like '012-3%'; > > Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981 > > width=676) > > > In fact,count(*) is 342323 and 114741 respectively. > > > The first problem is that estimated cost is too low. > > It seems that the index selectivity of '012-3%' = the index > > selectivity of '012%' / (256*256),right ? > > If so,does it give more practical estimation than before ? > > It doesn't correspond to rows information either. > > The rows number is fairly bogus (because it's coming from application of > eqsel, which is not the right thing; perhaps someday LIKE should have > its very own selectivity estimation function). But the cost estimate > is driven by the estimated selectivity of > tel >= '012-3' AND tel < '012-4' > and it would be nice to think that we have some handle on that. > Shouldn't rows number and cost estimate correspond in this case ? For example,the following query would return same row numbers.select * from t_hoge where tel = '012'; And the cost estimate is probably > 1000. Is it good that the cost estimate for "tel like '012%'" is much smaller than " tel = '012' " ? PostgreSQL's selectivity doesn't mean a pure probabilty. For example,for int4 type the pure '=' probabity is pow(2,-32). Is current cost estimate for " tel>=val1 and tel <val2'" is effective for narrow range of (val1,val2) ? The range ('012-3','012-4') is veeeery narrow in the vast char(5) space. Regards. Hiroshi Inoue Inoue@tpf.co.jp
I've gotten the plans from Akira Imagawa. Regards. Hiroshi Inoue Inoue@tpf.co.jp > -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Hiroshi Inoue > Sent: Friday, April 07, 2000 8:39 AM > To: Tom Lane > Cc: pgsql-hackers > Subject: RE: [HACKERS] 7.0 like selectivity > > > > -----Original Message----- > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > > For the query > > > select hoge_cd,shimeinn,tel > > > from t_hoge > > > where shimeinn like 'imag%' > > > and tel like '012%' > > > order by hoge_cd > > > limit 100; > > > > > 64 rows returned immediately. Sort (cost=0.01..0.01 rows=1 width=28) -> Index Scan using t_hoge_ix1 on t_hoge (cost=0.00..0.00 rows=1 wid th=28) > > > > > And for the query > > > select hoge_cd,shimeinn,tel > > > from t_hoge > > > where shimeinn like 'imag%' > > > and tel like '012-3%' > > > order by hoge_cd > > > limit 100; > > > > > 24 rows returned after waiting 8 minutes. Sort (cost=0.01..0.01 rows=1 width=28) -> Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1 wid th=28)
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > I've gotten the plans from Akira Imagawa. OK ... I assume t_hoge_ix1 is on shimeinn and t_hoge_ix3 is on tel ... Can we find out what are the min and max values of both the shimeinn and tel columns? regards, tom lane