Thread: 7.0 like selectivity

7.0 like selectivity

From

"Hiroshi Inoue"

Date:

06 April 2000, 14:01:38

Hi all,

There was a bug(??) report about LIKE optimization of
7.0 beta3 in Japan from Akira Imagawa.
It may be difficult to solve. 

Let t_hoge be a table like
{hoge_cd int4 primary key,shimeinn text,tel text,..
}
index hoge_ix2 on t_hoge(shimeinn).
index hoge_ix3 on t_hoge(tel).

There are 348236 rows in t_hoge.

For the query
select hoge_cd,shimeinn,telfrom t_hogewhere shimeinn like 'imag%'  and tel like '012%'order by hoge_cdlimit 100;

64 rows returned immediately.

And for the query
select hoge_cd,shimeinn,telfrom t_hogewhere shimeinn like 'imag%'  and tel like '012-3%'order by hoge_cd limit 100;

24 rows returned after waiting 8 minutes.

I got the following output from him.
explain select * from t_hoge where tel like '012%';Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.23
rows=1981width=676)

explain select * from t_hoge where tel like '012-3%';Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.00
rows=1981width=676)

In fact,count(*) is 342323 and 114741 respectively.

The first problem is that estimated cost is too low.
It seems that the index selectivity of '012-3%' = the index
selectivity of '012%' / (256*256),right ? 
If so,does it give more practical estimation than before ?
It doesn't correspond to rows information either.

In reality, * shimeinn like 'imag%' * is much more restrictive
than * tel like '012-3%' *.  However I couldn't think of the
way to foresee which is more restrictive. Now I doubt whether
we have enough information to estimate LIKE selectivity
correctly. It's the second problem.

Comments ? 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: 7.0 like selectivity

From

Tom Lane

Date:

06 April 2000, 14:31:06

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> For the query
> select hoge_cd,shimeinn,tel
>  from t_hoge
>  where shimeinn like 'imag%'
>    and tel like '012%'
>  order by hoge_cd
>  limit 100;

> 64 rows returned immediately.

> And for the query
> select hoge_cd,shimeinn,tel
>  from t_hoge
>  where shimeinn like 'imag%'
>    and tel like '012-3%'
>  order by hoge_cd
>   limit 100;

> 24 rows returned after waiting 8 minutes.

So what were the plans for these two queries?  Also, has this table been
"vacuum analyzed"?

> I got the following output from him.
> explain select * from t_hoge where tel like '012%';
>     Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.23 rows=1981
>     width=676)

> explain select * from t_hoge where tel like '012-3%';
>     Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.00 rows=1981
>     width=676)

> In fact,count(*) is 342323 and 114741 respectively.

> The first problem is that estimated cost is too low.
> It seems that the index selectivity of '012-3%' = the index
> selectivity of '012%' / (256*256),right ? 
> If so,does it give more practical estimation than before ?
> It doesn't correspond to rows information either.

The rows number is fairly bogus (because it's coming from application of
eqsel, which is not the right thing; perhaps someday LIKE should have
its very own selectivity estimation function).  But the cost estimate
is driven by the estimated selectivity oftel >= '012-3' AND tel < '012-4'
and it would be nice to think that we have some handle on that.

It could be that the thing is getting fooled by a very non-uniform
distribution of telephone numbers.  You indicate that most of the
numbers in the DB begin with '012', but if there are a small number
that begin with digits as high as 9, the selectivity estimates would
be pretty bad.

> In reality, * shimeinn like 'imag%' * is much more restrictive
> than * tel like '012-3%' *.  However I couldn't think of the
> way to foresee which is more restrictive. Now I doubt whether
> we have enough information to estimate LIKE selectivity
> correctly.

No, we don't, because we only keep track of the min and max values
in each column and assume that the data is uniformly distributed
between those limits.  Perhaps someday we could keep a histogram
instead --- but VACUUM ANALYZE would get a lot slower and more
complicated ...
        regards, tom lane

RE: 7.0 like selectivity

From

"Hiroshi Inoue"

Date:

06 April 2000, 22:32:55

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > For the query
> > select hoge_cd,shimeinn,tel
> >  from t_hoge
> >  where shimeinn like 'imag%'
> >    and tel like '012%'
> >  order by hoge_cd
> >  limit 100;
> 
> > 64 rows returned immediately.
> 
> > And for the query
> > select hoge_cd,shimeinn,tel
> >  from t_hoge
> >  where shimeinn like 'imag%'
> >    and tel like '012-3%'
> >  order by hoge_cd
> >   limit 100;
> 
> > 24 rows returned after waiting 8 minutes.
> 
> So what were the plans for these two queries?

OK,I would ask him to send them.

> Also, has this table been
> "vacuum analyzed"?
>

Yes,his another problem was solved by "vacuum analyze".
> > I got the following output from him.
> > explain select * from t_hoge where tel like '012%';
> >     Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.23 rows=1981
> >     width=676)
> 
> > explain select * from t_hoge where tel like '012-3%';
> >     Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.00 rows=1981
> >     width=676)
> 
> > In fact,count(*) is 342323 and 114741 respectively.
> 
> > The first problem is that estimated cost is too low.
> > It seems that the index selectivity of '012-3%' = the index
> > selectivity of '012%' / (256*256),right ? 
> > If so,does it give more practical estimation than before ?
> > It doesn't correspond to rows information either.
> 
> The rows number is fairly bogus (because it's coming from application of
> eqsel, which is not the right thing; perhaps someday LIKE should have
> its very own selectivity estimation function).  But the cost estimate
> is driven by the estimated selectivity of
>     tel >= '012-3' AND tel < '012-4'
> and it would be nice to think that we have some handle on that.
>

Shouldn't rows number and cost estimate correspond in this case ?
For example,the following query would return same row numbers.select * from t_hoge where tel = '012';
And the cost estimate is probably > 1000.
Is it good that the cost estimate for "tel like '012%'" is much smaller
than " tel = '012' " ?

PostgreSQL's selectivity doesn't mean a pure probabilty.
For example,for int4 type the pure '=' probabity is pow(2,-32).
Is current cost estimate for " tel>=val1 and tel <val2'" is effective
for narrow range of (val1,val2) ?  The range ('012-3','012-4')
is veeeery narrow in the vast char(5) space.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

RE: 7.0 like selectivity

From

"Hiroshi Inoue"

Date:

06 April 2000, 23:53:51

I've gotten  the plans from Akira Imagawa.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 

> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Hiroshi Inoue
> Sent: Friday, April 07, 2000 8:39 AM
> To: Tom Lane
> Cc: pgsql-hackers
> Subject: RE: [HACKERS] 7.0 like selectivity 
> 
> 
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > 
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > > For the query
> > > select hoge_cd,shimeinn,tel
> > >  from t_hoge
> > >  where shimeinn like 'imag%'
> > >    and tel like '012%'
> > >  order by hoge_cd
> > >  limit 100;
> > 
> > > 64 rows returned immediately.

Sort  (cost=0.01..0.01 rows=1 width=28) ->  Index Scan using t_hoge_ix1 on t_hoge  (cost=0.00..0.00 rows=1 wid
th=28)

> > 
> > > And for the query
> > > select hoge_cd,shimeinn,tel
> > >  from t_hoge
> > >  where shimeinn like 'imag%'
> > >    and tel like '012-3%'
> > >  order by hoge_cd
> > >   limit 100;
> > 
> > > 24 rows returned after waiting 8 minutes.

Sort  (cost=0.01..0.01 rows=1 width=28) ->  Index Scan using t_hoge_ix3 on t_hoge  (cost=0.00..0.00 rows=1 wid
th=28)

Re: 7.0 like selectivity

From

Tom Lane

Date:

07 April 2000, 00:12:50

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> I've gotten  the plans from Akira Imagawa.

OK ... I assume t_hoge_ix1 is on shimeinn and t_hoge_ix3 is on tel ...

Can we find out what are the min and max values of both the
shimeinn and tel columns?
        regards, tom lane