Thread: planner question..

planner question..

From

Rajesh Kumar Mallah

Date:

17 April 2003, 04:34:36


Hi,

For a distribution of data like below why does the planner
choses to do an index scan by default for source = 'REGIS' when > 50%
of the rows are having source='REGIS'.

Of course index produced better results 22 sec versus 4 secs.
but can someone explain this case.

tradein_clients=# SELECT source,count(*) from  profile_master group by source;
+----------+--------+
|  source  | count  |
+----------+--------+
| BRANDING |  64008 |
| CATALOG  |    711 |
| EYP      |  10380 |
| IID      |    349 |
| IP       |    493 |
| REGIS    | 102090 |
+----------+--------+
(6 rows)

tradein_clients=# SELECT count(*) from  profile_master ;
+--------+
| count  |
+--------+
| 178031 |
+--------+
(1 row)


tradein_clients=# SET enable_indexscan=off;
tradein_clients=# explain analyze SELECT count(*) from  profile_master where source='REGIS';

+------------------------------------------------------------------------------------------------------------------------------+
|                                                          QUERY PLAN
      |
 

+------------------------------------------------------------------------------------------------------------------------------+
| Aggregate  (cost=23982.58..23982.58 rows=1 width=0) (actual time=22872.97..22872.97 rows=1 loops=1)
      |
 
|   ->  Seq Scan on profile_master  (cost=0.00..23970.40 rows=4871 width=0) (actual time=328.44..22730.69 rows=102090
loops=1)|
 
|         Filter: (source = 'REGIS'::character varying)
      |
 
| Total runtime: 22873.03 msec
      |
 

+------------------------------------------------------------------------------------------------------------------------------+
(4 rows)

tradein_clients=# SET enable_indexscan=on;
SET
tradein_clients=# explain analyze SELECT count(*) from  profile_master where source='REGIS';

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                          QUERY PLAN
                                       |
 

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Aggregate  (cost=18225.48..18225.48 rows=1 width=0) (actual time=5919.24..5919.24 rows=1 loops=1)
                                       |
 
|   ->  Index Scan using profile_master_index_source on profile_master  (cost=0.00..18213.31 rows=4871 width=0) (actual
time=9.43..5786.15rows=102090 loops=1) |
 
|         Index Cond: (source = 'REGIS'::character varying)
                                       |
 
| Total runtime: 5919.31 msec
                                       |
 

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
(4 rows)

-- 
Rajesh Kumar Mallah,
Project Manager (Development)
Infocom Network Limited, New Delhi
phone: +91(11)6152172 (221) (L) ,9811255597 (M)

Visit http://www.trade-india.com ,
India's Leading B2B eMarketplace.

Re: planner question..

From

Stephan Szabo

Date:

17 April 2003, 09:51:10

On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote:

> For a distribution of data like below why does the planner
> choses to do an index scan by default for source = 'REGIS' when > 50%
> of the rows are having source='REGIS'.

It's not estimating that though.

> tradein_clients=# SET enable_indexscan=off;
> tradein_clients=# explain analyze SELECT count(*) from  profile_master where source='REGIS';
>
+------------------------------------------------------------------------------------------------------------------------------+
> |                                                          QUERY PLAN
        |
 
>
+------------------------------------------------------------------------------------------------------------------------------+
> | Aggregate  (cost=23982.58..23982.58 rows=1 width=0) (actual time=22872.97..22872.97 rows=1 loops=1)
        |
 
> |   ->  Seq Scan on profile_master  (cost=0.00..23970.40 rows=4871 width=0) (actual time=328.44..22730.69 rows=102090
loops=1)|
 

It's expecting that about 5000 rows match not 100000.

Does it get better if you raise the statistics target for the
column (see ALTER TABLE) from 10 to say 50 or 100?

Re: planner question..

From

Rajesh Kumar Mallah

Date:

17 April 2003, 09:57:20

Thanks Stephan,

I think there shud be more documentation on
understanding EXPLAIN (ANALYZE) output.
once i found an webpage but lost it.

where do u see its expecting 5000 not 100000?

Regds
mallah.

On Thursday 17 Apr 2003 7:21 pm, Stephan Szabo wrote:
> On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote:
> > For a distribution of data like below why does the planner
> > choses to do an index scan by default for source = 'REGIS' when > 50%
> > of the rows are having source='REGIS'.
>
> It's not estimating that though.
>
> > tradein_clients=# SET enable_indexscan=off;
> > tradein_clients=# explain analyze SELECT count(*) from  profile_master
> > where source='REGIS';
> > +------------------------------------------------------------------------
> >------------------------------------------------------+
> >
> > |                                                          QUERY PLAN    
> > |                                                      |
> >
> > +------------------------------------------------------------------------
> >------------------------------------------------------+
> >
> > | Aggregate  (cost=23982.58..23982.58 rows=1 width=0) (actual
> > | time=22872.97..22872.97 rows=1 loops=1)                          | -> 
> > | Seq Scan on profile_master  (cost=0.00..23970.40 rows=4871 width=0)
> > | (actual time=328.44..22730.69 rows=102090 loops=1) |
>
> It's expecting that about 5000 rows match not 100000.
>
> Does it get better if you raise the statistics target for the
> column (see ALTER TABLE) from 10 to say 50 or 100?
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

-- 
Rajesh Kumar Mallah,
Project Manager (Development)
Infocom Network Limited, New Delhi
phone: +91(11)6152172 (221) (L) ,9811255597 (M)

Visit http://www.trade-india.com ,
India's Leading B2B eMarketplace.

Re: planner question..

From

Stephan Szabo

Date:

17 April 2003, 10:13:05

On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote:

> I think there shud be more documentation on
> understanding EXPLAIN (ANALYZE) output.
> once i found an webpage but lost it.

I believe there's one inside the docs on performance or optimization.

>
> where do u see its expecting 5000 not 100000?

> > > | Aggregate  (cost=23982.58..23982.58 rows=1 width=0) (actual
> > > | time=22872.97..22872.97 rows=1 loops=1)                          | ->
> > > | Seq Scan on profile_master  (cost=0.00..23970.40 rows=4871 width=0)
> > > | (actual time=328.44..22730.69 rows=102090 loops=1) |

In the Seq Scan line, it's got rows=4871.  That's the number it thinks are
going to match the filter/index condition.

Re: planner question..

From

Tom Lane

Date:

17 April 2003, 10:26:41

Rajesh Kumar Mallah <mallah@trade-india.com> writes:
> For a distribution of data like below why does the planner
> choses to do an index scan by default for source = 'REGIS' when > 50%
> of the rows are having source='REGIS'.

Are there a huge number of dead rows in the table?  ("vacuum verbose"
would give some info)

The given result seems suspect; an indexscan couldn't possibly read >50%
of the rows in less than a quarter of the time for a seqscan.  Unless
(a) the table contains vast amounts of empty space that the seqscan has to
slog through, or (b) your second measurement is bogus due to caching
performed by the first measurement.

Also, might the table be in order by the "source" column?  A
sufficiently high correlation might have persuaded the planner to try an
indexscan even if point (a) isn't true.
        regards, tom lane

Re: planner question..

From

Rajesh Kumar Mallah

Date:

17 April 2003, 11:10:17

On Thursday 17 Apr 2003 7:56 pm, you wrote:
> Rajesh Kumar Mallah <mallah@trade-india.com> writes:
> > For a distribution of data like below why does the planner
> > choses to do an index scan by default for source = 'REGIS' when > 50%
> > of the rows are having source='REGIS'.
>
> Are there a huge number of dead rows in the table?  ("vacuum verbose"
> would give some info)
>

Yes as you predicted after a vacuum verbose the planner did
switch to seq scan and was only marginally slower than index
scan. (do u think we shud live with it?)

so whats the moral ?
ANALYZE itself is not sufficient in updating pg_statistic,
we must do VACUUM ANALYZE always ,
if so why would a seperate ANALYZE command exist ?

here is the info (in the same order as commands were executed)

tradein_clients=# explain analyze SELECT count(*) from  general.profile_master where source='REGIS';

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                           QUERY PLAN
                                        |
 

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Aggregate  (cost=18225.48..18225.48 rows=1 width=0) (actual time=6295.33..6295.34 rows=1 loops=1)
                                        |
 
|   ->  Index Scan using profile_master_index_source on profile_master  (cost=0.00..18213.31 rows=4871 width=0) (actual
time=14.54..6132.61rows=102090 loops=1) |
 
|         Index Cond: (source = 'REGIS'::character varying)
                                        |
 
| Total runtime: 6295.41 msec
                                        |
 

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
(4 rows)



tradein_clients=# VACUUM VERBOSE general.profile_master;
INFO:  --Relation general.profile_master--
INFO:  Index profile_master_profile_id_pkey: Pages 426; Tuples 178031: Deleted 11802.       CPU 0.02s/0.12u sec elapsed
0.24sec.
 
INFO:  Index profile_master_index_userid: Pages 431; Tuples 178031: Deleted 8940.       CPU 0.00s/0.13u sec elapsed
0.14sec.
 
INFO:  Index profile_master_index_creation_date: Pages 420; Tuples 178031: Deleted 8940.       CPU 0.02s/0.12u sec
elapsed0.20 sec.
 
INFO:  Index profile_master_index_company_id: Pages 419; Tuples 178031: Deleted 8940.       CPU 0.01s/0.10u sec elapsed
0.15sec.
 
INFO:  Index profile_master_index_eyp_list_id: Pages 419; Tuples 178031: Deleted 8940.       CPU 0.01s/0.11u sec
elapsed0.45 sec.
 
INFO:  Index profile_master_index_iid_list_id: Pages 420; Tuples 178031: Deleted 8940.       CPU 0.01s/0.10u sec
elapsed0.46 sec.
 
INFO:  Index profile_master_index_ip_list_id: Pages 421; Tuples 178031: Deleted 8940.       CPU 0.00s/0.11u sec elapsed
0.12sec.
 
INFO:  Index profile_master_index_catalog_company_id: Pages 419; Tuples 178031: Deleted 8940.       CPU 0.00s/0.12u sec
elapsed0.46 sec.
 
INFO:  Index profile_master_index_source: Pages 619; Tuples 178031: Deleted 8940.       CPU 0.02s/0.09u sec elapsed
0.29sec.
 
INFO:  Removed 11802 tuples in 495 pages.       CPU 0.01s/0.03u sec elapsed 0.14 sec.
INFO:  Pages 23923: Changed 36, Empty 0; Tup 178031: Vac 11802, Keep 0, UnUsed 325029.       Total CPU 0.96s/1.19u sec
elapsed11.97 sec.
 
INFO:  --Relation pg_toast.pg_toast_122045388--
INFO:  Pages 31: Changed 0, Empty 0; Tup 189: Vac 0, Keep 0, UnUsed 0.       Total CPU 0.00s/0.00u sec elapsed 0.03
sec.
VACUUM

tradein_clients=# explain analyze SELECT count(*) from  general.profile_master where source='REGIS';

+-----------------------------------------------------------------------------------------------------------------------------+
|                                                         QUERY PLAN
     |
 

+-----------------------------------------------------------------------------------------------------------------------------+
| Aggregate  (cost=26398.37..26398.37 rows=1 width=0) (actual time=8355.76..8355.76 rows=1 loops=1)
     |
 
|   ->  Seq Scan on profile_master  (cost=0.00..26148.39 rows=99994 width=0) (actual time=16.23..8237.53 rows=102090
loops=1)|
 
|         Filter: (source = 'REGIS'::character varying)
     |
 
| Total runtime: 8355.88 msec
     |
 

+-----------------------------------------------------------------------------------------------------------------------------+
(4 rows)


> The given result seems suspect; an indexscan couldn't possibly read >50%
> of the rows in less than a quarter of the time for a seqscan.  Unless
> (a) the table contains vast amounts of empty space that the seqscan has to
> slog through, or 

Sorry i dont understand the vacuum verbose output throughly , does it looks like
though ?

(b) your second measurement is bogus due to caching
> performed by the first measurement.

I am not sure this time but i have the habit of running 
EXPLAIN ANALYZE thrice and post the middle one ;-)



>
> Also, might the table be in order by the "source" column?  A
> sufficiently high correlation might have persuaded the planner to try an
> indexscan even if point (a) isn't true.

Yes the data is loaded from one source completely before loading
from another source. So they were in order but i did a lot of updates
and deletes.

regds
mallah.

>
>             regards, tom lane
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

-- 
Rajesh Kumar Mallah,
Project Manager (Development)
Infocom Network Limited, New Delhi
phone: +91(11)6152172 (221) (L) ,9811255597 (M)

Visit http://www.trade-india.com ,
India's Leading B2B eMarketplace.

Re: planner question..

From

Rajesh Kumar Mallah

Date:

17 April 2003, 11:12:26

On Thursday 17 Apr 2003 7:42 pm, Stephan Szabo wrote:
> On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote:
> > I think there shud be more documentation on
> > understanding EXPLAIN (ANALYZE) output.
> > once i found an webpage but lost it.
>
> I believe there's one inside the docs on performance or optimization.

I have seen that and saw a better one somewhere else which i
cant recall.


> > where do u see its expecting 5000 not 100000?
> >
> > > > | Aggregate  (cost=23982.58..23982.58 rows=1 width=0) (actual
> > > > | time=22872.97..22872.97 rows=1 loops=1)                          |
> > > > | -> Seq Scan on profile_master  (cost=0.00..23970.40 rows=4871
> > > > | width=0) (actual time=328.44..22730.69 rows=102090 loops=1) |
>
> In the Seq Scan line, it's got rows=4871.  That's the number it thinks are
> going to match the filter/index condition.

Thanks for spotting , i will go thru the docs once again though :)

regds
mallah.

-- 
Rajesh Kumar Mallah,
Project Manager (Development)
Infocom Network Limited, New Delhi
phone: +91(11)6152172 (221) (L) ,9811255597 (M)

Visit http://www.trade-india.com ,
India's Leading B2B eMarketplace.

Re: planner question..

From

Tom Lane

Date:

17 April 2003, 12:10:33

Rajesh Kumar Mallah <mallah@trade-india.com> writes:
> INFO:  Pages 23923: Changed 36, Empty 0; Tup 178031: Vac 11802, Keep 0, UnUsed 325029.

> Sorry i dont understand the vacuum verbose output throughly , does it looks like
> though ?

Yeah, I think you could do with a VACUUM FULL there ... the table
appears to be less than 50% occupied.
        regards, tom lane