Thread: planner question..
Hi, For a distribution of data like below why does the planner choses to do an index scan by default for source = 'REGIS' when > 50% of the rows are having source='REGIS'. Of course index produced better results 22 sec versus 4 secs. but can someone explain this case. tradein_clients=# SELECT source,count(*) from profile_master group by source; +----------+--------+ | source | count | +----------+--------+ | BRANDING | 64008 | | CATALOG | 711 | | EYP | 10380 | | IID | 349 | | IP | 493 | | REGIS | 102090 | +----------+--------+ (6 rows) tradein_clients=# SELECT count(*) from profile_master ; +--------+ | count | +--------+ | 178031 | +--------+ (1 row) tradein_clients=# SET enable_indexscan=off; tradein_clients=# explain analyze SELECT count(*) from profile_master where source='REGIS'; +------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | +------------------------------------------------------------------------------------------------------------------------------+ | Aggregate (cost=23982.58..23982.58 rows=1 width=0) (actual time=22872.97..22872.97 rows=1 loops=1) | | -> Seq Scan on profile_master (cost=0.00..23970.40 rows=4871 width=0) (actual time=328.44..22730.69 rows=102090 loops=1)| | Filter: (source = 'REGIS'::character varying) | | Total runtime: 22873.03 msec | +------------------------------------------------------------------------------------------------------------------------------+ (4 rows) tradein_clients=# SET enable_indexscan=on; SET tradein_clients=# explain analyze SELECT count(*) from profile_master where source='REGIS'; +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Aggregate (cost=18225.48..18225.48 rows=1 width=0) (actual time=5919.24..5919.24 rows=1 loops=1) | | -> Index Scan using profile_master_index_source on profile_master (cost=0.00..18213.31 rows=4871 width=0) (actual time=9.43..5786.15rows=102090 loops=1) | | Index Cond: (source = 'REGIS'::character varying) | | Total runtime: 5919.31 msec | +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ (4 rows) -- Rajesh Kumar Mallah, Project Manager (Development) Infocom Network Limited, New Delhi phone: +91(11)6152172 (221) (L) ,9811255597 (M) Visit http://www.trade-india.com , India's Leading B2B eMarketplace.
On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote: > For a distribution of data like below why does the planner > choses to do an index scan by default for source = 'REGIS' when > 50% > of the rows are having source='REGIS'. It's not estimating that though. > tradein_clients=# SET enable_indexscan=off; > tradein_clients=# explain analyze SELECT count(*) from profile_master where source='REGIS'; > +------------------------------------------------------------------------------------------------------------------------------+ > | QUERY PLAN | > +------------------------------------------------------------------------------------------------------------------------------+ > | Aggregate (cost=23982.58..23982.58 rows=1 width=0) (actual time=22872.97..22872.97 rows=1 loops=1) | > | -> Seq Scan on profile_master (cost=0.00..23970.40 rows=4871 width=0) (actual time=328.44..22730.69 rows=102090 loops=1)| It's expecting that about 5000 rows match not 100000. Does it get better if you raise the statistics target for the column (see ALTER TABLE) from 10 to say 50 or 100?
Thanks Stephan, I think there shud be more documentation on understanding EXPLAIN (ANALYZE) output. once i found an webpage but lost it. where do u see its expecting 5000 not 100000? Regds mallah. On Thursday 17 Apr 2003 7:21 pm, Stephan Szabo wrote: > On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote: > > For a distribution of data like below why does the planner > > choses to do an index scan by default for source = 'REGIS' when > 50% > > of the rows are having source='REGIS'. > > It's not estimating that though. > > > tradein_clients=# SET enable_indexscan=off; > > tradein_clients=# explain analyze SELECT count(*) from profile_master > > where source='REGIS'; > > +------------------------------------------------------------------------ > >------------------------------------------------------+ > > > > | QUERY PLAN > > | | > > > > +------------------------------------------------------------------------ > >------------------------------------------------------+ > > > > | Aggregate (cost=23982.58..23982.58 rows=1 width=0) (actual > > | time=22872.97..22872.97 rows=1 loops=1) | -> > > | Seq Scan on profile_master (cost=0.00..23970.40 rows=4871 width=0) > > | (actual time=328.44..22730.69 rows=102090 loops=1) | > > It's expecting that about 5000 rows match not 100000. > > Does it get better if you raise the statistics target for the > column (see ALTER TABLE) from 10 to say 50 or 100? > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Rajesh Kumar Mallah, Project Manager (Development) Infocom Network Limited, New Delhi phone: +91(11)6152172 (221) (L) ,9811255597 (M) Visit http://www.trade-india.com , India's Leading B2B eMarketplace.
On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote: > I think there shud be more documentation on > understanding EXPLAIN (ANALYZE) output. > once i found an webpage but lost it. I believe there's one inside the docs on performance or optimization. > > where do u see its expecting 5000 not 100000? > > > | Aggregate (cost=23982.58..23982.58 rows=1 width=0) (actual > > > | time=22872.97..22872.97 rows=1 loops=1) | -> > > > | Seq Scan on profile_master (cost=0.00..23970.40 rows=4871 width=0) > > > | (actual time=328.44..22730.69 rows=102090 loops=1) | In the Seq Scan line, it's got rows=4871. That's the number it thinks are going to match the filter/index condition.
Rajesh Kumar Mallah <mallah@trade-india.com> writes: > For a distribution of data like below why does the planner > choses to do an index scan by default for source = 'REGIS' when > 50% > of the rows are having source='REGIS'. Are there a huge number of dead rows in the table? ("vacuum verbose" would give some info) The given result seems suspect; an indexscan couldn't possibly read >50% of the rows in less than a quarter of the time for a seqscan. Unless (a) the table contains vast amounts of empty space that the seqscan has to slog through, or (b) your second measurement is bogus due to caching performed by the first measurement. Also, might the table be in order by the "source" column? A sufficiently high correlation might have persuaded the planner to try an indexscan even if point (a) isn't true. regards, tom lane
On Thursday 17 Apr 2003 7:56 pm, you wrote: > Rajesh Kumar Mallah <mallah@trade-india.com> writes: > > For a distribution of data like below why does the planner > > choses to do an index scan by default for source = 'REGIS' when > 50% > > of the rows are having source='REGIS'. > > Are there a huge number of dead rows in the table? ("vacuum verbose" > would give some info) > Yes as you predicted after a vacuum verbose the planner did switch to seq scan and was only marginally slower than index scan. (do u think we shud live with it?) so whats the moral ? ANALYZE itself is not sufficient in updating pg_statistic, we must do VACUUM ANALYZE always , if so why would a seperate ANALYZE command exist ? here is the info (in the same order as commands were executed) tradein_clients=# explain analyze SELECT count(*) from general.profile_master where source='REGIS'; +----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | +----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Aggregate (cost=18225.48..18225.48 rows=1 width=0) (actual time=6295.33..6295.34 rows=1 loops=1) | | -> Index Scan using profile_master_index_source on profile_master (cost=0.00..18213.31 rows=4871 width=0) (actual time=14.54..6132.61rows=102090 loops=1) | | Index Cond: (source = 'REGIS'::character varying) | | Total runtime: 6295.41 msec | +----------------------------------------------------------------------------------------------------------------------------------------------------------------+ (4 rows) tradein_clients=# VACUUM VERBOSE general.profile_master; INFO: --Relation general.profile_master-- INFO: Index profile_master_profile_id_pkey: Pages 426; Tuples 178031: Deleted 11802. CPU 0.02s/0.12u sec elapsed 0.24sec. INFO: Index profile_master_index_userid: Pages 431; Tuples 178031: Deleted 8940. CPU 0.00s/0.13u sec elapsed 0.14sec. INFO: Index profile_master_index_creation_date: Pages 420; Tuples 178031: Deleted 8940. CPU 0.02s/0.12u sec elapsed0.20 sec. INFO: Index profile_master_index_company_id: Pages 419; Tuples 178031: Deleted 8940. CPU 0.01s/0.10u sec elapsed 0.15sec. INFO: Index profile_master_index_eyp_list_id: Pages 419; Tuples 178031: Deleted 8940. CPU 0.01s/0.11u sec elapsed0.45 sec. INFO: Index profile_master_index_iid_list_id: Pages 420; Tuples 178031: Deleted 8940. CPU 0.01s/0.10u sec elapsed0.46 sec. INFO: Index profile_master_index_ip_list_id: Pages 421; Tuples 178031: Deleted 8940. CPU 0.00s/0.11u sec elapsed 0.12sec. INFO: Index profile_master_index_catalog_company_id: Pages 419; Tuples 178031: Deleted 8940. CPU 0.00s/0.12u sec elapsed0.46 sec. INFO: Index profile_master_index_source: Pages 619; Tuples 178031: Deleted 8940. CPU 0.02s/0.09u sec elapsed 0.29sec. INFO: Removed 11802 tuples in 495 pages. CPU 0.01s/0.03u sec elapsed 0.14 sec. INFO: Pages 23923: Changed 36, Empty 0; Tup 178031: Vac 11802, Keep 0, UnUsed 325029. Total CPU 0.96s/1.19u sec elapsed11.97 sec. INFO: --Relation pg_toast.pg_toast_122045388-- INFO: Pages 31: Changed 0, Empty 0; Tup 189: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.03 sec. VACUUM tradein_clients=# explain analyze SELECT count(*) from general.profile_master where source='REGIS'; +-----------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | +-----------------------------------------------------------------------------------------------------------------------------+ | Aggregate (cost=26398.37..26398.37 rows=1 width=0) (actual time=8355.76..8355.76 rows=1 loops=1) | | -> Seq Scan on profile_master (cost=0.00..26148.39 rows=99994 width=0) (actual time=16.23..8237.53 rows=102090 loops=1)| | Filter: (source = 'REGIS'::character varying) | | Total runtime: 8355.88 msec | +-----------------------------------------------------------------------------------------------------------------------------+ (4 rows) > The given result seems suspect; an indexscan couldn't possibly read >50% > of the rows in less than a quarter of the time for a seqscan. Unless > (a) the table contains vast amounts of empty space that the seqscan has to > slog through, or Sorry i dont understand the vacuum verbose output throughly , does it looks like though ? (b) your second measurement is bogus due to caching > performed by the first measurement. I am not sure this time but i have the habit of running EXPLAIN ANALYZE thrice and post the middle one ;-) > > Also, might the table be in order by the "source" column? A > sufficiently high correlation might have persuaded the planner to try an > indexscan even if point (a) isn't true. Yes the data is loaded from one source completely before loading from another source. So they were in order but i did a lot of updates and deletes. regds mallah. > > regards, tom lane > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Rajesh Kumar Mallah, Project Manager (Development) Infocom Network Limited, New Delhi phone: +91(11)6152172 (221) (L) ,9811255597 (M) Visit http://www.trade-india.com , India's Leading B2B eMarketplace.
On Thursday 17 Apr 2003 7:42 pm, Stephan Szabo wrote: > On Thu, 17 Apr 2003, Rajesh Kumar Mallah wrote: > > I think there shud be more documentation on > > understanding EXPLAIN (ANALYZE) output. > > once i found an webpage but lost it. > > I believe there's one inside the docs on performance or optimization. I have seen that and saw a better one somewhere else which i cant recall. > > where do u see its expecting 5000 not 100000? > > > > > > | Aggregate (cost=23982.58..23982.58 rows=1 width=0) (actual > > > > | time=22872.97..22872.97 rows=1 loops=1) | > > > > | -> Seq Scan on profile_master (cost=0.00..23970.40 rows=4871 > > > > | width=0) (actual time=328.44..22730.69 rows=102090 loops=1) | > > In the Seq Scan line, it's got rows=4871. That's the number it thinks are > going to match the filter/index condition. Thanks for spotting , i will go thru the docs once again though :) regds mallah. -- Rajesh Kumar Mallah, Project Manager (Development) Infocom Network Limited, New Delhi phone: +91(11)6152172 (221) (L) ,9811255597 (M) Visit http://www.trade-india.com , India's Leading B2B eMarketplace.
Rajesh Kumar Mallah <mallah@trade-india.com> writes: > INFO: Pages 23923: Changed 36, Empty 0; Tup 178031: Vac 11802, Keep 0, UnUsed 325029. > Sorry i dont understand the vacuum verbose output throughly , does it looks like > though ? Yeah, I think you could do with a VACUUM FULL there ... the table appears to be less than 50% occupied. regards, tom lane