Thread: Indexes and statistics
Hi all, This is a further post from last week. I've got a table of phone call detail records. buns=# select count(*) from cdr; count ---------2800653 (1 row) One of the customers is quite large (8.3% of the records): buns=# select count(*) from cdr where cust_id = 99201110; count --------231889 (1 row) I have indexes on cust_id (integer) and bill_id (integer). If I try to do a query on that customer it doesn't use the index: buns=# explain analyse select count(*) from cdr where cust_id = 99201110 and bill_id is null; QUERY PLAN -----------------------------------------------------------------------------------------------------------------Aggregate (cost=87082.81..87082.81 rows=1 width=0) (actual time=82279.63..82279.63 rows=1 loops=1) -> Seq Scan on cdr (cost=0.00..87037.71rows=18041 width=0) (actual time=82279.61..82279.61 rows=0 loops=1) Filter: ((cust_id = 99201110)AND (bill_id IS NULL))Total runtime: 82280.19 msec (4 rows) I tried this: alter table cdr alter column cust_id set statistics 1000; alter table cdr alter column bill_id set statistics 1000; analyze verbose; The I ran the query again but I still got the same result. Then I tried disabling sequential scans in postgresql.conf, restarted the postmaster and did the query again: buns=# explain analyse select count(*) from cdr where cust_id = 99201110 and bill_id is null; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=913498.60..913498.60 rows=1 width=0) (actual time=48387.91..48387.91 rows=1 loops=1) -> Index Scan using cdr_ix3on cdr (cost=0.00..913453.49 rows=18041 width=0) (actual time=48387.89..48387.89 rows=0 loops=1) Index Cond:(cust_id = 99201110) Filter: (bill_id IS NULL)Total runtime: 48388.47 msec (5 rows) The computed cost of using the index was a factor of 10 higher which I presume is why the query planner wasn't using theindex, but it ran in half the time. So I guess I need to know how to alter the statistics collection so that the indexwill get used. I gather that index columns that occur in more than "a few" percent of the table cause the query plannerto not use the index. Does that mean I won't be able to get the query planner to ever use the cust_id index for thatcustomer or can I tune some parameters to alter that? Any suggestions appreciated. Thanks, David David Witham Telephony Platforms Architect Unidial, Australia
"David Witham" <davidw@unidial.com.au> writes: > One of the customers is quite large (8.3% of the records): Hmm. Unless your rows are quite wide, a random sampling of 8.3% of the table would be expected to visit every page of the table, probably several times. So the planner's cost estimates do not seem out of line to me; an indexscan *should* be slow. The first question to ask is why the deviation from reality. Are the rows for that customer ID likely to be physically concentrated into a limited number of physical pages? Do you have so much RAM that the whole table got swapped in, eliminating the extra I/O that the planner is expecting? regards, tom lane
Tom, I'm running PostgreSQL 7.3.2 on Red Hat Linux 7.3 with 512Mb RAM. The table definition is: Table "public.cdr" Column | Type | Modifiers -----------------------+-----------------------+-----------carrier_id | integer | not nullfile_id | integer | not nullservice_num | character varying(10) | not nullday | date | not nulltime | integer | not nulldestination | character varying(20) | not nullduration | integer | not nullcharge_wholesale |numeric(8,2) | not nullcharge_band_id | integer | charge_retail | numeric(8,2) | not nullrate_plan_id | integer | not nullitem_code | integer | notnullcust_id | integer | not nullbill_id | integer | prefix | character varying(12) | charge_wholesale_calc | numeric(8,2) | Indexes: cdr_ix1 btree ("day"), cdr_ix2 btree (service_num), cdr_ix3 btree (cust_id), cdr_ix4 btree(bill_id), cdr_ix5 btree (carrier_id), cdr_ix6 btree (file_id) Does this make it a "wide" table? The data arrives ordered by service_num, day, time. This customer has one primary service_num that most of the calls aremade from. Therefore each day a clump of CDRs will be loaded for that customer, interspersed with CDRs from all the othercustomers. Therefore the distribution of records for a service_num is clumpy but evenly distributed throughout the table.For a customer with a single primary number, this result applies to the customer as a whole. For a customer with manyservice_num's the result is a little more doubtful depending on whether their service_num's arrive sequentially or not.This would not necessarily be the case. I hope this makes sense. Does it help any? Thanks, David -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, 18 February 2004 16:10 To: David Witham Cc: pgsql-sql@postgresql.org Subject: Re: [SQL] Indexes and statistics "David Witham" <davidw@unidial.com.au> writes: > One of the customers is quite large (8.3% of the records): Hmm. Unless your rows are quite wide, a random sampling of 8.3% of the table would be expected to visit every page of the table, probably several times. So the planner's cost estimates do not seem out of line to me; an indexscan *should* be slow. The first question to ask is why the deviation from reality. Are the rows for that customer ID likely to be physically concentrated into a limited number of physical pages? Do you have so much RAM that the whole table got swapped in, eliminating the extra I/O that the planner is expecting? regards, tom lane
"David Witham" <davidw@unidial.com.au> writes: > Does this make it a "wide" table? Nope. A rough estimate is that your rows will be about 160 bytes wide, which means you can fit about 50 per 8K page. So a query that needs to select 8% of the table will *on average* need to hit about 4 rows per page. In the absence of any data clumping this would certainly mean that the scan would need to touch every page anyway, and thus that using the index could provide no I/O savings. However, > The data arrives ordered by service_num, day, time. This customer has > one primary service_num that most of the calls are made from. So you do have very strong clumping, which the planner is evidently failing to account for properly. Could we see the pg_stats rows for service_num and cust_id? I'm curious whether the ANALYZE stats picked up the effect at all. As far as actually solving the problem is concerned, you have a few options. I wouldn't recommend turning off enable_seqscan globally, but you could perhaps turn it off locally (just do a SET command) just for this query. Another possibility, if you care a lot about the speed of this particular type of query, is to make a partial index tuned to the query:create index my_idx on cdr (cust_id) WHERE bill_id IS NULL; I gather from your previously shown results that "bill_id IS NULL" covers only a small fraction of the table, so this index would be pretty small and should look quite attractive to the planner. regards, tom lane
> The computed cost of using the index was a factor of 10 higher which I presume is why the query planner wasn't > using the index, but it ran in half the time Have you tried playing with the random_page_cost parameter? The default is 4. Try: set random_page_cost = 1; in psql to alter it for the current session (you can change this in postgresql.conf too). This will make index usage more attractive by reducing the computed cost. This is the simple way of looking at it anyway. On my system I tested a 'typical' query exercising some joins on large tables which didn't use an index, but I thought maybe it would perform better if it did. I determined that a random_page_cost of 1.8 would cause indexes to be used, but in this case the *actual* performance didn't improve very much. I took this to mean that a random_page_cost of around 1.8/1.9 represents a rough balance point on my development server (one slow IDE disk, and a big database). As well as the other things mentioned by Tom, perhaps you should be looking for the "correct" setting of random_page_cost for your system. It may be appropriate to alter it globally using postgresql.conf, and for specific situations such as you mentioned. HTH Iain