Thread: Indexes and statistics

Indexes and statistics

From
"David Witham"
Date:
Hi all,

This is a further post from last week. I've got a table of phone call detail records.

buns=# select count(*) from cdr;                          count
---------2800653
(1 row)

One of the customers is quite large (8.3% of the records):

buns=# select count(*) from cdr where cust_id = 99201110;        count
--------231889
(1 row)

I have indexes on cust_id (integer) and bill_id (integer). If I try to do a query on that customer it doesn't use the
index:

buns=# explain analyse select count(*) from cdr where cust_id = 99201110 and bill_id is null;
                      QUERY PLAN                                                     

-----------------------------------------------------------------------------------------------------------------Aggregate
(cost=87082.81..87082.81 rows=1 width=0) (actual time=82279.63..82279.63 rows=1 loops=1)  ->  Seq Scan on cdr
(cost=0.00..87037.71rows=18041 width=0) (actual time=82279.61..82279.61 rows=0 loops=1)        Filter: ((cust_id =
99201110)AND (bill_id IS NULL))Total runtime: 82280.19 msec 
(4 rows)

I tried this:

alter table cdr alter column cust_id set statistics 1000;
alter table cdr alter column bill_id set statistics 1000;
analyze verbose;

The I ran the query again but I still got the same result.

Then I tried disabling sequential scans in postgresql.conf, restarted the postmaster and did the query again:

buns=# explain analyse select count(*) from cdr where cust_id = 99201110 and bill_id is null;
                               QUERY PLAN                                                             

----------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=913498.60..913498.60 rows=1 width=0) (actual time=48387.91..48387.91 rows=1 loops=1)  ->  Index Scan using
cdr_ix3on cdr  (cost=0.00..913453.49 rows=18041 width=0) (actual time=48387.89..48387.89 rows=0 loops=1)        Index
Cond:(cust_id = 99201110)        Filter: (bill_id IS NULL)Total runtime: 48388.47 msec 
(5 rows)

The computed cost of using the index was a factor of 10 higher which I presume is why the query planner wasn't using
theindex, but it ran in half the time. So I guess I need to know how to alter the statistics collection so that the
indexwill get used. I gather that index columns that occur in more than "a few" percent of the table cause the query
plannerto not use the index. Does that mean I won't be able to get the query planner to ever use the cust_id index for
thatcustomer or can I tune some parameters to alter that? 

Any suggestions appreciated.

Thanks,
David

David Witham
Telephony Platforms Architect
Unidial, Australia




Re: Indexes and statistics

From
Tom Lane
Date:
"David Witham" <davidw@unidial.com.au> writes:
> One of the customers is quite large (8.3% of the records):

Hmm.  Unless your rows are quite wide, a random sampling of 8.3% of the
table would be expected to visit every page of the table, probably
several times.  So the planner's cost estimates do not seem out of line
to me; an indexscan *should* be slow.  The first question to ask is why
the deviation from reality.  Are the rows for that customer ID likely to
be physically concentrated into a limited number of physical pages?
Do you have so much RAM that the whole table got swapped in, eliminating
the extra I/O that the planner is expecting?
        regards, tom lane


Re: Indexes and statistics

From
"David Witham"
Date:
Tom,

I'm running PostgreSQL 7.3.2 on Red Hat Linux 7.3 with 512Mb RAM.

The table definition is:
                   Table "public.cdr"       Column         |         Type          | Modifiers
-----------------------+-----------------------+-----------carrier_id            | integer               | not
nullfile_id              | integer               | not nullservice_num           | character varying(10) | not nullday
                | date                  | not nulltime                  | integer               | not nulldestination
       | character varying(20) | not nullduration              | integer               | not nullcharge_wholesale
|numeric(8,2)          | not nullcharge_band_id        | integer               | charge_retail         | numeric(8,2)
      | not nullrate_plan_id          | integer               | not nullitem_code             | integer               |
notnullcust_id               | integer               | not nullbill_id               | integer               | prefix
            | character varying(12) | charge_wholesale_calc | numeric(8,2)          |  
Indexes: cdr_ix1 btree ("day"),        cdr_ix2 btree (service_num),        cdr_ix3 btree (cust_id),        cdr_ix4
btree(bill_id),        cdr_ix5 btree (carrier_id),        cdr_ix6 btree (file_id) 

Does this make it a "wide" table?

The data arrives ordered by service_num, day, time. This customer has one primary service_num that most of the calls
aremade from. Therefore each day a clump of CDRs will be loaded for that customer, interspersed with CDRs from all the
othercustomers. Therefore the distribution of records for a service_num is clumpy but evenly distributed throughout the
table.For a customer with a single primary number, this result applies to the customer as a whole. For a customer with
manyservice_num's the result is a little more doubtful depending on whether their service_num's arrive sequentially or
not.This would not necessarily be the case. 

I hope this makes sense. Does it help any?

Thanks,
David

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, 18 February 2004 16:10
To: David Witham
Cc: pgsql-sql@postgresql.org
Subject: Re: [SQL] Indexes and statistics


"David Witham" <davidw@unidial.com.au> writes:
> One of the customers is quite large (8.3% of the records):

Hmm.  Unless your rows are quite wide, a random sampling of 8.3% of the
table would be expected to visit every page of the table, probably
several times.  So the planner's cost estimates do not seem out of line
to me; an indexscan *should* be slow.  The first question to ask is why
the deviation from reality.  Are the rows for that customer ID likely to
be physically concentrated into a limited number of physical pages?
Do you have so much RAM that the whole table got swapped in, eliminating
the extra I/O that the planner is expecting?
        regards, tom lane


Re: Indexes and statistics

From
Tom Lane
Date:
"David Witham" <davidw@unidial.com.au> writes:
> Does this make it a "wide" table?

Nope.  A rough estimate is that your rows will be about 160 bytes wide,
which means you can fit about 50 per 8K page.  So a query that needs to
select 8% of the table will *on average* need to hit about 4 rows per
page.  In the absence of any data clumping this would certainly mean
that the scan would need to touch every page anyway, and thus that using
the index could provide no I/O savings.  However,

> The data arrives ordered by service_num, day, time. This customer has
> one primary service_num that most of the calls are made from.

So you do have very strong clumping, which the planner is evidently
failing to account for properly.  Could we see the pg_stats rows for
service_num and cust_id?  I'm curious whether the ANALYZE stats picked
up the effect at all.

As far as actually solving the problem is concerned, you have a few
options.  I wouldn't recommend turning off enable_seqscan globally,
but you could perhaps turn it off locally (just do a SET command) just
for this query.  Another possibility, if you care a lot about the speed
of this particular type of query, is to make a partial index tuned to
the query:create index my_idx on cdr (cust_id) WHERE bill_id IS NULL;
I gather from your previously shown results that "bill_id IS NULL"
covers only a small fraction of the table, so this index would be pretty
small and should look quite attractive to the planner.
        regards, tom lane


Re: Indexes and statistics

From
"Iain"
Date:
> The computed cost of using the index was a factor of 10 higher which I
presume is why the query planner wasn't
> using the index, but it ran in half the time

Have you tried playing with the random_page_cost parameter? The default is
4. Try:

set random_page_cost = 1;

in psql to alter it for the current session (you can change this in
postgresql.conf too). This will make index usage more attractive by reducing
the computed cost. This is the simple way of looking at it anyway.

On my system I tested a 'typical' query exercising some joins on large
tables which didn't use an index, but I thought maybe it would perform
better if it did. I determined that a random_page_cost of 1.8 would cause
indexes to be used, but in this case the *actual* performance didn't improve
very much. I took this to mean that a random_page_cost of around 1.8/1.9
represents a rough balance point on my development server (one slow IDE
disk, and a big database).

As well as the other things mentioned by Tom, perhaps you should be looking
for the "correct" setting of random_page_cost for your system. It may be
appropriate to alter it globally using postgresql.conf, and for specific
situations such as you mentioned.

HTH
Iain