Thread: index usage
Hi, I have a query which I think should be using an index all of the time but postgres only uses the index part of the time. The index (ticket_crm_map_crm_id_suppid) has the where clause column (crm_id) listed first followed by the selected column (support_person_id). Wouldn't the most efficient plan be to scan the index each time because the only columns needed are in the index? Below is the table, 2 queries showing the difference in plans, followed by the record distribution of ticket_crm_map. I first did a 'vacuum analyze' to update the statistics. Thanks, Brad athenapost=> \d ticket_crm_map Table "public.ticket_crm_map" Column | Type | Modifiers ------------------------+-----------------------------+--------------------- ----------------------- tcrm_map_id | integer | not null ticket_id | integer | not null crm_id | integer | not null support_person_id | integer | not null escalated_to_person_id | integer | not null status | character varying(50) | not null default 'Open'::character varying close_date | timestamp without time zone | updated_date | timestamp without time zone | updated_by | character varying(255) | created_date | timestamp without time zone | created_by | character varying(255) | additional_info | text | subject | character varying(255) | Indexes: "ticket_crm_map_pkey" primary key, btree (tcrm_map_id) "ticket_crm_map_crm_id_key" unique, btree (crm_id, ticket_id) "ticket_crm_map_crm_id_suppid" btree (crm_id, support_person_id) "ticket_crm_map_status" btree (status) "ticket_crm_map_ticket_id" btree (ticket_id) Foreign-key constraints: "$1" FOREIGN KEY (ticket_id) REFERENCES ticket(ticket_id) "$2" FOREIGN KEY (crm_id) REFERENCES company_crm(crm_id) "$3" FOREIGN KEY (support_person_id) REFERENCES person(person_id) "$4" FOREIGN KEY (escalated_to_person_id) REFERENCES person(person_id) "$5" FOREIGN KEY (status) REFERENCES ticket_status(status) athenapost=> explain analyze select distinct support_person_id from ticket_crm_map where crm_id = 7; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------- Unique (cost=1262.99..1265.27 rows=1 width=4) (actual time=15.335..18.245 rows=20 loops=1) -> Sort (cost=1262.99..1264.13 rows=456 width=4) (actual time=15.332..16.605 rows=2275 loops=1) Sort Key: support_person_id -> Index Scan using ticket_crm_map_crm_id_suppid on ticket_crm_map (cost=0.00..1242.85 rows=456 width=4) (actual time=0.055..11.281 rows=2275 loops=1) Index Cond: (crm_id = 7) Total runtime: 18.553 ms (6 rows) Time: 20.598 ms athenapost=> explain analyze select distinct support_person_id from ticket_crm_map where crm_id = 1; QUERY PLAN ---------------------------------------------------------------------------- ----------------------------------------------------- Unique (cost=10911.12..11349.26 rows=32 width=4) (actual time=659.102..791.517 rows=24 loops=1) -> Sort (cost=10911.12..11130.19 rows=87628 width=4) (actual time=659.090..713.285 rows=93889 loops=1) Sort Key: support_person_id -> Seq Scan on ticket_crm_map (cost=0.00..3717.25 rows=87628 width=4) (actual time=0.027..359.299 rows=93889 loops=1) Filter: (crm_id = 1) Total runtime: 814.601 ms (6 rows) Time: 817.095 ms athenapost=> select count(*), crm_id from ticket_crm_map group by crm_id; count | crm_id -------+-------- 2554 | 63 129 | 25 17 | 24 110 | 23 74 | 22 69 | 21 2 | 20 53 | 82 10 | 17 16 | 81 46637 | 16 14 | 80 2 | 15 1062 | 79 87 | 78 93 | 77 60 | 44 363 | 76 225 | 10 4 | 74 83 | 9 27 | 73 182 | 8 2275 | 7 15 | 71 554 | 6 44 | 70 631 | 5 37 | 4 190 | 3 112 | 2 93889 | 1 (32 rows) Time: 436.697 ms
When checking an index in postgres the original table has to be checked for each result to find if the index entry is still valid? In which case you can't blindly scan the whole index and assume the data is good. I was used to Oracle behavior where the index is up to date so it can do the scan without hitting the original table. Does this sound correct to anyone? Thanks, Brad Stephan Szabo writes: > On Fri, 23 Apr 2004 brad-pgperf@duttonbros.com wrote: > >> I have a query which I think should be using an index all of the time but >> postgres only uses the index part of the time. The index >> (ticket_crm_map_crm_id_suppid) has the where clause column (crm_id) listed >> first followed by the selected column (support_person_id). Wouldn't the >> most efficient plan be to scan the index each time because the only columns >> needed are in the index? Below is the table, 2 queries showing the > > Not necessarily. The rows in the actual file still need to be checked to > see if they're visible to the select and if it's expected that the entire > file (or a reasonable % of the pages anyway) will need to be loaded using > the index isn't necessarily a win. >
On Fri, 23 Apr 2004 brad-pgperf@duttonbros.com wrote: > I have a query which I think should be using an index all of the time but > postgres only uses the index part of the time. The index > (ticket_crm_map_crm_id_suppid) has the where clause column (crm_id) listed > first followed by the selected column (support_person_id). Wouldn't the > most efficient plan be to scan the index each time because the only columns > needed are in the index? Below is the table, 2 queries showing the Not necessarily. The rows in the actual file still need to be checked to see if they're visible to the select and if it's expected that the entire file (or a reasonable % of the pages anyway) will need to be loaded using the index isn't necessarily a win. > athenapost=> explain analyze select distinct support_person_id from > ticket_crm_map where crm_id = 1; > QUERY PLAN > ---------------------------------------------------------------------------- > ----------------------------------------------------- > Unique (cost=10911.12..11349.26 rows=32 width=4) (actual > time=659.102..791.517 rows=24 loops=1) > -> Sort (cost=10911.12..11130.19 rows=87628 width=4) (actual > time=659.090..713.285 rows=93889 loops=1) > Sort Key: support_person_id > -> Seq Scan on ticket_crm_map (cost=0.00..3717.25 rows=87628 > width=4) (actual time=0.027..359.299 rows=93889 loops=1) > Filter: (crm_id = 1) > Total runtime: 814.601 ms How far off is this from the index scan version in time? Try doing set enable_seqscan=off; and then explain analyzing again. It's possible that you may wish to lower random_page_cost to change the estimated effect of how much more expensive random reads are compared to sequential ones.
On Mon, 26 Apr 2004, Stephan Szabo wrote: > > On Fri, 23 Apr 2004 brad-pgperf@duttonbros.com wrote: > > > I have a query which I think should be using an index all of the time but > > postgres only uses the index part of the time. The index > > (ticket_crm_map_crm_id_suppid) has the where clause column (crm_id) listed > > first followed by the selected column (support_person_id). Wouldn't the > > most efficient plan be to scan the index each time because the only columns > > needed are in the index? Below is the table, 2 queries showing the > > Not necessarily. The rows in the actual file still need to be checked to > see if they're visible to the select and if it's expected that the entire > file (or a reasonable % of the pages anyway) will need to be loaded using > the index isn't necessarily a win. While those of us familiar with PostgreSQL are well aware of the fact that indexes can't be used directly to garner information, but only as a lookup to a tuple in the table, it seems this misconception is quite common among those coming to postgreSQL from other databases. Is there any information that directly reflects this issue in the docs? There are tons of hints that it works this way in how they're written, but nothing that just comes out and says that with pgsql's mvcc implementation, an index scan still has to hit the pages that contain the tuples, so often in pgsql a seq scan is a win where in other databases and index scan would have been a win? If not, where would I add it if I were going to write something up for the docs? Just wondering...
"scott.marlowe" <scott.marlowe@ihs.com> writes: > There are tons of hints that it works this way in how they're written, but > nothing that just comes out and says that with pgsql's mvcc > implementation, an index scan still has to hit the pages that contain the > tuples, so often in pgsql a seq scan is a win where in other databases and > index scan would have been a win? > If not, where would I add it if I were going to write something up for the > docs? Just wondering... AFAIR the only place in the docs that mentions seqscan or indexscan at all is the discussion of EXPLAIN in "Performance Tips". Perhaps a suitably-enlarged version of that section could cover this. regards, tom lane