Thread: performance discrepancy indexing one column versus two columns

performance discrepancy indexing one column versus two columns

From
Mark Dilger
Date:
All,

In the psql output below, I would expect the second query to run faster, because
the b-tree index on two columns knows the values of 'b' for any given value of
'a', and hence does not need to fetch a row from the actual table.  I am not
seeing a speed-up, however, so I expect my understanding of the index mechanism
is wrong.  Could anyone enlighten me?

Specifically, I would expect the first query to walk the b-tree looking for
values of 'a' equal to 1, 100001, etc., and then a dereference over to the main
table to fetch the value for column 'b'.

But I would expect the second query to walk the b-tree looking for values of 'a'
equal to 1, 100001, etc., and then find on that same page in the b-tree the
value of 'b', thereby avoiding the dereference and extra page fetch.

Is the problem that the two-column b-tree contains more data, is spread across
more disk pages, and is hence slower to access, canceling out the performance
gain of not having to fetch from the main table?  Or is the query system not
using the second column information from the index and doing the table fetch
anyway?  Or does the index store the entire row from the main table regardless
of the column being indexed?

I am running postgresql 8.0.3 on a Pentium 4 with ide hard drives and the
default configuration file settings.

Thanks in advance,

mark



mark=# create sequence test_id_seq;
CREATE SEQUENCE
mark=# create table test (a integer not null default nextval('test_id_seq'), b
integer not null);
CREATE TABLE
mark=# create function testfunc () returns void as $$
mark$# declare
mark$# i integer;
mark$# begin
mark$# for i in 1..1000000 loop
mark$# insert into test (b) values (i);
mark$# end loop;
mark$# return;
mark$# end;
mark$# $$ language plpgsql;
CREATE FUNCTION
mark=# select * from testfunc();
  testfunc
----------

(1 row)

mark=# select count(*) from test;
   count
---------
  1000000
(1 row)

mark=# create index test_single_idx on test(a);
CREATE INDEX
mark=# vacuum full;
VACUUM
mark=# analyze;
ANALYZE
mark=# explain analyze select b from test where a in (1, 100001, 200001, 300001,
400001, 500001, 600001, 700001, 800001, 900001);

                                                      QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using test_single_idx, test_single_idx, test_single_idx,
test_single_idx, test_single_idx, test_single_idx, test_single_idx,
test_single_idx, test_single_idx, test_single_idx on test  (cost=0.00..30.36
rows=10 width=4) (actual time=0.145..0.917 rows=10 loops=1)
    Index Cond: ((a = 1) OR (a = 100001) OR (a = 200001) OR (a = 300001) OR (a =
400001) OR (a = 500001) OR (a = 600001) OR (a = 700001) OR (a = 800001) OR(a =
900001))
  Total runtime: 1.074 ms
(3 rows)

mark=# drop index test_single_idx;
DROP INDEX
mark=# create index test_double_idx on test(a,b);
CREATE INDEX
mark=# vacuum full;
VACUUM
mark=# analyze;
ANALYZE
mark=# explain analyze select b from test where a in (1, 100001, 200001, 300001,
400001, 500001, 600001, 700001, 800001, 900001);

                                                      QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using test_double_idx, test_double_idx, test_double_idx,
test_double_idx, test_double_idx, test_double_idx, test_double_idx,
test_double_idx, test_double_idx, test_double_idx on test  (cost=0.00..43.48
rows=10 width=4) (actual time=0.283..1.119 rows=10 loops=1)
    Index Cond: ((a = 1) OR (a = 100001) OR (a = 200001) OR (a = 300001) OR (a =
400001) OR (a = 500001) OR (a = 600001) OR (a = 700001) OR (a = 800001) OR(a =
900001))
  Total runtime: 1.259 ms
(3 rows)

Re: performance discrepancy indexing one column versus

From
Gavin Sherry
Date:
On Sun, 11 Sep 2005, Mark Dilger wrote:

> All,
>
> In the psql output below, I would expect the second query to run faster,
> because the b-tree index on two columns knows the values of 'b' for any
> given value of 'a', and hence does not need to fetch a row from the
> actual table.  I am not seeing a speed-up, however, so I expect my
> understanding of the index mechanism is wrong.  Could anyone enlighten
> me?

A common but incorrect assumption. We must consult the underlying table
when we do an index scan so that we can check visibility information. The
reason it is stored there in the table is so that we have only one place
to check for tuple visibility and therefore avoid race conditions.

A brief explanation of this system is described here:
http://www.postgresql.org/docs/8.0/static/mvcc.html.

and this page shows what information we store in the to do visibility
checks:

http://www.postgresql.org/docs/8.0/static/storage-page-layout.html

Thanks,

Gavin