Thread: What determines the cost of an index scan?

What determines the cost of an index scan?

From
Christian Schröder
Date:
Hi list,
I have experienced the following situation: A join between two tables
(one with ~900k rows, the other with ~1600k rows) takes about 20 sec on
our productive database. I have created two tables in our test database
with the same data, but with fewer fields. (I have omitted several
fields that do not participate in the join.) If I try the join in our
test database it takes about 8 sec. Both queries have the same query plan:

prod=# explain analyze select 1 from dtng."Z_UL" inner join
dtng."Z_BARRIER" using ("ISIN", "ID_NOTATION");
                                                                  QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
Merge Join  (cost=0.00..131201.15 rows=39376 width=0) (actual
time=0.198..16086.185 rows=1652076 loops=1)
  Merge Cond: ((("Z_UL"."ISIN")::bpchar = ("Z_BARRIER"."ISIN")::bpchar)
AND ("Z_UL"."ID_NOTATION" = "Z_BARRIER"."ID_NOTATION"))
  ->  Index Scan using "Z_UL_pkey" on "Z_UL"  (cost=0.00..34806.57
rows=897841 width=20) (actual time=0.075..1743.396 rows=897841 loops=1)
  ->  Index Scan using "Z_BARRIER_ISIN_ID_NOTATION_key" on "Z_BARRIER"
(cost=0.00..83255.17 rows=1652076 width=20) (actual time=0.076..3389.676
rows=1652076 loops=1)
Total runtime: 18123.042 ms

test=# explain analyze select 1 from table1 inner join table2 using
(key1, key2);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=0.00..82443.05 rows=36158 width=0) (actual
time=0.092..8036.490 rows=1652076 loops=1)
   Merge Cond: ((table1.key1 = table2.key1) AND (table1.key2 = table2.key2))
   ->  Index Scan using table1_pkey on table1  (cost=0.00..22719.56
rows=897841 width=20) (actual time=0.026..845.916 rows=897841 loops=1)
   ->  Index Scan using table2_key1_key2_key on table2
(cost=0.00..46638.20 rows=1652076 width=20) (actual time=0.049..1843.047
rows=1652076 loops=1)
 Total runtime: 8460.956 ms

No disk io occurs in either server, so I guess that the whole data is
already in memory. Both servers are idle. Both use the same PostgreSQL
version (8.2.9). Both servers are 64bit machines. However, the servers
have different CPUs and memory: The production server has 4 Dual-Core
AMD Opteron 8214 processors (2.2 GHz) and 16 GB memory, the test server
has 2 Dual-Core Intel Xeon 5130 processors (2.0 GHz) and 8 GB memory. I
have not yet done a CPU and memory benchmark, but this is my next step.

Where does this difference come from? Pure cpu performance? Do the
additional fields in the productive database have an impact on the
performance? Or do I miss something?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer


Re: What determines the cost of an index scan?

From
Martin Gainty
Date:
Christian

i would suggest ensuring results will be pre-ordered (according to the column to be merged)

anyone?
Martin
______________________________________________
Disclaimer and confidentiality note
Everything in this e-mail and any attachments relates to the official business of Sender. This transmission is of a confidential nature and Sender does not endorse distribution to any party other than intended recipient. Sender does not necessarily endorse content contained within this transmission.




> Date: Sun, 4 Jan 2009 11:35:51 +0100
> From: cs@deriva.de
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] What determines the cost of an index scan?
>
> Hi list,
> I have experienced the following situation: A join between two tables
> (one with ~900k rows, the other with ~1600k rows) takes about 20 sec on
> our productive database. I have created two tables in our test database
> with the same data, but with fewer fields. (I have omitted several
> fields that do not participate in the join.) If I try the join in our
> test database it takes about 8 sec. Both queries have the same query plan:
>
> prod=# explain analyze select 1 from dtng."Z_UL" inner join
> dtng."Z_BARRIER" using ("ISIN", "ID_NOTATION");
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join (cost=0.00..131201.15 rows=39376 width=0) (actual
> time=0.198..16086.185 rows=1652076 loops=1)
> Merge Cond: ((("Z_UL"."ISIN")::bpchar = ("Z_BARRIER"."ISIN")::bpchar)
> AND ("Z_UL"."ID_NOTATION" = "Z_BARRIER"."ID_NOTATION"))
> -> Index Scan using "Z_UL_pkey" on "Z_UL" (cost=0.00..34806.57
> rows=897841 width=20) (actual time=0.075..1743.396 rows=897841 loops=1)
> -> Index Scan using "Z_BARRIER_ISIN_ID_NOTATION_key" on "Z_BARRIER"
> (cost=0.00..83255.17 rows=1652076 width=20) (actual time=0.076..3389.676
> rows=1652076 loops=1)
> Total runtime: 18123.042 ms
>
> test=# explain analyze select 1 from table1 inner join table2 using
> (key1, key2);
>
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join (cost=0.00..82443.05 rows=36158 width=0) (actual
> time=0.092..8036.490 rows=1652076 loops=1)
> Merge Cond: ((table1.key1 = table2.key1) AND (table1.key2 = table2.key2))
> -> Index Scan using table1_pkey on table1 (cost=0.00..22719.56
> rows=897841 width=20) (actual time=0.026..845.916 rows=897841 loops=1)
> -> Index Scan using table2_key1_key2_key on table2
> (cost=0.00..46638.20 rows=1652076 width=20) (actual time=0.049..1843.047
> rows=1652076 loops=1)
> Total runtime: 8460.956 ms
>
> No disk io occurs in either server, so I guess that the whole data is
> already in memory. Both servers are idle. Both use the same PostgreSQL
> version (8.2.9). Both servers are 64bit machines. However, the servers
> have different CPUs and memory: The production server has 4 Dual-Core
> AMD Opteron 8214 processors (2.2 GHz) and 16 GB memory, the test server
> has 2 Dual-Core Intel Xeon 5130 processors (2.0 GHz) and 8 GB memory. I
> have not yet done a CPU and memory benchmark, but this is my next step.
>
> Where does this difference come from? Pure cpu performance? Do the
> additional fields in the productive database have an impact on the
> performance? Or do I miss something?
>
> Regards,
> Christian
>
> --
> Deriva GmbH Tel.: +49 551 489500-42
> Financial IT and Consulting Fax: +49 551 489500-91
> Hans-Böckler-Straße 2 http://www.deriva.de
> D-37079 Göttingen
>
> Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Send e-mail faster without improving your typing skills. Get your Hotmail® account.

Re: What determines the cost of an index scan?

From
Gregory Stark
Date:
Christian Schröder <cs@deriva.de> writes:

> Where does this difference come from? Pure cpu performance? Do the additional
> fields in the productive database have an impact on the performance? Or do I
> miss something?

Sure, more data takes more time to process.

Other factors which could affect it is how much dead space is in the table due
to previous updates and deletes, as well as how fragmented the indexes have
become over time.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: What determines the cost of an index scan?

From
Craig Ringer
Date:
Gregory Stark wrote:
> Christian Schröder <cs@deriva.de> writes:
>
>> Where does this difference come from? Pure cpu performance? Do the additional
>> fields in the productive database have an impact on the performance? Or do I
>> miss something?
>
> Sure, more data takes more time to process.
>
> Other factors which could affect it is how much dead space is in the table due
> to previous updates and deletes, as well as how fragmented the indexes have
> become over time.

In other words: on the production server you might want to VACUUM FULL,
REINDEX, and test again. Even better, use CLUSTER rather than VACUUM
FULL, since it'll be faster and will order your data on disk according
to your selected index (usually the primary key) as well.

REINDEX, CLUSTER, and VACUUM FULL will completely prevent all access to
the table being operated on while they run, as they all take an ACCESS
EXCLUSIVE lock. See:
http://www.postgresql.org/docs/8.1/static/explicit-locking.html

As a result you'll want to schedule a downtime window or at least do it
while the affected tables aren't needed. You should also check if you
actually need to do it first. You can get an estimate of table bloat
from the catalog data with a bit of massaging. See:

http://pgsql.tapoueh.org/site/html/news/20080131.bloat.html

http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

(It'd be REALLY, REALLY useful to have functions something like
pg_table_bloat(tablename) and pg_index_bloat(indexname) as part of Pg,
actually).

Note that under normal circumstances you should not run VACUUM FULL.
However, it can be useful if your tables have become really bloated due
to insufficient fsm_map space, infrequent VACUUMing, etc. If you do run
it, it's probably wise to also run REINDEX on the table(s) you ran
VACUUM FULL on.

--
Craig Ringer