Re: Specifying many rows in a table - Mailing list pgsql-general

From Steve Atkins
Subject Re: Specifying many rows in a table
Date
Msg-id 20040129032405.GA11231@gp.word-to-the-wise.com
Whole thread Raw
In response to Re: Specifying many rows in a table  (Steve Atkins <steve@blighty.com>)
List pgsql-general
On Wed, Jan 28, 2004 at 04:14:21PM -0800, Steve Atkins wrote:
> I'm trying a number of variants (in(list), in(select), exists, join) and
> benchmarking all of them across a range of numbers, randomly chosen or
> clustered with appropriate splashing of the DB and disk buffers between
> runs.
>
> Once the perl script has finished I'll post a pointer to the data.

On the off-chance anyone else is interested in the performance
variations (or is searching the archives from the year 2020)...

I have a table of 10,000,000 rows, each row having a few hundred bytes
of data. It has an integer primary key, and is clustered on that key.
Real data, not synthetic, not that it makes a difference for this
test.

I have in the application layer a list of the rows I want to retrieve
(a sorted list of unique integers) and want to retrieve all those rows
from the big table.

The test system is a single processor Athlon 1800 with a Reiser
filesystem on two striped decent scsi drives. PostgreSQL is pretty
well tuned. Between each test run PostgreSQL was shut down, a gig of
data read from the disk to flush OS caches and PostgreSQL restarted.

The big table is indexed and analysed. All temporary tables are analysed
(and times include construction, population, any indexing and analysing
 of the table, but not destruction of it, nor time to transfer data to the
 client).

The sequence of ids to be retrieved is identical in each run. In the
first batch of tests it's completely random, but sorted. In the second batch
it's a sequence (42, 43, 44, 45...) with a random starting point.


inscalar: select * from big where id in (1, 17, 43 ...., 2071)
inselect: select * from big where id in (select id from small)
exists:   select * from big where exists (select 1 from small where id=big.id)
join:     select * from big, small where big.id = small.id;

inselect-idx: As inselect, but small is indexed
exists-idx:   As exists, but small is indexed
join-idx:     As join, but small is indexed

In all but the first case, small is created as a temporary table, populated
with a sequence of inserts (inside a transaction), possibly indexed then
analysed.

                             number of random rows retrieved
                1        10       100      1000      10000    100000
inscalar       0.394    0.818    3.80    23.5
inselect       0.752    0.770    0.751    0.977     0.798    0.790
join           0.728    0.746    0.625    0.766     0.703    0.806
inselect-idx   0.655    0.642    0.748    0.822     0.777    0.820
exists-idx    50.7     49.4     49.4     50.0      49.4     51.2
join-idx       0.657    0.703    0.760    0.730     0.731    0.815

                             number of sequential rows retrieved
                1        10       100      1000      10000    100000
inscalar       0.464    0.474    0.488    0.529
inselect       0.820    0.826    0.778    0.776     0.612    0.770
join           0.754    0.855    0.831    0.829     0.899    0.742
inselect-idx   0.686    0.679    0.703    0.747     0.747    0.729
exists-idx    49.2     79.8     49.5     49.3      49.3     53.7
join-idx       0.625    0.754    0.697    0.770     0.750    0.783



Overall, IN(scalar) looks pretty good for single digit numbers. A simple
join looks nice for anything else - and indexing the temporary table
doesn't seem to buy anything.

Pretty impressive overall, though.

Cheers,
  Steve

pgsql-general by date:

Previous
From: Craig Addleman
Date:
Subject: Pl/tcl auto-load problem
Next
From: "Chris Travers"
Date:
Subject: Re: Interest in database-based email?