Re: Specifying many rows in a table - Mailing list pgsql-general
From | Steve Atkins |
---|---|
Subject | Re: Specifying many rows in a table |
Date | |
Msg-id | 20040129032405.GA11231@gp.word-to-the-wise.com Whole thread Raw |
In response to | Re: Specifying many rows in a table (Steve Atkins <steve@blighty.com>) |
List | pgsql-general |
On Wed, Jan 28, 2004 at 04:14:21PM -0800, Steve Atkins wrote: > I'm trying a number of variants (in(list), in(select), exists, join) and > benchmarking all of them across a range of numbers, randomly chosen or > clustered with appropriate splashing of the DB and disk buffers between > runs. > > Once the perl script has finished I'll post a pointer to the data. On the off-chance anyone else is interested in the performance variations (or is searching the archives from the year 2020)... I have a table of 10,000,000 rows, each row having a few hundred bytes of data. It has an integer primary key, and is clustered on that key. Real data, not synthetic, not that it makes a difference for this test. I have in the application layer a list of the rows I want to retrieve (a sorted list of unique integers) and want to retrieve all those rows from the big table. The test system is a single processor Athlon 1800 with a Reiser filesystem on two striped decent scsi drives. PostgreSQL is pretty well tuned. Between each test run PostgreSQL was shut down, a gig of data read from the disk to flush OS caches and PostgreSQL restarted. The big table is indexed and analysed. All temporary tables are analysed (and times include construction, population, any indexing and analysing of the table, but not destruction of it, nor time to transfer data to the client). The sequence of ids to be retrieved is identical in each run. In the first batch of tests it's completely random, but sorted. In the second batch it's a sequence (42, 43, 44, 45...) with a random starting point. inscalar: select * from big where id in (1, 17, 43 ...., 2071) inselect: select * from big where id in (select id from small) exists: select * from big where exists (select 1 from small where id=big.id) join: select * from big, small where big.id = small.id; inselect-idx: As inselect, but small is indexed exists-idx: As exists, but small is indexed join-idx: As join, but small is indexed In all but the first case, small is created as a temporary table, populated with a sequence of inserts (inside a transaction), possibly indexed then analysed. number of random rows retrieved 1 10 100 1000 10000 100000 inscalar 0.394 0.818 3.80 23.5 inselect 0.752 0.770 0.751 0.977 0.798 0.790 join 0.728 0.746 0.625 0.766 0.703 0.806 inselect-idx 0.655 0.642 0.748 0.822 0.777 0.820 exists-idx 50.7 49.4 49.4 50.0 49.4 51.2 join-idx 0.657 0.703 0.760 0.730 0.731 0.815 number of sequential rows retrieved 1 10 100 1000 10000 100000 inscalar 0.464 0.474 0.488 0.529 inselect 0.820 0.826 0.778 0.776 0.612 0.770 join 0.754 0.855 0.831 0.829 0.899 0.742 inselect-idx 0.686 0.679 0.703 0.747 0.747 0.729 exists-idx 49.2 79.8 49.5 49.3 49.3 53.7 join-idx 0.625 0.754 0.697 0.770 0.750 0.783 Overall, IN(scalar) looks pretty good for single digit numbers. A simple join looks nice for anything else - and indexing the temporary table doesn't seem to buy anything. Pretty impressive overall, though. Cheers, Steve
pgsql-general by date: