Hi,
Im working on a large-scale simulation study in biology, and
Im using PostgreSQL as the hub of a distributed computing system.
Essentially, I have a single table containing 2.7 million rows, each of
which represents a work unit to be processed (think SETI@Home). Among
other fields, each row in the table contains a unique ID and a status,
which is one of WAIT (not yet processed) OUT (processing) or DONE
(completed). To dispatch a unit to a client, Id like to pick a row with
status = WAIT, mark it as OUT, and return the values. The problem I have
is that finding a row takes a really long time (~22s on a 2-way PIII-700
running Linux 2.4.19). My SQL looks like: select min(id) from analyses
where status=WAIT. I have indexes on the id field and the status
field, but an explain shows that theyre not being usedIm assuming
that the optimizer has decided that theres no point in using the index,
since it doesnt sufficiently limit the scope of the search. Is there
some optimization that Im missing (Ive tried indexes on status/id and
id/status as well as the VACUUM and CLUSTER), or will I have to do
something more than the naïve approach? (Were actually sampling 2700
points 1000 times, so I can use replicates_executed counters in each row
and have a separate results table if I have to; itd just be nice to
keep things simple if possible).
Thanks in advance,
Daniel
dmyers@pomona.edu
P.S.: Ive included the code to create the analyses table below.
create table analyses (
id serial,
kind char(5) not null check (kind in ('FAST', 'SLOW')),
host varchar(255) references hosts(hostname),
dispatched timestamp,
received timestamp,
status char(5) not null check (status in ('WAIT', 'OUT', 'DONE'))
default 'WAIT',
/* Plus a bunch of parameters for the simulation... */
);