Thread: Query optimization question
Hi, Im working on a large-scale simulation study in biology, and Im using PostgreSQL as the hub of a distributed computing system. Essentially, I have a single table containing 2.7 million rows, each of which represents a work unit to be processed (think SETI@Home). Among other fields, each row in the table contains a unique ID and a status, which is one of WAIT (not yet processed) OUT (processing) or DONE (completed). To dispatch a unit to a client, Id like to pick a row with status = WAIT, mark it as OUT, and return the values. The problem I have is that finding a row takes a really long time (~22s on a 2-way PIII-700 running Linux 2.4.19). My SQL looks like: select min(id) from analyses where status=WAIT. I have indexes on the id field and the status field, but an explain shows that theyre not being usedIm assuming that the optimizer has decided that theres no point in using the index, since it doesnt sufficiently limit the scope of the search. Is there some optimization that Im missing (Ive tried indexes on status/id and id/status as well as the VACUUM and CLUSTER), or will I have to do something more than the naïve approach? (Were actually sampling 2700 points 1000 times, so I can use replicates_executed counters in each row and have a separate results table if I have to; itd just be nice to keep things simple if possible). Thanks in advance, Daniel dmyers@pomona.edu P.S.: Ive included the code to create the analyses table below. create table analyses ( id serial, kind char(5) not null check (kind in ('FAST', 'SLOW')), host varchar(255) references hosts(hostname), dispatched timestamp, received timestamp, status char(5) not null check (status in ('WAIT', 'OUT', 'DONE')) default 'WAIT', /* Plus a bunch of parameters for the simulation... */ );
Daniel S. Myers wrote: > Hi, > I’m working on a large-scale simulation study in biology, and > I’m using PostgreSQL as the hub of a distributed computing system. > Essentially, I have a single table containing 2.7 million rows, each of > which represents a work unit to be processed (think SETI@Home). Among > other fields, each row in the table contains a unique ID and a status, > which is one of WAIT (not yet processed) OUT (processing) or DONE > (completed). To dispatch a unit to a client, I’d like to pick a row with > status = WAIT, mark it as OUT, and return the values. The problem I have > is that finding a row takes a really long time (~22s on a 2-way PIII-700 > running Linux 2.4.19). My SQL looks like: select min(id) from analyses > where status=’WAIT’. I have indexes on the id field and the status > field, but an explain shows that they’re not being used—I’m assuming > that the optimizer has decided that there’s no point in using the index, > since it doesn’t sufficiently limit the scope of the search. Is there > some optimization that I’m missing (I’ve tried indexes on status/id and > id/status as well as the VACUUM and CLUSTER), or will I have to do > something more than the naïve approach? (We’re actually sampling 2700 > points 1000 times, so I can use replicates_executed counters in each row > and have a separate results table if I have to; it’d just be nice to > keep things simple if possible). > You might try creating a partial index (see the CREATE INDEX page in the documentation -- basically an index built on just part of a table qualified by a where clause) on id where status='WAIT'. I've never actually used one myself, but it seems like it might help in your situation. Also, instead of min(id), try: select id from analyses where status=’WAIT’ order by id limit 1; For reasons discussed before on the list (please search the archives), min() won't use your index, but ORDER BY ... LIMIT can. HTH, Joe
Hi, avoid min(col) or max(col), use "ORDER BY col DESC|ASC LIMIT 1" instead. It is faster on postgresql. I dunno why, but the archives shuold give the answer ;-) Hope this helps, -tb On Mon, Nov 18, 2002 at 05:49:28PM -0800, Daniel S. Myers wrote: > Hi, > Im working on a large-scale simulation study in biology, and > Im using PostgreSQL as the hub of a distributed computing system. > Essentially, I have a single table containing 2.7 million rows, each of > which represents a work unit to be processed (think SETI@Home). Among > other fields, each row in the table contains a unique ID and a status, > which is one of WAIT (not yet processed) OUT (processing) or DONE > (completed). To dispatch a unit to a client, Id like to pick a row with > status = WAIT, mark it as OUT, and return the values. The problem I have > is that finding a row takes a really long time (~22s on a 2-way PIII-700 > running Linux 2.4.19). My SQL looks like: select min(id) from analyses > where status=WAIT. I have indexes on the id field and the status > field, but an explain shows that theyre not being usedIm assuming > that the optimizer has decided that theres no point in using the index, > since it doesnt sufficiently limit the scope of the search. Is there > some optimization that Im missing (Ive tried indexes on status/id and > id/status as well as the VACUUM and CLUSTER), or will I have to do > something more than the naïve approach? (Were actually sampling 2700 > points 1000 times, so I can use replicates_executed counters in each row > and have a separate results table if I have to; itd just be nice to > keep things simple if possible). > > Thanks in advance, > Daniel > dmyers@pomona.edu > > P.S.: Ive included the code to create the analyses table below. > > create table analyses ( > id serial, > kind char(5) not null check (kind in ('FAST', 'SLOW')), > host varchar(255) references hosts(hostname), > dispatched timestamp, > received timestamp, > status char(5) not null check (status in ('WAIT', 'OUT', 'DONE')) > default 'WAIT', > > /* Plus a bunch of parameters for the simulation... */ > ); > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Thomas Beutin tb@laokoon.IN-Berlin.DE Beam me up, Scotty. There is no intelligent live down in Redmond.