Thread: Query optimization question

Query optimization question

From

"Daniel S. Myers"

Date:

18 November 2002, 20:49:43

Hi,
    Im working on a large-scale simulation study in biology, and
Im using PostgreSQL as the hub of a distributed computing system.
Essentially, I have a single table containing 2.7 million rows, each of
which represents a work unit to be processed (think SETI@Home). Among
other fields, each row in the table contains a unique ID and a status,
which is one of WAIT (not yet processed) OUT (processing) or DONE
(completed). To dispatch a unit to a client, Id like to pick a row with
status = WAIT, mark it as OUT, and return the values. The problem I have
is that finding a row takes a really long time (~22s on a 2-way PIII-700
running Linux 2.4.19). My SQL looks like: select min(id) from analyses
where status=WAIT. I have indexes on the id field and the status
field, but an explain shows that theyre not being usedIm assuming
that the optimizer has decided that theres no point in using the index,
since it doesnt sufficiently limit the scope of the search. Is there
some optimization that Im missing (Ive tried indexes on status/id and
id/status as well as the VACUUM and CLUSTER), or will I have to do
something more than the naïve approach? (Were actually sampling 2700
points 1000 times, so I can use replicates_executed counters in each row
and have a separate results table if I have to; itd just be nice to
keep things simple if possible).

Thanks in advance,
Daniel
dmyers@pomona.edu

P.S.: Ive included the code to create the analyses table below.

create table analyses (
    id serial,
    kind char(5) not null check (kind in ('FAST', 'SLOW')),
    host varchar(255) references hosts(hostname),
    dispatched timestamp,
    received timestamp,
    status char(5) not null check (status in ('WAIT', 'OUT', 'DONE'))
default 'WAIT',

    /* Plus a bunch of parameters for the simulation... */
);

Re: Query optimization question

From

Joe Conway

Date:

18 November 2002, 21:00:53

Daniel S. Myers wrote:
> Hi,
>     I’m working on a large-scale simulation study in biology, and
> I’m using PostgreSQL as the hub of a distributed computing system.
> Essentially, I have a single table containing 2.7 million rows, each of
> which represents a work unit to be processed (think SETI@Home). Among
> other fields, each row in the table contains a unique ID and a status,
> which is one of WAIT (not yet processed) OUT (processing) or DONE
> (completed). To dispatch a unit to a client, I’d like to pick a row with
> status = WAIT, mark it as OUT, and return the values. The problem I have
> is that finding a row takes a really long time (~22s on a 2-way PIII-700
> running Linux 2.4.19). My SQL looks like: select min(id) from analyses
> where status=’WAIT’. I have indexes on the id field and the status
> field, but an explain shows that they’re not being used—I’m assuming
> that the optimizer has decided that there’s no point in using the index,
> since it doesn’t sufficiently limit the scope of the search. Is there
> some optimization that I’m missing (I’ve tried indexes on status/id and
> id/status as well as the VACUUM and CLUSTER), or will I have to do
> something more than the naïve approach? (We’re actually sampling 2700
> points 1000 times, so I can use replicates_executed counters in each row
> and have a separate results table if I have to; it’d just be nice to
> keep things simple if possible).
>

You might try creating a partial index (see the CREATE INDEX page in the
documentation -- basically an index built on just part of a table qualified by
a where clause) on id where status='WAIT'. I've never actually used one
myself, but it seems like it might help in your situation. Also, instead of
min(id), try:

   select id from analyses where status=’WAIT’ order by id limit 1;

For reasons discussed before on the list (please search the archives), min()
won't use your index, but ORDER BY ... LIMIT can.

HTH,

Joe

Re: Query optimization question

From

Thomas Beutin

Date:

19 November 2002, 06:05:10

Hi,

avoid min(col) or max(col), use "ORDER BY col DESC|ASC LIMIT 1" instead.
It is faster on postgresql. I dunno why, but the archives shuold give
the answer ;-)

Hope this helps,
-tb


On Mon, Nov 18, 2002 at 05:49:28PM -0800, Daniel S. Myers wrote:
> Hi,
>     Im working on a large-scale simulation study in biology, and
> Im using PostgreSQL as the hub of a distributed computing system.
> Essentially, I have a single table containing 2.7 million rows, each of
> which represents a work unit to be processed (think SETI@Home). Among
> other fields, each row in the table contains a unique ID and a status,
> which is one of WAIT (not yet processed) OUT (processing) or DONE
> (completed). To dispatch a unit to a client, Id like to pick a row with
> status = WAIT, mark it as OUT, and return the values. The problem I have
> is that finding a row takes a really long time (~22s on a 2-way PIII-700
> running Linux 2.4.19). My SQL looks like: select min(id) from analyses
> where status=WAIT. I have indexes on the id field and the status
> field, but an explain shows that theyre not being usedIm assuming
> that the optimizer has decided that theres no point in using the index,
> since it doesnt sufficiently limit the scope of the search. Is there
> some optimization that Im missing (Ive tried indexes on status/id and
> id/status as well as the VACUUM and CLUSTER), or will I have to do
> something more than the naïve approach? (Were actually sampling 2700
> points 1000 times, so I can use replicates_executed counters in each row
> and have a separate results table if I have to; itd just be nice to
> keep things simple if possible).
>
> Thanks in advance,
> Daniel
> dmyers@pomona.edu
>
> P.S.: Ive included the code to create the analyses table below.
>
> create table analyses (
>     id serial,
>     kind char(5) not null check (kind in ('FAST', 'SLOW')),
>     host varchar(255) references hosts(hostname),
>     dispatched timestamp,
>     received timestamp,
>     status char(5) not null check (status in ('WAIT', 'OUT', 'DONE'))
> default 'WAIT',
>
>     /* Plus a bunch of parameters for the simulation... */
> );
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.