Decision support query inefficiencies ... - Mailing list pgsql-sql

From Gunther Schadow
Subject Decision support query inefficiencies ...
Date
Msg-id 3D459ED0.3030000@aurora.regenstrief.org
Whole thread Raw
Responses Re: Decision support query inefficiencies ...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
Hi,

this is a more general question than about specific PostgreSQL
implementation, but may be PostgreSQL has some potential specific
answer, and hopefully you can give me some general hints.

We have a large data base of, let's call them "Acts" (in our case
it is clinical observations, but could just as well be purchase orders
or account transactions, etc.)

CREATE TABLE Act (  id         OID       NOT NULL PRIMARY KEY,  subject_id OID       NOT NULL FOREIGN KEY REFERENCES
Subject(id), type_code  VARCHAR   NOT NULL FOREIGN KEY REFERENCES Act_type(code),  time       TIMESTAMP NOT NULL,  /*
detail... */
 
);

CREATE INDEX ON Act(subject_id, type_code, time);

Now let's have 200 million of these acts on 2 million subjects in our
test database. The production version will have about 100 times that
amount.

A very frequent decision support query we have to make is to get the
last act of a certain type that occurred before a certain cut-off
date. Worse, yet, the cutoff date is quite likely even a correlated
subquery, but that's for later. For now, I'm always struck by the
difficulty of asking the simple quesion of the most recent act of
a certain type before a cutoff time:

SELECT Act.*  FROM (SELECT subject_id, type_code, MAX(time) AS time          FROM Act          WHERE Act.type_code =
$ACT_TYPE           AND Act.time < $CUTOFF_TIME          GROUP BY subject_id, type_code) last   INNER JOIN       Act
ON(Act.subject_id,  Act.type_code,  Act.time)     =(last.subject_id, last.type_code, last.time);
 

I'm not so worried about the complexity of writing this query but
how inefficient it appears to be.

1. select the Acts of the specific type code
2. sort by subject_id, type_code and time
3. join the result again with the Act table, selecting the items   again via random access

Wouldn't it be nice to somehow retain the sort that happened for the
GROUP BY operation, and just read the last item of each group with
same group-by key?

Can PostgreSQL do something like that? Is my desire for some
optimization flawed? Am I overlooking an existing, well-known
optimization one can do with plain SQL?

I would greatly appreciate your advice or ideas
-Gunther


PS: The a materialized view or summary table would not help in this
case because we're asking for the last value before a certain
cutoff time.

-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org




pgsql-sql by date:

Previous
From: Peter Atkins
Date:
Subject: Re: Returning PK of first insert for second insert use.
Next
From: Peter Atkins
Date:
Subject: Re: Returning PK of first insert for second insert use.