Re: getting most recent row efficiently - Mailing list pgsql-general

From SHELTON,MICHAEL (Non-HP-Boise,ex1)
Subject Re: getting most recent row efficiently
Date
Msg-id FB60DFB2C0E24449AC0C21F743B935410150F840@xboi02.boi.hp.com
Whole thread Raw
In response to getting most recent row efficiently  (Fran Fabrizio <ffabrizio@mmrd.com>)
List pgsql-general
Here are a few suggestions off the top of my head:

1) It sounds like performance is the issue here.  My first question would be
why you store a "list" of favorites for each doc.  If you "only" need the
latest, just do an update instead of an insert (unless no doc entry...),
then your table is short and to the point.  If you need the other data, then
obviously this isn't the solution.

2) If you need all favorites in order, what about a 2nd table (space vs
performance tradeoff) that stores only the most recent favorite.  Longer
store time when inserting a favorite, quick lookup.

3) In the case of trying to store the order without a timestamp you can't do
it simply by inserting rows (there is no guarantee for row order, it will
insert where it wants -- logical, but not always in order).  However, you
could have a table ID (not the doc id) used only to indicate row #.  It
would be unique, auto-incr, indexed, etc.  This way you could sort by doc,
and take the top one by table ID and get your favorite.

I'm sure there are other ways to do this, but thought these might help.

Mike Shelton

-----Original Message-----
From: Fran Fabrizio [mailto:ffabrizio@mmrd.com]
Sent: Monday, December 17, 2001 2:30 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] getting most recent row efficiently



Hello,

I've got a table that keeps a log of a person's favorites over time
(what follows is a simplified example)

person_id    favorite_color
1                red
1                blue
2                green
3                yellow
3                purple

I want the set of most recent entries for each person.  So assuming the
order they appear above is the order they were inserted, I want:

person_id    favorite_color
1                blue
2                green
3                purple

Is there any way to retrieve "the most recent row for each distinct
person_id?" without making a timestamp column for when the row was
inserted and then grouping them by person_id?

What I include below is the real world example.  There's a table,
doctor_favorites, that functions like my sample table above.  There's a
view, current_doctor_favorites, which contains the most recent row for
each doctor.  This view has a high cost as you can see.  There's a
second view, doctor_metrics, which combines the data from the
current_doctor_favorites view with that of another table, and that is my
ultimate goal.  It's taking a loooong time to run, I think because of
the sequential scan when creating the current_doctor_favorites view from
the doctor_favorites table.  I'm wondering if there's a way to perhaps
use an OID or something to determine the most recent rows from
doctor_favorites rather than the timestamp and the subselect.  I tried
to create an index on the local_ts timestamp field of the underlying
doctor_favorites table, but it's hard to index something that's
basically unique each time you insert a row anyhow.

monitor-prod=# \d doctor_favorites
                      Table "doctor_favorites"
    Attribute    |           Type           |        Modifier
-----------------+--------------------------+------------------------
 favorites       | integer                  |
 remote_ts       | timestamp with time zone | not null
 local_ts        | timestamp with time zone | not null default now()
 med_practice_id | integer                  |
 doctor_id       | integer                  |
Indices: docid_index,
         docid_medpracid_index,
         localts_index,
         medpracid_index

monitor-test=# \d current_doctor_favorites
   View "current_doctor_favorites"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 favorites       | integer |
View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites
FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));


monitor-test=# explain  SELECT df.doctor_id, df.med_practice_id,
df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));
NOTICE:  QUERY PLAN:

Seq Scan on doctor_favorites df  (cost=0.00..9564.75 rows=47 width=12)
  SubPlan
    ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
          ->  Index Scan using docid_medpracid_index on
doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-test=#

monitor-prod=# \d doctor_metrics
        View "doctor_metrics"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 lab_requests    | integer |
 lab_results     | integer |
 rx_auth         | integer |
 transcriptions  | integer |
 omnidocs        | integer |
 phone           | integer |
 favorites       | integer |
View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN
doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));

monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL)
THEN do
ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));
NOTICE:  QUERY PLAN:

Merge Join  (cost=9566.05..9640.75 rows=1000 width=44)
  ->  Index Scan using doctor_tasks_pkey on doctor_tasks
(cost=0.00..59.00 rows=1000 width=32)
  ->  Sort  (cost=9566.05..9566.05 rows=47 width=12)
        ->  Subquery Scan current_doctor_favorites  (cost=0.00..9564.75
rows=47 width=12)
              ->  Seq Scan on doctor_favorites df  (cost=0.00..9564.75
rows=47 width=12)
                    SubPlan
                      ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
                            ->  Index Scan using docid_medpracid_index
on doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-prod=#

Thanks for your feedback,
Fran


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

pgsql-general by date:

Previous
From: "Command Prompt, Inc."
Date:
Subject: Practical PostgreSQL moves East!
Next
From: Stephan Szabo
Date:
Subject: Re: getting most recent row efficiently