Thread: getting most recent row efficiently

getting most recent row efficiently

From
Fran Fabrizio
Date:
Hello,

I've got a table that keeps a log of a person's favorites over time
(what follows is a simplified example)

person_id    favorite_color
1                red
1                blue
2                green
3                yellow
3                purple

I want the set of most recent entries for each person.  So assuming the
order they appear above is the order they were inserted, I want:

person_id    favorite_color
1                blue
2                green
3                purple

Is there any way to retrieve "the most recent row for each distinct
person_id?" without making a timestamp column for when the row was
inserted and then grouping them by person_id?

What I include below is the real world example.  There's a table,
doctor_favorites, that functions like my sample table above.  There's a
view, current_doctor_favorites, which contains the most recent row for
each doctor.  This view has a high cost as you can see.  There's a
second view, doctor_metrics, which combines the data from the
current_doctor_favorites view with that of another table, and that is my
ultimate goal.  It's taking a loooong time to run, I think because of
the sequential scan when creating the current_doctor_favorites view from
the doctor_favorites table.  I'm wondering if there's a way to perhaps
use an OID or something to determine the most recent rows from
doctor_favorites rather than the timestamp and the subselect.  I tried
to create an index on the local_ts timestamp field of the underlying
doctor_favorites table, but it's hard to index something that's
basically unique each time you insert a row anyhow.

monitor-prod=# \d doctor_favorites
                      Table "doctor_favorites"
    Attribute    |           Type           |        Modifier
-----------------+--------------------------+------------------------
 favorites       | integer                  |
 remote_ts       | timestamp with time zone | not null
 local_ts        | timestamp with time zone | not null default now()
 med_practice_id | integer                  |
 doctor_id       | integer                  |
Indices: docid_index,
         docid_medpracid_index,
         localts_index,
         medpracid_index

monitor-test=# \d current_doctor_favorites
   View "current_doctor_favorites"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 favorites       | integer |
View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites
FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));


monitor-test=# explain  SELECT df.doctor_id, df.med_practice_id,
df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));
NOTICE:  QUERY PLAN:

Seq Scan on doctor_favorites df  (cost=0.00..9564.75 rows=47 width=12)
  SubPlan
    ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
          ->  Index Scan using docid_medpracid_index on
doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-test=#

monitor-prod=# \d doctor_metrics
        View "doctor_metrics"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 lab_requests    | integer |
 lab_results     | integer |
 rx_auth         | integer |
 transcriptions  | integer |
 omnidocs        | integer |
 phone           | integer |
 favorites       | integer |
View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN
doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));

monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL)
THEN do
ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));
NOTICE:  QUERY PLAN:

Merge Join  (cost=9566.05..9640.75 rows=1000 width=44)
  ->  Index Scan using doctor_tasks_pkey on doctor_tasks
(cost=0.00..59.00 rows=1000 width=32)
  ->  Sort  (cost=9566.05..9566.05 rows=47 width=12)
        ->  Subquery Scan current_doctor_favorites  (cost=0.00..9564.75
rows=47 width=12)
              ->  Seq Scan on doctor_favorites df  (cost=0.00..9564.75
rows=47 width=12)
                    SubPlan
                      ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
                            ->  Index Scan using docid_medpracid_index
on doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-prod=#

Thanks for your feedback,
Fran


Practical PostgreSQL moves East!

From
"Command Prompt, Inc."
Date:
Hello,

I have just found out that our book, Pratical PostgreSQL
(http://www.postgresql.info) is going to be translated into Japanese!

Sincerely,

Joshua Draek



Re: getting most recent row efficiently

From
"SHELTON,MICHAEL (Non-HP-Boise,ex1)"
Date:
Here are a few suggestions off the top of my head:

1) It sounds like performance is the issue here.  My first question would be
why you store a "list" of favorites for each doc.  If you "only" need the
latest, just do an update instead of an insert (unless no doc entry...),
then your table is short and to the point.  If you need the other data, then
obviously this isn't the solution.

2) If you need all favorites in order, what about a 2nd table (space vs
performance tradeoff) that stores only the most recent favorite.  Longer
store time when inserting a favorite, quick lookup.

3) In the case of trying to store the order without a timestamp you can't do
it simply by inserting rows (there is no guarantee for row order, it will
insert where it wants -- logical, but not always in order).  However, you
could have a table ID (not the doc id) used only to indicate row #.  It
would be unique, auto-incr, indexed, etc.  This way you could sort by doc,
and take the top one by table ID and get your favorite.

I'm sure there are other ways to do this, but thought these might help.

Mike Shelton

-----Original Message-----
From: Fran Fabrizio [mailto:ffabrizio@mmrd.com]
Sent: Monday, December 17, 2001 2:30 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] getting most recent row efficiently



Hello,

I've got a table that keeps a log of a person's favorites over time
(what follows is a simplified example)

person_id    favorite_color
1                red
1                blue
2                green
3                yellow
3                purple

I want the set of most recent entries for each person.  So assuming the
order they appear above is the order they were inserted, I want:

person_id    favorite_color
1                blue
2                green
3                purple

Is there any way to retrieve "the most recent row for each distinct
person_id?" without making a timestamp column for when the row was
inserted and then grouping them by person_id?

What I include below is the real world example.  There's a table,
doctor_favorites, that functions like my sample table above.  There's a
view, current_doctor_favorites, which contains the most recent row for
each doctor.  This view has a high cost as you can see.  There's a
second view, doctor_metrics, which combines the data from the
current_doctor_favorites view with that of another table, and that is my
ultimate goal.  It's taking a loooong time to run, I think because of
the sequential scan when creating the current_doctor_favorites view from
the doctor_favorites table.  I'm wondering if there's a way to perhaps
use an OID or something to determine the most recent rows from
doctor_favorites rather than the timestamp and the subselect.  I tried
to create an index on the local_ts timestamp field of the underlying
doctor_favorites table, but it's hard to index something that's
basically unique each time you insert a row anyhow.

monitor-prod=# \d doctor_favorites
                      Table "doctor_favorites"
    Attribute    |           Type           |        Modifier
-----------------+--------------------------+------------------------
 favorites       | integer                  |
 remote_ts       | timestamp with time zone | not null
 local_ts        | timestamp with time zone | not null default now()
 med_practice_id | integer                  |
 doctor_id       | integer                  |
Indices: docid_index,
         docid_medpracid_index,
         localts_index,
         medpracid_index

monitor-test=# \d current_doctor_favorites
   View "current_doctor_favorites"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 favorites       | integer |
View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites
FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));


monitor-test=# explain  SELECT df.doctor_id, df.med_practice_id,
df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT
max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
((doctor_favorites.doctor_id = df.doctor_id) AND
(doctor_favorites.med_practice_id = df.med_practice_id))));
NOTICE:  QUERY PLAN:

Seq Scan on doctor_favorites df  (cost=0.00..9564.75 rows=47 width=12)
  SubPlan
    ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
          ->  Index Scan using docid_medpracid_index on
doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-test=#

monitor-prod=# \d doctor_metrics
        View "doctor_metrics"
    Attribute    |  Type   | Modifier
-----------------+---------+----------
 doctor_id       | integer |
 med_practice_id | integer |
 lab_requests    | integer |
 lab_results     | integer |
 rx_auth         | integer |
 transcriptions  | integer |
 omnidocs        | integer |
 phone           | integer |
 favorites       | integer |
View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN
doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));

monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL)
THEN do
ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
doctor_tasks.med_practice_id ELSE
current_doctor_favorites.med_practice_id END AS med_practice_id,
doctor_tasks.lab_requests, doctor_tasks.lab_results,
doctor_tasks.rx_auth, doctor_tasks.transcriptions,
doctor_tasks.omnidocs, doctor_tasks.phone,
current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
current_doctor_favorites USING (doctor_id, med_practice_id));
NOTICE:  QUERY PLAN:

Merge Join  (cost=9566.05..9640.75 rows=1000 width=44)
  ->  Index Scan using doctor_tasks_pkey on doctor_tasks
(cost=0.00..59.00 rows=1000 width=32)
  ->  Sort  (cost=9566.05..9566.05 rows=47 width=12)
        ->  Subquery Scan current_doctor_favorites  (cost=0.00..9564.75
rows=47 width=12)
              ->  Seq Scan on doctor_favorites df  (cost=0.00..9564.75
rows=47 width=12)
                    SubPlan
                      ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
                            ->  Index Scan using docid_medpracid_index
on doctor_favorites  (cost=0.00..2.02 rows=1 width=8)

EXPLAIN
monitor-prod=#

Thanks for your feedback,
Fran


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: getting most recent row efficiently

From
Stephan Szabo
Date:
On Mon, 17 Dec 2001, Fran Fabrizio wrote:

> monitor-prod=# \d doctor_favorites
>                       Table "doctor_favorites"
>     Attribute    |           Type           |        Modifier
> -----------------+--------------------------+------------------------
>  favorites       | integer                  |
>  remote_ts       | timestamp with time zone | not null
>  local_ts        | timestamp with time zone | not null default now()
>  med_practice_id | integer                  |
>  doctor_id       | integer                  |
> Indices: docid_index,
>          docid_medpracid_index,
>          localts_index,
>          medpracid_index
>
> monitor-test=# \d current_doctor_favorites
>    View "current_doctor_favorites"
>     Attribute    |  Type   | Modifier
> -----------------+---------+----------
>  doctor_id       | integer |
>  med_practice_id | integer |
>  favorites       | integer |
> View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites
> FROM doctor_favorites df WHERE (df.local_ts = (SELECT
> max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
> ((doctor_favorites.doctor_id = df.doctor_id) AND
> (doctor_favorites.med_practice_id = df.med_practice_id))));

Hmm, would something like (untested with this schema)
select df.doctor_id, df.med_practice_id, df.favorites
 from doctor_favorites df,
 (select doctor_id, med_practice_id, max(local_ts) as local_ts from
  doctor_favorites group by doctor_id, med_practice_id) df2
 where df.doctor_id=df2.doctor_id and
       df.med_practice_id=df2.med_practice_id and
       df.local_ts=df2.loca_ts;
potentially be faster?  I'd guess that'd avoid a lot of potential
evaluations.


Re: getting most recent row efficiently

From
Antonio Fiol Bonnín
Date:
There are some cases where using views is not always the most practical
solution.

If you will only query for a given doctor's current favorites, and you
do not care about other doctors' (listing all) or other (previous)
favorites, for that matter, you can use:

SELECT * FROM doctor_favorites WHERE doctor_id=XXX ORDER BY doctor_id,
local_ts DESC LIMIT 1;

The index you want is ON doctor_favorites (doctor_id, local_ts), I bet.

Simpler is not always better, though.

Antonio

>
>monitor-prod=# \d doctor_favorites
>                      Table "doctor_favorites"
>    Attribute    |           Type           |        Modifier
>-----------------+--------------------------+------------------------
> favorites       | integer                  |
> remote_ts       | timestamp with time zone | not null
> local_ts        | timestamp with time zone | not null default now()
> med_practice_id | integer                  |
> doctor_id       | integer                  |
>Indices: docid_index,
>         docid_medpracid_index,
>         localts_index,
>         medpracid_index
>
>monitor-test=# \d current_doctor_favorites
>   View "current_doctor_favorites"
>    Attribute    |  Type   | Modifier
>-----------------+---------+----------
> doctor_id       | integer |
> med_practice_id | integer |
> favorites       | integer |
>View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites
>FROM doctor_favorites df WHERE (df.local_ts = (SELECT
>max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
>((doctor_favorites.doctor_id = df.doctor_id) AND
>(doctor_favorites.med_practice_id = df.med_practice_id))));
>
>
>monitor-test=# explain  SELECT df.doctor_id, df.med_practice_id,
>df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT
>max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE
>((doctor_favorites.doctor_id = df.doctor_id) AND
>(doctor_favorites.med_practice_id = df.med_practice_id))));
>NOTICE:  QUERY PLAN:
>
>Seq Scan on doctor_favorites df  (cost=0.00..9564.75 rows=47 width=12)
>  SubPlan
>    ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
>          ->  Index Scan using docid_medpracid_index on
>doctor_favorites  (cost=0.00..2.02 rows=1 width=8)
>
>EXPLAIN
>monitor-test=#
>
>monitor-prod=# \d doctor_metrics
>        View "doctor_metrics"
>    Attribute    |  Type   | Modifier
>-----------------+---------+----------
> doctor_id       | integer |
> med_practice_id | integer |
> lab_requests    | integer |
> lab_results     | integer |
> rx_auth         | integer |
> transcriptions  | integer |
> omnidocs        | integer |
> phone           | integer |
> favorites       | integer |
>View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN
>doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
>doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
>doctor_tasks.med_practice_id ELSE
>current_doctor_favorites.med_practice_id END AS med_practice_id,
>doctor_tasks.lab_requests, doctor_tasks.lab_results,
>doctor_tasks.rx_auth, doctor_tasks.transcriptions,
>doctor_tasks.omnidocs, doctor_tasks.phone,
>current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
>current_doctor_favorites USING (doctor_id, med_practice_id));
>
>monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL)
>THEN do
>ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS
>doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN
>doctor_tasks.med_practice_id ELSE
>current_doctor_favorites.med_practice_id END AS med_practice_id,
>doctor_tasks.lab_requests, doctor_tasks.lab_results,
>doctor_tasks.rx_auth, doctor_tasks.transcriptions,
>doctor_tasks.omnidocs, doctor_tasks.phone,
>current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN
>current_doctor_favorites USING (doctor_id, med_practice_id));
>NOTICE:  QUERY PLAN:
>
>Merge Join  (cost=9566.05..9640.75 rows=1000 width=44)
>  ->  Index Scan using doctor_tasks_pkey on doctor_tasks
>(cost=0.00..59.00 rows=1000 width=32)
>  ->  Sort  (cost=9566.05..9566.05 rows=47 width=12)
>        ->  Subquery Scan current_doctor_favorites  (cost=0.00..9564.75
>rows=47 width=12)
>              ->  Seq Scan on doctor_favorites df  (cost=0.00..9564.75
>rows=47 width=12)
>                    SubPlan
>                      ->  Aggregate  (cost=2.02..2.02 rows=1 width=8)
>                            ->  Index Scan using docid_medpracid_index
>on doctor_favorites  (cost=0.00..2.02 rows=1 width=8)
>
>EXPLAIN
>monitor-prod=#
>
>Thanks for your feedback,
>Fran
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>.
>




Re: getting most recent row efficiently

From
Masaru Sugawara
Date:
On Mon, 17 Dec 2001 16:30:18 -0500
Fran Fabrizio <ffabrizio@mmrd.com> wrote:


> I've got a table that keeps a log of a person's favorites over time
> (what follows is a simplified example)
>
> person_id    favorite_color
> 1                red
> 1                blue
> 2                green
> 3                yellow
> 3                purple
>
> I want the set of most recent entries for each person.  So assuming the
> order they appear above is the order they were inserted, I want:
>
> person_id    favorite_color
> 1                blue
> 2                green
> 3                purple
>
> Is there any way to retrieve "the most recent row for each distinct
> person_id?" without making a timestamp column for when the row was
> inserted and then grouping them by person_id?
>

 Yes, there is an interesting, but often shaky, way of using sequences.
 You, however, need to be careful of disturbances from other sessions;
 otherwise you may get an unexpected result because the sequences have
 strong possibility of being incremented or decremented by others.
 If your platform is a 7.2beta, using a "CREATE TEMP SEQUENCE" clause
 seems to assure an increment of 1 per row.  Then again, in case of your
 real example with a timestamp, I would think the reliable way is to use
 its timestamp in the subselect like Stephan's reply.



create sequence seq_doctor1;
create sequence seq_doctor2;

select setval('seq_doctor1',1), setval('seq_doctor2',1);
select t1.person_id, t1.favorite_color
  from (select person_id, favorite_color, nextval('seq_doctor1')-1 as n
          from doctor
         order by person_id
       ) as t1 inner join
       (select person_id, max(nextval('seq_doctor2')-1) as rank
          from doctor
         group by person_id
       ) as t2 on (t1.n = t2.rank)
;

 person_id | favorite_color
-----------+----------------
         1 | blue
         2 | green
         3 | purple
(3 rows)


Regards,
Masaru Sugawara


Re: getting most recent row efficiently

From
Masaru Sugawara
Date:
On Sun, 23 Dec 2001 01:26:07 +0900
I wrote <rk73@echna.ne.jp>:


> > Is there any way to retrieve "the most recent row for each distinct
> > person_id?" without making a timestamp column for when the row was
> > inserted and then grouping them by person_id?

 snip

> create sequence seq_doctor1;
> create sequence seq_doctor2;
>
> select setval('seq_doctor1',1), setval('seq_doctor2',1);
> select t1.person_id, t1.favorite_color
>   from (select person_id, favorite_color, nextval('seq_doctor1')-1 as n
>           from doctor
>          order by person_id
>        ) as t1 inner join
>        (select person_id, max(nextval('seq_doctor2')-1) as rank
>           from doctor
>          group by person_id
>        ) as t2 on (t1.n = t2.rank)
> ;


  There also seems to be another way:

  select t1.person_id, t1.favorite_color
    from (select person_id, favorite_color, oid as n
              from doctor
            ) as t1 inner join
           (select person_id, max(oid) as rank
              from doctor
             group by person_id
           ) as t2 on (t1.n = t2.rank)
;

 person_id | favorite_color
-----------+----------------
         1 | blue
         2 | green
         3 | purple
(3 rows)

Regards,
Masaru Sugawara