Thread: getting most recent row efficiently
Hello, I've got a table that keeps a log of a person's favorites over time (what follows is a simplified example) person_id favorite_color 1 red 1 blue 2 green 3 yellow 3 purple I want the set of most recent entries for each person. So assuming the order they appear above is the order they were inserted, I want: person_id favorite_color 1 blue 2 green 3 purple Is there any way to retrieve "the most recent row for each distinct person_id?" without making a timestamp column for when the row was inserted and then grouping them by person_id? What I include below is the real world example. There's a table, doctor_favorites, that functions like my sample table above. There's a view, current_doctor_favorites, which contains the most recent row for each doctor. This view has a high cost as you can see. There's a second view, doctor_metrics, which combines the data from the current_doctor_favorites view with that of another table, and that is my ultimate goal. It's taking a loooong time to run, I think because of the sequential scan when creating the current_doctor_favorites view from the doctor_favorites table. I'm wondering if there's a way to perhaps use an OID or something to determine the most recent rows from doctor_favorites rather than the timestamp and the subselect. I tried to create an index on the local_ts timestamp field of the underlying doctor_favorites table, but it's hard to index something that's basically unique each time you insert a row anyhow. monitor-prod=# \d doctor_favorites Table "doctor_favorites" Attribute | Type | Modifier -----------------+--------------------------+------------------------ favorites | integer | remote_ts | timestamp with time zone | not null local_ts | timestamp with time zone | not null default now() med_practice_id | integer | doctor_id | integer | Indices: docid_index, docid_medpracid_index, localts_index, medpracid_index monitor-test=# \d current_doctor_favorites View "current_doctor_favorites" Attribute | Type | Modifier -----------------+---------+---------- doctor_id | integer | med_practice_id | integer | favorites | integer | View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE ((doctor_favorites.doctor_id = df.doctor_id) AND (doctor_favorites.med_practice_id = df.med_practice_id)))); monitor-test=# explain SELECT df.doctor_id, df.med_practice_id, df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE ((doctor_favorites.doctor_id = df.doctor_id) AND (doctor_favorites.med_practice_id = df.med_practice_id)))); NOTICE: QUERY PLAN: Seq Scan on doctor_favorites df (cost=0.00..9564.75 rows=47 width=12) SubPlan -> Aggregate (cost=2.02..2.02 rows=1 width=8) -> Index Scan using docid_medpracid_index on doctor_favorites (cost=0.00..2.02 rows=1 width=8) EXPLAIN monitor-test=# monitor-prod=# \d doctor_metrics View "doctor_metrics" Attribute | Type | Modifier -----------------+---------+---------- doctor_id | integer | med_practice_id | integer | lab_requests | integer | lab_results | integer | rx_auth | integer | transcriptions | integer | omnidocs | integer | phone | integer | favorites | integer | View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN doctor_tasks.med_practice_id ELSE current_doctor_favorites.med_practice_id END AS med_practice_id, doctor_tasks.lab_requests, doctor_tasks.lab_results, doctor_tasks.rx_auth, doctor_tasks.transcriptions, doctor_tasks.omnidocs, doctor_tasks.phone, current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN current_doctor_favorites USING (doctor_id, med_practice_id)); monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN do ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN doctor_tasks.med_practice_id ELSE current_doctor_favorites.med_practice_id END AS med_practice_id, doctor_tasks.lab_requests, doctor_tasks.lab_results, doctor_tasks.rx_auth, doctor_tasks.transcriptions, doctor_tasks.omnidocs, doctor_tasks.phone, current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN current_doctor_favorites USING (doctor_id, med_practice_id)); NOTICE: QUERY PLAN: Merge Join (cost=9566.05..9640.75 rows=1000 width=44) -> Index Scan using doctor_tasks_pkey on doctor_tasks (cost=0.00..59.00 rows=1000 width=32) -> Sort (cost=9566.05..9566.05 rows=47 width=12) -> Subquery Scan current_doctor_favorites (cost=0.00..9564.75 rows=47 width=12) -> Seq Scan on doctor_favorites df (cost=0.00..9564.75 rows=47 width=12) SubPlan -> Aggregate (cost=2.02..2.02 rows=1 width=8) -> Index Scan using docid_medpracid_index on doctor_favorites (cost=0.00..2.02 rows=1 width=8) EXPLAIN monitor-prod=# Thanks for your feedback, Fran
Hello, I have just found out that our book, Pratical PostgreSQL (http://www.postgresql.info) is going to be translated into Japanese! Sincerely, Joshua Draek
Here are a few suggestions off the top of my head: 1) It sounds like performance is the issue here. My first question would be why you store a "list" of favorites for each doc. If you "only" need the latest, just do an update instead of an insert (unless no doc entry...), then your table is short and to the point. If you need the other data, then obviously this isn't the solution. 2) If you need all favorites in order, what about a 2nd table (space vs performance tradeoff) that stores only the most recent favorite. Longer store time when inserting a favorite, quick lookup. 3) In the case of trying to store the order without a timestamp you can't do it simply by inserting rows (there is no guarantee for row order, it will insert where it wants -- logical, but not always in order). However, you could have a table ID (not the doc id) used only to indicate row #. It would be unique, auto-incr, indexed, etc. This way you could sort by doc, and take the top one by table ID and get your favorite. I'm sure there are other ways to do this, but thought these might help. Mike Shelton -----Original Message----- From: Fran Fabrizio [mailto:ffabrizio@mmrd.com] Sent: Monday, December 17, 2001 2:30 PM To: pgsql-general@postgresql.org Subject: [GENERAL] getting most recent row efficiently Hello, I've got a table that keeps a log of a person's favorites over time (what follows is a simplified example) person_id favorite_color 1 red 1 blue 2 green 3 yellow 3 purple I want the set of most recent entries for each person. So assuming the order they appear above is the order they were inserted, I want: person_id favorite_color 1 blue 2 green 3 purple Is there any way to retrieve "the most recent row for each distinct person_id?" without making a timestamp column for when the row was inserted and then grouping them by person_id? What I include below is the real world example. There's a table, doctor_favorites, that functions like my sample table above. There's a view, current_doctor_favorites, which contains the most recent row for each doctor. This view has a high cost as you can see. There's a second view, doctor_metrics, which combines the data from the current_doctor_favorites view with that of another table, and that is my ultimate goal. It's taking a loooong time to run, I think because of the sequential scan when creating the current_doctor_favorites view from the doctor_favorites table. I'm wondering if there's a way to perhaps use an OID or something to determine the most recent rows from doctor_favorites rather than the timestamp and the subselect. I tried to create an index on the local_ts timestamp field of the underlying doctor_favorites table, but it's hard to index something that's basically unique each time you insert a row anyhow. monitor-prod=# \d doctor_favorites Table "doctor_favorites" Attribute | Type | Modifier -----------------+--------------------------+------------------------ favorites | integer | remote_ts | timestamp with time zone | not null local_ts | timestamp with time zone | not null default now() med_practice_id | integer | doctor_id | integer | Indices: docid_index, docid_medpracid_index, localts_index, medpracid_index monitor-test=# \d current_doctor_favorites View "current_doctor_favorites" Attribute | Type | Modifier -----------------+---------+---------- doctor_id | integer | med_practice_id | integer | favorites | integer | View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE ((doctor_favorites.doctor_id = df.doctor_id) AND (doctor_favorites.med_practice_id = df.med_practice_id)))); monitor-test=# explain SELECT df.doctor_id, df.med_practice_id, df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE ((doctor_favorites.doctor_id = df.doctor_id) AND (doctor_favorites.med_practice_id = df.med_practice_id)))); NOTICE: QUERY PLAN: Seq Scan on doctor_favorites df (cost=0.00..9564.75 rows=47 width=12) SubPlan -> Aggregate (cost=2.02..2.02 rows=1 width=8) -> Index Scan using docid_medpracid_index on doctor_favorites (cost=0.00..2.02 rows=1 width=8) EXPLAIN monitor-test=# monitor-prod=# \d doctor_metrics View "doctor_metrics" Attribute | Type | Modifier -----------------+---------+---------- doctor_id | integer | med_practice_id | integer | lab_requests | integer | lab_results | integer | rx_auth | integer | transcriptions | integer | omnidocs | integer | phone | integer | favorites | integer | View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN doctor_tasks.med_practice_id ELSE current_doctor_favorites.med_practice_id END AS med_practice_id, doctor_tasks.lab_requests, doctor_tasks.lab_results, doctor_tasks.rx_auth, doctor_tasks.transcriptions, doctor_tasks.omnidocs, doctor_tasks.phone, current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN current_doctor_favorites USING (doctor_id, med_practice_id)); monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN do ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN doctor_tasks.med_practice_id ELSE current_doctor_favorites.med_practice_id END AS med_practice_id, doctor_tasks.lab_requests, doctor_tasks.lab_results, doctor_tasks.rx_auth, doctor_tasks.transcriptions, doctor_tasks.omnidocs, doctor_tasks.phone, current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN current_doctor_favorites USING (doctor_id, med_practice_id)); NOTICE: QUERY PLAN: Merge Join (cost=9566.05..9640.75 rows=1000 width=44) -> Index Scan using doctor_tasks_pkey on doctor_tasks (cost=0.00..59.00 rows=1000 width=32) -> Sort (cost=9566.05..9566.05 rows=47 width=12) -> Subquery Scan current_doctor_favorites (cost=0.00..9564.75 rows=47 width=12) -> Seq Scan on doctor_favorites df (cost=0.00..9564.75 rows=47 width=12) SubPlan -> Aggregate (cost=2.02..2.02 rows=1 width=8) -> Index Scan using docid_medpracid_index on doctor_favorites (cost=0.00..2.02 rows=1 width=8) EXPLAIN monitor-prod=# Thanks for your feedback, Fran ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
On Mon, 17 Dec 2001, Fran Fabrizio wrote: > monitor-prod=# \d doctor_favorites > Table "doctor_favorites" > Attribute | Type | Modifier > -----------------+--------------------------+------------------------ > favorites | integer | > remote_ts | timestamp with time zone | not null > local_ts | timestamp with time zone | not null default now() > med_practice_id | integer | > doctor_id | integer | > Indices: docid_index, > docid_medpracid_index, > localts_index, > medpracid_index > > monitor-test=# \d current_doctor_favorites > View "current_doctor_favorites" > Attribute | Type | Modifier > -----------------+---------+---------- > doctor_id | integer | > med_practice_id | integer | > favorites | integer | > View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites > FROM doctor_favorites df WHERE (df.local_ts = (SELECT > max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE > ((doctor_favorites.doctor_id = df.doctor_id) AND > (doctor_favorites.med_practice_id = df.med_practice_id)))); Hmm, would something like (untested with this schema) select df.doctor_id, df.med_practice_id, df.favorites from doctor_favorites df, (select doctor_id, med_practice_id, max(local_ts) as local_ts from doctor_favorites group by doctor_id, med_practice_id) df2 where df.doctor_id=df2.doctor_id and df.med_practice_id=df2.med_practice_id and df.local_ts=df2.loca_ts; potentially be faster? I'd guess that'd avoid a lot of potential evaluations.
There are some cases where using views is not always the most practical solution. If you will only query for a given doctor's current favorites, and you do not care about other doctors' (listing all) or other (previous) favorites, for that matter, you can use: SELECT * FROM doctor_favorites WHERE doctor_id=XXX ORDER BY doctor_id, local_ts DESC LIMIT 1; The index you want is ON doctor_favorites (doctor_id, local_ts), I bet. Simpler is not always better, though. Antonio > >monitor-prod=# \d doctor_favorites > Table "doctor_favorites" > Attribute | Type | Modifier >-----------------+--------------------------+------------------------ > favorites | integer | > remote_ts | timestamp with time zone | not null > local_ts | timestamp with time zone | not null default now() > med_practice_id | integer | > doctor_id | integer | >Indices: docid_index, > docid_medpracid_index, > localts_index, > medpracid_index > >monitor-test=# \d current_doctor_favorites > View "current_doctor_favorites" > Attribute | Type | Modifier >-----------------+---------+---------- > doctor_id | integer | > med_practice_id | integer | > favorites | integer | >View definition: SELECT df.doctor_id, df.med_practice_id, df.favorites >FROM doctor_favorites df WHERE (df.local_ts = (SELECT >max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE >((doctor_favorites.doctor_id = df.doctor_id) AND >(doctor_favorites.med_practice_id = df.med_practice_id)))); > > >monitor-test=# explain SELECT df.doctor_id, df.med_practice_id, >df.favorites FROM doctor_favorites df WHERE (df.local_ts = (SELECT >max(doctor_favorites.local_ts) AS max FROM doctor_favorites WHERE >((doctor_favorites.doctor_id = df.doctor_id) AND >(doctor_favorites.med_practice_id = df.med_practice_id)))); >NOTICE: QUERY PLAN: > >Seq Scan on doctor_favorites df (cost=0.00..9564.75 rows=47 width=12) > SubPlan > -> Aggregate (cost=2.02..2.02 rows=1 width=8) > -> Index Scan using docid_medpracid_index on >doctor_favorites (cost=0.00..2.02 rows=1 width=8) > >EXPLAIN >monitor-test=# > >monitor-prod=# \d doctor_metrics > View "doctor_metrics" > Attribute | Type | Modifier >-----------------+---------+---------- > doctor_id | integer | > med_practice_id | integer | > lab_requests | integer | > lab_results | integer | > rx_auth | integer | > transcriptions | integer | > omnidocs | integer | > phone | integer | > favorites | integer | >View definition: SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) THEN >doctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS >doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN >doctor_tasks.med_practice_id ELSE >current_doctor_favorites.med_practice_id END AS med_practice_id, >doctor_tasks.lab_requests, doctor_tasks.lab_results, >doctor_tasks.rx_auth, doctor_tasks.transcriptions, >doctor_tasks.omnidocs, doctor_tasks.phone, >current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN >current_doctor_favorites USING (doctor_id, med_practice_id)); > >monitor-prod=# explain SELECT CASE WHEN (doctor_tasks.doctor_id NOTNULL) >THEN do >ctor_tasks.doctor_id ELSE current_doctor_favorites.doctor_id END AS >doctor_id, CASE WHEN (doctor_tasks.med_practice_id NOTNULL) THEN >doctor_tasks.med_practice_id ELSE >current_doctor_favorites.med_practice_id END AS med_practice_id, >doctor_tasks.lab_requests, doctor_tasks.lab_results, >doctor_tasks.rx_auth, doctor_tasks.transcriptions, >doctor_tasks.omnidocs, doctor_tasks.phone, >current_doctor_favorites.favorites FROM (doctor_tasks FULL JOIN >current_doctor_favorites USING (doctor_id, med_practice_id)); >NOTICE: QUERY PLAN: > >Merge Join (cost=9566.05..9640.75 rows=1000 width=44) > -> Index Scan using doctor_tasks_pkey on doctor_tasks >(cost=0.00..59.00 rows=1000 width=32) > -> Sort (cost=9566.05..9566.05 rows=47 width=12) > -> Subquery Scan current_doctor_favorites (cost=0.00..9564.75 >rows=47 width=12) > -> Seq Scan on doctor_favorites df (cost=0.00..9564.75 >rows=47 width=12) > SubPlan > -> Aggregate (cost=2.02..2.02 rows=1 width=8) > -> Index Scan using docid_medpracid_index >on doctor_favorites (cost=0.00..2.02 rows=1 width=8) > >EXPLAIN >monitor-prod=# > >Thanks for your feedback, >Fran > > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >. >
On Mon, 17 Dec 2001 16:30:18 -0500 Fran Fabrizio <ffabrizio@mmrd.com> wrote: > I've got a table that keeps a log of a person's favorites over time > (what follows is a simplified example) > > person_id favorite_color > 1 red > 1 blue > 2 green > 3 yellow > 3 purple > > I want the set of most recent entries for each person. So assuming the > order they appear above is the order they were inserted, I want: > > person_id favorite_color > 1 blue > 2 green > 3 purple > > Is there any way to retrieve "the most recent row for each distinct > person_id?" without making a timestamp column for when the row was > inserted and then grouping them by person_id? > Yes, there is an interesting, but often shaky, way of using sequences. You, however, need to be careful of disturbances from other sessions; otherwise you may get an unexpected result because the sequences have strong possibility of being incremented or decremented by others. If your platform is a 7.2beta, using a "CREATE TEMP SEQUENCE" clause seems to assure an increment of 1 per row. Then again, in case of your real example with a timestamp, I would think the reliable way is to use its timestamp in the subselect like Stephan's reply. create sequence seq_doctor1; create sequence seq_doctor2; select setval('seq_doctor1',1), setval('seq_doctor2',1); select t1.person_id, t1.favorite_color from (select person_id, favorite_color, nextval('seq_doctor1')-1 as n from doctor order by person_id ) as t1 inner join (select person_id, max(nextval('seq_doctor2')-1) as rank from doctor group by person_id ) as t2 on (t1.n = t2.rank) ; person_id | favorite_color -----------+---------------- 1 | blue 2 | green 3 | purple (3 rows) Regards, Masaru Sugawara
On Sun, 23 Dec 2001 01:26:07 +0900 I wrote <rk73@echna.ne.jp>: > > Is there any way to retrieve "the most recent row for each distinct > > person_id?" without making a timestamp column for when the row was > > inserted and then grouping them by person_id? snip > create sequence seq_doctor1; > create sequence seq_doctor2; > > select setval('seq_doctor1',1), setval('seq_doctor2',1); > select t1.person_id, t1.favorite_color > from (select person_id, favorite_color, nextval('seq_doctor1')-1 as n > from doctor > order by person_id > ) as t1 inner join > (select person_id, max(nextval('seq_doctor2')-1) as rank > from doctor > group by person_id > ) as t2 on (t1.n = t2.rank) > ; There also seems to be another way: select t1.person_id, t1.favorite_color from (select person_id, favorite_color, oid as n from doctor ) as t1 inner join (select person_id, max(oid) as rank from doctor group by person_id ) as t2 on (t1.n = t2.rank) ; person_id | favorite_color -----------+---------------- 1 | blue 2 | green 3 | purple (3 rows) Regards, Masaru Sugawara