Thread: speeding up big query lookup
This is a question on speeding up some type of queries. I have a very big table that catalogs measurements of some objects over time. Measurements can be of several (~10) types. It keeps the observation date in a field, and indicates the type of measurement in another field. I often need to get the latest measurement of type A for object X. The table is indexed by object_id. The obvoious way to get the latest measurement of type A would be to join the table against SELECT object_id, object_val_type_id, max(observation_date) FROM object_val GROUP BY object_id, object_val_type_id But this can take a long time, and needs to be done very often. Next strategy was to build an auxiliary table that just keeps the last measurement of each type, for each object. I defined triggers to keep this table up to date whenever the object_val table was updated. However I don't trust this as much, and though it helps performance in reads, I'm not sure it's the best option overall. I found that Postgres can use indexes on pairs, so I'm going to play with indexing object_val by (object_id, object_val_type_id), and I'm sure my original JOIN query will be much faster. But I'm sure many people have this type of problem. Is there a smarter way to deal with this? I'd love to be able to choose the row with max value of some of the fields with just one statement, but I think this can't be done? Thanks Jaime *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
On fös, 2006-08-25 at 18:34 -0400, Silvela, Jaime (Exchange) wrote: > This is a question on speeding up some type of queries. > > I have a very big table that catalogs measurements of some objects over > time. Measurements can be of several (~10) types. It keeps the > observation date in a field, and indicates the type of measurement in > another field. > > I often need to get the latest measurement of type A for object X. > The table is indexed by object_id. one popular way is to create a composite index: CREATE INDEX object_val_id_type_date ON object_val(object_id, object_val_type_id, observation_date); then you could SELECT * FROM object_val WHERE object_id=? AND object_val_type_id=? ORDER BY observation_date DESC LIMIT 1; Hope this helps gnari
"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes: > I have a very big table that catalogs measurements of some objects over > time. Measurements can be of several (~10) types. It keeps the > observation date in a field, and indicates the type of measurement in > another field. > I often need to get the latest measurement of type A for object X. This is a pretty common requirement, and since plain SQL doesn't handle it very well, different DBMSes have invented different extensions to help. For instance you can use LIMIT: SELECT * from object_val WHERE object_id = X and object_val_type_id = Y ORDER BY object_id DESC, object_val_type_id DESC, observation_date DESC LIMIT 1; This will work very very fast if there is an index on (object_id, object_val_type_id, observation_date) for it to use. The only problem with it is that there's no obvious way to extend it to fetch latest measurements for several objects in one query. Another way, which AFAIK is Postgres-only, is to use DISTINCT ON: SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val ORDER BY object_id DESC, object_val_type_id DESC, observation_date DESC This can give you all the latest measurements at once, or any subset you need (just add a WHERE clause). It's not amazingly fast but it generally beats the bog-standard-SQL alternatives, which as you mentioned require joining against subselects. regards, tom lane
Silvela, Jaime (Exchange) wrote: > The obvoious way to get the latest measurement of type A would be to > join the table against > > SELECT object_id, object_val_type_id, max(observation_date) > FROM object_val > GROUP BY object_id, object_val_type_id I'm not sure this is actually the result you want; doesn't this give you all the unique (object_id, object_val_type_id)'s combined with the max observation_date in the table (as in, not necessarily related to the records listed)? I'd think you want this: SELECT object_id, object_val_type_id, observation_date FROM object_val GROUP BY object_id, object_val_type_id, observation_date HAVING observation_date = max(observation_date) Which'd return a single record with the highest observation_date. Though not strictly necessary, I can imagine you'd want observation_date to be unique, or you could get grouped observations with the same date. Although ordering and limiting is probably faster. I don't think the planner is intelligent enough to know that this would only return the record with the highest observation_date - it may be smart enough to reject ("drop from the result set") found records after finding ones with a higher observation_date (which'd be "interesting" when using cursors) or something along those lines. Hmm... Now I'm all curious; an EXPLAIN'd be interesting... Sorry for the mostly useless post :P Regards, -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
No, you can make this work just fine if you JOIN right. You're way is a more concise way of expressing it, though. Tom's trick SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val ORDER BY object_id DESC, object_val_type_id DESC, observation_date DESC Runs about twice as fast as the GROUP BY ... HAVING, but definitely not as fast as keeping a separate table with only the latest observations, updated by triggers. I'll be testing out the differences in overall performance for my applications. Thanks for the suggestions, Jaime -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alban Hertroys Sent: Monday, August 28, 2006 4:57 AM To: Silvela, Jaime (Exchange) Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] speeding up big query lookup Silvela, Jaime (Exchange) wrote: > The obvoious way to get the latest measurement of type A would be to > join the table against > > SELECT object_id, object_val_type_id, max(observation_date) > FROM object_val > GROUP BY object_id, object_val_type_id I'm not sure this is actually the result you want; doesn't this give you all the unique (object_id, object_val_type_id)'s combined with the max observation_date in the table (as in, not necessarily related to the records listed)? I'd think you want this: SELECT object_id, object_val_type_id, observation_date FROM object_val GROUP BY object_id, object_val_type_id, observation_date HAVING observation_date = max(observation_date) Which'd return a single record with the highest observation_date. Though not strictly necessary, I can imagine you'd want observation_date to be unique, or you could get grouped observations with the same date. Although ordering and limiting is probably faster. I don't think the planner is intelligent enough to know that this would only return the record with the highest observation_date - it may be smart enough to reject ("drop from the result set") found records after finding ones with a higher observation_date (which'd be "interesting" when using cursors) or something along those lines. Hmm... Now I'm all curious; an EXPLAIN'd be interesting... Sorry for the mostly useless post :P Regards, -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World // ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
I just finished reading 'The Art of SQL' by Stephane Faroult who has a chapter (ch 6) discussing this very topic. I strongly recommend any developer dealing with databases take a few days to read this narrative. A solution would seem to depend on whether you have many objects which change in measurement only occasionally or if your have very few objects whose measurement change very frequently. If you have a chance, go to your local big name bookstore who allow you to sit and take a read of the book, jump to ch6 (page 156 or so) and get a good understanding of some various techniques to take into consideration. A good suggestion (imo) is the composite key already mentioned. If I can give an example from the book in the case you have many objects whose measurements change occasionally: select whatever from object_val as outer where outer.item_id = someValueForObjectX and object_val_type_id = someValueForTypeA and outer.observation_date = ( select max(inner.observation_date) from object_val as inner where inner.item_id = someValueForObjectX and inner.object_val_type_id = someValueForTypeA and inner.observation_date <= yourReferenceDate ) Hopefully, I haven't completely mis-understood the author's intentions and gave you some silly query. Some may argue that the inner.item_id could be correlated to the outer.item_id, but then the inner query would need to be run multiple times. If used as presented, the inner query fires only once. I'd be rather interested in knowing if the above query (or similar) works in practice as occassionally can sound good on paper, but doesn't work in the real world - I'd like to know if in your situation it provides a good execution time. I would take the step of creating that composite key as suggested by Ragnar first. Good luck, am. "Silvela, Jaime (Exchange)" wrote: > This is a question on speeding up some type of queries. > > I have a very big table that catalogs measurements of some objects over > time. Measurements can be of several (~10) types. It keeps the > observation date in a field, and indicates the type of measurement in > another field. > > I often need to get the latest measurement of type A for object X. >
Just for fun, another approach since I believe pg supports it: select whateverFields from object_val as outer where (outer.object_id, outer.object_val_type_id,outer.observation_date) IN (select inner.object_id, inner.object_val_type,max(inner.observation_date) from object_val as inner where inner.object_id = somevalueForObjectX and inner.object_val_type = someValueForTypeA and inner.observation_date <= yourReferenceDate group by inner.object_id, inner.object_val_type) The reason these subqueries should run quickly is because the object_id,object_val_type,oberservation_date make up a composite key, so the subquery should execute extremely fast, thus eliminating the majority of the data when you want to display or act on other fields from the object_val (as outer). I suppose if you don't need any further information from object_val, and you are happy with the speeds, Tom's method is smooth. Adding the order by clause will take you out of the 'relational world' and thus slow you down. My fear with the triggers and the separate snapshot is that the delays are spread out and add questionable complexity, and potentially uneccessary overhead to the application. Something to consider (although admittedly it is arguably a weak consideration in some circumstances) is the extra space, indexes, and other factors such as additional time for backup routines (and restoration) the extra table creates. Best of luck, am "Silvela, Jaime (Exchange)" wrote: > No, you can make this work just fine if you JOIN right. > You're way is a more concise way of expressing it, though. > > Tom's trick > > SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val > ORDER BY object_id DESC, object_val_type_id DESC, observation_date > DESC > > Runs about twice as fast as the GROUP BY ... HAVING, but definitely not > as fast as keeping a separate table with only the latest observations, > updated by triggers. I'll be testing out the differences in overall > performance for my applications. > > Thanks for the suggestions, > Jaime >
macgillivary wrote: > I just finished reading 'The Art of SQL' by Stephane Faroult who has a > chapter (ch 6) discussing this very topic. I'd be curious to know any other references, books, folks would recommend when it comes to writing efficient SQL, as well as references on database design. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Thanks for the tips, am Actually, your suggestion is equivalent to JOINing the table with a GROUP BY copy of itself, and EXPLAIN shows both versions using the same index and aggregates. Just a matter of style. Your previous suggestion from the book works well too, but I actually prefer the JOIN method, since that allows me to set the object_id and/or object_val_type values in only one place. Tom's method is faster, but has against it a bit of obscurity - it's very fine tuned to a very specific behavior of DISTINCT ON and is less easy to read than the others. I fully agree that it is annoying to keep another table with triggers. And of course, that table needs to be indexed too, or it's worthless. I'm wondering how much extra time the db spends running all those indexes and triggers, and I'll probably dismantle that in favor of the composite index and the queries suggested so far. I'll definitely check that book, I've been looking for something like that. Thanks Jaime -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of macgillivary Sent: Monday, August 28, 2006 10:14 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] speeding up big query lookup Just for fun, another approach since I believe pg supports it: select whateverFields from object_val as outer where (outer.object_id, outer.object_val_type_id,outer.observation_date) IN (select inner.object_id, inner.object_val_type,max(inner.observation_date) from object_val as inner where inner.object_id = somevalueForObjectX and inner.object_val_type = someValueForTypeA and inner.observation_date <= yourReferenceDate group by inner.object_id, inner.object_val_type) The reason these subqueries should run quickly is because the object_id,object_val_type,oberservation_date make up a composite key, so the subquery should execute extremely fast, thus eliminating the majority of the data when you want to display or act on other fields from the object_val (as outer). I suppose if you don't need any further information from object_val, and you are happy with the speeds, Tom's method is smooth. Adding the order by clause will take you out of the 'relational world' and thus slow you down. My fear with the triggers and the separate snapshot is that the delays are spread out and add questionable complexity, and potentially uneccessary overhead to the application. Something to consider (although admittedly it is arguably a weak consideration in some circumstances) is the extra space, indexes, and other factors such as additional time for backup routines (and restoration) the extra table creates. Best of luck, am "Silvela, Jaime (Exchange)" wrote: > No, you can make this work just fine if you JOIN right. > You're way is a more concise way of expressing it, though. > > Tom's trick > > SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val > ORDER BY object_id DESC, object_val_type_id DESC, observation_date > DESC > > Runs about twice as fast as the GROUP BY ... HAVING, but definitely not > as fast as keeping a separate table with only the latest observations, > updated by triggers. I'll be testing out the differences in overall > performance for my applications. > > Thanks for the suggestions, > Jaime > ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************