Thread: A challenge for the SQL gurus out there...
or maybe not and I'm just not getting it. So here's the scenario: I have 3 tables forum: with primary key "id" forum_thread: again primary key "id" and a foreign key "forum_id" referencing th primary key of the forum table forum_post: again primary key "id" with a forign key "thread_id" referencing the primary key of the forum_thread table The forum_post table also has a field "date_posted" (timestamp) with an index on it. What I need is an efficient way to create overviews (just think about a forum) I.e. the forum table has 3 records, one for each forum category I want to get a list looking like forum id thread_id post_id 1 6 443 2 9 123 3 3 557 The trick is, that I need the latest post (by the date posted column) for each category (speak forum_id). Due to the keys the forum_thread table has to be involved. I've been thinking about this for hours now, but I just can't come up with a query that will give me 3 records, one for each category showing the latest post. I do have a rather expensive query that involves a lot of sorting, but the forum I'm running has around 40000 posts now and the query takes around 4 seconds - which is unacceptable. So there has to be a better way to query this. Currently I'm using a view to assemble a list with the latest post for each forum thread and then I join that view with the forum categories, sort it and limit it. The thing is that the sorting takes waaay to long, simply because I sort a ton of records just to limit them. So my idea was to limit the resultset before sorting takes place, which would probably cut the query execution time to milliseconds instead of seconds and it would deliver predictable results that are not as dependent on number of posts as they are now. The number of posts per thread is usually fairly equal. Even the longest threads won't make it past 1000 posts, so my intention is to sort a maximum of 1000 records instead of 40000 (due to the join). It all boils down to me not being able to come up with a query that gives me the latest post per forum_id. So any input would be very much appreciated. Uwe
"Uwe C. Schroeder" <uwe@oss4u.com> writes: > I want to get a list looking like > > forum id thread_id post_id > 1 6 443 > 2 9 123 > 3 3 557 ... > It all boils down to me not being able to come up with a query that gives me > the latest post per forum_id. In a situation like this I would probably denormalize the tables slightly by adding a form_id key to the individual posts. That would make it hard to ever move a thread from one forum to another, though not impossible, but would help in this case as well as any other time you want to do an operation on all posts in a forum regardless of thread. If you add that column then you could index <form_id,date> and get the result you're looking for instantly with a DISTINCT ON query (which is a Postgres SQL extension). SELECT DISTINCT ON (form_id) forum_id, thread_id, post_id FROM thread ORDER BY forum_id, date DESC (actually you would have to make the index on <form_id, date DESC> or make both columns DESC in the query and then re-order them in an outer query) Alternatively you could have a trigger on posts which updates a last_updated field on every thread (and possibly a recent_post_id) then you could have a query on forums which pulls the most recently updated thread directly without having to join on form_post at all. That would slow down inserts but speed up views -- possibly a good trade-off for a forum system. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Sunday 07 September 2008, Gregory Stark wrote: > "Uwe C. Schroeder" <uwe@oss4u.com> writes: > > I want to get a list looking like > > > > forum id thread_id post_id > > 1 6 443 > > 2 9 123 > > 3 3 557 > > ... > > > It all boils down to me not being able to come up with a query that gives > > me the latest post per forum_id. > > In a situation like this I would probably denormalize the tables slightly > by adding a form_id key to the individual posts. That would make it hard to > ever move a thread from one forum to another, though not impossible, but > would help in this case as well as any other time you want to do an > operation on all posts in a forum regardless of thread. > > If you add that column then you could index <form_id,date> and get the > result you're looking for instantly with a DISTINCT ON query (which is a > Postgres SQL extension). > > SELECT DISTINCT ON (form_id) > forum_id, thread_id, post_id > FROM thread > ORDER BY forum_id, date DESC > > (actually you would have to make the index on <form_id, date DESC> or make > both columns DESC in the query and then re-order them in an outer query) > > Alternatively you could have a trigger on posts which updates a > last_updated field on every thread (and possibly a recent_post_id) then you > could have a query on forums which pulls the most recently updated thread > directly without having to join on form_post at all. That would slow down > inserts but speed up views -- possibly a good trade-off for a forum system. Thanks Gregory. Just to put my final solution on the list: I ended up with a combined approach of what you suggested: I added the forum_id to the posts table and created 2 triggers: one that sets the forum_id in the posts table to the forum_id in the threads table on insert (therefor no change in the application was necessary). The second trigger is to overcome the downside of adding the forum_id to the posts table. On an update to forum_thread.forum_id the trigger updates all posts in that thread to reflect the change in forum_id. That way one can just move the whole thread by changing the forum_id and the posts are moved along by the trigger. Very nice! The query time is now 198ms instead of up to 48seconds !!! Thanks for the idea Uwe
In article <200809070253.15422.uwe@oss4u.com>, "Uwe C. Schroeder" <uwe@oss4u.com> writes: > or maybe not and I'm just not getting it. > So here's the scenario: > I have 3 tables > forum: with primary key "id" > forum_thread: again primary key "id" and a foreign key "forum_id" referencing > th primary key of the forum table > forum_post: again primary key "id" with a forign key "thread_id" referencing > the primary key of the forum_thread table > The forum_post table also has a field "date_posted" (timestamp) with an index > on it. > What I need is an efficient way to create overviews (just think about a forum) > I.e. the forum table has 3 records, one for each forum category > I want to get a list looking like > forum id thread_id post_id > 1 6 443 > 2 9 123 > 3 3 557 > The trick is, that I need the latest post (by the date posted column) for each > category (speak forum_id). Due to the keys the forum_thread table has to be > involved. > I've been thinking about this for hours now, but I just can't come up with a > query that will give me 3 records, one for each category showing the latest > post. Try something like this: SELECT t1.forum_id, p1.thread_id, p1.id AS post_id, p1.date_posted FROM forum f1 JOIN forum_thread t1 ON t1.forum_id = f1.id JOIN forum_post p1 ON p1.thread_id = t1.id LEFT JOIN ( SELECT t2.forum_id, p2.thread_id, p2.date_posted FROM forum_thread t2 JOIN forum_post p2 ON p2.thread_id = t2.id ) AS f2 ON f2.forum_id = f1.id AND f2.date_posted > p1.date_posted WHERE f2.forum_id IS NULL ORDER BY t1.forum_id
On Sun, Sep 7, 2008 at 6:09 AM, Gregory Stark <stark@enterprisedb.com> wrote: > "Uwe C. Schroeder" <uwe@oss4u.com> writes: > >> I want to get a list looking like >> >> forum id thread_id post_id >> 1 6 443 >> 2 9 123 >> 3 3 557 > ... >> It all boils down to me not being able to come up with a query that gives me >> the latest post per forum_id. > > In a situation like this I would probably denormalize the tables slightly by > adding a form_id key to the individual posts. That would make it hard to ever > move a thread from one forum to another, though not impossible, but would help > in this case as well as any other time you want to do an operation on all > posts in a forum regardless of thread. <sql guru hat on> select f.*, ( select (t, ( select p from post p where p.thread_id = t.thread_id order by post_id desc limit 1 )) from thread t where forum_id = f.forum_id order by thread_id desc limit 1 ) as threadpost from forum f; :-) 'thread post' is a nested composite, ((thread), post). The above will pretty much guarantee a fast query unless the number of forums is large. To pull out the composite fields, wrap in a subquery or (better yet) fire up libpqtypes. merlin
On Mon, Sep 8, 2008 at 9:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, Sep 7, 2008 at 6:09 AM, Gregory Stark <stark@enterprisedb.com> wrote: >> "Uwe C. Schroeder" <uwe@oss4u.com> writes: >> >>> I want to get a list looking like >>> >>> forum id thread_id post_id >>> 1 6 443 >>> 2 9 123 >>> 3 3 557 >> ... >>> It all boils down to me not being able to come up with a query that gives me >>> the latest post per forum_id. >> >> In a situation like this I would probably denormalize the tables slightly by >> adding a form_id key to the individual posts. That would make it hard to ever >> move a thread from one forum to another, though not impossible, but would help >> in this case as well as any other time you want to do an operation on all >> posts in a forum regardless of thread. > > <sql guru hat on> > > select f.*, > ( > select (t, > ( > select p from post p > where p.thread_id = t.thread_id > order by post_id desc limit 1 > )) > from thread t > where forum_id = f.forum_id > order by thread_id desc limit 1 > ) as threadpost > from forum f; > > :-) > > 'thread post' is a nested composite, ((thread), post). > > The above will pretty much guarantee a fast query unless the number of > forums is large. To pull out the composite fields, wrap in a subquery > or (better yet) fire up libpqtypes. oh, and I assumed id order corresponds to date order...should have mentioned that, and I had forum.id as forum_id, which it should have been in the first place. merlin