Thread: A challenge for the SQL gurus out there...

A challenge for the SQL gurus out there...

From
"Uwe C. Schroeder"
Date:
or maybe not and I'm just not getting it.
So here's the scenario:

I have 3 tables

forum: with primary key "id"
forum_thread: again primary key "id" and a foreign key "forum_id" referencing
th primary key of the forum table
forum_post: again primary key "id" with a forign key "thread_id" referencing
the primary key of the forum_thread table

The forum_post table also has a field "date_posted" (timestamp) with an index
on it.


What I need is an efficient way to create overviews (just think about a forum)
I.e. the forum table has 3 records, one for each forum category

I want to get a list looking like

forum id    thread_id    post_id
1        6    443
2        9    123
3        3    557

The trick is, that I need the latest post (by the date posted column) for each
category (speak forum_id). Due to the keys the forum_thread table has to be
involved.

I've been thinking about this for hours now, but I just can't come up with a
query that will give me 3 records, one for each category showing the latest
post.
I do have a rather expensive query that involves a lot of sorting, but the
forum I'm running has around 40000 posts now and the query takes around 4
seconds - which is unacceptable. So there has to be a better way to query
this.

Currently I'm using a view to assemble a list with the latest post for each
forum thread and then I join that view with the forum categories, sort it and
limit it. The thing is that the sorting takes waaay to long, simply because I
sort a ton of records just to limit them. So my idea was to limit the
resultset before sorting takes place, which would probably cut the query
execution time to milliseconds instead of seconds and it would deliver
predictable results that are not as dependent on number of posts as they are
now.

The number of posts per thread is usually fairly equal. Even the longest
threads won't make it past 1000 posts, so my intention is to sort a maximum
of 1000 records instead of 40000 (due to the join).

It all boils down to me not being able to come up with a query that gives me
the latest post per forum_id.

So any input would be very much appreciated.

Uwe

Re: A challenge for the SQL gurus out there...

From
Gregory Stark
Date:
"Uwe C. Schroeder" <uwe@oss4u.com> writes:

> I want to get a list looking like
>
> forum id    thread_id    post_id
> 1        6    443
> 2        9    123
> 3        3    557
...
> It all boils down to me not being able to come up with a query that gives me
> the latest post per forum_id.

In a situation like this I would probably denormalize the tables slightly by
adding a form_id key to the individual posts. That would make it hard to ever
move a thread from one forum to another, though not impossible, but would help
in this case as well as any other time you want to do an operation on all
posts in a forum regardless of thread.

If you add that column then you could index <form_id,date> and get the result
you're looking for instantly with a DISTINCT ON query (which is a Postgres SQL
extension).

SELECT DISTINCT ON (form_id)
       forum_id, thread_id, post_id
  FROM thread
 ORDER BY forum_id, date DESC

(actually you would have to make the index on <form_id, date DESC> or make
both columns DESC in the query and then re-order them in an outer query)

Alternatively you could have a trigger on posts which updates a last_updated
field on every thread (and possibly a recent_post_id) then you could have a
query on forums which pulls the most recently updated thread directly without
having to join on form_post at all. That would slow down inserts but speed up
views -- possibly a good trade-off for a forum system.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: A challenge for the SQL gurus out there...

From
"Uwe C. Schroeder"
Date:

On Sunday 07 September 2008, Gregory Stark wrote:
> "Uwe C. Schroeder" <uwe@oss4u.com> writes:
> > I want to get a list looking like
> >
> > forum id    thread_id    post_id
> > 1        6    443
> > 2        9    123
> > 3        3    557
>
> ...
>
> > It all boils down to me not being able to come up with a query that gives
> > me the latest post per forum_id.
>
> In a situation like this I would probably denormalize the tables slightly
> by adding a form_id key to the individual posts. That would make it hard to
> ever move a thread from one forum to another, though not impossible, but
> would help in this case as well as any other time you want to do an
> operation on all posts in a forum regardless of thread.
>
> If you add that column then you could index <form_id,date> and get the
> result you're looking for instantly with a DISTINCT ON query (which is a
> Postgres SQL extension).
>
> SELECT DISTINCT ON (form_id)
>        forum_id, thread_id, post_id
>   FROM thread
>  ORDER BY forum_id, date DESC
>
> (actually you would have to make the index on <form_id, date DESC> or make
> both columns DESC in the query and then re-order them in an outer query)
>
> Alternatively you could have a trigger on posts which updates a
> last_updated field on every thread (and possibly a recent_post_id) then you
> could have a query on forums which pulls the most recently updated thread
> directly without having to join on form_post at all. That would slow down
> inserts but speed up views -- possibly a good trade-off for a forum system.

Thanks Gregory.
Just to put my final solution on the list: I ended up with a combined approach
of what you suggested:
I added the forum_id to the posts table and created 2 triggers: one that sets
the forum_id in the posts table to the forum_id in the threads table on
insert (therefor no change in the application was necessary).
The second trigger is to overcome the downside of adding the forum_id to the
posts table. On an update to forum_thread.forum_id the trigger updates all
posts in that thread to reflect the change in forum_id. That way one can just
move the whole thread by changing the forum_id and the posts are moved along
by the trigger.

Very nice! The query time is now 198ms instead of up to 48seconds !!!

Thanks for the idea

Uwe

Re: A challenge for the SQL gurus out there...

From
Harald Fuchs
Date:
In article <200809070253.15422.uwe@oss4u.com>,
"Uwe C. Schroeder" <uwe@oss4u.com> writes:

> or maybe not and I'm just not getting it.
> So here's the scenario:

> I have 3 tables

> forum: with primary key "id"
> forum_thread: again primary key "id" and a foreign key "forum_id" referencing
> th primary key of the forum table
> forum_post: again primary key "id" with a forign key "thread_id" referencing
> the primary key of the forum_thread table

> The forum_post table also has a field "date_posted" (timestamp) with an index
> on it.


> What I need is an efficient way to create overviews (just think about a forum)
> I.e. the forum table has 3 records, one for each forum category

> I want to get a list looking like

> forum id    thread_id    post_id
> 1        6    443
> 2        9    123
> 3        3    557

> The trick is, that I need the latest post (by the date posted column) for each
> category (speak forum_id). Due to the keys the forum_thread table has to be
> involved.

> I've been thinking about this for hours now, but I just can't come up with a
> query that will give me 3 records, one for each category showing the latest
> post.

Try something like this:

  SELECT t1.forum_id, p1.thread_id, p1.id AS post_id, p1.date_posted
  FROM forum f1
  JOIN forum_thread t1 ON t1.forum_id = f1.id
  JOIN forum_post p1 ON p1.thread_id = t1.id
  LEFT JOIN (
      SELECT t2.forum_id, p2.thread_id, p2.date_posted
      FROM forum_thread t2
      JOIN forum_post p2 ON p2.thread_id = t2.id
    ) AS f2 ON f2.forum_id = f1.id AND f2.date_posted > p1.date_posted
  WHERE f2.forum_id IS NULL
  ORDER BY t1.forum_id

Re: A challenge for the SQL gurus out there...

From
"Merlin Moncure"
Date:
On Sun, Sep 7, 2008 at 6:09 AM, Gregory Stark <stark@enterprisedb.com> wrote:
> "Uwe C. Schroeder" <uwe@oss4u.com> writes:
>
>> I want to get a list looking like
>>
>> forum id    thread_id post_id
>> 1             6       443
>> 2             9       123
>> 3             3       557
> ...
>> It all boils down to me not being able to come up with a query that gives me
>> the latest post per forum_id.
>
> In a situation like this I would probably denormalize the tables slightly by
> adding a form_id key to the individual posts. That would make it hard to ever
> move a thread from one forum to another, though not impossible, but would help
> in this case as well as any other time you want to do an operation on all
> posts in a forum regardless of thread.

<sql guru hat on>

select f.*,
  (
    select (t,
      (
        select p from post p
        where p.thread_id = t.thread_id
        order by post_id desc limit 1
      ))
      from thread t
      where forum_id = f.forum_id
      order by thread_id desc limit 1
  ) as threadpost
  from forum f;

:-)

'thread post' is a nested composite, ((thread), post).

The above will pretty much guarantee a fast query unless the number of
forums is large.  To pull out the composite fields, wrap in a subquery
or (better yet) fire up libpqtypes.

merlin

Re: A challenge for the SQL gurus out there...

From
"Merlin Moncure"
Date:
On Mon, Sep 8, 2008 at 9:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Sep 7, 2008 at 6:09 AM, Gregory Stark <stark@enterprisedb.com> wrote:
>> "Uwe C. Schroeder" <uwe@oss4u.com> writes:
>>
>>> I want to get a list looking like
>>>
>>> forum id    thread_id post_id
>>> 1             6       443
>>> 2             9       123
>>> 3             3       557
>> ...
>>> It all boils down to me not being able to come up with a query that gives me
>>> the latest post per forum_id.
>>
>> In a situation like this I would probably denormalize the tables slightly by
>> adding a form_id key to the individual posts. That would make it hard to ever
>> move a thread from one forum to another, though not impossible, but would help
>> in this case as well as any other time you want to do an operation on all
>> posts in a forum regardless of thread.
>
> <sql guru hat on>
>
> select f.*,
>  (
>    select (t,
>      (
>        select p from post p
>        where p.thread_id = t.thread_id
>        order by post_id desc limit 1
>      ))
>      from thread t
>      where forum_id = f.forum_id
>      order by thread_id desc limit 1
>  ) as threadpost
>  from forum f;
>
> :-)
>
> 'thread post' is a nested composite, ((thread), post).
>
> The above will pretty much guarantee a fast query unless the number of
> forums is large.  To pull out the composite fields, wrap in a subquery
> or (better yet) fire up libpqtypes.

oh, and I assumed id order corresponds to date order...should have
mentioned that, and I had forum.id as forum_id, which it should have
been in the first place.

merlin