Best way to store a threaded message list/tree in SQL - Mailing list pgsql-general
From | Mike Christensen |
---|---|
Subject | Best way to store a threaded message list/tree in SQL |
Date | |
Msg-id | 49CC1656.5050100@comcast.net Whole thread Raw |
Responses |
Re: Best way to store a threaded message list/tree in SQL
|
List | pgsql-general |
Hi guys - I'm looking for the best way to store a set of "posts" as well as comments on those posts in SQL. Imagine a design similar to a "Wall" on Facebook where users can write posts on their wall and other users can comment on those posts. I need to be able to display all wall posts as well as the comments. When I first started out, I came up with a table such as: CREATE Table wallposts ( id uuid NOT NULL, posted timestamp NOT NULL, userid uuid NOT NULL, posterid uuid NOT NULL, parentid uuid NOT NULL, comment text NOT NULL ) id is unique, parentid will be null on original posts and point to an id if the row is a comment on an existing post. Easy enough and super fast to insert new data. However, doing a select which would return me: POST 1 COMMENT 1 COMMENT 2 POST 2 COMMENT 1 COMMENT 2 Regardless of which order the rows existed in the database proved to be extremely difficult. I obviously can't just order by date, as someone might comment on post 1 after post 2 has been posted. If I do a LEFT JOIN to get the parent post on all rows, and then sort by that date first, all the original posts group together as they'd have a value of null. Then I got this idea: CREATE TABLE wallposts ( id uuid NOT NULL, threadposted timestamp, posted timestamp, ... comment text ) On an original post, threadposted and posted would be the same. On a comment, timestamp would be the time the original post was posted and "posted" would be the time the comment on that thread was posted. Now I can just do: select * from wallposts order by threadposted, posted; This works great, however one thing irks me. If two people create a post at the same time, comments on the two posts would get munged together as they'd have the same timestamp. I could use "ticks" instead of a datetime, but still the accuracy is only 1/1000 of a second. I could also setup a unique constraint on threadposted and posted which makes inserts a bit more expensive, but if I had multiple database servers in a farm, the chance of a collision is still there. I almost went ahead with this anyway since the chances of this happening are extremely small, but I wanted to see if I could eat my cake and still have it too. Mostly for my own educational curiosity. Third solution would be to store this data in the form of a graph. Each node would have a v-left and v-right pointer. I could order by "left" which would traverse the tree in the order I need. However, every time someone inserts a comment I'd have to re balance the whole tree. This would create a ton of row locking, and all sorts of problems if the site was very busy. Plus, it's kinda extreme and also causes replication problems. So I tossed this idea quickly. I also thought about just storing the original posts and then serializing the comments in a binary form, since who cares about individual comments. This would be very fast, however if a user wants to delete their comment or append a new comment to the end, I have to deserialize this data, modify the structure, then serialize it back and update the row. If a bunch of people are commenting on the same post at the same time, I might have random issues with that. So here's what I eventually did. I query for all the posts ordered by date entered. In the middle ware layer, I loop through the recordset and create a "stack" of original posts, each node on the stack points to a linked list of comments. When I come across an original post, I push a new node on the stack and when I come across a comment I add a node to the linked list. I organize this in memory so I can traverse the recordset once and have O(n). After I create the in-memory representation of the wall, I traverse through this data structure again and write out HTML. This works great and has super fast inserts and super fast selects, and no weird row locking issues; however it's a bit heavier on my presentation layer and requires me to build an in memory representation of the user's wall to move stuff around so it's in the right order. Still, I believe this is the best approach I've found so far. I thought I'd check with some other SQL experts and see if 1) Postgres has some super nifty tree functions that I've never heard of and 2) see if there's actually a way to do this using joins or unions or something which would still be performant with millions of users. Thoughts? Mike
pgsql-general by date: