Thread: Optimising SELECT on a table with one million rows
Hi, I'm fairly new with Postgresql, so I am not sure if the performance problems I'm having are due to poorly constructed queries/indices, or if I bumped into more fundamental problems requiring a design of my database structure. That's why I'm requesting your help. Here's the situation: I have three tables: Users, Stories, and Comments. Stories have an author (a user), and a comment is associated with a story and with the user who posted it. The structure of the database is therefore fairly simple: (there are also some sequences, which I've omitted for clarity) CREATE TABLE users ( user_id int UNIQUE NOT NULL, user_name text, PRIMARY KEY (user_id) ); CREATE TABLE stories ( story_id int UNIQUE NOT NULL, story_title text, story_body text, story_timestamp timestamptz, story_author int REFERENCES users (user_id) NOT NULL, PRIMARY KEY (story_id) ); CREATE TABLE comments ( comment_id int UNIQUE NOT NULL, comment_title text, comment_body text, comment_timestamp timestamptz, comment_story int REFERENCES stories (story_id) NOT NULL, comment_author int REFERENCES users (user_id) NOT NULL, PRIMARY KEY (comment_id) ); I've also populated the database with some test data, comprising 1,000 users, 1,000 stories (one per user), and 1,000,000 comments (one comment per user per story). Now, the query I wish to optimise is also simple: get me all comments (the comment_id suffices) and corresponding user *names* for a given story. If for example the story_id is 100, the query looks like this: SELECT comments.comment_id, users.user_name FROM comments, users WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id; The problem is that this query takes a *very* long time. With the said 1,000,000 comments, it needs at least 1100ms on my system. "Explain analyze" tells me that a sequential scan is being performed on both users and comments: Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual time=3.674..1144.779 rows=1000 loops=1) Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer) -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual time=0.185..1136.067 rows=1000 loops=1) Filter: ((comment_story)::integer = 100) -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425 rows=1000 loops=1) -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual time=0.068..1.845 rows=1000 loops=1) Total runtime: 1146.424 ms On the long run, I guess one possible solution to this problem will be to partition the comments table into a number of sub-tables, most likely based on the timestamp attribute (by having current versus historic data). Nevertheless, I am wondering if there are other more straightforward ways to optimise this query. Some clever use of indices, perhaps? Or is the way I am now constructing the select non-optimal? Or do I need some pixie-magic on the Postgresql settings? Anyway, any suggestions are welcome! (and thanks in advance) Regards, C.S. ____________________________________________________________________________________ Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC
First question... did you create the appropriate indexes on the appropriate columns for these tables? Foreign keys do not implicitly create indexes in postgres.
Bryan
Bryan
On 7/30/07, Cultural Sublimation <cultural_sublimation@yahoo.com> wrote:
Hi,
I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure. That's why I'm requesting your help.
Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it. The structure of the database
is therefore fairly simple: (there are also some sequences, which I've
omitted for clarity)
CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);
CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);
CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);
I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).
Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:
SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id ;
The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:
Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
-> Seq Scan on comments (cost= 0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms
On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query. Some clever use of indices, perhaps? Or is
the way I am now constructing the select non-optimal? Or do I need
some pixie-magic on the Postgresql settings? Anyway, any suggestions
are welcome! (and thanks in advance)
Regards,
C.S.
____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org/
Cultural Sublimation wrote: > CREATE TABLE users > ( > user_id int UNIQUE NOT NULL, > user_name text, > PRIMARY KEY (user_id) > ); > > > CREATE TABLE stories > ( > story_id int UNIQUE NOT NULL, > story_title text, > story_body text, > story_timestamp timestamptz, > story_author int REFERENCES users (user_id) NOT NULL, > PRIMARY KEY (story_id) > ); > > > CREATE TABLE comments > ( > comment_id int UNIQUE NOT NULL, > comment_title text, > comment_body text, > comment_timestamp timestamptz, > comment_story int REFERENCES stories (story_id) NOT NULL, > comment_author int REFERENCES users (user_id) NOT NULL, > PRIMARY KEY (comment_id) > ); > > > I've also populated the database with some test data, comprising 1,000 > users, 1,000 stories (one per user), and 1,000,000 comments (one comment > per user per story). > > Now, the query I wish to optimise is also simple: get me all comments (the > comment_id suffices) and corresponding user *names* for a given story. > If for example the story_id is 100, the query looks like this: > > SELECT comments.comment_id, users.user_name > FROM comments, users > WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id; > > The problem is that this query takes a *very* long time. With the said > 1,000,000 comments, it needs at least 1100ms on my system. "Explain > analyze" tells me that a sequential scan is being performed on both > users and comments: What else is it supposed to do? You haven't created any indexes. I'm also guessing that you haven't analysed the tables either. -- Richard Huxton Archonet Ltd
On Jul 30, 12:01 pm, cultural_sublimat...@yahoo.com (Cultural Sublimation) wrote: > Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual > time=3.674..1144.779 rows=1000 loops=1) > Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer) > -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual > time=0.185..1136.067 rows=1000 loops=1) > Filter: ((comment_story)::integer = 100) > -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425 > rows=1000 loops=1) > -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual > time=0.068..1.845 rows=1000 loops=1) > Total runtime: 1146.424 ms Create an index on comments.comment_story column.
Cultural Sublimation skrev: > Hi, > > I'm fairly new with Postgresql, so I am not sure if the performance > problems I'm having are due to poorly constructed queries/indices, > or if I bumped into more fundamental problems requiring a design of > my database structure. That's why I'm requesting your help. > > Here's the situation: I have three tables: Users, Stories, and Comments. > Stories have an author (a user), and a comment is associated with a > story and with the user who posted it. The structure of the database > is therefore fairly simple: (there are also some sequences, which I've > omitted for clarity) > > > CREATE TABLE users > ( > user_id int UNIQUE NOT NULL, > user_name text, > PRIMARY KEY (user_id) > ); > > > CREATE TABLE stories > ( > story_id int UNIQUE NOT NULL, > story_title text, > story_body text, > story_timestamp timestamptz, > story_author int REFERENCES users (user_id) NOT NULL, > PRIMARY KEY (story_id) > ); > > > CREATE TABLE comments > ( > comment_id int UNIQUE NOT NULL, > comment_title text, > comment_body text, > comment_timestamp timestamptz, > comment_story int REFERENCES stories (story_id) NOT NULL, > comment_author int REFERENCES users (user_id) NOT NULL, > PRIMARY KEY (comment_id) > ); You need indices on comment.comment_story (and probably later for comment_author). You should ALWAYS add an index on a FOREIGN KEY column unless you have a very good reason not to. So: CREATE INDEX comments_story_idx ON comments(comment_story); CREATE INDEX comments_author_idx ON comments(comment_author); CREATE INDEX story_author_idx ON story(story_author); Thge first of these should remove the need for a seqscan on comments for your query. The seqscan on users is not a problem - you are returning data from all the rows, so a seqscan is the smart thing to do. Nis
Cultural Sublimation wrote: > SELECT comments.comment_id, users.user_name > FROM comments, users > WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id; > > The problem is that this query takes a *very* long time. With the said > 1,000,000 comments, it needs at least 1100ms on my system. "Explain > analyze" tells me that a sequential scan is being performed on both > users and comments: > > Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual > time=3.674..1144.779 rows=1000 loops=1) > Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer) > -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual > time=0.185..1136.067 rows=1000 loops=1) > Filter: ((comment_story)::integer = 100) > -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425 > rows=1000 loops=1) > -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual > time=0.068..1.845 rows=1000 loops=1) > Total runtime: 1146.424 ms If you have no index on comments.comment_author, then a seqscan will be required for your join between comments and users. Similarly, if you have no index on comments.comment_story, then any query against comments that uses that column as part of a predicate will require a seqscan of the comments table. Note that an FK constraint does not automatically create an index on the underlying column. You need to create the actual index yourself if it will be necessary for your queries. > On the long run, I guess one possible solution to this problem will be > to partition the comments table into a number of sub-tables, most likely > based on the timestamp attribute (by having current versus historic data). Partitioning on comments.comment_timestamp won't help you at all for this particular query, since you don't have a condition in your query dependent upon that value. It might help you for other queries (such as gathering up all the comments posted on a particular day, or during some other time range), but it won't make any positive difference for this query. -Jon -- Senior Systems Developer Media Matters for America http://mediamatters.org/
Hi, > If you have no index on comments.comment_author, then a seqscan will be > required for your join between comments and users. Similarly, if you > have no index on comments.comment_story, then any query against comments > that uses that column as part of a predicate will require a seqscan of > the comments table. I see. As I said, I'm still fairly new to this... > Note that an FK constraint does not automatically create an index on the > underlying column. You need to create the actual index yourself if it > will be necessary for your queries. I see what you mean. The basic idea then is to take a look at the typical queries and to create indices based on them. Is there a good guide on index creation for optimisation purposes? > Partitioning on comments.comment_timestamp won't help you at all for > this particular query, since you don't have a condition in your query > dependent upon that value. It might help you for other queries (such as > gathering up all the comments posted on a particular day, or during some > other time range), but it won't make any positive difference for this query. You are right. Come to think of it, partitioning the comments table based on comment_story might make more sense, since the overwhelming majority of queries will be like the one I just mentioned: asking for all comments of a given story. Anyway, thanks a lot for your help! (And that goes for all the other people who also given their 2 cents) Best regards, C.S. ____________________________________________________________________________________Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. http://tv.yahoo.com/