Thread: Optimising SELECT on a table with one million rows

Optimising SELECT on a table with one million rows

From

Cultural Sublimation

Date:

30 July 2007, 17:01:26

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure.  That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it.  The structure of the database
is therefore fairly simple:  (there are also some sequences, which I've
omitted for clarity)


CREATE TABLE users
        (
        user_id                 int UNIQUE NOT NULL,
        user_name               text,
        PRIMARY KEY (user_id)
        );


CREATE TABLE stories
        (
        story_id                int UNIQUE NOT NULL,
        story_title             text,
        story_body              text,
        story_timestamp         timestamptz,
        story_author            int REFERENCES users (user_id) NOT NULL,
        PRIMARY KEY (story_id)
        );


CREATE TABLE comments
        (
        comment_id              int UNIQUE NOT NULL,
        comment_title           text,
        comment_body            text,
        comment_timestamp       timestamptz,
        comment_story           int REFERENCES stories (story_id) NOT NULL,
        comment_author          int REFERENCES users (user_id) NOT NULL,
        PRIMARY KEY (comment_id)
        );


I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;

The problem is that this query takes a *very* long time.  With the said
1,000,000 comments, it needs at least 1100ms on my system.  "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join  (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
   Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
   ->  Seq Scan on comments  (cost=0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
         Filter: ((comment_story)::integer = 100)
   ->  Hash  (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
         ->  Seq Scan on users  (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
 Total runtime: 1146.424 ms

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query.  Some clever use of indices, perhaps?  Or is
the way I am now constructing the select non-optimal?  Or do I need
some pixie-magic on the Postgresql settings?  Anyway, any suggestions
are welcome!  (and thanks in advance)

Regards,
C.S.




____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

Re: Optimising SELECT on a table with one million rows

From

"Bryan Murphy"

Date:

30 July 2007, 17:14:18

First question... did you create the appropriate indexes on the appropriate columns for these tables? Foreign keys do not implicitly create indexes in postgres.

Bryan

On 7/30/07, Cultural Sublimation <cultural_sublimation@yahoo.com> wrote:

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure.  That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it.  The structure of the database
is therefore fairly simple:  (there are also some sequences, which I've
omitted for clarity)

CREATE TABLE users
        (
        user_id                 int UNIQUE NOT NULL,
        user_name               text,
        PRIMARY KEY (user_id)
        );

CREATE TABLE stories
        (
        story_id                int UNIQUE NOT NULL,
        story_title             text,
        story_body              text,
        story_timestamp         timestamptz,
        story_author            int REFERENCES users (user_id) NOT NULL,
        PRIMARY KEY (story_id)
        );

CREATE TABLE comments
        (
        comment_id              int UNIQUE NOT NULL,
        comment_title           text,
        comment_body            text,
        comment_timestamp       timestamptz,
        comment_story           int REFERENCES stories (story_id) NOT NULL,
        comment_author          int REFERENCES users (user_id) NOT NULL,
        PRIMARY KEY (comment_id)
        );

I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id ;

The problem is that this query takes a *very* long time.  With the said
1,000,000 comments, it needs at least 1100ms on my system.  "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join  (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
   Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
   ->  Seq Scan on comments  (cost= 0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
         Filter: ((comment_story)::integer = 100)
   ->  Hash  (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
         ->  Seq Scan on users  (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query.  Some clever use of indices, perhaps?  Or is
the way I am now constructing the select non-optimal?  Or do I need
some pixie-magic on the Postgresql settings?  Anyway, any suggestions
are welcome!  (and thanks in advance)

Regards,
C.S.

____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/

Re: Optimising SELECT on a table with one million rows

From

Richard Huxton

Date:

30 July 2007, 17:16:59

Cultural Sublimation wrote:
> CREATE TABLE users
>         (
>         user_id                 int UNIQUE NOT NULL,
>         user_name               text,
>         PRIMARY KEY (user_id)
>         );
>
>
> CREATE TABLE stories
>         (
>         story_id                int UNIQUE NOT NULL,
>         story_title             text,
>         story_body              text,
>         story_timestamp         timestamptz,
>         story_author            int REFERENCES users (user_id) NOT NULL,
>         PRIMARY KEY (story_id)
>         );
>
>
> CREATE TABLE comments
>         (
>         comment_id              int UNIQUE NOT NULL,
>         comment_title           text,
>         comment_body            text,
>         comment_timestamp       timestamptz,
>         comment_story           int REFERENCES stories (story_id) NOT NULL,
>         comment_author          int REFERENCES users (user_id) NOT NULL,
>         PRIMARY KEY (comment_id)
>         );
>
>
> I've also populated the database with some test data, comprising 1,000
> users, 1,000 stories (one per user), and 1,000,000 comments (one comment
> per user per story).
>
> Now, the query I wish to optimise is also simple: get me all comments (the
> comment_id suffices) and corresponding user *names* for a given story.
> If for example the story_id is 100, the query looks like this:
>
> SELECT comments.comment_id, users.user_name
> FROM comments, users
> WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;
>
> The problem is that this query takes a *very* long time.  With the said
> 1,000,000 comments, it needs at least 1100ms on my system.  "Explain
> analyze" tells me that a sequential scan is being performed on both
> users and comments:

What else is it supposed to do? You haven't created any indexes. I'm
also guessing that you haven't analysed the tables either.

--
   Richard Huxton
   Archonet Ltd

Re: Optimising SELECT on a table with one million rows

From

Rodrigo De León

Date:

30 July 2007, 17:19:19

On Jul 30, 12:01 pm, cultural_sublimat...@yahoo.com (Cultural
Sublimation) wrote:
> Hash Join  (cost=28.50..21889.09 rows=988 width=14) (actual
> time=3.674..1144.779 rows=1000 loops=1)
>    Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
>    ->  Seq Scan on comments  (cost=0.00..21847.00 rows=988 width=8) (actual
> time=0.185..1136.067 rows=1000 loops=1)
>          Filter: ((comment_story)::integer = 100)
>    ->  Hash  (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
> rows=1000 loops=1)
>          ->  Seq Scan on users  (cost=0.00..16.00 rows=1000 width=14) (actual
> time=0.068..1.845 rows=1000 loops=1)
>  Total runtime: 1146.424 ms

Create an index on comments.comment_story column.

Re: Optimising SELECT on a table with one million rows

From

Nis Jørgensen

Date:

30 July 2007, 17:26:07

Cultural Sublimation skrev:
> Hi,
>
> I'm fairly new with Postgresql, so I am not sure if the performance
> problems I'm having are due to poorly constructed queries/indices,
> or if I bumped into more fundamental problems requiring a design of
> my database structure.  That's why I'm requesting your help.
>
> Here's the situation: I have three tables: Users, Stories, and Comments.
> Stories have an author (a user), and a comment is associated with a
> story and with the user who posted it.  The structure of the database
> is therefore fairly simple:  (there are also some sequences, which I've
> omitted for clarity)
>
>
> CREATE TABLE users
>         (
>         user_id                 int UNIQUE NOT NULL,
>         user_name               text,
>         PRIMARY KEY (user_id)
>         );
>
>
> CREATE TABLE stories
>         (
>         story_id                int UNIQUE NOT NULL,
>         story_title             text,
>         story_body              text,
>         story_timestamp         timestamptz,
>         story_author            int REFERENCES users (user_id) NOT NULL,
>         PRIMARY KEY (story_id)
>         );
>
>
> CREATE TABLE comments
>         (
>         comment_id              int UNIQUE NOT NULL,
>         comment_title           text,
>         comment_body            text,
>         comment_timestamp       timestamptz,
>         comment_story           int REFERENCES stories (story_id) NOT NULL,
>         comment_author          int REFERENCES users (user_id) NOT NULL,
>         PRIMARY KEY (comment_id)
>         );

You need indices on comment.comment_story (and probably later for
comment_author). You should ALWAYS add an index on a FOREIGN KEY column
unless you have a very good reason not to. So:

CREATE INDEX comments_story_idx ON comments(comment_story);
CREATE INDEX comments_author_idx ON comments(comment_author);
CREATE INDEX story_author_idx ON story(story_author);

Thge first of these should remove the need for a seqscan on comments for
your query. The seqscan on users is not a problem - you are returning
data from all the rows, so a seqscan is the smart thing to do.

Nis

Re: Optimising SELECT on a table with one million rows

From

Jon Sime

Date:

30 July 2007, 17:29:36

Cultural Sublimation wrote:
> SELECT comments.comment_id, users.user_name
> FROM comments, users
> WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;
>
> The problem is that this query takes a *very* long time.  With the said
> 1,000,000 comments, it needs at least 1100ms on my system.  "Explain
> analyze" tells me that a sequential scan is being performed on both
> users and comments:
>
> Hash Join  (cost=28.50..21889.09 rows=988 width=14) (actual
> time=3.674..1144.779 rows=1000 loops=1)
>    Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
>    ->  Seq Scan on comments  (cost=0.00..21847.00 rows=988 width=8) (actual
> time=0.185..1136.067 rows=1000 loops=1)
>          Filter: ((comment_story)::integer = 100)
>    ->  Hash  (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
> rows=1000 loops=1)
>          ->  Seq Scan on users  (cost=0.00..16.00 rows=1000 width=14) (actual
> time=0.068..1.845 rows=1000 loops=1)
>  Total runtime: 1146.424 ms

If you have no index on comments.comment_author, then a seqscan will be
required for your join between comments and users. Similarly, if you
have no index on comments.comment_story, then any query against comments
that uses that column as part of a predicate will require a seqscan of
the comments table.

Note that an FK constraint does not automatically create an index on the
underlying column. You need to create the actual index yourself if it
will be necessary for your queries.

> On the long run, I guess one possible solution to this problem will be
> to partition the comments table into a number of sub-tables, most likely
> based on the timestamp attribute (by having current versus historic data).

Partitioning on comments.comment_timestamp won't help you at all for
this particular query, since you don't have a condition in your query
dependent upon that value. It might help you for other queries (such as
gathering up all the comments posted on a particular day, or during some
other time range), but it won't make any positive difference for this query.

-Jon

--
Senior Systems Developer
Media Matters for America
http://mediamatters.org/

Re: Optimising SELECT on a table with one million rows

From

Cultural Sublimation

Date:

30 July 2007, 18:21:08

Hi,

> If you have no index on comments.comment_author, then a seqscan will be
> required for your join between comments and users. Similarly, if you
> have no index on comments.comment_story, then any query against comments
> that uses that column as part of a predicate will require a seqscan of
> the comments table.

I see.  As I said, I'm still fairly new to this...


> Note that an FK constraint does not automatically create an index on the
> underlying column. You need to create the actual index yourself if it
> will be necessary for your queries.

I see what you mean.  The basic idea then is to take a look at the typical
queries and to create indices based on them.  Is there a good guide on index
creation for optimisation purposes?


> Partitioning on comments.comment_timestamp won't help you at all for
> this particular query, since you don't have a condition in your query
> dependent upon that value. It might help you for other queries (such as
> gathering up all the comments posted on a particular day, or during some
> other time range), but it won't make any positive difference for this query.

You are right.  Come to think of it, partitioning the comments table based
on comment_story might make more sense, since the overwhelming majority of
queries will be like the one I just mentioned: asking for all comments of
a given story.

Anyway, thanks a lot for your help!  (And that goes for all the other people
who also given their 2 cents)

Best regards,
C.S.




____________________________________________________________________________________Ready for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/